What do people hear? Listeners’ perception of conversational speech

Adaeze Adigwe, Sarenne Wallbridge, Simon King

August 2024

PDF

Abstract

Conversational agents are becoming increasingly popular, prompting the need for text-to-speech (TTS) systems that sound conversational. Previous research has focused on training TTS models on elicited or found conversational speech then measuring an improved listener preference. Preference ratings cannot pinpoint why TTS voices fall short of conversational expectations, underscoring our limited understanding of conversational speaking styles. In this pilot study, we conduct interviews with naive listeners who evaluate if speech was taken from a conversation or not, then give their explanation. Our results indicate that listeners are capable of distinguishing conversational utterances from read speech from acoustic features alone. While listeners’ explanations vary, they generally allude to pronunciation, rhythmic organisation, and inappropriate prosody. Using targeted prosodic modifications to synthesise speech, we shed light on the complexity of evaluating conversational style.

Type

Conference paper

Publication

In Interspeech 2024; Kos, Greece

“In Interspeech 2024; Kos, Greece”

Source Themes