What do people hear? Listeners’ perception of conversational speech

Abstract

Conversational agents are becoming increasingly popular, prompting the need for text-to-speech (TTS) systems that sound conversational. Previous research has focused on training TTS models on elicited or found conversational speech then measuring an improved listener preference. Preference ratings cannot pinpoint why TTS voices fall short of conversational expectations, underscoring our limited understanding of conversational speaking styles. In this pilot study, we conduct interviews with naive listeners who evaluate if speech was taken from a conversation or not, then give their explanation. Our results indicate that listeners are capable of distinguishing conversational utterances from read speech from acoustic features alone. While listeners’ explanations vary, they generally allude to pronunciation, rhythmic organisation, and inappropriate prosody. Using targeted prosodic modifications to synthesise speech, we shed light on the complexity of evaluating conversational style.

Publication
In Interspeech 2024; Kos, Greece

“In Interspeech 2024; Kos, Greece”

Sarenne Wallbridge
Sarenne Wallbridge
Machine Learning PhD Fellow

My research interests include machine learning, pyscholinguistics, and information theory.

Related