Can Self-Supervised Speech Models Predict the Perceived Acceptability of Prosodic Variation?

Abstract

Though producing an appropriate prosodic realisation of text is a one-to-many problem, modern speech generation often focuses on identifying the “best” or “most likely” output, overlooking acceptable variation across realisations. How listeners perceive such variation–and whether models capture it–is unaccounted for in current evaluation paradigms. In this study, we present exploratory analyses of whether self-supervised models encode acceptable prosodic variation. Using a new dataset of relative acceptability ratings across carefully controlled, high-quality synthetic utterances, we show that SSL representations contain information predictive of such judgments. By introducing a novel method for deriving probability-based uncertainty from autoregressive speech models, we examine whether this information is available in an unsupervised setting, highlighting the complexity of prosodic perception and the value of more human-centric evaluation paradigms.

Publication
published in ASRU 2025; Honolulu, USA

In ASRU 2025; Honolulu, USA

Sarenne Wallbridge
Sarenne Wallbridge
Research Scientist

My research interests include machine learning, pyscholinguistics, and information theory.

Related