September 5, 2025
The GIST Where AI models fall short in mimicking the expressiveness of human speech
Lisa Lock
scientific editor
Andrew Zinin
lead editor
Editors' notes
This article has been reviewed according to Science X's editorial process and policies. Editors have highlighted the following attributes while ensuring the content's credibility:
fact-checked
trusted source
proofread

It's not just what is said but how it's articulated that shapes the meaning of human communication, and people use intonation to highlight the most important part of a sentence. Take, for instance, the sentence "Molly mailed a melon." If someone asks, "Who mailed the melon?" people are most likely to stress "Molly mailed a melon." If someone inquired what Molly did with the melon, it would be "Molly mailed a melon." If the question was what Molly mailed, the response is "Molly mailed a melon."
But if you ask any of these questions to an artificial intelligence model that is capable of speech, it's a different story. Jianjing Kuang, associate professor of linguistics in the School of Arts & Sciences and director of the Penn Phonetics Laboratory, says while AI robots can articulate a word accurately, the technology to capture intonation, known as prosodic focus, "is not quite there yet."
This summer, she mentored three undergraduate students—Kevin Li and Henry Huang, second-year computer science students and Ethan Yang, a third-year mechanical engineering major—in a research project comparing human and AI speech in speech production and perception. This is part of the Penn Undergraduate Research Mentoring Program (PURM), a 10-week summer research opportunity through the Center for Undergraduate Research and Fellowships that comes with a $5,000 award.
"I've always been interested in linguistics and phonetics, but this is a really good opportunity for me to do hands-on research," says Li, who is from Kansas City, Kansas. Huang, who is from Shenzhen, China, says the experience taught him how to design an experiment and analyze data.
Inputting different contexts, the students generated the sentence "Molly mailed a melon" in 15 AI text-to-speech (TTS) platforms—from major companies like OpenAI, Google, and Meta to smaller ones like Sesame AI and Eleven Labs. They also captured audio from human volunteers in Kuang's recording studio to compare AI-generated speech to the same speech from humans.
Yang, a third-year mechanical engineering major from Diamond Bar, California, says this project taught him how to control intonation in TTS models. The team then analyzed acoustic measures such as pitch, intensity, and duration of words using the software Praat.
They found that, compared to human production, most of the TTS models failed to focus on the correct place. As an example, Li pulled up a graph showing that when prompted to focus on the word "mailed," the average word duration is significantly longer from humans than from any of the speech robots.
The students found "huge variability among the models," Kuang says. Some models were explicitly instructed to emphasize a certain word but could not, while others, such as OpenAI and Google Gemini, were more capable. Some models emphasized more than one word, one turned the sentence into a question mark, and another didn't even finish the sentence. Another interesting finding, Kuang says, is that speech robots had an easier time emphasizing "Molly" than words later in the sentence.
In addition to speech production, the students ran a perception experiment, asking human listeners to rate the naturalness of an audio clip and identify whether the speaker is human or AI. Kuang says the accuracy for identifying human versus AI is very high, suggesting that AI speech is still not human-like.
"The goal is to build bridges between science and industry. I do think they need us—our knowledge—to tell how good the model is and help move us closer to truly natural and expressive AI speech," she says. Kuang adds that working with AI also has implications for better understanding human speech and its uniqueness, such as why certain tasks come easily to us and how to develop better therapies for speech disorders.
Provided by University of Pennsylvania Citation: Where AI models fall short in mimicking the expressiveness of human speech (2025, September 5) retrieved 5 September 2025 from https://techxplore.com/news/2025-09-ai-fall-short-mimicking-human.html This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.
Explore further
Bar chatter: Automatic speech recognition rivals humans in noisy environments shares
Feedback to editors