The Surprising Repercussions of Making AI Assistants Sound Human
By
Josh Clark
Published Jun 17, 2017
There’s much effort afoot to make the bots sound less… robotic. Amazon recently enhanced its Speech Synthesis Markup Language to give Alexa a more human range of expression. SSML now lets Alexa whisper, pause, bleep expletives, and vary the speed, volume, emphasis, and pitch of its speech.
This all comes on the heels of Amazon’s February release of so-called speechcons (like emoticons, get it?) meant to add some color to Alexa’s speech. These are phrases like “zoinks,” “yowza,” “read ’em and weep,” “oh brother,” and even “neener neener,” all pre-rendered with maximum inflection. (Still waiting on “whaboom” here.)
The effort is intended to make Alexa feel less transactional and, well, more human. Writing for Wired, however, Elizabeth Stinson considers whether human personality is really what we want from our bots—or whether it’s just unhelpful misdirection.
“If Alexa starts saying things like hmm and well, you’re going to say things like that back to her,” says Alan Black, a computer scientist at Carnegie Mellon who helped pioneer the use of speech synthesis markup tags in the 1990s. Humans tend to mimic conversational styles; make a digital assistant too casual, and people will reciprocate. “The cost of that is the assistant might not recognize what the user’s saying,” Black says.
A voice assistant’s personality improving at the expense of its function is a tradeoff that user interface designers increasingly will wrestle with. "Do we want a personality to talk to or do we want a utility to give us information? I think in a lot of cases we want a utility to give us information,” says John Jones, who designs chatbots at the global design consultancy Fjord. Just because Alexa can drop colloquialisms and pop culture references doesn’t mean it should. Sometimes you simply want efficiency. A digital assistant should meet a direct command with a short reply, or perhaps silence—not booyah! (Another speechcon Amazon added.)
Personality and utility aren’t mutually exclusive, though. You’ve probably heard the design maxim form should follow function. Alexa has no physical form to speak of, but its purpose should inform its persona. But the comprehension skills of digital assistants remain too rudimentary to bridge these two ideals. “If the speech is very humanlike, it might lead users to think that all of the other aspects of the technology are very good as well,” says Michael McTear, coauthor of The Conversational Interface. The wider the gap between how an assistant sounds and what it can do, the greater the distance between its abilities and what users expect from it.
When designing within the constraints of any system, the goal should be to channel user expectations and behavior to match the actual capabilities of the system. The risk of adding too much personality is that it will create an expectation/behavior mismatch. Zoinks!