AI Voice Actors Sound More Human Than EverAI配音演员——胜似人声
作者: 郝珂灵 李小雪/译A new wave of startups are using deep learning to build synthetic voice actors for digital assistants, video-game characters, and corporate videos.
新一波的初创公司正在运用深度学习技术为数字助理、视频游戏角色和企业视频合成虚拟配音演员。
The company blog post drips with the enthusiasm of a ’90s US infomercial1. WellSaid Labs describes what clients can expect from its “eight new digital voice actors!” Tobin is “energetic and insightful.” Paige is “poised and expressive.” Ava is “polished, self-assured, and professional.”
WellSaid Labs公司的博客文章字里行间充溢着90年代美国专题广告片的热情,描述了其“八位新数字配音演员”能带给客户的效果。托宾“精力充沛、洞察力强”;佩姬“沉着而富有表现力”;阿娃“优雅、自信、专业”。
Each one is based on a real voice actor, whose likeness (with consent) has been preserved using AI. Companies can now license these voices to say whatever they need. They simply feed some text into the voice engine, and out will spool a crisp audio clip of a natural-sounding performance.
每个数字配音演员都有一位真人配音演员作原型,(经后者同意)利用AI技术保留相似度。如今,公司可以授权这些声音按需说话,只要将一些文本输入语音引擎,就会输出一个清晰的音频剪辑,播放着听起来自然的表演。
WellSaid Labs, a Seattle-based startup that spun out of the research nonprofit Allen Institute of Artificial Intelligence, is the latest firm offering AI voices to clients. For now, it specializes in voices for corporate e-learning videos. Other startups make voices for digital assistants, call center operators, and even video-game characters.
WellSaid Labs是一家初创公司,总部位于西雅图,从非营利性研究组织艾伦人工智能研究所中分离出来,新近开始为客户提供AI语音。目前,它专注于企业电子学习视频的声音。其他初创公司的业务涉及为数字助理、呼叫中心运营商甚至视频游戏角色配音。
Not too long ago, such deepfake2 voices had something of a lousy reputation for their use in scam calls and internet trickery. But their improving quality has since piqued the interest of a growing number of companies. Recent breakthroughs in deep learning have made it possible to replicate many of the subtleties of human speech. These voices pause and breathe in all the right places. They can change their style or emotion. You can spot the trick if they speak for too long, but in short audio clips, some have become indistinguishable from humans.
不久前,这种深伪技术合成的声音用于诈骗电话和互联网骗术,因而名声不佳。但此后,它们的质量持续提升,激发了越来越多公司的兴趣。最近,深度学习的技术突破使复制人类语言的许多微妙之处成为可能。这些声音的停顿、呼吸都恰到好处,还能改变自己的风格或情感。如果它们长时间说话,你就能发现端倪,然而在简短的音频剪辑中,有些合成声音已经与真人声音难以区分。
AI voices are also cheap, scalable3, and easy to work with. Unlike a recording of a human voice actor, synthetic voices can also update their script in real time, opening up new opportunities to personalize advertising.
此外,AI语音造价低、可扩展且易于使用。合成声音与真人配音演员的录音不同,它们还能实时更新脚本,为个性化广告开辟了新机会。
How to fake a voice
如何伪造声音
Synthetic voices have been around for a while. But the old ones, including the voices of the original Siri and Alexa, simply glued together words and sounds to achieve a clunky, robotic effect. Getting them to sound any more natural was a laborious manual task.
合成声音已经存在了一段时间。但是,包括原始Siri和Alexa在内的老版声音只是简单地将单词和声音黏合在一起,听着笨拙,如同机器人。如要让它们听起来更自然,就需要人工作业,颇为费劲。
Deep learning changed that. Voice developers no longer needed to dictate the exact pacing, pronunciation, or intonation of the generated speech. Instead, they could feed a few hours of audio into an algorithm and have the algorithm learn those patterns on its own.
深度学习改变了这一点。语音开发人员无须再规定生成语音的确切节奏、发音或语调。他们可以将几个小时的音频输入算法,让算法自主学习这些模式。
Over the years, researchers have used this basic idea to build voice engines that are more and more sophisticated. The one WellSaid Labs constructed, for example, uses two primary deep-learning models. The first predicts, from a passage of text, the broad strokes4 of what a speaker will sound like—including accent, pitch, and timbre5. The second fills in the details, including breaths and the way the voice resonates in its environment.
多年来,研究人员利用这一基本理念构建日趋复杂的语音引擎。例如,WellSaid Labs构建的一个语音引擎就使用了两个主要的深度学习模型。第一个模型是从一段文字中预测说话者听起来大致是什么样子——包括口音、音高和音色。第二个模型填充细节,包括呼吸和声音在其环境中的回音。
Making a convincing synthetic voice takes more than just pressing a button, however. Part of what makes a human voice so human is its inconsistency, expressiveness, and ability to deliver the same lines in completely different styles, depending on the context.
然而,要想合成声音以假乱真,不能仅凭按一下按钮。真人声音之所以听起来像真的,部分原因是它并非一成不变,表现力强,有能力根据语境以截然不同的风格演绎出相同的台词。
Capturing these nuances involves finding the right voice actors to supply the appropriate training data and fine-tune the deep-learning models. WellSaid says the process requires at least an hour or two of audio and a few weeks of labor to develop a realistic-sounding synthetic replica.
要想捕捉这些细微差别,就要找到合适的配音演员提供适当的训练数据,还要微调深度学习模型。WellSaid说,如果要开发一个逼真的合成复制品,至少需要一两个小时的音频和几周的劳动。
AI voices have grown particularly popular among brands looking to maintain a consistent sound in millions of interactions with customers. With the ubiquity of smart speakers today, and the rise of automated customer service agents as well as digital assistants embedded in cars and smart devices, brands may need to produce upwards of a hundred hours of audio a month. But they also no longer want to use the generic voices offered by traditional text-to-speech technology—a trend that accelerated during the pandemic as more and more customers skipped in-store interactions to engage with companies virtually.