AI Voice Actors Sound More Human Than Ever

2023-04-15 11:40:24By

英語世界 2023年2期

關鍵詞：深度

掃碼聽讀

A new wave of startups are using deep learning to build synthetic voice actors for digital assistants, videogame characters, and corporate videos.

2The company blog post drips with the enthusiasm of a ’90s US infomercial1infomercial 商業信息電視片，專題廣告片。. WellSaid Labs describes what clients can expect from its “eight new digital voice actors!” Tobin is “energetic and insightful.” Paige is “poised and expressive.” Ava is “polished, self-assured,and professional.”

3Each one is based on a real voice actor, whose likeness (with consent) has been preserved using AI. Companies can now license these voices to say whatever they need. They simply feed some text into the voice engine, and out will spool a crisp audio clip of a natural-sounding performance.

4WellSaid Labs, a Seattle-based startup that spun out of the research nonprofit Allen Institute of Artificial Intelligence, is the latest firm offering AI voices to clients. For now, it specializes in voices for corporate e-learning videos. Other startups make voices for digital assistants, call center operators, and even video-game characters.

新一波的初創公司正在運用深度學習技術為數字助理、視頻游戲角色和企業視頻合成虛擬配音演員。

2WellSaid Labs 公司的博客文章字里行間充溢著90 年代美國專題廣告片的熱情，描述了其“八位新數字配音演員”能帶給客戶的效果。托賓“精力充沛、洞察力強”；佩姬“沉著而富有表現力”；阿娃“優雅、自信、專業”。

3每個數字配音演員都有一位真人配音演員作原型，（經后者同意）利用AI 技術保留相似度。如今，公司可以授權這些聲音按需說話，只要將一些文本輸入語音引擎，就會輸出一個清晰的音頻剪輯，播放著聽起來自然的表演。

4WellSaid Labs 是一家初創公司，總部位于西雅圖，從非營利性研究組織艾倫人工智能研究所中分離出來，新近開始為客戶提供A I語音。目前，它專注于企業電子學習視頻的聲音。其他初創公司的業務涉及為數字助理、呼叫中心運營商甚至視頻游戲角色配音。

5Not too long ago, such deepfake2deepfake 是深度學習（deep learning）與fake（偽造）的合成詞，指基于深度學習等機器學習方法創建或合成視聽覺內容，如圖像、音視頻、文本等。深偽技術最廣為人知的一種應用形式是AI 換臉（face-swap）。voices had something of a lousy reputation for their use in scam calls and internet trickery. But their improving quality has since piqued the interest of a growing number of companies. Recent breakthroughs in deep learning have made it possible to replicate many of the subtleties of human speech. These voices pause and breathe in all the right places. They can change their style or emotion. You can spot the trick if they speak for too long, but in short audio clips, some have become indistinguishable from humans.

6AI voices are also cheap, scalable3scalable（系統）可擴增的；可增大的。,and easy to work with. Unlike a recording of a human voice actor, synthetic voices can also update their script in real time, opening up new opportunities to personalize advertising.

How to fake a voice

7Synthetic voices have been around for a while. But the old ones, including the voices of the original Siri and Alexa,simply glued together words and sounds to achieve a clunky, robotic effect. Getting them to sound any more natural was a laborious manual task.

5不久前，這種深偽技術合成的聲音用于詐騙電話和互聯網騙術，因而名聲不佳。但此后，它們的質量持續提升，激發了越來越多公司的興趣。最近，深度學習的技術突破使復制人類語言的許多微妙之處成為可能。這些聲音的停頓、呼吸都恰到好處，還能改變自己的風格或情感。如果它們長時間說話，你就能發現端倪，然而在簡短的音頻剪輯中，有些合成聲音已經與真人聲音難以區分。

6此外，AI 語音造價低、可擴展且易于使用。合成聲音與真人配音演員的錄音不同，它們還能實時更新腳本，為個性化廣告開辟了新機會。

如何偽造聲音

7合成聲音已經存在了一段時間。但是，包括原始S i r i和A l e x a在內的老版聲音只是簡單地將單詞和聲音黏合在一起，聽著笨拙，如同機器人。如要讓它們聽起來更自然，就需要人工作業，頗為費勁。

8Deep learning changed that. Voice developers no longer needed to dictate the exact pacing, pronunciation, or intonation of the generated speech. Instead,they could feed a few hours of audio into an algorithm and have the algorithm learn those patterns on its own.

9Over the years, researchers have used this basic idea to build voice engines that are more and more sophisticated. The one WellSaid Labs constructed, for example, uses two primary deeplearning models. The first predicts, from a passage of text, the broad strokes4stroke = brushstroke（計劃或想法的）闡釋方式。of what a speaker will sound like—including accent, pitch, and timbre5timbre 音質，音色。. The second fills in the details, including breaths and the way the voice resonates in its environment.

10Making a convincing synthetic voice takes more than just pressing a button, however. Part of what makes a human voice so human is its inconsistency, expressiveness, and ability to deliver the same lines in completely different styles, depending on the context.

11Capturing these nuances involves finding the right voice actors to supply the appropriate training data and finetune the deep-learning models. WellSaid says the process requires at least an hour or two of audio and a few weeks of labor to develop a realistic-sounding synthetic replica.

8深度學習改變了這一點。語音開發人員無須再規定生成語音的確切節奏、發音或語調。他們可以將幾個小時的音頻輸入算法，讓算法自主學習這些模式。

9多年來，研究人員利用這一基本理念構建日趨復雜的語音引擎。例如，WellSaid Labs構建的一個語音引擎就使用了兩個主要的深度學習模型。第一個模型是從一段文字中預測說話者聽起來大致是什么樣子——包括口音、音高和音色。第二個模型填充細節，包括呼吸和聲音在其環境中的回音。

10然而，要想合成聲音以假亂真，不能僅憑按一下按鈕。真人聲音之所以聽起來像真的，部分原因是它并非一成不變，表現力強，有能力根據語境以截然不同的風格演繹出相同的臺詞。

11要想捕捉這些細微差別，就要找到合適的配音演員提供適當的訓練數據，還要微調深度學習模型。WellSaid 說，如果要開發一個逼真的合成復制品，至少需要一兩個小時的音頻和幾周的勞動。

12AIvoiceshavegrownparticularly popular among brandslooking tomaintainaconsistentsound inmillionsof interactionswithcustomers.Withthe ubiquity of smartspeakerstoday,and theriseof automatedcustomerserviceagentsaswellasdigital assistants embeddedincarsandsmartdevices,brandsmay need toproduceupwardsof a hundred hoursof audioa month.But they alsonolonger want touse thegenericvoicesoffered by traditional textto-speech technology—a trend that accelerated during thepandemicasmore andmorecustomersskippedin-store interactionstoengagewith companies virtually.

13“If I’m Pizza Hut,Icertainly can’t soundlikeDomino’s,andIcertainly can’t sound like Papa John’s,”saysRupalPatel,aprofessor at Northeastern UniversityandthefounderandCEO of VocaliD,whichpromisestobuild custom voicesthat match a company’s brandidentity.“Thesebrandshave thoughtabouttheircolors.They’ve thought about their fonts.Now they’ve got tostart thinking about the way their voice soundsaswell.”

14 Whereas companies used to have to hire different voice actors for different markets—the Northeast versus Southern US, or France versus Mexico—some voice AI firms can manipulate the accent or switch the language of a single voice in different ways. This opens up the pos-sibility of adapting ads on streaming platforms depending on who is listening, changing not just the characteristics of the voice but also the words being spo-ken. A beer ad could tell a listener to stop by a different pub depending on whether it’s playing in New York or Toronto, for example. Resemble.ai, which designs voices for ads and smart assistants, says it’s already working with clients to launch such personalized audio ads on Spotify and Pandora.

12想要在與客戶的數百萬次互動中保持始終如一聲音的品牌格外青睞AI語音。隨著當今智能揚聲器的普及，隨著自動化客戶服務代理以及車載和智能設備內置數字助理的興起，各大品牌每月可能需要制作超過一百小時的音頻。但是，它們不再愿意使用傳統的文本轉語音技術所提供的通用語音——這一趨勢在疫情期間加速發展，越來越多的顧客放棄到店購物，轉而與公司進行虛擬互動。

13“如果我是必勝客，我肯定不能聽起來像達美樂或是棒約翰。”東北大學教授、VocaliD創始人兼首席執行官魯帕爾·帕特爾說。他的公司承諾提供與公司品牌特性相匹配的定制聲音。“這些品牌已經考慮過它們的顏色、字體，現在也開始考慮它們的聲音風格。”

14曾經，各大公司必須為不同的市場（比如美國東北部與南部、法國與墨西哥）雇用不同的配音演員。如今一些語音AI 公司能夠以不同方式對言這就可以根據聽來調整流體平臺上的廣告，不僅能改聲音特征，還能改變措辭。例如，啤酒廣告可以針對其不同的播放地區，如紐的酒吧。告智助計語音的Resembl.ai 表示，它已經在與客戶合作將在Spotify 和Padora 上推出這種個性化的音頻廣告。

15But there are limitations to how far AI can go. It’s still difficult to maintain the realism of a voice over the long stretches of time that might be required for an audiobook or podcast.And there’s little ability to control an AI voice’s performance in the same way a director can guide a human performer.

A human touch

16In other words, human voice actors aren’t going away just yet. Expressive,creative, and long-form projects are still best done by humans. And for every synthetic voice made by these companies, a voice actor also needs to supply the original training data.

15但是AI 的應用前景有其局限性。有聲讀物或播客都需要長時間播放，而要在這么長的一段時間內保持聲音的真實感仍然是一件困難的事情。此外，像導演指導人類表演者那樣掌控AI 語音的表演幾乎無法做到。

一抹人情味

16換句話說，人類配音演員還沒到離場的時候。表現力強、富于創意和長篇的項目還是人類做得最好，而且上述公司制作的每一個合成聲音都需要配音演員提供原始的訓練數據。

17 在VocaliD 的帕特爾看來，AI語音的最終目的并不是復制人類的表現或運用自動化技術取代現有的配音工作。它們的前途在于有望開辟全新的可能性。她說，如果將來可以使用合成語音快速調整在線教育材料以適應不同的受眾，那會怎樣？“打個比方，假設你要把材料推廣給市中心貧困區的孩子，如果那個聲音聽起來真的像是來自他們的社區，難道不是很好嗎？”

17For VocaliD’s Patel, the point of AI voices is ultimately not to replicate human performance or to automate away existing voice-over work. Instead, the promise is that they could open up entirely new possibilities. What if in the future,she says, synthetic voices could be used to rapidly adapt online educational materials to different audiences? “If you’re trying to reach6reach 理解；與……交流。, let’s say, an inner-city77 inner-city 市中心貧民區的。group of kids, wouldn’t it be great if that voice actually sounded like it was from their community?” ■