Celebrity Endorsement Artificial Intelligence AI

AI experts weigh the opportunities & risks of OpenAI’s Voice Engine

By Webb Wright, NY Reporter

April 3, 2024 | 9 min read

The company is still trying to determine how its new text-to-speech model could be used beneficially and safely. But the dangers are clear.

OpenAI has deployed Voice Engine in a “small-scale preview” to study its potential benefits and risks. / Adobe Stock

Last Friday, OpenAI – creator of the viral generative AI chatbot ChatGPT – announced in a blog post that it had built an AI model, called Voice Engine, that can convincingly recreate human voices from just fifteen seconds of audio and a text prompt.

According to the blog post, Voice Engine has been deployed in a “small-scale preview” with the goal of understanding the model’s applications and risks. “We are taking a cautious and informed approach to a broader release due to the potential for synthetic voice misuse,” the company said in a statement. “We hope to start a dialogue on the responsible deployment of synthetic voices, and how society can adapt to these new capabilities.”

As is often the case with novel technologies, the Voice Engine cart has in some ways preceded the horse; it isn’t totally clear who would use the text-to-speech model, or for what purposes.

OpenAI suggested in its blog post that it might be used to support people who are unable to verbally communicate as a result of disability or disease. It’s also capable of cloning a person’s voice to read text in a variety of languages, presenting huge potential for audio and video translation, “so creators and businesses can reach more people around the world, fluently and in their own voices,” OpenAI wrote.

Voice Engine represents “a tangible uptick in the state-of-the-art” text-to-speech AI models, says Daniel Faggella, CEO of Emerj, a market research firm specializing in the AI industry.

Despite the fact that Voice Engine has yet to be publicly released, many marketers have undoubtedly taken notice and begun to consider how the model might be leveraged within their industry.

Celebrity voiceovers in ads, for example, could potentially be created in a matter of seconds, perhaps at a fraction of the current cost.

Customer service could also conceivably be transformed; rather than the mechanistic, uncanny robot voices that currently greet customers on many hotlines, emerging models like Voice Engine could provide the same service in voices that sound much closer to natural human language.

Eleven Labs, a company specializing in text-to-speech software, says that automation will enhance rather than replace human workers in the customer service sector – an increasingly commonplace sentiment among companies building AI. “Some companies may choose to replace roles in their support teams with AI but we believe that the best customer service will be delivered through a combination of AI and humans,” Sam Sklar, communications lead at Eleven Labs, told The Drum. “With AI helping customers get quick answers to standard questions, support teams can be freed up to spend more time tackling tricky queries.”

What could go wrong?

As OpenAI alluded to in its blog post, such a model could also present serious societal risks, especially during an election year.

Voice Engine or a similar model could be used by bad actors to recreate voices of politicians or public figures in an effort to sway voters or sow disinformation. And it’s hard to imagine how many elderly people with low tech literacy might be swindled out of their money by AI-generated voices mimicking grandchildren who need money for Bible camp.

Suggested newsletters for you

Daily Briefing

Daily

Catch up on the most important stories of the day, curated by our editorial team.

Ads of the Week

Wednesday

See the best ads of the last week - all in one place.

The Drum Insider

Once a month

Learn how to pitch to our editors and get published on The Drum.

The world got a taste of this burgeoning kind of scam back in January when voters in New Hampshire received a phone call, seemingly from President Joe Biden, urging them not to vote in the state’s primary election. The call was, in fact, a recorded imitation of the president’s voice that was probably generated using AI.

Acknowledging the dangers of Voice Engine, OpenAI wrote that it’s “engaging with US and international partners from across government, media, entertainment, education, civil society and beyond to ensure we are incorporating their feedback as we build.”

Among the cohort of early testers, the company has also prohibited the nonconsensual use of an individual’s voice and has implemented a watermarking system to identify audio generated by Voice Engine, among other safety measures.

“OpenAI is doing the right thing by withholding [the] broad release of Voice Engine until more guardrails against abuse are in place,” says Andrew Grotto, William J Perry international security fellow at Stanford University and former senior director for cybersecurity policy for the Obama and Trump administrations.

Text-to-speech technology has been around for decades, and today, there are a number of platforms that leverage AI to synthesize speech from text prompts, including Speechify and Eleven Labs. But these platforms aren’t able to recreate speech that’s based on that of an actual, flesh and blood human being; this is the power and peril of Voice Engine.

Advice for marketers

It remains to be seen whether or not OpenAI will actually release Voice Engine to the general public. But now that the door has been cracked open, it’s likely that other AI developers will eventually release similar tools, potentially with less caution about broad-scale deployment.

Should this model – or another like it – ever reach the hands of marketing teams, what would responsible use look like?

According to Grotto, it’s critical for marketers to carefully consider the context underlying the use of a Voice Engine-like model or text-to-speech AI more broadly. “If it would be weird to use an actual human’s voice in a particular setting or in a particular way, it’s probably even weirder to use a synthetic voice,” he says. “Mind the creepiness factor!”

Grotto adds that “if authenticity is an important dimension of the message, it’s probably best to not use a synthetic voice to communicate it.” Brands, in his view, would not be doing themselves any favors if they used text-to-speech AI tools in order to come across as more relatable to an international audience. “Be thoughtful about the use of accents,” he says. “Don’t misappropriate ethnic or other voices and avoid perpetuating stereotypes.”

Speaking about the recent wave of generative AI more broadly, Faggella of Emerj urges not just marketers but professionals across a wide scope of industries to actively experiment and keep their minds open to new opportunities for boosting efficiency and creativity. “If you don’t want to be a dinosaur, I think it behooves essentially everyone to be tinkering with these tools today,” he says.

For more on the latest happenings in AI, web3 and other cutting-edge technologies, sign up for The Emerging Tech Briefing newsletter.