We Are Social is using ElevenLabs text-to-speech tech to give chatbots a voice
We Are Social’s Singapore team explain how they’re using gen AI text-to-speech tool ElevenLabs to make more engaging chatbots and virtual influencers.
Combined with generative AI text tools, ElevenLabs gives virtual influencers the appearance of autonomy / Unsplash
Generative AI text-to-speech tools could give creative and digital agencies the power to furnish chatbots and virtual influencer creations with a real personality. And one of the leading technologies being used by agencies to achieve that goal comes from ElevenLabs, an American tech company.
ElevenLabs’ AI training datasets draw from audiobooks and podcasts – and its tools are already being used to bring virtual influencers to life and make chatbots speak.
According to Manolis Perrakis, innovation director at We Are Social Singapore, “we always like to anthropomorphize technology. Having this machine articulating text into a human-sounding voice is extremely powerful.”
Pricing is dependent on access and ranges between $0 and $330. The tool comes with a library of around 70 voices that can be used immediately, combined with a written script. “You can choose a deep, older person’s voice or a young girl reading a children’s story. This allows the creative team more tools to express themselves creatively,” he says.
We Are Social experimented with “a number” of different models and software, Perrakis tells The Drum, including building its own. “The ones that are open source or have to be built by ourselves are quite complex,” he says, and so ElevenLabs proved to be the most feasible solution. Rather than use the company’s off-the-shelf options, it’s built a bespoke solution using the ElevenLabs API.
How’s the tech being used right now?
We Are Social’s Singaporean office has been creating virtual influencers for a while and incorporated ElevenLabs into the toolbox behind that work several months ago. Though it started as an experimental aid, the team is now working on a new campaign set to go live in Japan later this year; Perrakis declines to name the client. By combining a generative AI tool such as ChatGPT which has been trained on a bespoke large language model (LLM) with the text-to-speech generator, the team’s been able to create a chatbot that reads from a script it’s writing automatically, in response to spoken word prompts from users.
So far, there are two primary use cases where text-to-speech tools can add value to virtual influencer projects: streaming and customer service. “Imagine you go to the website of a brand and you’re welcomed by a virtual influencer that can respond to you in human language, based on the questions that you’re asking them,” Perrakis describes.
The other example would be for virtual influencer characters that exist on Twitch or Youtube Live. “It gives them autonomy, but controlled autonomy. And it’s all based on the brand’s values, so it’s not free to say whatever it ‘wants’.”
Head of strategy James Honda-Pinder says text-to-speech tools mean We Are Social’s teams are “able to imbue your virtual influencer with so much more personality. “That’s the objective. A personality cuts through in the hurricane of social media.”
In addition to ElevenLabs’ text-to-speech tool, We Are Social is using GPT4 to generate text to be used in the ‘script’, while Web Speech API is used to ‘hear’ voice prompts from a user and translate them into text, to be ‘read’ by the virtual influencer (a test featuring We Are Social’s in-house influencer Cinder is above).
Not a ‘silver bullet’
Though there are benefits to advertisers in deploying animated, speaking brand mascots on the web, a library of only 70 voices might not offer enough distinctiveness to agency clients. Brands will want to develop different voices to their competitors.
Suggested newsletters for you
One solution to that problem, albeit one that’s a little farther down the line, Perrakis says, would be recording voices that the agency itself could own and use to train the tool. He’s already added his own voice to the library, he says. “It allows us to have a talent, or a person from our team with a nice-sounding voice, to donate their voice,” he says.
We Are Social is only a couple of months into its use of ElevenLabs and other drawbacks are already apparent. First of all, while a voice can bring a personality to life online, it’s not a “silver bullet” for brand communications, particularly in niches currently occupied by text-based chatbots.
Honda-Pinder notes that giving chatbots a voice won’t necessarily improve the underlying customer experience. “There are so many things that need to be fixed. Before this call, I had a 45-minute conversation with my bank. Virtual influencers have lots of opportunities, but I don’t think they can save customer service.”
Furthermore, the software’s first language is English. That’s not a problem in the US or Britain, but some of the largest advertising markets in the world – including Singapore – are multilingual. Other tools offer better renditions of Mandarin, Cantonese or Japanese, Perrakis says, and the company is looking into their usage to broaden the tool’s applicability. “We’re looking at other technologies that offer more languages and translation,” Perrakis says.