Set the tone of trust for the audio renaissance

By Tiffany Xingyu Wang, Chief strategy and marketing officer

November 10, 2021 | 6 min read

The audio renaissance is an exciting time for brands and marketers. Tiffany Xingyu Wang, chief strategy and marketing officer at Spectrum Labs, believes safety by design must take the same precedence as growth. To do so, decision-makers must understand the complexities of voice moderation and consider key strategies for preventing toxicity in these popular social spaces.

Man wearing glasses and earbuds against pink background

How can audio platforms be kept free from harassment?

In one generation, we traveled in history from asynchronous one-to-one communications on paper to many-to-many user-generated content in real-time on the web. Today’s flourishing audio and voice technologies remind me of the early days of the text-based web.

Now we are at an inflection point. The rise of Clubhouse sparked an audio renaissance, and set the stage for Rodeo, Stereo, Twitter Spaces and Spotify Greenroom. Advertisers surf this consumer behavior shift and quickly march into the immersive battlefields like Fortnite, an online video game that was turned into a virtual Ariana Grande concert attended by 27 million people from all corners of the world. We are entering this multi-sensory and multi-dimensional metaverse through the gateway of this audio renaissance.

The crescendo of toxicity in social audio

According to Pew Research, four in 10 Americans say they’ve experienced online harassment. Every technologist I spoke with who worked on earlier incarnations of the web and social platforms shared a common regret: “I wish I could have fixed ‘it’ at the get-go.”

We’ve just started to see new forms of harassment appearing on audio platforms. Noise spammers seek to disrupt community discussions and harass speakers with persistent and irritating sounds. Large groups conspire to form ‘voice brigades’ and harass a single person during a live voice session.

At the dawn of the audio renaissance, what can we learn from the major missteps of web 2.0? How can we leapfrog into the next era of a safer and more inclusive metaverse?

The aria of safety by design

Safety by design means baking safety thinking and mechanisms into the design phase of a new product or platform.

It entails hiring a diverse policy team to create community guidelines inclusive of the user base. The guidelines, as part of user experience flows, need to be made actionable to the community through self-service, breach-reporting capabilities. Human moderators (enhanced by moderation technologies) become the guardian angels to keep platforms safe and inclusive when some users violate the guidelines.

The ensemble of voice moderation complexities

As brand trust is unequivocally tied to a platform’s safety nowadays, metaverse builders are actively searching for voice moderation tools.

Unlike text, voice is ephemeral, its file sizes are massive, and real-time moderation is difficult to scale. There is a lack of information about the constraints, and best practices are very limited. Through working with a panel of metaverse builders coping with this complex challenge, I gathered this list of common starter questions for those adopting voice moderation to safeguard their platforms:

What do you want to moderate? The benefits you reap must justify the cost and privacy trade-offs. It may not be worth it to moderate swear words, whereas moderating for hate speech or child grooming are.
What will you do when you detect problems? When will you mute, warn, suspend or ban a user? Will the action be applied to a single user or at a different level?
What data do you need to store to support moderation decisions? Will you review and take action, then delete? Or will you keep a trail of evidence for reporting to authorities?
What do you want to record? A large part of moderation is going back to see what was said and determining whether or not it was toxic. When do you want to start recording the content? Will it be all content or only a subset?
Do your privacy policies cover recording? Users expect more privacy in voice. Proactive detection may shift the perception of privacy. Coordinate with legal and privacy stakeholders to adjust terms.

The andante of voice solutions

The content moderation industry spun out two major categories of solutions – transcription-based detection versus direct-on-audio analysis.

Transcription-based detection turns audio to text, and applies content moderation tooling on the transcription. A benefit of transcription-based detection is that it’s available today. On the flip side, costs are high, and much of the speaker’s intention is lost, making accurate moderation more difficult.

The emerging real-time audio detectors can directly analyze short audio snippets, skipping the transcription step. This method captures important information that’s lost when voice is transcribed, including tone, pitch, range and volume, which can reveal a speaker’s intention, and enable moderators to make the right decision. But it’s not without its limitations. At present, the amount of data that real-time audio detectors can analyze is still small.

Platforms adopt a combination of both solutions, coupled with human moderation, for voice moderation tooling. Now the voice moderation tooling is only part of the solution. High-accuracy detection of toxicity requires a robust AI infrastructure to support it.

Set the tone of trust and safety

It is not an easy battle to set the tone of trust and safety, because it has not been a common corporate priority. However, the tide is turning. The younger generations identify themselves with brands that put safety and inclusion at the core of growth strategies. Leaders who turn trust and safety into a competitive advantage in this audio renaissance will win on the battlefield of the metaverse.

Tiffany Xingyu Wang is chief strategy and marketing officer at Spectrum Labs.

Digital Transformation Audio Media