How synthetic data could mitigate privacy and bias issues for marketers using AI
Data is the lifeblood of both machine learning and ad targeting. But datasets often contain pernicious biases, and the privacy landscape presents challenges for industries and businesses that handle sensitive information. Here’s a look at how synthetic data can help - part of The Drum’s latest Deep Dive, The New Data & Privacy Playbook.
Synthetic data is generated by AI to supplement or replace real data. / Adobe Stock
Artificial intelligence (AI) has recently become the tech trend du jour across much of the advertising industry. Sparked in large part by the meteoric success of ChatGPT, a growing number of marketers have come to view the rise of AI as a phenomenon that is rapidly and dramatically reshaping their industry.
At the same time, seismic shifts in the data privacy landscape have also had a profound impact on advertising. The depreciation of third-party cookies, for example, and new limitations on cross-app data tracking implemented by Apple and other tech companies have cumulatively resulted in what’s often referred to as ‘signal loss’ – the attenuation of marketers’ ability to track, measure and thereby strategically respond to customer data.
There’s a technology that exists at the confluence between these two trends, and it offers marketers the chance to leverage AI while adapting to new data-related challenges. It’s called synthetic data.
In contrast to ‘real data,’ which reflects real information extracted from the real world, synthetic data is artificial – it’s been generated by AI either to supplement or replace real-world data.
Synthetic data is sometimes referred to as ‘fake data,’ but that phrase is a bit misleading. ‘Fake data’ sounds like something that a corrupt banker fabricates in order to dupe investors. But synthetic data has many legitimate – and perfectly legal – uses that can help marketers to improve their digital ad marketing capabilities while simultaneously adhering to the letter of data laws.
Synthetic data and ML
The most valuable application of synthetic data is probably in the training of machine learning (ML) models.
These models require vast quantities of data in order to be able to make accurate predictions about the real world. The impressive verbal fluency of ChatGPT, for example, stems from the fact that its underlying large language model (LLM), GPT-4, has been trained with an enormous amount of text-based content from the internet.
But it takes time and money for human beings to gather and label (that is, organize according to a particular set of rules) data that’s been collected from the real world. Synthetic data, on the other hand, is both inexpensive to produce and automatically pre-labeled, which means it can easily be fed into an ML model without jeopardizing that model’s functionality. To use an automotive analogy, it’s as if a car manufacturer suddenly developed a means of producing huge amounts of cheap but perfectly usable refined oil using only a computer.
Another benefit of synthetic data is that it can be used to create a more robust dataset, accounting for fringe or anomalous events which might not be reflected in real-world data. “Real data is not perfect," says Alys Woodward, a senior director analyst at Gartner who specializes in AI and synthetic data. “It lacks what we call ‘edge cases,’ which are unusual occurrences … so training machine learning models [with] real data doesn't represent the world that you want to train your model for.” In other words: most real data, by definition, typically reflects the status quo; an ML model fed exclusively on real data may not be able to expect the unexpected, which can have disastrous consequences.
Take self-driving cars, for example. In order to be safe enough for widespread use, a self-driving car should ideally have an automated response at the ready for virtually every possible contingency that it might encounter while navigating a road. Normally, traffic adheres to a predictable set of rules; cars usually stay in their lanes, pedestrians usually obey traffic lights (unless you happen to be driving in New York City) and meteorites don’t usually come flaming out of the sky. But as every human driver knows, shit occasionally happens, and you have to be able to suddenly adapt.
Developing autonomous vehicles (AVs) that are able to safely respond to any anomalous event on the road is an immensely difficult technical problem; that’s one of the big reasons why they’re still an uncommon sight, despite the predictions of many science fiction writers throughout the ages.
An AV needs to know how to respond, say, to a dog that suddenly runs into the street. Should it veer violently to the right, into a line of parked cars, thereby endangering the life of the driver? Or should it plow on ahead, protecting the driver but perhaps killing the dog? And this scenario could play out in a virtually limitless variety of ways. Road safety contingencies are, in other words, combinatorially explosive - there’s simply not enough real-world data that's available to human programmers in their efforts to build satisfactorily safe self-driving cars. (The fact that the human brain is constantly able to grapple with combinatorial explosiveness to home in on the best course of action in a given situation is a mystery that has long vexed cognitive scientists.)
This is where synthetic data can come into play. By using AI to manufacture data that resembles real-world data while simultaneously accounting for ‘edge cases,’ engineers can get one step closer to developing AVs that can respond flexibly and intelligently to the unpredictability and infinite complexity of the world.
Gartner has estimated that synthetic data will outnumber real data in the training of AI models by 2030.
Suggested newsletters for you
Synthetic data, privacy and the mitigation of bias
Synthetic data can enable brands to anonymize personal information from real individuals, thereby ensuring the privacy and security of that information.
A healthcare company, for example - which obviously handles massive quantities of sensitive customer information - might use generative AI to produce synthetic data which masks the real data. It’s a bit like a cipher; the sensitive data becomes obfuscated and intelligible only to those with inside knowledge of how to read it.
Synthetic data can also help to fill in demographic gaps found in real data that might otherwise propagate bias. “If [real] data is 80% male and 20% female, then you could build that bias into your model,” says Woodward. Brands can supplement existing and demographically skewed real data sets with synthetic data to create a more even and unbiased distribution.
“By generating counterfactuals, synthetic data can help identify and correct hidden biases in [AI] models,” says Akash Srivastava, senior research scientist and manager at IBM Research and a co-leader of Project Synderella, an IBM initiative devoted to the generation of synthetic data for privacy-protection purposes. “This can benefit marketers by ensuring that their ad campaigns are not inadvertently biased against certain target audiences.”
It’s important to bear in mind, however, that synthetic data is not conjured up out of thin air. It should be thought of as a simulacrum of real data, an artificial representation of the real world. As such, it carries with it the potential to contain biases. The same risk is presented by generative AI models like ChatGPT: The content they create is based on real data, and can therefore potentially be as prejudiced as that data. Some generative AI models have also been criticized for their propensity to steal the work of human artists.
“Creating synthetic data still requires original data to generate, so it can face the same issues around privacy and consent that surround generative AI,” says Henry Ajder, an expert in AI and deepfakes. “There are also concerns that synthetic data could almost exactly replicate original data ... However, if executed responsibly, synthetic data could help organizations sidestep the sticky issues associated with training on sensitive data, particularly in fields such as healthcare.”
According to IBM’s Srivastava, the responsible use of synthetic data among marketers revolves primarily around careful planning and oversight: “When considering the use of synthetic data for ad-targeting efficacy, brands should first evaluate [the] data privacy and ethical implications,” he says. “Synthetic data can be used to run causal inference and A/B testing on marketing questions, but it is important to ensure that the generated data accurately represents the real-world data it is meant to replace.”
In order to steer clear of the hidden biases that can be propagated through AI models, says Srivastava. “Brands should take care to use synthetic data that is diverse and representative of their target audience, and constantly monitor and evaluate the results of any ad campaigns utilizing synthetic data.”
To read more from The Drum’s latest Deep Dive, where we’ll be demystifying data & privacy for marketers in 2023, head over to our special hub.