Dataset bias and adversarial examples: AI’s data problem
For The Drum’s data deep dive, The New Data & Privacy Playbook, Matt Sutherland of agency True confronts the downsides of AI, data-learned biases and how they can proliferate prejudice.
Are tools like ChatGPT doomed to be culturally biased? / Andrew Neel
Artificial intelligence (AI) is still front-page industry news; it’s sparked conversations around its future usage, the jobs it might impact, and what it will be able to infer and predict as it develops.
However, our tendency to assume that all answers provided by AI are correct comes fraught with a level of danger. That’s because, fundamentally, data is crucial to the ability of an AI tool to perform its task effectively.
AI systems learn from data, so the quality, quantity, and representativeness of the data used to train the AI system directly affect its ability to perform the tasks we ask of it.
That’s why Tay (a Microsoft chatbot for Twitter) ‘learned’ to be racist and discriminatory within hours of going public. It was trained on Twitter and had learned (quite terrifyingly) from the platform’s own users, who were posting offensive content.
This is an example of data holding biases and assumptions that reflect the perspectives of the people who created or curated it. Likewise, if an AI system is trained on a dataset of images that mostly depict white people, it may not be as accurate when identifying people with darker skin tones.
A similar issue was found through with concept of predictive policing. Police departments have used AI to predict crime hotspots and identify potential suspects.
There have been instances where these algorithms have been guilty of prejudice, thanks to their data. In 2016, ProPublica found that a predictive policing algorithm used in Florida was twice as likely to label black defendants as high-risk for committing future crimes as white defendants, even when both sets of defendants had similar criminal histories.
Meanwhile, in the medical sector, a 2018 study tried to produce AI-generated cancer diagnostics. It was later concluded that the AI was having difficulty determining between malignant and benign lesions. One flaw was that it had started to conclude that the presence of a ruler in the image of the lesion meant that it was more likely to be malignant.
Why? Because the images that were used to train the AI were often medical images where the malignant lesion was being measured. This is known as dataset bias, and it can lead to inaccurate or unfair results. For example, in facial recognition, dataset bias can lead to higher error rates for people with certain demographic characteristics, such as race or gender.
Suggested newsletters for you
The dark side of AI
A related issue is the problem of ‘adversarial examples’: data that is deliberately modified to fool AI systems. Researchers have shown that when the data being used to train the AI is image-led, adding imperceptible noise to that image can cause an AI system to misclassify it. This raises concerns about the reliability and robustness of AI systems when they are used in critical applications, such as autonomous vehicles or healthcare.
It’s also possible that generative AI (which can create new content such as images, videos, or text) can cause issues for online know-your-customer (KYC) systems.
KYC is the process of verifying the identity of customers. It’s used by financial institutions and similar businesses to prevent fraud, money laundering, and other illegal activities. Generative AI can be used to fake images, voice or videos that mimic a real person’s appearance or behavior.
What can we do about it?
The first step is to ensure that we provide our AI systems with large volumes of high-quality data that is representative of the task they are designed to perform. The data should be diverse and cover a range of scenarios and variations that the AI system is likely to encounter in its own real-world application.
The data must be annotated to provide clear guidance and classification for the AI system to consume. Human values must be considered as part of the design of these AI systems, as well as the data that they are trained upon.
Content by The Drum Network member:
19 years ago true was founded with the aim of being different; straight-talking, to the point, focussed on delivering long-term growth, not through chat, but through action. Creating work that was true to our clients’ needs, true to their customers’ needs and true to our own expectations.Find out more