AI Adoption and Data Poisoning

Data Services / Data Services Blog / Data for AI / AI Adoption and Data Poisoning

What is AI data poisoning?

AI data poisoning refers to an attack intended to corrupt the training data used to train a machine learning (ML) system. ML systems rely on the data they’re trained upon to generalise their responses to novel information. The training data is curated to present the model with an appropriate sample. When that data is compromised, the ML system is also compromised.

Data poisoning includes tactics like encoding hidden triggers into training data, injecting malicious samples into the data set, assigning incorrect metadata to the training data or manipulating the data set by adding fake samples or altering or removing real ones.

Data poisoning leads to model drift, security issues, operational failures and compliance risks, so it makes an ML system more costly and less useful. This has significant implications across nearly every industry, as these systems are used in everything from medical diagnostics to fraud detection to driving autonomous vehicles. It also has significant implications for the deployment of increasingly common artificial intelligence solutions within business.

The University of Chicago’s Glaze Project: Nightshade

In the generative AI era, scholars have weaponised data poisoning specifically to catch out those AI organisations whose training data may not be sourced with the consent of all contributors. In the service of this goal, researchers at the University of Chicago working on the Glaze Project have developed a tool called Nightshade.

Nightshade’s creators call it “a last defence for content creators against web scrapers that ignore opt-out/do-not-crawl directives.” It is applied to artists’ works and makes subtle changes that can corrupt the images produced using text-to-image generation AI.

Part of the reason this kind of attack is difficult to defend against is because these models require such large datasets that it’s challenging to identify instances of poisoned data, and relatively low numbers of poisoned samples can create performance problems.

In the case of Nightshade, unscrupulously collected data (in this case, artists’ work) becomes a risk that can compromise the performance of an AI model, but not all such attacks are altruistic.

Data poisoning is a growing problem for organisations using AI

This isn’t a problem relegated to massive AI models with sprawling datasets. Recent breakthroughs in artificial intelligence, such as generative AI, mean that more organisations are engaging with the wider machine learning landscape. As this group of technologies becomes more democratised, the risks grow.

Data poisoning is a growing problem for businesses at all sizes. In its 2025 Cost of a Data Breach Report, IBM revealed that nearly a third of organisations that reported a security incident involving artificial intelligence solutions said the source was a third-party vendor and delivered as SaaS — a highly democratised business model whose major selling point is its lower upfront costs.

It also reported that 15% of all security incidents involving AI models included data poisoning. Not all organisations who use these technologies have AI governance policies, or even data governance policies, in place.

In terms of the types of data exposed in a breach, IBM reported that company intellectual property was the most costly data involved, and personally identifiable information (PII) of clients was the most commonly compromised.

Tay: A Machine Learning Project

Microsoft’s Tay experiment is one of the most famous examples of data poisoning in recent history because it was so public.

In 2016, Microsoft released an artificial intelligence bot called “Tay” onto the social platform that was then known as Twitter (now X). It was expected to exchange tweets with people using the platform, learn from its exchanges, and get better and better at conversations.

However, upon release, Tay immediately met the zeitgeist of the open public internet and had to be taken down within a single day after posting thousands of offensive tweets. In a statement made to ABC News, a representative summarised: “Unfortunately, within the first 24 hours of coming online, we became aware of a coordinated effort by some users to abuse Tay’s commenting skills to have Tay respond in inappropriate ways.”

This was not a highly successful machine learning project, but arguably, a lot of twitter users learnt more about machines. Because users interacting with Tay intentionally exposed the bot to data inappropriate for its training set, the model became unfit for purpose — data poisoning in action!

What do you do about data poisoning?

In the above examples of data poisoning, more cautious data collection would reduce the impact of bad actors. Nightshade is designed to create negative consequences for incautious scraping. The Tay twitter bot was designed to develop based on how people interacted with it, and therefore this kind of vulnerability was a necessary design feature.

If your organisation is serious about adding value with AI, data governance needs to be a core pillar of your approach. AI and data are closely entwined, so it must not be an afterthought.

It’s better to prevent data poisoning than to address it after the fact. You can avert such problems with your data in several ways:

Identify and review any outliers in your data. You can use machine learning tools to find anomalous records in your training data and eliminate them or flag them for review.
Maintain good data hygiene. It’s a challenge to achieve perfectly clean and accurate data, but you can ensure a representative sample by automating data cleansing to standardise data, reduce errors and avoid duplication.
Verify your sources. Make sure the data you’re collecting is from trusted sources that actually meet your requirements for sampling. Make sure you log sources of your data, and keep track of its metadata and history, so you always know where it came from and if anything was hanged over time.

Are you deploying artificial intelligence solutions in your organisation? Our data experts are only a phone call away. Contact us today.

AI Adoption and Data Poisoning

What is AI data poisoning?

The University of Chicago’s Glaze Project: Nightshade

Data poisoning is a growing problem for organisations using AI

Tay: A Machine Learning Project

What do you do about data poisoning?

Blog Categories

Related Articles

5 Tips to Align Your Data Strategy With AI

DCA and Dataro Discuss: Your Roadmap to AI