In less than two years, generative AI tools like ChatGPT have reshaped how digital content is created. But while the technology is evolving rapidly, so is an unexpected side effect: AI models are now learning from content that previous AI models created. This feedback loop—where synthetic data becomes the training material for newer models—is already raising red flags among researchers. The problem isn’t just theoretical. It could seriously undermine how reliable, useful, and accurate future AI systems become. Experts are calling it “model collapse,” and the contamination of training data is at the center of it.

What Is Model Collapse—and Why It Matters
Model collapse isn’t just a theoretical concern—it’s a practical one with wide-ranging consequences. It describes a process where AI models, trained increasingly on their own synthetic output or on the output of similar models, start to lose their grounding in real-world data. The result? Models that may sound confident but become increasingly inaccurate, repetitive, or misleading.
In 2023 and 2024, researchers began to take this problem seriously. Several academic papers explored the phenomenon, with one group even coining the term “Model Autophagy Disorder” (MAD). The concern is that AI systems are, metaphorically speaking, eating their own output—relying on data that lacks the nuance, error patterns, and creativity inherent in human-generated material.
A 2024 paper from Apple analyzed this issue in the context of large language models (LLMs) used for reasoning tasks. They found that models begin to struggle when exposed to too much synthetic content, especially in complex reasoning scenarios. Their findings, however, were challenged by experts like Alex Lawsen of Open Philanthropy, who argued that some of the testing methods (such as exceeding token limits) may have skewed results. This disagreement highlights the still-evolving nature of the research, but it doesn’t eliminate the concern.
Researchers like Maurice Chiodo and Rupprecht Podszun point out that even if an AI model sounds coherent, it doesn’t mean it’s trustworthy. “You can build a very usable model that lies,” said Chiodo. That’s especially dangerous in fields like medicine, law, or journalism, where factual precision is critical. Worse, once synthetic data is deeply embedded in the training pipeline, it’s nearly impossible to untangle—a kind of digital contamination that can’t easily be undone.
The issue also raises questions about fairness and access. If only a handful of companies own clean, human-generated datasets, they gain a massive competitive edge. Newcomers would be left to build on compromised data, making it harder to innovate or compete. In this way, model collapse isn’t just a technical issue—it’s also an economic and ethical one.

Why AI Experts Are Sounding the Alarm on “Data Contamination”
In the early days of the atomic era, scientists discovered that metals manufactured after nuclear weapons tests were contaminated with background radiation. These “polluted” materials couldn’t be used in sensitive equipment, so researchers turned to older, radiation-free sources—like sunken ships from before 1945—to build clean instruments. Today, AI researchers are drawing a stark parallel: the internet is becoming saturated with content generated by AI, and that pollution may be sabotaging the next generation of AI development.
The problem centers on a phenomenon known as model collapse, where AI systems begin to degrade because they’re trained on synthetic content—output created by other AI systems rather than original human input. When AI models learn from this recycled material, their outputs may become less accurate, less creative, and potentially misleading. The effect is cumulative: as more AI-generated data is uploaded online and scraped by future models, the risk of a feedback loop grows.
John Graham-Cumming, former CTO of Cloudflare, dubbed this issue the “low-background steel problem of AI.” Just like researchers needed uncontaminated metals for precision instruments, AI developers now need access to “clean” human-created data—particularly data generated before the generative AI boom in 2022. He even registered a domain, lowbackgroundsteel.ai, to catalog sources of pre-AI content.
Some researchers, including those behind a 2024 paper on legal access to human-generated data, argue that AI is approaching a critical point. If too much synthetic data floods the internet, the ability to train high-quality, trustworthy models could become compromised. Ilia Shumailov, a co-author of several papers on model collapse, described it as a looming risk, though there’s still debate among AI experts about how imminent or severe that collapse might be.
At the heart of the issue is this: generative AI tools are contributing to their own degradation. Every AI-generated blog post, chatbot response, or auto-written email becomes part of a growing pool of data that’s no longer “clean”—and the next AI system trained on that pool inherits its limitations. As Maurice Chiodo from Cambridge’s Centre for the Study of Existential Risk puts it, “Everyone participating in generative AI is polluting the data supply for everyone.”

The Competitive Edge of “Clean” Data
As generative AI tools flood the internet with synthetic content, a new kind of digital inequality is emerging—one based not on computing power or algorithms, but on access to human-generated data. The quality of AI systems depends heavily on the data they’re trained on. When that data is original, human-made, and diverse, the resulting models tend to be more accurate, creative, and useful. But as the volume of AI-generated output balloons, especially since 2022, the availability of unpolluted training data is shrinking. This gives an immediate advantage to companies and institutions that gathered large datasets before the generative AI boom or that have proprietary access to cleaner sources.
Researchers like Maurice Chiodo and Rupprecht Podszun warn that this creates a lockout effect. New entrants into the AI space—startups, independent developers, and academic labs—may find themselves unable to compete because they’re stuck training models on lower-quality, synthetic-heavy data. That could reinforce the dominance of a few major players who already control valuable data pipelines, further concentrating power in an industry already marked by steep barriers to entry. The issue isn’t just about technological capability—it’s about fairness, transparency, and the long-term health of the AI ecosystem. Without access to clean training inputs, future models are more likely to produce generic, unreliable, or biased outputs, and innovation will slow.
The problem is compounded by the fact that the most valuable data isn’t necessarily what’s most factual—it’s what captures how humans actually communicate. Podszun points out that pre-2022 data from emails, forums, and social media carries human idiosyncrasies: real language patterns, creativity, and contextual richness that machine-generated text tends to flatten. AI systems trained primarily on synthetic data risk becoming stylistically dull and epistemically unstable, unable to reflect the nuance and unpredictability of genuine human interaction. That affects not only user experience but also trust—especially in sensitive applications like legal reasoning, medical advice, or education.
Some researchers have suggested solutions like federated learning, which allows AI models to be trained on clean data without giving direct access to that data. In theory, this could level the playing field by letting more developers build on high-quality inputs while still preserving privacy and data control. But implementing such systems at scale would require major policy coordination and technical safeguards, especially given the security and governance challenges of managing sensitive human data. Without meaningful intervention, the gap between those with clean data and those without is likely to widen, turning data access into a form of strategic leverage that shapes the direction of AI development itself.

Cleaning Up the Data Supply: Can the Damage Be Contained?
Tackling AI data contamination isn’t as simple as flipping a switch. Once synthetic content enters the training pipeline, it’s nearly impossible to extract. That’s why researchers like Maurice Chiodo emphasize the urgency of action before the problem becomes irreversible. One of the most straightforward proposals is forced labeling of AI-generated content, so that it’s easier to filter out synthetic data during training. But even that faces serious technical and legal challenges. Text can be copied, modified, or stripped of identifying marks. Watermarks can be removed or never applied in the first place. And because content is created and shared across borders, enforcing such labeling consistently is close to impossible.
The idea of maintaining centralized repositories of “clean” data—human-generated, pre-2022 content—has also been floated. But this raises a different set of risks. Large-scale data repositories would become high-value targets for misuse, surveillance, or political influence. Any centralized control of data—especially by governments—would need strong, transparent safeguards, which are rarely easy to design or enforce. Chiodo points out that storing this kind of information brings significant privacy and security concerns: what data gets included, who controls it, how it’s protected, and what happens if that control shifts. Today’s responsible custodian could be tomorrow’s authoritarian actor.
A more decentralized solution may lie in promoting competition and access to uncontaminated data through federated learning or similar frameworks. This would allow developers to train models on clean datasets without taking possession of the raw data itself. That could help preserve equity in the AI landscape while minimizing privacy risks. It’s a promising approach but still in early stages, and it depends on cooperation between data holders, researchers, and regulators—none of which is guaranteed. Rupprecht Podszun believes that if multiple stakeholders can manage access to clean data independently, the risks of political capture and monopolization could be reduced. But without regulatory oversight or shared standards, it’s not clear how well such a system would function at scale.
Governments, meanwhile, are still figuring out how to approach the problem. The U.S. has been hesitant to regulate AI development out of fear of stifling innovation, and the U.K. has leaned toward a similarly light-touch model. In contrast, the European Union has moved ahead with the AI Act, which aims to create a legal framework for responsible AI development. While the Act doesn’t yet address synthetic data pollution directly, it could lay the groundwork for future rules that do. Podszun argues that delaying regulation often leads to worse outcomes later, pointing to how tech platforms consolidated unchecked power during the early internet era. The lesson, he says, is to act before the window for meaningful intervention closes.

What You Can Do: Everyday Steps to Protect Data Quality
While AI model collapse might seem like a problem for researchers and policymakers, everyday users play a role too. Every time someone creates or shares original content—whether it’s a blog post, a social media comment, or even a code snippet—they’re contributing to the pool of human-generated data that AI systems depend on. This makes it all the more important to keep producing and prioritizing authentic, human-made material. If your work is being published online, label it clearly as human-written when possible. It helps platforms, researchers, and eventually model trainers distinguish original data from AI output—especially if content scraping tools become more selective.
Be cautious with how you use AI writing assistants or chatbots. It’s easy to fall into the habit of relying on them for emails, social posts, or other everyday communication, but doing so increases the volume of synthetic content online. If you’re using AI tools, consider editing their outputs significantly or adding your own voice and structure. The goal isn’t to abandon helpful tools but to avoid contributing to a cycle of overreliance and data dilution. The more synthetic content circulates, the harder it becomes for AI models to learn from genuine human behavior.
For developers and creators working in digital media, code, or publishing, consider archiving and tagging your work in trusted repositories—especially if it was created before the generative AI boom. These archives could become essential in the future, much like the Arctic Code Vault or Graham-Cumming’s lowbackgroundsteel.ai project. High-quality, timestamped content will only grow in value as clean data becomes scarcer. In addition, if you manage platforms or datasets, start thinking about provenance tracking—keeping records of how content was created and whether AI tools were involved.
Lastly, stay informed and engaged. The AI space is changing fast, and public awareness can help shape better norms and pressure platforms to prioritize transparency. Just as users pushed for clearer labeling of sponsored posts or deepfakes, similar momentum could help establish better safeguards against data contamination. You don’t need to be an AI expert to care about what information machines learn from. In a world where synthetic content is everywhere, protecting the integrity of human-made data is a shared responsibility.
Why It Matters Now—And What Happens If We Wait
AI isn’t just learning from us—it’s learning from everything we’ve ever published, shared, and stored online. But as generative models flood the internet with synthetic content, the pool of high-quality human data is shrinking. If left unchecked, this could compromise the accuracy, usefulness, and credibility of the next generation of AI systems. And once the training pipeline is saturated with synthetic output, the damage is hard—if not impossible—to reverse. That’s why experts are urging both immediate awareness and long-term thinking: the choices we make now will shape the foundation of future AI.
This isn’t just a technical problem—it’s a structural one. Without proactive steps to preserve clean data and ensure fair access to it, the AI field risks becoming even more concentrated, with a few dominant players controlling the only reliable training sources. That threatens not just innovation, but also accountability. If all models are trained on degraded or derivative content, we risk building systems that sound smart but don’t know what they’re talking about. Worse, once trust in AI-generated content breaks down, rebuilding it may be far more difficult than preventing its erosion in the first place.
Policy alone won’t solve this. Regulation is slow, fragmented, and often comes after the damage is done. What’s needed is a shared shift in mindset—among developers, users, platforms, and policymakers alike. Preserving data integrity must become a core part of how AI is built, trained, and deployed. That means labeling synthetic content, rewarding original human input, and investing in infrastructure that makes clean data accessible without centralizing power. It also means being honest about what today’s AI can and can’t do, and resisting the temptation to flood the system with fast, cheap content that undermines long-term quality.
We’re still early enough in this transition to change course. But that window won’t stay open forever. The integrity of AI depends on the integrity of its training data—and protecting that starts with recognizing the value of human-created content before it becomes the rarest resource in tech.4
Leave a Reply