The Internet is Eating Itself

It starts with a blog post written by ChatGPT, a product description generated by Copilot, a news summary created by Gemini. None of them are written by humans – but all end up on the web, are indexed by search engines, and can become part of the training data for next-generation AI models.

This is not a future vision. It is happening now, on a large scale.

A new study published on arXiv (2602.16065) addresses what happens when generative AI models train recursively on data contaminated by previous AI-generated material. The phenomenon has been named in the research community: model collapse. And according to the research, it is not a question of if it happens – but how fast and how hard.

A Self-Reinforcing Problem with Exponential Growth

Figures from the OECD's AI Incidents and Hazard Monitor show that media-registered incidents related to AI-generated content increased from around 50 per month in early 2020 to nearly 500 per month in January 2026 – a tenfold increase in six years, with the last twelve months alone accounting for a doubling.

At the same time, OECD data shows that the proportion of businesses using AI rose from 8.7 percent in 2023 to 20.2 percent in 2025. More use means more AI-generated content, which in turn means more contamination of future training data.

A Europol report, cited in numerous analyses, estimates that up to 90 percent of all online content could be AI-generated by 2026 – admittedly a forward-looking expert estimate rather than a measured fact, but the direction is clear.

A new arXiv preprint (2602.16136) introduced the term Retrieval Collapse in February 2026: AI content contaminates not only training data but also web searches. With 67 percent synthetic content in search pools, the analysis showed that over 80 percent of SEO-exposed results were AI-generated, further eroding access to authentic human text.

By 2026, up to 90 percent of web content could be AI-generated – and next-generation models will train on this very material.
90%
Estimated AI share of web content by 2026 (Europol)
9
Iterations before complete model collapse in Nature study
20.2%
Share of firms using AI in 2025 (OECD)
When AI Trains on AI: Researchers Warn of Digital Self-Cannibalism

The Shumailov Study: From Wikipedia to Nonsense in Nine Steps

The most cited empirical documentation of model collapse comes from Shumailov et al., published in Nature in 2024. Researchers trained language models iteratively on Wikipedia text, where each new generation only had access to text produced by the previous model.

The results were discouraging: already in the early generations, rare words, concepts, and stylistic variants began to disappear. By generation nine, the models produced meaningless text – including mixing completely unrelated concepts like architecture and biology. What had started as a functioning language model had become a cacophonous echo of itself.

Theoretical analysis from NYU (2024) confirms the finding mathematically: since each training round reduces the variance in the model's parameter distribution, the process is inevitable without correctives. Rare patterns – which are crucial for a model to handle edge cases, minority languages, and complex topics – disappear with mathematical certainty.

Noema Magazine has described the phenomenon as the web "eating itself" – a gradual dilution of high-quality data and an amplification of errors in niche domains.

"The model began to mix architecture with biology. What started as Wikipedia knowledge became meaningless text within nine generations." — Shumailov et al., Nature, 2024
When AI Trains on AI: Researchers Warn of Digital Self-Cannibalism

Norwegian is Particularly Vulnerable

For Norwegian language technology, this is particularly serious. Norwegian is spoken by around five million people, and the authentic digital text material in Norwegian is limited compared to major languages like English, Spanish, or Mandarin.

Norwegian language models – developed by the National Library and the University of Oslo, among others – depend on the Norwegian text corpus actually reflecting real human language use. If an increasing proportion of Norwegian web text is AI-generated, future Norwegian models risk being trained on a gradually more homogeneous and artificial language.

The consequences may include:

  • Loss of dialectal richness: AI text is typically written in standardized Bokmål or Nynorsk, and dialectal variants risk disappearing from training data
  • Stylistic homogenization: Literary and stylistic variation – essays, local history, debate pieces – is replaced by smooth, neutral AI prose
  • Amplification of bias: Model collapse reinforces statistical averages and underrepresents minority voices

Figures from a Norwegian youth survey show that 70 percent of young people between 16 and 24 used AI for schoolwork in 2025. The demand for good Norwegian-language models is high – but the foundation for building them is under pressure.

Simula Research Laboratory, which according to the Research Council's 2025 evaluation ranks highest in Norway for ICT impact, works on multimodal learning and data methods that can address some of the problem. The institute combines text, images, and sound in training programs and collaborates with clinical partners to build unique datasets.

Can the Problem Be Solved?

The research community is not uniformly pessimistic. The arXiv article (2602.16065) seeks to establish theoretical guarantees for model quality even under contaminated training conditions, pointing out that collapse can be limited – but only with conscious countermeasures.

What helps according to the research?

Data curation and provenance: OpenAI and Google have already begun prioritizing the licensing of human-produced data from before 2022 – a clear recognition that "clean" data access is a competitive advantage. Traceability of data origin is crucial.

Automatic detection: Tools like GPTZero report 98 percent accuracy on pure AI text and over 90 percent on paraphrased AI text, according to the company's own figures. DependencyAI, which analyzes syntactic structures via spaCy and LightGBM, achieved 88.85 percent accuracy and an 88.94 F1-score across seven different text generators in the M4GT-Bench datasets. These methods can be used to clean training datasets of synthetic content.

Synthetic data with caution: Paradoxically, synthetic data can be used to compensate for data shortages – but only if done in a controlled manner, with verification and clear labeling of origin. Google DeepMind's GDR method (Generalized Data Refinement) filters toxic and inaccurate data from web scrapes and is a promising approach for low-resource languages like Norwegian.

The National Library's role: Norwegian institutions like the National Library are already working on digitizing and preserving authentic, dated text corpora. This work is not just culturally valuable – it is technologically strategic.

"Accumulation of real data, synthetic verification, and provenance tracking are the three pillars to avoid collapse" — summary of recommendations from Shumailov et al. and arXiv 2602.16065

What Does This Mean for Norwegian Actors?

Norwegian businesses using or developing AI with Norwegian language understanding should take these implications seriously:

Data quality is more important than data quantity. A large amount of web-scraped Norwegian text is not necessarily better if a high proportion is AI-generated. According to an Ahrefs study of 600,000 websites, the correlation between the proportion of AI content and Google ranking is only 0.011 – suggesting that search engines also favor human-written content.

Traceability is a competitive advantage. Training data should be documented regarding origin and timing. Without this, it is impossible to know how contaminated the dataset is – and impossible to correct the problem over time.

Regulatory pressure is increasing. The EU AI Act sets requirements for transparency regarding training data. Through the EEA agreement, this will also apply to Norwegian actors. Businesses that already have data documentation in place will have an advantage.

Invest in Norwegian-language corpora. Support for the National Library's digitization work, Simula's research, and similar initiatives is not just cultural policy – it is infrastructure for future Norwegian AI competitiveness.

Model collapse is not a theoretical threat. It is a process already underway, documented in some of the world's leading scientific journals. The question is not whether Norwegian language technology will notice it – but whether we act fast enough to mitigate the consequences.