Large Language Models Suffer From Their ...

Regurgitative Training: The Value of Real Data in Training Large Language Models

July 3, 2024

93% Match

Jinghui Zhang, Dandan Qiao, ... , Wei Qiang

Computation and Language

Artificial Intelligence

Machine Learning

What happens if we train a new Large Language Model (LLM) using data that are at least partially generated by other LLMs? The explosive success of LLMs means that a substantial amount of content online will be generated by LLMs rather than humans, which will inevitably enter the training datasets of next-generation LLMs. We evaluate the implications of such "regurgitative training" on LLM performance. Through fine-tuning GPT-3.5 with data generated either by itself or by othe...

Find SimilarView on arXiv

The Curse of Recursion: Training on Generated Data Makes Models Forget

May 27, 2023

93% Match

Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, ... , Anderson Ross

Machine Learning

Artificial Intelligence

Computation and Language

Cryptography and Security

Computer Vision and Pattern ...

Stable Diffusion revolutionised image creation from descriptive text. GPT-2, GPT-3(.5) and GPT-4 demonstrated astonishing performance across a variety of language tasks. ChatGPT introduced such language models to the general public. It is now clear that large language models (LLMs) are here to stay, and will bring about drastic change in the whole ecosystem of online text and images. In this paper we consider what the future might hold. What will happen to GPT-{n} once LLMs c...

Find SimilarView on arXiv

LLM as a Broken Telephone: Iterative Generation Distorts Information

February 27, 2025

93% Match

Amr Mohamed, Mingmeng Geng, ... , Shang Guokan

Computation and Language

Artificial Intelligence

As large language models are increasingly responsible for online content, concerns arise about the impact of repeatedly processing their own outputs. Inspired by the "broken telephone" effect in chained human communication, this study investigates whether LLMs similarly distort information through iterative generation. Through translation-based experiments, we find that distortion accumulates over time, influenced by language choice and chain complexity. While degradation is ...

Find SimilarView on arXiv

The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text

November 16, 2023

92% Match

Yanzhu Guo, Guokan Shang, ... , Clavel Chloé

Computation and Language

This study investigates the consequences of training large language models (LLMs) on synthetic data generated by their predecessors, an increasingly prevalent practice aimed at addressing the limited supply of human-generated training data. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we developed a set of novel metrics tar...

Find SimilarView on arXiv

Characterizing Model Collapse in Large Language Models Using Semantic Networks and Next-Token Probability

October 16, 2024

92% Match

Daniele Gambetta, Gizem Gezici, Fosca Giannotti, Dino Pedreschi, ... , Pappalardo Luca

Computation and Language

Artificial Intelligence

As synthetic content increasingly infiltrates the web, generative AI models may experience an autophagy process, where they are fine-tuned using their own outputs. This autophagy could lead to a phenomenon known as model collapse, which entails a degradation in the performance and diversity of generative AI models over successive generations. Recent studies have explored the emergence of model collapse across various generative AI models and types of data. However, the curren...

Find SimilarView on arXiv

Benchmarking Linguistic Diversity of Large Language Models

December 13, 2024

92% Match

Yanzhu Guo, Guokan Shang, Chloé Clavel

Computation and Language

The development and evaluation of Large Language Models (LLMs) has primarily focused on their task-solving capabilities, with recent models even surpassing human performance in some areas. However, this focus often neglects whether machine-generated language matches the human level of diversity, in terms of vocabulary choice, syntactic construction, and expression of meaning, raising questions about whether the fundamentals of language generation have been fully addressed. Th...

Find SimilarView on arXiv

Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

December 4, 2024

92% Match

Alex Havrilla, Andrew Dai, Laura O'Mahony, Koen Oostermeijer, Vera Zisler, Alon Albalak, Fabrizio Milo, Sharath Chandra Raparthy, Kanishk Gandhi, Baber Abbasi, Duy Phung, Maia Iyer, Dakota Mahan, Chase Blagden, Srishti Gureja, Mohammed Hamdy, Wen-Ding Li, Giovanni Paolini, ... , Meyerson Elliot

Machine Learning

Artificial Intelligence

Computation and Language

Synthetic data generation with Large Language Models is a promising paradigm for augmenting natural data over a nearly infinite range of tasks. Given this variety, direct comparisons among synthetic data generation algorithms are scarce, making it difficult to understand where improvement comes from and what bottlenecks exist. We propose to evaluate algorithms via the makeup of synthetic data generated by each algorithm in terms of data quality, diversity, and complexity. We ...

Find SimilarView on arXiv

Generative Monoculture in Large Language Models

July 2, 2024

91% Match

Fan Wu, Emily Black, Varun Chandrasekaran

Computation and Language

Artificial Intelligence

We introduce {\em generative monoculture}, a behavior observed in large language models (LLMs) characterized by a significant narrowing of model output diversity relative to available training data for a given task: for example, generating only positive book reviews for books with a mixed reception. While in some cases, generative monoculture enhances performance (e.g., LLMs more often produce efficient code), the dangers are exacerbated in others (e.g., LLMs refuse to share ...

Find SimilarView on arXiv

Machine-generated text detection prevents language model collapse

February 21, 2025

91% Match

George Drayson, Vasileios Lampos

Computation and Language

Machine Learning

As Large Language Models (LLMs) become increasingly prevalent, their generated outputs are proliferating across the web, risking a future where machine-generated content dilutes human-authored text. Since web data is the primary resource for LLM pretraining, future models will be trained on an unknown portion of synthetic data. This will lead to model collapse, a degenerative process which causes models to reinforce their own errors and experience a drop in model performance....

Find SimilarView on arXiv

Collapse of Self-trained Language Models

April 2, 2024

91% Match

David Herel, Tomas Mikolov

Computation and Language

Artificial Intelligence

In various fields of knowledge creation, including science, new ideas often build on pre-existing information. In this work, we explore this concept within the context of language models. Specifically, we explore the potential of self-training models on their own outputs, akin to how humans learn and build on their previous thoughts and actions. While this approach is intuitively appealing, our research reveals its practical limitations. We find that extended self-training of...

Find SimilarView on arXiv

Large Language Models Suffer From Their Own Output: An Analysis of the Self-Consuming Training Loop

Regurgitative Training: The Value of Real Data in Training Large Language Models

The Curse of Recursion: Training on Generated Data Makes Models Forget

LLM as a Broken Telephone: Iterative Generation Distorts Information

The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text

Characterizing Model Collapse in Large Language Models Using Semantic Networks and Next-Token Probability

Benchmarking Linguistic Diversity of Large Language Models

Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

Generative Monoculture in Large Language Models

Machine-generated text detection prevents language model collapse

Collapse of Self-trained Language Models