A Survey on Data Contamination for Large...

Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation

June 20, 2024

96% Match

Chunyuan Deng, Yilun Zhao, Yuzhao Heng, Yitong Li, Jiannan Cao, ... , Cohan Arman

Computation and Language

Data contamination has garnered increased attention in the era of large language models (LLMs) due to the reliance on extensive internet-derived training corpora. The issue of training corpus overlap with evaluation benchmarks--referred to as contamination--has been the focus of significant recent research. This body of work aims to identify contamination, understand its impacts, and explore mitigation strategies from diverse perspectives. However, comprehensive studies that ...

Find SimilarView on arXiv

Benchmark Data Contamination of Large Language Models: A Survey

June 6, 2024

95% Match

Cheng Xu, Shuhao Guan, ... , Kechadi M-Tahar

Computation and Language

The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and Gemini has transformed the field of natural language processing. However, it has also resulted in a significant issue known as Benchmark Data Contamination (BDC). This occurs when language models inadvertently incorporate evaluation benchmark information from their training data, leading to inaccurate or unreliable performance during the evaluation phase of the process. This paper reviews the comp...

Find SimilarView on arXiv

Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation

February 23, 2025

95% Match

Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, ... , Ray Baishakhi

Machine Learning

Artificial Intelligence

Computation and Language

Data contamination has received increasing attention in the era of large language models (LLMs) due to their reliance on vast Internet-derived training corpora. To mitigate the risk of potential data contamination, LLM benchmarking has undergone a transformation from static to dynamic benchmarking. In this work, we conduct an in-depth analysis of existing static to dynamic benchmarking methods aimed at reducing data contamination risks. We first examine methods that enhance s...

Find SimilarView on arXiv

NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark

October 27, 2023

95% Match

Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, ... , Agirre Eneko

Computation and Language

In this position paper, we argue that the classical evaluation on Natural Language Processing (NLP) tasks using annotated benchmarks is in trouble. The worst kind of data contamination happens when a Large Language Model (LLM) is trained on the test split of a benchmark, and then evaluated in the same benchmark. The extent of the problem is unknown, as it is not straightforward to measure. Contamination causes an overestimation of the performance of a contaminated model in a ...

Find SimilarView on arXiv

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

February 24, 2024

94% Match

Yihong Dong, Xue Jiang, Huanyu Liu, ... , Li Ge

Computation and Language

Artificial Intelligence

Cryptography and Security

Machine Learning

Software Engineering

Recent statements about the impressive capabilities of large language models (LLMs) are usually supported by evaluating on open-access benchmarks. Considering the vast size and wide-ranging sources of LLMs' training data, it could explicitly or implicitly include test data, leading to LLMs being more susceptible to data contamination. However, due to the opacity of training data, the black-box access of models, and the rapid growth of synthetic training data, detecting and mi...

Find SimilarView on arXiv

Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?

November 6, 2024

94% Match

Aaditya K. Singh, Muhammed Yusuf Kocyigit, Andrew Poulton, David Esiobu, Maria Lomeli, ... , Hupkes Dieuwke

Computation and Language

Hampering the interpretation of benchmark scores, evaluation data contamination has become a growing concern in the evaluation of LLMs, and an active area of research studies its effects. While evaluation data contamination is easily understood intuitively, it is surprisingly difficult to define precisely which samples should be considered contaminated and, consequently, how it impacts benchmark scores. We propose that these questions should be addressed together and that con...

Find SimilarView on arXiv

Investigating Data Contamination for Pre-training Language Models

January 11, 2024

94% Match

Minhao Jiang, Ken Ziyu Liu, Ming Zhong, Rylan Schaeffer, Siru Ouyang, ... , Koyejo Sanmi

Computation and Language

Artificial Intelligence

Machine Learning

Language models pre-trained on web-scale corpora demonstrate impressive capabilities on diverse downstream tasks. However, there is increasing concern whether such capabilities might arise from evaluation datasets being included in the pre-training corpus -- a phenomenon known as \textit{data contamination} -- in a manner that artificially increases performance. There has been little understanding of how this potential contamination might influence LMs' performance on downstr...

Find SimilarView on arXiv

Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions

October 24, 2024

94% Match

Yujuan Fu, Ozlem Uzuner, ... , Xia Fei

Computation and Language

Large language models (LLMs) have demonstrated great performance across various benchmarks, showing potential as general-purpose task solvers. However, as LLMs are typically trained on vast amounts of data, a significant concern in their evaluation is data contamination, where overlap between training data and evaluation datasets inflates performance assessments. While multiple approaches have been developed to identify data contamination, these approaches rely on specific as...

Find SimilarView on arXiv

Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

November 8, 2023

94% Match

Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, ... , Stoica Ion

Computation and Language

Artificial Intelligence

Large language models are increasingly trained on all the data ever produced by humans. Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets. While most data decontamination efforts apply string matching (e.g., n-gram overlap) to remove benchmark data, we show that these methods are insufficient, and simple variations of test data (e.g., paraphrasing, translation) can easily bypass thes...

Find SimilarView on arXiv

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges

September 16, 2024

94% Match

Vinay Samuel, Yue Zhou, Henry Peng Zou

Computation and Language

Artificial Intelligence

As large language models achieve increasingly impressive results, questions arise about whether such performance is from generalizability or mere data memorization. Thus, numerous data contamination detection methods have been proposed. However, these approaches are often validated with traditional benchmarks and early-stage LLMs, leaving uncertainty about their effectiveness when evaluating state-of-the-art LLMs on the contamination of more challenging benchmarks. To address...

Find SimilarView on arXiv

A Survey on Data Contamination for Large Language Models

Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation

Benchmark Data Contamination of Large Language Models: A Survey

Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation

NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models

Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?

Investigating Data Contamination for Pre-training Language Models

Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions

Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

Towards Data Contamination Detection for Modern Large Language Models: Limitations, Inconsistencies, and Oracle Challenges