How much is enough?: Data requirements for statistical NLP

September 7, 1995

Mark Microsoft Institute, Sydney Lauer

Computer Science

Computation and Language

In this paper I explore a number of issues in the analysis of data requirements for statistical NLP systems. A preliminary framework for viewing such systems is proposed and a sample of existing works are compared within this framework. The first steps toward a theory of data requirements are made by establishing some results relevant to bounding the expected error rate of a class of simplified statistical language learners as a function of the volume of training data.

Conserving Fuel in Statistical Language Learning: Predicting Data Requirements

September 7, 1995

93% Match

Mark Microsoft Institute, Sydney Lauer

Computation and Language

In this paper I address the practical concern of predicting how much training data is sufficient for a statistical language learning system. First, I briefly review earlier results and show how these can be combined to bound the expected accuracy of a mode-based learner as a function of the volume of training data. I then develop a more accurate estimate of the expected accuracy function under the assumption that inputs are uniformly distributed. Since this estimate is expens...

Find SimilarView on arXiv

Designing Statistical Language Learners: Experiments on Noun Compounds

September 25, 1996

91% Match

Mark Microsoft Research Institute, Sydney Lauer

Computation and Language

The goal of this thesis is to advance the exploration of the statistical language learning design space. In pursuit of that goal, the thesis makes two main theoretical contributions: (i) it identifies a new class of designs by specifying an architecture for natural language analysis in which probabilities are given to semantic forms rather than to more superficial linguistic elements; and (ii) it explores the development of a mathematical theory to predict the expected accura...

Find SimilarView on arXiv

How Much is Enough? The Diminishing Returns of Tokenization Training Data

February 27, 2025

90% Match

Varshini Reddy, Craig W. Schmidt, ... , Tanner Chris

Computation and Language

Computational Engineering, F...

Tokenization, a crucial initial step in natural language processing, is often assumed to benefit from larger training datasets. This paper investigates the impact of tokenizer training data sizes ranging from 1GB to 900GB. Our findings reveal diminishing returns as the data size increases, highlighting a practical limit on how much further scaling the training data can improve tokenization quality. We analyze this phenomenon and attribute the saturation effect to the constrai...

Find SimilarView on arXiv

Efficient Methods for Natural Language Processing: A Survey

August 31, 2022

88% Match

Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Aken Betty van, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, Pedro H. Martins, André F. T. Martins, Jessica Zosa Forde, Peter Milder, Edwin Simpson, Noam Slonim, Jesse Dodge, Emma Strubell, Niranjan Balasubramanian, Leon Derczynski, ... , Schwartz Roy

Computation and Language

Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require fewer resources to achieve similar results. This survey synthesizes and relates cur...

Find SimilarView on arXiv

Learning Computational Grammars

July 15, 2001

88% Match

John Nerbonne, Anja Belz, Nicola Cancedda, Herve Dejean, James Hammerton, Rob Koeling, Stasinos Konstantopoulos, Miles Osborne, ... , Sang Erik F. Tjong Kim

Computation and Language

This paper reports on the "Learning Computational Grammars" (LCG) project, a postdoc network devoted to studying the application of machine learning techniques to grammars suitable for computational use. We were interested in a more systematic survey to understand the relevance of many factors to the success of learning, esp. the availability of annotated data, the kind of dependencies in the data, and the availability of knowledge bases (grammars). We focused on syntax, esp....

Find SimilarView on arXiv

Review of Charniak's "Statistical Language Learning"

June 21, 1995

88% Match

David M. Magerman

Computation and Language

This article is an in-depth review of Eugene Charniak's book, "Statistical Language Learning". The review evaluates the appropriateness of the book as an introductory text for statistical language learning for a variety of audiences. It also includes an extensive bibliography of articles and papers which might be used as a supplement to this book for learning or teaching statistical language modeling.

Find SimilarView on arXiv

The Cost of Training NLP Models: A Concise Overview

April 19, 2020

88% Match

Or Sharir, Barak Peleg, Yoav Shoham

Computation and Language

Machine Learning

Neural and Evolutionary Comp...

We review the cost of training large-scale language models, and the drivers of these costs. The intended audience includes engineers and scientists budgeting their model-training experiments, as well as non-practitioners trying to make sense of the economics of modern-day Natural Language Processing (NLP).

Find SimilarView on arXiv

Language Models as Models of Language

August 13, 2024

87% Match

Raphaël Millière

Computation and Language

This chapter critically examines the potential contributions of modern language models to theoretical linguistics. Despite their focus on engineering goals, these models' ability to acquire sophisticated linguistic knowledge from mere exposure to data warrants a careful reassessment of their relevance to linguistic theory. I review a growing body of empirical evidence suggesting that language models can learn hierarchical syntactic structure and exhibit sensitivity to various...

Find SimilarView on arXiv

A Primer on Large Language Models and their Limitations

December 3, 2024

87% Match

Sandra Johnson, David Hyland-Wood

Computation and Language

Artificial Intelligence

This paper provides a primer on Large Language Models (LLMs) and identifies their strengths, limitations, applications and research directions. It is intended to be useful to those in academia and industry who are interested in gaining an understanding of the key LLM concepts and technologies, and in utilising this knowledge in both day to day tasks and in more complex scenarios where this technology can enhance current practices and processes.

Find SimilarView on arXiv

"I'm sorry Dave, I'm afraid I can't do that": Linguistics, Statistics, and Natural Language Processing circa 2001

April 21, 2003

87% Match

Lillian Lee

Computation and Language

A brief, general-audience overview of the history of natural language processing, focusing on data-driven approaches.Topics include "Ambiguity and language analysis", "Firth things first", "A 'C' change", and "The empiricists strike back".

Find SimilarView on arXiv