September 7, 1995
In this paper I explore a number of issues in the analysis of data requirements for statistical NLP systems. A preliminary framework for viewing such systems is proposed and a sample of existing works are compared within this framework. The first steps toward a theory of data requirements are made by establishing some results relevant to bounding the expected error rate of a class of simplified statistical language learners as a function of the volume of training data.
Similar papers 1
September 7, 1995
In this paper I address the practical concern of predicting how much training data is sufficient for a statistical language learning system. First, I briefly review earlier results and show how these can be combined to bound the expected accuracy of a mode-based learner as a function of the volume of training data. I then develop a more accurate estimate of the expected accuracy function under the assumption that inputs are uniformly distributed. Since this estimate is expens...
September 25, 1996
The goal of this thesis is to advance the exploration of the statistical language learning design space. In pursuit of that goal, the thesis makes two main theoretical contributions: (i) it identifies a new class of designs by specifying an architecture for natural language analysis in which probabilities are given to semantic forms rather than to more superficial linguistic elements; and (ii) it explores the development of a mathematical theory to predict the expected accura...
February 27, 2025
Tokenization, a crucial initial step in natural language processing, is often assumed to benefit from larger training datasets. This paper investigates the impact of tokenizer training data sizes ranging from 1GB to 900GB. Our findings reveal diminishing returns as the data size increases, highlighting a practical limit on how much further scaling the training data can improve tokenization quality. We analyze this phenomenon and attribute the saturation effect to the constrai...
August 31, 2022
Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require fewer resources to achieve similar results. This survey synthesizes and relates cur...
July 15, 2001
This paper reports on the "Learning Computational Grammars" (LCG) project, a postdoc network devoted to studying the application of machine learning techniques to grammars suitable for computational use. We were interested in a more systematic survey to understand the relevance of many factors to the success of learning, esp. the availability of annotated data, the kind of dependencies in the data, and the availability of knowledge bases (grammars). We focused on syntax, esp....
June 21, 1995
This article is an in-depth review of Eugene Charniak's book, "Statistical Language Learning". The review evaluates the appropriateness of the book as an introductory text for statistical language learning for a variety of audiences. It also includes an extensive bibliography of articles and papers which might be used as a supplement to this book for learning or teaching statistical language modeling.
April 19, 2020
We review the cost of training large-scale language models, and the drivers of these costs. The intended audience includes engineers and scientists budgeting their model-training experiments, as well as non-practitioners trying to make sense of the economics of modern-day Natural Language Processing (NLP).
August 13, 2024
This chapter critically examines the potential contributions of modern language models to theoretical linguistics. Despite their focus on engineering goals, these models' ability to acquire sophisticated linguistic knowledge from mere exposure to data warrants a careful reassessment of their relevance to linguistic theory. I review a growing body of empirical evidence suggesting that language models can learn hierarchical syntactic structure and exhibit sensitivity to various...
December 3, 2024
This paper provides a primer on Large Language Models (LLMs) and identifies their strengths, limitations, applications and research directions. It is intended to be useful to those in academia and industry who are interested in gaining an understanding of the key LLM concepts and technologies, and in utilising this knowledge in both day to day tasks and in more complex scenarios where this technology can enhance current practices and processes.
April 21, 2003
A brief, general-audience overview of the history of natural language processing, focusing on data-driven approaches.Topics include "Ambiguity and language analysis", "Firth things first", "A 'C' change", and "The empiricists strike back".