June 15, 1993
Gardner's analysis of the optimal storage capacity of neural networks is extended to study finite-temperature effects. The typical volume of the space of interactions is calculated for strongly-diluted networks as a function of the storage ratio $\alpha$, temperature $T$, and the tolerance parameter $m$, from which the optimal storage capacity $\alpha_c$ is obtained as a function of $T$ and $m$. At zero temperature it is found that $\alpha_c = 2$ regardless of $m$ while $\alp...
October 4, 2018
Memorization is worst-case generalization. Based on MacKay's information theoretic model of supervised machine learning, this article discusses how to practically estimate the maximum size of a neural network given a training data set. First, we present four easily applicable rules to analytically determine the capacity of neural network architectures. This allows the comparison of the efficiency of different network architectures independently of a task. Second, we introduce...
June 5, 2003
We analyze a learning method that uses a margin $\kappa$ {\it a la} Gardner for simple perceptron learning. This method corresponds to the perceptron learning when $\kappa=0$, and to the Hebbian learning when $\kappa \to \infty$. Nevertheless, we found that the generalization ability of the method was superior to that of the perceptron and the Hebbian methods at an early stage of learning. We analyzed the asymptotic property of the learning curve of this method through comput...
August 20, 2017
We derive the calculation of two critical numbers predicting the behavior of perceptron networks. First, we derive the calculation of what we call the lossless memory (LM) dimension. The LM dimension is a generalization of the Vapnik--Chervonenkis (VC) dimension that avoids structured data and therefore provides an upper bound for perfectly fitting almost any training data. Second, we derive what we call the MacKay (MK) dimension. This limit indicates a 50% chance of not bein...
July 18, 2022
There is mounting evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning a $k$-sparse parity of $n$ bits, a canonical discrete search problem which is statistically...
December 6, 2024
Parities have become a standard benchmark for evaluating learning algorithms. Recent works show that regular neural networks trained by gradient descent can efficiently learn degree $k$ parities on uniform inputs for constant $k$, but fail to do so when $k$ and $d-k$ grow with $d$ (here $d$ is the ambient dimension). However, the case where $k=d-O_d(1)$ (almost-full parities), including the degree $d$ parity (the full parity), has remained unsettled. This paper shows that for...
May 21, 2017
A fundamental aspect of limitations in learning any computation in neural architectures is characterizing their optimal capacities. An important, widely-used neural architecture is known as autoencoders where the network reconstructs the input at the output layer via a representation at a hidden layer. Even though capacities of several neural architectures have been addressed using statistical physics methods, the capacity of autoencoder neural networks is not well-explor...
February 16, 2016
We prove that any algorithm for learning parities requires either a memory of quadratic size or an exponential number of samples. This proves a recent conjecture of Steinhardt, Valiant and Wager and shows that for some learning problems a large storage space is crucial. More formally, in the problem of parity learning, an unknown string $x \in \{0,1\}^n$ was chosen uniformly at random. A learner tries to learn $x$ from a stream of samples $(a_1, b_1), (a_2, b_2) \ldots$, wh...
November 18, 1996
A perceptron with N random weights can store of the order of N patterns by removing a fraction of the weights without changing their strengths. The critical storage capacity as a function of the concentration of the remaining bonds for random outputs and for outputs given by a teacher perceptron is calculated. A simple Hebb-like dilution algorithm is presented which in the teacher case reaches the optimal generalization ability.
December 25, 2007
In this paper, we address the problem of how many randomly labeled patterns can be correctly classified by a single-layer perceptron when the patterns are correlated with each other. In order to solve this problem, two analytical schemes are developed based on the replica method and Thouless-Anderson-Palmer (TAP) approach by utilizing an integral formula concerning random rectangular matrices. The validity and relevance of the developed methodologies are shown for one known r...