Replica analysis of overfitting in generalized linear models

April 14, 2020

ACC Coolen, M Sheikh, A Mozeika, F Aguirre-Lopez, F Antenucci

Condensed Matter

Mathematics

Statistics

Disordered Systems and Neura...

Statistics Theory

Nearly all statistical inference methods were developed for the regime where the number $N$ of data samples is much larger than the data dimension $p$. Inference protocols such as maximum likelihood (ML) or maximum a posteriori probability (MAP) are unreliable if $p=O(N)$, due to overfitting. This limitation has for many disciplines with increasingly high-dimensional data become a serious bottleneck. We recently showed that in Cox regression for time-to-event data the overfitting errors are not just noise but take mostly the form of a bias, and how with the replica method from statistical physics once can model and predict this bias and the noise statistics. Here we extend our approach to arbitrary generalized linear regression models (GLM), with possibly correlated covariates. We analyse overfitting in ML/MAP inference without having to specify data types or regression models, relying only on the GLM form, and derive generic order parameter equations for the case of $L2$ priors. Second, we derive the probabilistic relationship between true and inferred regression coefficients in GLMs, and show that, for the relevant hyperparameter scaling and correlated covariates, the $L2$ regularization causes a predictable direction change of the coefficient vector. Our results, illustrated by application to linear, logistic, and Cox regression, enable one to correct ML and MAP inferences in GLMs systematically for overfitting bias, and thus extend their applicability into the hitherto forbidden regime $p=O(N)$.

Analysis of overfitting in the regularized Cox model

April 14, 2019

94% Match

M Sheikh, A. C. C. Coolen

stat.ME

cond-mat.dis-nn

cs.LG

math.ST

stat.ML

stat.TH

The Cox proportional hazards model is ubiquitous in the analysis of time-to-event data. However, when the data dimension p is comparable to the sample size $N$, maximum likelihood estimates for its regression parameters are known to be biased or break down entirely due to overfitting. This prompted the introduction of the so-called regularized Cox model. In this paper we use the replica method from statistical physics to investigate the relationship between the true and infer...

Find Similar View on arXiv

Replica analysis of overfitting in regression models for time-to-event data

May 4, 2017

91% Match

ACC Coolen, JE Barrett, ... , Perez-Vicente CJ

Applications

Disordered Systems and Neura...

Data Analysis, Statistics an...

Overfitting, which happens when the number of parameters in a model is too large compared to the number of data points available for determining these parameters, is a serious and growing problem in survival analysis. While modern medicine presents us with data of unprecedented dimensionality, these data cannot yet be used effectively for clinical outcome prediction. Standard error measures in maximum likelihood regression, such as p-values and z-scores, are blind to overfitt...

Find Similar View on arXiv

Replica analysis of overfitting in regression models for time to event data: the impact of censoring

December 5, 2023

89% Match

Emanuele Massa, Alexander Mozeika, Anthony Coolen

Methodology

Disordered Systems and Neura...

Statistics Theory

We use statistical mechanics techniques, viz. the replica method, to model the effect of censoring on overfitting in Cox's proportional hazards model, the dominant regression method for time-to-event data. In the overfitting regime, Maximum Likelihood parameter estimators are known to be biased already for small values of the ratio of the number of covariates over the number of samples. The inclusion of censoring was avoided in previous overfitting analyses for mathematical c...

Find Similar View on arXiv

Asymptotics of Non-Convex Generalized Linear Models in High-Dimensions: A proof of the replica formula

February 27, 2025

89% Match

Matteo Vilucchio, Yatin Dandi, ... , Krzakala Florent

Machine Learning

The analytic characterization of the high-dimensional behavior of optimization for Generalized Linear Models (GLMs) with Gaussian data has been a central focus in statistics and probability in recent years. While convex cases, such as the LASSO, ridge regression, and logistic regression, have been extensively studied using a variety of techniques, the non-convex case remains far less understood despite its significance. A non-rigorous statistical physics framework has provide...

Find Similar View on arXiv

Correction of overfitting bias in regression models

April 12, 2022

89% Match

Emanuele Massa, Marianne Jonker, ... , Coolen Anthony

Methodology

Statistics Theory

Data Analysis, Statistics an...

Statistics Theory

Regression analysis based on many covariates is becoming increasingly common. However, when the number of covariates $p$ is of the same order as the number of observations $n$, maximum likelihood regression becomes unreliable due to overfitting. This typically leads to systematic estimation biases and increased estimator variances. It is crucial for inference and prediction to quantify these effects correctly. Several methods have been proposed in literature to overcome overf...

Find Similar View on arXiv

LR-GLM: High-Dimensional Bayesian Inference Using Low-Rank Data Approximations

May 17, 2019

88% Match

Brian L. Trippe, Jonathan H. Huggins, ... , Broderick Tamara

Computation

Machine Learning

Methodology

Machine Learning

Due to the ease of modern data collection, applied statisticians often have access to a large set of covariates that they wish to relate to some observed outcome. Generalized linear models (GLMs) offer a particularly interpretable framework for such an analysis. In these high-dimensional problems, the number of covariates is often large relative to the number of observations, so we face non-trivial inferential uncertainty; a Bayesian approach allows coherent quantification of...

Find Similar View on arXiv

Replica Analysis for Ensemble Techniques in Variable Selection

August 29, 2024

87% Match

Takashi Takahashi

math.ST

cond-mat.dis-nn

cond-mat.stat-mech

cs.IT

math.IT

stat.TH

Variable selection is a problem of statistics that aims to find the subset of the $N$-dimensional possible explanatory variables that are truly related to the generation process of the response variable. In high-dimensional setups, where the input dimension $N$ is comparable to the data size $M$, it is difficult to use classic methods based on $p$-values. Therefore, methods based on the ensemble learning are often used. In this review article, we introduce how the performance...

Find Similar View on arXiv

Understanding Phase Transitions via Mutual Information and MMSE

July 3, 2019

87% Match

Galen Reeves, Henry Pfister

Information Theory

Statistics Theory

The ability to understand and solve high-dimensional inference problems is essential for modern data science. This article examines high-dimensional inference problems through the lens of information theory and focuses on the standard linear model as a canonical example that is both rich enough to be practically useful and simple enough to be studied rigorously. In particular, this model can exhibit phase transitions where an arbitrarily small change in the model parameters c...

Find Similar View on arXiv

Prediction Errors for Penalized Regressions based on Generalized Approximate Message Passing

June 26, 2022

87% Match

Ayaka Sakata

Machine Learning

Disordered Systems and Neura...

Machine Learning

We discuss the prediction accuracy of assumed statistical models in terms of prediction errors for the generalized linear model and penalized maximum likelihood methods. We derive the forms of estimators for the prediction errors, such as $C_p$ criterion, information criteria, and leave-one-out cross validation (LOOCV) error, using the generalized approximate message passing (GAMP) algorithm and replica method. These estimators coincide with each other when the number of mode...

Find Similar View on arXiv

Scalable Bayesian inference for the generalized linear mixed model

March 5, 2024

87% Match

Samuel I. Berchuck, Felipe A. Medeiros, ... , Agazzi Andrea

Computation

Methodology

Machine Learning

The generalized linear mixed model (GLMM) is a popular statistical approach for handling correlated data, and is used extensively in applications areas where big data is common, including biomedical data settings. The focus of this paper is scalable statistical inference for the GLMM, where we define statistical inference as: (i) estimation of population parameters, and (ii) evaluation of scientific hypotheses in the presence of uncertainty. Artificial intelligence (AI) learn...

Find Similar View on arXiv