Correction of overfitting bias in regression models

April 12, 2022

View on ArXiv

Emanuele Massa, Marianne Jonker, Kit Roes, Anthony Coolen

Statistics

Mathematics

Physics

Methodology

Statistics Theory

Data Analysis, Statistics an...

Statistics Theory

Regression analysis based on many covariates is becoming increasingly common. However, when the number of covariates $p$ is of the same order as the number of observations $n$, maximum likelihood regression becomes unreliable due to overfitting. This typically leads to systematic estimation biases and increased estimator variances. It is crucial for inference and prediction to quantify these effects correctly. Several methods have been proposed in literature to overcome overfitting bias or adjust estimates. The vast majority of these focus on the regression parameters. But failure to estimate correctly also the nuisance parameters may lead to significant errors in confidence statements and outcome prediction. In this paper we present a jacknife method for deriving a compact set of non-linear equations which describe the statistical properties of the ML estimator in the regime where $p=O(n)$ and under the hypothesis of normally distributed covariates. These equations enable one to compute the overfitting bias of maximum likelihood (ML) estimators in parametric regression models as functions of $\zeta = p/n$. We then use these equations to compute shrinkage factors in order to remove the overfitting bias of maximum likelihood (ML) estimators. This new derivation offers various benefits over the replica approach in terms of increased transparency and reduced assumptions. To illustrate the theory we performed simulation studies for multiple regression models. In all cases we find excellent agreement between theory and simulations.

Replica analysis of overfitting in generalized linear models

April 14, 2020

89% Match

ACC Coolen, M Sheikh, A Mozeika, ... , Antenucci F

Disordered Systems and Neura...

Statistics Theory

Nearly all statistical inference methods were developed for the regime where the number $N$ of data samples is much larger than the data dimension $p$. Inference protocols such as maximum likelihood (ML) or maximum a posteriori probability (MAP) are unreliable if $p=O(N)$, due to overfitting. This limitation has for many disciplines with increasingly high-dimensional data become a serious bottleneck. We recently showed that in Cox regression for time-to-event data the overfit...

Find Similar View on arXiv

Analysis of overfitting in the regularized Cox model

April 14, 2019

88% Match

M Sheikh, A. C. C. Coolen

stat.ME

cond-mat.dis-nn

cs.LG

math.ST

stat.ML

stat.TH

The Cox proportional hazards model is ubiquitous in the analysis of time-to-event data. However, when the data dimension p is comparable to the sample size $N$, maximum likelihood estimates for its regression parameters are known to be biased or break down entirely due to overfitting. This prompted the introduction of the so-called regularized Cox model. In this paper we use the replica method from statistical physics to investigate the relationship between the true and infer...

Find SimilarView on arXiv

Double Descent: Understanding Linear Model Estimation of Nonidentifiable Parameters and a Model for Overfitting

August 23, 2024

88% Match

Ronald Christensen

Machine Learning

Statistics Theory

We consider ordinary least squares estimation and variations on least squares estimation such as penalized (regularized) least squares and spectral shrinkage estimates for problems with p > n and associated problems with prediction of new observations. After the introduction of Section 1, Section 2 examines a number of commonly used estimators for p > n. Section 3 introduces prediction with p > n. Section 4 introduces notational changes to facilitate discussion of overfitting...

Find SimilarView on arXiv

Bias in parametric estimation: reduction and useful side-effects

November 25, 2013

88% Match

Ioannis Kosmidis

Methodology

Applications

The bias of an estimator is defined as the difference of its expected value from the parameter to be estimated, where the expectation is with respect to the model. Loosely speaking, small bias reflects the desire that if an experiment is repeated indefinitely then the average of all the resultant estimates will be close to the parameter value that is estimated. The current paper is a review of the still-expanding repository of methods that have been developed to reduce bias i...

Find SimilarView on arXiv

The effect of regularization in high dimensional Cox regression

May 22, 2024

87% Match

Emanuele Massa

Statistics Theory

Disordered Systems and Neura...

Statistics Theory

We investigate analytically the behaviour of the penalized maximum partial likelihood estimator (PMPLE). Our results are derived for a generic separable regularization, but we focus on the elastic net. This penalization is routinely adopted for survival analysis in the high dimensional regime, where the Maximum Partial Likelihood estimator (no regularization) might not even exist. Previous theoretical results require that the number $s$ of non-zero association coefficients is...

Find SimilarView on arXiv

Replica analysis of overfitting in regression models for time-to-event data

May 4, 2017

86% Match

ACC Coolen, JE Barrett, ... , Perez-Vicente CJ

Applications

Disordered Systems and Neura...

Data Analysis, Statistics an...

Overfitting, which happens when the number of parameters in a model is too large compared to the number of data points available for determining these parameters, is a serious and growing problem in survival analysis. While modern medicine presents us with data of unprecedented dimensionality, these data cannot yet be used effectively for clinical outcome prediction. Standard error measures in maximum likelihood regression, such as p-values and z-scores, are blind to overfitt...

Find SimilarView on arXiv

Optimal Bias-Correction and Valid Inference in High-Dimensional Ridge Regression: A Closed-Form Solution

May 1, 2024

86% Match

Zhaoxing Gao

Econometrics

Methodology

Machine Learning

Ridge regression is an indispensable tool in big data econometrics but suffers from bias issues affecting both statistical efficiency and scalability. We introduce an iterative strategy to correct the bias effectively when the dimension $p$ is less than the sample size $n$. For $p>n$, our method optimally reduces the bias to a level unachievable through linear transformations of the response. We employ a Ridge-Screening (RS) method to handle the remaining bias when $p>n$, cre...

Find SimilarView on arXiv

Bias correction in a multivariate normal regression model with general parameterization

December 18, 2008

86% Match

Alexandre G. Patriota, Artur J. Lemonte

Methodology

This paper develops a bias correction scheme for a multivariate normal model under a general parameterization. In the model, the mean vector and the covariance matrix share the same parameters. It includes many important regression models available in the literature as special cases, such as (non)linear regression, errors-in-variables models, and so forth. Moreover, heteroscedastic situations may also be studied within our framework. We derive a general expression for the sec...

Find SimilarView on arXiv

Fast Marginal Likelihood Estimation of the Ridge Parameter(s) in Ridge Regression and Generalized Ridge Regression for Big Data

September 8, 2014

86% Match

George Karabatsos

Methodology

Unlike the ordinary least-squares (OLS) estimator for the linear model, a ridge regression linear model provides coefficient estimates via shrinkage, usually with improved mean-square and prediction error. This is true especially when the observed design matrix is ill-conditioned or singular, either as a result of highly-correlated covariates or the number of covariates exceeding the sample size. This paper introduces novel and fast marginal maximum likelihood (MML) algorithm...

Find SimilarView on arXiv

Two-Step Estimation and Inference with Possibly Many Included Covariates

July 26, 2018

85% Match

Matias D. Cattaneo, Michael Jansson, Xinwei Ma

Econometrics

Statistics Theory

Methodology

Statistics Theory

We study the implications of including many covariates in a first-step estimate entering a two-step estimation procedure. We find that a first order bias emerges when the number of \textit{included} covariates is "large" relative to the square-root of sample size, rendering standard inference procedures invalid. We show that the jackknife is able to estimate this "many covariates" bias consistently, thereby delivering a new automatic bias-corrected two-step point estimator. T...

Find SimilarView on arXiv