April 12, 2022
Regression analysis based on many covariates is becoming increasingly common. However, when the number of covariates $p$ is of the same order as the number of observations $n$, maximum likelihood regression becomes unreliable due to overfitting. This typically leads to systematic estimation biases and increased estimator variances. It is crucial for inference and prediction to quantify these effects correctly. Several methods have been proposed in literature to overcome overfitting bias or adjust estimates. The vast majority of these focus on the regression parameters. But failure to estimate correctly also the nuisance parameters may lead to significant errors in confidence statements and outcome prediction. In this paper we present a jacknife method for deriving a compact set of non-linear equations which describe the statistical properties of the ML estimator in the regime where $p=O(n)$ and under the hypothesis of normally distributed covariates. These equations enable one to compute the overfitting bias of maximum likelihood (ML) estimators in parametric regression models as functions of $\zeta = p/n$. We then use these equations to compute shrinkage factors in order to remove the overfitting bias of maximum likelihood (ML) estimators. This new derivation offers various benefits over the replica approach in terms of increased transparency and reduced assumptions. To illustrate the theory we performed simulation studies for multiple regression models. In all cases we find excellent agreement between theory and simulations.
Similar papers 1
Nearly all statistical inference methods were developed for the regime where the number $N$ of data samples is much larger than the data dimension $p$. Inference protocols such as maximum likelihood (ML) or maximum a posteriori probability (MAP) are unreliable if $p=O(N)$, due to overfitting. This limitation has for many disciplines with increasingly high-dimensional data become a serious bottleneck. We recently showed that in Cox regression for time-to-event data the overfit...
April 14, 2019
The Cox proportional hazards model is ubiquitous in the analysis of time-to-event data. However, when the data dimension p is comparable to the sample size $N$, maximum likelihood estimates for its regression parameters are known to be biased or break down entirely due to overfitting. This prompted the introduction of the so-called regularized Cox model. In this paper we use the replica method from statistical physics to investigate the relationship between the true and infer...
November 25, 2013
The bias of an estimator is defined as the difference of its expected value from the parameter to be estimated, where the expectation is with respect to the model. Loosely speaking, small bias reflects the desire that if an experiment is repeated indefinitely then the average of all the resultant estimates will be close to the parameter value that is estimated. The current paper is a review of the still-expanding repository of methods that have been developed to reduce bias i...
May 22, 2024
We investigate analytically the behaviour of the penalized maximum partial likelihood estimator (PMPLE). Our results are derived for a generic separable regularization, but we focus on the elastic net. This penalization is routinely adopted for survival analysis in the high dimensional regime, where the Maximum Partial Likelihood estimator (no regularization) might not even exist. Previous theoretical results require that the number $s$ of non-zero association coefficients is...
May 4, 2017
Overfitting, which happens when the number of parameters in a model is too large compared to the number of data points available for determining these parameters, is a serious and growing problem in survival analysis. While modern medicine presents us with data of unprecedented dimensionality, these data cannot yet be used effectively for clinical outcome prediction. Standard error measures in maximum likelihood regression, such as p-values and z-scores, are blind to overfitt...
May 1, 2024
Ridge regression is an indispensable tool in big data econometrics but suffers from bias issues affecting both statistical efficiency and scalability. We introduce an iterative strategy to correct the bias effectively when the dimension $p$ is less than the sample size $n$. For $p>n$, our method optimally reduces the bias to a level unachievable through linear transformations of the response. We employ a Ridge-Screening (RS) method to handle the remaining bias when $p>n$, cre...
December 18, 2008
This paper develops a bias correction scheme for a multivariate normal model under a general parameterization. In the model, the mean vector and the covariance matrix share the same parameters. It includes many important regression models available in the literature as special cases, such as (non)linear regression, errors-in-variables models, and so forth. Moreover, heteroscedastic situations may also be studied within our framework. We derive a general expression for the sec...
September 8, 2014
Unlike the ordinary least-squares (OLS) estimator for the linear model, a ridge regression linear model provides coefficient estimates via shrinkage, usually with improved mean-square and prediction error. This is true especially when the observed design matrix is ill-conditioned or singular, either as a result of highly-correlated covariates or the number of covariates exceeding the sample size. This paper introduces novel and fast marginal maximum likelihood (MML) algorithm...
July 26, 2018
We study the implications of including many covariates in a first-step estimate entering a two-step estimation procedure. We find that a first order bias emerges when the number of \textit{included} covariates is "large" relative to the square-root of sample size, rendering standard inference procedures invalid. We show that the jackknife is able to estimate this "many covariates" bias consistently, thereby delivering a new automatic bias-corrected two-step point estimator. T...
February 19, 2020
An important challenge in statistical analysis concerns the control of the finite sample bias of estimators. This problem is magnified in high-dimensional settings where the number of variables $p$ diverges with the sample size $n$, as well as for nonlinear models and/or models with discrete data. For these complex settings, we propose to use a general simulation-based approach and show that the resulting estimator has a bias of order $\mathcal{O}(0)$, hence providing an asym...