Replica analysis of overfitting in regression models for time-to-event data

May 4, 2017

View on ArXiv

ACC Coolen, JE Barrett, P Paga, CJ Perez-Vicente

Statistics

Condensed Matter

Physics

Applications

Disordered Systems and Neura...

Data Analysis, Statistics an...

Overfitting, which happens when the number of parameters in a model is too large compared to the number of data points available for determining these parameters, is a serious and growing problem in survival analysis. While modern medicine presents us with data of unprecedented dimensionality, these data cannot yet be used effectively for clinical outcome prediction. Standard error measures in maximum likelihood regression, such as p-values and z-scores, are blind to overfitting, and even for Cox's proportional hazards model (the main tool of medical statisticians), one finds in literature only rules of thumb on the number of samples required to avoid overfitting. In this paper we present a mathematical theory of overfitting in regression models for time-to-event data, which aims to increase our quantitative understanding of the problem and provide practical tools with which to correct regression outcomes for the impact of overfitting. It is based on the replica method, a statistical mechanical technique for the analysis of heterogeneous many-variable systems that has been used successfully for several decades in physics, biology, and computer science, but not yet in medical statistics. We develop the theory initially for arbitrary regression models for time-to-event data, and verify its predictions in detail for the popular Cox model.

Replica analysis of overfitting in regression models for time to event data: the impact of censoring

December 5, 2023

95% Match

Emanuele Massa, Alexander Mozeika, Anthony Coolen

Methodology

Disordered Systems and Neura...

Statistics Theory

We use statistical mechanics techniques, viz. the replica method, to model the effect of censoring on overfitting in Cox's proportional hazards model, the dominant regression method for time-to-event data. In the overfitting regime, Maximum Likelihood parameter estimators are known to be biased already for small values of the ratio of the number of covariates over the number of samples. The inclusion of censoring was avoided in previous overfitting analyses for mathematical c...

Find SimilarView on arXiv

Analysis of overfitting in the regularized Cox model

April 14, 2019

94% Match

M Sheikh, A. C. C. Coolen

stat.ME

cond-mat.dis-nn

cs.LG

math.ST

stat.ML

stat.TH

The Cox proportional hazards model is ubiquitous in the analysis of time-to-event data. However, when the data dimension p is comparable to the sample size $N$, maximum likelihood estimates for its regression parameters are known to be biased or break down entirely due to overfitting. This prompted the introduction of the so-called regularized Cox model. In this paper we use the replica method from statistical physics to investigate the relationship between the true and infer...

Find SimilarView on arXiv

Replica analysis of overfitting in generalized linear models

April 14, 2020

91% Match

ACC Coolen, M Sheikh, A Mozeika, ... , Antenucci F

Disordered Systems and Neura...

Statistics Theory

Nearly all statistical inference methods were developed for the regime where the number $N$ of data samples is much larger than the data dimension $p$. Inference protocols such as maximum likelihood (ML) or maximum a posteriori probability (MAP) are unreliable if $p=O(N)$, due to overfitting. This limitation has for many disciplines with increasingly high-dimensional data become a serious bottleneck. We recently showed that in Cox regression for time-to-event data the overfit...

Find Similar View on arXiv

The effect of regularization in high dimensional Cox regression

May 22, 2024

89% Match

Emanuele Massa

Statistics Theory

Disordered Systems and Neura...

Statistics Theory

We investigate analytically the behaviour of the penalized maximum partial likelihood estimator (PMPLE). Our results are derived for a generic separable regularization, but we focus on the elastic net. This penalization is routinely adopted for survival analysis in the high dimensional regime, where the Maximum Partial Likelihood estimator (no regularization) might not even exist. Previous theoretical results require that the number $s$ of non-zero association coefficients is...

Find SimilarView on arXiv

Replica Analysis for Ensemble Techniques in Variable Selection

August 29, 2024

87% Match

Takashi Takahashi

math.ST

cond-mat.dis-nn

cond-mat.stat-mech

cs.IT

math.IT

stat.TH

Variable selection is a problem of statistics that aims to find the subset of the $N$-dimensional possible explanatory variables that are truly related to the generation process of the response variable. In high-dimensional setups, where the input dimension $N$ is comparable to the data size $M$, it is difficult to use classic methods based on $p$-values. Therefore, methods based on the ensemble learning are often used. In this review article, we introduce how the performance...

Find SimilarView on arXiv

Correction of overfitting bias in regression models

April 12, 2022

86% Match

Emanuele Massa, Marianne Jonker, ... , Coolen Anthony

Methodology

Statistics Theory

Data Analysis, Statistics an...

Statistics Theory

Regression analysis based on many covariates is becoming increasingly common. However, when the number of covariates $p$ is of the same order as the number of observations $n$, maximum likelihood regression becomes unreliable due to overfitting. This typically leads to systematic estimation biases and increased estimator variances. It is crucial for inference and prediction to quantify these effects correctly. Several methods have been proposed in literature to overcome overf...

Find SimilarView on arXiv

High-Dimensional Survival Analysis: Methods and Applications

May 5, 2022

86% Match

Stephen Salerno, Yi Li

Methodology

In the era of precision medicine, time-to-event outcomes such as time to death or progression are routinely collected, along with high-throughput covariates. These high-dimensional data defy classical survival regression models, which are either infeasible to fit or likely to incur low predictability due to over-fitting. To overcome this, recent emphasis has been placed on developing novel approaches for feature selection and survival prognostication. We will review various c...

Find SimilarView on arXiv

An empirical appraisal of methods for the dynamic prediction of survival with numerous longitudinal predictors

March 21, 2024

86% Match

Signorelli Mirko, Sophie Retif

Methodology

Applications

Recently, the increasing availability of repeated measurements in biomedical studies has motivated the development of several statistical methods for the dynamic prediction of survival in settings where a large (potentially high-dimensional) number of longitudinal covariates is available. These methods differ in both how they model the longitudinal covariates trajectories, and how they specify the relationship between the longitudinal covariates and the survival outcome. Beca...

Find SimilarView on arXiv

A Modern Theory for High-dimensional Cox Regression Models

April 3, 2022

85% Match

Xianyang Zhang, Huijuan Zhou, Hanxuan Ye

Statistics Theory

The proportional hazards model has been extensively used in many fields such as biomedicine to estimate and perform statistical significance testing on the effects of covariates influencing the survival time of patients. The classical theory of maximum partial-likelihood estimation (MPLE) is used by most software packages to produce inference, e.g., the coxph function in R and the PHREG procedure in SAS. In this paper, we investigate the asymptotic behavior of the MPLE in the...

Find SimilarView on arXiv

An Introduction to Deep Survival Analysis Models for Predicting Time-to-Event Outcomes

October 1, 2024

85% Match

George H. Chen

Machine Learning

Many applications involve reasoning about time durations before a critical event happens--also called time-to-event outcomes. When will a customer cancel a subscription, a coma patient wake up, or a convicted criminal reoffend? Time-to-event outcomes have been studied extensively within the field of survival analysis primarily by the statistical, medical, and reliability engineering communities, with textbooks already available in the 1970s and '80s. This monograph aims to pr...

Find SimilarView on arXiv