Replica analysis of overfitting in regression models for time to event data: the impact of censoring

December 5, 2023

View on ArXiv

Emanuele Massa, Alexander Mozeika, Anthony Coolen

Statistics

Condensed Matter

Mathematics

Methodology

Disordered Systems and Neura...

Statistics Theory

We use statistical mechanics techniques, viz. the replica method, to model the effect of censoring on overfitting in Cox's proportional hazards model, the dominant regression method for time-to-event data. In the overfitting regime, Maximum Likelihood parameter estimators are known to be biased already for small values of the ratio of the number of covariates over the number of samples. The inclusion of censoring was avoided in previous overfitting analyses for mathematical convenience, but is vital to make any theory applicable to real-world medical data, where censoring is ubiquitous. Upon constructing efficient algorithms for solving the new (and more complex) RS equations and comparing the solutions with numerical simulation data, we find excellent agreement, even for large censoring rates. We then address the practical problem of using the theory to correct the biased ML estimators {without} knowledge of the data-generating distribution. This is achieved via a novel numerical algorithm that self-consistently approximates all relevant parameters of the data generating distribution while simultaneously solving the RS equations. We investigate numerically the statistics of the corrected estimators, and show that the proposed new algorithm indeed succeeds in removing the bias of the ML estimators, for both the association parameters and for the cumulative hazard.

Replica analysis of overfitting in regression models for time-to-event data

May 4, 2017

95% Match

ACC Coolen, JE Barrett, ... , Perez-Vicente CJ

Applications

Disordered Systems and Neura...

Data Analysis, Statistics an...

Overfitting, which happens when the number of parameters in a model is too large compared to the number of data points available for determining these parameters, is a serious and growing problem in survival analysis. While modern medicine presents us with data of unprecedented dimensionality, these data cannot yet be used effectively for clinical outcome prediction. Standard error measures in maximum likelihood regression, such as p-values and z-scores, are blind to overfitt...

Find SimilarView on arXiv

Analysis of overfitting in the regularized Cox model

April 14, 2019

93% Match

M Sheikh, A. C. C. Coolen

stat.ME

cond-mat.dis-nn

cs.LG

math.ST

stat.ML

stat.TH

The Cox proportional hazards model is ubiquitous in the analysis of time-to-event data. However, when the data dimension p is comparable to the sample size $N$, maximum likelihood estimates for its regression parameters are known to be biased or break down entirely due to overfitting. This prompted the introduction of the so-called regularized Cox model. In this paper we use the replica method from statistical physics to investigate the relationship between the true and infer...

Find SimilarView on arXiv

Replica analysis of overfitting in generalized linear models

April 14, 2020

89% Match

ACC Coolen, M Sheikh, A Mozeika, ... , Antenucci F

Disordered Systems and Neura...

Statistics Theory

Nearly all statistical inference methods were developed for the regime where the number $N$ of data samples is much larger than the data dimension $p$. Inference protocols such as maximum likelihood (ML) or maximum a posteriori probability (MAP) are unreliable if $p=O(N)$, due to overfitting. This limitation has for many disciplines with increasingly high-dimensional data become a serious bottleneck. We recently showed that in Cox regression for time-to-event data the overfit...

Find Similar View on arXiv

The effect of regularization in high dimensional Cox regression

May 22, 2024

87% Match

Emanuele Massa

Statistics Theory

Disordered Systems and Neura...

Statistics Theory

We investigate analytically the behaviour of the penalized maximum partial likelihood estimator (PMPLE). Our results are derived for a generic separable regularization, but we focus on the elastic net. This penalization is routinely adopted for survival analysis in the high dimensional regime, where the Maximum Partial Likelihood estimator (no regularization) might not even exist. Previous theoretical results require that the number $s$ of non-zero association coefficients is...

Find SimilarView on arXiv

Beyond first-order asymptotics for Cox regression

April 13, 2015

87% Match

Donald A. Pierce, Ruggero Bellio

Statistics Theory

To go beyond standard first-order asymptotics for Cox regression, we develop parametric bootstrap and second-order methods. In general, computation of $P$-values beyond first order requires more model specification than is required for the likelihood function. It is problematic to specify a censoring mechanism to be taken very seriously in detail, and it appears that conditioning on censoring is not a viable alternative to that. We circumvent this matter by employing a refere...

Find SimilarView on arXiv

Robust and Efficient Estimation in the Parametric Cox Regression Model under Random Censoring

October 6, 2018

87% Match

Abhik Ghosh, Ayanendranath Basu

Methodology

Statistics Theory

Cox proportional hazard regression model is a popular tool to analyze the relationship between a censored lifetime variable with other relevant factors. The semi-parametric Cox model is widely used to study different types of data arising from applied disciplines like medical science, biology, reliability studies and many more. A fully parametric version of the Cox regression model, if properly specified, can yield more efficient parameter estimates leading to better insight ...

Find SimilarView on arXiv

On near-redundancy and identifiability of parametric hazard regression models under censoring

May 9, 2023

86% Match

F. J. Rubio, J. A. Espindola, J. A. Montoya

Methodology

We study parametric inference on a rich class of hazard regression models in the presence of right-censoring. Previous literature has reported some inferential challenges, such as multimodal or flat likelihood surfaces, in this class of models for some particular data sets. We formalize the study of these inferential problems by linking them to the concepts of near-redundancy and practical non-identifiability of parameters. We show that the maximum likelihood estimators of th...

Find SimilarView on arXiv

Estimation for the Cox Model with Biased Sampling Data via Risk Set Sampling

September 2, 2022

86% Match

Omidali Aghababaei Jazi

Methodology

Prevalent cohort sampling is commonly used to study the natural history of a disease when the disease is rare or it usually takes a long time to observe the failure event. It is known, however, that the collected sample in this situation is not representative of the target population which in turn leads to biased sample risk sets. In addition, when survival times are subject to censoring, the censoring mechanism is informative. In this paper, I propose a pseudo-partial likeli...

Find SimilarView on arXiv

Generalized inferential models for censored data

November 29, 2019

86% Match

Joyce Cahoon, Ryan Martin

Methodology

Inferential challenges that arise when data are censored have been extensively studied under the classical frameworks. In this paper, we provide an alternative generalized inferential model approach whose output is a data-dependent plausibility function. This construction is driven by an association between the distribution of the relative likelihood function at the interest parameter and an unobserved auxiliary variable. The plausibility function emerges from the distributio...

Find SimilarView on arXiv

Improved Non-parametric Penalized Maximum Likelihood Estimation for Arbitrarily Censored Survival Data

August 4, 2021

86% Match

Justin D. Tubbs, Lane Guolan Chen, ... , Sham Pak C.

Methodology

Non-parametric maximum likelihood estimation encompasses a group of classic methods to estimate distribution-associated functions from potentially censored and truncated data, with extensive applications in survival analysis. These methods, including the Kaplan-Meier estimator and Turnbull's method, often result in overfitting, especially when the sample size is small. We propose an improvement to these methods by applying kernel smoothing to their raw estimates, based on a B...

Find SimilarView on arXiv