December 5, 2023
We use statistical mechanics techniques, viz. the replica method, to model the effect of censoring on overfitting in Cox's proportional hazards model, the dominant regression method for time-to-event data. In the overfitting regime, Maximum Likelihood parameter estimators are known to be biased already for small values of the ratio of the number of covariates over the number of samples. The inclusion of censoring was avoided in previous overfitting analyses for mathematical convenience, but is vital to make any theory applicable to real-world medical data, where censoring is ubiquitous. Upon constructing efficient algorithms for solving the new (and more complex) RS equations and comparing the solutions with numerical simulation data, we find excellent agreement, even for large censoring rates. We then address the practical problem of using the theory to correct the biased ML estimators {without} knowledge of the data-generating distribution. This is achieved via a novel numerical algorithm that self-consistently approximates all relevant parameters of the data generating distribution while simultaneously solving the RS equations. We investigate numerically the statistics of the corrected estimators, and show that the proposed new algorithm indeed succeeds in removing the bias of the ML estimators, for both the association parameters and for the cumulative hazard.
Similar papers 1
May 4, 2017
Overfitting, which happens when the number of parameters in a model is too large compared to the number of data points available for determining these parameters, is a serious and growing problem in survival analysis. While modern medicine presents us with data of unprecedented dimensionality, these data cannot yet be used effectively for clinical outcome prediction. Standard error measures in maximum likelihood regression, such as p-values and z-scores, are blind to overfitt...
April 14, 2019
The Cox proportional hazards model is ubiquitous in the analysis of time-to-event data. However, when the data dimension p is comparable to the sample size $N$, maximum likelihood estimates for its regression parameters are known to be biased or break down entirely due to overfitting. This prompted the introduction of the so-called regularized Cox model. In this paper we use the replica method from statistical physics to investigate the relationship between the true and infer...
Nearly all statistical inference methods were developed for the regime where the number $N$ of data samples is much larger than the data dimension $p$. Inference protocols such as maximum likelihood (ML) or maximum a posteriori probability (MAP) are unreliable if $p=O(N)$, due to overfitting. This limitation has for many disciplines with increasingly high-dimensional data become a serious bottleneck. We recently showed that in Cox regression for time-to-event data the overfit...
May 22, 2024
We investigate analytically the behaviour of the penalized maximum partial likelihood estimator (PMPLE). Our results are derived for a generic separable regularization, but we focus on the elastic net. This penalization is routinely adopted for survival analysis in the high dimensional regime, where the Maximum Partial Likelihood estimator (no regularization) might not even exist. Previous theoretical results require that the number $s$ of non-zero association coefficients is...
April 13, 2015
To go beyond standard first-order asymptotics for Cox regression, we develop parametric bootstrap and second-order methods. In general, computation of $P$-values beyond first order requires more model specification than is required for the likelihood function. It is problematic to specify a censoring mechanism to be taken very seriously in detail, and it appears that conditioning on censoring is not a viable alternative to that. We circumvent this matter by employing a refere...
October 6, 2018
Cox proportional hazard regression model is a popular tool to analyze the relationship between a censored lifetime variable with other relevant factors. The semi-parametric Cox model is widely used to study different types of data arising from applied disciplines like medical science, biology, reliability studies and many more. A fully parametric version of the Cox regression model, if properly specified, can yield more efficient parameter estimates leading to better insight ...
May 9, 2023
We study parametric inference on a rich class of hazard regression models in the presence of right-censoring. Previous literature has reported some inferential challenges, such as multimodal or flat likelihood surfaces, in this class of models for some particular data sets. We formalize the study of these inferential problems by linking them to the concepts of near-redundancy and practical non-identifiability of parameters. We show that the maximum likelihood estimators of th...
September 2, 2022
Prevalent cohort sampling is commonly used to study the natural history of a disease when the disease is rare or it usually takes a long time to observe the failure event. It is known, however, that the collected sample in this situation is not representative of the target population which in turn leads to biased sample risk sets. In addition, when survival times are subject to censoring, the censoring mechanism is informative. In this paper, I propose a pseudo-partial likeli...
November 29, 2019
Inferential challenges that arise when data are censored have been extensively studied under the classical frameworks. In this paper, we provide an alternative generalized inferential model approach whose output is a data-dependent plausibility function. This construction is driven by an association between the distribution of the relative likelihood function at the interest parameter and an unobserved auxiliary variable. The plausibility function emerges from the distributio...
August 4, 2021
Non-parametric maximum likelihood estimation encompasses a group of classic methods to estimate distribution-associated functions from potentially censored and truncated data, with extensive applications in survival analysis. These methods, including the Kaplan-Meier estimator and Turnbull's method, often result in overfitting, especially when the sample size is small. We propose an improvement to these methods by applying kernel smoothing to their raw estimates, based on a B...