# Analysis of overfitting in the regularized Cox model

The Cox proportional hazards model is ubiquitous in the analysis of time-to-event data. However, when the data dimension p is comparable to the sample size N, maximum likelihood estimates for its regression parameters are known to be biased or break down entirely due to overfitting. This prompted the introduction of the so-called regularized Cox model. In this paper we use the replica method from statistical physics to investigate the relationship between the true and inferred regression parameters in regularized multivariate Cox regression with L2 regularization, in the regime where both p and N are large but with p/N O(1). We thereby generalize a recent study from maximum likelihood to maximum a posteriori inference. We also establish a relationship between the optimal regularization parameter and p/N, allowing for straightforward overfitting corrections in time-to-event analysis.

## Authors

• 3 publications
• 3 publications
04/14/2020

### Replica analysis of overfitting in generalized linear models

Nearly all statistical inference methods were developed for the regime w...
04/12/2022

### Correction of overfitting bias in regression models

Regression analysis based on many covariates is becoming increasingly co...
08/19/2020

### Structure Learning in Inverse Ising Problems Using ℓ_2-Regularized Linear Estimator

Inferring interaction parameters from observed data is a ubiquitous requ...
12/02/2021

### Optimal regularizations for data generation with probabilistic graphical models

Understanding the role of regularization is a central question in Statis...
11/25/2017

### Complex Structure Leads to Overfitting: A Structure Regularization Decoding Method for Natural Language Processing

Recent systems on structured prediction focus on increasing the level of...
07/01/2016

### A new analytical approach to consistency and overfitting in regularized empirical risk minimization

This work considers the problem of binary classification: given training...
09/25/2020

### Towards the interpretation of time-varying regularization parameters in streaming penalized regression models

High-dimensional, streaming datasets are ubiquitous in modern applicatio...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Inference of parameters for generalized linear models using the maximum likelihood (ML) protocol becomes increasingly biased due to overfitting as the ratio increases, where is the number of covariates and the number of training data. Overfitting occurs when model parameters seek to explain not only the ‘signal’ but also the ‘noise’ in training data, and is characterized by a difference in outcome prediction accuracy between training and validation samples. See e.g. [1, 3, 4, 5]

for examples from logistic regression, gamma distributions and Cox models

[1]. Hence, standard statistical significance tests for regression coefficients, being usually based on asymptotic results derived for fixed , become increasingly inaccurate [6]. Unfortunately, in post-genome medicine, having large ratios

is the rule rather than the exception. This prompted epidemiologists to formulate heuristic rules for avoiding overfitting, such as limits on the number of events per variable

[7, 8, 9, 10, 11]. Approximate recipes for correcting ML estimates were developed in e.g. [12, 13]

. Alternative methods of addressing the overfitting problem include feature selection and regularization. In feature selection one seeks to identify a subset of covariates that are informative of outcomes

[14, 15, 16]

. Its advantages include reduction in the required computational resources, and increased interpretability. In regularization one adds a penalty term to the objective function of ML inference (which can alternatively be derived from a prior in Bayesian inference) to suppress the number or magnitude of the model parameters

[17, 18]. Application of regularization to survival analysis with high-dimensional covariates is studied widely, see e.g. [19, 20] and references therein.

A recent study [1] provided a new approach to overfitting in survival analysis. It showed how the replica method from statistical physics can be used to model ML inference analytically as the zero noise limit of a suitably defined stochastic minimization, starting from an information-theoretic measure of overfitting. The theory predicted the quantitative relation between ML-inferred and true parameters in the Cox model [1]

, and a phase transition at

.

Let us denote the set of model parameters as , and the data as

. The observation that ML inference is equivalent to minimization of the Kullback-Leibler divergence between the empirical data distribution

and the parametrized distribution assumed as a model of the data, suggests [1] using as a measure of overfitting111

A similar idea for comparing estimators of probability distributions was used in

[22], using the Lévy distance rather than the KL divergence. Other measures of overfitting can be found in e.g. [20, 23].
, in which are the true (but a priori unknown) parameter values. Perfect regression implies , underfitting implies , and overfitting implies . To gain more intuition for this measure, we generate synthetic data from a simple logistic regression model, find the ML estimators of its parameters, and calculate . Here the parameters are , the data are , with and , and we use the short-hand (with the convention ). The outcome likelihood and the measure are given by

 Pβ(y|x) = (11+\rme−β⋅x)y(11+\rmeβ⋅x)1−y (1.1) E(β⋆,D) = 1NN∑i=1{yilog(1+\rme−β⋅xi1+\rme−β⋆⋅xi)+(1−yi)log(1+\rmeβ⋅xi1+\rmeβ⋆⋅xi)} (1.2)

Results are shown in Figure 1. When , converges towards zero during the minimization, indicating perfect parameter recovery. As the number of samples in the data set is reduced, giving and , converges to increasingly negative values. Since there is no model mismatch (the data were generated from a logistic model), the negative values of indicate overfitting.

Switching from maximum likelihood to maximum a posteriori estimators implies adding a penalty term to the likelihood:

where represents a parameter prior, giving

 E(ϑ⋆,D) ≡ (1.3) = minϑ{1NN∑i=1logp(ti|zi,ϑ⋆)p(ϑ⋆)p(ti|zi,ϑ)p(ϑ)}

MAP regression is equivalent to minimizing the quantity (1.3). This minimization should in principle be over all , but may in practice be constrained to simplify the calculation (see e.g. [24, 25, 26]). For generalized linear models, commonly used priors are (giving regularization222This choice promotes sparsity in the regression coefficient vector , which would result in a horizontal line segment passing through the origin in Fig. 2. Since our theory aims to predict the slope of the data clouds in Fig. 2, we will not pursue regularizers in this paper.

, or ‘LASSO’ regression

[18]) and (giving

regularization, or ‘ridge’ regression).

In the present paper we generalize the replica analysis of [1] from ML to MAP inference, upon adding an regularization term to the log-likelihood function. This term suppresses overfitting effects, and removes the ML phase transition of the Cox model [1] at ; see e.g. Fig. 2. In the presence of an regularizer, correlations between covariates can no longer be transformed away, as was done in [1], leading to the appearance in the theory of the population covariance matrix

of the covariates. Under mild restrictions on the eigenvalue spectrum of this matrix, we show how an accurate theory of overfitting for the regularized Cox proportional hazards model can be developed, in spite of such additional mathematical complications, including for the previously inaccessible regime

. We find, as in [1], that the replica symmetric version of the theory is sufficient to explain accurately the behaviour of interest. The resulting equations can also be used to predict the amount of regularization needed for unbiased regression, expressed in term of spectum of and the ratio .

## 2 Replica analysis of regularized Cox regression

### 2.1 Generalized replica formalism to include priors

Following [1], we interpret minimization of (1.3

) as computing the ground state energy of a statistical mechanical system with degrees of freedom

and Hamiltonian , at inverse temperature , where and

 H(ϑ|ϑ⋆,D) = logN∏i=1[p(ti|zi,ϑ⋆)p(ϑ⋆)p(ti|zi,ϑ)p(ϑ)] (2.1)

We define the associated free energy, which we average over the disorder (the microscopic realization of ), and can compute the disorder-averaged ground state energy as the limit of the disorder-averaged energy density , where

 Eγ(ϑ⋆) = −1N∂∂γ⟨log∫\rmdϑ N∏i=1[p(ti|zi,ϑ)p(ϑ)p(ti|zi,ϑ⋆)p(ϑ⋆)]γ⟩D

The replica identity is subsequently used to simplify the average of the logarithm (see [1] for details and references), giving in the present case

 Eγ(ϑ⋆) = −∂∂γlimn→01Nnlog⟨{∫\rmdϑ N∏i=1[p(ti|zi,ϑ)p(ϑ)p(ti|zi,ϑ⋆)p(ϑ⋆)]γ}n⟩D (2.3) = = ×{∫\rmdz\rmdt p(z)p(t|z,ϑ⋆)n∏α=1[p(t|z,ϑα))p(t|z,ϑ⋆)]γ}N

Equation (2.3

) is applicable to any parametric model

and any prior . See also [27] for alternative results on the use of the replica method in statistical inference. We will now make a specific choice for , and use (2.3) to develop a theory for regression and overfitting in regularized Cox models with Gaussian priors.

### 2.2 Application to the regularized Cox proportional hazards model

Cox’s proportional hazards model originally described in [2] assumes a parametrization of the form

 p(t|z,ϑ) = λ(t)\rmeβ⋅z−exp(β⋅z)∫t0\rmdt′ λ(t′) (2.4)

Its parameters are the coefficients , and a base hazard rate (a nonnegative function defined for ). Substituting translates (2.3) into

 Eγ(β⋆,λ⋆) = −∂∂γlimn→01Nnlog∫{\rmdλ1…\rmdλn}∫\rmdβ1…\rmdβn{n∏α=1[p(βα)p(β⋆)]γ} (2.5) ×{∫\rmdz\rmdt p(z)p(t|z,β⋆,λ⋆)n∏α=1[p(t|z,βα,λα))p(t|z,β⋆,λ⋆)]γ}N

Functional integrals are written as , the true parameters responsible for the data are written as , and we follow the standard convention for regularized Cox models of only including a prior for the association parameters (equivalently, assuming an improper, or ‘flat’, prior for the base hazard rate). Our prior is . To proceed with the analytical treatment, we assume that the covariate vectors are drawn independently from a population distribution with zero mean and covariance matrix . The introduction of regularization means that the regression equations for correlated covariates can no longer be transformed to those corresponding to uncorrelated ones. This leads to a more complex theory than [1], and ultimately to conditions on the eigenvalue spectrum of .

Our analysis is carried out in the regime where both but with fixed ratio . To retain non-zero event times, even for , we must rescale the regression coefficients according to , resulting in . Without this rescaling we would have event time distributions with all weight concentrated on and . We also replace by , to allow for more compact notation. Following [1] we next introduce

 p(y|β0,…,βn)=∫\rmdz p(z)n∏α=0δ[yα−βα⋅z√p] (2.6)

where . The magnitude of represents the relative risk of failure, compared to that of an ‘average’ individual (with ). Therefore can be considered a vector of risk scores. Our energy density then becomes

 Eγ(β⋆,λ⋆) = −∂∂γlimn→01Nnlog∫{\rmdλ1…\rmdλn}∫\rmdβ1…\rmdβnn∏α=1[p(βα)p(β0)]γ (2.7) ×{∫\rmdy p(y|β0,…,βn)∫\rmdt p(t|y0,λ0)n∏α=1[p(t|yα,λα)p(t|y0,λ0)]γ}N

in which now

 p(t|y,λ) = λ(t)\rmey−exp(y)∫t0\rmdt′ λ(t′) (2.8)

To proceed we assume that is Gaussian. This holds for any and as soon as

is Gaussian, and for non-Gaussian covariate statistics it will generally hold due to the Central Limit Theorem if the correlations among the covariates are weak, and

and are large. Since we assumed , the risk score distribution is now given by

 p(y|β0,…,βn)=\rme−12y⋅C−1[{β}]y√(2π)n+1detC[{β}] (2.9)

It is determined in full by the covariance matrix , with entries

 Cαρ[{β}] = ∫\rmdz p(z)(βα⋅z√p)(βρ⋅z√p) =1pβα⋅Aβρ (2.10)

The entries of are given by . The measure the similarity between the -dimensional vectors formed by the regression parameters in different replicas. For each replica pair we use the integral representation of the Dirac delta function, and rescale the conjugate integration parameter by ,

 1=∫\rmdCαρ δ[Cαρ−1pβα⋅Aβρ]=∫\rmdCαρ\rmd^Cαρ2π/p\rme\rmip^Cαρ(Cαρ−βα⋅Aβρ) (2.11)

in order to simplify expression (2.7) to

 Eγ(β⋆,λ⋆) = −∂∂γlimn→01Nnlog∫{\rmdλ1…\rmdλn}∫\rmdC\rmd^C \rme\rmip∑nα,ρ=0^CαρCαρ(2π/p)(n+1)2 (2.12) ×∫\rmdβ1…dβn \rme−ηγ∑nα=1[(βα)2−(β0)2]−\rmi∑nα,ρ=0^Cαρβα⋅Aβρ

The quadratic nature of the exponent in the integral, a consequence of having chosen regularization, allows for a closed form solution. Changing the penalty term to or with would significantly complicate the integrals.

### 2.3 Conversion into a saddle point problem

With a modest amount of foresight we transform , and introduce the short-hand . To evaluate the Gaussian integral in (2.12) we define the matrix and the -dimensional vector , with entries

 Ξαμ;βν=2ηγδαβ(A−1)μν+δμνDαβ,      ξαμ=−D0α~β0μ (2.13)

With these definitions we may write the Gaussian integral in (2.12) as

 ∫(n∏α=1\rmd~βα\rme−ηγ~βα⋅A−1~βα)\rme−12∑nα,ρ=1Dαρ~βα⋅~βρ−∑nρ=1D0ρ~β0⋅~βρ (2.14) = \rme12ξ⋅Ξ−1ξ∫\rmd~β \rme−12(~β−Ξ−1ξ)⋅Ξ(~β−Ξ−1ξ)=(2π)np2√detΞ\rme12ξ⋅Ξ−1ξ

Let and denote the eigenvalues of and , respectively. The two terms and of the matrix , with components and

, clearly commute. The complete set of eigenvectors of

can therefore be written as , with components , and where and , and where both are normalised according to . The eigenvalues of are then , and

 detΞ=p∏μ=1n∏α=1(2ηγaμ+bα),      (Ξ−1)αμ,α′μ′=n∑β=1p∑ν=1uβαvνμuβα′vνμ′2ηγ/aν+bβ (2.15)

Hence the integral (2.14) can be written as

 (2π)np2√detΞ \rme12ξ⋅Ξ−1ξ = \rme12nplog(2π)−12np⟨log(2ηγ/a+b)⟩+12np⟨(ξ⋅^u)2(2ηγ/a+b)−1⟩ (2.16)

where the averages in the exponents are over the eigenvalues and orthonormal eigenvectors of , i.e. . Since with , the integrals over , and the base hazard rates in (2.12) can for be evaluated by steepest descent, provided the limits and commute. Expression (2.16) then enables us to write the result as

 limN→∞Eγ(β⋆,λ⋆) = ∂∂γlimn→01nextrΨ(C,D,λ1…λn) (2.17)

in which

 Ψ(C,D,λ1…λn) = −12ζ[n∑α,ρ=0DαρCαρ−1pD00(~β0)2]+12(n+1−nζ)log(2π) (2.18) −log∫\rmdy \rme−12y⋅C−1y∫\rmdt p(t|y0,λ0)n∏α=1[p(t|yα,λα)p(t|y0,λ0)]γ

where we have defined . Differentiating with respect to removes from the problem, and gives .

## 3 Replica symmetric theory

### 3.1 Replica symmetric saddle points

To proceed, we make the replica symmetric ansatz, which implies assuming ergodicity of the stochastic regression process, and translates into invariance of all order parameters under all permutations of the replicas . Now, for all :

 λα(t)=λ(t),      C0α=c0D0α=d0,      Cαρ=Cδαρ+c(1−δαρ)Dαρ=Dδαρ+d(1−δαρ) (3.1)

Both and are positive definite, so and . We may now write

 C=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝C00c0……c0c0Cc…c⋮cC…c⋮⋮⋮⋱⋮c0cc…C⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,      C−1=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝B00b0……b0b0Bb…b⋮bB…b⋮⋮⋮⋱⋮b0bb…B⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠

The eigenvalues and eigenvectors of , and are found in [1]. has two nondegenerate eigenvalues with , and a further fold degenerate eigenvalue . Hence

 logdetC = log([C+(n−1)c]C00−nc20)+(n−1)log(C−c) = logC00+nlog(C−c)+n(c−c20/C00)C−c+O(n2)

The entries of are found to be

 B00=C+(n−1)cC00[C+(n−1)c]−nc20,      b0=−c0C00[C+(n−1)c]−nc20 (3.3) B=b+1C−c,      b=c20−cC00(C00[C+(n−1)c]−nc20)(C−c) (3.4)

Hence

 y⋅C−1y = B00(y0)2+(B−b)n∑α=1(yα)2+b(n∑α=1yα)2+2b0y0n∑α=1yα

Next we turn to terms in (2.18) that involve the spectrum of . This matrix has one eigenvalue with eigenvector , and the fold degenerate eigenvalue

with eigenspace

. Hence

 ⟨log(2ηγa+b)⟩ = 1npp∑μ=1[(n−1)log(2ηγa+D−d)+log(2ηγa+D−d+nd)] (3.6) = ⟨log(2ηγa+D−d)⟩+⟨da2ηγ+(D−d)a⟩+O(n)

Similarly, using the RS form of , we may write

 ⟨(ξ⋅^u)22ηγ/a+b⟩ = 1npp∑μ=1(p∑ν=1n∑ρ=1(A12β0)νvμν1√n)2d202ηγ/aμ+D+(n−1)d (3.7) = 1pp∑μ=1(β0⋅vμ)2d20aμ2ηγ/aμ+D−d+O(n) = d20 ⟨a2(β0⋅v)22ηγ+(D−d)a⟩+O(n)

The averages in (3.6,3.7

) are now over the joint distribution of eigenvalues and eigenvectors of

only. Inserting the above RS expressions into (2.18), and using , then gives us, with the short-hand ,

 1nΨRS(…) = −12ζ(2d0c0+DC−dc)+12(1−ζ)log(2π)−ηζγS2+O(n) (3.8) +12[log(C−c)+c−c20/C00C−c]−12ζd20 ⟨a2(β0⋅v)22ηγ+(D−d)a⟩ +12ζ⟨log(2ηγa+D−d)⟩+12ζ⟨da2ηγ+(D−d)a⟩+1nlog~S −1nlog∫Dz∫\rmdy0√2π\rme−12B00y20∫\rmdt p(t|y0,λ0) ×[∫\rmdy \rme−12(B−b)y2+y(\rmi√b−b0y0)pγ(t|y,λ)pγ(t|y0,λ0)]n = −12ζ(2d0c0+DC−dc)+12(1−ζ)log(2π)−ηζγS2+O(n) +12[log(C−c)+c−c20/~S2C−c]−12ζd20 ⟨a2(β0⋅v)22ηγ+(D−d)a⟩ +12ζ⟨log(2ηγa+D−d)⟩+12ζ⟨da2ηγ+(D−d)a⟩+12nlog(~S2B00) −1nlog∫DzDy0∫\rmdt p(t|y0/√B00,λ0) ×[∫\rmdy \rme−12(B−b)y2+y(\rmi√b−b0y0/√B00)pγ(t|y,λ)pγ(t|y0/√B00,λ0)]n

We note that

 B−100=~S2−nc20/(C−c)+O(n2),      B−b=1/(C−c) (3.9) b0=−c0/~S2(C−c)+O(n),              b=c20−c~S2~S2(C−c)2+O(n) (3.10)

and these identities enable us, after some simple rearrangements, to write the limit in the much simpler form

 ΨRS(…) = −12ζ{2d0c0+DC−dc+log(2π)+2ηγS2 (3.11) +d20 ⟨a2(β0⋅v)22ηγ+(D−d)a⟩−⟨log(2ηγa+D−d)⟩−⟨da2ηγ+(D−d)a⟩} −∫DzDy0∫\rmdt p(t|~Sy0,λ0)log∫Dy pγ