1 Introduction
Inference of parameters for generalized linear models using the maximum likelihood (ML) protocol becomes increasingly biased due to overfitting as the ratio increases, where is the number of covariates and the number of training data. Overfitting occurs when model parameters seek to explain not only the ‘signal’ but also the ‘noise’ in training data, and is characterized by a difference in outcome prediction accuracy between training and validation samples. See e.g. [1, 3, 4, 5]
for examples from logistic regression, gamma distributions and Cox models
[1]. Hence, standard statistical significance tests for regression coefficients, being usually based on asymptotic results derived for fixed , become increasingly inaccurate [6]. Unfortunately, in post-genome medicine, having large ratiosis the rule rather than the exception. This prompted epidemiologists to formulate heuristic rules for avoiding overfitting, such as limits on the number of events per variable
[7, 8, 9, 10, 11]. Approximate recipes for correcting ML estimates were developed in e.g. [12, 13]. Alternative methods of addressing the overfitting problem include feature selection and regularization. In feature selection one seeks to identify a subset of covariates that are informative of outcomes
[14, 15, 16]. Its advantages include reduction in the required computational resources, and increased interpretability. In regularization one adds a penalty term to the objective function of ML inference (which can alternatively be derived from a prior in Bayesian inference) to suppress the number or magnitude of the model parameters
[17, 18]. Application of regularization to survival analysis with high-dimensional covariates is studied widely, see e.g. [19, 20] and references therein.
in the minimization search is the zero vector, giving a positive value of E (implying an underfitted model).

using ML (no regularization) or Maximum A Posteriori Probability regression (MAP, with regularization). Data points on the diagonal imply perfect inference.
A recent study [1] provided a new approach to overfitting in survival analysis. It showed how the replica method from statistical physics can be used to model ML inference analytically as the zero noise limit of a suitably defined stochastic minimization, starting from an information-theoretic measure of overfitting. The theory predicted the quantitative relation between ML-inferred and true parameters in the Cox model [1]
, and a phase transition at
.Let us denote the set of model parameters as , and the data as
. The observation that ML inference is equivalent to minimization of the Kullback-Leibler divergence between the empirical data distribution
and the parametrized distribution assumed as a model of the data, suggests [1] using as a measure of overfitting111A similar idea for comparing estimators of probability distributions was used in
[22], using the Lévy distance rather than the KL divergence. Other measures of overfitting can be found in e.g. [20, 23]., in which are the true (but a priori unknown) parameter values. Perfect regression implies , underfitting implies , and overfitting implies . To gain more intuition for this measure, we generate synthetic data from a simple logistic regression model, find the ML estimators of its parameters, and calculate . Here the parameters are , the data are , with and , and we use the short-hand (with the convention ). The outcome likelihood and the measure are given by(1.1) | |||||
(1.2) |
Results are shown in Figure 1. When , converges towards zero during the minimization, indicating perfect parameter recovery. As the number of samples in the data set is reduced, giving and , converges to increasingly negative values. Since there is no model mismatch (the data were generated from a logistic model), the negative values of indicate overfitting.
Switching from maximum likelihood to maximum a posteriori estimators implies adding a penalty term to the likelihood:
where represents a parameter prior, giving(1.3) | |||||
MAP regression is equivalent to minimizing the quantity (1.3). This minimization should in principle be over all , but may in practice be constrained to simplify the calculation (see e.g. [24, 25, 26]). For generalized linear models, commonly used priors are (giving regularization222This choice promotes sparsity in the regression coefficient vector , which would result in a horizontal line segment passing through the origin in Fig. 2. Since our theory aims to predict the slope of the data clouds in Fig. 2, we will not pursue regularizers in this paper.
, or ‘LASSO’ regression
[18]) and (givingregularization, or ‘ridge’ regression).
In the present paper we generalize the replica analysis of [1] from ML to MAP inference, upon adding an regularization term to the log-likelihood function. This term suppresses overfitting effects, and removes the ML phase transition of the Cox model [1] at ; see e.g. Fig. 2. In the presence of an regularizer, correlations between covariates can no longer be transformed away, as was done in [1], leading to the appearance in the theory of the population covariance matrix
of the covariates. Under mild restrictions on the eigenvalue spectrum of this matrix, we show how an accurate theory of overfitting for the regularized Cox proportional hazards model can be developed, in spite of such additional mathematical complications, including for the previously inaccessible regime
. We find, as in [1], that the replica symmetric version of the theory is sufficient to explain accurately the behaviour of interest. The resulting equations can also be used to predict the amount of regularization needed for unbiased regression, expressed in term of spectum of and the ratio .2 Replica analysis of regularized Cox regression
2.1 Generalized replica formalism to include priors
Following [1], we interpret minimization of (1.3
) as computing the ground state energy of a statistical mechanical system with degrees of freedom
and Hamiltonian , at inverse temperature , where and(2.1) |
We define the associated free energy, which we average over the disorder (the microscopic realization of ), and can compute the disorder-averaged ground state energy as the limit of the disorder-averaged energy density , where
The replica identity is subsequently used to simplify the average of the logarithm (see [1] for details and references), giving in the present case
(2.3) | |||||
Equation (2.3
) is applicable to any parametric model
and any prior . See also [27] for alternative results on the use of the replica method in statistical inference. We will now make a specific choice for , and use (2.3) to develop a theory for regression and overfitting in regularized Cox models with Gaussian priors.2.2 Application to the regularized Cox proportional hazards model
Cox’s proportional hazards model originally described in [2] assumes a parametrization of the form
(2.4) |
Its parameters are the coefficients , and a base hazard rate (a nonnegative function defined for ). Substituting translates (2.3) into
(2.5) | |||||
Functional integrals are written as , the true parameters responsible for the data are written as , and we follow the standard convention for regularized Cox models of only including a prior for the association parameters (equivalently, assuming an improper, or ‘flat’, prior for the base hazard rate). Our prior is . To proceed with the analytical treatment, we assume that the covariate vectors are drawn independently from a population distribution with zero mean and covariance matrix . The introduction of regularization means that the regression equations for correlated covariates can no longer be transformed to those corresponding to uncorrelated ones. This leads to a more complex theory than [1], and ultimately to conditions on the eigenvalue spectrum of .
Our analysis is carried out in the regime where both but with fixed ratio . To retain non-zero event times, even for , we must rescale the regression coefficients according to , resulting in . Without this rescaling we would have event time distributions with all weight concentrated on and . We also replace by , to allow for more compact notation. Following [1] we next introduce
(2.6) |
where . The magnitude of represents the relative risk of failure, compared to that of an ‘average’ individual (with ). Therefore can be considered a vector of risk scores. Our energy density then becomes
(2.7) | |||||
in which now
(2.8) |
To proceed we assume that is Gaussian. This holds for any and as soon as
is Gaussian, and for non-Gaussian covariate statistics it will generally hold due to the Central Limit Theorem if the correlations among the covariates are weak, and
and are large. Since we assumed , the risk score distribution is now given by(2.9) |
It is determined in full by the covariance matrix , with entries
(2.10) |
The entries of are given by . The measure the similarity between the -dimensional vectors formed by the regression parameters in different replicas. For each replica pair we use the integral representation of the Dirac delta function, and rescale the conjugate integration parameter by ,
(2.11) |
in order to simplify expression (2.7) to
(2.12) | |||||
The quadratic nature of the exponent in the integral, a consequence of having chosen regularization, allows for a closed form solution. Changing the penalty term to or with would significantly complicate the integrals.
2.3 Conversion into a saddle point problem
With a modest amount of foresight we transform , and introduce the short-hand . To evaluate the Gaussian integral in (2.12) we define the matrix and the -dimensional vector , with entries
(2.13) |
With these definitions we may write the Gaussian integral in (2.12) as
(2.14) | |||||
Let and denote the eigenvalues of and , respectively. The two terms and of the matrix , with components and
, clearly commute. The complete set of eigenvectors of
can therefore be written as , with components , and where and , and where both are normalised according to . The eigenvalues of are then , and(2.15) |
Hence the integral (2.14) can be written as
(2.16) |
where the averages in the exponents are over the eigenvalues and orthonormal eigenvectors of , i.e. . Since with , the integrals over , and the base hazard rates in (2.12) can for be evaluated by steepest descent, provided the limits and commute. Expression (2.16) then enables us to write the result as
(2.17) |
in which
(2.18) | |||||
where we have defined . Differentiating with respect to removes from the problem, and gives .
3 Replica symmetric theory
3.1 Replica symmetric saddle points
To proceed, we make the replica symmetric ansatz, which implies assuming ergodicity of the stochastic regression process, and translates into invariance of all order parameters under all permutations of the replicas . Now, for all :
(3.1) |
Both and are positive definite, so and . We may now write
The eigenvalues and eigenvectors of , and are found in [1]. has two nondegenerate eigenvalues with , and a further fold degenerate eigenvalue . Hence
The entries of are found to be
(3.3) | |||
(3.4) |
Hence
Next we turn to terms in (2.18) that involve the spectrum of . This matrix has one eigenvalue with eigenvector , and the fold degenerate eigenvalue
with eigenspace
. Hence(3.6) | |||||
Similarly, using the RS form of , we may write
(3.7) | |||||
) are now over the joint distribution of eigenvalues and eigenvectors of
only. Inserting the above RS expressions into (2.18), and using , then gives us, with the short-hand ,(3.8) | |||||
We note that
(3.9) | |||
(3.10) |
and these identities enable us, after some simple rearrangements, to write the limit in the much simpler form
(3.11) | |||||
Comments
There are no comments yet.