Better Approximate Inference for Partial Likelihood Models with a Latent Structure

10/22/2019 ∙ by Amrith Setlur, et al. ∙ 0

Temporal Point Processes (TPP) with partial likelihoods involving a latent structure often entail an intractable marginalization, thus making inference hard. We propose a novel approach to Maximum Likelihood Estimation (MLE) involving approximate inference over the latent variables by minimizing a tight upper bound on the approximation gap. Given a discrete latent variable Z, the proposed approximation reduces inference complexity from O(|Z|^c) to O(|Z|). We use convex conjugates to determine this upper bound in a closed form and show that its addition to the optimization objective results in improved results for models assuming proportional hazards as in Survival Analysis.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Temporal Point Processes (TPPs) provide a formal framework to model the occurrences of discrete events in time (like failures or financial transactions). Recent work ((Linderman and Adams, 2014) (Snoek et al., 2013)

) on modelling TPPs with latent factors have showcased their ability to capture correlations such as inhibitory relationships & a dichotomy of classes of neurons in neural spike recordings. Although, there have been several advances in non-parametric Bayesian inference (

(Samo and Roberts, 2015)) most models are parametric ((Cox, 1955)) where parameter estimation is done by maximizing the likelihood of observed point values. Survival analysis is the problem of estimating survival times for entities (like nodes in a machine) and it has largely relied on TPPs to estimate survival times in the presence of censored observations. Semi-parametric methods like the Cox Proportional Hazards (CPH) (Cox, 1955) allow parametric estimation using a partial likelihood objective without estimating the baseline hazard. Therefore, we propose an approximate inference strategy for latent variable models with a partial likelihood objective. We introduce an inference method for models where the normalization factor includes interactions over log-linear factors. Such models are common in TPPs assuming proportional hazards ((Rosen and Tanner, 1999)) or in latent Conditional Random Fields (CRFs) where the normalization involves a sum over finite potential functions ((Sutton et al., 2012)). (Rosen and Tanner, 1999) introduce an inference strategy for CPH which is similar to our proposed method, but they fail to identify cases where the approximation fails. Although our inference strategy is applicable to the full likelihood in a TPP, we focus on its impact in the case of partial likelihoods since the objective there is closely related to the MLE objectives observed in latent CRFs, thus making our work applicable to a broader class of problems.

Inspired by (Jebara and Choromanska, 2012) we introduce a distribution agnostic closed form tight upper bound on likelihood estimations for TPPs resembling (Diggle, 2005). The upper bound can be minimized via standard gradient descent based iterative methods (Ruder, 2016). Finally, we prove a tight upper bound on the Jensen inequality for strictly convex polynomial functions on .

2 The Inference Problem

Given a compact set equipped with Borel -algebra , is a TPP if

is a measurable transform from the probability space

to the space of counting measures on S. Given a series of events : for a given entity the event of interest occurred at time or the observation was censored . The risk set for is given by . Under a Poisson TPP with intensity , the likelihood of the event for at given that the event hasn’t occurred till is given by . where

follows a Poisson distribution with mean


We modify the formulation by adding latent variables (see figure 1) and now the intensity function in the TPP is function of parameters , input and latent variable . (Rosen and Tanner, 1999) and (Diggle, 2005) used partial likelihood models to efficiently compute the MLE estimates for the parameters in an inhomogenous Poisson process. Partial likelihood was first introduced by (Cox, 1955) with the aim of identifying variables that impact survival analysis without worrying about the baseline hazard. For the same reasons, we choose to maximize the partial likelihood of an event conditioned on the risk set . Thus the denominator in eq. 1 now involves a sum over a finite set of factors from (closely resembling latent CRFs (Quattoni et al., 2007) in likelihood estimation).


3 Approximate Inference Solution

In this section we provide a computationally tractable approximation for the maximum-likelihood estimation of the semi-parametric latent variable model defined in section 2 and in section 4 we show the conditions under which the approximation is tight. Assuming

we can define positive random variables (R.V.)

and which are functions of & . This assumption implies that . In the rest of the paper (unless stated otherwise), the expectation is over the distribution . Using this re-formulation and the Taylor series expansion we can re-write eq. 1 as,

Lemma 1.

If we assume ,

to have moments of order

respectively, then their ratio distribution will have moments of order . (Cedilnik et al., 2006)

Using the Mellin Transform theory for ratio distributions of positive independent random variables (R.V.); we have . Based on lemma 1 we limit the expansion in eq. 3 to a finite . At this point, we are computing expectations over convex functions and with (defined on ). Since for a convex , (Jensen inequality) we can further approximate eq. 3 with eq. 4. Once again we use the Taylor series approximation to finally arrive at a tractable maximum-likelihood objective (eq. 5). For each data point the inference complexity under the original objective is whereas under the proposed marginalization the complexity reduces to .


The crux of the approximation lies in the Jensen inequality. Therefore, we spend the following section on identifying a tight distribution independent bound on the inequality gap. If this inequality gap is reduced then we know that 5 is a good approximation for 1.

4 Bounding the Approximation [Analysis]

We identify the conditions under which the approximation 3 is feasible and provide a closed form bound for it. In order to simplify the statements in the rest of the paper, we introduce some notations and assumptions here. We assume that the R.V. has mean and is sub-Gaussian with parameter . This assumption is fairly common in latent variable models where the true posterior

is approximated by a Gaussian distribution

. For a continuous function , with lying in a closed, bounded set (with probability ), one can bound the values attained by . Thus for a R.V. we obtain reasonable probabilistic bounds on , formalized by the following statement: with probability (defined in theorem 1)).

The following theorem is stated without proof (in Appendix A.2). It bounds the gap in Jensen inequality for a series of strictly convex functions with and where () is either or .

Theorem 1.

Although this is stated for , the statement for is similar.
With probability ,


The conjugate , , which is a singleton set for strictly convex functions.

5 Joint Objective Function

Since gradients for conjugate functions are well defined in our case (see Appendix A.2), we can show that the approximation in eq. 5 is good when the joint objective is minimized (eq. 8). The joint objective enforces the model to find optimal that maximizes the likelihood in eq. 1, while ensuring proximity to the true objective. Eq. 8

can also be viewed from the perspective of a regularized objective where the model learns to enforce additional constraints on the variance of

and thus ends up with distributions of with rapidly decaying Gaussian tails.


Here is obtained by using theorem 1 which bounds the Jensen’s inequality for each data point , via a sum over gradients computed for the functions .

6 Results

We analyze two types of results: (1) we evaluate our combined objective (eq. 8) on a proportional hazards (CPH) model and show an improvement in the concordance-index (table 1), (2) we compare our proposed distribution agnostic bound against a standard bound for the Jensen inequality (Dragomir, 1999).

Figure 2: Comparisons of the upper bounds on the Jensen inequality established by our model and the baseline (Dragomir, 1999)

. 95% confidence intervals are shown when values of

and .
Figure 3: Visualization of our distribution agnostic bound on Jensen’s inequality for strictly convex functions.

6.1 Survival Analysis

Given a discrete we model the distribution to be a multinomial. The final layer of the input encoder network is a softmax operation ensuring that the distribution over the latent space of is a valid one. We compare our models: Latent Variable CPH via Hard/Soft Gating (LV-CPH-HG/LV-CPH-SG) against three popular baselines: Cph (Cox, 1955), Rsf (Ishwaran and Lu, 2007), DeepSurv (Katzman et al., 2016) on common datasets in survival analysis: METABRIC (Yao, 2014), ROTTERDAM-GBSG (Schumacher et al., 1994), SUPPORT (Knaus et al., 1995).

For the discrete case, it is easy to see that the regularizer in eq. 8 is minimized when has low variance (or entropy). We enforce a low entropy distribution by gating (soft/hard) the predictions

obtained from the softmax layer. Since for the discrete case, low entropy (

) on and as , one can instead minimize to effectively reduce the upper bound in theorem 1. Therefore, we conclude that optimizing for the joint objective function in eq. 8 instead of the mere approximation in eq. 5 leads to an improved concordance-index for CPH models.

6.2 Tightness of the proposed bound

Figure 2 compares the bound computed by (Dragomir, 1999) [baseline] (Appendix A.1) against our tight bound (Appendix A.2), by sampling

from distinct fixed normal distributions. Our bound is much tighter for smaller values of

, and it converges to the baseline’s value for large p. Looking at figure 3 it is easy to verify that the bound we propose on Jensen’s inequality is stronger than the baseline.

Model Metabric Rotterdam-Gbsg Support
DeepSurv 0.
Table 1: Results of hard and soft linear gating networks and their comparison with relevant baselines (95% bootstrap CI ). (Nagpal, 2019)

7 Discussion

We propose an approximation for the MLE objective in TPPs involving a partial likelihood function with latent factors. We also show that the MLE approximation can be bounded by minimizing a joint objective which includes an upper bound on the approximation gap. We have shown this to be theoretically and empirically better for the partial likelihood estimation (in survival analysis). Future work on this inference method would be to further exploit the tractable closed form approximation gap by directly optimizing for it with iterative methods like ADAM. Yet another direction would be to extend this work to latent variable models for semi-parametric models like Gaussian processes for survival analysis

(Fernández et al., 2016).


  • A. Cedilnik, K. Kosmelj, and A. Blejec (2006) Ratio of two random variables: a note on the existence of its moments. Metodoloski zvezki 3 (1), pp. 1. Cited by: Lemma 1.
  • D. R. Cox (1955) Some statistical methods connected with series of events. Journal of the Royal Statistical Society: Series B (Methodological) 17 (2), pp. 129–157. Cited by: §1, §2, §6.1.
  • P. J. Diggle (2005) A partial likelihood for spatio-temporal point processes. Biostats. Cited by: §1, §2.
  • S. S. Dragomir (1999) A converse result for jensen’s discrete inequality via grüss’ inequality and applications in information theory. An. Univ. Oradea Fasc. Mat. Cited by: §A.1, Figure 2, §6.2, §6.
  • T. Fernández, N. Rivera, and Y. W. Teh (2016) Gaussian processes for survival analysis. In Advances in Neural Information Processing Systems, pp. 5021–5029. Cited by: §7.
  • H. Ishwaran and M. Lu (2007) Random survival forests. Wiley StatsRef: Statistics Reference Online, pp. 1–13. Cited by: §6.1.
  • T. Jebara and A. Choromanska (2012) Majorization for crfs and latent likelihoods. In Advances in Neural Information Processing Systems, pp. 557–565. Cited by: §1.
  • J. L. Katzman, U. Shaham, A. Cloninger, J. Bates, T. Jiang, and Y. Kluger (2016) Deep survival: a deep cox proportional hazards network. stat 1050, pp. 2. Cited by: §6.1.
  • W. A. Knaus, F. E. Harrell, J. Lynn, L. Goldman, R. S. Phillips, A. F. Connors, N. V. Dawson, W. J. Fulkerson, R. M. Califf, N. Desbiens, et al. (1995) The support prognostic model: objective estimates of survival for seriously ill hospitalized adults. Annals of internal medicine 122 (3), pp. 191–203. Cited by: §6.1.
  • S. Linderman and R. Adams (2014) Discovering latent network structure in point process data. In

    International Conference on Machine Learning

    pp. 1413–1421. Cited by: §1.
  • C. Nagpal (2019) Nonlinear semi-parametric models for survival analysis. arXiv. External Links: 1905.05865 Cited by: Table 1.
  • A. Quattoni, S. Wang, L. Morency, M. Collins, and T. Darrell (2007) Hidden conditional random fields. IEEE Transactions on Pattern Analysis & Machine Intelligence, pp. 1848–1852. Cited by: §2.
  • O. Rosen and M. Tanner (1999) Mixtures of proportional hazards regression models. Statistics in Medicine 18 (9), pp. 1119–1131. Cited by: §1, §2.
  • S. Ruder (2016) An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: §1.
  • Y. K. Samo and S. Roberts (2015) Scalable nonparametric bayesian inference on point processes with gaussian processes. In International Conference on Machine Learning, pp. 2227–2236. Cited by: §1.
  • M. Schumacher, G. Bastert, H. Bojar, K. Huebner, M. Olschewski, W. Sauerbrei, C. Schmoor, C. Beyerle, R. Neumann, and H. Rauschecker (1994) Randomized 2 x 2 trial evaluating hormonal treatment and the duration of chemotherapy in node-positive breast cancer patients. german breast cancer study group.. Journal of Clinical Oncology 12 (10), pp. 2086–2093. Cited by: §6.1.
  • S. Simic (2009) On an upper bound for jensen’s inequality. Journal of Inequalities in Pure and Applied Mathematics. Cited by: §A.1.
  • J. Snoek, R. Zemel, and R. P. Adams (2013) A determinantal point process latent variable model for inhibition in neural spiking data. In Advances in Neural Information Processing Systems, pp. 1932–1940. Cited by: §1.
  • C. Sutton, A. McCallum, et al. (2012) An introduction to conditional random fields. Foundations and Trends® in Machine Learning 4 (4), pp. 267–373. Cited by: §1.
  • C. Q. Yao (2014) Methods for the identification of biomarkers in prostate and breast cancer. Ph.D. Thesis, University of Toronto (Canada). Cited by: §6.1.

Appendix A Appendix

a.1 Bounding Jensen’s Inequality [Dragomir – Loose Bound]

[Simic, 2009] propose multiple distribution agnostic upper bounds for Jensen’s inequality in the case of generic continuous convex functions defined on a compact set . One of the popular bounds in this regime is the Dragomir’s inequality proposed in [Dragomir, 1999]. Given is bounded with probability , section 4 bounds with . By Dragomir’s [Dragomir, 1999] inequality for a convex function ,


For and , the bounds are given by eq. 10 and eq. 11 respectively.


This bound is easy to compute and can be visualized via the gap shown in figure 3. This is also quite naive and generic since it only uses the first order conditions for convex functions to arrive at an upper bound. In the following section we provide a tighter upper bound under the stronger assumptions of strict convexity.

a.2 Bounding Jensen’s Inequality [Convex Conjugate – Tight Bound]

This section provides the proof for theorem 1 in the main paper. We investigate bounds under the special case of strictly convex functions with and . We also show visually (figure 3) that our bound is the tightest possible distribution agnostic bound for the given set of functions. With as the bound of interest,


Figure 3 depicts geometrically the maximization problem in the RHS of eq. 12 which we solve via the convex conjugate of . The optimization problem in eq. 12 involves identifying , such that the line in figure 3 denoted by where is farthest away from at the optimal point.

Given that is a closed proper strict convex function on , (the case when we consider on is similar) we can define the convex conjugate of . The form in eq. 13 is similar to what we need (distance between and a line with slope defined by


Since is strictly convex the subgradient of at is a singleton set and is exactly equal to the , giving us a unique maximizer in eq. 14.


Eq. 15 tells us that (and thus the bound in Eq. 12) for a given sample is a function of & . Given it is easy to compute by,


In section 6 we empirically compared this bound with eqs. 10, 11 from appendix A.1.