1 Background and motivation
Indepth modeling of the evolution of human mortality necessitates analysis of the prevalent causes of death. This is doubly so for making mortality forecasts into the future across different age groups, populations and genders. In this article we develop a methodology for probabilistic forecasting of causespecific mortality in a multipopulation (primarily interpreted as a multinational) context. Thus, we simultaneously fit multiple causespecific longevity surfaces via a spatiotemporal model that accounts for the complex dependencies across causes and countries and across the AgeYear dimensions.
While there have been many works on modeling mortality across several populations (Dong et al. 2020, Enchev et al. 2017, Guibert et al. 2019, Hyndman et al. 2013, Kleinow 2015, Li and Lu 2017, Tsai and Zhang 2019), as well as an active literature on causeofdeath mortality, there are very few that do both simultaneously. As we detail below, there are many natural reasons for building such a joint model, and this gap is arguably driven by the underlying “Big Data” methodological challenge. Indeed, with dozens of mortality datasets that are indexed by countries, causesofdeath, genders, etc., developing a scalable approach is daunting. We demonstrate that this issue may be overcome by adapting machine learning approaches, specifically techniques from multitask learning (Bonilla et al. 2008, Caruana 1997, Letham and Bakshy 2019, Williams et al. 2009). To this end, we employ multioutput Gaussian Processes (MOGP) combined with linear coregionalization. GPs are a kernelbased datadriven regression framework that translates mortality modeling into smoothing and extrapolating an inputoutput response surface based on noisy observations. It yields a full uncertainty quantification for mortality rates and mortality improvement factors. Coregionalization is a dimension reduction technique that enables efficiently handling many correlated outputs.
This work is a continuation of our series of articles Ludkovski et al. (2018), Huynh et al. (2020) and Huynh and Ludkovski (2021) that discussed the application of GPs to model allcause mortality in the singlepopulation and multipopulation contexts, respectively. Unlike allcause mortality in different geographic regions, which tends to exhibit strong correlation and longterm coherence, different causes have less commonality, and thus require a more flexible structure for the respective crossdependence. Moreover, joint analysis of 10+ mortality surfaces brings computational scalability challenges of potentially hundreds of model parameters to calibrate. Thus, in order to carry out causespecific mortality analysis, we make methodological innovations along two directions. First, we compare two distinct versions of the MOGP that implement dimension reduction by fusing outputs through linear combinations of latent functions: the Semiparametric Latent Factor Model (SLFM) and the Intrinsic Coregionalization Model (ICM). ICM assumes a fixed spatial kernel in AgeYear, while SLFM does not. This distinction corresponds to different assumptions about the structural commonality in the modeled mortality surfaces. We conduct sensitivity analysis to assess these two choices for the tasks of insample fitting and of forecasting causespecific mortality. Second, we implement a multilevel MOGPICM model that separates latent factors across the different types (countries, causes, genders, etc.) of categorical inputs describing the populations. The separability assumption in the joint covariance kernel yields a producttype crosspopulation correlation structure, providing insights about the relationships between the mortality improvement trends.
Our models are driven by the scalability issue which has been a critical obstacle to analyze in bulk the largevolume mortality datasets that have become available recently. Thus, to cope with many mortality surfaces, we leverage the twin pillars of multioutput models (Teh et al. 2005, Letham and Bakshy 2019, Williams et al. 2009) and the structured Kronecker covariance that mitigates the typical cubic computational complexity of GPs (Flaxman et al. 2015, Gilboa et al. 2015, Saatçi 2011, Zhe et al. 2019).
Decomposing allcause mortality rates leads to reduced signaltonoise ratio since less common causes intrinsically have limited death counts. Consequently, causespecific analysis must contend with much noisier data. One motivation for the MOGP approach is to explicitly enable data fusion across populations, sharing information to improve model fitting. We demonstrate significant denoising of mortality experience that successfully captures causespecific trends, including for causes with low data credibility. We also document the benefit of joint models to reduce model risk, i.e. improved inference of model hyperparameters. As another feature, our framework can handle nonrectangular datasets, e.g. countries with different period coverages. In one of our case studies we exploit this to borrow the most recent data from other countries to update predicted domestic mortality rates.
Literature Review
One of the few works approaching causespecific mortality modeling within a multipopulation context is the recent study in Lyu et al. (2021). The authors introduced a nested model in the spirit of Li and Lee (2005) to jointly model major causes from three European countries by capturing the crosscause and crosscountry dependencies through common factors.
Since a given death is associated with a single causeofdeath, direct dependence among causes is not observable. As such, many researchers choose to model and forecast each cause in isolation. A variety of forecasting methods for individual causes are employed: univariate timeseries, such as ARIMA methods in Caselli (1996), Knudsen and McNown (1993), McNown and Rogers (1992), dynamic parametrization in Tabeau et al. (1999), least squares methods and variations of the LeeCarter model in Caselli et al. (2019)
. To capture dependence between causes, a common approach is via copulas within the framework of dependent competing risks. The main effort is to characterize the joint distribution of survival times in terms of unobserved causespecific mortality rates, see
Dimitrova et al. (2013), Lo and Wilke (2010) and Li and Lu (2019). Alternatively, Alai et al. (2018)utilized multinomial logistic regression. Outside the competing risks framework,
Arnold (Gaille) and Sherris (2013)proposed a multivariate Vector Error Correction timeseries model to examine the existing cointegration relationships. Another approach is to link multiple causes through a list of clinical factors, then applying a stochastic model to forecast these factors
(Foreman et al. 2018).Several studies relied on compositional data analysis to achieve coherence in the sense that causespecific forecasts sum to the allcause forecast. This entails modeling the bycause distribution of deaths, directly incorporating dependence between causes, see BergeronBoucher et al. (2017), Kjaergaard et al. (2019). We refer to Wilmoth (1995), Caselli et al. (2019), Tabeau et al. (1999) for discussions on the usefulness and limitations in using causespecific models to project allcause mortality.
The remainder of this paper is organized as follows. Section 2 introduces causeofdeath mortality datasets from the HCD. Section 3 describes the MOGPICM framework and its extensions within multinational context. Section 4 focuses on how MOGPs can maximize predictive gains over singlepopulation models and provide insights about the projected trends of aggregate mortality. Section 5 compares the results from Multi and Singlelevel ICM. Finally, Section 6 concludes with main findings and directions for further analysis.
2 The Human Causeofdeath Database
The Human Causeofdeath Database (HCD) (HCD 2021) provides detailed causespecific mortality data for more than a dozen developed countries. The HCD offers three levels for the classification of causes. The short list (with 16 broad categories) and the intermediate list (103 categories) are the same across countries, while the full list is countryspecific. The data for each country is organized by calendar years, age groups, gender, and causes. Datasets for different countries do not line up, both due to historical availability and different timelines for updates, see Figure 21 in Appendix A.
As part of their extensive and welldocumented postprocessing the HCD conducted a series of bridgecoding studies to reduce disruptions in mortality trends due to changes in the International Classification of Disease (ICD). The current 10th Revision of the ICD is far more detailed with the addition of 8,000 categories compared to the 9th Revision (Anderson et al. 2001). As causeofdeath records switched to the 10th Revision, death counts were shifted among some categories. To minimize such jumps, the HCD reconstructed death assignments between the old and the new ICD Revisions; this reconciliation is already part of the HCD datasets we used.
In our first case study we analyze subcategories of Cancer, available from the 103category intermediate list. Being the leading cause of death worldwide (WHO 2021), it is useful to understand the trends in cancer mortality for different age groups and explore the dependence between its common variations. Moreover, cancer types generally do not feature any substantial trend discontinuity due to ICD revisions (Anderson et al. 2001), allowing a better assessment of model performance in terms of insample smoothing and outofsample forecasting.
For our testbed we select 5 variations of cancer: lung and bronchus (LUN), colon and rectum (COL), pancreas (PAN), stomach (STM), and all other cancers (RMN) from 3 countries in the HCD: Czech Republic (CZE), Germany (DEU), and Poland (POL). The chosen Cancer causesofdeath are common in both Male and Female populations, enabling us to jointly model mortality rates across Cause, Country, and Gender factors. For example, we exclude breast cancer from the study as the respective male counts are very limited. Age in the HCD is formatted as discrete fiveyear age groups; we treat it as a continuous covariate encoded via the respective group average (52 for Age group 50–54, 57 for group 55–59, etc). In Figure 1, we visualize the raw logmortality rates across different cancer types in Male populations from Czech Rep., Germany, and Poland. Lung and Colorectal are the leading causes of cancer deaths, while Pancreatic and Stomach are less prevalent and exhibit more volatility. Figure 22 in the Appendix visualizes the respective patterns in Age; we observe a strong convexity in Age, especially for LUN.
Our second case study uses a subset of the HCD data for United States, based on the CDC’s National Center for Health Statistics, where we decompose allcause mortality into its major categories. This analysis is inspired by a similar study conducted by the SOA (Boumezoued et al. 2019) that employed a multivariate LeeCarter model. The SOA study looked at 11 major causes, all Age groups between 0 and 95+ in 1999–2016 with data combined from the HCD and the Global Burden of Disease project. We restrict analysis to Ages 40–69 and years 1999–2018, and moreover reduce to the eight most common causes for the examined age groups: Heart (HEA), Stroke (STK), Diabetes (DIA), Cancers not induced by smoking (CAN), (lung) Cancers induced by smoking (CANL), Respiratory (RES), Drug abuse (DRU), and all other remaining causes (RMN). The cause mapping for this case study closely follows the SOA guidance.
2.1 Stacking SubPopulations
For joint modelling purposes, datasets are stacked together. Generically we have a total of populations, indexed by the subscript . When needed, to indicate country, cause, and gender factors, we reindex by . Throughout, the two main independent variables are Age and Year, , and the observed mortality rate in the th population is denoted by:
(1) 
where is the number of deaths in the th population for the th observation with the corresponding Age & Year inputs. The denominator refers to the corresponding exposedtorisk counts. Note that different causes in the same countrygender combination with have the same , but different ’s. Our GP models work with logmortality rates so that our data is
The overall dataset is then represented as , , .
3 Methodology
3.1 Gaussian Process Regression for Mortality Tables
Consider the nonparametric additive regression model for the populations to be jointly analyzed:
(2) 
where represents an individual entry in the mortality table (indexed by Age, Year), is the observed output in the th population, and , is the underlying latent logmortality surface, obscured with the observation noise .
We first summarize the spatial structure that concerns the dependence of mortality rates as a function of Age and Year. Momentarily focusing on a singleoutput Gaussian Process (SOGP) regression, we put a GP prior on the latent function , meaning that any finite vector at
inputs follows the multivariate Gaussian distribution
where is the mean vector of size and is the by covariance matrix. All properties of a GP are thus completely described by its mean and covariance functions.
The functions characterize our prior beliefs about the response surface . The covariance kernel of the GP defines a similarity between pairs of data points. It characterizes the smoothing process by determining the influence of observations on the distribution of the output. Data points that are close are expected to behave more similarly than data points that are farther away. In terms of spatial dependence on (Age, Year) we concentrate on a common family of covariance functions known as the Matérn class, equipped with automatic relevance determination. Specifically, the Matérn5/2 kernel defines the covariance between two mortality table entries as follows:
(3) 
This kernel is parametrized by the Age lengthscale and the Year lengthscale .
The mean function describes the relevant trends in logmortality rates. We fit a parametric prior mean to capture the longterm longevity features: where ’s are given basis functions and the ’s are unknown coefficients. For example, to reflect a quadratic trend in Age (Figure 22 in Appendix A) and linear trend in Year dimension we use:
(4) 
We are interested in the posterior distribution for at specified inputs given the dataset , , in other words the likelihood of the true response surface being given what we have observed. Using where is the Kronecker delta, we have the Gaussian observation likelihood where the error terms are assumed to be independent and Gaussiandistributed, . Incorporating the evidence from observations, as reflected in the likelihood function and the prior, we obtain the Gaussian posterior
The Universal Kriging equations below provide not only the posterior mean
and posterior variance
but also the estimated coefficients
. Let , , and where is the covariance matrix . The posterior mean of along with the predicted posterior mean and respective variance for any input are:(5)  
(6)  
(7) 
where is the vector of covariances between inputs in the training set and desired test input . Note that the predictive distribution of observation at is similarly obtained as .
Below we use as the model prediction for the respective (log)mortality rate of the th population in cell , and
as the corresponding posterior uncertainty which is used to obtain predictive quantiles around the former prediction.
3.2 Semiparametric Latent Factor Model
The vectorvalued latent response variable over the AgeYear input space is defined as
, where the functions are the logmortality surfaces for the corresponding th population. Similar to singlepopulation GP above, we place a GP prior over the latent function such that:where is the mean vector function whose elements are the mean functions of each output, and is the fused covariance matrix. This implies that we separate the “spatial” dependence, encoded in from the intertask dependence encoded in . While has a natural Euclidean metric that is used in (3), the population indices are generally of factortype. Hence, to describe the dependence across outputs we need hyperparameters which becomes inefficient and unstable beyond 34 populations. Thus, with an eye towards reducing the number of hyperparameters, additional structure is needed for .
In the Semiparametric Latent Factor Model (SLFM) dating back to Teh et al. (2005), each output is assumed to be a linear combination of latent functions:
(8) 
where ’s are independent realizations from GP priors with distinct covariances , and ’s are the factor loadings (), considered part of the kernel hyperparameter space . Thus, the semiparametric name of the model comes from the combination of a nonparametric component (several GPs) and a parametric linear mixing of the functions .
The role of is to achieve dimension reduction for the correlation structure across the ’s. Let be the vector representing the collection of coefficients associated with the th latent function across the outputs. Then the covariance of the vectorvalued function is:
(9) 
where symbolizes the Kronecker product, and each has rank one.
3.3 Intrinsic Coregionalization Model
Similar to SLFM, ICM assumes each output function is generated from a common pool of latent functions, cf. (8). However, the latent all share the same GP prior with the covariance kernel . Then, the covariance for is:
(10) 
where has rank . In other words, the crossoutput correlation is of rank , while the spatial covariance of each output is the same. The th element in the diagonal of the crosscovariance matrix (or in SLFM) represents the process variance of . Since , the individual entries are and the diagonals are , . We can similarly infer the correlation matrix between population and (); for ICM it is
(11) 
In both SLFM and ICM, the fused covariance kernel belongs to the class of separable kernels (Alvarez et al. 2012) and decouples using the Kronecker product into: (1) the coregionalization matrices that measure the interaction between different outputs and (2) the spatial covariance over AgeYear dimensions , cf. Eqs (9) & (10). In ICM, all populations share the same spatial covariance kernel . Such assumption of a common spatial covariance over AgeYear inputs links to the concept of commonality in the mortality structure (but not levels) across populations. Compared to ICM, SLFM offers more flexibility at the cost of adding more hyperparameters: the total number of kernel hyperparameters in the fused covariance matrix of SLFM is visavis hyperparameters in the ICM.
Selecting rank . As is not one of the hyperparameters to be optimized, ad hoc ways are needed to pick it. We use the Bayesian Information Criterion (BIC) to select rank that produces the most parsimonious model, see Williams et al. (2009) and Huynh and Ludkovski (2021). As discussed in Bonilla et al. (2008), taking in ICM corresponds to finding a rank approximation (based on an incomplete Cholesky decomposition) to the fullrank . A similar interpretation holds for SLFM and the respective ’s. While attractive for dimension reduction and computational speedup, low may not be adequate to describe the overall dependence structure and hence clashes with the original goal of capturing the variability present in the fused mortality dataset. In particular, we observe that BIC tends to select which may be too small for . Based on our case studies, we recommend for maximizing predictive performance.
3.4 MultiLevel ICM for Scalable GPs
In the situation when we have multidimensional factor inputs (e.g. Cause and Gender together), one approach is to combine all factor inputs into a single covariate with distinct outputs prior to applying ICM or SLFM. When grows large, ICM becomes less feasible due to its time complexity Bonilla et al. (2008). In this section, we develop the structured Kronecker product kernel (multilevel ICM in Liu et al. (2019), Zhe et al. (2019)) to mitigate this scalability issue in GP. The structured covariance kernel exploits the fact that mortality tables are gridded along each factor dimension.
We express the total number of outputs as the product across types of categorical inputs, , where is the number of levels within the th categorical input. We then decompose the crosspopulation covariance as the Kronecker product:
(12) 
where refers to the crosscovariance matrix between subpopulations within the th categorical input, taken to have rank . Directly marginalizing the crosscovariance matrices yields a convenient interpretation of the correlation between subpopulations within a factor input, and moreover allows for separate estimation of each crosscovariance submatrix , .
The multilevel ICM setup implies that each output is the weighted combination of independent latent functions, all with the spatial covariance kernel :
(13) 
The improvement in scalability of the multilevel ICM can be analyzed via the ranks of the crosscovariance submatrices. Thanks to the Kronecker product’s property, and using Cholesky decomposition, Eq. (12) can be rewritten as:
(14) 
where , , and each vector , () represents the collection of scalar coefficients associated with the th latent function across subpopulations in the th categorical input. Thus, the number of hyperparameters required to estimate the crosscovariance is , which can be much lower than for singlelevel ICM when is large. Note that when , the multilevel ICM utilises more latent functions to generate the model outputs, compensating for the imposed structure in in (14). In terms of overall complexity, multilevel ICM requires time compared to for singlelevel ICM.
Remark: In the typical situation, ’s are small and so it is feasible to consider fullrank multilevel ICM, i.e. . Otherwise, (14) allows to exploit simultaneously the Kronecker product structure, as well as the lowrank approximation. See Table 2 for results on the impact of values in multilevel ICM.
3.5 MOGP Hyperparameters
To implement a GP model requires specifying its hyperparameters. Note that actual inference reduces to linearalgebraic formulas in (6)(7), and the modeling task is to learn the spatial covariance, namely the mean and kernel functions.
Mean function: We make the prior to be populationspecific in order to maximize model flexibility in describing the mortality trend of each population. Thus, we have coefficients , cf. (4).
Observation Likelihood: We assume a constant observation noise within each population . This accounts for heterogeneous characteristics when observations from multiple populations are combined, in particular is smaller for larger populations and for more prevalent causes (Huynh et al. 2020). The ’s are estimated via Maximum Likelihood along with all other hyperparameters. More advanced GP models that either employ Poisson likelihood or infer inputdependent nonparametric are possible but require additional coding and are beyond the scope of this work.
Estimating Hyperparameters: For (multilevel) ICM the set of hyperparameters is ; for SLFM it is . We use the R package kergp (Deville et al. 2019) to carry out the respective Maximum Likelihood Estimation via Kronecker decompositions. Alternatively, to account for model risk and offer more robust results, one could employ a fully Bayesian hyperparameter inference. This could be done with the Stan software (Carpenter et al. 2017) but is beyond the scope of this work.
3.6 Performance Metrics
Given a test set of observed ’s, we evaluate the effectiveness of different models using two metrics. First, we employ the mean absolute percentage error (APE) to examine the discrepancy between the observed and predicted outputs:
(15) 
where is the observed value at test input in the th population, and is the predicted logmortality rate. Note that APE is scaleindependent, enabling us to compare model performance across populations with different exposures.
We also use the Continuous Ranked Probability Score (CRPS) metric to assess the quality of the probabilistic forecasts produced by a MOGP. Indeed, one of the major benefits of GPbased mortality models is a full distribution for future observations which allows a more detailed uncertainty quantification beyond just looking at the predictive mean . CRPS assesses the closeness of the realized outcome relative to its predictive distribution which in the GP contest is Gaussian and leads to
(16) 
where are the standard Gaussian density and cdf. Observe how CRPS penalizes both bias () and excessive predictive variance.
We average both APE and CRPS across a test set of AgeYearPopulation inputs. The resulting mean APE is interpreted as a normalized relative predictive error, and mean CRPS as the squared difference between the forecasted and the empirical cumulative distribution functions. Models with lower mean APE/CRPS are judged to have a better fit.
Mortality Improvement Factors. A common way to interpret a mortality surface is via the (annual) mortality improvement factors which measure longevity changes yearoveryear. The raw annual percentage mortality improvement is . The smoothed improvement factor is obtained by substituting in the GP posterior means ’s:
(17) 
4 Modeling Multiple Causes of Death
To understand the behavior of Agespecific mortality across different causes of death, we begin by generating MOGP models for Cancer variants. Using the HCD database we fit both ICM and SLFM and assess their performance in 3 European countries. We test the resulting predictive performance by computing APE and CRPS for oneyearahead mortality forecasts in three separate test sets, using SOGP as the baseline. All models, including SOGP models, are trained on the same Ages from 50–84 (7 age groups) and three overlapping periods: 1998–2013 for the 2014 prediction, 1999–2014 for 2015 prediction, and lastly 2000–2015 for 2016 prediction. We report APE and CRPS for allCancer logmortality rates since some of the cancer variations, such as Stomach and Pancreatic have relatively few (and therefore more noisy) recorded deaths. The differences in APE and CRPS between MOGP and SOGP models are expressed as the 3year median percentage improvement over SOGP models. Positive improvement means joint models have smaller mean APE/ CRPS values. To compute allCancer CRPS, we leverage the closedform expression of the MOGP multivariate predictive distribution in (6)(7) to simulate the forecast distribution of allCancer logmortality for each Age group in the data. To do so, we first draw () stochastic samples of the joint across all the Cancer types. After exponentiating and summing, we then obtain corresponding samples of (unlogged) allCancer mortality rates.
Cause () 

Czech Rep.  Germany  Poland  

APE  CRPS  APE  CRPS  APE  CRPS  
ICM  12  19.16  30.18  3.86  15.05  12.85  17.72  
17  17.69  19.08  6.53  15.16  13.46  18.64  
22  22.31  21.43  6.30  14.58  17.02  11.01  
SLFM  14  18.04  29.49  5.09  13.51  10.82  16.05  
21  22.87  23.72  2.78  13.55  4.11  10.65  
28  16.97  18.24  8.75  16.73  5.13  3.81 
Table 1 shows the 3year median improvement in MAPE and CRPS for Multicause ICM and SLFM over SOGP models. Overall, joint models produce more accurate mean forecasts (positive improvement in APE) with higher credibility (positive improvement in CRPS). We observe the opportunity to borrow information across different cancers to better estimate the kernel hyperparameters. As a result, joint models can describe important trends for individual cancers, leading to the reduction in disparity between the predicted values and the observed allCancer mortality in the test sets.
4.1 Commonalities in CauseSpecific Mortality Surfaces
The main difference between ICM and SLFM is the underlying assumption about the latent factors . ICM assumes that all factors have the same lengthscales , and is therefore appropriate for modeling homogenous mortality surfaces. SLFM is more general and fits distinct ; it is expected to perform better when the different surfaces exhibit heterogeneity (for example different degree of correlation across Age). We examine this commonality assumption in Figure 2 by displaying sidebyside the mortality improvement factors of the various cancers derived from multicause ICM and SLFM. The BIC selection criterion yields for both ICM and SLFM. In this case study, the results from SLFM are almost identical to ICM predictions. Therefore, the assumption of sharing the spatial kernel over AgeYear inputs across the considered cancer types is plausible. This conclusion is reinforced by Table 3 in Appendix A which shows that the inferred lengthscales for SLFM are very similar for across all three countries. In other words, both of the latent factors learned by SLFM have similar AgeYear spatial dependence, and so there is little loss of fidelity in a priori forcing them to be equal, as is done in ICM. Indeed, the lengthscales in SLFM are close to the ICM ones.
We can also inspect Figure 2 for insights about the relative mortality improvement trends of different cancers. Stomach cancer has the largest improvement rates for most age groups in all three countries. Decline in stomach cancer incidence tends to be associated with economic improvements resulting in healthier diet, better food preservation, clean water supply, etc. We also observe the increasing improvement trend of lung cancer among age groups below 60 in Czech Rep. and Poland, reflecting lower smoking rates in birth cohorts after WWII. Czech Males experienced large increase in the improvement rates in most cancers. In Germany, the improvement rates have been rising among age groups below 60 but slowing down among older age groups. In Polish Males, except stomach cancer, the improvement trends increased until early 2010s and then significantly declined, displaying the impact of an aging population and an increase in lifestyle exposure to risk factors for cancers (Wojtyś et al. 2014). The incidence of Stomach cancer has been flat over time, consistent with the finding in Arash et al. (2020).
Figure 9 visualizes the inferred crosscause correlation matrices . Both ICM and SLFM document a global positive association among mortality rates from these cancers in each country. It is consistent with strong resemblance in the mortality improvement trends between cancers in Figure 2. Since cancers within a given population share common risk factors, innovations in early detection and advancements in treatment for one cancer are likely to have positive effects on other cancer variations. Negative correlation reflects opposite trends, e.g. Stomach and Pancreas cancers in Poland; this may be due to a competing risks context.
For a different take on causeofdeath commonality, we applied the multicause GP ICM model to jointly model top causes in the HCD US dataset, separately for each gender. Using BIC as the criterion, models with rank (for 8 populations) yield the largest BICs for both Males and Females. This indicates a higher degree of heterogeneity in these larger cause groupings, compared to across five cancer causes in the previous study. Thus, the models employ more latent functions to adequately capture the total variability in the joint datasets. The inferred crosscause correlation matrices are displayed in Figure 27 in Appendix B. Overall, our results are in agreement with the SOA study (Boumezoued et al. 2019), for example confirming that there are moderate positive associations between most causes. The strongest correlations are found between Heart disease and Stroke, and between Heart and Drug overdose. As might be expected, there is little correlation between Remaining Causes (RMN) and most other categories. Some of the correlations vary between genders, possibly due to strong observation noise in less common causes.
A complementary way to examine commonalities across causes is to inspect the inferred factor loadings . Populations that have similar loadings will be highly correlated. Figure 23 in Appendix A displays the factor loadings in a CountryCause SLFM with latent factors
. We observe that primary clustering is by Causes rather than by Countries. Some outliers, such as STM in Czech R., can also be seen and suggest idiosyncratic behavior of the respective mortality surface.
4.2 Aggregating byCause models
An important motivation for our work was to use bycause analysis to make more precise conclusions about allcause mortality. For example, the mortality trends of individual Cancers give insights into the respective allCancer mortality trends. Figure 10 shows the results from a multilevel CountryCause GP ICM. It visualizes the aggregated predictive distribution of allCancer logmortality observations, , for Male populations by Country and Age group, using shading to denote predictive quantiles. Note that since the model did not use allCancer mortality during training, the fact that the predictive insample bands closely match the historical movement of allCancer mortality data is a validation of a successful bycause analysis.
The effort in fighting cancer has transpired in all three countries, but the improvement is not uniform. Despite Czech Republic and Poland being socioeconomically similar, Czech Rep. has a faster improvement pace than Poland. Although Germany continues to have the lowest logmortality rates across all Age groups, the Czech Rep. has drastically closed this gap in the last decade, see especially the left panel of Figure 10 (Age group 5559). The main driver is the rapid improvements in all common Cancers in Czech Rep., for example the mortality improvement factor for LUN being recently more than double compared to Germany, cf. Figure 2.
4.3 Trends in US TopLevel Causes
Figure 11 presents the insample posterior distribution of logmortality rates during 1999–2018 along with the projected trends up to 2025 for two Age groups, in both Male and Female US populations. The left panels display the predictive trends for the top 6 causes in each agegroup. We observe improvements in most causes and age groups; the largest improvement being in CANL. In contrast, Drug abuse deaths are rising rapidly among young age groups (e.g: Age 40–44), see the 2 upperleft panels of Figure 11. In the older age groups, mortality from Heart disease experienced large declines in early 2000s, but essentially flattened out after 2015. Boumezoued et al. (2019) emphasized the need for breakpoint detection to improve forecasts of such causes whose trends change over time. In the MOGP framework, the forecasts are driven by the most recent data and are thus automatically adjusted if trends shift. So for example, we do not need to do any breakpoint analysis to achieve the slow pace of future HEA improvement shown in Figure 11.
The middle and right columns in Figure 11 compare the aggregate allcause projections for the US population. We witness the pessimism of the aggregate projected trends based on the bycause models compared to an allcause SOGP (right column), especially in the younger ages. This discrepancy, driven by the growing importance of causes with increasing trends, such as Drug overdose, highlights the additional insights from bycause modeling. The pessimism of bycause analysis was first mentioned in Wilmoth (1995) and reiterated in Boumezoued et al. (2019). For older age groups the underlying dynamics among common causes are more stable and allcause and bycause forecasts are broadly similar.
Note that compared to Figure 10, the GP posterior uncertainty bands widen dramatically as we go from insample (up to Year 2018) to extrapolating for Years 2019–2025. This reflects the datadriven nature of GP forecasts which intrinsically leads to low uncertainty for insample smoothing and widening uncertainty as predictions are made further into the future. This phenomenon gets amplified as we add up the bycause forecasts to obtain allcause predictions, witness the wider band in the middle panels of Figure 10 relative to the right panel based on a SOGP.
As discussed in Huynh and Ludkovski (2021), MOGPs are wellsuited to generate expertbased projections. This is a useful feature to have given that by default projections are driven by the historical trends that might not continue in the future. For instance, the increasing trend of Drug abuse is largely fueled by the opioid epidemic. Assuming this crisis is addressed in the future, the projected DRU mortality should be adjusted downward. In MOGP this can be achieved by modifying the Year trend in . Figure 24 in the Appendix displays an illustration where the trend of Drug abuse is reduced by one third (through lowering the Year effect by one third) of the original pace for both the Male and Female populations. The resulting adjusted forecast for aggregate mortality gets closer to that from the allcause model, reducing the level of pessimism we have observed earlier among the US young population. Another potential adjustment could be for smokinginduced cancer, where the MOGP models extrapolate the historical trend of rapid longevity gains. However, the SOA report (Boumezoued et al. 2019) suggests that this pace might not take place in the intermediate term, lowering aggregate mortality gains.
5 CauseofDeath Joint Modeling in a Multinational Context
We proceed to consider simultaneously 30 populations in the Cancer variations case study, arranged by the three factor inputs of Cause ), Country () and Gender ().
5.1 Multilevel vs Singlelevel Correlation Structure
Figure 14 displays the inferred correlation structure across the above 30 populations. On the left we fit a multilevel ICM () and on the right a singlelevel ICM (). Both models are fitted on Age groups 50–84, Years 1998–2016 for 3 Countries, 5 Cancers, and 2 Genders. Observe that the multilevel model has hyperparameters compared to in the singlelevel model. The right panel does not display any recognisable structure in the inferred correlations; because the model is not aware of the different factor dimensions, after dimension reduction the marginal associations between subpopulations within the original factor inputs are no longer accessible. In contrast, the multilevel model enforces a block structure in the correlation matrix , see Figure 14 (a). Recall that the correlation submatrices for each factor input are estimated separately and the Kronecker product structure implies that we can read off the correlation among any combination of factors. Figure 18 displays the derived sub crosscorrelation matrices from the above threelevel CountryCauseGender GP model. One can then multiply these factorbased correlations to get the total correlation in Figure 14(a). For example, the correlation of Cancer mortality between CzechLungMale and PolishPancreasMale is . This is in fact also the correlation between CzechLungFemale and PolishPancreasFemale () mortality. The limited number of deaths in some causes explains differences in the correlation matrices between single and multilevel ICM. For example, the correlations between PolishStomachFemale and other cancer types are mostly negatively correlated in ICM but (mildly) positively correlated in the multilevel ICM.
5.2 Model Selection
Next, we compare the predictive performance of multilevel vs singlelevel MOGPICM. Following the same setup as Table 1, Table 2 shows the 3year median improvement in APE and CRPS for different joint models relative to SOGP models in a multinational context. Joint models produce more accurate mean forecasts with higher credibility but the predictive gains are not uniform across countries. The largest improvements from multilevel ICM are in Czech Republic; Czech raw data has lower credibility due to its smaller exposures, thus there is more opportunity for data fusion. The results further validate our approach of modelling causespecific mortality across populations: models that incorporate information from foreign countries (e.g. CountryCause GP) have larger predictive improvements compared to Table 1. In Poland due to the structural break encountered in mid2010s (see Fig. 10), the performance of aggregated CountryCause MOGP is consistently worse than those of allcause SOGP; this issue is rectified for CountryCauseGender MOGPs.
CountryCause () 

Czech Rep.  Germany  Poland  

APE  CRPS  APE  CRPS  APE  CRPS  
ICM  32  35.54  30.74  12.02  21.06  40.44  59.47  
47  45.51  50.07  8.50  20.35  37.09  38.48  
62  31.64  36.56  14.08  35.66  13.32  9.56  
SLFM  34  31.51  24.57  4.44  15.87  47.66  45.98  
51  33.78  41.59  0.88  8.62  37.25  24.54  
68  44.79  43.29  8.93  20.49  45.63  37.66  
Multilevel ICM  18  26.42  24.24  4.55  6.84  48.46  61.53  
28  25.74  32.26  5.04  15.75  35.94  44.24  
36  42.03  36.80  3.07  11.12  44.45  53.69  
CountryCauseGender () 

Czech Rep.  Germany  Poland  
APE  CRPS  APE  CRPS  APE  CRPS  
ICM  62  9.99  11.75  5.40  13.40  12.21  6.55  
92  0.52  3.07  10.78  20.35  15.31  4.09  
Multilevel ICM  32  14.85  24.91  68.45  15.22  25.11  7.75  
40  21.34  27.22  40.45  10.22  5.12  8.04 
In contrast to Section 4 where the performance of ICM and SLFM was almost identical, with multiple input factors ICM usually outperforms SLFM. Given the strong commonality in mortality trends across Cancer types, a shared AgeYear covariance kernel appears preferable for information fusion. The comparison between single and multilevel ICM depends upon the number of populations to model. In CountryCauseGender setting with populations, the multilevel ICM () yields better mean APE and CRPS than singlelevel ICM, and moreover uses fewer hyperparameters. For CountryCause setting, the performance is comparable.
Table 2 also shows the impact of the latent ranks and ’s. For CountryCause, tends to yield the best results in singlelevel models; but for CountryCauseGender, ICM with performs consistently worse than , presumably due to unstable estimates of the more than 90 underlying hyperparameters. For multilevel ICM, we generally find that fullrank works best, although lowrank setups also yield good predictive performance, indicating the opportunity to shrink even further the number of kernel hyperparameters.
Figure 19 shows the predicted logmortality rates for individual Cancers and Age group 5559 via fullrank multilevel ICM and singlelevel ICM with . Both models are fitted on Age groups 50–84 and Years 1998–2016 before we perform outofsample forecasts for the next 3 years (2014–2016). The singlelevel ICM produces oversmoothed forecasts for several cancers, especially cancers with large observation noise like Stomach and Pancreas; this problem is mitigated by the shorter lengthscale in Year in multilevel ICM ( versus in singlelevel ICM).
Coherence in causespecific trends: Figure 19 demonstrates that Males and Females do not always share similar progress in mortality reduction. While the trends in Stomach, Colorectal, and Pancreatic cancers are consistent for both genders, for Lung cancer the MaleFemale gap is diminishing rapidly, especially in Germany and Poland. This is driven by a decrease in cigarette consumption among men, while women are more likely to develop Lung cancers that are not associated with smoking. Thus, the concept of forecast coherence (namely extrapolating a stable MaleFemale spread, as observed historically) is not always well suited for causeofdeath analysis.
5.3 Borrowing the Latest Datasets from Other Populations
The period coverage in HCD varies by country as datasets for countries are uploaded asynchronously. This implies that some countries have more uptodate datasets than others, see Figure 21. This offers opportunities to fuse data from other countries to update domestic forecasts. For AgePeriodCohort models, such as Li and Lee (2005), this is generally challenging as the model fitting relies on having a rectangular dataset. Our MOGP framework can straightforwardly handle such ‘notched’ datasets to take full advantage of additional observations in different countries.
As an illustration, we choose Czech Males as a reference population and examine oneyearout prediction quality for several models. We take Czech observations for 1998–2015 and borrow the more uptodate dataset from Germany (1998–2016) to implement the notched setup. Table 4 in the Appendix describes the setup of the different models we consider in this experiment.
Comparison of the prediction accuracy for 2016 allCancer logmortality of Czech Males for indicated Age groups between different models with ‘notched’ setup. Top row: Predictive standard deviation
; bottom row: discrepancy between the predictive mean and the observed value .Figure 20 visualizes relative performance by comparing two components: (a) predictive standard deviation of the underlying 2016 (top panels), and (b) prediction errors relative to the realized 2016 allCancer logmortality rates of Czech Males (bottom). Our benchmark is Czech Males SOGP fitted on allCancer logmortality rates from 1998–2016. We see that a CountryCause MOGP yields more credible predictions (lower ) than the benchmark model that performs insampled smoothing of the 2016 Czech allCancer experience. The forecasts from this joint model are nearly as good as the benchmark as both models produce the smallest prediction errors. In fact, the prediction quality of CountryCause MOGP is as competitive as the Multicause MOGP that simultaneously models the logmortality rates of all 5 cancer types in Czech Males with 2016 observations available. Thus, borrowing the latest information from Germany improves the prediction quality of the recent allCancer mortality rates of Czech Males. The results are as good as when we have domestic data available in Czech Republic.
Remark:
At the moment, the HCD offers causeofdeath data for less than 15 countries. Many countries do not have recent data available yet (e.g. only up to 2014), leading to limited options in terms of the selection and the number of countries we can incorporate with Czech Males in this experiment. Based on our analysis in
Huynh and Ludkovski (2021), choosing countries that are highly correlated with Czech Males is essential to maximize the prediction quality.6 Conclusion
In this article, we develop multioutput GP models for causespecific mortality modeling within a multipopulation context. With the MOGP mechanism, we are able to capture the crosscause dependencies that allow joint models to gain more predictive power over singleoutput models that treat each population independently. Among MOGP variants, SLFM offers more flexibility and is recommended for modeling heterogenous causes, such as toplevel ICD categories. Multilevel ICM is demonstrated to work well for interpretable modeling across multiple factor inputs. Our case studies show the applicability of MOGPs to understand causespecific and aggregate mortality trends, both within a country and across nations, whereby our framework is convenient for information fusion and credibility boosting.
The current work focuses on exploiting the structured Kronecker covariance to efficiently learn the joint covariance kernel. This is sufficient for handling a moderately large number of subpopulations (up to 30 in our case studies); additional techniques would be needed to handle larger datasets, for example across more causesofdeath or across all the countries in HCD. There is an active research area looking at alternative methods (local kernel interpolation, inducing points, block structures, etc., see e.g.,
Flaxman et al. (2015)) for massive scalable GP wellsuited for gridded mortality datasets. Another methodological extension would be to consider a Linear Coregionalization Model (LCM) for mortality modeling. LCM generalizes ICM and SLFM and allows multiple latent functions from GP priors with different covariance kernels. A third direction would be to investigate other kernel families, such as composite kernels or kernels that can incorporate structural changes. The latter would be useful to model (sub)causes that exhibit strong disruptions over time in their mortality trends.Appendix A Additional Plots for the Case Study on Cancer SubTypes
Year ranges available in the HCD by country
Age patterns of logmortality rates of different cancers
Kernel Hyperparameter Learning
Searching for optimal hyperparameters in GP can be challenging if the marginal likelihood features multiple local maxima or flattens around its global maximum (Rasmussen and Williams 2005). When the optimizer fails to find the global maximum, unsuitable lengthscales in Age and Year sometimes result, leading to a poor fit of the data. Table 3 reports the inferred lengthscale in Age () and Year () of the SOGP models fitted on the 5 Cancer types for the Male populations in the three considered countries. Many SOGP models have being too short (less than 5 or so), resulting in oscillatory fitted ’s. Such models lack the ability to distinguish between true signal and the inherent randomness in the data. Similarly, when the estimated lengthscales are too large, the fitted GP surfaces are oversmoothed. Joint models tend to better learn the hyperparameters by enabling data fusion across multiple populations and utilizing more observations. In Table 3 we show that when we fit multicause MOGP (both ICM and SLFM) on all 5 cancer types, the resulting lengthscales are all well calibrated.
Czech Rep.  SOGP on each cancer type  Multicause ICM  Multicause SLFM  
Stomach  Colorectal  Pancreas  Lung  Others  ()  ()  
159.92  16.32  6.21  17.69  12.52  16.90  21.26  19.67  
38.49  9.03  7.82  13.87  3.69  10.12  13.05  11.42 
Germany  SOGP on each cancer type  Multicause ICM  Multicause SLFM  
Stomach  Colorectal  Pancreas  Lung  Others  ()  ()  
8.65  3.72  8.99  0.00  2.79  8.59  11.75  9.46  
7.35  4.90  7.69  4.82  4.44  10.26  9.92  6.77 
Poland  SOGP on each cancer type  Multicause ICM  Multicause SLFM  
Stomach  Colorectal  Pancreas  Lung  Others  ()  ()  
23.96  16.28  5.72  15.25  22.80  16.98  16.20  21.89  
253.34  3.69  6.19  9.04  6.01  12.83  11.19  12.11 
Training Designs in Notched Datasets
GP models  Outcome variable  Country  Year  Pred. Type  Abbrev.  
SOGP  Allcancer logmortality  Czech R.  1998–2016  Insample  SOGP (’98’16)  
1998–2015  Outofsample  SOGP (’98’15)  
Multicause GP 

Czech R.  1998–2016  Insample  Multicause (’98’16)  
1998–2015  Outofsample  Multicause (’98’15)  
CountryCause GP 

Czech R.  19982015  Outofsample  CountryCause  
Germany  19982016 
Illustrating Latent Factor Loadings in SLFM
Appendix B Additional Plots for the US TopLevelCause Study
Adjusting Drug Overdose Trend
Crosscorrelation Matrices for the US AllCause Analysis
References
 Alai et al. [2018] Daniel H. Alai, Séverine Arnold (Gaille), Madhavi Bajekal, and Andrés M. Villegas. Mind the gap: a study of causespecific mortality by socioeconomic circumstances. North American Actuarial Journal, 22(2):161–181, 2018.
 Alvarez et al. [2012] Mauricio A. Alvarez, Lorenzo Rosasco, and Neil D. Lawrence. Kernels for vectorvalued functions: a review. Foundations and Trends in Machine Learning, 4(3):195–266, 2012.
 Anderson et al. [2001] Robert N Anderson, Arialdi M Miniño, Donna L Hoyert, and Harry M Rosenberg. Comparability of cause of death between ICD9 and ICD10: preliminary estimates. National Vital Statistics Report, 49(2):1–32, May 2001.
 Arash et al. [2020] Etemadi Arash, Shakeri Ramin, Safiri Saeid, and Sadaf G Sepanlou. The global, regional, and national burden of stomach cancer in 195 countries, 1990–2017: a systematic analysis for the global burden of disease study 2017. The Lancet Gastroenterology and Hepatology, 5(1):42–54, 2020.
 Arnold (Gaille) and Sherris [2013] Séverine Arnold (Gaille) and Michael Sherris. Forecasting mortality trends allowing for causeofdeath mortality dependence. North American Actuarial Journal, 17(4):273–282, 2013.
 BergeronBoucher et al. [2017] MariePier BergeronBoucher, Vladimir CanudasRomo, James E. Oeppen, and James Vaupel. Coherent forecasts of mortality with compositional data analysis. Demographic Research, 37(17):527–566, 2017.
 Bonilla et al. [2008] Edwin V Bonilla, Kian M. Chai, and Christopher Williams. Multitask Gaussian Process prediction. In Advances in Neural Information Processing Systems 20, pages 153–160. Curran Associates, Inc., 2008.
 Boumezoued et al. [2019] Alexandre Boumezoued, JeanBaptiste Coulomb, Al Klein, Damien Louvet, and Eve Titon. Modeling and forecasting causeofdeath mortality. Technical report, Society of Actuaries, 2019.
 Carpenter et al. [2017] Bob Carpenter, Andrew Gelman, Matthew D.Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: a probabilistic programming language. Journal of Statistical Software, 76(1), 2017.
 Caruana [1997] Rich Caruana. Multitask learning. Machine Learning, 28(1):41–75, Jul 1997.
 Caselli [1996] Graziella Caselli. Future longevity among elderly populations. In Health and mortality among elderly populations, pages 235––265. Oxford University Press, 1996.
 Caselli et al. [2019] Graziella Caselli, Jacques Vallin, and Marco Marsili. How useful are the causes of death when extrapolating mortality trends. an update. In Tommy Bengtsson and Nico Keilman, editors, Old and New Perspectives on Mortality Forecasting, pages 237–259. Springer International Publishing, 2019.
 Deville et al. [2019] Yves Deville, David Ginsbourger, and Olivier Roustant. Contributors: Nicolas Durrande. kergp: Gaussian Process Laboratory, 2019. URL https://CRAN.Rproject.org/package=kergp. R package version 0.5.0.
 Dimitrova et al. [2013] Dimitrina S. Dimitrova, Steven Haberman, and Vladimir K. Kaishev. Dependent competing risks: cause elimination and its impact on survival. Insurance: Mathematics and Economics, 53(2):464–477, 2013.

Dong et al. [2020]
Yumo Dong, Fei Huang, Honglin Yu, and Steven Haberman.
Multipopulation mortality forecasting using tensor decomposition.
Scandinavian Actuarial Journal, 2020(8):754–775, 2020.  Enchev et al. [2017] Vasil Enchev, Torsten Kleinow, and Andrew J. G. Cairns. Multipopulation mortality models: fitting, forecasting and comparisons. Scandinavian Actuarial Journal, 2017(4):319–342, 2017.
 Flaxman et al. [2015] Seth Flaxman, Andrew Wilson, Daniel Neill, Hannes Nickisch, and Alex Smola. Fast Kronecker inference in Gaussian processes with NonGaussian likelihoods. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 607–616, Lille, France, 07–09 Jul 2015. PMLR.
 Foreman et al. [2018] Kyle Foreman, N. Marquez, A. Dolgert, K. Fukutaki, N. Fullman, M. McGaughey, Martin A Pletcher, A. Smith, Kendrick Tang, ChunWei Yuan, J. Brown, Joseph Friedman, J. He, Kyle Heuton, Mollie Holmberg, Disha J Patel, P. Reidy, Austin Carter, Kelly M Cercy, Abigail Chapin, D. DouwesSchultz, Tahvi D Frank, F. Goettsch, P. Liu, Vishnu Nandakumar, M. Reitsma, Vincent Reuter, Nafis Sadat, Reed J. D. Sorensen, V. Srinivasan, R. Updike, Hunter York, Alan D. Lopez, R. Lozano, Stephen S. Lim, A. Mokdad, S. Vollset, and C. Murray. Forecasting life expectancy, years of life lost, and allcause and causespecific mortality for 250 causes of death: reference and alternative scenarios for 2016–40 for 195 countries and territories. Lancet (London, England), 392:2052 – 2090, 2018.
 Gilboa et al. [2015] Elad Gilboa, Yunus Saatçi, and John P. Cunningham. Scaling multidimensional inference for structured Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):424–436, 2015.
 Guibert et al. [2019] Quentin Guibert, Olivier Lopez, and Pierrick Piette. Forecasting mortality rate improvements with a highdimensional VAR. Insurance: Mathematics and Economics, 88:255–272, 2019.
 HCD [2021] HCD. The Human CauseofDeath Database. French Institute for Demographic Studies (France) and Max Planck Institute for Demographic Research (Germany), 2021. URL www.causeofdeath.org.
 Huynh and Ludkovski [2021] Nhan Huynh and Mike Ludkovski. Multioutput Gaussian processes for multipopulation longevity modelling. Annals of Actuarial Science, 15(2):318–345, 2021.
 Huynh et al. [2020] Nhan Huynh, Mike Ludkovski, and Howard Zail. Multipopulation longevity models: a spatial random field approach. Proceedings of the Society of Actuaries 2020 Living to 100 Symposium, 2020.
 Hyndman et al. [2013] Rob J Hyndman, Heather Booth, and Farah Yasmeen. Coherent mortality forecasting: the productratio method with functional time series models. Demography, 50(1):261–283, 2013.
 Kjaergaard et al. [2019] Søren Kjaergaard, Yunus Emre Ergemen, Malene KallestrupLamb, Jim Oeppen, and Rune LindahlJacobsen. Forecasting causes of death by using compositional data analysis: the case of cancer deaths. Journal of the Royal Statistical Society: Series C (Applied Statistics), 68(5):1351–1370, 2019.
 Kleinow [2015] Torsten Kleinow. A common age effect model for the mortality of multiple populations. Insurance: Mathematics and Economics, 63:147–152, 2015.
 Knudsen and McNown [1993] Christin Knudsen and Robert McNown. Changing causes of death and the sex differential in the USA: Recent trends and projections. Population Research and Policy Review, 12(1):27–41, 1993.
 Letham and Bakshy [2019] Benjamin Letham and Eytan Bakshy. Bayesian optimization for policy search via onlineoffline experimentation. Journal of Machine Learning Research, 20(145):1–30, 2019.
 Li and Lu [2017] Hong Li and Yang Lu. Coherent forecasting of mortality rates: A sparse vectorautoregression approach. ASTIN Bulletin: The Journal of the IAA, 47(2):563–600, 2017.
 Li and Lu [2019] Hong Li and Yang Lu. Modeling causeofdeath mortality using hierarchical Archimedean copula. Scandinavian Actuarial Journal, 2019(3):247–272, 2019.
 Li and Lee [2005] Nan Li and Ronald Lee. Coherent mortality forecasts for a group of populations: An extension of the LeeCarter method. Demography, 42(3):575–594, 2005.
 Liu et al. [2019] Haitao Liu, YewSoon Ong, Xiaobo Shen, and Jianfei Cai. When Gaussian process meets big data: A review of scalable GPs, 2019. Arxiv 1807.01065.
 Lo and Wilke [2010] Simon M. S. Lo and Ralf A. Wilke. A copula model for dependent competing risks. Journal of the Royal Statistical Society. Series C (Applied Statistics), 59(2):359–376, 2010.
 Ludkovski et al. [2018] Mike Ludkovski, Jimmy Risk, and Howard Zail. Gaussian process models for mortality rates and improvement factors. ASTIN Bulletin: The Journal of the IAA, 48(3):1307–1347, 2018.
 Lyu et al. [2021] Pintao Lyu, Anja De Waegenaere, and Bertrand Melenberg. A multipopulation approach to forecasting allcause mortality using causeofdeath mortality data. North American Actuarial Journal, 25(1):S421–S456, 2021.
 McNown and Rogers [1992] Robert McNown and Andrei Rogers. Forecasting causespecific mortality using time series methods. International Journal of Forecasting, 8(3):413–432, 1992.
 Rasmussen and Williams [2005] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005.
 Saatçi [2011] Yunus Saatçi. Scalable Inference for Structured Gaussian Process Models. PhD thesis, University of Cambridge, 2011.
 Tabeau et al. [1999] Ewa Tabeau, Peter Ekamper, Corina Huisman, and Alinda Bosch. Improving overall mortality forecasts by analysing causeofdeath, period and cohort effects in trends. European Journal of Population / Revue Européenne de Démographie, 15(2):153–183, 1999.

Teh et al. [2005]
Y. Teh, M. Seeger, and Michael I. Jordan.
Semiparametric latent factor models.
In
International Workshop on Artificial Intelligence and Statistics (AISTATS)
, pages 333–340. PMLR, 2005.  Tsai and Zhang [2019] Cary ChiLiang Tsai and Ying Zhang. A multidimensional Bühlmann credibility approach to modeling multipopulation mortality rates. Scandinavian Actuarial Journal, 2019(5):406–431, 2019.
 WHO [2021] WHO. Cancer, 2021. URL https://www.who.int/newsroom/factsheets/detail/cancer.
 Williams et al. [2009] Christopher Williams, Stefan Klanke, Sethu Vijayakumar, and Kian Chai. Multitask Gaussian process learning of robot inverse dynamics. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2009.
 Wilmoth [1995] John R. Wilmoth. Are mortality projections always more pessimistic when disaggregated by cause of death? Mathematical Population Studies, 5(4):293–319, 1995.
 Wojtyś et al. [2014] Piotr Wojtyś, Andrzej Antczak, and Dariusz Godlewski. Predictions of cancer mortality in Poland in 2020. Central European Journal of Medicine, 9:667–675, 2014.
 Zhe et al. [2019] Shandian Zhe, Wei Xing, and Robert M. Kirby. Scalable highorder Gaussian process regression. In Proceedings of the TwentySecond International Conference on Artificial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 2611–2620. PMLR, 16–18 Apr 2019.