Let be a continuous, strictly decreasing function from to such that . Let be the inverse or the pseudo-inverse of , where the latter is defined as zero for . If the generator is called strict. An Archimedean copula with generator is a function from to defined as
There are many properties that characterize Archimedean copulas, for instance, they are symmetric, associative and their diagonal section is always less than for all . Generators are usually parametric families defined by a single parameter. Most of them are summarised in Nelsen (2006, Table 4.1).
Association measures induced by Archimedean copulas are a function of the generator. For instance, Kendall’s tau becomes
where denotes right derivative of at .
In this work we propose a Bayesian semiparametric generator defined through a quadratic spline. Within a survival analysis context, we model the first derivative of a hazard rate function with a piecewise constant function. The hazard rate and the cumulative hazard functions become linear and quadratic continue functions, respectively. The induced survival function is used as an inverse generator for an Archimedean copula. Convexity constrains are properly addressed and inference on the model is done under a Bayesian approach.
Other studies on semiparametric generators for Archimedean copulas can be found in Genest and Rivest (1993) where their model is based on an empirical Kendall’s process. A new approach and extensions of this latter methodology can be found in Genest et al. (2011). In Guillote and Perron (2015)
the model arises from the one-to-one correspondence between an Archimedean generator and a distribution function of a nonnegative random variable. In particular they use a mixture of Pólya trees as a prior for the corresponding distribution function under a Bayesian nonparametric approach. In a work more related to ours,Vanderhende and Lampert (2005)
use the relationship between quantile functions and Archimedean generators to define a semiparametric generator by supplementing a parametric generator withdependence parameters. Differing to their work, our model is not based on any parametric generator and the Kendall’s tau can take values on the whole interval .
The contents of the rest of the paper is as follows. In Section 2 we present our proposal and characterise its properties. In Section 3 we provide details of how to make posterior inference under a Bayesian approach. In Section 4 we illustrate the performance of our model with a simulation study as well as with a real data set. We conclude with some remarks in Section 5.
To define our proposal we realise that is a decreasing function from to , so it behaves as a survival function, in a failure time data analysis context (e.g. Klein and Moeschberger, 2003). The idea is to propose a semi/non parametric form for the inverse generator by using survival analysis ideas. For that we recall some basic definitions.
Let be a nonnegative function with domain in such that as . Then is a decreasing function from to , so it behaves like an inverse generator . In a survival analysis context, functions , and are the hazard rate, cumulative hazard and survival functions, respectively.
In particular, if , i.e. constant for all , then . If we take , then . Using (1) we obtain that the resulting copula is the independence copula, and what is interesting, is that it does not depend on .
Using these ideas we construct a semiparametric generator in the following way. We first consider a partition of size of the positive real line, with interval limits given by . Then, we define the first derivative of the hazard rate, as a piecewise constant function of the form
where . We recover the hazard rate function as , where is an initial condition. Using (3), the hazard rate becomes a piecewise linear function of the form
where and , for .
Integrating now the hazard function (4), the cumulative hazard is a piecewise quadratic function given by
where and , for .
We therefore define a semiparametric inverse generator as the induced survival function, which can be written as
where is given in (5).
After doing some algebra, we can invert this function to obtain an expression for the generator
The value controls the flexibility of the generator, and thus of the copula. If , the induced Archimedean copula is the independent copula, whereas for larger , the generator, and the induced copula, become more nonparametric. Potentially could be infinite.
We now discuss some properties of our semiparametric generator.
Consider the semiparametric inverse generator , given in (6), and the corresponding generator , given in (LABEL:eq:phi), and assume that are such that , and satisfy conditions (C1) and (C2) given by
, for and for all .
, for and for all .
and are continuous functions of ,
is a convex function,
a strict generator.
Proof For (i) we know that , as in (3), is a piecewise constant discontinuous function, however, function , as in (4), is continuous. To see this, for each , the limit from the left is , and the limit from the right becomes . Since , then both limits coincide. For (ii) we take the second derivative of which becomes , this is positive if and only if . For this to happen we require condition . For (iii), must be a proper survival function, that is, must be nonnegative, which is achieved by imposing condition . Furthermore, we need , which is equivalent to prove that . This is true since is a finite constant, and , so the linear part goes to infinity when .
To see the kind of association induced by our proposal, we computed the Kendall’s tau using expression (2) with generator (LABEL:eq:phi). This is given in the following result.
The Kendall’s tau obtained by the Archimedean copula with semiparametric generator (LABEL:eq:phi) is given by
Moreover, this .
Proof Rewriting expression (2) in terms of the inversed generator we obtain . Computing the derivative we get . Doing the integral we obtain the expression. To obtain the range of possible values of it is easier to re-write in terms of and . This becomes . Here it is straightforward to see that the sign of is determined by the sign of , therefore for all implies and implies .
The expression for tells us that the concordance induced by our semiparametric copula is a function of both, the parameters , as well as of the partition limits . It depends on a definite integral and can be evaluated numerically. What is more important is that covers the whole range from to , showing that our proposal is very flexible.
To illustrate the flexibility of our model we define a partition of the positive real line of size , such that for . We consider two scenarios for the values of the parameters . The first scenario is defined by for all , whereas the second scenario contains for all . Conditions and were satisfied in both cases. Figure 1 contains functions , and for two different scenarios, the solid (blue) line corresponds to the first scenario and the dotted (red) line to the second scenario. In the first case the corresponding hazard function (middle panel) is decreasing, whereas for the second case the hazard function is increasing. The induced concordance values are and , respectively.
As a second example, we consider a partition of size , such that for . We consider three different scenarios for the parameters with , respectively. In the first scenario we assume , in the second and in the third . Posteriorly, we define sequentially with and constants such that constrains and are satisfied, for and . We repeated sampling from these distributions a total of 5,000 times, and for each repetition we computed . The induced histogram densities for the three scenarios are presented in Figure 2. For the first scenario, the values of range from to , showing that our model can capture both negative and positive concordance measures. For the second scenario, the values of
are all positive and the distribution is right skewed, and for the third scenario the values ofare all negative showing a left skewed distribution.
According to (Nelsen, 2006), new generators can be defined if we apply a scale transformation of the form iff , for , where becomes a new Archimedean copula generator. More recently, (Di Bernardino and Rullière, 2013) realised that the new generator induces exactly the same copula (1) as that obtained with . To see this we have that . In other words, an Archimedean copula generator is not unique.
Moreover, in terms of the hazard rate functions, and , induced by generators and , respectively, the relationship becomes . In order to make our semiparametric generator identifiable, without loss of generality, we impose the new constraint
This constraint is equivalent to impose in definition (4).
3 Posterior inference
The copula density , of an Archimedean copula, can be obtained by taking the second crossed derivatives with respect to and in expression (1). In terms of the generator and its inverse this density becomes
where the single and double primes denote first and second derivatives, respectively, and are given by
Let , be a bivariate sample of size from defined in (8). With this we can construct the likelihood for as , where we have made explicit the dependence on in the notation of the copula density. Recall that the parameters must satisfy several conditions, and given in Proposition 1, to make our generator unique, and .
We assume a prior distribution for the ’s of the form
independently for .
Note that we explicitly allow the ’s, for
to be zero with positive probability. This prior choice is useful to define an independence test. Specifically, the hypothesis independent is equivalent to
. To perform the test we can compute the posterior probability of
and its complement and make the decision, say via Bayes factors(Kass and Raftery, 1995).
The posterior distribution of is simply given by the product of expressions (8) and (9), up to a proportionality constant. It is somehow easier to characterize the posterior distribution by implementing a Gibbs sampler (Smith and Roberts, 1993) and sampling from the conditional posterior distributions
for . However, sampling from conditional distributions (10) is not trivial, we therefore propose a Metropolis-Hastings step (Tierney, 1994) by sampling at iteration from a random walk proposal distribution
where the interval represents the conditional support of , is its length, with , for , , for , and . The justification of these bounds obeys the inclusion of constraints and and their derivations are given in Appendix A. The parameters and are tuning parameters that control de acceptance rate.
Therefore, at iteration we accept with probability
4 Numerical studies
We illustrate the performance of our model in two ways, through a simulation study, and with a real data set.
To define the partition of the positive real line we consider a Log- partition defined by for , with . This partition is the result of transforming a uniform partition in the interval via a convex function. In particular we inspired ourselves in the generator of the product copula. Larger values of increase the spread of the partition along the positive real line.
4.1 Simulation study
We generated simulated data from four parametric Archimedean copulas, namely the product, Clayton, Ali-Mikhail-Haq (AMH) and Gumbel copulas. Their features are summarised in Table 1, where we include the parameter space, the generator, the inverse generator, an indicator whether the copula is strict or not and the induced function obtained through inversion of relationship (6).
For each parametric copula we took a sample of size . To specify the copulas we took for the Clayton copula, for the AMH copula, and for the Gumbel copula. For the partition size we compared and tried values .
For the prior distributions (9) we took and . We implemented a MH step with-in the Gibbs sampler where the proposal distributions were specified by and . The acceptance rate attained with these specifications are around 30%, which according to (Robert and Casella, 2010) are optimal for random walks. Finally, the chains were ran for 20,000 iterations with a burn-in of 2,000 and keeping one of every 5
iteration to produce posterior estimates.
To assess goodness of fit (GOF) we computed several statistics. The logarithm of the pseudo marginal likelihood (LPML), originally suggested by (Geisser and Eddy, 1979), to assess the fitting of the model to the data. The supremum norm, defined by to assess the discrepancy between our posterior estimate (posterior mean) from the true inverse generator . We also computed the Kendall’s tau coefficient and compare the posterior point and 95% interval estimates with the true value. These values are shown in Tables 2 to 7. Although we fitted our model with all values of mentioned above, we only show results for some of them in the tables.
Note that, due to the nonunicity of an Archimedean generator, an equivalent constraint to has to be imposed to the parametric generators that we are comparing to. That is we set for the product, Clayton and AMH copulas, and for the Gumbel copula, for say . The difference in the latter case is because, for a Gumbel copula, when . These conditions are already included in the parametrisation used in Table 1.
For the product copula the GOF measures are presented in Table 2. With exception of the partition Log- for , for all settings considered, the true
lies inside the 95% credible intervals. The LPML chooses the model with Log-partitions of size , and corresponds to the second smallest value of the supremum norm. Posterior estimates of functions and are shown in Figure 3. In both cases the true function lies inside the 95% credible intervals.
For the Clayton copula we have two choices of , and . The first choice, , corresponds to a generator that is not strict, that is, for , and for . This is an interesting challenge because our model defined only strict generators. The settings with smallest supremum norm, Log- with , produces the 95% credible interval for closest to the true value, however it does not achieve the largest LPML. The inconsistency of the GOF measures might be due to the non strictness feature of the true generator. Moreover, if we look at the graphs of the posterior estimates of and (Figure 4), for larger values of the true functions lie outside of our posterior estimates. For , the best model is obtained with a Log- partition of size . In this case, posterior estimates of functions and with the best fitting (Figure 5), contain the true functions.
For the AMH copula we have two values of , and . The best fitting consistently chosen by the three GOF criteria is obtained with a Log- and Log- partitions of size , respectively for the two values of . Posterior estimates of functions and with the best fitting are shown in Figures 7 and 6, respectively. In all cases the true functions lie within the 95% credible intervals.
For the Gumbel copula with we have an interesting behaviour. The true function has the feature that . This represents a challenge for our model since we have imposed the constrain which is equivalent to . The highest LPML value is obtained with a Log- partition of size , however the posterior 95% credible interval for does not contain the true value. On the other hand, the second best value of LPML is obtained with an Log- partition of size , and in this case the 95% credible interval for does contain the true value. We select this latter as the best fitting. Posterior estimates of functions and are shown in Figure 8. Recalling that the true hazard function goes asymptotically to infinity when , therefore, for values close to zero the true lies outside our posterior credible intervals, something similar happens in the estimates of the inverse generator. Apart from this, our posterior estimates are very good for .
An important learning from the previous examples is that increasing the partition size doe not necessarily implies a better fitting.
4.2 Real data analysis
In public health it is important to study the factors that determine the birth weight of a child. Low birth weight is associated with high perinatal mortality and morbility (e.g. Stevens-Simon and Orleans, 2001). We study the dependence structure between the age of a mother () and the weight of her child (), and concentrated on mothers of 35 years old and above. The dataset was obtained from the General Hospital of Mexico through the opendata platform that can be accessed at https://datos.gob.mx/busca/dataset/perfiles-metabolicos-neonatales/resource/4ab603eb-b73a-498f-8c56-0dc6d21930e8. It contains records of the neonatal metabolic profile of male babies registered in the year 2017 in Mexico City.
The marginal distributions for variables and , induced by copula (1), are uniform. In practice, copulas are used to model the dependence for any pair of random variables regardless of their marginal distributions. Let and be two random variables with marginal cumulative distributions and
respectively. Then the joint cumulative distribution function foris obtained as (Sklar, 1959), , where is given in (1).
Since we are just interested in modelling the dependence between and , it is common in practice to transform the original data, , , to the unit interval via a modified rank transformation (Deheuvels, 1979) in the following way. Let and then and are the transformed data, where iff for . This is based on the probability integral transform using the empirical cumulative distribution function of each coordinate.
In Figure 9 we show a dispersion diagram of the original data (left panel) and the rank transformed data (right panel). To avoid problems due to ties in the original data, we fist include a perturbation to the data by adding a uniform random variable to each coordinate. The sample Kendall’s tau value for the transformed data is .
We fitted our model to the transformed data with the following specifications. To define the partitions we took values with sizes . For the prior we took , and . The MCMC specifications were the same as those used for the simulated data.
The GOF measures computed were the LPML and the posterior estimates (point and 95% credible interval) of . The results are reported in Table 8. The best fitting model according to LPML is that obtained with a partition of size and Log-. The sample concordance is included in our posterior 95% credible interval estimate .
The estimated hazard rate function and the inverse generator , with the best fitting model, are included in Figure 9. The solid thick line corresponds to the point estimates and the solid thin lines to the 95% credible intervals. For a visual comparison, the blue dotted line corresponds to the function of the independence (product) copula. This confirms that there is a negative (weak) dependence between the age of the mother and the birth weight of the child. The older the mother, the less weight of the child. This finding could potentially help the policy makers to focus campaigns to help the awareness of future mothers.
), such that the prior probability ofis around , in other words, we want . Particularly, for a partition of size we need to specify . In order to get a point of mass proposal in the Metropolis-Hasting step we consider . We re-ran our model using these values with the other specifications left unchanged. The posterior probability of becomes , which leads to an inconclusive test. This might be explained by the fact that although the association is negative, it is very close to zero. To calibrate our independence test, we consider the simulated data from previous section of two copulas, the product copula and the Clayton copula with . In these cases the best fitting was obtained with a partition of size , so we chose the same prior values as for the real data to perform an independence test. Posterior probabilities of are and , respectively, showing that for independent data the posterior probability is a lot larger that 0.5, whereas for clearly dependent data the posterior probability of independence is zero.
5 Concluding remarks
We have proposed a semiparametric Archimedean copula that is flexible enough to capture the behaviour of several families of parametric arquimedean copulas. Our model is capable of modelling positive and negative dependence. The number of parameters in the model to produce a good estimation of the dependence in the data should not be extremely high. For most of the examples considered here 10 parameters are enough.
Defining an appropriate partition to analyse real data sets is not trivial. We suggest to try different values of in a wide range and compare using a GOF criteria like the LPML we used here.
In the exposition and in examples considered here, we concentrated on bivariate copulas, however extensions to more than two dimensions is also possible, say . Performance of our semiparametric copula in this multivariate setting is worth studying.
Our model is motivated by semiparametric proposals for survival analysis functions (Nieto-Barajas and Walker, 2002) and appropriately modified to satisfy the properties of an Archimedean generator. The semiparametric generator presented here turned out to be based on quadratic splines, however, alternative proposals are possible.
Instead of starting with a piecewise function for the derivative of a hazard rate, we could start by defining a piecewise constant function for the hazard rate itself. That is with , and a partition of the positive real line. In this case the cumulative hazard function becomes , for , with . The inverse generator is then a linear spline of the form
and the corresponding Archimedean generator has the form
with . To ensure convexity of the generator we further require . Furthermore, the Kendall’s tau has a simpler expression
However, it can be shown that this expression for the Kendall’s tau only allows positive values, constraining the possible associations captured by the model.
This research was supported by Asociación Mexicana de Cultura, A.C.
Appendix A Derivation of posterior conditional support of
In order to satisfy constraint , we consider first the case . Therefore
. This implies the following constraint for ,
for , where we define the empty sum as zero.
On the order hand, if we have and we get, from condition , the following restriction
This defines the upper bound , for , and .
Because the term appears on the right side of the previous inequality for , we need to consider the following restriction
if . Combining this with the constraint when above, we get the lower bound for .
- Deheuvels (1979) Deheuvels, P. (1979). La fonction de dépendance empirique et ses proprietés. Un test non paramétrique d’indépendance. Roy. Belg. Bull. Cl. Sci., 65, (5), 274–292.
- Di Bernardino and Rullière (2013) Di Bernardino, E. and Rullière, D. (2013). On certain transformation of Archimedean copulas: Application to non-parametric estimation of their generators. Dependence Modeling 1, 1–36.
- Geisser and Eddy (1979) Geisser, S. and Eddy, W. F. (1979). A predictive approach to model selection. Journal of the American Statistical Association 74, 153–160.
- Genest et al. (2011) Genest, C., Nešlehová J. and Ziegel J. (2011). Inference in multivariate Archimedean copula models. TEST 20: 223. https://doi.org/10.1007/s11749-011-0250-6.
- Genest and Rivest (1993) Genest, C. and Rivest, L. (1993). Statistical inference procedures for bivariate Archimedean copulas . Journal of the American Statistical Association, 88, 1034-1043.
- Guillote and Perron (2015) Guillote, S. and Perron, J. (2015). Inference on Archimedean copulas using mixtures of Pólya trees. Journal of Statistical Planning and Inference, 166, 2-13.
- Kass and Raftery (1995) Kass, R.E. and Raftery, A.E. (1995). Bayes Factors. Journal of the American Statistical Association 90, 773–795.
- Klein and Moeschberger (2003) Klein, J.P. and Moeschberger, M.L. (2003). Survival analysis. Springer, New York.
- Nelsen (2006) Nelsen, R.B. (2006). An introduction to copulas. Springer, New York.
- Neuhaus (1971) Neuhaus, G. (1971). On weak convergence of stochastic processes with multidimensional time parameter. The Annals of Mathematical Statistics , 42 (4), 1285-1295.
- Nieto-Barajas and Walker (2002) Nieto-Barajas, L.E. and Walker, S.G. (2002). Markov beta and gamma processes for modeling hazard rates. Scandinavian Journal of Statistics 29, 413–424.
- Robert and Casella (2010) Robert, C. P. and Casella, G. (2010). Introducing Monte Carlo methods with R. Springer, New York.
- Sklar (1959) Sklar, M. (1959). Fonctions de répartition á dimensions et leurs marges. Université Paris 8.
Smith and Roberts (1993)
Smith, A. and Roberts, G. (1993). Bayesian computations via the Gibbs sampler and related Markov chain Monte Carlo methods.Journal of the Royal Statistical Society, Series B 55, 3-–23.
- Stevens-Simon and Orleans (2001) Stevens-Simon, C. and Orleans, M. (2001). Low-birthweight prevention programs: The enigma of failure. Birth 26, 184–191.
- Tierney (1994) Tierney, L. (1994). Markov chains for exploring posterior distributions. Annals of Statistics 22, 1701-–1722.
- Vanderhende and Lampert (2005) Vanderhende, F. and Lampert, P. (2005). Local dependence estimation using semiparametric Archimedean copulas. The Canadian Journal of Statistics 33, 377–388.