Graphical models are an efficient tool for modeling and inference in high dimensional settings. Directed acyclic graph (DAG) models, known as Bayesian networks(Lauritzen, 1996), are often used to model asymmetric cause-effect relationships. Models represented by undirected graphs are used to model symmetric relationships, for instance gene regulatory networks.
Some graphical models are able to represent both asymmetric and symmetric relationships simultaneously. One such model so-called chain graph model (Lauritzen, 1996, Lauritzen and Wermuth, 1989) which is a generalization of directed and undirected graphical models. Chain graph models contain a mixed set of directed and undirected edges. The vertex set of a chain graph can be partitioned into chain components where edges within a chain component are undirected whereas the edges between two chain components are directed and point in the same direction. Recently, chain graph models are considered in a time series setting (Abegaz and Wit, 2013, Gao and Tian, 2010, Dahlhaus and Eichler, 2003).
There is a rich literature on reconstructing undirected graph for continuous data, categorical data, and mixed categorical and continuous data (Behrouzi and Wit, 2017, Mohammadi et al., 2015, Dobra et al., 2011, Hoff, 2007) and similarly for directed acyclic graphs (Colombo et al., 2012, Kalisch and Bühlmann, 2007). Recently, Abegaz and Wit (2013) have proposed a method based on chain graph model for analyzing time course continuous data, like gene expression data. However, many real-world time series data are not continuous, but are categorical or mixed categorical and continuous. Until now constructing dynamic networks for non-continuous time series data has remained unexplored. Here, we develop a method to explore dynamic or delayed interactions and contemporaneous interactions for time series of categorical data and time series of mixed categorical and continuous data.
The proposed method is based on chain graph models, where the ordered time steps build a DAG of blocks and each block contains an undirected network of variables under consideration at that time point. The method developed in this paper is designed to analyze the nature of interactions present in repeated multivariate time series mixed categorical and continuous data, where we use time series chain graphical models to study the conditional independence relationships among variables at a fixed time point and “causal” relationship among time series components across consecutive time steps. The concept of causality that we use is the concept of Granger causality (Granger, 1969), which exploits the natural time ordering to achieve a “causal” ordering of the variables in multivariate time series. The idea of this causality concept is based on predictability, where one time series is said to be Granger causal for another series if the latter series to be better predicted using all available information than if the information apart from the former series had been used. Our inference procedure not only enforces sparsity on interactions within each time step, but it also between time steps; this feature is particularly realistic in a real-world dynamic networks setting.
We proceed as follow: in section 2, we explain the method where we first introduce dynamic chain graph models in section 2.1, then we propose the Gaussian copula for mixed scale time series data in section 2.2. In sections 2.3 and 2.4 we define a model for underlying multivariate time series components and we explain the procedure of penalized inference based on the L1 norm and smoothly clipped absolute deviation (SCAD) penalty terms. In section 2.5 we present a method for obtaining the log-likelihood of the observed mixed scale time series component under the penalized EM algorithm and we proceed with model selection for tuning the penalty terms. In section 3 we study the performance of the proposed dynamic chain graph model under different scenarios. Furthermore, we compare its performance with the other available methods. The proposed method is demonstrated in section 4 to investigate the course of depression and anxiety disorders.
2.1 Dynamic chain graph models
A chain graph is defined as where is a set of vertices (nodes) and is a set of ordered and unordered pairs of nodes, called edges, which contains the directed and undirected interactions between pairs of nodes. A dynamic chain graph model is associated with a time series chain graph model, where the dependence structure of the time series components can be divided into two sets: intra time-slice dependencies, which are represented by undirected edges that specify the association among variables in a fixed time step, and a set of inter time-slice dependencies, which are represented by associations among variables across consecutive time steps. Links across time steps are directed pointing from a set of nodes at a previous time step, , to nodes at the current time step, . The dynamic chain graph models in our modeling framework relates the time series components at time to only that of at time , but this can be easily extended to a higher order () time steps.
Let be an p-dimensional time series vector representation of variables that have been studied longitudinally across time points. Each time series component is assumed to be sampled times. Thus, represents the value of the -th variable at time for the -th sample, , .
Here, we focus on non-Gaussian multivariate time series data such as ordinal-valued time series taking values in , where is the number of possible categories, or mixed categorical and continuous time series data, which routinely occurs in real world settings.
2.2 Gaussian Copula
To model dependencies among p-dimensional vector we use a Gaussian copula, defined as
where is the a -dimensional Gaussian cdf with correlation matrix , and . From equation (1) the following properties are clear: the joint marginal distribution of any subset of has a Gaussian copula with a correlation matrix and univariate marginals . The Gaussian copula can be expressed in terms of a latent Gaussian variable as follow
Since the marginal distributions are nondecreasing, observing implies . This can be written as set where given the observed data , the latent samples , are constrained to belong to the set
If an observed value of is missing, we define the lower bound and the upper bound of as and , respectively.
2.3 Model definition
We assume a stable dynamic chain graph model meaning that the structure of interactions within each time point remains stable for previous and current time step, and interactions between consecutive time steps are stable too. We use a vector autoregressive process of order , VAR(),
to describe the directed latent interactions, where describes the undirected instantaneous interactions.
The parameter set of this model contains all the conditional independence relationships in the dynamic chain graph model where the following terms hold: if and only if , and if and only if .
Given the set , we calculate the likelihood as
where and . Given the set of parameters, the event in (2.3) does not depends on marginals and contains the relevant information about the copula and the parameters of interest and . We drop the second term in (2.3) because this term does not provide any information about intra and inter time-slice dependencies. As Hoff (2007) proposes we use as the rank likelihood,
We ignore the second term in (2.3) as we do not want to make additional assumption on the unconditional distribution of . And we start from , where we compute the conditional log-likelihood using the conditional distribution . According to (3) the conditional distribution
follows a multivariate normal distribution
which its density for -th observation is defined as
2.4 Penalized EM inference
In Gaussian copula, we treat the marginals distributions as nuisance parameters since our main goal is to learn the dependence structure among time series components both at a fixed time step and also across consecutive time steps. We use an empirical marginal cdf (Genest et al., 1995) to estimate marginals.
Genetic time series data often are high dimensional due to a large number of variables that are measured on small number of samples across only few time steps. Furthermore, many real-world networks (e.g. genetic, genomics, and brain networks) are intrinsically sparse. Thus, incorporating sparsity into the proposed dynamic chain graph model makes the derived model more biologically plausible. Accordingly, we propose a dynamic chain graph model for genetic data based on the penalized likelihood. In order to find the penalized maximum likelihood estimation we will use the EM algorithm (Green, 1990). This modeling technique provides sparse estimates of the autoregressive coefficient matrix and the precision matrix in (3) which are used to reconstruct inter and intra time-slice conditional independences, respectively.
The E-step of the EM algorithm is given by
Under the assumption described in (6), the E-step can be written as
such that conditional expectation at current time, , and at past, , is defined as
and the conditional expectation at inter time-slice dependence is
The latent variables and is used to calculate the conditional expectation of intra time-slice dependencies and , respectively. And is used to calculate . All the three above mentioned conditional expectations are a matrix. When
they can be computed through the second moment. When we use a mean field theory approach (Chandler, 1987) to approximate them as
for intra time-slice dependencies, and for inter time-slice dependencies follows as
This approximation performs well when the interaction between and given the rest of the variables, and the interaction between and given the rest of the variables are close to be independent; this often holds in our proposed dynamic chain graph model which and are sparse.
When the off-diagonal elements of , , and matrices can be computed through the first moment as
and the second moments is
Given the property of Gaussian distribution,
follows a multivariate normal distribution with mean and variance-covariance matrix
Calculating the exact value of the first and second moments is computationally expensive. Moreover, we approximate the first and the second moments as follow
The conditional distribution of follows a multivariate normal distribution with mean and variance-covariance matrix . Due to a property of Gaussian distribution, the conditional distribution of inside the inner expectation in (15) and (16) follows a multivariate normal distribution with mean and variance-covariance matrix as follow
We remark that conditioning on , and is equivalent to
Thus, this conditional distributions follows a truncated normal on the interval which the first and second moments can be obtained via lemma 2.1.
(Johnson et al., 1995). Let such that and are true for any constants that . Then the first and second moments of the truncated normal distribution on the interval are defined as
where is the density function of the standard normal distribution.
where . Here, the first order delta method is used to approximate the nonlinear terms [more details in (Guo et al., 2015)]. Moreover, we approximate the elements of inter time-slice conditional expectation matrix through equations (2.4) and (2.4). For approximating the elements of intra time-slice conditional expectation matrices , we refer to the Appendix.
The M-step of the EM algorithm contains two-stage optimization process where we maximize expectation of the penalized log-likelihood with respect to and . We introduce two different penalty functions and for intra time-slice conditional independencies , and inter time-slice conditional independencies , respectively. Therefore, the objective function for optimization can be defined as
where denotes the expectation of given the data and updated parameters, and and are the -th element of the and matrices. Among different penalty functions, we consider the norm and smoothly clipped absolute deviation (SCAD) penalty functions which have the desirable sparsity properties.
The Lasso or penalty function is defined as
The penalty leads to a desirable optimization problem, where the log-likelihood is convex and can efficiently be solved using various optimization algorithms at the -th iteration of the EM. Under this penalty function, the updated estimates are given via
where the sparsity level of intra and inter time-slice conditional independences are controlled by and . penalty is biased due to its constant rate of penalty. To address this issue, Fan and Li (2001)
proposed SCAD penalty, which results in unbiased estimates for large coefficients.
SCAD penalized EM.
The SCAD penalty function is expressed as
where and are two tunning parameters. The function corresponds to a quadratic spline on with knots at and . A similar function can be written for where and are two knots. The SCAD penalty is symmetric but non-convex, whose first order derivative is given by
The notation stands for the positive part of . Fan and Li (2001) showed that in practice is a good choice. Maximizing non-convex penalized likelihood is challenging. To address this issue, we use an efficient algorithm proposed in Fan et al. (2009), which is based on local linear approximation, to maximize the penalized log-likelihood for the SCAD penalty function. In each its step, a symmetric linear function is used to locally approximates the SCAD penalty. Using the Taylor expansion, and can be approximated in the neighbor of and as follow:
Due to the monotonicity of and over , the derivatives and are non-negative for and . Therefore, under the penalized log-likelihood with SCAD penalty, the estimate of the sparse parameters and relies on the solution of the following optimization problem at step
where , , and , are -th element of and -th element of , respectively. The SCAD penalty applies a constant penalty to large coefficients, whereas the penalty increases linearly as increases. This features keep the SCAD penalty against producing biases for estimating large coefficients. Therefore, the SCAD penalty overcome the bias issue of the penalty. Then a two stage-optimization problem within the M-step of the EM algorithm is employed to solve the objective functions (20) or (21) to estimate the parameters and .
Glasso calculation of .
For the SCAD penalty-based estimation, in the first stage we optimize
for previous . This optimization can be solved efficiently using the graphical lasso algorithm proposed by Friedman et al. (2008). Due to the sparsity in each iteration, we consider a one-step local linear approximation algorithm (LLA). Zou and Li (2008) showed that one-step LLA, asymptotically, performs as well as the fully iterative LLA algorithm as long as initial solution is good enough. In practice, we take the initial value as the penalty graphical LASSO for estimating the intra time-slice conditional independences in order to calculate the initial weights