Matrix-variate data are routinely collected in many fields. As the collection of these types of data expands, so does the need for new statistical methods that can capture the shared and individual-specific structure in multiple matrices. This is particularly the case when matrices in a collection consist of multivariate observations collected over time. Here, we are motivated by the challenge of measuring social coordination between two people who are interacting with one another. In such cases, multiple facial and body features are extracted from videos of each individual in the pair over time. The data for each individual form a matrix, with the columns corresponding to different time points. One component of the variability in the two matrices will be attributable to shared structure that relates to how much people like each other and cooperate (lakin2003using; johnston2002behavioral; marsh2016imitation). Another component will be attributable to variability specific to each individual. Analyzing these paired dynamic matrix-variate data requires a strategy that can accommodate two significant challenges: 1) complex multivariate dependence among variables, and 2) dynamic time-varying lags between the two multivariate time series. Although our motivating example is from human social interactions, similar challenges are posed by other types of paired multivariate data, such as that collected in animal behavior studies, cellular imaging studies, finance, or speech, gesture, and handwriting recognition.
Joint and Individual Variation Explained (JIVE) (lock2013joint) and Common and Individual Feature Analysis (CIFA) (zhou2016group) were developed to capture shared and individual-specific features in pairs of multivariate matrices. In the case of JIVE, the data ’s are decomposed into three parts: a low-rank approximation of joint structure , a low-rank approximation of individual variation , and an error under the restriction for all . Here is the matrix stacking ’s on top of each other. The CIFA decomposition defines a matrix factorization problem: under the restriction that for all , with denoting the Frobenius norm. Thus JIVE assumes orthogonal rows between and whereas CIFA assumes orthogonality between columns of and . Extensions of these methods are proposed in li2018general and feng2018angle. Related approaches have been used in behavioral research (schouteden2014performing), genomic clustering (lock2013bayesian; ray2014bayesian), neuroimaging (zhou2016linked) and railway network analysis (jere2014extracting). In most cases, frequentist frameworks are used for inference, the methods are not likelihood-based, and the focus is on static data. de2018bayesian developed a method for multigroup factor analysis in a Bayesian framework, which has some commonalities with these approaches but does not impose orthogonality.
One way to accommodate time-varying lags is to temporally align the features in a shared space, avoiding the need to develop a complex model of lagged dependence across the series. However, time alignment is a hard problem. Typically, alignment is done in a first stage, and then an inferential model is applied to the aligned data (vial2009combination). However, such two-stage approaches do not provide adequate uncertainty quantification.
Several approaches have been proposed to model warping functions. tsai2013profile used basis functions similar to B-splines with varying knot positions, using stochastic search variable selection for the knots. This makes the model more flexible but at the cost of very high computational demand. kurtek2017geometric put a prior on the warping function based on a geometric condition and developed importance sampling methods. In a multivariate setup, the method becomes very complicated due to the geometric structure. lu2017bayesian use a similar structure in placing a prior on the warping function. However, the issue with computational complexity in the multivariate setting persists.
bharath2017partition; cheng2016bayesian put a Dirichlet prior on the increments of the warping function over a grid of time points. Thus, the estimated warping function is not smooth. Also when the warping function is convolved with an unknown function, computation becomes inefficient due to poor mixing. The concept of warplets of claeskens2010multiresolution is very interesting. Nevertheless, this method also suffers from a similar computational problem.
For multivariate time warping, listgarten2005multiple
proposed a method based on a hidden Markov model. Other works propose to use a warping based distance to cluster similar time series(orsenigo2010combining; che2017decade). Unfortunately, these algorithms require the two time series to be collected at the same time points. In addition, it is difficult to avoid a two-stage procedure, since there is no straightforward way to combine a statistical model with the warping algorithms.
gervini2004self modeled the warping function as , where ’s are characterized using B-splines with the sum of the ’s equal to zero. For identifiability, they assumed restrictive conditions on the spline coefficients and did not accommodate multivariate data. telesca2008bayesian developed a related Bayesian approach but their structure makes it difficult to apply gradient-based Metropolis-Hastings (MH), and finding a good proposal for efficient sampling is problematic.
We propose to estimate the similarity between two multivariate time series with time-varying lags using a Bayesian dynamic factor model that incorporates time warping and parameter estimation in a single step. We assume the multivariate time series have both shared time-aligned factors and individual-specific dynamic factors. The resulting model reduces to a CIFA-style dependence structure, but unlike previous work, we accommodate time dependence and take a Bayesian approach to inference. Key aspects of our Bayesian implementation include likelihood-based estimation of shared and individual-specific subspaces, incorporation of a monotonicity constraint on the warping function for identifiability, and development of an efficient gradient-based Markov chain Monte Carlo (MCMC) algorithm for posterior sampling.
We align the two time series by mapping the features of the shared space using a monotone increasing warping function . This flexible function can accommodate situations where the time lags between the multivariate time series change sign and direction. Our monotone function construction differs from previous Bayesian approaches (ramsay1988monotone; he1998monotone; neelon2004bayesian; shively2009bayesian; lin2014bayesian), motivated by tractability in obtaining a nonparametric specification amenable to Hamiltonian Monte Carlo (HMC) sampling.
In general, posterior samples of the loading matrices are not interpretable without identifiability restrictions (seber2009multivariate; lopes2004bayesian; rovckova2016fast; fruehwirth2018sparse). To avoid arbitrary constraints, which complicate computation, one technique is to post-process an unconstrained MCMC chain. assmann2016bayesian post-process by solving an Orthogonal Procrustes problem to produce a point estimate of the loading matrix, but without uncertainty quantification. We consider to post-process the MCMC chain iteratively so that it becomes possible to draw inference based on the whole chain. Apart from the computational advantages, we also show identifiability of the warping function in our factor modeling setup both in theory and simulations. Moreover, our identifiability result is more general than the result in gervini2004self as we do not assume any particular form of the warping function other than monotonicity.
In section 2 we discuss our model in detail. Prior specifications are described in Section LABEL:prior. Our computational scheme is outlined in Section LABEL:compute. Section LABEL:theory discusses theoretical properties such as identifiability of the warping function and posterior concentration. We study the performance of our method in two simulation setups in Section LABEL:sim. Section LABEL:real considers applications to human social interaction datasets. We end with some concluding remarks in Section LABEL:discuss. Supplementary Materials have all the proofs, additional algorithmic details, and additional results.
We have a pair of
dimensional time varying random variablesand . We propose to model the data as a function of time varying shared latent factors, , and individual-specific factors, and . We do time alignment through the shared factors in using warping functions . Here is the warping function for the latent variable .
Latent factor modeling is natural in this setting in relating the measured multivariate time series to lower-dimensional characteristics, while reducing the number of parameters needed to represent the covariance. Since we are using the warping function to align the dynamic factors of the shared space, to ensure identifiability, the individual-specific space and the shared space space are required to be orthogonal. Thus the corresponding loading matrices of the two orthogonal subspaces are assumed to have orthogonal column spaces. Let be the loading matrix of the shared space. Then the shared space signal belongs to the span of the columns of with weights as some multiple of the shared factors . An element from the dynamic shared space can be represented as for some constant where is the - column of . Alternatively it can also be written as , where is a diagonal matrix with entries . The individual-specific space is assumed to be in the orthogonal subspace of the column space of . Thus we use the orthogonal projection matrix to construct the loading matrix of the individual-specific part of each signal. The loading matrix for the individual-specific space is assumed to be for some matrix of dimension , where is the rank. The corresponding loading matrix for the individual-specific space of is , with being a matrix with the rank.