1 Introduction
Datadriven linear models are a natural approach to modeling timeseries and have been studied extensively in the mathematical sciences and applied fields. These domains include Earth and atmospheric sciences (Leith, 1978; Danforth et al., 2007; Alexander et al., 2008), fluid dynamics (Rowley et al., 2009; Allgaier et al., 2012; Budišić et al., 2012; Towne et al., 2018), and neuroscience (Brunton et al., 2016; Markowitz et al., 2018)
. In nonlinear settings, linear approximations may be justified by their connection to the eigenfunctions of the Koopman operator
(Budišić et al., 2012; Lusch et al., 2018), known as the dynamic mode decomposition (Rowley et al., 2009; Tu et al., 2014; Takeishi et al., 2017). In order to capture nonstationary behavior, it is important to leverage dynamic linear models (DLMs) that vary over time (West and Harrison, 1997). However, since every linear model is parametrized by a system matrix, and this matrix changes over time, a naive DLM fit to a length timeseries of variables depends upon parameters, which could be very large. We propose to manage such complexity by representing the stack of system matrices as a low rank tensor with only parameters (see Fig. 1).There is a large literature on DLMs, including timevarying autoregressive (TVAR) and switching linear dynamical systems (SLDS) models that we cannot review in full here (West and Harrison, 1997). Ours is a regularized optimization method (in the spirit of Ghahramani and Hinton, 2000; Chan et al., 2014, 2015; Tank et al., 2017), complementary to the Bayesian approaches taken in much of the literature (Prado et al., 2000; Fox et al., 2009; Damien et al., 2013; Linderman et al., 2017). We also would like to highlight the recent work of Costa et al. (2019), which uses likelihood tests to adaptively segment a timeseries and fit different models to each segment. However, our approach is inherently more scalable due to the lowrank assumptions we make.
The key innovation of our model is to parametrize the dynamics by a low rank tensor for computational tractability, ease of identification, and interpretation. Tensor decompositions (Kolda and Bader, 2009) are a powerful technique for summarizing multivariate data and an area of ongoing research in theory (Ge and Ma, 2017; Anari et al., 2018, among others) and applications
(e.g., to improve neural networks
Novikov et al., 2015; He et al., 2017). In our formulation, the system tensor representing the DLM is regressed against the data. In this aspect, our method is most similar to the work of Yu and colleagues who considered spatiotemporal forecasting with spatial smoothness regularization (Bahadori et al., 2014; Yu and Liu, 2016; Yu et al., 2018), in contrast to our temporal smoothness. Our work also differs in the emphasis on nonsmooth regularization to find switching or other temporally structured behavior.2 The TVART model
We now introduce our timevarying autoregressive model with low rank tensors (TVART, Fig. 1). Assume we have sampled the trajectory of an dimensional dynamical system for . We split this trajectory into nonoverlapping windows of length , so that . Let be the tensor with entries and similarly let be a tensor of the same size with entries shifted by one time point, . (See Appendix A for the notation conventions.) We call the frontal slices and the snapshot matrices for window . The first of these are
The subsequent snapshots for are each shifted by .
The goal of TVART is to fit an tensor of system matrices, so that for , where is the th frontal slice of . The assumption underlying this goal is that where is constant within a window. This motivates the least squares optimization problem, equivalent to assuming uncorrelated Gaussian errors,
(TVART) 
Without the rank constraint, TVART factors into decoupled problems for each window
. In order to limit the degrees of freedom in the tensor
, we use a low rank formulation. Specifically, we represent using the canonical polyadic (CP) decomposition (Kolda and Bader, 2009) of rank asin terms of the factor matrices , , and , where is the th column of . Thus, the number of parameters is reduced to , which is now linear in and
. We can optionally normalize the factors to have unitlength columns and capture their scalings in a vector
; we consider this a postprocessing step and explicitly state when we do this.When solving TVART for , we work directly with the frontal slices
(1) 
Matrix is the diagonal matrix formed from the th row of (Kolda and Bader, 2009). We call the matrices , , and the TVART dynamical modes. Specifically, we refer to and as the left and right spatial modes, since they determine the loadings of onto the spatial dimensions/channels in the data. The matrix , which determines , contains the temporal modes, since it determines the timevariation of the system matrix across windows.
Equation (1
) is similar to the singular value decomposition (SVD): each slice
is the product of a low rank matrix , a diagonal matrix , and another low rank matrix . However, in the SVD, the left and right singular vectors are orthogonal. Let andbe the QR decompositions of the left and right spatial modes, so that
. Thus, in order to calculate the SVD of , we would have to take the SVD of the matrix . This also illustrates that, in the CP decomposition, slices and may have different left and right singular vectors (but they always lie in the column spaces of and ). This flexibility is important to allow fitting of switching models with different singular subspaces.2.1 Extensions: affine dynamics and higherorder autoregressions
In many applications, the mean of the data may drift over time, and thus affine models are more appropriate than linear models. We can fit an affine model of this type within the TVART framework by appending a row of ones to each and extending by one row to build in a term. In this case, we have that , where is the extra row .
Furthermore, autoregressive models of higher order are often considered, where is predicted from data with lags . In this case, the dimensions of and change to and , respectively, but otherwise the mathematics remain equivalent. For simplicity, we focus on just the case, but higherorder autoregressive models are likely bettersuited to certain applications.
2.2 Norm regularization
A natural approach to TVART is alternating least squares. However, we have found that this is numerically unstable, in particular for a switching linear test problem (Sec. 3.1) with low noise. Additive, independent noise adds a diagonal component to the data covariance, which suggests we apply Tikhonov regularization to the problem. An additional motivation is that we do not want the entries in the matrices , , and to become too large, but some might become large due to the scaling indeterminancy of the CP decomposition (Kolda and Bader, 2009). We thus add a Tikhonov regularization term
to the leastsquares loss. The regularization parameter controls the magnitude of this regularization; as , the constraint disappears. In matrix completion problems, a similar twoterm regularization is often added and can be seen as a convex relaxation of the matrix rank.
2.3 Temporal mode smoothing
We now consider a few possible regularizers on the time components . Recall that the rows of correspond to the loadings at different time windows. By forcing these rows to be correlated, we keep the system matrices from varying too much from window to window, a form of temporal smoothing. The first regularizer we consider is a total variation (TV) penalty:
(2) 
Matrix is the first difference matrix (1 on the principal diagonal and 1 on the lower diagonal). TV prefers piecewise constant time components since it penalizes nonzero firstdifferences with an penalty to enforce sparsity, appropriate for an SLDS. We also consider a spline penalty:
(3) 
This linear smoother penalizes the norm of the first derivative, leading to smoothly varying solutions.
2.4 Regularized cost function
We modify the problem (TVART
), adding the Tikhonov and smoothing penalties to the loss function. These additions result in the regularized cost function
(4) 
where follows equation (1) as before. Here, is either TV or Spline. Increasing the temporal smoothing strength leads to stronger regularization, as does decreasing .
2.5 Alternating minimization algorithm
We minimize the regularized cost (4) using alternating minimization, also known as block coordinate descent. In Algorithm 1, we give the full alternating minimization procedure for solving the regularized problem. The subroutines that minimize for and require different approaches. Since the objective (4) is quadratic in and , we find these by solving a linear matrix equation either directly or using the method of conjugate gradients (CG); this works best for solving the Sylvester equation in . For , with the Spline penalty, the cost is again quadratic so we also use CG. However, the TV penalty is convex but not smooth, so in this case we use the proximal gradient method with Nesterov acceleration. See Appendix B for further algorithm details. Our code is available from https://github.com/kharris/tvart.
Theorem 2.1.
Proof.
We use the framework of Tseng (2001, Theorem 5.1) for cyclic block coordinate descent (of which Algorithm 1 is an example) on objective functions with convex and lower semicontinuous blocks, as well as bounded level sets. We split the cost into smooth and nonsmooth parts
where and contains the remaining loss and Tikhonov terms. The function is continuous and differentiable, and it is strongly convex in each of its blocks , and . However, is not a convex function. Also, is convex and continuous. Let be the initialization and . Denote the level set
Then, since and the ball is bounded, we can conclude that the level set is also bounded. Then by Tseng (2001, Theorem 5.1), we obtain the result. ∎
Remark.
We did not use any structure of besides convexity and lower semicontinuity, thus the same convergence results hold for other regularizations with those properties. However, we did need to use the Tikhonov penalty to ensure that the level sets are bounded.
3 Example applications
In this section we test our method on both synthetic data (switching linear and nonlinear), as well as realworld datasets. With the synthetic linear data, we show that our algorithm can recover the true dynamics and is competitive with other state of the art techniques. In the other examples, we highlight how the recovered modes are interpretable and can correspond to important dynamical regimes. For detailed explanation of the datasets and parameters used, please refer to Appendix D.
3.1 Test problem: switching low rank linear system
We first apply TVART to a simple test case where the true model is a lowrank, switching linear system. We generate two system matrices, and , which are random, rank2 rotation matrices. For the first half of our timeseries, the dynamics follow , and then they switch to . Gaussian noise is added to each entry of to form the observations.
We present results in Fig. 2 for , , and window size . The temporal modes are stable for the first 5 windows, when system is active, then switch to a different state for the remaining windows where is active. If the switch does not exactly occur on the window boundary, little changes. Thus just from the temporal modes we can determine the change point to the resolution of the window size. Examining the TVART output in more detail, only the first four spatial and temporal modes are significantly different from zero. Remarkably, TVART is able to discover the true rank of the system—equal to four because and are each rank two—and this crucially depends on the Tikhonov penalty. Furthermore, we show the TVART reconstruction of in window , the true , and their difference. The reconstruction matches the truth very closely.
We now remark on some other behaviors that we have observed in test problems. When the noise is very small or zero, the Tikhonov regularization term is important. If we eliminate this term by taking , the ALS subproblems become illconditioned and lead to poor results; see the remark after Theorem 2.1. The TV regularization, in contrast, is important for selecting a switching solution under moderate to large noise.
The TVART method was compared to a number of alternative approaches: independent windowed fits of ranks , 6, and 4, as well as a Bayesian SLDS fit using the ssm package with latent spaces of dimensions 6 and 4. On the left side of Fig. 3, we see that, across system sizes , TVART is able to recover the true dynamics as well as or better than a rank 4 SLDS. The rank 6 SLDS performs close to independent rank 4, while the independent models of higher rank perform worse. On the right of Fig. 3, the runtimes of rank 4 TVART and SLDS are compared. We find that TVART offers a speedup of approximately 3x–6x over SLDS for , and the runtimes of both appear to scale superlinearly as . We conclude that TVART with TV regularization is a scalable, alternative way of finding switching linear dynamics.
A further comparison to smoothlyvarying low rank dynamics is given in the Appendix D.1. In this case, TVART with Spline regularization performs much better than SLDS, since it can capture smooth behavior which SLDS cannot.
3.2 Nonlinear example: Lorenz (1963) system
We take a sample trajectory of the canonical Lorenz (1963) chaotic nonlinear dynamical system, a three variable dynamical system that exhibits nonlinear oscillations about two unstable spiral points. The oscillations grow about a single fixed point until the system “flips” to oscillation about the other fixed point. There is a third unstable node at the origin.
In Fig. 4 we depict the temporal modes as well as the result of clustering the matrices using kmedoids algorithm on with 3 clusters. The clustering is able to identify the lobe of the attractor that the system is on (clusters 2 and 3) as well as if it is in a transitory state between the two lobes (cluster 1). This makes sense because in each of these cases the dynamics are dominated by the closest fixed point.
3.3 Worm behavior
We analyze the escape response behavior of the nematode worm Caenorhabditis elegans in response to a heat stimulus (Broekmans et al., 2016a, b). These were used as test data for clustering via adaptive linear models in the recent work of Costa et al. (2019). The results for worm 1 are shown in Fig. 5. We performed clustering on the system matrices as before with three clusters and found that these clusters matched the three behaviors in the data: forward crawling, a turn, and backward crawling. These clusters are not trivial: the data means during forward and backward motion are approximately equal, but the phase velocity switches.
Finally, we also compared our results to the code provided by Costa et al. (2019), which fits an adaptive linear model and clusters the same timeseries; the clustering results are essentially the same. In terms of runtime, fitting a TVART model, performing clustering, and displaying the results takes 23 s (90 iterates), versus 123 s for the code of Costa et al. (2019). Thus, we see that our method is much faster than theirs, while producing essentially the same clustering results.
3.4 Sea surface temperature
As an example of a highdimensional dataset, we applied our method to weekly sea surface temperature data from 1990 until present (Reynolds et al., 2002). The leading modes that are output by the algorithm oscillate seasonally. However, we also find a temporal mode that tracks the El NiñoSouthern Oscillation (ENSO), Fig. 6. The corresponding spatial modes show a plume of warm water in the central and eastern equatorial Pacific Ocean. Warmer than average water in this location is the signature of ENSO.
3.5 Neural activity during a reaching task
A second highdimensional dataset we use comes from electrocorticography recordings (ECoG, measured with chronic electrodes placed below the skull and the dura mater) of a Japanese macaque monkey Macaca fuscata during a reaching task (Chao et al., 2010). During the task, the monkey makes repeated reaches towards food while its limbs are tracked using motion capture and brain activity is recorded. Figure 7 depicts the results: We show motion capture data with dashed lines at the onset and offsets of movements, using a changepoint detection procedure. We also highligh the result of clustering the TVART temporal modes into two clusters. The brain activity reveals two dominant modes, one of which is aligned to movement. These movements are accompanied by an increase in high frequency (32200 Hz) and a decrease in low frequency (232 Hz) power. The dominant TVART mode (show in blue) follows this spectral change in the timeseries, and the left spatial mode associated with it is centered in the premotor region (supplementary Fig. 9). TVART tracks this spectral change directly from the timeseries of electrode voltage.
4 Conclusions
We have presented a low rank tensor formulation of timevarying autoregressive problems that we call TVART. This offers the advantage of being able to scale to problems with many state variables or many time points, as highlighted by our examples. Furthermore, the modes output by our method are often interpretable as dominant dynamical features in the data. Future work should investigate higherorder autoregressive models and applications to other datasets such as economic data.
Acknowledgements
Thank you to Scott Linderman and Nathan Kutz for discussions and to Steven Peterson for help preprocessing the Neurotycho data. KDH was supported by a Washington Research Foundation Postdoctoral Fellowship. BWB, RR, and KDH were supported by NSF CNS award 1630178 and DARPA award FA8750182025.
plus 0.3ex
References
 Alexander et al. (2008) Alexander, M. A., Matrosova, L., Penland, C., Scott, J. D., and Chang, P. (2008). Forecasting Pacific SSTs: Linear Inverse Model Predictions of the PDO. Journal of Climate, 21(2):385–402.
 Allgaier et al. (2012) Allgaier, N. A., Harris, K. D., and Danforth, C. M. (2012). Empirical correction of a toy climate model. Physical Review E, 85(2).

Anari et al. (2018)
Anari, N., Daskalakis, C., Maass, W., Papadimitriou, C., Saberi, A., and
Vempala, S. (2018).
Smoothed Analysis of Discrete Tensor Decomposition and Assemblies of Neurons.
In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems 31, pages 10857–10867. Curran Associates, Inc.  Bahadori et al. (2014) Bahadori, M. T., Yu, Q. R., and Liu, Y. (2014). Fast Multivariate Spatiotemporal Analysis via Low Rank Tensor Learning. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D., and Weinberger, K. Q., editors, Advances in Neural Information Processing Systems 27, pages 3491–3499. Curran Associates, Inc.
 Broekmans et al. (2016a) Broekmans, O. D., Rodgers, J. B., Ryu, W. S., and Stephens, G. J. (2016a). Data from: Resolving coiled shapes reveals new reorientation behaviors in C. elegans. Dryad Digital Repository.
 Broekmans et al. (2016b) Broekmans, O. D., Rodgers, J. B., Ryu, W. S., and Stephens, G. J. (2016b). Resolving coiled shapes reveals new reorientation behaviors in C. elegans. eLife, 5:e17227.
 Brunton et al. (2016) Brunton, B. W., Johnson, L. A., Ojemann, J. G., and Kutz, J. N. (2016). Extracting spatial–temporal coherent patterns in largescale neural recordings using dynamic mode decomposition. Journal of Neuroscience Methods, 258:1–15.
 Budišić et al. (2012) Budišić, M., Mohr, R., and Mezić, I. (2012). Applied Koopmanism. Chaos: An Interdisciplinary Journal of Nonlinear Science, 22(4):047510.
 Chan et al. (2014) Chan, N. H., Yau, C. Y., and Zhang, R.M. (2014). Group LASSO for Structural Break Time Series. Journal of the American Statistical Association, 109(506):590–599.
 Chan et al. (2015) Chan, N. H., Yau, C. Y., and Zhang, R.M. (2015). LASSO estimation of threshold autoregressive models. Journal of Econometrics, 189(2):285–296.
 Chao et al. (2010) Chao, Z. C., Nagasaka, Y., and Fujii, N. (2010). Longterm asynchronous decoding of arm motion using electrocorticographic signals in monkey. Frontiers in Neuroengineering, 3.
 Costa et al. (2019) Costa, A. C., Ahamed, T., and Stephens, G. J. (2019). Adaptive, locally linear models of complex dynamics. Proceedings of the National Academy of Sciences, page 201813476.
 Damien et al. (2013) Damien, P., Dellaportas, P., Polson, N. G., and Stephens, D. A. (2013). Bayesian Theory and Applications. OUP Oxford.
 Danforth et al. (2007) Danforth, C. M., Kalnay, E., and Miyoshi, T. (2007). Estimating and Correcting Global Weather Model Error. Monthly Weather Review, 135(2):281–299.
 Fox et al. (2009) Fox, E., Sudderth, E. B., Jordan, M. I., and Willsky, A. S. (2009). Nonparametric Bayesian Learning of Switching Linear Dynamical Systems. In Koller, D., Schuurmans, D., Bengio, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 21, pages 457–464. Curran Associates, Inc.
 Ge and Ma (2017) Ge, R. and Ma, T. (2017). On the Optimization Landscape of Tensor Decompositions. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 3653–3663. Curran Associates, Inc.
 Ghahramani and Hinton (2000) Ghahramani, Z. and Hinton, G. E. (2000). Variational Learning for Switching StateSpace Models. Neural Computation, 12(4):831–864.
 He et al. (2017) He, Z., Gao, S., Xiao, L., Liu, D., He, H., and Barber, D. (2017). Wider and Deeper, Cheaper and Faster: Tensorized LSTMs for Sequence Learning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 1–11. Curran Associates, Inc.
 Kolda and Bader (2009) Kolda, T. and Bader, B. (2009). Tensor Decompositions and Applications. SIAM Review, 51(3):455–500.
 Leith (1978) Leith, C. E. (1978). Objective Methods for Weather Prediction. Annual Review of Fluid Mechanics, 10(1):107–128.
 Linderman et al. (2017) Linderman, S., Johnson, M., Miller, A., Adams, R., Blei, D., and Paninski, L. (2017). Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems. In Artificial Intelligence and Statistics, pages 914–922.
 Lorenz (1963) Lorenz, E. N. (1963). Deterministic Nonperiodic Flow. Journal of the Atmospheric Sciences, 20(2):130–141.
 Lusch et al. (2018) Lusch, B., Kutz, J. N., and Brunton, S. L. (2018). Deep learning for universal linear embeddings of nonlinear dynamics. Nature Communications, 9(1):4950.

Markowitz et al. (2018)
Markowitz, J. E., Gillis, W. F., Beron, C. C., Neufeld, S. Q., Robertson, K.,
Bhagat, N. D., Peterson, R. E., Peterson, E., Hyun, M., Linderman, S. W.,
Sabatini, B. L., and Datta, S. R. (2018).
The Striatum Organizes 3D Behavior via MomenttoMoment Action Selection.
Cell, 0(0).  Million (2007) Million, E. (2007). The Hadamard Product. Technical report, University of Puget Sound.
 Novikov et al. (2015) Novikov, A., Podoprikhin, D., Osokin, A., and Vetrov, D. P. (2015). Tensorizing Neural Networks. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 442–450. Curran Associates, Inc.
 Perraudin et al. (2014) Perraudin, N., Kalofolias, V., Shuman, D., and Vandergheynst, P. (2014). UNLocBoX: A MATLAB convex optimization toolbox for proximalsplitting methods. arXiv:1402.0779 [cs, stat].
 Prado et al. (2000) Prado, R., Huerta, G., and West, M. (2000). Bayesian timevarying autoregressions: Theory, methods and applications. Resenhas do Instituto de Matemática e Estatística da Universidade de São Paulo, 4(4):405–422.
 Reynolds et al. (2002) Reynolds, R. W., Rayner, N. A., Smith, T. M., Stokes, D. C., and Wang, W. (2002). An Improved In Situ and Satellite SST Analysis for Climate. Journal of Climate, 15(13):1609–1625.
 Rowley et al. (2009) Rowley, C. W., Mezić, I., Bagheri, S., Schlatter, P., and Henningson, D. S. (2009). Spectral analysis of nonlinear flows. Journal of Fluid Mechanics, 641:115–127.
 Takeishi et al. (2017) Takeishi, N., Kawahara, Y., and Yairi, T. (2017). Learning Koopman Invariant Subspaces for Dynamic Mode Decomposition. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 1130–1140. Curran Associates, Inc.
 Tank et al. (2017) Tank, A., Fox, E. B., and Shojaie, A. (2017). An Efficient ADMM Algorithm for Structural Break Detection in Multivariate Time Series. arXiv:1711.08392 [stat].
 Towne et al. (2018) Towne, A., Schmidt, O. T., and Colonius, T. (2018). Spectral proper orthogonal decomposition and its relationship to dynamic mode decomposition and resolvent analysis. Journal of Fluid Mechanics, 847:821–867.
 Tseng (2001) Tseng, P. (2001). Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization. Journal of Optimization Theory and Applications, 109(3):475–494.
 Tu et al. (2014) Tu, J. H., Rowley, C. W., Luchtenburg, D. M., Brunton, S. L., and Kutz, J. N. (2014). On dynamic mode decomposition: Theory and applications. Journal of Computational Dynamics, 1(2):391–421.
 West and Harrison (1997) West, M. and Harrison, J. (1997). Bayesian Forecasting and Dynamic Models. Springer Science & Business Media.
 Yu et al. (2018) Yu, R., Li, G., and Liu, Y. (2018). Tensor Regression Meets Gaussian Processes. In International Conference on Artificial Intelligence and Statistics, pages 482–490.

Yu and Liu (2016)
Yu, R. and Liu, Y. (2016).
Learning from Multiway Data: Simple and Efficient Tensor
Regression.
In
International Conference on Machine Learning
, pages 373–381.
Appendix A Notation
Throughout this work, we follow the notational conventions of Kolda and Bader (2009). In general, tensors are denoted with calligraphic bold (), matrices with capital bold (), and vectors with lowercase bold symbols (). We use MATLAB style “colon” notation to represent slices, e.g. is the row vector formed from the th row of a matrix . All of the tensors we consider are 3rdorder. For a tensor , let be its th frontal slice, i.e. .
Appendix B Alternating least squares algorithm
Here we describe in detail how to optimize the loss of the unregularized (TVART) problem. The computations required for the regularized problem are essentially the same, with added diagonal matrices to satisfy the Tikhonov assumptions (not shown). Let the loss function be
(B.1) 
Then minimizing the loss is an equivalent formulation of the TVART problem. The cost (B.1) is quadratic and convex in each of the variables , and , although it is not jointly convex in all of them. Smooth and convex leastsquares problems are some of the most studied in optimization, leading to linear equations for the unknowns. In the next three sections, we explicitly lay out the gradients of with respect to each mode which are used in Algorithm 1.
b.1 Subproblem 1: Left Spatial Modes
Taking partial derivatives of with respect to leads to
(B.2) 
Solving requires forming an matrix, and right multiplying by the pseudoinverse of an matrix.
b.2 Subproblem 2: Right Spatial Modes
We now differentiate with respect to and obtain
(B.3) 
which we can rewrite in the form
(B.4) 
where , , and .
For small enough system sizes, we can use the Kronecker product and vectorization operations to rewrite (B.4) as a linear matrixvector equation
(B.5) 
If we set the derivative equal to zero, we obtain a square linear system in unknowns. However, for large , it is impractical to even form this matrix. Thus, we solve this via the matrixvalued method of conjugate gradients.
b.3 Subproblem 3: Temporal Modes
Finally, we derive the gradients of with respect to . The equations for , that is for , are similar except in this case the unknown is a diagonal matrix. The gradients are decoupled across windows (frontal slices), thus for the th window we have
(B.6) 
where is the Hadamard elementwise product, which enforces the constraint that the gradient with respect to is also diagonal.
Define the matrices and . Then by Theorem 2.5 in Million (2007) we have that
where we have substituted . Thus the gradient can be written as
where is the column vector formed by the diagonal elements of . Thus, the leastsquares problem for involves solving an system for each row of independently, . However, adding regularization to the temporal modes destroys this decoupling across time windows, which is why we resort to CG or proximal gradient approaches in Algorithm 1.
b.4 Computational complexity of alternating least squares
Each minimization step requires the solution of a linear matrix equation for . We now comment on the difficulty of solving the subproblems in , , and . Subproblem 1 in is by far the easiest, since it involves just one system solve or pseudoinverse. Subproblem 3 in requires decoupled system solves. The cost of this step is thus linear in . We assume that is small enough that solving for each time window is quick. By far the most difficult step is Subproblem 2 for the right hand factors . Partly this is because we obtain a Sylvester equation which we might naively solve as an square system in unknowns. In practice, we avoid using Kronecker products and instead use a matrixfree method such as conjugate gradients, we can expect this to take less memory and possibly less time, depending on the condition number. However, in either case the subproblem involves coefficient matrices of size , whereas for the other problems these are of size .
The full alternating minimization procedure for the fully regularized problem, Algorithm 1, has similar complexity. However, in that case the cost is dominated by the minimization over , which requires computing the proximal operator at each step of the proximal gradient method.
b.5 Gradient Identities for Least Squares Subproblems
The least squares subproblems for and require us to compute a gradient of a particular form:
(B.7) 
Replacing , and leads to the terms in the Sylvester equation (B.3) for the unknown . On the other hand, replacing , and
and taking the Hadamard product with the identity matrix (i.e., enforcing the diagonal constraint) leads to the terms in (
B.6).Appendix C Algorithm implementation details
The code and instructions for running it are available from https://github.com/kharris/tvart. We implemented Algorithm 1 in MATLAB. It was run on an Intel(R) Xeon(R) CPU E52620 v4 @ 2.10GHz, 1200 MHz with 32 cores and 128 GB RAM using Ubuntu 16.04.1 with Linux kernel 4.15.048generic and 64bit MATLAB R2017b. The CG and proxgradient subroutines are limited to 24 and 40 iterates, respectively. Proximal gradient uses a backtracking line search to find the step size and the proximal operator of TV is evaluating using prox_tv1d from UNLocBox (Perraudin et al., 2014)
; this is typically the bottleneck step for large problems. All hyperparameter tuning (for
, , , and ) was performed manually.The initialization is the following method: We consider the entire timeseries in two snapshot matrices,
We then fit a single model to the timeseries by and form the matrices . When , the TVART modes are then initialized to , , and is the matrix of allones with columns normalized. If , we truncate the smaller singular vectors, and when , we add columns equal to to and and to
. Initializations with unequal columns appear to help speed up the initial phase of the optimization, as opposed to allones. For this reason, we add Gaussian noise to the initializations with standard deviation
to and and with standard deviation to . Initializing to zeros is not appropriate since the gradients in that case are always zero. All of the results we present do not depend strongly on the realization of the noise.For stopping criteria, we use both a relative and absolute tolerance on the decrease in the cost function. Let be the cost (4) at iterate . Then, we stop if either
(C.8) 
Unless specified otherwise, rtol = and atol = . In all cases we have tested, the relative tolerance is acheived first. We report the cost (4) as it runs as well as the root mean square error (RMSE), which we define as:
(C.9) 
The RMSE is normalized so that it gives a measure of average onestep prediction error per channel. This “goodness of fit” metric can then be compared to the standard deviation of the data.
Appendix D Example application details
In this section, we provide precise details of the test problem and datasets we considered. A complete table of parameters is given in Table 1.
Name  affine?  

Switching test problem  6–4000  20  10  4, 6, 8  5  TV  no  
Lorenz system  3  10  419  4  0.01  400  TV  yes 
Worm behavior  4  6  33  6  0.05  6  TV  yes 
Sea surface temperature  1259  9  171  6  Spline  no  
Neural activity  64  200  1999  8  1  100  TV  no 
d.1 Test problem
We generate the matrices by the following process for :

Form a rotation matrix .

Draw a random matrix
, with standard Gaussian entries. 
Take the SVD: .

Form a larger rotation matrix by projecting with onto the left singular vectors: .
Then, is a rank 2 rotation matrix. We use and . The data are generated by starting with the initial condition and iterating 200 steps with to remove a transient. After the transient, the vector is renormalized to have . For the first half of the timeseries , we iterate , while for the second half we iterate . After the first application of , we again rescale to have norm . After generating the timeseries, we add independent, Gaussian observation noise with standard deviation to each entry of . Thus, since the noise vector norm scales as , rescaling the signal vector by the same factor allows us to maintain the same signaltonoise ratio across different system sizes . The RMSE of a model which is not overfit should be approximately for any .
In Fig. 2, we run the switching linear model with , , and . The TVART algorithm is run with , , and and TV regularization. Algorithm 1 converges in 30 iterations with an RMSE of 0.554, which is close to the noise floor .
In Fig. 3, we sweep across , and 4000 for both and , , and . The independent models fit a matrix independently to each window of size ; indep() first performs an SVD on the data and fit the model in the subspace spanned by the first principal components. The SLDS() models fit the same data with the Bayesian SLDS code provided by Scott Linderman at https://github.com/slinderman/ssm. In all cases the maximum number of switches Kmax = 4. The SLDS model was fit using the mean field variational posterior for 6000 iterations, and the predicted system matrix
was taken as the maximum a posteriori estimate. For the runtime calculation of SLDS, we:

First fit the SLDS for 6000 iterations,

Took the evidence lower bound (ELBO) sequence and smoothed it using a 2ndorder lowpass Butterworth filter with cutoff frequency 0.01 per iteration,

Fit a sigmoid to the smoothed ELBO, and

Chose the final iteration number as the estimated step at which the ELBO reaches 99.9% of its maximum according to the sigmoid.
The SLDS was then refit for that number of iterations to compute the run time.
Another comparison was made by assuming rank 2 orthogonal dynamics that are smooth. However, rather than switching the angle once, we sample from a Gaussian process in order to have slowlyvarying rank 2 dynamics. In this case, the orthogonal projection into the higher space is fixed for all time. The results are shown in Figure 8. We see that TVART(4) highly outperforms SLDS(6) (fit in the same manner as for the switching test problem described previously) as well as the independent models across system sizes . The parameters used for the sweep were , rank , , , and with the Spline regularization. Interestingly, with this strong temporal regularization, we can fit a different model at each time point.
d.2 Lorenz system
We use the usual chaotic parameter set , , and , integrate with 4th order RungeKutta using step size 0.001, and ignore the initial 1000 step transient. We observe all three state variables corrupted with additive Gaussian noise with unit standard deviation.
We run TVART with , , , , and affine dynamics. The algorithm converges in 369 iterations to an RMSE of 1.39, which is significantly smaller than the standard deviations of the state variables (7.83, 8.98, 8.76 for , respectively), meaning that the learned TVART model has reasonable onestep predictive power.
d.3 Worm behavior
Worm postural data were analyzed as smooth timeseries of “eigenworm” principal components. We ran TVART with an affine model and , and on these data. We also compared the performance of the code provided by Costa et al. (2019) at https://github.com/AntonioCCosta/locallinearsegmentation.
d.4 Sea surface temperature
Sea surface temperature grids (Reynolds et al., 2002) were downsampled by a factor of 6 in the latitudinal and longitudinal directions, resulting in final vectors of length . We chose a window size of weeks, resulting in windows, and use parameters , , and the Spline regularizer. In this case, the ALS routine stagnates (the matrix converges very slowly). However, we have found that restarting the algorithm after 10 iterations and setting
in the new initialization speeds things up significantly. This heuristic was inspired by the fact that the solution has
. In the end, the algorithm requires 1176 iterations to converge.d.5 Neural activity during a reaching task
The ECoG recordings were provided by Chao et al. (2010) as part of the NeuroTycho project. We analyzed the data of monkey K1, date 20090525. We ran TVART on the channel ECoG voltage data after filtering out 50 Hz line noise, downsampling to 500 Hz, and standardization. The parameters were , , , and , resulting in windows of length 0.4 s.
Comments
There are no comments yet.