Multivariate time series are of considerable interest in a number of domains, such as finance, economics, and engineering. Vector autoregressive (VAR) models have played a central role in capturing the dynamics hidden in such time series (Sims, 1980). VAR models typically attempt to fit a multivariate time series with linear coefficients representing the dependencies of multivariate variables within limited number of lags, and innovation (or error) representing new information (impulses) fed to the process at a given time point. Although it has been common practice to maintain a linear functional form to achieve interpretability and tractability, recent studies have provided a growing body of evidence that nonlinearity often exists in time series, and allowing for nonlinearities can be valuable for uncovering important features of dynamics (Jeliazkov, 2013; Kalli and Griffin, 2018; Koop and Korobilis, 2010; Primiceri, 2005; Shen et al., 2019; Teräsvirta, 1994; Tsay, 1998)
. Many recent studies used a deep learning framework to model nonlinear processes in video(Finn et al., 2016; Lotter et al., 2017; Oh et al., 2015; Srivastava et al., 2015; Villegas et al., 2017; Wichers et al., 2018) or audio (van den Oord et al., 2016), for example, with neural networks.
The innovation plays a key role by driving time series, and it can have a concrete meaning, such as economic shocks in finance, external torques given to a mechanical system, or stimulation in neuroscience experiments. However, its estimation has a serious indeterminacy even with linear models, if only conventional statistical assumptions are made. To facilitate estimation, VAR typically assumes that the innovations are additive, multivariate Gaussian (not necessary uncorrelated), and temporally independent (or serially uncorrelated). A well-known consequence of this is that the innovations cannot be identified: Multiplication of such innovations by any orthogonal matrix will not change distribution of the observed data, which hinders their interpretation. Some studies proposed to incorporate independent component analysis (ICA) framework to guarantee identifiability, by assuming mutual independence of non-Gaussian innovations(Gómez-Herrero et al., 2008; Hyvärinen et al., 2010; Lanne et al., 2017; Moneta et al., 2013). However, those studies assumed linear VAR models, while indeterminacy would presumably be even more serious in general nonlinear VAR (NVAR) models, in which the innovations may not be additive anymore. In fact, a serious lack of identifiability in general nonlinear cases is well-known in nonlinear ICA (NICA) (Hyvärinen and Pajunen, 1999).
We propose a novel VAR analysis framework called independent innovation analysis (IIA), which enables estimation of innovations hidden in unknown general NVAR. We first propose a model which allows for nonlinear interactions between innovations and observations, with very general nonlinearities. IIA can be seen as an extension of recently proposed NICA frameworks (Hyvärinen and Morioka, 2016; Hyvarinen et al., 2019), and guarantees the identifiability of innovations up to permutation and component-wise nonlinearities. The model assumes a certain temporal structure in the innovations, which typically takes the form of nonstationarity, but can be more general. We propose two practical estimation methods for IIA, both of which are self-supervised and can thus be easily implemented based on ordinary neural network training. Our identifiability theory for NVAR is quite different from anything presented earlier, and thus it can contribute as a new general framework for NVAR process.
2 Model definition
2.1 NVAR model and demixing model
We here assume a general NVAR model, which is first order (NVAR(1)) for simplicity:
where represents an NVAR (mixing) model, and and are observations and innovations (or errors) of the process at time point , respectively. As with ordinary VAR, the innovations are assumed to be temporally independent (serially uncorrelated). Importantly, this model includes potential nonlinear interaction between the observations and innovations, unlike ordinary linear VAR models (Gómez-Herrero et al., 2008; Hyvärinen et al., 2010; Lanne et al., 2017; Moneta et al., 2013) and additive innovation nonlinear models (Shen et al., 2019). We assume that is unknown and make minimal regularity assumptions on it in the theorems below. Our goal is to estimate the innovations (latent components) only from the observations obtained from the unknown NVAR process. The model, learning algorithms, Theorems, and proofs below can be easily extended to higher order models NVAR() () by replacing by .
To estimate the innovation, we propose a new framework called IIA, which learns the inverse (demixing) of the NVAR (mixing) model from the observations in data-driven manner, based on some statistical assumptions on the innovations. The theory is related to ICA (Hyvärinen, 1999), which estimates a demixing from instantaneous mixtures of latent components, i.e., , where is usually a linear function. However, IIA includes a recurrent structure of the observations in the model (Eq. 1), which makes IIA theoretically distinct from ordinary ICA. Nevertheless, in the following we leverage the recently developed theory of NICA (Hyvärinen and Morioka, 2016; Hyvarinen et al., 2019).
We start by transforming the NVAR model to something similar to NICA. This leads us to consider the following augmented NVAR (mixing) model
where is the augmented model, which includes the original NVAR model in the half of the space, and an identity mapping of in the remaining subspace. Importantly, this augmentation does not impose any particular constraint on the original model. We only assume that this augmented model is invertible (i.e. bijective; while itself cannot be invertible) as well as sufficiently smooth, but we do not constrain it in any other way. The estimation of the innovation can then be achieved by learning the inverse (demixing) of the augmented NVAR model :
where is the augmented demixing model of the (true) augmented NVAR model , and is the sub-space of the demixing model representing a mapping from two temporally consecutive observations to the innovation at the corresponding timing. This is simply a deduction from Eq. 2, and does not impose any additional assumptions on the original model.
2.2 Innovation model with auxiliary variable
The estimation of the demixing model in an unsupervised (or self-supervised) manner needs some assumptions on the innovations. Although some studies guaranteed the identifiability by assuming mutual independence of the innovations in linear VAR models (Hyvärinen et al., 2010; Lanne et al., 2017; Moneta et al., 2013), it would not be enough in nonlinear cases, as can be seen in well-known indeterminacy of NICA with i.i.d. components (Hyvärinen and Pajunen, 1999). Thus, we here adopt the framework recently proposed for NICA (Hyvärinen and Morioka, 2016; Hyvarinen et al., 2019); we assume that we can additionally observe auxiliary information about the innovation for each data point
, represented by a random variable, which specifies the modulations of the distributions of the innovations as a function of . In practice, can simply be time-index (Hyvarinen et al., 2019) or a time-segment-index (Hyvärinen and Morioka, 2016), thus incorporating information about nonstationarity. More specifically, we assume the followings:
The innovations are temporally independent.
Each is statistically dependent on some fully-observed -dimensional random auxiliary variable , but conditionally independent of the other , and has a univariate exponential family distribution conditioned on (we omit data index here):
where is the base measure, is the normalizing constant, is the model order, is the sufficient statistics, and is a parameter (scalar function) depending on .111The is assumed to be minimal, meaning that we cannot rewrite the form with a smaller . The parameters are assumed that for each , . These conditions are required for the distribution to be strongly exponential (Khemakhem et al., 2020), which is not very restrictive, and satisfied by all the usual exponential family distributions.
The temporal independence (A1) is the ordinary assumption for VAR. The A2 is related to the assumption of Gaussian innovations in ordinary VAR, but requires more specific properties represented by conditional independence and sufficient probabilistic modulation, determined by a fully observable auxiliary variable . Note that exponential families have universal approximation capabilities, so this assumption is not very restrictive (Sriperumbudur et al., 2017).
3 Learning algorithms
Depending on the type of the auxiliary variable in the innovation model (see A2), we can develop two learning algorithms; (IIA-GCL) generalized contrastive learning (Hyvarinen et al., 2019)-based framework for the general case of a possibly continuous-valued , and (IIA-TCL) time-contrastive learning (Hyvärinen and Morioka, 2016)-based framework for the special case in which the auxiliary variable is integer taking a finite number of values (e.g. a time segment index).
3.1 General contrastive learning framework (IIA-GCL)
In the general case, we develop a general contrastive learning (GCL) framework for IIA, based on the recently proposed NICA framework (Hyvarinen et al., 2019), from a version where randomization is performed on . Thus we define two datasets
where is a random value from the distribution of , but independent of , created in practice by random permutation of the empirical sample of . We learn a nonlinear logistic regression system using a regression function of the form
which gives the posterior probability of the first classas . The scalar-valued functions , , , , , , and take some of the , , and as input; note that the first term has a special factorized form. The universal approximation capacity (Hornik et al., 1989) is assumed for those functions; they would typically be learned by neural networks. This learning framework and the regression function are based on the following Theorem, proven in Supplementary Material A:
Assume the following:
(Assumption of Variability) There exist distinct points such that the matrix
of size is invertible, where .
The augmented function is invertible.
The scalar functions in Eq. 6 are twice differentiable, and for each , the following implication holds: .
Then, in the limit of infinite data, in the regression function provides a consistent estimator of the IIA model: The functions give the independent innovations, up to permutation and scalar (component-wise) invertible transformations.
This Theorem guarantees the convergence (consistency) of the learning algorithm. It immediately implies the identifiability of the innovations, up to a permutation and component-wise invertible nonlinearities. This kind of identifiability for innovations is stronger than any previous results in the literature. The estimation is based on the learning of nonlinear logistic regression function, and thus can be easily implemented based on ordinary neural network training. The Assumption of Variability requires the auxiliary variable to have a sufficiently strong and diverse effect on the distributions of the innovations. The assumptions on the NVAR model are not too restrictive, and supposed to be satisfied in many applications. The assumption 6 indicates that are not functionally redundant; any cannot be represented by a linear combination of . Although the assumptions of the nonlinear functions to be trained (assumptions 5 and 6) are not trivial, we assume they are only necessary to have a rigorous theory, and immaterial in any practical implementation.
3.2 Time-contrastive learning framework (IIA-TCL)
In the special case in which is integer within a finite number of classes , we can also develop a TCL-based framework for the estimation (Hyvärinen and Morioka, 2016). This special case includes time-segment-wise stationary process in which represents the time segment index at time (Hyvärinen and Morioka, 2016).
Instead of the basic two-class logistic regression used in IIA-GCL, IIA-TCL uses a multinomial logistic regression (MLR) classifier for the learning. More specifically, we learn a nonlinear MLR using a softmax function which represents the posterior distribution of , by the form
where are the class-specific weight and bias parameters of the MLR, and , , and are again scalar-valued functions assumed to have the universal approximation capacity (Hornik et al., 1989); they would typically be learned by neural networks. This learning framework and the regression function are justified on the following Theorem, proven in Supplementary Material B:
Assume the following:
The auxiliary variable is an integer in , with the number of values it takes (classes).
(Assumption of Variability) The modulation matrix of size
has full row rank , where .
We train a multinomial logistic regression with universal approximation capability to predict the class label (auxiliary variable) from with regression function in Eq. 8.
The augmented function is invertible.
The scalar functions in Eq. 8 are twice differentiable, and for each , the following implication holds: .
Then, in the limit of infinite data in each class, in the regression function provides a consistent estimator of the IIA model: The functions give the independent innovations, up to permutation and scalar (component-wise) invertible transformations.
Many of the assumptions are the same as those in IIA-TCL, except for the specifics of the innovation model (assumptions 3 and 4) and the learning algorithm (assumption 5). The estimation is based on self-supervised nonlinear MLR, and thus can be easily implemented based on ordinary neural network training, like IIA-GCL. Although the estimation methods are different, the identifiability result implied here by IIA-TCL is the same as above by IIA-GCL. Note that here the limit of infinite data takes the form that each class (value of ) has an infinite number of data points. In practice, each class is thus required to have a sufficient number of samples, so needs to be much smaller than the total number of data points; this would be natural if is a segment index.
4.1 Simulation 1: IIA-GCL for artificial dynamics with nonstationary innovations
We generated data from an artificial NVAR process with nonstationary innovations. The innovations were randomly generated from a Gaussian distribution by modulating its mean and standard deviation across time, i.e., . The modulations were designed to be temporally smooth and continuous. The dimensions of the observations and innovations (
) were 20. As the NVAR model, we used a multilayer perceptron we call NVAR-MLP, which takes a concatenation ofand as an input, then outputs (see Supplementary Material C for more details of the experimental settings). The goal of this simulation is to estimate the innovations only from the observable time series , without knowing the parameters of the NVAR-MLP.
Considering the innovation model with , we here used IIA-GCL for the estimation of the latent innovations. We adopted MLPs as the nonlinear scalar functions in Eq. 6. The nonlinear regression function was trained by back-propagation with a momentum term so as to discriminate the real dataset from its -randomized version. (As a simple sanity check, we saw that it achieved higher classification accuracies than chance after the training, see Fig. 3a in Supplementary Material.) The performance was evaluated by the Pearson correlation between the true innovations and the estimated feature values . It was averaged over 10 runs, for each setting of the complexity (number of layers) of the NVAR-MLP and the number of data points. For comparison, we also applied NICA based on GCL (NICA-GCL (Hyvarinen et al., 2019)
), an NVAR with additive innovation model (AD-NVAR), and variational autoencoder (VAE)(Kingma and Welling, 2014) to the same data. We fixed exceptionally for VAE because of the instability of training in high layer models. We additionally applied linear ICA (Hyvarinen, 2001) to the estimations by AD-NVAR and VAE for fair comparisons.
The IIA-GCL framework could reconstruct the innovations reasonably well even for the nonlinear mixture cases () (Fig. 1a). We can see that a larger amount of data make it possible to achieve higher performance, and higher complexity of the NVAR model makes learning more difficult. AD-NVAR performed well for the linear mixture case () because the additive innovation model is equivalent to the general NVAR model in the linear case; however, it was much worse in the nonlinear case. As expected, the other methods performed worse than IIA-GCL because their model did not match well to the NVAR generation model.
4.2 Simulation 2: IIA-TCL for artificial dynamics with nonstationary innovations
Next, to evaluate the IIA-TCL framework, we applied it to the same data used in Simulation 1. For IIA-TCL, we first divided the time series into 256 equally-sized segments, and used the segment label as the auxiliary variable ; i.e., we assume that the data are segment-wise stationary. Although this assumption is not consistent with the real innovation model (Section 4.1), it is approximately true because the modulations were temporally smooth and continuous; we thus consider here data with a realistic deviation from model assumptions. We adopted MLPs as the nonlinear scalar functions in Eq. 8, which architectures were similar to those in Simulation 1. The training and evaluation methods follow those in Simulation 1. (Again as a sanity check, we saw that the MLR achieved higher classification accuracies than chance after the training, see Fig. 3b in Supplementary Material.) We discarded the cases of small data sets ( and ) because of the instability of training. For comparison, we also applied NICA based on TCL (NICA-TCL (Hyvärinen and Morioka, 2016)). See Supplementary Material D for more details of the training settings.
IIA-TCL performed better than NICA-TCL (Fig. 1b). In addition, even though the innovation model matches IIA-GCL better than IIA-TCL (the modulations are temporally smooth and continuous, and thus not segment-wise stationary), IIA-TCL achieved slightly better performances than IIA-GCL; this finding is consistent with the comparison of NICA-GCL and NICA-TCL by (Hyvarinen et al., 2019). As with IIA-GCL, a larger number of data points leads to higher performance (i.e. the method seems to converge), and again, higher complexity of the NVAR models makes learning more difficult.
4.3 Experiments on real brain imaging data
To evaluate the applicability of IIA to real data, we applied it on multivariate time series of electrical activities of the human brain, measured by magnetoencephalography (MEG). In particular, we used a dataset measured during auditory or visual presentations of words (Westner et al., 2018). Although ICA is often used to analyze brain imaging data, relying on the assumption of mutual independence of the hidden components, the event-related components (such as event-related potentials; ERPs) are not likely to be independent because they may have similar temporal patterns time-locked to the stimulation. However, the innovations generating the components should still be independent because they would be generated by different brain sources, which motivates us to use IIA rather than ICA (see Supplementary Material E for the details of the data and settings).
Data and preprocessing
We used a publicly available MEG dataset (Westner et al., 2018). Briefly, the participants were presented with a random word selected from 420 unrelated German nouns either visually or auditorily, randomly for each trial. MEG signals were measured from twenty healthy volunteers by a 148-channel magnetometer (219.122.4 trials for each subject; 2,207 auditory and 2,174 visual trials in total for all subjects). We band-pass filtered the data between 4 Hz and 125 Hz (sampling frequency = 300 Hz). The dimension of the data was reduced to 30 by PCA.
We used IIA-TCL for the training, by assuming an NVAR(3) model and the segment-wise-stationarity of the latent innovations. The trial data were divided into 84 equally sized segments of length of 8 samples (26.7 ms), and the segment label was used as the auxiliary variable . The same segment labels were given across the trials; however, considering the possible stimulus-specific dynamics of the brain, we assigned different labels for the auditory and visual trials. In total, there are 168 segments (classes) to be discriminated by MLR. We used MLPs for the nonlinear scalar functions (Eq. 8), and fixed the number of components to 5. Considering the fast sampling rate of the data, we fixed the time lag between two consecutive samples to 3 (10 ms).
For evaluation, we performed classification of the stimulus modality (auditory or visual) by using the estimated innovations. The classification was performed using a linear support vector machine (SVM) classifier trained on the stimulation label and sliding-window-averaged innovations (width=16 and stride=8 samples) obtained for each trial. The performance was evaluated by the generalizability of a classifier across subjects, i.e., one-subject-out cross-validation (OSO-CV); the feature extractor and the classifier were trained only from the training subjects, and then applied to the held-out subject. For comparison, we also evaluated NICA based on TCL(Hyvärinen and Morioka, 2016) and AD-NVAR(3). We omitted for IIA-TCL because of the instability of training. We visualized the spatial characteristics of each innovation component by estimating the optimal (maximal and minimal) input while fixing to zero.
Figure 2a shows the decoding accuracies of the stimulus categories, across different methods and the number of layers for each model. The performances by IIA-TCL with nonlinear models () were significantly higher than the other baseline methods (; Wilcoxon signed-rank test, FDR correction), which indicates the importance of the modeling of the MEG signals by NVAR, especially with the nonlinear (non-additive) interactions of the innovations.
The left panels of Fig. 2b show the temporal patterns of the innovations during the auditory and visual stimuli. Some components have clear differences between the stimulus modalities, which implies that those components are related to the stimulus-specific dynamics of the brain; e.g., C1 and C2 represent auditory- and visual-relevant innovations, respectively. Such stimulus-specificity can be also seen from the spatial characteristics of the components; C1 is strongly activated by the MEG signals around auditory areas of the brain, while C2 is more activated by the visual areas. C3 seems to represent stimulus-evoked activities on the parietal region caused by both categories. Those results show that IIA-TCL extracted reasonable components (innovations) relevant to the external stimuli automatically from the data in a self-supervised, data-driven manner.
IIA can be seen as a generalization of the recently proposed NICA frameworks (Hyvärinen and Morioka, 2016; Hyvarinen et al., 2019), with the important difference that observations can have recurrent temporal structure. The theory strictly includes NICA as a special case, since the main assumptions can be satisfied even if the NVAR model (Eq. 1) does not actually depend on , which corresponds to the instantaneous nonlinear mixture model of NICA: . This connection can be also seen by comparing the regression functions; by omitting the dependencies of Eqs. 6 and 8 on , we can obtain the same algorithms as NICA ((Hyvärinen and Morioka, 2016) with k=1, and (Hyvarinen et al., 2019)). This indicates that the regression functions of IIA can learn NICA models as a special case.
Applying IIA on time series data has some practical advantages compared to NICA. First, autoregressive structures are generally inherent in any kinds of dynamics, and their explicit modeling is beneficial for the estimation. Second, innovations are usually more independent mutually than the processes generated by them, because the independence of processes implies the independence of their innovations, but not vice versa, as argued in the linear case by (Hyvärinen, 1998). Thus, innovations are likely to give a better fit to any model assuming independence of the latent variables.
While IIA estimates innovations from the observed time series, the NVAR model is left unknown, unlike in ordinary VAR analyses. In practice, we can estimate after IIA as a post-processing, by fitting a nonlinear function which outputs from and the estimated . Since IIA guarantees the estimation of up to a permutation and element-wise invertible nonlinearities, this should be possible if the model to be fitted has universal approximation capability.
We proposed independent innovation analysis (IIA) as a new general framework to nonlinearly extract innovations hidden in a time series. In contrast to the common simplifying assumption of additive innovations, IIA can deal with a general nonlinear VAR model in which innovations are not additive. Any general nonlinear interactions between the innovations and the observations are allowed. To guarantee identifiability, IIA requires some assumptions on the innovations, in particular mutual independence conditionally on a fully observable auxiliary variable which also needs to modulate the distributions of the innovations. A typical case would be nonstationary innovations mutually independent at each time point.
We proposed two practical estimation methods, both of which were based on a self-supervised training of a nonlinear feature extractor by logistic regression, possibly multinomial. They can thus be easily implemented by ordinary neural network training. In both methods, the consistency of the estimation is guaranteed up to a permutation and component-wise invertible nonlinearity, which implies the strongest identifiability proof of general NVAR in the literature, by far. IIA can be seen as a generalization of recently proposed NICA frameworks, and includes them as special cases.
Experiments on real brain imaging data by MEG showed distinctive components relevant to the external-stimulus categories. This result suggests a wide applicability of the method to different kinds of time series such as video data, econometric data, and biomedical data, in which innovation plays an important role.
This research was supported in part by JSPS KAKENHI 18KK0284, 19K20355, and 19H05000. A.H. was funded by a Fellow Position from CIFAR as well as the DATAIA convergence institute as part of the “Programme d’Investissement d’Avenir", (ANR-17-CONV-0003) operated by Inria.
-  (2016) Unsupervised learning for physical interaction through video prediction. In Advances in Neural Information Processing Systems (NIPS) 29, pp. 64–72. Cited by: §1.
-  (2008) Measuring directional coupling between eeg sources. NeuroImage 43 (3), pp. 497 – 508. Cited by: §1, §2.1.
Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics.
Journal of Machine Learning Research13 (11), pp. 307–361. Cited by: §A.
-  (2001) The elements of statistical learning. Springer, New York, NY. Cited by: §A.
-  (1989) Multilayer feedforward networks are universal approximators. Neural Networks 2 (5), pp. 359 – 366. Cited by: §3.1, §3.2.
Unsupervised feature extraction by time-contrastive learning and nonlinear ica. In Advances in Neural Information Processing Systems (NIPS) 29, pp. 3765–3773. Cited by: §D, §E, §E, §1, §2.1, §2.2, §3.2, §3, §4.2, §4.3, §5.
-  (1999) Nonlinear independent component analysis: existence and uniqueness results. Neural Netw. 12 (3), pp. 429 – 439. Cited by: §1, §2.2.
-  (2019) Nonlinear ica using auxiliary variables and generalized contrastive learning. In AISTATS, pp. 859–868. Cited by: §C, §1, §2.1, §2.2, §3.1, §3, §4.1, §4.2, §5.
Estimation of a structural vector autoregression model using non-gaussianity. Journal of Machine Learning Research 11 (56), pp. 1709–1731. Cited by: §1, §2.1, §2.2.
-  (1999) Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans. Neural Netw. 10 (3), pp. 626–634. Cited by: §2.1.
Blind source separation by nonstationarity of variance: a cumulant-based approach. IEEE Transactions on Neural Networks 12 (6), pp. 1471–1474. Cited by: §C, §E, §4.1.
-  (1998) Independent component analysis for time-dependent stochastic processes. In ICANN 98, pp. 135–140. Cited by: §5.
-  (2013) Nonparametric vector autoregressions: specification, estimation, and inference. In VAR Models in Macroeconomics –New Developments and Applications: Essays in Honor of Christopher A. Sims, Vol. 32, pp. 327–359. Cited by: §1.
-  (2018) Bayesian nonparametric vector autoregressive models. Journal of Econometrics 203 (2), pp. 267 – 282. Cited by: §1.
-  (2020) Variational autoencoders and nonlinear ica: a unifying framework. In AISTATS, Cited by: §A, §A, §B, §B, footnote 1.
-  (2014) Auto-encoding variational bayes. In ICLR 2014, Cited by: §C, §4.1.
-  (2010) Bayesian multivariate time series methods for empirical macroeconomics. Found. Trends Econ. 3 (4), pp. 267–358. Cited by: §1.
-  (2017) Identification and estimation of non-gaussian structural vector autoregressions. Journal of Econometrics 196 (2), pp. 288 – 304. Cited by: §1, §2.1, §2.2.
-  (2017) Deep predictive coding networks for video prediction and unsupervised learning. In ICLR 2017, Cited by: §1.
-  (2013) Causal inference by independent component analysis: theory and applications. Oxford Bulletin of Economics and Statistics 75 (5), pp. 705–730. Cited by: §1, §2.1, §2.2.
-  (2015) Action-conditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems (NIPS) 28, pp. 2863–2871. Cited by: §1.
-  (2005) Time varying structural vector autoregressions and monetary policy. The Review of Economic Studies 72 (3), pp. 821–852. Cited by: §1.
-  (2019) Nonlinear structural vector autoregressive models with application to directed brain networks. IEEE Transactions on Signal Processing 67 (20), pp. 5325–5339. Cited by: §1, §2.1.
-  (1980) Macroeconomics and reality. Econometrica 48 (1), pp. 1–48. Cited by: §1.
-  (2017) Density estimation in infinite dimensional exponential families. Journal of Machine Learning Research 18 (57), pp. 1–59. Cited by: §2.2.
-  (2015) Unsupervised learning of video representations using lstms. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 843–852. Cited by: §1.
-  (1994) Specification, estimation, and evaluation of smooth transition autoregressive models. Journal of the American Statistical Association 89 (425), pp. 208–218. Cited by: §1.
-  (1998) Testing and modeling multivariate threshold models. Journal of the American Statistical Association 93 (443), pp. 1188–1202. Cited by: §1.
-  (2016) WaveNet: A generative model for raw audio. CoRR abs/1609.03499. Cited by: §1.
-  (2017) Decomposing motion and content for natural video sequence prediction. In ICLR 2017, Cited by: §1.
-  (2018) Across-subjects classification of stimulus modality from human meg high frequency activity. PLOS Computational Biology 4 (3), pp. 1–14. Cited by: §E, §4.3, §4.3.
-  (2018) Hierarchical long-term video prediction without supervision. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80, pp. 6038–6046. Cited by: §1.
A Proof of Theorem 1
The log-pdf of is given by, using the probability transformation formula,
where , , and are the conditional pdfs of , , and , respectively, denotes the Jacobian, and by definition. The third equation is from the structure of the augmented demixing model (Eq. 3), and the last equation is from the temporal independence of (A1).
By well-known theory [3, 4], after convergence of logistic regression, with infinite data and a function approximator with universal approximation capability, the regression function (Eq. 6) will equal the difference of the log-pdfs in the two classes:
where and are the marginal pdfs of the innovations and observations when is integrated out, and the last equation came from the conditional exponential pdf model of (A2). The Jacobians and marginals cancel out here. Considering its factorization form and the distinctive dependency of each term on , , and , the approximation solution is possible as
Next, we have to prove that this is the only solution up to the indeterminacies given in the Theorem. Let be the points given by assumption 3 in the Theorem. We plug each of those to obtain equations. By collecting those equations into rows, with subtracting the first equation for from the remaining equations:
where is a matrix of , with the product of giving row index and column index, is a matrix of given in the assumption 3 in the Theorem, , , and the other vectors are the collection of the corresponding terms in Eq. 11 at the points with all subtracting the one with ; , is a vector of ones, , , and . In both sides of the equation, the terms not depending on disappeared by the subtraction with . Let a compound demixing-mixing function , and change variables to , we then have
Firstly, we will show that is invertible. From the definition of , its partial derivative with respect to is . According to Lemma 3 of , for which satisfies A2, there exist points such that are linearly independent. By differentiating Eq. 14 with respect to and collecting their evaluations at such distinctive points for all , we get
where is a matrix collecting to the columns indexed by , and is a collection of partial derivatives of evaluated at the same points. is invertible (through a combination of Lemma 3 of  and the fact that each component of is univariate), and thus the right-hand side is invertible because is invertible as well (assumption 3). The invertibility of the right-hand side implies the invertibility of and .
Now, let an augmented compound demixing-mixing function , where is the augmented function defined in the assumption 5 in the Theorem. The corresponds to defined above. Note that is invertible because both and are invertible. What we need to prove is that is a block-wise invertible point-wise function, in the sense that is a function of only one and not of any of , and vise versa. This can be done by showing that the product of any two distinct partial derivatives of any component is always zero, and the Jacobian is block diagonal; the upper and lower block correspond to and respectively. Along with invertibility, this means that each component depends exactly on one variable of the corresponding block ( or ). Below, we show that separately for and . Firstly, this is obviously true for because is just an identity mapping of from the definitions of and , and does not depend on ; the lower non-zero block of
is an identity matrix. Next, we will show that for. We differentiate Eq. 14 with respect to (an element of ), and , and get
From the invertibility of and the calculation of differentials, we get
where , , such that the non-zero entries are at indices , , , and . From Lemma 4 and 5 of , assumption 6 implies that has full row rank , and thus the pseudo-inverse of fulfils . We multiply the equation above from the left by such pseudo-inverse and obtain
In particular, for all , , and . This means that a row of at each has either 1) only one non-zero entry somewhere in the former half block (corresponding to the partial derivatives by ) or 2) non-zero entries only in the latter half block (corresponding to the partial derivatives by ). The latter case is contradictory because it means that the component is a function of only , and cannot hold Eq 14, which right-hand side is a function of all components of (and ). Therefore, should have only one non-zero entry in the former half block for each row. From the results of and , we deduce that is a block diagonal matrix. Now, by invertibility and continuity of , we deduce that the location of the non-zero entries are fixed and do not change as a function of . This proves that is a block-wise invertible point-wise function, and () is represented by only one () up to a scalar (component-specific) invertible transformation, and the Theorem is proven.
B Proof of Theorem 2
The conditional joint log-pdf of a data point is given by, using the probability transformation formula,
where , , and are the conditional pdfs of , , and , respectively, denotes the Jacobian, and by definition. The second equation is from the structure of the augmented demixing model (Eq. 3) and the temporal independence of (A1), and the last equation is from the conditional exponential family model of the innovation (A2). On the other hand, by applying Bayes rule on the optimal discrimination relation given by Eq. 8, after dividing all the exponential term by the one of to avoid the well-known indeterminacy of the softmax function,