# Hidden Markov Nonlinear ICA: Unsupervised Learning from Nonstationary Time Series

Recent advances in nonlinear Independent Component Analysis (ICA) provide a principled framework for unsupervised feature learning and disentanglement. The central idea in such works is that the latent components are assumed to be independent conditional on some observed auxiliary variables, such as the time-segment index. This requires manual segmentation of data into non-stationary segments which is computationally expensive, inaccurate and often impossible. These models are thus not fully unsupervised. We remedy these limitations by combining nonlinear ICA with a Hidden Markov Model, resulting in a model where a latent state acts in place of the observed segment-index. We prove identifiability of the proposed model for a general mixing nonlinearity, such as a neural network. We also show how maximum likelihood estimation of the model can be done using the expectation-maximization algorithm. Thus, we achieve a new nonlinear ICA framework which is unsupervised, more efficient, as well as able to model underlying temporal dynamics.

## Authors

• 3 publications
• 36 publications
• ### Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive Learning

Nonlinear ICA is a fundamental problem for unsupervised representation l...
05/22/2018 ∙ by Aapo Hyvärinen, et al. ∙ 0

• ### Disentangling Identifiable Features from Noisy Data with Structured Nonlinear ICA

We introduce a new general identifiable framework for principled disenta...
06/17/2021 ∙ by Hermanni Hälvä, et al. ∙ 0

• ### Robust Nonlinear Component Estimation with Tikhonov Regularization

Learning reduced component representations of data using nonlinear trans...
07/15/2019 ∙ by Reuben Feinman, et al. ∙ 2

• ### Nonlinear ISA with Auxiliary Variables for Learning Speech Representations

This paper extends recent work on nonlinear Independent Component Analys...
07/25/2020 ∙ by Amrith Setlur, et al. ∙ 7

• ### Independent innovation analysis for nonlinear vector autoregressive process

The nonlinear vector autoregressive (NVAR) model provides an appealing f...
06/19/2020 ∙ by Hiroshi Morioka, et al. ∙ 0

• ### The Incomplete Rosetta Stone Problem: Identifiability Results for Multi-View Nonlinear ICA

We consider the problem of recovering a common latent source with indepe...
05/16/2019 ∙ by Luigi Gresele, et al. ∙ 9

• ### Asymptotic risks of Viterbi segmentation

We consider the maximum likelihood (Viterbi) alignment of a hidden Marko...
02/18/2010 ∙ by Kristi Kuljus, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Representation learning – the task of finding useful features from data – is one of the main challenges in unsupervised learning. Recent theoretical and practical advances in Nonlinear ICA provide a principled approach to this problem (Hyvärinen and Morioka, 2016, 2017; Hyvärinen et al., 2019; Khemakhem et al., 2020; Sorrenson et al., 2020). These works frame Nonlinear ICA as deep generative models, which allows them to harness deep neural networks to recover latent independent components from observed data. Identifiability of the latent components can be guaranteed by explicitly defining probabilistic generative models with appropriate conditional independence structures. A general framework was proposed recently by Hyvärinen et al. (2019), who assumed that the components are independent given some other observed auxiliary variable. For example, in time-series data this can be the time-index or segment-index if the data is non-stationary, as was earlier assumed in Time-Contrastive Learning or TCL (Hyvärinen and Morioka, 2016). Non-stationarity is a fundamental property of many applications, since for example, video, audio, and most neuroscience data are non-stationary.

A crucial limitation of all of the above nonlinear ICA models is that the conditioning auxiliary variable is always assumed observable. In some sense, these models are therefore not fully unsupervised. If, for instance, we wish to exploit the nonstationary temporal structure optimally in estimating independent components, TCL would require segment indices that correspond to the different latent data generative states. In practice we don’t observe such states so the default approach is to manually segment the data.

In general, it is unrealistic to assume that we can infer from observed data alone the exact time-points at which the latent data distribution changes. In fact, such change-points may not exist at all. Segmenting data manually is also infeasible for large datasets. The default approach is therefore to segment data at equal intervals, however, this is problematic for various reasons. Consider, for example, a situation where the true latent state switches between five different states. By segmenting the data at equal intervals we will end up with an unnecessarily large number of states where just a few would have done the job. This is computationally expensive, inaccurate and will completely miss out on temporal dynamics in situations where the latent states repeat over time.

In fact, often a reasonable assumption is that non-stationarity can be succinctly summarized using a limited number of segment indices or latent states, and properly modelling such state switching is likely to improve learning. Notice that even if ground-truth nonstationary information was available, the existing methods lack the machinery to perform inference on latent temporal dynamics. In many applications, for example brain imaging, describing the dynamics in terms of latent states could be very useful in its own right.

The points above highlight the need for a nonlinear ICA method that is able to cluster observations and learn latent states and their temporal dynamics in an unsupervised fashion. A well-known approach to modelling hidden latent states in time series is to use a Hidden Markov Model (HMM). HMMs can be viewed as probabilistic mixture models where the discrete latent states, which specify the data generating distribution, are time-dependent with Markov dynamics. HMMs are especially well suited for modelling non-stationary data as they automatically allow for a representation of the time series in terms of a discrete number of states.

In this work, we therefore resolve the above limitations by combining Nonlinear ICA with a HMM. This idea has been proposed earlier for linear ICA (Penny et al., 2000; Zhou and Zhang, 2008) but their identifiability and estimation results do not directly extend to the nonlinear case. In our model, we achieve this by having the latent state act in place of the conditioning auxiliary variable in the framework of Hyvärinen et al. (2019). Importantly, we are able to prove that Hidden Markov Nonlinear ICAs are identifiable. Attaining identifiability has been a major research focus for both Nonlinear ICA (Hyvärinen and Morioka, 2017; Hyvärinen et al., 2019) and HMMs (Allman et al., 2009; Gassiat et al., 2016), and therefore much of our paper is devoted to combining these two research strands. To the best of our knowledge, this is the first fully unsupervised non-linear ICA, in the sense that the model’s identifiability comes from an unobserved conditioning variable which is inferred from the time series as a part of learning. We further show how the structure of the model allows us to use the Expectation-Maximization (EM) algorithm for parameter estimation. In practice the Hidden Markov Nonlinear ICA is endowed with rich representation learning capabilities that allow it to simultaneously extract independent components and to learn the dynamics of the latent state that drives non-stationarity data, as illustrated by our simulations.

## 2 Background

We start by giving an overview of the problem of unidentifiability in both nonlinear ICA and HMMs, and recently proposed solutions. For both types of models, identifiability arises as a consequence of appropriate temporal structures which suggests a natural synthesis between the two.

### 2.1 Nonlinear Ica and Identifiability

Consider a parametric model of observed data

with marginal likelihood . This model is identifiable if it fulfils below:

 pθ(x)=pθ′(x)⇒θ=θ′:∀(θ,θ′) (1)

In the context of a latent variable model, this is closely connected to the idea of being able to recover the original latent variables, as discussed by Khemakhem et al. (2020).

Assume we observe -dimensional data at discrete time-steps, . Simple nonlinear ICA can be defined as the task of estimating an unobserved

-dimensional independent component vector

such that , as well as the inverse of a mixing function , that has generated the observed data:

 x(t)=f(s(t)) (2)

Unfortunately, without a temporal structure, that is if are i.i.d over the time-index, and if there are no constraints on , then this model is unidentifiable (Hyvärinen and Pajunen, 1999). In fact, the authors show that there are infinite potential nonlinear transformations and independent components that would satisfy the model, with no criterion for choosing one of them over the others.

In order to make the model identifiable, constraints are thus needed. For time-series data, this comes naturally by placing restrictions on the temporal structure of the model. For linear ICA this approach has been shown to yield identifiable models (Belouchrani et al., 1997; Tong et al., 1991), and extensions to the nonlinear case have been also been proposed in earlier work (Harmeling et al., 2003; Sprekeler et al., 2014). The first fully rigorous proof of an identifiable nonlinear ICA model, along with an estimation algorithm (Time-Contrastive Learning or TCL), was given by Hyvärinen and Morioka (2016). The constraint imposed in that work is that of a non-stationary data generative process such that independent component vectors within different time-segments have different distributional parameters. Specifically, the model assumes that each independent component has an exponential family distribution, where the time segment index modulates the natural parameters (denoted as ):

 pτ(si)=qi(si)Z(λi)exp{⟨λi(τ),T(si)⟩} (3)

where is the base measure and are the sufficient statistics. TCL then assumes that the independent components in all the segments are transformed into observed variables by some mixing function (2

). The authors prove identifiability up to a linear transformation of pointwise functions of the components:

 T(s(t))=Ah(x(t);θ)+b (4)

By learning to contrast between the different segments, the TCL algorithm learns the inverse of the mixing function and the independent components.

This seminal work has inspired other frameworks for identifiable nonlinear ICA estimation. Permutation Contrastive Learning (Hyvärinen and Morioka, 2017), for instance, exploits temporal dependencies, rather than non-stationarity, to identify independent components. The unifying tenet of these identifiable nonlinear ICA algorithms is that independent components are conditionally independent given some observed auxiliary variable. This general idea was formalized in Hyvärinen et al. (2019), of which both the TCL (segment index as auxiliary variable) and the PCL (past data as auxiliary variable) are special cases.

These identifiable nonlinear ICA models provide a principled approach to finding meaningful data representations. This is in contrast to the majority of recent deep generative models used for representation learning, such as VAEs (Kingma and Welling, 2014) and GANs (Goodfellow et al., 2014), which are all malaised by unidentifiability. In fact, any generative latent variable model with an unconditional prior is unidentifiable. This issue is portrayed in depth by Khemakhem et al. (2020) who resolve it by introducing identifiable VAE (iVAE). Like regular VAE, this model estimates a full generative model , but with a factorial conditional prior . As in Hyvärinen et al. (2019), u is some auxiliary variable, and iVAE provides a novel algorithm to estimate nonlinear independent components in the same identifiable framework. iVAE however suffers from the same problems as TCL, as its auxiliary variable u has to be observed.

### 2.2 Hidden Markov Models and Identifiability

In order to define HMMs, let

be an observed random variable from a time series with a discrete time index

. In a standard hidden Markov model, distribution of the observations depends conditionally on a discrete latent state random variable as per ; we refer to this as the emission distribution. The latent state undergoes first-order Markov process governed by a

transition-probability matrix

. is used to denote the probability of transitioning from state to , and the starting-state probabilities. The likelihood of a typical HMM is hence given by:

 p(x(1),…,x(T);A)= ∑c(1),…,c(T)π(c(1))p(x(1)|c(1))T∏t=2Ac(t−1),c(t)p(x(t)|c(t)) (5)

HMMs can be viewed as mixture models where the latent state is coupled across time by a Markov process. This observation raises the question of identifiability since mixture models with non-parametric emission distributions are generally unidentifiable, though many commonly used parametric forms are identifiable (Allman et al., 2009). Recently, however, Gassiat et al. (2016) have proven a major result that nonparametric HMMs are in general identifiable under some mild assumptions. We will use this result later and thus reproduce it here (notice that their nonparametric result subsumes parametric HMMs):

###### Theorem 1.

(Gassiat et al., 2016)  Assume the number of latent states, , is known. Use

to denote nonparametric probability distributions of the

emission distributions. Also assume that the transition-matrix is full rank. Then the parameters and are identifiable given the distribution, , of at least 3 consecutive observations , up to label swapping of the hidden states, that is: if is a transition matrix, if is a stationary distribution of with , and if are probability distributions on that verify the equality of the HMM distribution functions , then there exists a permutation of the set such that for all we have and .

Much like for TCL, identifiability in nonparametric HMMs is a result of temporal structure, namely observations across time are independent conditionally on the latent state—this is in contrast to simple (i.i.d.) mixture models for which such general identifiability results are not available. We show below that this temporal structure of the HMM’s, together with nonstationarity similar to TCL, combine to identify the resulting Hidden Markov Nonlinear ICA model.

## 3 Identifiable Nonlinear Ica for Nonstationary Data

In this section, we propose a combination of a hidden Markov model and nonlinear ICA. Specifically, we propose an HMM which has nonlinear ICA as its emission model, and show how to estimate it by Expectation-Maximization.

### 3.1 Model Definition

To incorporate nonlinear ICA into the standard HMM of (5) we define the emission distribution as a deep latent variable model. First, the latent independent component variables are generated from a factorial exponential family prior, given the hidden state , as

 p(s(t)|c(t);λc(t))=N∏i=1p(s(t)i|c(t);λi,c(t)) =N∏i=1h(s(t)i)Z(λi,c(t))exp{⟨λi,c(t),Ti(si)⟩} (6)

where are the base measures, the normalizing constants, and the sufficient statistics. Second, the observed data is generated by a nonlinear mixing function as in Eq. (2).

For remainder of the paper we assume that the exponential family is in minimal representation form so that the sufficient statistics are linearly independent. The corresponding -dimensional parameter vectors are denoted by . The subscripts indicate that the parameters of the different components are modulated directly, and independently, by the HMM latent state. Indeed, it is precisely this conditional dependence of the parameters on the discrete latent state that seeps through our model and generates non-stationary observed data. Note that the parameters themselves are time-homogeneous, that is they are constant over time; instead, the latent state evolves over time and determines which set of parameters is active a point in time. In other words, non-stationary arises purely from the dynamics of the latent state . The full set of parameters for the independent components can hence be captured by a matrix (plus the transition probabilities of the hidden states).

The nonlinear mixing function in Eq. (2) is assumed to be bijective with inverse given by . It follows that in our model the conditional emission distribution for observed data is:

 p(x(t)|c(t);f,λc(t))= |Jg(x(t))|H(g(x(t)))Z(λc(t))exp{⟨λc(t),T(g(x(t)))⟩} (7)

where is short-hand notation for the absolute value of the determinant of the Jacobian of the inverse (demixing) function, and . We have also simplified notation by stacking the vectors for different components and .

We allow to be any arbitrary but bijective function. In practice, it can be learned as a neural network. The model can therefore be viewed as a deep generative model for non-stationary data. Finally, using and our hidden Markov nonlinear ICA model’s data-likelihood is given as:

 p(x(1),…,x(T);Θ)= ∑c(1),…,c(T)(π(c(1))p(x(1)|c(1),θc(1))× T∏t=2Ac(t−1),c(t)p(x(t)|c(t);θc(t))) (8)

where the emission distributions in Eq. (7) should be plugged in.

### 3.2 Estimation

Assume we have a sequence of observed data generated by (8). In order to estimate the model parameters in practice we will choose the factorial prior in (6) from a well-known family such that the normalizing constant is tractable, such as a Gaussian location-scale family. Intractable normalizing constant would make estimation very difficult, even by approximate inference methods such as Variational Bayes or VAEs. However, notice that the choice of distribution for the latent prior does not severely limit the type of data that can be modelled since the non-linear mixing function can be any arbitrary function.

Tractable exponential families also make it easy to estimate the model parameters by maximizing the likelihood in (8) by the EM algorithm. The ”free-energy” EM lower bound for our model is given by:

 L(q(c),Θ):= Eq(c)[logp(c,x(1),…,x(T);Θ)]−Eq(c)[logq(c)] (9)

where , such that the first RHS terms is the complete-data likelihood under some distribution . In the E-step one finds which is the standard result for HMMs and can be easily computed using the forward-backward (Baum-Welch) algorithm. In the M-step we aim to find , which reduces to maximizing:

 ~L(q(c),Θ):=Eq(c⋆)[logp(c,x(1),…,x(T);Θ)]= +T∑t=2Eq(c(t−1)⋆,c(t)⋆)[logAc(t−1),c(t)] (10)

where we have left out the initial-state probability term as we can assume a stationary process and infer them directly from

as its left eigenvector. The M-step updates for

are standard:

 A⋆i,j←∑Tt=2q(c(t−1)⋆=i,c(t)⋆=j)∑Tt=1q(c(t)⋆) (11)

M-step updates for the parameters also follow from standard EM results for exponential families:

 ∇λkT∑t=1Eq(c(t)⋆)[logp(x(t)|c(t);θc(t))] =∇λkT∑t=1Eq(c(t)⋆)[⟨λk,T(g(x(t)))⟩−logZ(λk)] =T∑t=1q(c(t)⋆=k)[T(g(x(t)))−∇λkZ(λk)Z(λk)]=0 ⇒∇λkZ(λk)Z(λk)=∑Tt=1q(c(t)⋆=k)T(g(x(t)))∑Tt=1q(c(t)⋆=k) (12)

where LHS can be rewritten as:

 1Z(λk)∇λk∫(|Jg(x(t))|H(g(x(t))) ×exp{⟨λc(t),T(g(x(t)))⟩}) =Ep(x(t)|c(t);θc(t))[T(g(x(t)))] (13)

Thus the M-step updates for are the ones that solve:

 =∑Tt=1q(c(t)⋆=k)T(g(x(t)))∑Tt=1q(c(t)⋆=k) (14)

In practice, (14

) has closed-form updates for many usual exponential family members. As an example, if we were to use a Gaussian distribution, the updates for mean and variance vectors would be:

 μ⋆k←∑Tt=1q(c(t)⋆=k)g(x(t))∑Tt=1q(c(t)⋆=k) σ2⋆k←diag⎛⎝∑Tt=1q(c(t)⋆=k)y⋆ky⋆,Tk∑Tt=1q(c(t)⋆=k)⎞⎠ (15)

where

Next, the demixing function is estimated by parameterizing it as a deep neural network but for notational simplicity we will not write these parameters explicitly and instead subsume them in . Since an exact M-step is not possible, a gradient ascent step on the lower bound is taken instead, where the gradient is given by:

 ∇g~L(q(c),Θ)=∇gT∑t=1Eq(c(t)⋆)[logp(x(t)|c(t);θc(t))] =∇gT∑t=1log|Jg| +∇gT∑t=1Eq(c(t)⋆)[logH(g)+⟨λc(t),T(g)⟩] (16)

where we have used for brevity. The parameters are then updated as:

 gnew←gold+η∇g~L(q(c),Θ) (17)

See Appendix A for a discussion on the convergence of our algorithm.

The gradient term with respect to the determinant of the Jacobian deserves special attention. It is widely considered difficult to compute, and therefore, normalizing flows models are often used in literature in order to make the Jacobians more tractable. The problem with this approach is that, to our best knowledge, none of such flow models has universal function approximation capabilities (despite some being universal distribution approximators). This would restrict the possible set of nonlinear mixing functions that can be estimated, and is thus not practical for our purposes. Fortunately modern autograd packages such as JAX make it possible to calculate gradients of the log determinant Jacobian efficiently up to moderate dimensions (see Appendix B) – this is the approach we take. Very recent, promising, alternative for computing the log-determinant is the relative gradient (Gresele et al., 2020) which could easily be implemented in our framework. Finally, notice that the second term (16) is easy to evaluate since the expectation is just a discrete sum over the posteriors that we get from the E-step.

### 3.3 Comment on Estimation for Long Time Sequences

The above estimation method may be impractical for very long time sequences since the forward-backward algorithm has computational complexity of . In such situations we can adapt the subchain sampling approach of Foti et al. (2014). This involves splitting up the full dataset into shorter time sequences and then forming minibatches over time. The resulting gradient updates would be biased and therefore a scaling term will be applied to them. The forward-backward algorithm applied to the subchains is also only approximate due to loss of information at the ends of the chains but the authors describe a technique to buffer the chains with extra observations to reduce this effect.

### 3.4 Comment on Dimension Reduction

An important problem in applying our method on real data is dimension reduction. While in the theory above, we assumed that the number of independent components is equal to the number of observed variables, in many practical cases, we would like to have a smaller number of components than observed variables. We propose here two solutions for this problem.

The first solution, which is widely used in the linear ICA case, is to first reduce the dimension of the data by PCA, and then do ICA in that reduced space with the same dimensions of components and observed variables. In the nonlinear case, a number of nonlinear PCA methods, also called manifold learning methods, has been proposed and could be used for such a two-stage method. In particular, dimension reduction is achieved by even the very simplest autoencoders; recent work has developed the theory further in various directions

(Maaten and Hinton, 2008; Vincent et al., 2010). This approach has the advantage of reducing the noise in the data, which is a well-known property of PCA, and allows us to separate the problem of dimension reduction from the problem of developing ICA algorithms. A possible drawback is that such dimension reduction may not be optimal from the viewpoint of estimating independent components.

The second solution is to build an explicit noise model into the nonlinear ICA model, following Khemakhem et al. (2020). Denote by a random vector of Gaussian noise which is white both temporally and spatially and of variance . Instead of the Eq. (2), we would define a mixing model as

 x(t)=f(s(t))+n(t) (18)

where the model of the components is unchanged. We could then combine the variational estimation method presented by Khemakhem et al. (2020) with the HMM inference procedure presented here. However, we leave the details for future work.

## 4 Identifiability Theory

In this section we provide identifiability theory for the model discussed in the previous section. As was discussed above, many deep latent variable models are non-identifiable. In other words, an estimation method such as the EM proposed above might not have a unique solution, or even a small number of solutions which are indistinguishable for any practical purposes.

Fortunately, we are able to combine previous nonlinear ICA theory with the identifiability of Hidden Markov Models to prove the identifiability of our combined model. Albeit our model being different from (Hyvärinen and Morioka, 2017), (Hyvärinen et al., 2019) and (Khemakhem et al., 2020), the identifiability we reach is very similar. We also show that in the case of Gaussian independent components we can get exact identifiability up to linear transformation of the components.

### 4.1 Definitions

In order to illustrate the relationship of our model’s identifiability to earlier works in the area, we introduce the following definitions from Khemakhem et al. (2020)

###### Definition 1.

Let be the equivalence relation on . (8) is said to be identifiable up to if

 p(x(1),…,x(T);Θ)=p(x(1),…,x(T);^Θ)⇒Θ∼^Θ (19)
###### Definition 2.

Let be the binary relation on defined by:

 (f,λ)∼(^f,^λ)↔ ∃W,b∣T(g(x(t)))=WT(^g(x(t)))+b (20)

where is an matrix and is an vector.

If is invertible, the above relation is denote by , and if is a block permutation matrix, it is denoted by . In block permutation, each block linearly transforms into with each corresponding to one, and only one, .

### 4.2 General Result

Now we present our most general Theorem on identifiability. It will be followed by stronger results in the Gaussian case below.

###### Theorem 2.

Assume observed data is generated by a Hidden Markov Nonlinear ICA according to (5) - (8). Also, assume:

1. [label=()]

2. The time-homogeneous transition matrix has full rank and induces an irreducible 111all states can be reached from every stateMarkov chain with a unique stationary state distribution

3. The number of latent states, , is known and

4. There exists an square matrix of the different states’ parameters with respect to a pivot state

 ˜L=⎛⎜ ⎜⎝(λc=1−λc=0)T⋮(λc=NV−λc=0)T⎞⎟ ⎟⎠ (21)

which is invertible.

5. The emission distributions for the different latent states are linearly independent functions of

6. The non-linear mixing function is bijective

Then the model parameters are identifiable.

###### Proof.

Suppose we have

 p(x(1),…,x(T);Θ)=p(x(1),…,x(T);^Θ) (22)

Using assumptions 1-4, we can invoke Theorem 1 and apply it to our model to get:

 ^Ak,l=Aσ(k),σ(l) (23) p(x|k;^θk)=p(x|σ(k);θσ(k)) (24)

where superscript is dropped for convenience. For notational simplicity, and without loss of generality, we assume the components are ordered such that . Substituting in we have:

 |J^g(x)|H(^g(x))Z(^λk)exp{⟨^λk,T(^g(x))⟩} =|Jg(x)|H(g(x))Z(λk)exp{⟨λk,T(g(x))⟩} (25)

for some latent state . Recall from assumption 3 that . We can thus take states and assign one of them, say as a pivot states. Taking logs of (25) for all the other states with respect to the pivot state gives equations of below form:

 ⟨(λk−λ1),T(gi(x))⟩+logZ(λ1)−logZ(λk) =⟨(^λk−^λ1),T(^gi(x))⟩+logZ(^λ1)−logZ(^λc) (26)

Collecting all the such equations, we can stack them into a linear system :

 ˜LT(g(x))=ˆ˜LT(^g(x))+β (27)

where is the invertible square matrix defined in assumption 3, and the elements of are defined similarly, but no assumption about its invertibility is made. The constants that result from the sums of the log-normalizers are stacked to form vector . Multiplying both sides by results in our desired form:

 T(g(x))=˜L−1ˆ˜LT(^g(x))+˜L−1β T(s)=WT(^g(x))+b (28)

Recall that we defined the exponential families to be in minimal representation in Section 3.1. It follows that we can find an arbitrary number of points such that the vectors formed by the sufficient statistic functions of each independent component , are linearly independent. This can be done separately for each . Additionally, as and can be changed independently, we can find for then and are linearly independent for all . Therefore, all elements of the vector are linearly independent which implies that the square matrix in (28) is invertible. ∎

#### 4.2.1 Comments on the assumptions of Theorem 1

The assumptions 1, 2 are standard HMM assumptions. The assumption of a full rank transition matrix is non-standard but crucial here. Intuitively speaking, it allows the latent states to be distinguished from each other, while the irreducibility assumptions ensures that there is a single unique stationary state distribution.222technically one aperiodic state is also required. An aperiodic state is one which can be returned to after an irregular number of steps Notice that these assumptions necessarily hold, for example, when the transition matrix is close to identity, as in a case where the states are strongly persistent.

The assumption that the real number of latent components is known, is valid in certain applications, and if not it could be estimated for instance be increasing the number of latent states between each estimation and then detecting the point at which increases in likelihood become marginal (the elbow method). Assumption 3 is valid in practice as long as the parameters are generated randomly - in that case it almost surely holds as singular solutions will lie in a submanifold of lower dimension. The validity of assumption is less obvious 4, however, we will below prove that it holds, for instance, in the case of Gaussian independent components.

### 4.3 Identifiability With Gaussian Independent Components

In this section, we first provide two lemmas which we use to prove the claim, already alluded to above, that assumption 4 of Theorem 2 is satisfied for Gaussian components. Then, we prove that in this case a stronger form of identifiability can be reached as a special case of above results, namely that we get exact identification of components up to linear transformation. Together these results make a strong case for using Gaussian latent components in practical applications.

We begin by stating two Lemmas (proofs in Appendix C):

###### Lemma 1.

Assumption 4 of Theorem 1 requires the emission distributions defined by (7) to be linearly independent. A sufficient, and necessary, condition is that the conditional source distributions defined by (6) are linearly independent.

###### Lemma 2.

Assume probability density functions of random variables , and that each factorizes across the variables: . If the factorial density functions are linearly independent for some , then the joint-density functions are linearly independent.

Based on these Lemmas, we can prove the following Theorem (proof in Appendix C):

###### Theorem 3.

Assume that distributions of the independent components conditional on the latent state, as defined by (6), are Gaussian parameterised by mean and variance. Assume also that the means of the density functions are all different. Then the emission distributions, defined by (7), are linearly independent, thus satisfying assumption 4 in Theorem 2.

Finally, we have the following Theorem which proves a stronger form of identifiability—essentially recovering the components with minimum indeterminacy—of our Hidden Markov Nonlinear ICA model in the Gaussian case (proof in Appendix C)

###### Theorem 4.

Assume that the latent independent components have a conditionally Gaussian distributions, and assume hypotheses 1, 2,3 and 5 of Theorem 2 hold, as well as the assumptions of Theorem 3. Additionally assume that the mixing function has all of it second-order cross derivatives, then the components in our Hidden Markov Nonlinear ICA model are exactly identified up to linear transformation.

Notice that this proof and the identifiability result is similar to that in (Sorrenson et al., 2020), although our models are entirely different. These authors also prove a general version for other distributions with different sufficient statistics.

## 5 Experiments

In this section we present results from our simulations on artificial non-stationary data. Code, written in JAX, is available at github.com/HHalva/hmnlica.

Dataset: We generated a synthetic dataset from the model defined in Section 3.1. More specifically, the independent components are created from non-stationary Gaussian emission distributions of an HMM with discrete states – the latent state determines the means and the variances of the independent components at each time point. The transition matrix was defined so that at each time-step there was a 99% probability that the state didn’t change and a 1% probability the latent state switches to another state.333for context, the probability of staying in the same state for over 100 time steps with these numbers is around 37% If it switches to another state, it will always go to the next one ‘in line’, where we define a circular ordering for the states. That is, we defined a circular repeating path for the latent state where transitions could only happen to two states such that the transition matrix is close to identity (Figure 1 illustrates this). These settings were chosen to ensure the HMM assumptions of Theorem 2 hold, as well as to reflect a situation where a relatively small number of states repeats over time with some interesting, non-random, temporal dynamics, including persistence to stay in the same state. The mean and variance parameters were chosen at random for each latent state before data generation so that assumption 3 of Theorem 2 holds. Similarly to Hyvärinen et al. (2019), the mixing function (2) was simulated with a randomly initialized, invertible 444invertibility achieved by having all layers units wide and utilizing leaky ReLUs

multi-layer perceptron (MLP) – this produced the observed data for our experiments. The sequences that were created are 100,000 time steps long. The number of latent states was set such that

, which ensures that the assumptions 2 and 3 were fulfilled.

Model estimation: We estimate the (inverse) mixing function and distribution parameters of our Hidden Markov nonlinear-ICA model using the EM algorithm described in Section 3.2. Mean and variance parameter estimates for the independent components are initialized randomly at the start of the EM algorithm. The inverse mixing function is parameterized with an MLP where the number hidden layers is set to match the number of data generating mixing layers. The gradient M-step are taken with the Adam optimizer (Kingma and Ba, 2017). Random restarts were used to avoid inferior local maxima. Further, we found that a stochastic version of our algorithm (see Section 3.3) converged faster – thus, the experiments here have been run with 100 time-step long sub-sequences in minibatches of 64.

Results – independent component recovery: After estimating the model parameters and independent components, a linear sum assignment problem is solved to optimally match each of the estimated components to one of the real ones. This is necessary as the ordering of the components is arbitrary. Mean absolute correlation coefficients over the resulting pairs of true and estimated components are then used to measure how well original components were recovered. This is the methodology taken in previous nonlinear ICA works (Hyvärinen and Morioka, 2016).

Figure 2 shows the mean correlation between the estimated components for our model in comparison to TCL, which is the only other nonlinear ICA model for non-stationary data. For TCL, the data was split into 500 time-step long segments; 500 steps provided best performance relative to other computationally feasible options (100, 250, 750, 1000). We can see that our model outperforms TCL for all levels of nonlinearity. This validates our theoretical arguments that the TCL framework struggles with non-stationary data in which latent states (often a relatively small number) repeat over time since the segments it has access to don’t correspond well with the true data generating process.

Results – temporal dynamics Unlike previous models, Hidden Markov Nonlinear ICA is able to perform unsupervised clustering of latent states and to take into account the learned temporal dynamics in doing so. To estimate this ability, we ran the well-know Viterbi algorithm (Viterbi, 1967) which finds the most likely path of latent states based on our estimated model. The results show that on average for and our model reaches near perfect classification accuracy in the linear ICA case, mean accuracy of around 80% for , and 68% for , thus clearly outperforming random chance level (figure in Appendix E).

## 6 Conclusion

We proposed a framework nonlinear ICA based on a Hidden Markov Model of the temporal dynamics. This improves on existing nonlinear ICA methods in several ways. First, it removes the need for any arbitrary segmentation of the data as in TCL, which is likely to improve the estimation of the demixing function. Second, it leverages the fact that the nonstationary structure is often repeating with a limited number of hidden states, which not only reduces the computation by limiting the number classes, but again is likely to improve estimation of the demixing function. Third, our method estimates the underlying latent temporal dynamics, which are often interesting in their own right. We believe this in an important advance in order to apply nonlinear ICA methods on real data.

#### Acknowledgements

The authors would like to thank Elisabeth Gassiat, Ilyes Khemakhem and Ricardo Pio Monti for helpful discussion. I.K.’s help with the experiments is also much appreciated. A.H. was supported by a Fellowship from CIFAR, and from the DATAIA convergence institute as part of the “Programme d’Investissement d’Avenir” (ANR-17-CONV-0003) operated by Inria.

## References

• E. S. Allman, C. Matias, and J. A. Rhodes (2009) Identifiability of parameters in latent structure models with many observed variables. The Annals of Statistics 37 (6A), pp. 3099–3132 (en). Note: arXiv: 0809.5032 External Links: ISSN 0090-5364, Link, Document Cited by: §1, §2.2.
• A. Belouchrani, K. Abed-Meraim, J.-F. Cardoso, and E. Moulines (1997) A blind source separation technique using second-order statistics. IEEE Transactions on Signal Processing 45 (2), pp. 434–444 (en). External Links: ISSN 1053587X, Link, Document Cited by: §2.1.
• A. P. Dempster, N. M. Laird, and D. B. Rubin (1977) Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39 (1), pp. 1–38. Cited by: Appendix A.
• N. Foti, J. Xu, D. Laird, and E. Fox (2014) Stochastic variational inference for hidden Markov models. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 3599–3607. External Links: Link Cited by: §3.3.
• E. Gassiat, A. Cleynen, and S. Robin (2016) Inference in finite state space non parametric Hidden Markov Models and applications. Statistics and Computing 26 (1-2), pp. 61–71 (en). External Links: ISSN 0960-3174, 1573-1375, Link, Document Cited by: §1, §2.2, Theorem 1.
• I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.1.
• L. Gresele, G. Fissore, A. Javaloy, B. Schölkopf, and A. Hyvärinen (2020)

Relative gradient optimization of the jacobian term in unsupervised deep learning

.
Submitted. Cited by: §3.2.
• S. Harmeling, A. Ziehe, M. Kawanabe, and K. Müller (2003) Kernel-Based Nonlinear Blind Source Separation. Neural Computation 15 (5), pp. 1089–1124 (en). External Links: ISSN 0899-7667, 1530-888X, Link, Document Cited by: §2.1.
• A. Hyvärinen and H. Morioka (2016)

Unsupervised Feature Extraction by Time-Contrastive Learning and Nonlinear ICA

.
arXiv:1605.06336 [cs, stat] (en). Note: arXiv: 1605.06336 External Links: Link Cited by: §1, §2.1, §5.
• A. Hyvärinen and H. Morioka (2017) Nonlinear ICA of Temporally Dependent Stationary Sources.

Proceedings of the 20 th International Con- ference on Artificial Intelligence and Statistics (AISTATS)

, pp. 14 (en).
Cited by: §1, §1, §2.1, §4.
• A. Hyvärinen and P. Pajunen (1999) Nonlinear independent component analysis: Existence and uniqueness results. Neural Networks 12 (3), pp. 429–439 (en). External Links: ISSN 08936080, Link, Document Cited by: §2.1.
• A. Hyvärinen, H. Sasaki, and R. E. Turner (2019) Nonlinear ICA Using Auxiliary Variables and Generalized Contrastive Learning. arXiv:1805.08651 [cs, stat] (en). Note: arXiv: 1805.08651 External Links: Link Cited by: §1, §1, §2.1, §2.1, §4, §5.
• I. Khemakhem, D. P. Kingma, R. P. Monti, and A. Hyvärinen (2020) Variational Autoencoders and Nonlinear ICA: A Unifying Framework. arXiv:1907.04809 [cs, stat] (en). Note: arXiv: 1907.04809 External Links: Link Cited by: Appendix C, §1, §2.1, §2.1, §3.4, §4.1, §4.
• D. P. Kingma and J. Ba (2017) Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs] (en). Note: arXiv: 1412.6980 External Links: Link Cited by: §5.
• D. P. Kingma and M. Welling (2014) Auto-Encoding Variational Bayes. arXiv:1312.6114 [cs, stat] (en). Note: arXiv: 1312.6114 External Links: Link Cited by: §2.1.
• L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-SNE.

Journal of machine learning research

9 (Nov), pp. 2579–2605.
Cited by: §3.4.
• W.D. Penny, E. R.M., and R. S.J. (2000) Hidden markov independent component analysis. In Advances in Independent Component Analysis, Cited by: §1.
• P. Sorrenson, C. Rother, and U. Köthe (2020) Disentanglement by Nonlinear ICA with General Incompressible-flow Networks (GIN). arXiv:2001.04872 [cs, stat] (en). Note: arXiv: 2001.04872 External Links: Link Cited by: §1, §4.3.
• H. Sprekeler, T. Zito, and L. Wiskott (2014) An extension of slow feature analysis for nonlinear blind source separation. J.\ of Machine Learning Research 15 (1), pp. 921–947. Cited by: §2.1.
• L. Tong, R.-w. Liu, V.C. Soon, and Y.-F. Huang (1991) Indeterminacy and identifiability of blind identification. IEEE Transactions on Circuits and Systems 38 (5), pp. 499–509 (en). External Links: ISSN 00984094, Link, Document Cited by: §2.1.
• P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol (2010)

Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion

.
journal of machine learning research 11 (dec), pp. 3371–3408. Cited by: §3.4.
• A. J. Viterbi (1967) Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory 13, pp. 260–267. Cited by: §5.
• J. Zhou and X. Zhang (2008) An ica mixture hidden markov model for video content analysis. IEEE Transactions on Circuits and Systems for Video Technology 18 (11), pp. 1576–1586. Cited by: §1.

## Appendix A Note on the convergence of our estimation algorithm

Standard theory (Dempster et al., 1977) shows that each EM iteration increases the likelihood, unless parameters are already at a zero-gradient point. Further, maxima of free-energy and likelihood coincide. This also holds under the gradient M-steps in our algorithm (with classical assumption of sufficiently small step size). Under suitable regularity conditions, theoretical limit of infinite data and universal approximation of the nonlinear transformation, combined with our identifiability proof, MLE guarantees convergence to correct parameters up to the equivalence class identified in our Theorem 2. In practice, however, these assumptions may not be satisfied — for instance, parameters may approach a boundary point and likelihood tend to infinity. Random restarts and regularisation are common strategies to avoid these problems.

## Appendix B Note on the compute time of the gradients of the logdet Jacobian

We estimate the non-linear mixing function in our model using a multi-layer perceptron without any restrictions on it. As a consequence of the change of variable formula for probability densities, we have to calculate the gradient of the log-determinant of the Jacobian as part of our parameter updates. JAX, a new machine learning package that utilizes autograd, has the ability to calculate the Jacobian in just a single forward pass thus making the computations efficient for typical data dimensions. For our model, we can see that the compute time required for the log-determinant of the Jacobian starts to dominate as we approach 100 dimensions and above.

## Appendix C Proofs

Proof for Lemma 1

###### Proof.

Assume that we have linear independence of the conditional source distributions as defined by (6). Then we have that:

 a1pS(s|1;λ1)+⋯+aCpS(s|C;λC)≡0⇒a=0 (29)

Above holds if we multiply it through by the Jacobian determinant of the mixing function as we have assumed bijectivity, that is:

 a1|Jg(x)|pS(s|1;λ1)+… +aC|Jg(x)|pS(s|C;λC)≡0⇒a=0 (30)

which is equivalent to:

 a1|Jg(x)|pS(g(x)|1;λ1)+… +aC|Jg(x)|pS(g(x)|C;λC)≡0⇒a=0 (31)

And therefore by (7) we have:

 a1pX(x|1;f,λ1)+… +aCpX(x|C;f,λC)≡0⇒a=0 (32)

So the emission distributions are linearly independent if the densities for the independence components are linearly independent across the different latent states. Necessity follows easily by the reverse of above argumentation. ∎

Proof for Lemma 2

###### Proof.

Assume linear independence of the joint-density functions for some subset of variables , where and , that is:

 w1p1(zi,…,zi+n)+⋯+wKpK(zi,…,zi+n) =w1i+n∏j=ip(j)1(zj)+⋯+wKi+n∏j=ip(j)K(zj)≡0 ⇒w=0 (33)

Now the linear independence for joint of requires:

 w1p1(zi,…,zi+n,zi+n+1)+ …+wKpK(zi,…,zi+n,zi+n+1)≡0⇒w=0

Using the factorial form of the joint, we can rewrite this as:

 w1p(i+n+1)1(zi+n+1)p1(zi,…,zi+n)+… +wKp(i+n+1)K(zi+n+1)pK(zi,…,zi+n) ≡0⇒w=0 (34)

If this didn’t hold we could define constants such that:

 v1p1(zi,…,zi+n)+⋯+vKpK(zi,…,zi+n)≡0 (35)

where the constants are not all zero which would contradict our original assumption. Thus it is sufficient to prove linear independence of , say for , without loss of generality, and then apply the above induction step to guarantee linear independence of the joint-density functions . ∎

Proof for Theorem 3

###### Proof.

By Lemma 1 it is sufficient to prove the linear independence of the different conditional independent component density functions, rather than emission densities. And by Lemma 2 it suffices to prove this only for any one of the different independent components. In exponential family form, the density is written as (see Appendix D):

 p(si|c)=1√2πexp{ηi,c,1si−ηi,c,2s2i}Zi,c (36)

where . We drop subscript for convenience. Consider:

 w11√2πexp{η1,1s−η1,2s2}Z1+… +wC1√2πexp{ηC,1s−ηC,2s2}ZC=0 (37)

First assume, all the are distinct. Also we can assume, without loss of generality, that the latent states are ordered such that . We can divide all the terms with the first density to give:

 w1+w2Z1Z2exp{(η1,1−η2,1)s+(η1,2−η2,2)s2}+… +wCZ1ZCexp{(η1,1−ηC,1)s+(η1,2−ηC,2)s2}=0 (38)

taking of above gives . Repeatedly performing this process for remaining terms eventually gives . Consider now the opposite case in which all the are equal. Then we have:

 w11√2πexp{η1,1}Z1+⋯+wC1√2πexp{ηC,1}ZC=0 (39)

If we re-order the terms such that is the largest (recall we assumed that the means are different). We can again divide everything by this term, take , and establish and repeat the process to get . In the the final case where more than one component has the highest variance, but rest are unequal, (only the equality of the largest variances of the remaining terms matters), we can first perform the variance division followed by division by the largest mean, repeatedly until . ∎

Proof for Theorem 4

###### Proof.

By Theorem 3, the above assumptions suffice for Theorem 2 to hold. Next, note that the sufficient statistics of a Gaussian distribution are twice differentiable. This, combined with the assumption about the existence of ’s cross-derivatives fulfils the conditions of Theorem 2 of Khemakhem et al. (2020) and thus our model’s parameters are identifiable (as per Definition 2). We therefore have:

 (sis2i)=Wj(gj(x)gj(x)2)+bi (40)

for some . Hence, we have

 (w11gj(x)+w12gj(x)2+b11)2 =w21gj(x)+w22gj(x)2+b21 w212z4+2w11w12z3+(w211−w22)z2−w21z+b=0 (41)

Above has to hold for all values of . The trivial solution of all is impossible as would not be invertible. Therefore, it must be that and . Thus we have exact identification (up to linear transformation) for some constants . ∎

## Appendix D Model with Gaussian independent components

 p(si|c)= 1√2πσ2i,cexp{−12σ2i,c(si−μi,c)2} (42) = 1√2πσ2i,cexp{−12σ2i,c(s2i−2siμi,c+μ2i,c)} (43) = 1√2πσ2i,cexp{siμi,cσ2i,c−s2i12σ2i,c−μ2i,c2σ2i,c} (44) = Z−1i,cexp{siμi,cσ2i,c−s2i12σ2i,c} (45)

Therefore by independence of components:

 p(s|c)= exp{N∑i=1(siμi,cσ2i,c−s2i12σ2i,c)}N∏i=1Z−1i,c (46) = exp{N∑i=1(siμ