## 1 Introduction

Many tasks in natural language processing, computational biology, reinforcement learning, and time series analysis rely on learning with sequential data, i.e. , estimating functions defined over sequences of observations from training data. Weighted finite automata (WFAs) are a powerful and flexible class of models which can efficiently represent such functions. WFAs are tractable, they encompass a wide range of machine learning models (they can for example compute any probability distribution defined by a hidden Markov model (HMM)

(denis2008rational)and can model the transition and observation behavior of partially observable Markov decision processes

(thon2015links)) and they offer appealing theoretical guarantees. In particular, the so-called*spectral methods*for learning HMMs (hsu2009spectral), WFAs (bailly2009grammatical; balle2014spectral) and related models (glaude2016pac; boots2011closing)

, provide an alternative to Expectation-Maximization (EM) based learning algorithms that is both computationally efficient and consistent.

One of the major applications of WFA is to approximate probability distribution over sequences of discrete symbols. Although the WFA model has been extended to the continuous domain (li2020connecting; rabusseau2019connecting)

as the so-called linear 2-RNN model (or continuous WFA model), approximating density functions for sequential data under continuous domain using this model is not straight-forward, as the model does not guarantee to compute a density function by construction. Moreover, due to the linearity of the model, the continuous WFA model (CWFA) is not expressive enough to estimate some of the common density functions over sequences of continuous random variables such as a Gaussian hidden Markov model.

In recent years, neural networks have been widely applied in density estimation and have been proved to be particularly successful. To estimate a density function via neural networks, the neural density estimator need to be flexible enough to represent complex densities but have tractable inference functions and learning algorithms. One particular example of such models is the class of autoregressive models

(uria2016neural; uria2013rnade), where the joint density is decomposed into a product of conditionals and each conditional is approximated by a neural network. One other type of methods are the so-called flow-based methods (normalizing flows) (dinh2014nice; dinh2016density; rezende2015variational). Flow-based methods transform a base density (e.g. a standard Gaussian) into the target density by an invertible transformation with tractable Jacobian. Although these methods have been used to estimate sequential densities, the sequences often come as fixed length. It is often unclear how to generalize these methods to account for varying length of the sequences in the testing phase, which can be important for some sequential task, such as language modeling for NLP task. Weighted finite automata, on the other hand, are designed to carry out such task under the discrete setting. The question is, how to generalize WFA to approximate density functions over continuous domains.In this paper, by extending the classic CWFA model with a (nonlinear) feature mapping and a (nonlinear) termination function, we first propose our nonlinear continuous weighted finite automata (NCWFA) model. Combining this model with the RNADE framework (uria2013rnade)

, we propose RNADE-NCWFA to approximate sequential density functions. The model is flexible as it naturally generalizes to sequences of varying lengths. Moreover, we show that the RNADE-NCWFA model is strictly more expressive than the Gaussian HMM model. In addition, we propose a spectral learning based algorithm for efficiently learning the parameters of a RNADE-NCWFA. For the empirical study, we conduct synthetic experiments using data generated from a Gaussian HMM model. We compare our proposed spectral learning of RNADE-NCWFA with HMM learned with the EM algorithm, RNADE with LSTM and RNADE-NCWFA learned with stochastic gradient descent. We evaluate the models’ performance through their log likelihood over sequences of unseen length, meaning the testing sequences are longer than the training sequences, to observe the models’ generalization ability. We show that our model outperforms all the baseline models on this metric, especially for long testing sequences. Moreover, the advantage of our model is more significant when dealing with small training sizes and noisy data.

## 2 Background

In this section, we first introduce basic tensor algebra. Then we introduce the continuous weighted finite automata model as well as the RNADE model for density estimation.

### 2.1 Tensor algebra

We first recall basic definitions of tensor algebra; more details can be found in (Kolda09). A tensor can simply be seen as a multidimensional array . The mode- fibers of

are the vectors obtained by fixing all indices except the

th one, e.g. . The th mode matricization of is the matrix having the mode- fibers of for columns and is denoted by . The vectorization of a tensor is defined by . In the following always denotes a tensor of size .The mode- matrix product of the tensor and a matrix
is a tensor denoted by . It is
of size and is defined by the relation
.
The mode- vector product of the tensor and a vector
is a tensor defined by .
It is easy to check that the -mode product satisfies
where we assume compatible dimensions of the tensor and
the matrices and . Given strictly positive integers satisfying
, we use the notation to denote the th order tensor
obtained by reshaping into a tensor^{*}^{*}*Note that the specific ordering used to perform matricization, vectorization
and such a reshaping is not relevant as long as it is consistent across all operations. of size

A rank tensor train (TT) decomposition (oseledets2011tensor) of a tensor
factorizes into the product of core tensors
, and is defined^{†}^{†}†The classical definition of the TT-decomposition allows the rank to be different
for each mode, but this definition is sufficient for the purpose of this paper. by
for all indices (here is a row vector, where , is an matrix, etc.). We will use the notation
to denote this product. The name of this decomposition comes from the fact that the tensor is decomposed into a train of lower-order tensors.

### 2.2 Continuous weighted finite automata (CWFAs)

The concept of continuous weighted finite automata (CWFAs) is a generalization of the classic weighted finite automata model to its continuous input case and is also shown to be equivalent to the linear second-order RNN model (li2020connecting; rabusseau2019connecting).

###### Definition 1

A continuous weighted finite automaton with states (CWFA) is defined by a tuple , where is the initial weight, is the transition tensor, and is the termination matrix. Let denote the set of sequences of size real-valued vectors. A CWFA computes the following function :

(1) |

To learn the CWFA model, (li2020connecting) extend the spectral learning algorithm for the classic WFA model (mohri2012foundations; balle2014method) to its continuous case. The algorithm relies on the Hankel tensor, which is a generalization of the Hankel matrix.

###### Definition 2

For a function , its Hankel tensor of length , is defined by where denotes the canonical basis of .

In practice, to learn the Hankel tensor, one can use gradient descent to minimize the loss function

over a training dataset. Here is an input sequence and the corresponding output, and . It is is shown in (li2020connecting) that the Hankel tensor of a CWFA with finite states can be parameterized by its tensor train form, i.e. . The spectral learning algorithm for CWFA relies on the following theorem (li2020connecting) showing how to recover the parameters of CWFA from Hankel tensors of the function it computes.###### Theorem 2.1

Let be a function computed by a minimal linear CWFA with hidden units and let
be an integer such that^{‡}^{‡}‡It is worth mentioning that such an integer does not always exist. See (li2020connecting) for more details. . Then, for any and such that the minimal CWFA computing is defined by:

### 2.3 Real-valued neural autoregressive density estimator (RNADE)

The real-valued neural autoregressive density estimator (RNADE) (uria2013rnade) is a generalization of the original neural autoregressive density estimator (NADE) (uria2016neural)

to continuous variables. The core idea of RNADE is to estimate the joint density using the chain rule and approximate each conditional density via neural networks, i.e.

(2) |

where denotes all attributes preceding in a fixed ordering, is a mixture of Gaussians with parameters . Moreover, we have: , where denotes the th element of , same for and and denotes the Gaussian density with mean

evaluated at . Note that are functions of . These functions are often chosen to be various forms of neural networks. In the classic setting, RNADE with mixing components and hidden states has the following update rules:(3) | ||||

(4) |

where and are matrices, and are vectors of size , and is an update function for the hidden state which is time step dependent (see (uria2013rnade) for more details on the specific update functions used in the original RNADE formulation). The softmax function (bridle1990probabilistic) ensures the mixing weights

are positive and sum to one and the exponential ensures the variances are positive. RNADE is trained to minimize the negative log likelihood:

via gradient descent.## 3 Methodology

To approximate density functions with CWFA, we need to improve the expressivity of the model and constrain it to compute a valid density function. In this section, we first introduce nonlinear continuous weighted finite automata. Then, we present RNADE-NCWFA for sequential density approximation, which combines CWFA with the RNADE framework. In the end, we show that RNADE-NCWFA is strictly more expressive than Gaussian HMM and present our spectral learning based algorithm for learning RNADE-NCWFA.

### 3.1 Nonlinear Continuous Weighted Finite Automata (NCWFAs)

To leverage CWFAs to estimate density functions, we first need to improve the expressivity of the model. We will do so by introducing a nonlinear feature map as well as a nonlinear termination function. We hence propose the nonlinear continuous weighted finite automata (NCWFA) model as the following:

###### Definition 3

A nonlinear continuous weighted finite automaton (NCWFA) is defined by a tuple , where is the initial weight, is the feature map, is the termination function and is the transition tensor. Given a sequence , the function that the NCWFA computes is:

(5) |

One immediate observation is that we can exactly recover the definition of a CWFA by letting and .

### 3.2 Density Estimation with NCWFAs

The second problem to tackle is that we need to constrain the NCWFA so that it can tractably compute a density function. In this section, we will leverage the RNADE method to propose the RNADE-NCWFAs model. The proposed model is flexible and can compute sequential densities of arbitrary sequence length. Moreover, we will show that this model is strictly more expressive than the classic Gaussian HMM model.

Recall the core idea of RNADE is to estimate the joint density using the chain rule as in Equation 2. Instead of approximating the conditionals via the classic RNADE treatment as in Equations 3, we use an NCWFA , i.e., . One key difference with the classic RNADE model is that the state update function is independent of the time step, allowing the model to generalize to sequences of arbitrary lengths. However, an NCWFA does not readily compute a density function, as the function does not necessarily integrates to one and the output is non-negative. To overcome this issue, we adopt the approach used in RNADE by constraining the output of the NCWFA to be a mixture of Gaussians with diagonal covariance matrices:

(6) | ||||

(7) | ||||

(8) |

where , . is defined to be , where denotes the Hadamard product, is an all one vector and

is an identity matrix. For simplicity, we let

and approximate each conditional via a mixture of Gaussian with diagonal covariance matrix. This can be changed to a full covariance matrix, should the corresponding assumption (positive semi-definite) of the matrix is satisfied. Note this simplification does not affect the expressiveness of the model, as a GMM with a diagonal covariance matrix is also an universal approximator for densities and can approximate a GMM with a full covariance matrix (benesty2008springer), given enough states. Under this definition, it is easy to show that computes the density of the sequence , where denotes . We will refer to this NCWFA model with RNADE structure as RNADE-NCWFA of states and mixtures. Note although the definitions of , and takes specific forms, in practice, one can use any differentiable function of to compute , and , so long as sums to one and is positive.One natural question to ask is how expressive this model is. We show in the following theorem (proof in Appendix A) that RNADE-NCWFA is strictly more expressive than Gaussian HMMs, which is well known for sequential modeling (bilmes1998gentle).

###### Theorem 3.1

Given a Gaussian HMM with states , where is the Gaussian emission function, is the initial state distribution and is the transition matrix, there exists a states mixtures RNADE-NCWFA with full covariance matrices such that the density function over all possible trajectories generated by can be computed by : for any trajectory . Moreover, there exists a RNADE-NCWFA such that no Gaussian HMM model can compute its density.

Note that a CWFA cannot compute the density function of a Gaussian mixture. Indeed, the function computed by a CWFA on a sequence of length is linear in its input, whereas a RNADE-NCWFA associate such an input to a Gaussian density.

To learn RNADE-NCWFA, we want to maximize the likelihood given some training set of length- sequences of dimensional vectors, i.e., , where each . More specifically, we want to minimize the negative log likelihood function: One straight-forward solution is to use gradient descent to optimize this loss function. However, as pointed out in (bengio1994learning)

, due to repeated multiplication by the same transition tensor, gradient descent is prone to suffer from the vanishing gradient problem and to fail in capturing long term dependencies. One alternative is the classic spectral learning algorithm for WFAs. Recall that the spectral learning method for CWFA requires to first learn Hankel tensors of length

, and and then perform a rank factorization on the learnt Hankel tensor to recover the CWFA parameters (see (li2020connecting)). However, due to the nonlinearity added to the model, namely the feature map and the termination function , spectral learning alone will not be enough. To circumvent this issue, we present an algorithm jointly leveraging gradient descent and spectral learning. The idea is to first learn the Hankel tensors of various length and the function and using gradient descent. Then we use the spectral learning algorithm to recover the transition tensor and the initial weights.Let and denote the parameters of the mappings and , respectively (see Eq. 6-8), and let be the TT form of the Hankel tensor, where and for . The spectral learning method for RNADE-NCWFAs first involves an approximation of the Hankel tensor via minimizing the following loss function:

(9) |

where . In this process, we have obtained the Hankel tensors and the parameters of the termination function and the feature map. Then, one can perform a rank factorization on the learned Hankel tensor and recover the rest of the parameters for the RNADE-NCWFA, namely . The detailed algorithm is presented in Algorithm 1.

## 4 Experiments

For the experiments, we conduct a synthetic experiment based on data generated from a random 10-states Gaussian HMM. We sample sequences of length 3, 6 and 7 from the HMM. To evaluate the model’s performance on its generalization ability to unseen length of sequence, we sample 1,000 sequences from length 8 to length 400 from the same HMM for the test data. To test the model’s resistance to noise, we inject the training samples with Gaussian noise of different standard deviations (0.1 and 1.0) with 0 mean.

For the baseline models, we have HMM learned with expectation maximization (EM) method, as it can computes density of sequences of any length by design. We also modified the RNADE model by replacing the hidden states update rule 3 with an LSTM structure to give the RNADE model the ability to generalize to sequences of arbitrary length, regardless of the length of the training sequences. We refer to this model as RNADE-LSTM. For our model, by following Algorithm 1, we have the method RNADE-NCWFA (spec). Alternatively, although we have mentioned that training the (RNADE-)NCWFA model through pure gradient descent method can have many issues, we also list this approach of training RNADE-NCWFA as one of the baselines, referred to as RNADE-NCWFA (sgd). For all the training processes, if gradient descent is involved, we always use Adam optimizer (kingma2014adam) with 0.001 learning rate with early stopping. For HMM as well as RNADE-NCWFA models, we set the size of the model to be 10 (ground truth of the randomly HMM). For RNADE-LSTM, we set the size of the hidden states to be 10. We present the trend of the averaged log likelihood ratio with the ground truth likelihood, i.e. w.r.t. length of the sequences over 10 seeds in Figure 1 and a snapshot of the log likelihood for each model of 400 length testing sequences in Appendix B.

From the experiment results, we can see that RNADE-CWFA (spec) consistently has the best performance across all training sizes and levels of noise injected. More precisely, this advantage is more significant when given small training sizes and (or) the data is injected with high noise. Moreover, the spectral learning algorithm shows stable training results as the standard deviation of the log likelihood (ratio) is the lowest among all methods. This is especially the case when not enough training samples are provided. In addition, one can see that this advantage is consistent with all test sequence lengths we have experimented.

## 5 Conclusion and Future Work

In this paper, we propose the RNADE-NCWFA model, an expressive and tractable WFA-based density estimator over sequences of continuous vectors. We extend the notion of continuous WFA to its nonlinear case by introducing a nonlinear feature mapping function as well as a nonlinear termination function. We then combine the idea from RNADE to propose our density estimation model RNADE-NCWFA and its spectral learning based learning algorithm. In addition, we show that theoretically, RNADE-NCWFA is strictly more expressive than the Gaussian HMM model. We show that, empirically, our method has great capability of generalize to sequences of varying length, which is potentially not the same as the training sequences. For future work, we are looking into more experiments on real dataset and compare with more baselines. Moreover, we did not add nonlinear transition for the NCWFA model as it would imply that the Hankel tensor will be of infinite tensor train rank, hence making the spectral learning algorithm intractable. We will be looking into possibilities of adding this nonlinearity into the NCWFA model and have a working algorithm for it. In addition, we would like to examine more closely in terms of the expressivity of the RNADE-NCWFA.

## References

## Appendix A Proof of Theorem

###### Theorem A.1

Given a Gaussian HMM with states , where is the Gaussian emission function, is the initial state distribution and is the transition matrix, there exists a states mixtures RNADE-NCWFA with full covariance matrices such that the density function over all possible trajectories generated by can be computed by :

for any trajectory . Moreover, there exists a RNADE-NCWFA such that no Gaussian HMM model can compute its density.

For the Gaussian HMM , given an observation sequences , its density under is:

where for some mean vector and covariance matrix . Let , for , and . Note it reasonable to let , since as long as we let , , and , following equations 7, then for any , we have . Then under this parameterization, we have . Then the RNADE-NCWFA computes the following function:

Therefore, we have:

For the proof of the second half of the theorem, consider a shifting Gaussian HMM, where the mean vector of the Gaussian emission is a function of the time steps, i.e., , where . For simplicity, assume the shifting Gaussian HMM is for one dimensional sequences and has one mixture. In addition, let and assume the variance is 1. Then the emission function can be written as . Then the density of a sequence under this shifting Gaussian HMM is:

We show that this density cannot be computed by a Gaussian HMM of finite states. If can be computed by a Gaussian HMM, then for the mean vector there exists an initial weight vector , a transition matrix satisfying the following linear system:

This linear system is, however, overdetermined, as is a vector of finite size, while there are infinite linearly independent equations to satisfy. Therefore, a Gaussian HMM of finite states cannot compute the density function of a shifting Gaussian HMM.

We now show such density can be computed by a RNADE-NCWFA. Let , and , , , . Then we have:

Therefore:

Therefore, for the given shifting Gaussian HMM density, it can be computed by a RNADE-NCWFA, but cannot be computed by a Gaussian HMM with finite states.

## Appendix B Experiment Results in Table

In this section, we show a snapshot of the results we show in the previous experiment results in Figure 1. The result is listed in Table 1. The reported likelihood is evaluated on test sequence of length 400. From this table, we can see more clearly the advantage of our RNADE-NCWFA model when trained with spectral learning algorithm.

Training Size | 100 | 500 | 1000 | ||||||
---|---|---|---|---|---|---|---|---|---|

Noise Std | 0 | 0.1 | 1 | 0 | 0.1 | 1 | 0 | 0.1 | 1 |

HMM (EM) | -615.26 (2.57) | -616.88 (3.40) | -638.44 (7.50) | -601.15 (0.20) | -601.18 (0.17) | -628.75 (1.51) | -600.70 (0.12) | -600.69 (0.12) | -628.91 (1.13) |

RNADE-LSTM | -604.71 (3.36) | -604.10 (2.29) | -641.72 (7.17) | -600.86 (0.15) | -600.85 (0.23) | -628.28 (3.56) | -601.01 (1.34) | -600.67 (0.24) | -628.45 (1.56) |

RNADE-NCWFA (spec) | -600.96 (0.42) | -601.06 (0.24) | -621.75 (2.71) | -600.73 (0.18) | -600.67 (0.067) | -622.10 (1.44) | -600.50 (0.49) | -600.53 (0.07) | -621.91 (1.13) |

RNADE-NCWFA (sgd) | -604.11 (2.13) | -603.29 (1.69) | -633.96 (14.6) | -600.81 (0.20) | -600.91 (0.46) | -631.80 (5.53) | -600.51 (0.06) | -600.52 (0.08) | -629.11 (1.65) |

Ground Truth | -600.40 | -600.40 | -600.40 | -600.40 | -600.40 | -600.40 | -600.40 | -600.40 | -600.40 |