is a WFA satisfying some constraints that computes a probability distribution over strings; PFA are expressively equivalent toHidden Markov Models (HMM) (dupont2005links), which have been successfully applied in many tasks such as speech recognition (gales2008application) and human activity recognition (nazabal2015discriminative). Recently, the so-called spectral method has been proposed as an alternative to EM based algorithms to learn HMM (hsuspectral), WFA (bailly2009grammatical), predictive state representations (boots2011closing), and related models. Compared to EM based methods, the spectral method has the benefits of providing consistent estimators and reducing computational complexity.
Although WFA have been successfully applied in various areas of machine learning, they are inherently linear models: their computation boils down to the composition of linear maps. Recent positive results in machine learning have shown that models based on composing nonlinear functions are both very expressive and able to capture complex structure in data. For example, by leveraging the expressive power of deep convolutional neural networks in the context of reinforcement learning, agents can be trained to outperform humans in Atari games(mnih2013playing) or to defeat world-class go players (silver2016mastering)
. Deep convolutional networks have also recently led to considerable breakthroughs in computer vision(krizhevsky2012imagenet), where they showed their ability to disentangle the complex structure of the data by learning a representation which unfold the original complex feature space (where the data lies on a low-dimensional manifold) into a representation space where the structure has been linearized. It is thus natural to wonder to which extent introducing non-linearity in WFA could be beneficial. We will show that both these advantages of nonlinear models, namely their expressiveness and their ability to learn rich representations, can be brought to the classical WFA computational model.
In this paper, we propose a nonlinear WFA model (NL-WFA) based on neural networks, along with a learning algorithm. In contrast with WFA, the computation of a NL-WFA relies on successive compositions of nonlinear mappings. This model can be seen as an extension of dynamical recognizers (moore1997dynamical)
— which are in some sense a nonlinear extension of deterministic finite automata — to the quantitative setting. In contrast with the training of recurrent neural networks (RNN), our learning algorithm does not rely on back-propagation through time. It is inspired by the spectral learning algorithm for WFA, which can be seen as a two-step process: first find a low-rank factorization of the so calledHankel matrix
leading to a natural embedding of the set of words into a low-dimensional vector space, and then perform regression in this representation space to recover the transition matrices. Similarly, our learning algorithm first finds a nonlinear factorization of the Hankel matrix using an auto-encoder network, thus learning a rich nonlinear representation of the set of strings, and then performs nonlinear regression using a feed-forward network to recover the transition operators in the representation space.
Related works. NL-WFA and RNN are closely related: their computation relies on the composition of nonlinear mappings directed by a sequence of observations. In this paper, we explore a somehow orthogonal direction to the recent RNN literature by trying to connect such models back with classical computational models from formal language theory. Such connections have been explored in the past in the non-quantitative setting with dynamical recognizers (moore1997dynamical), whose inference has been studied in e.g. (pollack1991induction). The ability of RNN to learn classes of formal languages has also been investigated, see e.g. (avcu2017subregular) and references therein. It is well know that predictive state representations (PSR) (littman2002predictive) are strongly related with WFA (thon2015links). A nonlinear extension of PSR has been proposed for deterministic controlled dynamical systems in (rudary2004nonlinear). More recently, building upon reproducing kernel Hilbert space embedding of PSR (boots2013hilbert), non-linearity is introduced into PSR using recurrent neural networks (downey2017predictive; venkatraman2017predictive). One of the main differences with these approaches is that our learning algorithm does not rely on back-propagation through time and we instead investigate how the spectral learning method for WFA can be beneficially extended to the nonlinear setting.
We first introduce notions on weighted automata and the spectral learning method.
2.1 Weighted finite automaton
Let denote the set of strings over a finite alphabet and let be the empty word. A weighted finite automaton (WFA) with states is a tuple where are the initial and final weight vector respectively, and is the transition matrix for each symbol . A WFA computes a function defined for each word by
By letting for any word we will often use the shorter notation A WFA with states is minimal if its number of states is minimal, i.e., any WFA such that has at least states. A function is recognizable if it can be computed by a WFA. In this case the rank of is the number of states of a minimal WFA computing . If is not recognizable we let .
2.2 Hankel matrix
The Hankel matrix associated with a function is the bi-infinite matrix with entries for all words . The spectral learning algorithm for WFA relies on the following fundamental relation between the rank of and the rank of the Hankel matrix (carlyle1971realizations; fliess1974matrices):
For any , .
In practice, one deals with finite sub-blocks of the Hankel matrix. Given a basis , where is a set of prefixes and is a set of suffixes, we denote the corresponding sub-block of the Hankel matrix by . Among all possible basis, we are particularly interested in the ones with the same rank as . We say that a basis is complete if .
For an arbitrary basis , we define its p-closure by , where . It turns out that a Hankel matrix over a p-closed basis can be partitioned into blocks of the same size (balle2014spectral):
where for each the matrix is defined by .
2.3 Spectral learning
It is easy to see that the rank of the Hankel matrix is upper bounded by the rank of : if is a WFA with states computing , then admits the rank factorization where the matrices and are defined by and for all . Moreover, one can check that for each . The spectral learning algorithm relies on the non-trivial observation that this construction can be reversed: given any rank factorization , the WFA defined by
is a minimal WFA computing (balle2014spectral, Lemma 4.1), where for denote the finite matrices defined above for a prefix closed complete basis .
3 Nonlinear Weighted Finite Automata
The WFA model assumes that the transition operators are linear. It is natural to wonder whether this linear assumption sometimes induces a too strong model bias (e.g. if one tries to learn a function that is not recognizable by a WFA). Moreover, even for recognizable functions, introducing non-linearity could potentially reduce the number of states needed to represent the function. Consider the following example: given a WFA , the function is recognizable and can be computed by the WFA with , and , where denotes Kronecker product. One can check that if , then can be as large as , but intuitively the true dimension of the model is using non-linearity111By applying the spectral method on the component-wise square root of the Hankel matrix of , one would recover the WFA of rank .. These two observations motivate us to introduce nonlinear WFA (NL-WFA).
3.1 Definition of NL-WFA
We will use the notation to stress that a function may be nonlinear. We define a NL-WFA of with k states as a tuple , where is a vector of initial weights, is a transition function for each and is a termination function. A NL-WFA computes a function defined by
for any word . Similarly to the linear case, we will sometimes use the shorthand notation . This nonlinear model can be seen as a generalization of dynamical recognizers (moore1997dynamical) to the quantitative setting. It is easy to see that one recovers the classical WFA model by restricting the functions and to be linear. Of course some restrictions on these nonlinear functions have to be imposed in order to control the expressiveness of the model. In this paper, we consider nonlinear functions computed by neural networks.
3.2 A Representation learning perspective on the spectral algorithm
Our learning algorithm is inspired by the spectral learning method for WFA. In order to give some insights and further motivate our approach, we will first show how the spectral method can be interpreted as a representation learning scheme.
The spectral method can be summarized as a two-stages process consisting of a factorization step and a regression step: first find a low rank factorization of the Hankel matrix and then perform regression to estimate the transition operators .
First focusing on the factorization step, let us observe that one can naturally embed the set of prefixes into the vector space by mapping each prefix to the corresponding row of the Hankel matrix . However, it is easy to check that this representation is highly redundant when the Hankel matrix is of low rank. In the factorization step of the spectral learning algorithm, the rank factorization can be seen as finding a low dimensional representation for each prefix , from which the original Hankel representation can be recovered using the linear map (indeed ). We can formalize this encoder-decoder perspective by defining two maps and by and . One can easily check that , which implies that encodes all the information sufficient to predict the value for any suffix (indeed ).
The regression step of the spectral algorithms consists in recovering the matrices satisfying . From our encoder-decoder perspective, this can be seen as recovering the compositional mappings satisfying for each .
It follows from the previous discussion that non-linearity could be beneficially brought to WFA and into the spectral learning algorithm in two ways: first by using nonlinear methods to perform the factorization of the Hankel matrix, thus discovering a potentially nonlinear embedding of the Hankel representation, and second by allowing the compositional feature maps associated to each symbol to be nonlinear.
4 Learning NL-WFA
Introducing non-linearity can be achieved in several ways. In this paper, we will use neural networks due to their ability to discover relevant nonlinear low-dimensional representation spaces and their expressive power as function approximators.
4.1 Nonlinear factorization
Introducing non-linearity in the factorization step boils down to finding two mappings and such that for any prefix . Briefly going back to the linear case, one can check that if , then we have for each prefix , implying that the encoder-decoder maps satisfy and . Thus the factorization step can essentially be interpreted as finding an auto-encoder able to project down the Hankel representation to a low dimensional space while preserving the relevant information captured by .
How to extend the factorization step to the nonlinear setting should now appear clearly: by training an auto-encoder to learn a low-dimensional representation of the Hankel representations , one will potentially unravel a rich representation of the set of prefixes from which a NL-WFA can be recovered.
Let and be the encoder and decoder maps respectively. We will train the auto-encoder shown in Figure 1 (left) to achieve
More precisely, if , the model is trained to map the original Hankel representation of each prefix to a latent representation vector in , where , and then map this vector back to the original representation . This is achieved by minimizing the reconstruction error (i.e. the
distance between the original representation and its reconstruction). Instead of linearly factorizing the Hankel matrix, we use an auto-encoder framework consisting of two networks, whose hidden layer activation functions are nonlinear222We use the (component-wise) function in our experiments..
More precisely, if we denote the nonlinear activation function by , and we let A, B, C, D be the weights matrices from the left to the right of the neural net shown in Figure 1 (left), the function computed by the auto-encoder can be written as
where the encoder-decoder functions and are defined by and for vectors .
It is easy to check that if the activation function is the identity, one will exactly recover a rank factorization of the Hankel matrix, thus falling back onto the classical factorization step of the spectral learning algorithm.
4.2 Nonlinear regression
Given the encoder-decoder maps and , we then move on to recovering the transition functions. Recall that we wish to find the compositional feature maps for each satisfying for all . Using the encoder map obtained in the factorization step, the mapping can be written as .
In order to learn these transition maps, we will thus train one neural network for each symbol
to minimize the following squared error loss function
The structure of the simple feed-forward network used to learn the transition maps is shown in Figure 1 (right). Let be the two weights matrices, the function computed by this network can be written as
We want to point out that both hidden units and output units of this network are nonlinear. Since this network will be trained to map between latent representations computed by the factorization network, the output units of the transition network and the units corresponding to the latent representation in the factorization network should be of the same nature to facilitate the optimization process.
4.3 Overall learning algorithm
Let be a basis of suffixes and prefixes such that . Let be its -closure (i.e. ) and let . For reasons that will be clarified in the next section, we assume that is prefix-closed (i.e. for any , all prefixes of also belong to ). The first step consists in building the estimate of the Hankel matrix from the training data (by using e.g. the empirical frequencies in the train set), where the rows of are indexed by prefixes in and its columns by suffixes in . The learning algorithm for NL-WFA then consists of two steps:
Train the factorization network to obtain a nonlinear decomposition of the Hankel matrix through the mappings and satisfying
Train the transition networks for each symbol to learn the transition maps satisfying
The resulting NL-WFA is then given by where and is defined by
is the one-hot encoding of the empty suffix.
4.4 Theoretical analysis
While the definitions of the initial vector and termination function given above may seem ad-hoc, we will now show that the learning algorithm we derived corresponds to minimizing an error loss function between and the estimated value over all prefixes in . Intuitively, this means that our learning algorithm aims at minimizing the empirical squared error loss over the training set . More formally, we show in the following theorem that if both the factorization network and the transition networks are trained to optimality (i.e. they both achieve training error), then the resulting NL-WFA exactly recovers the values given in the first column of the estimate of the Hankel matrix.
We first show by induction on the length of a word that
To conclude, for any we have by Eq. (1).
Intuitively, it follows that the learning algorithm described in Section 4.3 aims at minimizing the following loss function
where is the estimated value of the target function on the word , and where the NL-WFA is a function of the encoder-decoder maps and of the transition maps as described in Section 4.3.
Even though Theorem 2 seems to suggest that our learning algorithm is prone to over-fitting, this is not the case. Indeed, akin to the linear spectral learning algorithm, the restriction on the number of states of the NL-WFA (which corresponds to the size of the latent representation layer in the factorization network) induces regularization and enforces the learning process to discriminate between signal and noise (i.e. in practice, the networks will not achieve error due to the bottleneck structure of the factorization network).
4.5 Applying non-linearity independently in the factorization and transition networks
We have shown that non-linearity can be introduced into the two steps of our learning algorithm. We can thus consider three variants of this algorithm where we either apply non-linearity in the factorization step only, in the regression step only, or in both steps. It is easy to check that these three different settings correspond to three different NL-WFA models depending on whether the termination function only is nonlinear, the transition functions only are nonlinear, or both the termination and transition functions are nonlinear. Indeed, recall that that a NL-WFA is defined as a tuple . If no non-linearity are introduced in the factorization network, the termination function will have the form
(using the notations from the previous sections), which is linear. Similarly, if no non-linearity are used in the transition networks, the resulting maps will be linear.
One may argue that only applying non-linearity in the termination function would not lead to an expressive enough model. However, it is worth noting that in this case, after the nonlinear factorization step, even though the transition functions are linear they are operating on a nonlinear feature space. This is similar in spirit to the kernel trick, where a linear model is learned in a feature space resulting from a nonlinear transformation of the initial input space. Moreover, if we go back to the example of the squared function for some WFA with states (see beginning of Section 3), even though may have rank up to , one can easily build a NL-WFA with states computing where only the termination function is nonlinear.
We compare the classical spectral learning algorithm with the three configurations of our neural-net based NL-WFA learning algorithms: applying non-linearity only in the factorization step (denoted by fac.non), only in the regression step (denoted by tran.non), and in both phases (denoted by both.non). We will perform experiments on a grammatical inference task (i.e. learn a distribution over from samples drawn from this distribution) with both synthetic and real data
We use two metrics to evaluate the trained models on a test set: Pautomac score and word error rate.
The Pautomac score was first proposed for the Pautomac challenge (verwer2014pautomac) and is defined by
where is the normalized probability assigned to by the learned model and is the normalized true probability (both and are normalized to sum to over the test set ). Since the models returned by both our method and the spectral learning algorithm are not ensured to outputs positive values, while the logarithm of a negative value is not defined, we take the absolute values of all the negative outputs.
The word error rate (WER) measures the percentage of incorrectly predicted symbols when, given each prefix of strings in the test set, the most likely next symbol is predicted.
5.2 Synthetic data: probabilistic Dyck language
For the synthetic data experiment, we generate data from a probabilistic Dyck language. Let , we consider the language generated by the following probabilistic context free grammar
i.e. starting from the symbol , we draw one of the rules according to their probability and apply it to transform into the corresponding right hand side; this process is repeated until no symbol are left. One can check that this distribution will generate balanced strings of brackets. It is well known that this distribution cannot be computed by a WFA (since its support is a context free grammar). However, as a WFA can compute any distribution with finite support, it can model the restriction of this distribution to word of length less than some threshold . By using this distribution for our synthetic experiments, we want to showcase the fact that NL-WFA can lead to models with better predictive accuracy when the number of states is limited and that they can better capture the complex structure of this distribution.
In our experiments, we use empirical frequencies in a training data set to estimate the Hankel matrix , where the p-closed basis is obtained by selecting the most frequent prefixes and suffixes in the training data. We first assess the ability of NL-WFA to better capture the structure in the data when the number of states is limited. We compared the models for different model sizes ranging from to , where is the number of states of the learned WFA and NL-WFA. For the latter, we used a three hidden layers structure for the factorization network where the number of hidden units are set to , and . For the transition networks, we use a neural network with hidden units333These hyper parameters are not finely tuned, thus some optimization might potentially improve the results.. We used Adamax (kingma2014adam) with learning rate 0.015 and 0.001 respectively to train these two networks.
All models are trained on a training set of size and the Pautomac score and WER on a test set of size are reported in Figure 2 and 3 respectively. For both metrics, we see that NL-WFA gives better results for small model sizes. While NL-WFA and WFA tend to perform similarly for the Pautomac score for larger model sizes, NL-WFA clearly outperforms WFA in terms of WER in this case. This shows that including non-linearity can increase the prediction power of WFA by discovering the underlying nonlinear structure and can be beneficial when dealing with a small number of states.
We then compared the sample complexity of learning NL-WFA and WFA by training the different models on training set of sizes ranging from to . For all models the rank is chosen by cross-validation. In Figure 4 and Figure 5, we show the performances for the four models on a test set of size
by reporting the average and standard deviation overruns of this experiment. We can see that NL-WFA achieve better results on small sample sizes for the Pautomac score and consistently outperforms the linear model for all sample sizes for WER. This shows that NL-WFA can use the training data more efficiently and again that the expressiveness of NL-WFA is beneficial to this learning task.
5.3 Real data: Penn treebank
The Penn Treebank (taylor2003penn) is a well known benchmark dataset for natural language processing. It consists of approximately 7 million words of part-of-speech tagged text, 3 million words of skeletally parsed text, over 2 million words of text parsed for predicate argument structure, and 1.6 million words of transcribed spoken text annotated for speech disfluencies. In this experiment, we use a small portion of the Treebank dataset: the character level of English verbs which was used in the SPICE challenge (balle2017results). This dataset contains 5,987 sentences over an alphabet of 33 symbols as the training set. It also provides two test sets of size 750. We used one of the test sets as a validation set and then tested our models on the other.
For this experiment, the Hankel matrix is of size where the prefixes and suffixes have been selected again by taking the most frequents in the training data. We used a five layers factorization network where the layers are of size , , , and respectively, where is the number of states of the NL-WFA. The structure of the transition networks is the same as in the previous experiment. For all models, the rank is selected using the validation set.
In Table 1, we report the results for the two metrics on the test set. We can see that for both metrics, one of the NL-WFA models outperforms linear spectral learning. Individually speaking, for modeling the distribution (i.e. the perplexity metric) tran.non gives the best performances, while for the prediction task fac.non shows a significant advantage.
|log(Pauto)444Since we do not have access to the true probabilities, is estimated using the empirical frequencies in the test set.||21.3807||12.2571||13.8311||13.6604|
We believe that trying to combine models from formal languages theory (such as weighted automata) and models that have recently led to several successes in machine learning (e.g. neural networks) is an exciting and promising line of research, both from the theoretical and practical sides. This work is a first step in this direction: we proposed a novel nonlinear weighted automata model along with a learning algorithm inspired by the spectral learning method for classical WFA. We showed that non-linearity can be introduced in two ways in WFA, in the termination function or in the transition maps, which directly translates into the two steps of our learning algorithm.
In our experiment, we showed on both synthetic and real data that (i) NL-WFA can lead to models with better predictive accuracy than WFA when the number of states is limited, (ii) NL-WFA are able to capture the complex underlying structure of challenging languages (such as the Dyck language used in our experiments) and (iii) NL-WFA exhibit better sample complexity when learning on data with a complex grammatical structure.
In the future, we intend to investigate further the properties of NL-WFA from both the theoretical and experimental perspectives. For the former, one natural question is whether we could obtain learning guarantees for some specific classes of nonlinear functions. Indeed, one of the main advantages of the spectral learning algorithm is that it provides consistent estimators. While it may be difficult to obtain such guarantees when considering functions computed by neural networks, we believe that studying the case of more tractable nonlinear functions (e.g. polynomials) could be very insightful. We also plan on thoroughly investigating connections between NL-WFA and RNN. From the practical perspective, we want to first tune the hyper-parameters for NL-WFA more extensively on the current datasets to potentially improve the results. In addition, we intend to run further experiments on real data and on different kinds of tasks beside language modeling (e.g. classification, regression). Moreover, due to the strong connection between WFA and PSR, it will be very interesting to use NL-WFA in the context of reinforcement learning.
It is worth mentioning that the spectral learning algorithm cannot straightforwardly be used to learn functions that are not probability distributions. Indeed, while it makes sense in the probabilistic setting to fill the entries corresponding to words that are not in the training data to in the Hankel matrix, it is not clear how to fill these entries when one wants to learn a function that is not a probability distribution, e.g. in a regression task. One way to circumvent this issue is to first use matrix completion techniques to fill these missing entries before performing the low rank decomposition of the Hankel matrix (balle2012spectral). In contrast, our learning algorithm can directly be applied to this setting by simply adapting the loss function of the factorization network (i.e. simply ignore the missing entries in the loss function).