Sequence modeling is a fundamental real-world problem with a wide range of applications. Recurrent neural networks (RNNs) are currently preferred in sequence prediction tasks due to their ability to model long-term and variable order dependencies. However, RNNs have disadvantages in several applications because of their inability to natively handle uncertainty, and because of the inscrutable internal representations.
Probabilistic sequence models like Hidden Markov Models (HMM) have the advantage of more interpretable representations and the ability to handle uncertainty. Although overcomplete HMMs with many more hidden states compared to the observed states can, in theory, model long-term temporal dependencies , training HMMs is challenging due to credit diffusion 
. For this reason, simpler and inflexible n-gram models are preferred to HMMs for tasks like language modeling. Tensor decomposition methods have been suggested for the learning of HMMs in order to overcome the credit diffusion problem, but current methods are not applicable to the overcomplete setting where the full-rank requirements on the transition and emission matrices are not fulfilled. Recently there has been renewed interest in the topic of training overcomplete HMMs for higher-order dependencies with the expectation that sparsity structures could potentially alleviate the credit diffusion problem .
In this paper we demonstrate that a particular sparsity structure on the emission matrix can help HMMs learn higher-order temporal structure using the standard Expectation-Maximization algorithms (Baum-Welch) and its online variants. We call this model cloned HMM (CHMM) because the sparsity structure deterministically maps multiple hidden states (clones) to the same emission state, whereas the emission matrix in a standard HMM is dense and allows for any hidden state to emit any emission state. The basic idea originated in a popular compression method called dynamic Markov coding (DMC) 
, where the temporal dependence in a first order Markov chain is gradually increased by ‘cloning’ the states Fig1(a). The same idea has been re-discovered several times in different domains [11, 28, 8, 20]. Cloned hidden states are conjectured to be behind the higher-order sequence representations in bird songs , and specific aspects of the laminar and columnar organization of cortical circuits are postulated to reflect this cloning structure [11, 10] (Fig 1(g)). Researchers have attempted to learn such structures, or partially observable Markov models, using greedy state splitting algorithms that start from a first-order Markov model [7, 5, 20, 28]. Instead, we recognize that the cloning structure can be initialized as a sparse emission matrix of the HMM. The HMM allocates a maximum split capacity for each symbol up front and lets the learning algorithm decide how to utilize that capacity. This results in more flexibility compared to greedy state-splitting approaches (see Appendix A).
We test the efficacy of CHMMs by using them to learn character-level models of English language using a variety of English text datasets. Our experiments show that CHMMs outperform standard HMMs, n-grams, sequence memoizers, and LSTM-RNNs on several of these tasks while being efficient to train and infer. The models learned by CHMM are overcomplete by a factor of , are extremely sparse, and can be thought of as learning the underlying structure of sequence generation. Unlike n-grams, CHMMs can model arbitrarily long temporal dependencies without having to specify the order up front. When a predictive context contains “holes” of irrelevant symbols111The term “holes” has been used  to refer to parts of a context that are irrelevant for prediction. For instance, we can consider a data set in which the context I am going will predict to regardless of the existence of additional intermediate symbols, so that other contexts like I kgt am mnbh going or I mqlr am pzb going should result in the same predictive distribution. We would refer to the groups of inserted symbols kgt, mnbh as “holes”. that do not affect prediction, n-grams and sequence memoizers  will need exponentially (in the number of holes) more training data to be able to use the context for prediction, whereas CHMMs can handle such sequences efficiently and flexibly. Being generative models, they can model uncertainty natively, and can answer queries that were not part of the training objective. The learned transition matrix of a CHMM is a sparse graph of the generation process, and structural analysis of this graph may reveal communities that can help with hierarchical modeling and planning.
The rest of this paper is organized as follows: In Section 2 we introduce a toy example to illustrate the representational properties of CHMM, and discuss related work. In Section 3 we define the CHMM model and briefly discuss the variations of the EM algorithm we use to learn its parameters. Our results (Section 4) are divided into two subsections. In the first subsection we use synthetic data with known optimal solutions to demonstrate the properties of CHMMs in comparison with regular HMMs and LSTM-RNNs. In the second subsection we use a variety of English text datasets to compare CHMMs with n-grams, sequence memoizers, and LSTM-RNNs. We conclude in Section 6 with a discussion of future work in the context of our findings.
2 Modeling a toy sequence
Consider the finite state machine222 FSMs can be classified in nondeterministic finite automata (NFA) and deterministic finite automata (DFA). While it is possible to convert between these two representations, the first one is exponentially more compact for some FSMs. The formulation presented here applies to both categories, thus allowing the most compact representation to be used.
FSMs can be classified in nondeterministic finite automata (NFA) and deterministic finite automata (DFA). While it is possible to convert between these two representations, the first one is exponentially more compact for some FSMs. The formulation presented here applies to both categories, thus allowing the most compact representation to be used.(FSM) depicted in Fig. 1
(e). Each of the 24 nodes corresponds to a different internal state. Arrows are labeled with the probability of transition between two given states. The numbers inside the nodes indicate the symbol that the machine accepts when transitioning out of that node. E.g., when at a node labeled with a 7 (note that there are two internal states with this label), the machine will only accept a 7 as the input, and it will transition with equal probability to a state that accepts either an 8 or a 9. The output sequence of this FSM has interesting long-term structure. For example, observing a 0 guarantees a 1 appearing eight time-steps later. Similarly, a 2 will predict a 3 eight steps later with absolute certainty, despite the stochastic transitions in between. The task at hand is to discover this machine (or an equivalent one) based on a long-enough sequence generated by it.
We could model this sequence as a generic FSM, with the corresponding Bayesian network (BN) shown in Fig.1(b), where is the hidden state of the machine at time , and is the observation when transitioning out of towards . In particular, it can be modelled when the dimension of the hidden space (number of different discrete values that it can take) is of size 24, which is equal to the number of nodes 333We can exploit some properties of the sequence considered herein to slightly restrict the hidden space size.. We would then need to learn the conditional probabilities that each arrow represents. We only need to learn one time slice, since they are all identical.
Another way to encode the FSM as a BN is depicted in Fig. 1c. The additional arrows with respect to the FSM model increase its expressiveness and now we can encode our toy sequence with only 2 hidden states, instead of 24. Since now we can also refer to the previous emission and state to decide the current emission, the hidden state no longer needs to remember in which of the 24 hidden states it is, but only in which of the 2 copies (the top or bottom row of Fig. 1(e) that emit a given symbol) it is. We refer to each hidden state that can take as a different clone of the emission .
It is possible to perform exact inference on this model, as well as to learn the conditional probabilities that define it using expectation maximization (EM). To illustrate this, note that we can collect and write the model as in Fig. 1(d). This graphical model is a hidden Markov model with a fixed emission matrix, since given how we created , the emissions follow deterministically from the state. Now each group of hidden states that result in the same emission is a group of clones, since they look the same when emitted. For this reason, we call it the cloned HMM (CHMM). The number of clones associated to each emission is fixed, but can be different per emission. The transition matrix encodes the conditionals. This means that standard Baum-Welch (minus the learning of the emission matrix) can be used to recover the above FSM when using enough clones (in theory only two clones are required, but in practice more clones can help escape local minima).
To be completely clear, Fig 1(c) and Fig 1(d) are two representations of the same model. The semantics of the hidden variable differs, however: in this example, the hidden states in Fig 1(c) have cardinality 2, whereas the hidden states Fig 1(d) have cardinality 24, since the latter collect both and .
Fig 1(f) illustrates how the different states in an HMM are tied to the same emission state. On the left is the graphical model for an HMM. On the right we show the details of the internals within each node. This CHMM is overcomplete with three times as many hidden states (shown as white circles within the green oval) as emission states (shown as white circles within the blue oval). The emission matrix, comprising the arrows from the hidden states to the emission states, shows the cloned structure with multiple hidden states mapping to the same emission state. This matrix is initialized deterministically and does not change during learning. The transition matrix, encapsulated in the arrow between the green ovals, is initialized randomly and is modified during learning. The set of hidden states that deterministically emit the same observation can be considered as “clones” of each other and of the observed symbol. The different clones, through the learning of the transition matrix, will come to represent the different temporal contexts in which a particular symbol occurs in the training data. The long range dependencies get embedded in the hidden state and are propagated at arbitrary distances, thus becoming Markovian in that hidden state.
Fig 2(a) shows the learned transition matrix when a CHMM is used to learn a model for the sequence generated by the toy FSM. The CHMM was initialized with just two clones for each symbol and it was able to exactly recover the underlying sequence generation structure and achieve the best possible predictive performance. Observe that the number of clones do not correspond to the temporal order: this example has eight times-steps of temporal dependency, but this is captured using only two clones. In contrast, a regular overcomplete HMM failed to capture the long-term dependence corresponding to the generating FSM, see Fig 2(b). The problem here is not that the overcomplete HMM is insufficient, but the opposite, its extra flexibility makes it converge to a bad local minimum.
2.1 Related Work
N-grams, when combined with Kneser-Ney smoothing , are one of the strongest baselines for sequence modeling, being the tool of choice in many settings. Their weakness, however, is that they base their predictions exclusively on the statistics gathered over a finite context. When the required context is small this is not a problem, but since the model size grows exponentially with the length of the context, this quickly results in a) storage issues and b) insufficient data to fit a good model. For instance, in the above FSM, we can increase the length of the random sequence from 8 to an arbitrarily large number and the optimal model for the sequence will still be expressible as a CHMM with 2 clones. On the other hand, the context of the n-grams will grow arbitrarily large while memorizing all the exponentially many random sequences that are observed, becoming unfeasible at some point. Sequence memoizers  provide better smoothing mechanisms than n-grams, but share the same shortcoming as n-grams in not being able to efficiently represent sequences with “holes”.
The idea of cloning the states of a Markov chain to create higher-order representations can be traced back to a popular compression algorithm . Clones are created by identifying the states in a lower-order model that need to be split, and then relearning the counts. This basic idea has been re-discovered and elaborated. multiple times in different fields [18, 11, 28, 13, 17]. Notably, cloned state representations are postulated to be the underlying mechanisms for representing sequential structure in bird songs [13, 17]. While splitting states is a viable method for learning simple sequential structure, it encounters problems with complex sequences (see Appendix A). Partially observable Markov Models (POMMs)  utilize the same idea of a cloned structure and recognize their connection to HMMs, but utilized a greedy clone splitting strategy to learn the HMM.
It is well recognized that constraining an HMM appropriately can alleviate the local minima problem of EM [23, 21]. The only difference between the CHMM and the HMM is that in the CHMM the emission matrix is initialized to a particular deterministic sparse matrix which does not change with further learning. Even though this makes the CHMM less expressive compared to HMMs444For the same number of hidden states. An HMM can always be expressed as a CHMM by increasing the number of hidden states and embedding the emission matrix of the HMM in the transition matrix of the CHMM., it also makes it less prone to finding poor local minima. We show in Section 4 that the above FSM cannot be learned as accurately by an HMM as it is by a CHMM. Even more importantly, knowing the emission matrix a priori results in important computational savings, as we will show in Section 3, making CHMMs more widely applicable. Overcomplete HMMs are appealing because of their ability to represent higher-order information, and  anticipates that specific sparsity structures on the transition matrix of an overcomplete HMM can lead to efficient learning. However, restricting the transition matrix to particular classes like left-to-right or cyclic can limit the expressive power of the model and limit its application. One, perhaps surprising, insight of CHMM is that imposing the sparsity structure on the emission matrix leads to a sparse transition matrix where the sparsity pattern is learned from data.
RNNs, in their different incarnations, keep a state that is propagated while modeling the sequence, which means that they can properly model this type of sequence. The CHMM can be seen as the fully generative version of an RNN, thus being able to deal with all sorts of probabilistic queries in closed form.
3 Learning a CHMM
Since the CHMM is just a normal HMM with additional restrictions, the same equations for learning HMMs apply, with some simplifications. These simplifications result in significant savings in compute and memory.
The probability of a sequence under this model can be expressed as the sum over the (exponentially many) hidden sequences that are compatible with the observed sequence. The probability of each hidden sequence is straightforward to compute using the chain rule:
where means that the summation is only over the hidden values of that emit . There is a fixed, known association between hidden states and emissions, with each hidden state being associated to exclusively one emission. As noted in the previous section, the different hidden states associated to one emission are called the clones of that emission.
A simple and effective way to decide how to allocate clones to each symbol is to do so proportionally to the number of times that the symbol appears in the data. The intuition that more frequent symbols need more clones is obvious when we go to the extreme case of memorizing a sequence: we need exactly as many states as total symbols appear in the sequence, and as many clones per symbol as occurrences of that symbol appear in the sequence.
Expectation-Maximization for CHMM:
If the hidden states are ordered in such a way that all the clones of the first emission are placed first, all the clones of the second emission are placed second, etc, we can divide the transition matrix in submatrices, in which each submatrix is relevant for the transition between a pair of symbols. I.e., the transition matrix can be broken down into smaller submatrices that contain the transition probabilities such that and . The subindices range in , where is the total number of emitted symbols. I.e., they identify a specific set of clones. Arranging the in a grid according to their subindices recovers the full transition matrix . Similarly, the prior probability vector can be broken in sub-vectors that contain the prior probabilities such that . Stacking all the sub-vectors, we recover . The standard Baum-Welch equations can then be expressed in a simpler form:
where the symbol is the elementwise product (with broadcasting where needed) and is the elementwise division (with broadcasting where needed). All vectors are column vectors, where is the number of clones per emission555We use a constant number of clones per emission for simplicity here, but this number can be selected independently for each emission..
Online EM for CHMM:
It is possible to develop an adaptive, online version of the above algorithm by splitting the sequence in contiguous batches and performing EM steps on each batch successively. We store our running statistic after processing batch in , and compute from it. We have
where is a memory parameter and refers to the time steps contained in batch . If we set , then would coincide exactly with the transition matrix from the previous section. For smaller values of , the expected counts are weighed using an exponential window666Note that proper normalization of the exponential window is unnecessary, since it will cancel when computing ., thus giving more weight to the more recent counts. To obtain an online formulation, we simply express recursively
so that the expected counts of the last observed batch are incorporated in our running statistic. The statistics of batch are computed from the E step over that batch, which uses the transition matrix
, obtained after processing the previous batch. This allows to learn from arbitrarily long sequences, and to adapt to changes in the statistics if those happen over time. If we are interested in online learning but not adaptivity, we can feed all the batches from the same sequence multiple times (analogous to the epochs in an NN), so that learning can continue to make progress after one full pass through the sequence.
In a standard HMM with hidden states in which the emission matrix is stochastic, the computational cost for a sequence of length is and the required storage is , including the space for the transition matrix and the forward-backward messages.
In contrast, the CHMM exploits the known sparse emission matrix, which results in savings. With hidden states and clones per emission the computational cost is and the memory requirement is worst case , with additional savings for every pair of symbols that never appear consecutively in the training sequence (since the corresponding submatrix will be zero and does not need to be stored). Space requirements can be improved even further by using the online version of EM described above.
Since where is the total number of symbols, an increase in alphabet size does not increase the computational cost of the CHMM, whereas it does increase the computational cost of the HMM. Since the transition matrix can be stored sparsely, storage cost may or may not be increased depending on the structure of the sequence.
4 Empirical Results
We evaluate CHMMs on two classes of data sets that test a model’s ability to learn long-term temporal structure: 1) Synthetic data sets, which serve to illustrate the model’s representational properties, 2) Character-level language models, which demonstrate that CHMMs can learn on real data distributions. We show that CHMMs outperform LSTM-RNNs, n-grams and sequence memoizers in both the language modeling and synthetic data tasks.
The metric by which these methods are compared is bits per symbol —referred to as BPS— which reflects the compressive capacity of a model. For a sequence and a set of parameters , the BPS is defined as:
4.1 Synthetic data sets
We test CHMMs on two synthetic data sets: the toy example described in Section 2 and a sequence that opens and closes parenthesis and brackets in a syntactically correct way (see below). CHMMs outperformed standard overcomplete HMMs on both these tasks. On a more challenging version of these synthetic tasks, CHMMs outperformed LSTMs. Our experiments additionally reveal the computational advantage of using the online EM training procedure.
4.1.1 Toy example
Data for the toy example is generated from a generalized version of the FSM presented in Section 2. In that FSM, node deterministically led to node through a sequence of stochastically switching nodes in between. Node deterministically led to node in a similar manner. We keep the same structure but add the parameter to control the separation between and . We define and as signal elements, which are separated by noise symbols, with values in . We can interpret as the number of “holes” in the context. The example of Section 2 corresponds to using .
Train and test data are generated by concatenating stochastic sequences of length . Each stochastic noise sequence of length is emitted by repeating the following loop – indexed over . For , the symbol and is emitted with probability. Symbol is then deterministically emitted and followed by . Finally, elements and are emitted with a probability each. The last two noise elements ( and ) are connected with one of the signal elements or , depending on whether the last signal element emitted was or . As discussed in Section 2, this data can be drawn from a cloned HMM with clone for all signal elements and clones for the noise elements. We consider different training set sizes, as specified in Table 1.
Note that even though in theory 2 clones are enough to model this sequence, in practice the learning algorithm might not find the optimal 2 clone model, i.e., it might converge to a bad local minimum. Increasing the number of available clones (thus using an over-flexible model for this sequence) is an effective way to get around the local minima problem.
We compare the performance of a CHMM (with standard and online EM training) with an overcomplete HMM.
CHMM trained with EM: For , the CHMM is initialized with the optimal number of clones for each element. We allocate an extra clone to each noise element for , and an additional one for . The Baum-Welch algorithm presented in Section 3 is run for a maximum number of iterations. We stop the algorithm when the relative gain in train likelihood between 2 iterations of the algorithm is lower than .
CHMM trained with online EM: This is the same model as above, but we use the online EM algorithm. We split our training data into batches of size and consider a discount factor of .
HMM: This is an overcomplete HMM with same number of hidden states than the cloned HMMs above: for , for and for . We use Baum-Welch algorithm from the hmmlearn package777Available in scikit learn .. We use the same transition matrix initialization (random stochastic), number of iterations and convergence criterion as for the CHMMs.
The results of the comparison tabulated in Table 1 show that CHMM outperforms HMM on this data set. Training with online EM results in better models and faster convergence compared to standard EM. CHMMs trained with online EM is able to achieve almost optimal performance on and . For , it reaches the optimal performance for a portion of the random initializations.
|CHMM with EM||CHMM with online EM||HMM||Optimal|
|Training size||Max clones||BPS||N iter||BPS||N iter||BPS||BPS||Clones|
|2||11250||2||0.502 (0.002)||71.4 (13.6)||0.502 (0.001)||11.9 (0.7)||0.657 (0.000)||0.500||2|
|3||33750||2||0.556 (0.000)||42.9 (73.2)||0.446 (0.002)||94.1 (37.1)||0.556 (0.000)||0.444||2|
|4||90000||3||0.500 (0.000)||50.4 (27.5)||0.418 (0.000)||284.4 (212.4)||0.500 (0.000)||0.417||2|
|5||225000||4||0.467 (0.000)||1000.0 (0.0)||0.428 (0.033)||732.0 (240.0)||0.467 (0.000)||0.400||2|
. BPS values are averages over 10 runs and standard deviations are noted in parentheses.
4.1.2 Bracket example
A sequence in this data set consists of brackets and parenthesis [, (, ), ] forming syntactically valid sequence. The syntax is the expected: they must appear in opening-closing pairs, with both elements matching in type and nesting. Sentences are terminated with the symbol . As an example, a syntactically correct sequence with maximum nesting is: .
As in the above example, we compared two variants of CHMM training with an overcomplete HMM. We could only train the HMM for a few tens of iterations – which makes it comparable to both variants of CHMM – because of convergence issues. CHMM achieves the optimal BPS for with batch EM and online EM, and for only with online EM. We additionally found that for , online EM can solve the problem with clones whereas batch EM needs at least clones. An overcomplete HMM is unable to achieve these results in either setting.
|CHMM with EM||CHMM with online EM||HMM||Optimal|
|Training size||Max clones||BPS||N iter||BPS||N iter||BPS||BPS||Clones|
|2||50000||6||1.053 (0.003)||37.2 (1.0)||1.054 (0.02)||5.6 (0.5)||1.178 (0.004)||1.053 (0.003)||3|
|3||50000||20||1.172 (0.003)||69.2 (1.3)||1.134 (0.002)||26.7 (2.1)||1.384 (0.005)||1.130 (0.003)||7|
|3||50000||100||1.191 (0.004)||47.6 (12.4)||1.240 (0.002)||99.9 (0.3)||1.500 (0.006)||1.161 (0.004)||15|
4.1.3 Comparison with LSTM
We introduce two minor modifications of the previous examples that increases the difficulty of the data sets. Our train and test distributions are now a function of some parameter . We show that CHMM is still able to extract relevant information from this more complex environment that allows it to generalize well. In comparison, we show that an LSTM is unable to recover such accurate information and leads to worse predictive performance.
We fix and introduce a parameter which changes the generative model: the data is now generated from the FSM in Fig 3(a). The length of the noise is now stochastic, and its expected length varies with : a large will favor longer sequences – note that the results described earlier consider . We now fix , and generate the train and test sets as mixture of sequences with and sequences with , where . Our train and test sequences have the same distribution and are composed of a large majority of long sequences.
We consider the bracket example with maximum nesting . We define and as the respective probabilities of starting a sentence with a parenthesis and a bracket. Similarly, we introduce the probabilities , and for the first degree of nesting – a parenthesis or a bracket has already been opened – of respectively opening a parenthesis, opening a bracket or closing the previously opened symbol.
Our previous setting considered and . We generate the training sequence with and , and . When generating the test sequence, we set , , and .
The following Table 3 compares the methods:
CHMM: Cloned HMM described trained with online EM.
LSTM: We use the LSTM from the popular repository https://github.com/karpathy/char-rnn trained with default parameters: the number of layers is 2 and the hidden size 100. We set the maximum number of epochs to 400. We generate an additional validation set with same distribution as the test set, which is used to assess convergence.
|CHMM with online EM||LSTM|
|Data set||Avg training set size||Max clones||BPS||N iter||BPS||N iter|
|Toy||59559||2||(0.003)||91.9 (53.3)||0.535 (0.010)||400|
|Bracket||50000||4||(0.015)||15.9 (13.8)||1.211 (0.012)||400|
4.2 Real data sets
We evaluated CHMMs on character-level language modeling with eight text data sets available online:
Six online books accessible from Python’s Natural Language Toolkit package data sets : blake-poems, caroll-alice, shakespeare-hamlet, shakespeare-macbeth, milton-paradise and melville-mobydick.
The war-peace data set accessible from the LSTM repository above.
book1 from the calgary repo, available at: https://www.ics.uci.edu/~dan/class/267/datasets/calgary/?C=M;O=D
For each data set, we first transform all numbers to their character forms and replace all words which appear only once in the training or only in the test set into the word rare. We then remove all characters other than the 26 letters and space. The first 90% of our data is used as training set, and the remaining 10% as the test set. In addition, for computational reason, we limit our training set size to : this enforces us to reduce the training set ratio on some data sets.
We compare CHMM with other commonly used methods for character-level language models:
CHMM: We train CHMMs using the sparse EM algorithm described in Section 3. Smoothing is known to be critical for the sequence prediction task  to avoid overfitting. We regularize via early-stopping [30, 9]. A total capacity of states are allocated to characters, proportional to the number of unique n-grams that end with that character. of the training sequence is retained for validation. The training procedure consists of iterations of online EM and iterations of standard EM, where these parameters are selected using the validation set. After selecting the early stopping parameters, we retrain on the entire training and validation set.
CHMMs outperformed the other methods, including LSTM, for most of the data sets we compared against, see Table 4. Although sequence memoizers consistently returned a better model compared to n-grams, they were significantly outperformed by the CHMM on all data sets except WAR-PEACE, on which it learned a slightly better model.
Since LSTMs are the gold standard in character-level language models , we were surprised that CHMMs outperformed them. We searched over combinations of parameters of LSTM, including the number of layers and the size of the layers and were unable to find a setting that beat CHMMs on these data sets. One possibility is that LSTMs require significantly larger training data sets compared to the ones we use here.
In our tests on character-level models, we found CHMMs to be significantly more computationally efficient than LSTMs during inference. For the forward pass inference on a sequence length of , the optimal CHMM takes only seconds. In comparison, a PyTorch implementation of a one-layer LSTM with 256 hidden units takes seconds on the same GPU. Fig 3(b) shows the average time for a single step of inference in a CHMM as a function of the number of clones.
The models that we train are HMMs with a very large state space and a large number of parameters. The character-level language models have billion parameters. After training, a large fraction of the parameters go to zero, or can be set to zero without affecting the performance. The resultant transition matrix is extremely sparse with 99% of the parameters being zero, as shown in Fig 3(d) for the Milton data set. Our GPU implementation of CHMM does not exploit this sparsity of the learned transition matrix, and additional computational speed advantage over the LSTM could be gained by a sparse implementation.
Increasing the number of hidden states in a standard HMM does not always result in a better likelihood on the training data due to credit diffusion  and local minima. In contrast, in our experiments we found that CHMMs with more clones consistently produced better training likelihoods compared to the ones with fewer clones, supporting the hypothesis that the sparsity structure imposed on the emission matrix serves to alleviate the credit diffusion problem. This obviously doesn’t necessarily translate into an immediate improvement in test data performance when the training dataset is not large enough for the capacity of the model. Figure 3(c) shows the performance of a CHMM on the training data set as the capacity of the model is increased from to total states. All the results in Table 4 were obtained using the same capacity of hidden states for the CHMM. However, it is possible that their performance could be further improved by using the validation set to select an optimal capacity.
|Data set||Train set size||Test set size||cloned HMM||n-grams||SeqM||LSTM|
4.2.1 Discovering ‘word graphs’ using community detection
The transition matrix of a CHMM is a directed graph. Community detection algorithms  can be used to partition this graph into subgraphs to reveal the underlying structure of the data. For a CHMM trained on characters, the subgraphs discovered by Infomap-based community detection  correspond to words or word parts. Frequent and short words or word segments in the data get their own exclusive subgraphs, whereas infrequent and longer words share their graph representations with other words that have shared substrings. A community subgraph of a character-level CHMM can be thought of as a “generalized word”. Some of those subgraphs have deterministic word productions based on what was seen in the training data, whereas some other sub-graphs produce different word-like combinations, many of which are completely novel.
Figure 4 shows a few of the communities discovered in the CHMM transition matrix for milton-paradise, using the Infomap algorithm . The bottom-right graph produces the words “battle” and “rattle”. Interestingly, the word “rattle” does not appear in the training set, and is a generalization produced by CHMM. This generalization occurs by sharing the clone of “t” between “battle” and words like “grateful”, “generate”, etc. that have the “rat” segment in them.
4.2.2 Handling uncertainty
As a demonstration of CHMM’s ability to handle uncertainty, we consider the problem of decoding scrambled text. People have a remarkable ability to decode scrambled text, a topic that has been an internet meme (Fig 5a): “Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe” 
. CHMMs can handle this without having to train on scrambled text if the input is encoded to reflect the uncertainty in the data. We treat the evidence for each in-between character (characters that are between the start of the word and end of the word) as uncertain, with the uncertainty uniformly distributed among the in-between characters of that word. With this encoding, performing MAP inference (Viterbi decoding) decodes the correct words in most cases.
We performed two experiments to test the ability of CHMMs to decode scrambled text. In the first test, we sampled thirty 600-long character sequences from melville-mobydick, and scrambled the words while preserving their beginning and ends. Using the uncertainty encoding described above, and using the CHMM that was trained on melville-mobydick, MAP inference decoded the sequences with 99.02% word accuracy and 99.42% character accuracy. The ability of CHMM to decode scrambled text is not limited to known words. To test this, we selected 638 unseen words from a list of 3000 most frequent English words and scrambled them as above. Out of these the 89 four-letter words were decoded with 74.1% accuracy, the 99 five-letter words with 45.45%, the 144 six-letter words with 40.3%, the 172 seven-letter words with 27.9%, and the 134 eight-letter words with 21.6% accuracies. Figure 5 shows examples of sentences created from the unseen words and their CHMM decodings.
It is noteworthy that this capability is not using a separate dictionary, and not expecting the words that are being decoded to be present in the training set of the CHMM. For example, the words “surprisingly” and “complementary” do not occur in the training set of the CHMM for the Moby Dick dataset. Nevertheless, the learned representation assigns non-zero likelihoods to those words, and their scrambled versions are corrected during inference. Doing this using an LSTM would require sampling from the LSTM and using a dictionary to identify correct words, whereas CHMM’s internal representation and its ability to handle uncertainty allows it to solve this problem as a standard inference query. While this ability is similar to that of humans, we are not making the claim that humans use the same mechanism as CHMMs when they decode scrambled text.
5 Theoretical Results
presents statistical guarantees of the Baum-Welch algorithm for CHMM. It derives a finite sample L2 estimation of the difference between the successive Baum-Welch estimatesand a transition matrix , defined as a global optimum of the population likelihood – the limit of the empirical likelihood in the case of an infinite amount of data.
The L2 norm of the difference between the optimum and the Baum Welch iterates is bounded by the sum of an exponentially decaying term and a residual error. The residual is a function of the number of samples and the degree of truncation introduced in Appendix B. Consequently, Theorem 1 guarantees the local convergence of the Baum-Welch algorithm to a small ball around a global optimum of the population likelihood. The following Theorem 2 proposes a similar bound for HMM and compares the L2 estimation upper bounds for CHMM and HMM.
Theorem 2 guarantees the local convergence of the Baum-Welch algorithm for HMM. For fixed values of and , the L2 estimation upper bound is larger for HMM than for CHMM. Two major differences appear when both bounds are satisfied: (a) CHMM needs fewer iterations to converge to the local optima since its convergence rate – of the order of – is smaller than the rate for HMM and (b) CHMM converges closer to the global optima since its radius of local convergence is smaller.
Learning the parameters of even simple graphical models like HMMs can be very challenging. In this work, we used ideas from neuroscience to constrain the structure of HMMs and showed that this results in efficient learning of temporal structure. The learned models can deal with uncertainty, a critical requirement for agents operating in a noisy and uncertain world.
Although CHMMs outperformed LSTMs in many of our current settings, we would expect LSTM’s performance to be superior when trained on longer datasets. Also, we know that LSTMs can handle large vocabularies well, while this remains to be shown for the case of CHMMs. Despite these caveats, CHMMs seem a good model for sequential structure learning when the generative structure needs to be understood, when this structure is relatively sparse, and when uncertainty needs to be handled. It is also noteworthy that some of the most modern deep learning language models can only handle finite contexts and would not be able to deal with the arbitrarily long memories required by some of the sequences that a CHMM can model.
Our work has opened up several directions for future research. Although the discovered matrices are very sparse, EM has a tendency to overuse the clones. Our preliminary investigations show that imposing a sparse prior (e.g. a Dirichlet prior) on the transitions in a variational Bayesian formulation results in further sparsity. CHMMs induce generalization using a capacity bottleneck on the clones. Further generalization could be obtained using appropriate smoothing mechanisms. One fruitful avenue could be to investigate how n-gram smoothing ideas can be transferred to the CHMM setting. Since the representational ideas of CHMMs originated in biology, investigating the mapping between sequential representations in the neocortex and CHMMs would be an interesting avenue of future research.
We thank Michael Stark, Bruno Olshausen and Tom Silver for helpful comments on the manuscript.
Appendix A Greedy clone splitting
Here we show how greedy clone splitting can fail even in very simple cases. Consider a sequence formed by successively concatenating either “ab” or “(ab)”, choosing between the two at random. An example sequence would be “abab(ab)ab(ab)(ab)abab”. Sequences generated in this fashion can be modeled exactly as a CHMM with one clone for the symbols ( and ), and two clones for the symbols a and b (one clone activates inside the parenthesis, the other outside). A model with a single clone for every symbol will obviously have a lower likelihood given the data.
Using Baum-Welch, the optimal model is readily found from training data when enough clones are available. Now consider a greedy clone-splitting mechanism that starts with one clone per symbol and splits one clone at a time when doing so increases the likelihood of the model:
Splitting a’s clone cannot increase the likelihood of the model, since the single existing clone already predicts the single clone of b with probability 1 and is always correct in doing so.
Splitting b’s clone has the symmetric problem: even though having a different clone activate inside and outside the parenthesis would increase its ability to predict the next symbol (for instance, it should always be ) for the clone inside the parenthesis), since it is always preceded by the same clone of a, it is not possible to disambiguate both clones of b, and therefore this split does not increase the likelihood of the model either.
In this toy example, the only move that results in a likelihood increase is one that splits the clones of a and b simultaneously, and not greedily.
Appendix B Theoretical guarantees
b.1 Upper bounds for CHMM
We consider , a sequence with corresponding latent variables (clones) drawn from a clone HMM with transition matrix . The clones takes their values in and satisfy the following assumption:
The Markov chain associated to the latent sequence is aperiodic, recurrent, has unique stationary distribution , and is revertible. In addition, we can fix a constant such that:
where is the set of admissible transition matrices. The empirical operator for clone HMM can be expressed as:
Given an estimate of the transition matrix, the E-step of Baum-Welch algorithm for clone HMM derives the empirical operator by computing the probabilities . The M-step considers the maximizer operator defined by . That is, the algorithm repeats until convergence: .
In the case of an infinite amount of data, the empirical operator is equivalent to the population function defined as:
When Assumption 1 holds, the following Theorem 3 assess the existence of . This existence is not guaranteed: the quantity depends upon since the samples are not i.i.d.. In addition, we define the population maximizer by the self-consistency condition:
where the operator extends to the population function . corresponds to a global maximizer of the population likelihood, which we would like to recover via the Baum-Welch algorithm. However, since the samples are not i.i.d., we cannot directly analyze the asymptotic behavior of the population function . To this end, we consider a sequence of latent clones drawn from the stationary distribution of the Markov chain. For an integer , we introduce the truncated version of the operator :
The latent variables are conditioned on a window centered around the index. Because of the stationarity of the sequence of latent variables, the expectation does not depend upon the sample size . Consequently, we define the truncated population function :
is properly defined. Theorem 3 proves that under Assumption 1 and 2 (defined below), uniformly approximates the population function . In addition, we extend the operators and to their respective truncated versions:
As the truncation level increases, the operator converges to . Similarly, as increases, this operator converges to . We control their respective convergences with the two following parameters:
Let . Let be the smallest scalar such that:
In addition, let be the smallest scalar such that:
We finally make the following assumption.
For all , the truncated population function is strictly concave. That is,we can fix such that:
In addition for all , the gradient of is Lipschitz with constant . That is:
Finally, it holds .
We are now ready to present our main Theorem 3. It states that the difference between the th iterate of the Baum-Welch algorithm and the global optimum of the population likelihood is the sum of a geometrically decaying term and a residual error – function of the parameters and .
Under Assumption 1, the population function is well-defined.
The proof is presented in Appendix B.3.
b.2 Comparing the upper bounds for CHMM and HMM
We derive herein an upper bound for HMM similar to the one for CHMM in Equation (7) and compare the two upper bounds. Let be the sparse emission matrix associated with a CHMM – defined such that each clone only emits its observation. The normalized log-likelihood of the complete data is now lower-bounded by the empirical operator defined for any couple of estimates by:
By noting that the first term depends only depends upon and the second term only upon , we write:
We note the set of emission matrices. The EM algorithm considers the maximizer operator and repeats until convergence:
where and denote the maximizer operators respectively associated with and . Similarly to the CHMM case, we define the population versions , of the operators introduced, their truncated versions , and the truncated population versions , . In particular, the population versions are only defined on a set of emission matrices with lowest entry lower bounded by some . This does not hold for their truncated versions as the truncated operators do not depend upon . In addition, we introduce the operators corresponding to the decomposition with respect to the transition and emission matrices as in Equation (8). We consider a global population maximizer and assume that as defined in Section B.1. We finally note , and extend Assumption 2 as follows:
For all , the truncated population function is strictly concave with respect to each of its component. That is,we can fix such that:
In addition for all , the gradient of is Lipschitz with respect to each variable:
We define two similar conditions for – with respective Lipschitz constants and .
We note and and assume:
We finally extend Definition 1 as follows:
Let . Let be the smallest scalar such that:
In addition, let be the smallest scalar such that:
We derive the following theorem in the case of HMM. The proof is presented in Appendix B.3.
Finally, the sequence of transition matrices derived by the Baum-Welch algorithm: , , satisfies with probability at least :
We define to recover the bound in Theorem 2. It holds:
b.3 Proof of Theorem 3
We divide our proof in 3 steps.
Step 1: We first prove the existence of the population function defined as: To do so, we prove that the sequence of operators satisfies the Cauchy property:
where . Consequently, since , it holds :
As a result we have:
Using the triangle inequality:
we conclude that if , the sequence