1 Introduction
Sequence modeling is a fundamental realworld problem with a wide range of applications. Recurrent neural networks (RNNs) are currently preferred in sequence prediction tasks due to their ability to model longterm and variable order dependencies. However, RNNs have disadvantages in several applications because of their inability to natively handle uncertainty, and because of the inscrutable internal representations.
Probabilistic sequence models like Hidden Markov Models (HMM) have the advantage of more interpretable representations and the ability to handle uncertainty. Although overcomplete HMMs with many more hidden states compared to the observed states can, in theory, model longterm temporal dependencies [24], training HMMs is challenging due to credit diffusion [3]
. For this reason, simpler and inflexible ngram models are preferred to HMMs for tasks like language modeling. Tensor decomposition methods
[1] have been suggested for the learning of HMMs in order to overcome the credit diffusion problem, but current methods are not applicable to the overcomplete setting where the fullrank requirements on the transition and emission matrices are not fulfilled. Recently there has been renewed interest in the topic of training overcomplete HMMs for higherorder dependencies with the expectation that sparsity structures could potentially alleviate the credit diffusion problem [24].In this paper we demonstrate that a particular sparsity structure on the emission matrix can help HMMs learn higherorder temporal structure using the standard ExpectationMaximization algorithms
[27] (BaumWelch) and its online variants. We call this model cloned HMM (CHMM) because the sparsity structure deterministically maps multiple hidden states (clones) to the same emission state, whereas the emission matrix in a standard HMM is dense and allows for any hidden state to emit any emission state. The basic idea originated in a popular compression method called dynamic Markov coding (DMC) [7], where the temporal dependence in a first order Markov chain is gradually increased by ‘cloning’ the states Fig
1(a). The same idea has been rediscovered several times in different domains [11, 28, 8, 20]. Cloned hidden states are conjectured to be behind the higherorder sequence representations in bird songs [17], and specific aspects of the laminar and columnar organization of cortical circuits are postulated to reflect this cloning structure [11, 10] (Fig 1(g)). Researchers have attempted to learn such structures, or partially observable Markov models, using greedy state splitting algorithms that start from a firstorder Markov model [7, 5, 20, 28]. Instead, we recognize that the cloning structure can be initialized as a sparse emission matrix of the HMM. The HMM allocates a maximum split capacity for each symbol up front and lets the learning algorithm decide how to utilize that capacity. This results in more flexibility compared to greedy statesplitting approaches (see Appendix A).We test the efficacy of CHMMs by using them to learn characterlevel models of English language using a variety of English text datasets. Our experiments show that CHMMs outperform standard HMMs, ngrams, sequence memoizers, and LSTMRNNs on several of these tasks while being efficient to train and infer. The models learned by CHMM are overcomplete by a factor of , are extremely sparse, and can be thought of as learning the underlying structure of sequence generation. Unlike ngrams, CHMMs can model arbitrarily long temporal dependencies without having to specify the order up front. When a predictive context contains “holes” of irrelevant symbols^{1}^{1}1The term “holes” has been used [15] to refer to parts of a context that are irrelevant for prediction. For instance, we can consider a data set in which the context I am going will predict to regardless of the existence of additional intermediate symbols, so that other contexts like I kgt am mnbh going or I mqlr am pzb going should result in the same predictive distribution. We would refer to the groups of inserted symbols kgt, mnbh as “holes”. that do not affect prediction, ngrams and sequence memoizers [26] will need exponentially (in the number of holes) more training data to be able to use the context for prediction, whereas CHMMs can handle such sequences efficiently and flexibly. Being generative models, they can model uncertainty natively, and can answer queries that were not part of the training objective. The learned transition matrix of a CHMM is a sparse graph of the generation process, and structural analysis of this graph may reveal communities that can help with hierarchical modeling and planning.
The rest of this paper is organized as follows: In Section 2 we introduce a toy example to illustrate the representational properties of CHMM, and discuss related work. In Section 3 we define the CHMM model and briefly discuss the variations of the EM algorithm we use to learn its parameters. Our results (Section 4) are divided into two subsections. In the first subsection we use synthetic data with known optimal solutions to demonstrate the properties of CHMMs in comparison with regular HMMs and LSTMRNNs. In the second subsection we use a variety of English text datasets to compare CHMMs with ngrams, sequence memoizers, and LSTMRNNs. We conclude in Section 6 with a discussion of future work in the context of our findings.
2 Modeling a toy sequence
Consider the finite state machine^{2}^{2}2
FSMs can be classified in nondeterministic finite automata (NFA) and deterministic finite automata (DFA). While it is possible to convert between these two representations, the first one is exponentially more compact for some FSMs. The formulation presented here applies to both categories, thus allowing the most compact representation to be used.
(FSM) depicted in Fig. 1(e). Each of the 24 nodes corresponds to a different internal state. Arrows are labeled with the probability of transition between two given states. The numbers inside the nodes indicate the symbol that the machine accepts when transitioning out of that node. E.g., when at a node labeled with a 7 (note that there are two internal states with this label), the machine will only accept a 7 as the input, and it will transition with equal probability to a state that accepts either an 8 or a 9. The output sequence of this FSM has interesting longterm structure. For example, observing a 0 guarantees a 1 appearing eight timesteps later. Similarly, a 2 will predict a 3 eight steps later with absolute certainty, despite the stochastic transitions in between. The task at hand is to discover this machine (or an equivalent one) based on a longenough sequence generated by it.
We could model this sequence as a generic FSM, with the corresponding Bayesian network (BN) shown in Fig.
1(b), where is the hidden state of the machine at time , and is the observation when transitioning out of towards . In particular, it can be modelled when the dimension of the hidden space (number of different discrete values that it can take) is of size 24, which is equal to the number of nodes ^{3}^{3}3We can exploit some properties of the sequence considered herein to slightly restrict the hidden space size.. We would then need to learn the conditional probabilities that each arrow represents. We only need to learn one time slice, since they are all identical.Another way to encode the FSM as a BN is depicted in Fig. 1c. The additional arrows with respect to the FSM model increase its expressiveness and now we can encode our toy sequence with only 2 hidden states, instead of 24. Since now we can also refer to the previous emission and state to decide the current emission, the hidden state no longer needs to remember in which of the 24 hidden states it is, but only in which of the 2 copies (the top or bottom row of Fig. 1(e) that emit a given symbol) it is. We refer to each hidden state that can take as a different clone of the emission .
It is possible to perform exact inference on this model, as well as to learn the conditional probabilities that define it using expectation maximization (EM). To illustrate this, note that we can collect and write the model as in Fig. 1(d). This graphical model is a hidden Markov model with a fixed emission matrix, since given how we created , the emissions follow deterministically from the state. Now each group of hidden states that result in the same emission is a group of clones, since they look the same when emitted. For this reason, we call it the cloned HMM (CHMM). The number of clones associated to each emission is fixed, but can be different per emission. The transition matrix encodes the conditionals. This means that standard BaumWelch (minus the learning of the emission matrix) can be used to recover the above FSM when using enough clones (in theory only two clones are required, but in practice more clones can help escape local minima).
To be completely clear, Fig 1(c) and Fig 1(d) are two representations of the same model. The semantics of the hidden variable differs, however: in this example, the hidden states in Fig 1(c) have cardinality 2, whereas the hidden states Fig 1(d) have cardinality 24, since the latter collect both and .
Fig 1(f) illustrates how the different states in an HMM are tied to the same emission state. On the left is the graphical model for an HMM. On the right we show the details of the internals within each node. This CHMM is overcomplete with three times as many hidden states (shown as white circles within the green oval) as emission states (shown as white circles within the blue oval). The emission matrix, comprising the arrows from the hidden states to the emission states, shows the cloned structure with multiple hidden states mapping to the same emission state. This matrix is initialized deterministically and does not change during learning. The transition matrix, encapsulated in the arrow between the green ovals, is initialized randomly and is modified during learning. The set of hidden states that deterministically emit the same observation can be considered as “clones” of each other and of the observed symbol. The different clones, through the learning of the transition matrix, will come to represent the different temporal contexts in which a particular symbol occurs in the training data. The long range dependencies get embedded in the hidden state and are propagated at arbitrary distances, thus becoming Markovian in that hidden state.
Fig 2(a) shows the learned transition matrix when a CHMM is used to learn a model for the sequence generated by the toy FSM. The CHMM was initialized with just two clones for each symbol and it was able to exactly recover the underlying sequence generation structure and achieve the best possible predictive performance. Observe that the number of clones do not correspond to the temporal order: this example has eight timessteps of temporal dependency, but this is captured using only two clones. In contrast, a regular overcomplete HMM failed to capture the longterm dependence corresponding to the generating FSM, see Fig 2(b). The problem here is not that the overcomplete HMM is insufficient, but the opposite, its extra flexibility makes it converge to a bad local minimum.
2.1 Related Work
Ngrams, when combined with KneserNey smoothing [2], are one of the strongest baselines for sequence modeling, being the tool of choice in many settings. Their weakness, however, is that they base their predictions exclusively on the statistics gathered over a finite context. When the required context is small this is not a problem, but since the model size grows exponentially with the length of the context, this quickly results in a) storage issues and b) insufficient data to fit a good model. For instance, in the above FSM, we can increase the length of the random sequence from 8 to an arbitrarily large number and the optimal model for the sequence will still be expressible as a CHMM with 2 clones. On the other hand, the context of the ngrams will grow arbitrarily large while memorizing all the exponentially many random sequences that are observed, becoming unfeasible at some point. Sequence memoizers [26] provide better smoothing mechanisms than ngrams, but share the same shortcoming as ngrams in not being able to efficiently represent sequences with “holes”.
The idea of cloning the states of a Markov chain to create higherorder representations can be traced back to a popular compression algorithm [7]. Clones are created by identifying the states in a lowerorder model that need to be split, and then relearning the counts. This basic idea has been rediscovered and elaborated. multiple times in different fields [18, 11, 28, 13, 17]. Notably, cloned state representations are postulated to be the underlying mechanisms for representing sequential structure in bird songs [13, 17]. While splitting states is a viable method for learning simple sequential structure, it encounters problems with complex sequences (see Appendix A). Partially observable Markov Models (POMMs) [5] utilize the same idea of a cloned structure and recognize their connection to HMMs, but utilized a greedy clone splitting strategy to learn the HMM.
It is well recognized that constraining an HMM appropriately can alleviate the local minima problem of EM [23, 21]. The only difference between the CHMM and the HMM is that in the CHMM the emission matrix is initialized to a particular deterministic sparse matrix which does not change with further learning. Even though this makes the CHMM less expressive compared to HMMs^{4}^{4}4For the same number of hidden states. An HMM can always be expressed as a CHMM by increasing the number of hidden states and embedding the emission matrix of the HMM in the transition matrix of the CHMM., it also makes it less prone to finding poor local minima. We show in Section 4 that the above FSM cannot be learned as accurately by an HMM as it is by a CHMM. Even more importantly, knowing the emission matrix a priori results in important computational savings, as we will show in Section 3, making CHMMs more widely applicable. Overcomplete HMMs are appealing because of their ability to represent higherorder information, and [24] anticipates that specific sparsity structures on the transition matrix of an overcomplete HMM can lead to efficient learning. However, restricting the transition matrix to particular classes like lefttoright or cyclic can limit the expressive power of the model and limit its application. One, perhaps surprising, insight of CHMM is that imposing the sparsity structure on the emission matrix leads to a sparse transition matrix where the sparsity pattern is learned from data.
RNNs, in their different incarnations, keep a state that is propagated while modeling the sequence, which means that they can properly model this type of sequence. The CHMM can be seen as the fully generative version of an RNN, thus being able to deal with all sorts of probabilistic queries in closed form.
3 Learning a CHMM
Since the CHMM is just a normal HMM with additional restrictions, the same equations for learning HMMs apply, with some simplifications. These simplifications result in significant savings in compute and memory.
The probability of a sequence under this model can be expressed as the sum over the (exponentially many) hidden sequences that are compatible with the observed sequence. The probability of each hidden sequence is straightforward to compute using the chain rule:
(1) 
where means that the summation is only over the hidden values of that emit . There is a fixed, known association between hidden states and emissions, with each hidden state being associated to exclusively one emission. As noted in the previous section, the different hidden states associated to one emission are called the clones of that emission.
The parameters of this model are the prior probabilities, collected in the vector
, such that , the transition probabilities , such that , and the number of clones allocated to each emission. Matrix is row stochastic, i.e, .A simple and effective way to decide how to allocate clones to each symbol is to do so proportionally to the number of times that the symbol appears in the data. The intuition that more frequent symbols need more clones is obvious when we go to the extreme case of memorizing a sequence: we need exactly as many states as total symbols appear in the sequence, and as many clones per symbol as occurrences of that symbol appear in the sequence.
ExpectationMaximization for CHMM:
If the hidden states are ordered in such a way that all the clones of the first emission are placed first, all the clones of the second emission are placed second, etc, we can divide the transition matrix in submatrices, in which each submatrix is relevant for the transition between a pair of symbols. I.e., the transition matrix can be broken down into smaller submatrices that contain the transition probabilities such that and . The subindices range in , where is the total number of emitted symbols. I.e., they identify a specific set of clones. Arranging the in a grid according to their subindices recovers the full transition matrix . Similarly, the prior probability vector can be broken in subvectors that contain the prior probabilities such that . Stacking all the subvectors, we recover . The standard BaumWelch equations can then be expressed in a simpler form:
EStep
MStep
where the symbol is the elementwise product (with broadcasting where needed) and is the elementwise division (with broadcasting where needed). All vectors are column vectors, where is the number of clones per emission^{5}^{5}5We use a constant number of clones per emission for simplicity here, but this number can be selected independently for each emission..
Online EM for CHMM:
It is possible to develop an adaptive, online version of the above algorithm by splitting the sequence in contiguous batches and performing EM steps on each batch successively. We store our running statistic after processing batch in , and compute from it. We have
where is a memory parameter and refers to the time steps contained in batch . If we set , then would coincide exactly with the transition matrix from the previous section. For smaller values of , the expected counts are weighed using an exponential window^{6}^{6}6Note that proper normalization of the exponential window is unnecessary, since it will cancel when computing ., thus giving more weight to the more recent counts. To obtain an online formulation, we simply express recursively
so that the expected counts of the last observed batch are incorporated in our running statistic. The statistics of batch are computed from the E step over that batch, which uses the transition matrix
, obtained after processing the previous batch. This allows to learn from arbitrarily long sequences, and to adapt to changes in the statistics if those happen over time. If we are interested in online learning but not adaptivity, we can feed all the batches from the same sequence multiple times (analogous to the epochs in an NN), so that learning can continue to make progress after one full pass through the sequence.
Computational savings:
In a standard HMM with hidden states in which the emission matrix is stochastic, the computational cost for a sequence of length is and the required storage is , including the space for the transition matrix and the forwardbackward messages.
In contrast, the CHMM exploits the known sparse emission matrix, which results in savings. With hidden states and clones per emission the computational cost is and the memory requirement is worst case , with additional savings for every pair of symbols that never appear consecutively in the training sequence (since the corresponding submatrix will be zero and does not need to be stored). Space requirements can be improved even further by using the online version of EM described above.
Since where is the total number of symbols, an increase in alphabet size does not increase the computational cost of the CHMM, whereas it does increase the computational cost of the HMM. Since the transition matrix can be stored sparsely, storage cost may or may not be increased depending on the structure of the sequence.
4 Empirical Results
We evaluate CHMMs on two classes of data sets that test a model’s ability to learn longterm temporal structure: 1) Synthetic data sets, which serve to illustrate the model’s representational properties, 2) Characterlevel language models, which demonstrate that CHMMs can learn on real data distributions. We show that CHMMs outperform LSTMRNNs, ngrams and sequence memoizers in both the language modeling and synthetic data tasks.
The metric by which these methods are compared is bits per symbol —referred to as BPS— which reflects the compressive capacity of a model. For a sequence and a set of parameters , the BPS is defined as:
4.1 Synthetic data sets
We test CHMMs on two synthetic data sets: the toy example described in Section 2 and a sequence that opens and closes parenthesis and brackets in a syntactically correct way (see below). CHMMs outperformed standard overcomplete HMMs on both these tasks. On a more challenging version of these synthetic tasks, CHMMs outperformed LSTMs. Our experiments additionally reveal the computational advantage of using the online EM training procedure.
4.1.1 Toy example
Data for the toy example is generated from a generalized version of the FSM presented in Section 2. In that FSM, node deterministically led to node through a sequence of stochastically switching nodes in between. Node deterministically led to node in a similar manner. We keep the same structure but add the parameter to control the separation between and . We define and as signal elements, which are separated by noise symbols, with values in . We can interpret as the number of “holes” in the context. The example of Section 2 corresponds to using .
Train and test data are generated by concatenating stochastic sequences of length . Each stochastic noise sequence of length is emitted by repeating the following loop – indexed over . For , the symbol and is emitted with probability. Symbol is then deterministically emitted and followed by . Finally, elements and are emitted with a probability each. The last two noise elements ( and ) are connected with one of the signal elements or , depending on whether the last signal element emitted was or . As discussed in Section 2, this data can be drawn from a cloned HMM with clone for all signal elements and clones for the noise elements. We consider different training set sizes, as specified in Table 1.
Note that even though in theory 2 clones are enough to model this sequence, in practice the learning algorithm might not find the optimal 2 clone model, i.e., it might converge to a bad local minimum. Increasing the number of available clones (thus using an overflexible model for this sequence) is an effective way to get around the local minima problem.
Methods compared:
We compare the performance of a CHMM (with standard and online EM training) with an overcomplete HMM.

CHMM trained with EM: For , the CHMM is initialized with the optimal number of clones for each element. We allocate an extra clone to each noise element for , and an additional one for . The BaumWelch algorithm presented in Section 3 is run for a maximum number of iterations. We stop the algorithm when the relative gain in train likelihood between 2 iterations of the algorithm is lower than .

CHMM trained with online EM: This is the same model as above, but we use the online EM algorithm. We split our training data into batches of size and consider a discount factor of .

HMM: This is an overcomplete HMM with same number of hidden states than the cloned HMMs above: for , for and for . We use BaumWelch algorithm from the hmmlearn package^{7}^{7}7Available in scikit learn [19].. We use the same transition matrix initialization (random stochastic), number of iterations and convergence criterion as for the CHMMs.
The results of the comparison tabulated in Table 1 show that CHMM outperforms HMM on this data set. Training with online EM results in better models and faster convergence compared to standard EM. CHMMs trained with online EM is able to achieve almost optimal performance on and . For , it reaches the optimal performance for a portion of the random initializations.
CHMM with EM  CHMM with online EM  HMM  Optimal  

Training size  Max clones  BPS  N iter  BPS  N iter  BPS  BPS  Clones  
2  11250  2  0.502 (0.002)  71.4 (13.6)  0.502 (0.001)  11.9 (0.7)  0.657 (0.000)  0.500  2 
3  33750  2  0.556 (0.000)  42.9 (73.2)  0.446 (0.002)  94.1 (37.1)  0.556 (0.000)  0.444  2 
4  90000  3  0.500 (0.000)  50.4 (27.5)  0.418 (0.000)  284.4 (212.4)  0.500 (0.000)  0.417  2 
5  225000  4  0.467 (0.000)  1000.0 (0.0)  0.428 (0.033)  732.0 (240.0)  0.467 (0.000)  0.400  2 
. BPS values are averages over 10 runs and standard deviations are noted in parentheses.
4.1.2 Bracket example
A sequence in this data set consists of brackets and parenthesis [, (, ), ] forming syntactically valid sequence. The syntax is the expected: they must appear in openingclosing pairs, with both elements matching in type and nesting. Sentences are terminated with the symbol . As an example, a syntactically correct sequence with maximum nesting is: .
As in the above example, we compared two variants of CHMM training with an overcomplete HMM. We could only train the HMM for a few tens of iterations – which makes it comparable to both variants of CHMM – because of convergence issues. CHMM achieves the optimal BPS for with batch EM and online EM, and for only with online EM. We additionally found that for , online EM can solve the problem with clones whereas batch EM needs at least clones. An overcomplete HMM is unable to achieve these results in either setting.
CHMM with EM  CHMM with online EM  HMM  Optimal  

Training size  Max clones  BPS  N iter  BPS  N iter  BPS  BPS  Clones  
2  50000  6  1.053 (0.003)  37.2 (1.0)  1.054 (0.02)  5.6 (0.5)  1.178 (0.004)  1.053 (0.003)  3 
3  50000  20  1.172 (0.003)  69.2 (1.3)  1.134 (0.002)  26.7 (2.1)  1.384 (0.005)  1.130 (0.003)  7 
3  50000  100  1.191 (0.004)  47.6 (12.4)  1.240 (0.002)  99.9 (0.3)  1.500 (0.006)  1.161 (0.004)  15 
4.1.3 Comparison with LSTM
We introduce two minor modifications of the previous examples that increases the difficulty of the data sets. Our train and test distributions are now a function of some parameter . We show that CHMM is still able to extract relevant information from this more complex environment that allows it to generalize well. In comparison, we show that an LSTM is unable to recover such accurate information and leads to worse predictive performance.
Toy example:
We fix and introduce a parameter which changes the generative model: the data is now generated from the FSM in Fig 3(a). The length of the noise is now stochastic, and its expected length varies with : a large will favor longer sequences – note that the results described earlier consider . We now fix , and generate the train and test sets as mixture of sequences with and sequences with , where . Our train and test sequences have the same distribution and are composed of a large majority of long sequences.
Bracket example:
We consider the bracket example with maximum nesting . We define and as the respective probabilities of starting a sentence with a parenthesis and a bracket. Similarly, we introduce the probabilities , and for the first degree of nesting – a parenthesis or a bracket has already been opened – of respectively opening a parenthesis, opening a bracket or closing the previously opened symbol.
Our previous setting considered and . We generate the training sequence with and , and . When generating the test sequence, we set , , and .
Results:
The following Table 3 compares the methods:

CHMM: Cloned HMM described trained with online EM.

LSTM: We use the LSTM from the popular repository https://github.com/karpathy/charrnn trained with default parameters: the number of layers is 2 and the hidden size 100. We set the maximum number of epochs to 400. We generate an additional validation set with same distribution as the test set, which is used to assess convergence.
CHMM with online EM  LSTM  

Data set  Avg training set size  Max clones  BPS  N iter  BPS  N iter 
Toy  59559  2  (0.003)  91.9 (53.3)  0.535 (0.010)  400 
Bracket  50000  4  (0.015)  15.9 (13.8)  1.211 (0.012)  400 
4.2 Real data sets
We evaluated CHMMs on characterlevel language modeling with eight text data sets available online:

Six online books accessible from Python’s Natural Language Toolkit package data sets [4]: blakepoems, carollalice, shakespearehamlet, shakespearemacbeth, miltonparadise and melvillemobydick.

The warpeace data set accessible from the LSTM repository above.

book1 from the calgary repo, available at: https://www.ics.uci.edu/~dan/class/267/datasets/calgary/?C=M;O=D
For each data set, we first transform all numbers to their character forms and replace all words which appear only once in the training or only in the test set into the word rare. We then remove all characters other than the 26 letters and space. The first 90% of our data is used as training set, and the remaining 10% as the test set. In addition, for computational reason, we limit our training set size to : this enforces us to reduce the training set ratio on some data sets.
Methods compared:
We compare CHMM with other commonly used methods for characterlevel language models:

CHMM: We train CHMMs using the sparse EM algorithm described in Section 3. Smoothing is known to be critical for the sequence prediction task [6] to avoid overfitting. We regularize via earlystopping [30, 9]. A total capacity of states are allocated to characters, proportional to the number of unique ngrams that end with that character. of the training sequence is retained for validation. The training procedure consists of iterations of online EM and iterations of standard EM, where these parameters are selected using the validation set. After selecting the early stopping parameters, we retrain on the entire training and validation set.

SeqM: Hierarchical Bayesian model described in [26]. The authors released an online implementation accessible at https://github.com/jgasthaus/libPLUMP. We tune the smoothing parameter over the same validation dataset used for the CHMM and then retrain on the entire data set.

ngrams: We use ngrams with KneserNey smoothing [14] as implemented in the Kyoto Language Modeling Toolkit [16]. The code is accessible at https://github.com/neubig/kylm

LSTM:
We compared against two LSTM implementations and report the best of the two results for each data set. The first is Karpathy’s charrnn mentioned above, and the second is a PyTorch version at
https://github.com/spro/charrnn.pytorch.
Results:
CHMMs outperformed the other methods, including LSTM, for most of the data sets we compared against, see Table 4. Although sequence memoizers consistently returned a better model compared to ngrams, they were significantly outperformed by the CHMM on all data sets except WARPEACE, on which it learned a slightly better model.
Since LSTMs are the gold standard in characterlevel language models [12], we were surprised that CHMMs outperformed them. We searched over combinations of parameters of LSTM, including the number of layers and the size of the layers and were unable to find a setting that beat CHMMs on these data sets. One possibility is that LSTMs require significantly larger training data sets compared to the ones we use here.
In our tests on characterlevel models, we found CHMMs to be significantly more computationally efficient than LSTMs during inference. For the forward pass inference on a sequence length of , the optimal CHMM takes only seconds. In comparison, a PyTorch implementation of a onelayer LSTM with 256 hidden units takes seconds on the same GPU. Fig 3(b) shows the average time for a single step of inference in a CHMM as a function of the number of clones.
The models that we train are HMMs with a very large state space and a large number of parameters. The characterlevel language models have billion parameters. After training, a large fraction of the parameters go to zero, or can be set to zero without affecting the performance. The resultant transition matrix is extremely sparse with 99% of the parameters being zero, as shown in Fig 3(d) for the Milton data set. Our GPU implementation of CHMM does not exploit this sparsity of the learned transition matrix, and additional computational speed advantage over the LSTM could be gained by a sparse implementation.
Increasing the number of hidden states in a standard HMM does not always result in a better likelihood on the training data due to credit diffusion [3] and local minima. In contrast, in our experiments we found that CHMMs with more clones consistently produced better training likelihoods compared to the ones with fewer clones, supporting the hypothesis that the sparsity structure imposed on the emission matrix serves to alleviate the credit diffusion problem. This obviously doesn’t necessarily translate into an immediate improvement in test data performance when the training dataset is not large enough for the capacity of the model. Figure 3(c) shows the performance of a CHMM on the training data set as the capacity of the model is increased from to total states. All the results in Table 4 were obtained using the same capacity of hidden states for the CHMM. However, it is possible that their performance could be further improved by using the validation set to select an optimal capacity.
Data set  Train set size  Test set size  cloned HMM  ngrams  SeqM  LSTM 

blakepoems  29912  3300  1.75  1.71  1.68  
calgary  638677  7116  1.72  1.64  1.67  
carollalice  118931  13063  1.61  1.57  1.58  
shakespearehamlet  130101  14332  1.72  1.69  1.68  
shakespearemacbeth  79646  8824  1.79  1.77  1.74  
miltonparadise  382942  42297  1.83  1.78  1.78  
melvillemobydick  750000  387864  1.81  1.73  1.76  
warpeace  750000  2237883  1.59  1.65  1.62 
4.2.1 Discovering ‘word graphs’ using community detection
The transition matrix of a CHMM is a directed graph. Community detection algorithms [22] can be used to partition this graph into subgraphs to reveal the underlying structure of the data. For a CHMM trained on characters, the subgraphs discovered by Infomapbased community detection [22] correspond to words or word parts. Frequent and short words or word segments in the data get their own exclusive subgraphs, whereas infrequent and longer words share their graph representations with other words that have shared substrings. A community subgraph of a characterlevel CHMM can be thought of as a “generalized word”. Some of those subgraphs have deterministic word productions based on what was seen in the training data, whereas some other subgraphs produce different wordlike combinations, many of which are completely novel.
Figure 4 shows a few of the communities discovered in the CHMM transition matrix for miltonparadise, using the Infomap algorithm [22]. The bottomright graph produces the words “battle” and “rattle”. Interestingly, the word “rattle” does not appear in the training set, and is a generalization produced by CHMM. This generalization occurs by sharing the clone of “t” between “battle” and words like “grateful”, “generate”, etc. that have the “rat” segment in them.
4.2.2 Handling uncertainty
As a demonstration of CHMM’s ability to handle uncertainty, we consider the problem of decoding scrambled text. People have a remarkable ability to decode scrambled text, a topic that has been an internet meme (Fig 5a): “Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe” [25]
. CHMMs can handle this without having to train on scrambled text if the input is encoded to reflect the uncertainty in the data. We treat the evidence for each inbetween character (characters that are between the start of the word and end of the word) as uncertain, with the uncertainty uniformly distributed among the inbetween characters of that word. With this encoding, performing MAP inference (Viterbi decoding) decodes the correct words in most cases.
We performed two experiments to test the ability of CHMMs to decode scrambled text. In the first test, we sampled thirty 600long character sequences from melvillemobydick, and scrambled the words while preserving their beginning and ends. Using the uncertainty encoding described above, and using the CHMM that was trained on melvillemobydick, MAP inference decoded the sequences with 99.02% word accuracy and 99.42% character accuracy. The ability of CHMM to decode scrambled text is not limited to known words. To test this, we selected 638 unseen words from a list of 3000 most frequent English words and scrambled them as above. Out of these the 89 fourletter words were decoded with 74.1% accuracy, the 99 fiveletter words with 45.45%, the 144 sixletter words with 40.3%, the 172 sevenletter words with 27.9%, and the 134 eightletter words with 21.6% accuracies. Figure 5 shows examples of sentences created from the unseen words and their CHMM decodings.
It is noteworthy that this capability is not using a separate dictionary, and not expecting the words that are being decoded to be present in the training set of the CHMM. For example, the words “surprisingly” and “complementary” do not occur in the training set of the CHMM for the Moby Dick dataset. Nevertheless, the learned representation assigns nonzero likelihoods to those words, and their scrambled versions are corrected during inference. Doing this using an LSTM would require sampling from the LSTM and using a dictionary to identify correct words, whereas CHMM’s internal representation and its ability to handle uncertainty allows it to solve this problem as a standard inference query. While this ability is similar to that of humans, we are not making the claim that humans use the same mechanism as CHMMs when they decode scrambled text.
5 Theoretical Results
This section summarizes our theoretical findings for CHMM: we refer to Appendix B for a complete analysis. In particular, our next Theorem 1
presents statistical guarantees of the BaumWelch algorithm for CHMM. It derives a finite sample L2 estimation of the difference between the successive BaumWelch estimates
and a transition matrix , defined as a global optimum of the population likelihood – the limit of the empirical likelihood in the case of an infinite amount of data.Theorem 1
The L2 norm of the difference between the optimum and the Baum Welch iterates is bounded by the sum of an exponentially decaying term and a residual error. The residual is a function of the number of samples and the degree of truncation introduced in Appendix B. Consequently, Theorem 1 guarantees the local convergence of the BaumWelch algorithm to a small ball around a global optimum of the population likelihood. The following Theorem 2 proposes a similar bound for HMM and compares the L2 estimation upper bounds for CHMM and HMM.
Theorem 2
Theorem 2 guarantees the local convergence of the BaumWelch algorithm for HMM. For fixed values of and , the L2 estimation upper bound is larger for HMM than for CHMM. Two major differences appear when both bounds are satisfied: (a) CHMM needs fewer iterations to converge to the local optima since its convergence rate – of the order of – is smaller than the rate for HMM and (b) CHMM converges closer to the global optima since its radius of local convergence is smaller.
6 Discussion
Learning the parameters of even simple graphical models like HMMs can be very challenging. In this work, we used ideas from neuroscience to constrain the structure of HMMs and showed that this results in efficient learning of temporal structure. The learned models can deal with uncertainty, a critical requirement for agents operating in a noisy and uncertain world.
Although CHMMs outperformed LSTMs in many of our current settings, we would expect LSTM’s performance to be superior when trained on longer datasets. Also, we know that LSTMs can handle large vocabularies well, while this remains to be shown for the case of CHMMs. Despite these caveats, CHMMs seem a good model for sequential structure learning when the generative structure needs to be understood, when this structure is relatively sparse, and when uncertainty needs to be handled. It is also noteworthy that some of the most modern deep learning language models can only handle finite contexts and would not be able to deal with the arbitrarily long memories required by some of the sequences that a CHMM can model.
Our work has opened up several directions for future research. Although the discovered matrices are very sparse, EM has a tendency to overuse the clones. Our preliminary investigations show that imposing a sparse prior (e.g. a Dirichlet prior) on the transitions in a variational Bayesian formulation results in further sparsity. CHMMs induce generalization using a capacity bottleneck on the clones. Further generalization could be obtained using appropriate smoothing mechanisms. One fruitful avenue could be to investigate how ngram smoothing ideas can be transferred to the CHMM setting. Since the representational ideas of CHMMs originated in biology, investigating the mapping between sequential representations in the neocortex and CHMMs would be an interesting avenue of future research.
Acknowledgments
We thank Michael Stark, Bruno Olshausen and Tom Silver for helpful comments on the manuscript.
Appendix
Appendix A Greedy clone splitting
Here we show how greedy clone splitting can fail even in very simple cases. Consider a sequence formed by successively concatenating either “ab” or “(ab)”, choosing between the two at random. An example sequence would be “abab(ab)ab(ab)(ab)abab”. Sequences generated in this fashion can be modeled exactly as a CHMM with one clone for the symbols ( and ), and two clones for the symbols a and b (one clone activates inside the parenthesis, the other outside). A model with a single clone for every symbol will obviously have a lower likelihood given the data.
Using BaumWelch, the optimal model is readily found from training data when enough clones are available. Now consider a greedy clonesplitting mechanism that starts with one clone per symbol and splits one clone at a time when doing so increases the likelihood of the model:

Splitting a’s clone cannot increase the likelihood of the model, since the single existing clone already predicts the single clone of b with probability 1 and is always correct in doing so.

Splitting b’s clone has the symmetric problem: even though having a different clone activate inside and outside the parenthesis would increase its ability to predict the next symbol (for instance, it should always be ) for the clone inside the parenthesis), since it is always preceded by the same clone of a, it is not possible to disambiguate both clones of b, and therefore this split does not increase the likelihood of the model either.
In this toy example, the only move that results in a likelihood increase is one that splits the clones of a and b simultaneously, and not greedily.
Appendix B Theoretical guarantees
b.1 Upper bounds for CHMM
We consider , a sequence with corresponding latent variables (clones) drawn from a clone HMM with transition matrix . The clones takes their values in and satisfy the following assumption:
Assumption 1
The Markov chain associated to the latent sequence is aperiodic, recurrent, has unique stationary distribution , and is revertible. In addition, we can fix a constant such that:
(4) 
Assumption 1 is standard in the literature [29]. Using Jensen’s inequality, the normalized loglikelihood of the complete data is lowerbounded by:
where is the set of admissible transition matrices. The empirical operator for clone HMM can be expressed as:
Given an estimate of the transition matrix, the Estep of BaumWelch algorithm for clone HMM derives the empirical operator by computing the probabilities . The Mstep considers the maximizer operator defined by . That is, the algorithm repeats until convergence: .
In the case of an infinite amount of data, the empirical operator is equivalent to the population function defined as:
When Assumption 1 holds, the following Theorem 3 assess the existence of . This existence is not guaranteed: the quantity depends upon since the samples are not i.i.d.. In addition, we define the population maximizer by the selfconsistency condition:
where the operator extends to the population function . corresponds to a global maximizer of the population likelihood, which we would like to recover via the BaumWelch algorithm. However, since the samples are not i.i.d., we cannot directly analyze the asymptotic behavior of the population function . To this end, we consider a sequence of latent clones drawn from the stationary distribution of the Markov chain. For an integer , we introduce the truncated version of the operator :
The latent variables are conditioned on a window centered around the index. Because of the stationarity of the sequence of latent variables, the expectation does not depend upon the sample size . Consequently, we define the truncated population function :
is properly defined. Theorem 3 proves that under Assumption 1 and 2 (defined below), uniformly approximates the population function . In addition, we extend the operators and to their respective truncated versions:
As the truncation level increases, the operator converges to . Similarly, as increases, this operator converges to . We control their respective convergences with the two following parameters:
Definition 1
Let . Let be the smallest scalar such that:
In addition, let be the smallest scalar such that:
We finally make the following assumption.
Assumption 2
For all , the truncated population function is strictly concave. That is,we can fix such that:
(5) 
In addition for all , the gradient of is Lipschitz with constant . That is:
(6) 
Finally, it holds .
We are now ready to present our main Theorem 3. It states that the difference between the th iterate of the BaumWelch algorithm and the global optimum of the population likelihood is the sum of a geometrically decaying term and a residual error – function of the parameters and .
Theorem 3
Under Assumption 1, the population function is welldefined.
The proof is presented in Appendix B.3.
b.2 Comparing the upper bounds for CHMM and HMM
We derive herein an upper bound for HMM similar to the one for CHMM in Equation (7) and compare the two upper bounds. Let be the sparse emission matrix associated with a CHMM – defined such that each clone only emits its observation. The normalized loglikelihood of the complete data is now lowerbounded by the empirical operator defined for any couple of estimates by:
By noting that the first term depends only depends upon and the second term only upon , we write:
(8) 
We note the set of emission matrices. The EM algorithm considers the maximizer operator and repeats until convergence:
where and denote the maximizer operators respectively associated with and . Similarly to the CHMM case, we define the population versions , of the operators introduced, their truncated versions , and the truncated population versions , . In particular, the population versions are only defined on a set of emission matrices with lowest entry lower bounded by some . This does not hold for their truncated versions as the truncated operators do not depend upon . In addition, we introduce the operators corresponding to the decomposition with respect to the transition and emission matrices as in Equation (8). We consider a global population maximizer and assume that as defined in Section B.1. We finally note , and extend Assumption 2 as follows:
Assumption 3
For all , the truncated population function is strictly concave with respect to each of its component. That is,we can fix such that:
(9) 
In addition for all , the gradient of is Lipschitz with respect to each variable:
(10) 
We define two similar conditions for – with respective Lipschitz constants and .
We note and and assume:
We finally extend Definition 1 as follows:
Definition 2
Let . Let be the smallest scalar such that:
In addition, let be the smallest scalar such that:
We derive the following theorem in the case of HMM. The proof is presented in Appendix B.3.
Theorem 4
Under Assumption 1, the population function is welldefined on . In addition, if Assumption 3 holds, converges uniformly to :
Finally, the sequence of transition matrices derived by the BaumWelch algorithm: , , satisfies with probability at least :
(11) 
We define to recover the bound in Theorem 2. It holds:
b.3 Proof of Theorem 3
Proof:
We divide our proof in 3 steps.
Step 1: We first prove the existence of the population function defined as: To do so, we prove that the sequence of operators satisfies the Cauchy property:
Following Lemma 3 from [29], since the Markov chain with stationary distribution satisfies Assumption 1, it holds for some :
(12) 
where . Consequently, since , it holds :
(13) 
As a result we have:
(14) 
Using the triangle inequality:
we conclude that if , the sequence is Cauchy and is welldefined.
Step 2:
We now prove the uniform convergence of the truncated operator