Global Autoregressive Models for Data-Efficient Sequence Learning

Standard autoregressive seq2seq models are easily trained by max-likelihood, but tend to show poor results under small-data conditions. We introduce a class of seq2seq models, GAMs (Global Autoregressive Models), which combine an autoregressive component with a log-linear component, allowing the use of global a priori features to compensate for lack of data. We train these models in two steps. In the first step, we obtain an unnormalized GAM that maximizes the likelihood of the data, but is improper for fast inference or evaluation. In the second step, we use this GAM to train (by distillation) a second autoregressive model that approximates the normalized distribution associated with the GAM, and can be used for fast inference and evaluation. Our experiments focus on language modelling under synthetic conditions and show a strong perplexity reduction of using the second autoregressive model over the standard one.



There are no comments yet.


page 2

page 3

page 5

page 6

page 7

page 8

page 10

page 11


ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation

We propose to train a non-autoregressive machine translation model to mi...

Parallel and Flexible Sampling from Autoregressive Models via Langevin Dynamics

This paper introduces an alternative approach to sampling from autoregre...

Non-Autoregressive Machine Translation: It's Not as Fast as it Seems

Efficient machine translation models are commercially important as they ...

Surprisal-Triggered Conditional Computation with Neural Networks

Autoregressive neural network models have been used successfully for seq...

Pairwise likelihood estimation of latent autoregressive count models

Latent autoregressive models are useful time series models for the analy...

Mixtures of Sparse Autoregressive Networks

We consider high-dimensional distribution estimation through autoregress...

Predictive Sampling with Forecasting Autoregressive Models

Autoregressive models (ARMs) currently hold state-of-the-art performance...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural sequential text generation models have become the standard in NLP applications such as language modelling, NLG, machine translation. When enough data is available, these models can be trained end-to-end with impressive results. Generally, inference and training proceed in an auto-regressive manner, namely, the next decoded symbol is predicted by a

locally normalized

conditional distribution (the “softmax”). This has several advantages: (i) the probability of the sequence is already normalized, by the chain-rule over local decisions, (ii) max-likelihood (ML) training is easy, because the log-likelihood of the full sequence is simply the sum of local CE (cross-entropy) losses, (iii) exact sampling of full sequences from the model distribution is directly obtained through a sequence of local sampling decisions.

However, these autoregressive models (AMs) tend to suffer from a form of myopia. They have difficulty accounting for global properties of the predicted sequences, from overlooking certain aspects of the semantic input in NLG to duplicating linguistic material or producing “hallucinations” in MT, and generally through being unable to account for long-distance consistency requirements that would be obvious for a human reader.111

To borrow terminology from Reinforcement Learning (RL)

Sutton and Barto (2018)

, such NLP models work by “imitation learning”, without any representation of “objectives” to be realized. While this defect can be mitigated in the presence of large training sets, it can become serious when this condition is not met.

The main contributions of this paper are as follows.

First, we propose a hybrid seq2seq formalization, the Global Autoregressive Model (GAM), that combines a local autoregressive component with a global log-linear component, allowing the use of a priori

features to compensate for the lack of training data. GAMs are related both to the class of Energy-Based Models (EBM) and to that of Exponential Families (EF), and inherit some important properties from those: an intimate relationship between training and sampling (EBM); the identity of empirical and model expectations at maximum-likelihood; convexity of log-likelihood (EF).

Second, we propose a training procedure in two steps. In the first step, we train through max-likelihood a GAM, which however is unnormalized and improper for fast inference or evaluation. In the second step, we use this GAM to train (by distillation) a second autoregressive model that approximates the normalized distribution associated with the GAM, and can be used for fast inference and evaluation.

Third, we demonstrate the ability of GAMs to be data-efficient, namely, to exploit the original data better than a standard autoregressive model. In order to clarify the core techniques and issues, we design a simple class of synthetic data, consisting of random binary strings containing “motifs” (specific substrings) that we can manipulate in different ways. We show that, in limited data conditions, GAMs are able to exploit the features to obtain final autoregressive models that perform better than the original ones.

The remainder of the paper is structured as follows. In Section 2, we provide some background about autoregressive models, energy-based models, and log-linear models. In Section 3, we introduce GAMs. In section 4, we describe our focus on synthetic data. In Section 5, we explain our training procedure. In Section 6, we comment on related work. In Section 7, we describe our experiments. In Section 8, we provide an analysis of our results. We conclude with a discussion in Section 9. Note that some additional explanations and experiments are provided in the Supplementary Material, indicated by [SM].

2 Background

Autoregressive models (AM)

These are currently the standard for neural seq2seq processing, with such representatives as RNN/LSTMs Hochreiter and Schmidhuber (1997); Sutskever et al. (2014), ConvS2S Gehring et al. (2017), Transformer Vaswani et al. (2017)). Formally, they are defined though a distribution , where is an input (aka Context, e.g. a source sentence in Machine Translation (MT)), and is a target sequence (e.g. a target sentence in MT). We have:

where each

is a normalized conditional probability over the next symbol of the sequence, computed by a neural network (NN) with parameters

. The local normalization of the incremental probabilities implies the overall normalization of the distribution , and consequently, the possibility of directly sampling from it and evaluating the likelihood of training sequences.

Energy-Based Models (EBM)

EBMs are a generic class of models, characterized by an energy function computed by a NN parametrized by LeCun et al. (2006)

. Equivalently, they can be seen as directly defining a potential (an unnormalized probability distribution)

, and indirectly the normalized distribution , with . A fundamental property of these models is that, for max-likelihood training, the SGD updates can be computed through the formula:222See (LeCun et al., 2006, p. 15), and [SM] for a derivation.


which, in principle, reduces the problem of training with unnormalized potentials to the problem of sampling from them.

Log-Linear Models / Exponential Families

Log-Linear models Jebara (2013) are the conditional version of Exponential Families Jordan (2010). The general form of a log-linear model (for the discrete case) is as follows:

with . Here

is a vector of predefined real features of the pair

, which is combined by scalar product with a real vector of weights of the same dimension; is an arbitrary “base measure”, which is fixed. These models, which allow to introduce prior knowledge through features and have nice formal properties (see below), were mainstream in NLP before the revival of neural approaches.

3 Proposal: GAMs

We now define Global Autoregressive Models (GAMs). These are hybrid seq2seq models that exploit both local autoregressive properties as well as global properties of the full target sequence. A GAM is an unnormalized distribution over sequences , parametrized by a vector :


Here is an autoregressive seq2seq model for generating from input , parametrized by ; is a vector of predefined real features of the pair , which is combined by a scalar product with a real vector of the same dimension, computed over the input by a network parametrized by . The normalized distribution associated with the GAM is , where .

GAMs appear promising for the following reasons:

  • [leftmargin=*]

  • Features provide a simple way to draw attention of the model to potentially useful aspects that may be difficult for the AM component to discover on its own from limited data.

  • GAMs are an instance of EBMs, where the potential is the product of the an AM potential with a “log-linear” potential . Here the gradient relative to the log-linear part takes the especially simple form:

  • Log-linear models, on their own, while great at expressing prior knowledge, are not as good as AM models at discovering unforeseen regularities in the data. Also, they are typically problematic to train from a log-likelihood perspective, because sampling from them is often unfeasible. GAMs address the first issue through the component, and alleviate the second issue by permitting the use of as a powerful “proposal” (aka “surrogate”) distribution in importance sampling and related approaches, as we will see.

4 Experimental focus

While the motivation for GAMs ultimately lies in practical NLP applications such as those evoked earlier, in this paper we aim to understand some of their capabilities and training techniques in simple and controllable conditions. We focus on the unconditional (i.e. language modelling) case, and on synthetic data. Our setup is as follows:

  • [leftmargin=*]

  • We consider an underlying process that generates binary sequences according to a well-defined and flexible process. In this paper we use PFSAs (Probabilistic Finite State Automata) to impose the presence or absence of sub-strings (“motifs”) anywhere in the generated data, exploiting the intersection properties of automata.

  • Due to the dynamic programming properties of PFSAs, it is possible to compute the true entropy of the process (see [SM]), as well as other quantities (Partition Functions, Mean sequence length); it is also possible to generate training (), validation (), and test data () in arbitrary quantities.

  • We employ an unconditional GAM of the simple form:


    where is trained on and then kept fixed, and where is then trained on top of , also on .

    It should be noted that with fixed in this way, this formulation exactly corresponds to the definition of an exponential family Jordan (2010), with as base measure. In such models, we have two important properties: (i) the log-likelihood of the data is convex relative to the parameters , and thus a local maximum is also global; (ii) the max-likelihood value has the property that the model expectation is equal to the empirical expectation

    (“Moment Matching” property of exponential families).

  • We are specially interested in the relative data-efficiency of the GAM compared to the AM : namely the ability of the GAM to recover a lower perplexity approximation of than , especially in small training-set conditions.

5 Training procedure

Two-stage training

We consider a two-stage training procedure (see Fig. 1).

Figure 1: Two-stage training

This consists in training the model on . This is done by first training on in the standard way (by cross-entropy) and then by training by SGD with the formula (adapted from (3)):


The main difficulty then consists in computing an estimate of the model moments

. In our experiments, we compare two Monte-Carlo approaches Robert and Casella (2005) for addressing this problem: (i) Rejection Sampling (rs), using as the proposal distribution and (ii) Self-Normalized Importance Sampling (snis) (Y. Bengio and J. S. Senecal, 2008), also using as the proposal.

Rejection sampling is performed as follows. We use as the proposal, and as the unnormalized target distribution; for any specific , because our features are bounded between and , we can easily upper-bound the ratio by a number ; we then sample from , compute the ratio , and accept with probability . The accepted samples are unbiased samples from and can be used to estimate model moments.

Snis also uses the proposal distribution , but does not require an upper-bound, and is directly oriented towards the computation of expectations. In this case, we sample a number of points from , compute “importance ratios” , and estimate through . The estimate is biased for a given , but consistent (that is, it converges to the true for ).


While Training-1 results in a well-defined model , which may fit the data closely in principle, we should not conclude that is convenient to use for inference — namely, in language modeling, efficiently sampling from its normalized version ; as seriously, because of the partition factor , it is also not obvious to evaluate the perplexity of on test data. In order to do both, one approach consists in using a distillation technique Hinton et al. (2015), where, during training, one expends generous time towards producing a set of samples from , for instance by Monte-Carlo (e.g. Rejection Sampling) techniques, and where this set (which may be arbitrarily larger than the original ) is in turn used to train a new autoregressive model , which can then be used directly for sampling or for computing data likelihood. This is the approach that we use in our current experiments, again using the original as a proposal distribution.

Cyclical training

In the case of small , the proposal distribution is weak and as a result the distillation process, based on rejection sampling, can be slow. To address this issue, we also consider a cyclical training regime that updates the proposal distribution after distilling each batch of samples, with the intention of reducing the rejection rate. Once the process of distillation is finished, we use the aggregated samples to train the final . The two-stage training procedure is a variant of the cyclical one, with a fixed proposal (see the Algorithm 1 for more details).

1:function train(, DsSize, tReg, mode) DsSize - distilled dataset size; tReg
2:      initialize and then train RNN
3:      train for a given proposal
4:     if mode ‘two_stage’ then Training-2: distill in one step
6:     else if mode ‘cyclic’ then Cyclic-training: distill in several steps
8:          while  do proceed to the distillation process
9:                accptRate - acceptance rate of during distillation
10:               .insert(); .insert()
11:               if not  then
12:                     improve proposal
13:                     train for a given proposal
14:                     check if acceptance rate has stopped improving                          
15:          .insert(); .insert() add true data to the distilled one      
17:     return
18:function trainGAM(, D, V, tReg, ) Training-1
19:      initial learning rate
20:      empirical moments of the given dataset
21:     while not earlyStopping(do check if has stopped improving
22:          model_mom accumulate the model’s moments
24:          for  do
25:                use or to compute
26:               model_mom moving average
27:                use Eq. 5 to compute gradients
30:     return
Algorithm 1 Training

6 Related Work

Hoang et al. (2018), working in a NMT context, have a similar motivation to ours. They first train an autoregressive seq2seq model (Transformer in their case) on bilingual data, then attempt to control global properties of the generated sequences through the introduction of a priori

features. They interpolate the training of the autoregressive model with training of a Moment Matching component which tries to equate the features expectations of the model with those of the data. Contrarily to our approach, they do not directly try to maximize likelihood in an integrated model.

Andor et al. (2016) consider transition-based neural networks, and contrast local to global normalization of decision sequences, showing how the global approach avoids the label bias problem in such tasks as tagging or parsing. They focus on inference as maximization, e.g. finding the best sequence of tags for a sequence of words, and consistent with that objective, their training procedure exploits a beam-search approximation. By contrast, our focus is on inference as sampling in a language modelling perspective, on the complementarity between auto-regressive models and log-linear models, and on the relations between training and sampling in energy-based models.

Figure 2: Cross-entropy in nats per character and frequency of sampling motif, depending on . Two-stage Training. Features are on for all panels (). Panel (a): pure , features (super-motif) and (sub-motif) on; (b): pure , (motif) and (sub-motif) on; (c) pure , on; (d) mixture , on. The plain lines represent cross-entropy, the dashed lines motif frequency.

7 Experiments

We conduct a series of experiments on synthetic data to illustrate our approach.

Synthetic data

To assess the impact of GAMs, we focus on distributions that are likely to be well approximated by the AM in the presence of large data. The first class of distributions is obtained through a PFSA that filters binary strings of fixed length , ’s and

’s being equally probable (white-noise strings), through the condition that they contain a specific substring (“motif”) anywhere; here the relative frequency of sequences containing the motif among all sequences varies from

to .

We also consider mixtures of two PFSAs (motif/anti-motif): the first (with mixture prob. ) produces white-noise strings containing the motif and the second (with mixture prob. ) strings excluding the motif.

From these processes we produce a training set , of size varying between and , a validation set of size (but never smaller than or bigger than ) and a test set of fixed size .


In a real world scenario, prior knowledge about the true process will involve, along with predictive features, a number of noisy and useless features. By training the parameters to match the empirical moments, the GAM will learn to distinguish between these types. In order to simulate this situation we consider feature vectors over our artificial data that involve both types.

With the full string and the fixed motif used in constructing the training data, we consider variations among the 7 binary features in the set :

where iff the motif appears in , iff the motif followed by a zero (“super-motif”) appears in , iff an initial section of the motif (“sub-motif”, roughly half the size of ) appears in . These three features are chosen because they have some correlation with the process for generating the training data. By contrast, the four remaining features are “distractors”: iff begins with a , (resp. , ) iff a certain random, but fixed, string of similar length to (resp. of larger length, of smaller length) appears in . We test different configurations of these features for training , and document the use/non-use of features with a bit-vector of length , for instance means that all features are exploited, apart from .

Implementation aspects
Autoregressive models

The AMs are implemented in PyTorch

Paszke et al. (2017) using a 2-layered LSTM Hochreiter and Schmidhuber (1997)

with hidden-state size 200. The input is presented through one-hot encodings over the vocabulary

. These LSTMs are optimized with Adam Kingma and Ba (2014), with learning rate , and with early stopping (patience ) over a validation set.

Training: Two-Stage and Cyclical

The implementation is described in (Algorithm 1). Here we provide some additional details.

Training-1 For training we test two regimes in Eq. 5, namely and ; in both cases, we first train on the whatever is available, and use it as the proposal distribution. During , we compute the model’s expectation over 10 accepted samples, update the ’s according to (5), and iterate. During , we keep a buffer of the last samples from to compute the weighted average of the feature moments. For the training of ’s, we use a basic SGD optimization with learning rate . To assess the quality of for early stopping during training, we use the distance between the empirical and model moments:


Training-2 and Cyclical Training When distilling from in Training-2, we use a single proposal , and systematically produce a distilled dataset of size , which corresponds to the highest value of among those considered for training . In Cyclical Training, the distillation process is performed in several stages, with an evolving for improving the rejection rate.

m; m; m; mam; mam; mam;

0.998 0.967 2.92 0.997 1.003 4.7
1000 1.009 0.973 2.038 0.77 1.07 3.638
5000 0.995 0.967 0.756 1.12 0.99 1.365
10000 1.134 0.956 1.514 1.011 1.002 1.005
20000 1.497 0.961 0.938 0.965 1.005 0.975

Table 1: Comparison of the time for Training-1 in and ; for motif ; ; with pure (m) and ; with mixture of motif-anti-motif (mam).

8 Results

Cross-entropy comparison

We conduct experiments to compare the cross-entropy (measured in nats) between the initial AM relative to the test set and the final AM also relative to ; we vary the size of , the regimes (tReg) for Training-1 ( or ), the features employed, the rarity of the motifs. Figure 2 depicts the resulting curves at the end of the two-stage training (plain lines).

Here we show only a few experiments (a more extensive set is provided in the [SM]).

We observe that, for a small dataset size , there is a big gap between the CE of and the CE of . As increases, these cross-entropies become closer to one another, but a large gap persists for .

We note that the presence of the “fully-predictive” feature results in a that has CE very close to the theoretical entropy, even in low regimes, where on its own is very weak.333The CE of a model relative to the true underlying process (approximated by the test set ) can never be below the entropy of this process, due to the KL-divergence being non-negative. Thus, not only is the distilled AM much better than the initial AM, but this is an indication that itself (for which the cross-entropy is more difficult to compute exactly) is a good approximation of the true process.

By contrast, if the feature is absent, then, while is still better than in low regimes, it cannot reach the theoretical entropy in such regimes, because features such as and can only partially model the data. With large , on the other hand, on itself does a good job at predicting the data, and adds little on top of its component.

Finally, we note that the two regimes for training , and , result in ’s with similar accuracies.

We also observe that with a good performance of , the moments of motif feature on the distilled dataset are close to the true ones (see [SM] Figure 4, 5, 7).

These trends are consistent across the experiments with different motifs, as can be checked with the more extensive plots and statistics in the [SM] (Figure 4, 5, 7 and Table 3).

Motif frequencies

In order to assess the predictive properties of obtained AMs, we also compare the frequency of motifs in strings sampled from and from ( samples in total). From Figure 2 we see that when vary , the frequency of motifs (dashed lines) is aligned with the CE performance. Namely, produces a higher fraction of strings with motif than when is small ().
Detailed illustration To provide more intuition, we provide an illustration from one experiment in Table 2.

5 ’s
6 mom
7 mom
8 mom
9 CEs : 0.45, : 0.56, : 0.47
10 motif freqs : 1.0, : 0.045, : 0.959
Table 2: Illustration. Setting is from Fig. 2, panel (c): n =30, motif = 10001011111000 (always present in ), ft = 1001111, = 5000, rs used for Training-1. Lines 1,2,3 show one example from respectively; with training set of size 5000, is only able to generate the motif a fraction of the time (0.045, see line 10), but is better able to generate some submotifs (underlined); generates the motif frequently (0.959), as illustrated on line 3. With the features from (line 4), Training-1 produces a with first feature strongly negative (line 5), meaning that strongly penalizes the absence of the motif; the “distractor” features get a weight close to , meaning that they have little predictive power in combination with feature . It is visible from lines 6,7,8 that is much better able to approximate the true feature expectations than . Finally (line 9), the CE of relative to the test set is close to the true entropy of the process, while that of is much further away.
Mixture vs pure

In our experiments, the strings in (motif-anti-motif) contain a motif with . However, if not all of the samples in contain the motif, then the motif feature itself is not fully predictive. It can be seen in panel (d) of Figure 2 that the achieved with trained on mixture has consistent behaviour with the results obtained on the pure of panels (a,b,c).

Regimes in Training-1

For training GAM we consider two methods, and . As described in the previous sections, their impact on leads to ’s that have similar CE’s and motif frequencies. Despite such resemblance in terms of accuracy, these two methods differ in terms of speed (see Table 1). Namely, when is close to white noise due to small , then for the rare events rejects most samples not containing the motif due to the effect of the log linear term and negative value of the component corresponding to the feature, while is able to exploit all samples. Despite being faster than , remains competitive in terms of CE.

Cyclical vs two-stage training

We conducted a small experiment to compare the performance of cyclical training with two-stage training in terms of speed and accuracy for a fixed motif and features (see [SM] Table 4, Figure 3). We observed that CEs of the obtained ’s were about the same for different values of and Training-1 regimes. On the other hand, there was no systematic improvement in the training speed of one method over the other.

9 Discussion

The basic idea behind GAMs is very simple. First, we extend the representational power of the autoregressive model by multiplying by a log-linear potential, obtaining an unnormalized model (Training-1). Then we try to “project” this extended representation again to an autoregressive model (Training-2). Our results showed that, under favorable prior knowledge conditions, the final was able to perform as well, when trained on small data, as the standard , trained on large data. During our experiments, we noticed that training was actually easier than training from it. Intuitively, the small number of parameters to be fitted in the log-linear model requires less work and fewer data than the training of an autoregressive component.444At a deeper level, there are extreme situations where the obtained at the end of Training-1 can perfectly represent the true process, but where no autoregressive model can actually fit : one way to obtain such situations consists in generating binary strings that satisfy a certain cryptographic predicate, associated with a specific feature; the importance of this feature can be easily detected through Training-1, but an autoregressive model has no chance of generalizing from distilled or true data, even in large quantities.

It is interesting to relate our study to certain aspects of Reinforcement Learning (RL).

First, consider Training-2. There, we have a “score” that we are trying to approximate through an autoregressive model , which is basically a sequential “policy”. The main difference with RL is that we are not trying to find a policy that maximizes the score (which would be a bad idea for language modelling, as it would tend to concentrate the mass on a few sequences), but one that approximates in a distributional sense; our current distillation technique is only one way to approach this problem, but other techniques more in the spirit of RL are possible, a direction that we leave for future work.

Second, consider Training-1. Our approach, consisting in suggesting to the model a number of prior features, might look too easy and suspicious. But notice that in RL, one would typically directly provide to the model an externally defined reward, a very strong form of prior knowledge. Here, instead, we “only” indicate to the models which features it might attend to, and Training-1 then determines the “reward” through max-likelihood, a milder form of prior knowledge, more respectful for what the data has to say.


Appendix A Supplementary Material

a.1 SGD in Energy-based models

The formula 1 is fundamental for studying the SGD behavior of Energy-based models, and for convenience, we provide a derivation here.

If we define , we find that:

We then have:

a.2 The relevance of finite state automata; connections and differences with Reinforcement Learning

The way our synthetic data is produced through FSA’s may look contrived, but there are good motivations for using automata in such a study as ours.

Consider the following problem: you are given some RNN that produces sequences over a vocabulary , with probabilities but you would like to filter out sequences that do not contain a specific symbol , while preserving the relative probabilities of sequences provided by the RNN: . There appears to be no obvious way to realize through an RNN, apart from techniques similar to what we have been describing in our discussion of Training-2.

The situation is completely different with FSA’s. If you have a PFSA (Probabilistic FSA) generating sequences , then you can intersect with an automaton that accepts all sequences containing at least one , and re-normalize the intersection through dynamic programming, obtaining a new PFSA that generates the filtered distribution.555And this is exactly what we do to produce training data for our experiments, but using a binary sequence (motif ) instead of a single symbol . Such dynamic programming, with the capacity to anticipate properties that need to be satisfied on the global sequence, is unavailable in the RNN world.

With RNNs, the situation is reminiscent of RL, with a reward associated with having observed an during the production of the sequence. But a standard RL approach would mean that we would try to maximize , without taking into consideration the original that we are filtering from. To be correct, we need to find a policy (similar to an RNN), that tries to approximate in a distributional sense, not in a maximization sense (see (Bellemare et al., 2017) for related considerations). This is what we try to do in Training-2, using motifs as our main case-study, instead of a single symbol (which would not make sense for binary strings).

The advantages of using PFSAs in our study are multiple. They provide a well-understood comparison point to the more complex techniques that need to be deployed for autoregressive models. From an operational viewpoint, they also permit, through dynamic programming, to perform various calculations of interest for our study, such as sampling datasets of arbitrary size and computing exact entropy and partition functions that can serve as comparison points for the results obtained with GAMs. In the present paper, we only exploited PFSA’s in the context of motifs, but they provide a much larger class of models that could serve to expand our understanding of sequence-based energy based models.

a.2.1 Computing the Entropy of a PFSA

As mentioned earlier, one advantage of using weigthed finite-state automata for generating synthetic data is that some important quantities, such as entropy, mean sequence length, or partition function can be computed by dynamic programming.

Here we only derive a simple iterative method for computing the entropy of a PFSA, the other computations are very similar.666For another technique, and for extensions to the computation of relative entropy, see Carrasco (1997).

We consider a PFSA with transitions of the form , where are states, is the label of the transition from to (in our case ), and is the probability of the transition. The fact that the automaton is probabilistic, instead of simply weighted, means that the sum of ’s associated with transitions starting at is equal to . We further assume that the automaton is deterministic, namely that given and uniquely determines the next state .777The case of non-deterministic probabilistic automata appears much more difficult Cortes et al. (2008).

The entropy of a state is defined as , where denotes a sequence of labels that ends in a final state of the automaton, for which is computed in the obvious way. The entropy of the automaton as a whole is then defined as , where is the initial state of the automaton.

Lemma The entropies of states satisfy the fixpoint equation:


Proof. Let’s denote by the state obtained from by following label . We have:

It is possible to show that the state entropies actually correspond to the least fixpoint of equation (7), and this allows a simple iterative algorithm for computing the state entropies: at time , for all states , we define , and then we iterate until convergence:

a.3 Additional Experiments and Results

(See next pages)

tReg m: m: m: mam: mam: mam:

Table 3: Overall statistics: for ,                                                                                                                                                                     and , , .

1.02 1.21 1.02 1.51
1000 1.0 1.48 1.08 2.04
5000 1.04 0.57 1.0 0.57
10000 0.98 1.47 1.02 0.45
20000 0.99 2.65 1.0 0.28

Table 4: Cyclical training vs two stage training for motif ;
Figure 3: (Cyclical training) Cross-entropy in nats per character, and frequency of sampling motif, depending on the , while all distractive features and motif feature are 1:
Figure 4: Column 1: Cross-entropy; column 2: ; column 3: frequency of sampling motif, depending on the , all distractive features are 1: . Setting: supermotif+submotif, pure , while varying the rareness of the motif ().
Figure 5: Column 1: Cross-entropy; column 2: ; column 3: frequency of sampling motif, depending on the , all distractive features are 1: . Setting: motif+submotif, pure , while varying the rareness of the motif ().
Figure 6: Column 1: Cross-entropy; column 2: ; column 3: frequency of sampling motif, depending on the , all distractive features are 1: . Setting: motif, pure , while varying the rareness of the motif ().
Figure 7: Column 1: Cross-entropy; column 2: ; column 3: frequency of sampling motif, depending on the , all distractive features are 1: . Setting: motif, mixture , while varying the rareness of the motif ().