Multinomial Loss on Held-out Data for the Sparse Non-negative Matrix Language Model

11/05/2015 ∙ by Ciprian Chelba, et al. ∙ Google 0

We describe Sparse Non-negative Matrix (SNM) language model estimation using multinomial loss on held-out data. Being able to train on held-out data is important in practical situations where the training data is usually mismatched from the held-out/test data. It is also less constrained than the previous training algorithm using leave-one-out on training data: it allows the use of richer meta-features in the adjustment model, e.g. the diversity counts used by Kneser-Ney smoothing which would be difficult to deal with correctly in leave-one-out training. In experiments on the one billion words language modeling benchmark, we are able to slightly improve on our previous results which use a different loss function, and employ leave-one-out training on a subset of the main training set. Surprisingly, an adjustment model with meta-features that discard all lexical information can perform as well as lexicalized meta-features. We find that fairly small amounts of held-out data (on the order of 30-70 thousand words) are sufficient for training the adjustment model. In a real-life scenario where the training data is a mix of data sources that are imbalanced in size, and of different degrees of relevance to the held-out and test data, taking into account the data source for a given skip-/n-gram feature and combining them for best performance on held-out/test data improves over skip-/n-gram SNM models trained on pooled data by about 8 setup, or as much as 15 The ability to mix various data sources based on how relevant they are to a mismatched held-out set is probably the most attractive feature of the new estimation method for SNM LM.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A statistical language model estimates probability values for strings of words in a vocabulary whose size is in the tens, hundreds of thousands and sometimes even millions. Typically the string is broken into sentences, or other segments such as utterances in automatic speech recognition, which are often assumed to be conditionally independent; we will assume that is such a segment, or sentence.

Since the parameter space of is too large, the language model is forced to put the context
into an equivalence class determined by a function . As a result,

(1)

Research in language modeling consists of finding appropriate equivalence classifiers

and methods to estimate
. Once the form is specified, only the problem of estimating from training data remains.

Perplexity as a Measure of Language Model Quality

A statistical language model can be evaluated by how well it predicts a string of symbols —commonly referred to as test data—generated by the source to be modeled.

A commonly used quality measure for a given model is related to the entropy of the underlying source and was introduced under the name of perplexity (PPL):

(2)

For an excellent discussion on the use of perplexity in statistical language modeling, as well as various estimates for the entropy of English the reader is referred to [jelinek97], Section 8.4, pages 141-142 and the additional reading suggested in Section 8.5 of the same book.

2 Notation and Modeling Assumptions

We denote with an event in the training/development/test data corresponding to each prediction in Eq. (1); each event consists of:

  • a set of features , where denotes the set of features in the model, collected on the training data: ;

  • a predicted (target) word from the LM vocabulary ; we denote with the size of the vocabulary.

The set of features is obtained by applying the equivalence classification function to the context of the prediction. The most successful model so far has been the -gram model, extracting all -gram features of length from the context111The empty feature is considered to have length 0, it is present in every event , and it produces the unigram distribution on the language model vocabulary., respectively.

2.1 Skip--gram Language Modeling

A simple variant on the -gram model is the skip--gram model; a skip-

-gram feature extracted from the context

is characterized by the tuple where:

  • denotes number of remote context words

  • denotes the number of skipped words

  • denotes the number of adjacent context words

relative to the target word being predicted. For example, in the sentence,
<S> The quick brown fox jumps over the lazy dog </S>
a skip-gram feature for the target word dog is:
[brown skip-2 over the lazy]

To control the size of it is recommended to limit the skip length and also either or both and ; not setting any such upper bounds will result in events containing a set of skip-gram features whose total representation size is quintic in the length of the sentence.

We configure the skip--gram feature extractor to produce all features , defined by the equivalence class , that meet constraints on the minimum and maximum values for:

  • the number of context words used ;

  • the number of remote words ;

  • the number of adjacent words ;

  • the skip length .

We also allow the option of not including the exact value of in the feature representation; this may help with smoothing by sharing counts for various skip features. Tied skip--gram features will look like:
[curiosity skip-* the cat]

Sample feature extraction configuration files for a -gram and a skip--gram SNM LM are presented in Appendix A and B, respectively. A simple extension that leverages context beyond the current sentence, as well as other categorical features such as geo-location is presented and evaluated in [Chelba:ASRU2015].

In order to build a good probability estimate for the target word in a context , or an event in our notation, we need a way of combining an arbitrary number of features which do not fall into a simple hierarchy like regular -gram features. The following section describes a simple, yet novel approach for combining such predictors in a way that is computationally easy, scales up gracefully to large amounts of data and as it turns out is also very effective from a modeling point of view.

3 Multinomial Loss for the Sparse Non-negative Matrix Language Model

The sparse non-negative matrix (SNM) language model (LM) [Shazeer:2015]-[Shazeer:2015a] assigns probability to a word by applying the equivalence classification function to the context of the prediction, as explained in the previous section, and then using a matrix M, where is indexed by feature and word . We further assume that the model is parameterized as a slight variation on conditional relative frequencies for words given features , denoted as :

(3)

The adjustment function is a real-valued function whose task is to estimate the relative importance of each input feature for the prediction of the given target word . It is computed by a linear model on meta-features extracted from each link and associated feature :

(4)

The meta-features are either strings identifying the feature type, feature, link etc., or bucketed feature and link counts. We also allow all possible conjunctions of elementary meta-features, and estimate a weight for each (elementary or conjoined) meta-feature . In order to control the model size we use the hashing technique in [Ganchev:08],[Weinberger:2009]. The meta-feature extraction is explained in more detail in Section 3.2, and associated Appendix LABEL:appendix.

Assuming we have a sparse matrix M of adjusted relative frequencies, the probability of an event predicting word in context is computed as follows:

where ensures that the model is properly normalized over the LM vocabulary:

and the indicator functions and select a given feature, and target word in the event , respectively:

With this notation and using the shorthand , the derivative of the log-probability for event with respect to the adjustment function for a given link is:

(5)

making use of the fact that

Propagating the gradient to the parameters of the adjustment function is done using mini-batch estimation for the reasons detailed in Section 3.1:

(6)

Rather than using a single fixed learning rate , we use AdaGrad [Duchi:2011] which uses a separate adaptive learning rate for each weight :

(7)

where is the current batch index, is a constant scaling factor for all learning rates and is an initial accumulator constant. Basing the learning rate on historical information tempers the effect of frequently occurring features which keeps the weights small and as such acts as a form of regularization.

3.1 Implementation Notes

From a computational point of view, the two main issues with a straightforward gradient descent parameter update (either on-line or batch) are:

  1. the second term on the right-hand-side (RHS) of Eq. (5) is an update that needs to be propagated to all words in the vocabulary, irrespective of whether they occur on a given training event or not;

  2. keeping the model normalized after a parameter update means recomputing all normalization coefficients .

For mini-/batch updates, the model renormalization is done at the end of each training epoch/iteration, and it is no longer a problem. To side-step the first issue, we notice that mini-/batch updates would allow us to accumulate the

across the entire mini-/batch , and adjust the cumulative gradient at the end of the mini-/batch, in effect computing:

(8)

In summary, we use two maps to compute the gradient updates over a mini-/batch: one keyed by pairs, and one keyed by . The first map accumulates the first term on the RHS of Eq. (8), and is updated once for each link occurring in a training event . The second map accumulates the values, and is again updated only for each of the features encountered on a given event in the mini-/batch. At the end of the mini-/batch we update the entries in the first map acc. to Eq. (8) such that they store the cumulative gradient; these are then used to update the parameters of the adjustment function according to Eq. (6).

The model and the normalization coefficients are stored in maps keyed by , and , respectively. The map is initialized with relative frequencies computed from the training data; on disk they are stored in an SSTable [Chang:2008] keyed by , with and represented as plain strings. For training the adjustment model we only need the rows of the M matrix that are encountered on development data (i.e., the training data for the adjustment model). A MapReduce [Ghemawat:2004] with two inputs extracts and intersects the features encountered on development data with the features collected on the main training data—where the relative frequencies were also computed. The output is a significantly smaller matrix M that is loaded in RAM and used to train the adjustment model.

3.2 Meta-features extraction

The process of breaking down the original features into meta-features and recombining them, allows similar features, i.e. features that are different only in some of their base components, to share weights, thus improving generalization.

Given an event the quick brown fox, the 4-gram feature for the prediction of the target fox would be broken down into the following elementary meta-features:

  • feature identity, e.g. [the quick brown]

  • feature type, e.g. 3-gram

  • feature count

  • target identity, e.g. fox

  • feature-target count

Elementary meta-features of different types are then joined with others to form more complex meta-features, as described best by the pseudo-code in Appendix LABEL:appendix; note that the seemingly absent feature-target identity is represented by the conjunction of the feature identity and the target identity.

As count meta-features of the same order of magnitude carry similar information, we group them so they can share weights. We do this by bucketing the count meta-features according to their (floored) value. Since this effectively puts the lowest count values, of which there are many, into a different bucket, we optionally introduce a second (ceiled) bucket to assure smoother transitions. Both buckets are then weighted according to the fraction lost by the corresponding rounding operation.

To control memory usage, we employ a feature hashing technique [Ganchev:08],[Weinberger:2009] where we store the meta-feature weights in a flat hash table of predefined size; strings are fingerprinted, counts are hashed and the resulting integer mapped to an index in by taking its value modulo the pre-defined . We do not prevent collisions, which has the potentially undesirable effect of tying together the weights of different meta-features. However, when this happens the most frequent meta-feature will dominate the final value after training, which essentially boils down to a form of pruning. Because of this the model performance does not strongly depend on the size of the hash table.

4 Experiments

4.1 Experiments on the One Billion Words Language Modeling Benchmark

Our first experimental setup used the One Billion Word Benchmark corpus222http://www.statmt.org/lm-benchmark made available by [Chelba:2014]. For completeness, here is a short description of the corpus, containing only monolingual English data:

  • Total number of training tokens is about 0.8 billion

  • The vocabulary provided consists of 793471 words including sentence boundary markers <S>, </S>, and was constructed by discarding all words with count below 3

  • Words outside of the vocabulary were mapped to an <UNK> token, also part of the vocabulary

  • Sentence order was randomized

  • The test data consisted of 159658 words (without counting the sentence beginning marker <S> which is never predicted by the language model)

  • The out-of-vocabulary (OOV) rate on the test set was 0.28%.

The foremost concern when using held-out data for estimating the adjustment model is the limited amount of data available in a practical setup, so we used a small development set consisting of 33 thousand words.

We conducted experiments using two feature extraction configurations identical to those used in [Shazeer:2015]:
5-gram and skip-10-gram, see Appendix A and B. The AdaGrad parameters in Eq. (7) are set to: , , and the mini-batch size is 2048 samples. We also experimented with various adjustment model sizes (200M, 20M, and 200k hashed parameters), non-lexicalized meta-features, and feature-only meta-features, see Appendix LABEL:appendix. The results are presented in Tables 1-2.

A first conclusion is that we can indeed get away with very small amounts of development data. This is excellent news, because usually people do not have lots of development data to tune parameters on, see SMT experiments presented in the next section. Using meta-features computed only from the feature component of a link does lead to a fairly significant increase in PPL: 5% rel for the 5-gram config, and 10% rel for the skip-10-gram config.

Surprisingly, when using the 5-gram config, discarding the lexicalized meta-features consistently does a tiny bit better than the lexicalized model; for the skip-10-gram config the un-lexicalized model performs essentially as well as the lexicalized model. The number of parameters in the model is very small in this case (on the order of a thousand) so the model no longer over-trains after the first iteration as was the case when using link lexicalized meta-features; meta-feature hashing is not necessary either.

In summary, training and evaluating in exactly the same training/test setup as the one in [Shazeer:2015] we find that:

  1. 5-gram config: using multinomial loss training on 33 thousand words of development data, 200K or larger adjustment model, and un-lexicalized meta-features trained over 5 epochs produces 5-gram SNM PPL of 69.6, which is just a bit better than the 5-gram SNM PPL of 70.8 reported in [Shazeer:2015], Table 1, and very close to the Kneser-Ney PPL of 67.6.

  2. skip-10-gram config: using multinomial loss training on 33 thousand words of development data, 20M or larger adjustment model, and un-lexicalized meta-features trained over 5 epochs produced skip-10-gram SNM PPL of 50.9, again just a bit better than both the skip-10-gram SNM PPL of 52.9 and the RNN-LM PPL of 51.3 reported in [Shazeer:2015], Table 3, respectively.

Model Size Num Training Metafeatures Extraction Test Set Actual Num Hashed Params
(max num hashed params) Epochs lexicalized feature-only PPL (non-zero)
0 Unadjusted Model 86.0 0
200M 1 yes no 71.4 116205951
yes yes 75.8 72447
no no 70.3 567
5 yes no 78.7 116205951
yes yes 73.9 72447
no no 69.6 567
20M 1 yes no 71.4 20964888
yes yes 75.8 72344
no no 70.3 567
5 yes no 78.8 20964888
yes yes 73.9 72447
no no 69.6 567
200K 1 yes no 72.0 204800
yes yes 75.9 61022
no no 70.3 566
5 yes no 84.8 204800
yes yes 73.9 61022
no no 69.6 567
Table 1: Experiments on the One Billion Words Language Modeling Benchmark in 5-gram configuration; 2048 mini-batch size, one and five training epochs.
Model Size Num Training Metafeatures Extraction Test Set Actual Num Hashed Params
(max num hashed params) Epochs lexicalized feature-only PPL (non-zero)
0 Unadjusted Model 69.2 0
200M 1 yes no 52.2 209234366
yes yes 58.0 740836
no no 52.2 1118
5 yes no 54.3 209234366
yes yes 56.1 740836
no no 50.9 1118
20M 1 yes no 52.2 20971520
yes yes 58.0 560006
no no 52.2 1117
5 yes no 54.4 20971520
yes yes 56.1 560006
no no 50.9 1117
200K 1 yes no 52.4 204800
yes yes 58.0 194524
no no 52.2 1112
5 yes no 56.5 204800
yes yes 56.1 194524
no no 51.0 1112
Table 2: Experiments on the One Billion Words Language Modeling Benchmark in skip-10-gram configuration; 2048 mini-batch size, one and five training epochs.

4.2 Experiments on 10B Words of Burmese Data in Statistical Machine Translation Language Modeling Setup

In a separate set of experiments on Burmese data provided by the statistical machine translation (SMT) team, the held-out data (66 thousand words) and the test data (22 thousand words) is mismatched to the training data consisting of 11 billion words mostly crawled from the web (and labelled as “web”) along with 176 million words (labelled as “target”) originating from parallel data used for training the channel model. The vocabulary size is 785261 words including sentence boundary markers; the out-of-vocabulary rate on both held-out and test set is 0.6%.

To quantify statistically the mismatch between training and held-out/test data, we trained both Katz and interpolated Kneser-Ney

5-gram models on the pooled training data; the Kneser-Ney LM has PPL of 611 and 615 on the held-out and test data, respectively; the Katz LM is severely more mismatched, with PPL of 4153 and 4132, respectively333The cummulative hit-ratios on test data at orders 5 through 1 were 0.2/0.3/0.6/0.9/1.0 for the KN model, and 0.1/0.3/0.6/0.9/1.0 for the Katz model, which may explain the large gap in performance between KN and Katz: the diversity counts used by KN 80% of the time are more robust to mismatched training/test conditions than the relative frequencies used by Katz.. Because of the mismatch between the training and the held-out/test data, the PPL of the un-adjusted SNM 5-gram LM is significantly lower than that of the SNM adjusted using leave-one-out [Shazeer:2015] on a subset of the shuffled training set: 710 versus 1285.

The full set of results in this experimental setup are presented in Tables 4-5.

When using the multinomial adjustment model training on held-out data things fall in place, and the adjusted SNM 5-gram has lower PPL than the unadjusted one: 347 vs 710; the former is significantly lower than the 1285 value produced by leave-one-out training; the skip-5-gram SNM model (a trimmed down version of the skip-10-gram in Appendix B) has PPL of 328, improving only modestly over the 5-gram SNM result—perhaps due to the mismatch between training and development/test data.

We also note that the lexicalized adjustment model works significantly better then either the feature-only or the un-lexicalized one, in contrast to the behavior on the one billion words benchmark.

As an extension we experimented with SNM training that takes into account the data source for a given skip-/-gram feature, and combines them best on held-out/test data by taking into account the identity of the data source as well. This is the reality of most practical scenarios for training language models. We refer to such features as corpus-tagged features: in training we augment each feature with a tag describing the training corpus it originates from, in this case web and target, respectively; on held-out and test data the event extractor augments each feature with each of the corpus tags in training. The adjustment function is then trained to assign a weight for each such corpus-tagged feature. Corpus tagging the features and letting the adjustment model do the combination reduced PPL by about 8% relative over the model trained on pooled data in both 5-gram and skip-5-gram configurations.

4.3 Experiments on 35B Words of Italian Data in Language Modeling Setup for Automatic Speech Recognition

We have experimented with SNM LMs in the LM training setup for Italian as used on the automatic speech recognition (ASR) project. The same LM is used for two distinct types of ASR requests: voice-search queries (VS) and Android Input Method (IME, speech input on the soft keyboard). As a result we use two separate test sets to evaluate the LM performance, one for each VS and IME, respectively.

The held-out data used for training the adjustment model is a mix of VS and IME transcribed utterances, consisting of 36 thousand words split 30/70% between VS/IME, respectively. The adjustment model used 20 million parameters trained using mini-batch AdaGrad (2048 samples batch size) in one epoch.

The training data consists of a total of 35 billion words from various sources, of varying size and degree of relevance to either of the test sets:

  • google.com (111 Gbytes) and maps.google.com (48 Gbytes) query stream

  • high quality web crawl (5 Gbytes)

  • automatically transcribed utterances filtered by ASR confidence for both VS and IME (4.1 and 0.5 Gbytes, respectively)

  • manually transcribed utterances for both VS and IME (0.3 and 0.5 Gbytes, repectively)

  • voice actions training data (0.1 Gbytes)

As a baseline for the SNM we built Katz and interpolated Kneser-Ney 5-gram models by pooling all the training data. We then built a 5-gram SNM LM, as well as corpus-tagged SNM 5-gram where each -gram is tagged with the identity of the corpus it occurs in (one of seven tags). Skip-grams were added to either of the SNM models. The results are presented in Table 3; the vocabulary used to train all language models being compared consisted of 4 million words.

Model Perplexity
IME VS
Katz 5-gram 177 154
Interpolated Kneser-Ney 5-gram 152 142
SNM 5-gram, adjusted 104 126
SNM 5-gram, corpus-tagged, adjusted 88 124
SNM 5-gram, skip-gram, adjusted 96 119
SNM 5-gram, skip-gram, corpus-tagged, adjusted 86 119
Table 3: Perplexity Results of Various Approaches to Language Modeling in the Setup Used for Italian ASR.

A first observation is that the SNM 5-gram LM outperforms both Katz and Kneser-Ney LMs significantly on both test sets. We attribute this to the ability of the adjustment model to optimize the combination of various -gram contexts such that they maximize the likelihood of the held-out data; no such information is available to either of the Katz/Kneser-Ney models.

Augmenting the SNM 5-gram with corpus-tags benefits mostly the IME performance; we attribute this to the fact that the vast majority of the training data is closer to the VS test set, and clearly separating the training sources (in particular the ones meant for the IME component of the LM such as web crawl and IME transcriptions) allows the adjustment model to optimize better for that subset of the held-out data. Skip-grams offer relatively modest improvements over either SNM 5-gram models.

5 Conclusions and Future Work

The main conclusion is that training the adjustment model on held-out data using multinomial loss introduces many advantages while matching the previous results reported in [Shazeer:2015]: as observed in [Xu:2011], Section 2, using a binary probability model is expected to yield the same model as a multinomial probability model. Correcting the deficiency in [Shazeer:2015]

induced by using a Poisson model for each binary random variable does not seem to make a difference in the quality of the estimated model.

Being able to train on held-out data is very important in practical situations where the training data is usually mismatched from the held-out/test data. It is also less constrained than the previous training algorithm using leave-one-out on training data: it allows the use of richer meta-features in the adjustment model, e.g. the diversity counts used by Kneser-Ney smoothing which would be difficult to deal with correctly in leave-one-out training. Taking into account the data source for a given skip-/-gram feature, and combining them for best performance on held-out/test data improves over SNM models trained on pooled data by about 8% in the SMT setup, or as much as 15% in the ASR/IME setup.

We find that fairly small amounts of held-out data (on the order of 30-70 thousand words) are sufficient for training the adjustment model. Surprisingly, using meta-features that discard all lexical information can sometimes perform as well as lexicalized meta-features, as demonstrated by the results on the One Billion Words Benchmark corpus.

Given the properties of the SNM -gram LM explored so far:

  • ability to mix various data sources based on how relevant they are to a given held-out set, thus providing an alternative to Bayesian mixing algorithms such as [Allauzen:2011],

  • excellent pruning properties relative to entropy pruning of Katz and Kneser-Ney models [Pelemans:2015],

  • conversion to standard ARPA back-off format [Pelemans:2015],

  • effortless incorporation of richer features such as skip--grams and geo-tags [Chelba:ASRU2015],

we believe SNM could provide the estimation back-bone for a fully fledged LM training pipeline used in a real-life setup.

A comparison of SNM against maximum entropy modeling at feature extraction parity is also long due.

Model Size Num Training Metafeatures Extraction Test Set Actual Num Hashed Params
(max num hashed params) Epochs lexicalized feature-only PPL (non-zero)
Leave-one-out
200M yes no 1285
Multinomial
0 Unadjusted Model 710 0
200M 1 yes no 352 103549851
yes yes 653 87875
no no 569 716
5 yes no 347 103549851
yes yes 638 87875
no no 559 716
20M 1 yes no 353 20963883
yes yes 653 87712
no no 569 716
5 yes no 348 20963883
yes yes 638 87712
no no 559 716
200K 1 yes no 371 204800
yes yes 653 71475
no no 569 713
5 yes no 400 204800
yes yes 638 71475
no no 560 713
Multinomial, corpus-tagged
0 Unadjusted Model 574 0
200M 1 yes no 323 129291753
yes yes 502 157684
no no 447 718
5 yes no 324 129291753
yes yes 488 157684
no no 442 718
20M 1 yes no 323 20970091
yes yes 502 157141
no no 447 718
5 yes no 324 20970091
yes yes 488 157141
no no 442 718
200K 1 yes no 334 204800
yes yes 502 110150
no no 447 715
5 yes no 356 204800
yes yes 489 110150
no no 442 715
Table 4: SMT Burmese Dataset experiments in 5-gram configuration, with and without corpus-tagged feature extraction; 2048 mini-batch size, one and five training epochs.
Model Size Num Training Metafeatures Extraction Test Set Actual Num Hashed Params
(max num hashed params) Epochs lexicalized feature-only PPL (non-zero)
Multinomial
0 Unadjusted Model 687 0
200M 1 yes no 328 209574343
yes yes 587 772743
no no 496 1414
20M 1 yes no 328 20971520
yes yes 587 760066
no no 496 1414
200K 1 yes no 342 204800
yes yes 587 200060
no no 496 1408
Multinomial, corpus-tagged
0 Unadjusted Model 567 0
200M 1 yes no 302 209682449
yes yes 474 1366944
no no 405 1416
20M 1 yes no 303 20971520
yes yes 474 1327393
no no 405 1416
200K 1 yes no 312 204800
yes yes 474 204537
no no 405 1409
Table 5: SMT Burmese Dataset experiments in skip-5-gram configuration, with and without corpus-tagged feature extraction; 2048 mini-batch size, one training epoch.

6 Acknowledgments

Thanks go to Yoram Singer for clarifying the correct mini-batch variant of AdaGrad, Noam Shazeer for assistance on understanding his implementation of the adjustment function estimation, Diamantino Caseiro for code reviews, Kunal Talwar, Amir Globerson and Diamantino Caseiro for useful discussions, and Anton Andryeyev for providing the SMT training/held-out/test data sets. Last, but not least, we are thankful to our former summer intern Joris Pelemans for suggestions while preparing the final version of the paper.

Appendix A Appendix: 5-gram Feature Extraction Configuration

// Sample config generating a straight 5-gram language model.
ngram_extractor {
  min_n: 0
  max_n: 4
}

Appendix B Appendix: skip-10-gram Feature Extraction Configuration

// Sample config generating a straight skip-10-gram language model.
ngram_extractor {
  min_n: 0
  max_n: 9
}
skip_ngram_extractor {
  max_context_words: 4
  min_remote_words: 1
  max_remote_words: 1
  min_skip_length: 1
  max_skip_length: 10
  tie_skip_length: true
}
skip_ngram_extractor {
  max_context_words: 5
  min_skip_length: 1
  max_skip_length: 1
  tie_skip_length: false
}\end{lstlisting}
\newpage
\section{Appendix: Meta-features Extraction Pseudo-code}
\label{appendix}
\begin{algorithm}
\begin{algorithmic}
\State{// Metafeatures are represented as tuples (hash\_value, weight).}
\State{// Concat(metafeatures, end\_pos, mf\_new) concatenates mf\_new}
\State{// with all the existing metafeatures up to end\_pos.}
\Function{ComputeMetafeatures}{FeatureTargetPair pair}
    \State{// feature-related metafeatures}
    \State{metafeatures $\gets$ (Fingerprint(pair.feature.id()), 1.0)}
    \State{metafeatures $\gets$ (Fingerprint(pair.feature.type()), 1.0)}
    \State{ln\_count = log(pair.feature.count()) / log(2)}
    \State{bucket1 = floor(ln\_count)}
    \State{bucket2 = ceil(ln\_count)}
    \State{weight1 = bucket2 - ln\_count}
    \State{weight2 = ln\_count - bucket1}
    \State{metafeatures $\gets$ (Hash(bucket1), weight1)}
    \State{metafeatures $\gets$ (Hash(bucket2), weight2)}
    \State{}
    \State{// target-related metafeatures}
    \State{Concat(metafeatures, metafeatures.size(), (Fingerprint(pair.target.id()), 1.0))}
    \State{}
    \State{// feature-target-related metafeatures}
    \State{ln\_count = log(pair.count()) / log(2)}
    \State{bucket1 = floor(ln\_count)}
    \State{bucket2 = ceil(ln\_count)}
    \State{weight1 = bucket2 - ln\_count}
    \State{weight2 = ln\_count - bucket1}
    \State{Concat(metafeatures, metafeatures.size(), (Hash(bucket1), weight1))}
    \State{Concat(metafeatures, metafeatures.size(), (Hash(bucket2), weight2))}
    \State{}
    \State{return metafeatures}
\EndFunction
\end{algorithmic}
\end{algorithm}
\newpage
\small
\begin{thebibliography}{1}
\bibitem[1]{Allauzen:2011}
Cyril Allauzen and Michael Riley.
Bayesian Language Model Interpolation for Mobile Speech Input,
\textit{Proceedings of Interspeech}, 1429–1432, 2011.
\bibitem[2]{Chang:2008}
Chang et al.
Bigtable: A distributed storage system for structured data,
\textit{ACM Transactions on Computer Systems, vol. 26, pp. 1–26, num. 2}, 2008.
\bibitem[3]{Chelba:2014}
Ciprian Chelba, Tom{\’a}\v{s}Mikolov,MikeSchuster,QiGe,ThorstenBrants,PhillippKoehn,andTonyRobinson.
“OneBillionWordBenchmarkforMeasuringProgressinStatisticalLanguageModeling,’
\textit{ProceedingsofInterspeech},2635–2639,2014.
\bibitem[4]{Chelba:ASRU2015}
CiprianChelba,NoamShazeer.
“SparseNon-negativeMatrixLanguageModelingforGeo-annotatedQuerySessionData,’
\textit{ASRU},to~appear,2015.
\bibitem[5]{Duchi:2011}
JohnDuchi,EladHazanandYoramSinger.
“AdaptiveSubgradientMethodsforOnlineLearningandStochasticOptimization,’
\textit{JournalofMachineLearningResearch},12,2121–2159,2011.
\bibitem[6]{Ganchev:08}
KuzmanGanchevandMarkDredze.
“Smallstatisticalmodelsbyrandomfeaturemixing,’
\textit{ProceedingsoftheACL-2008WorkshoponMobileLanguageProcessing,AssociationforComputationalLinguistics},2008.
\bibitem[7]{Ghemawat:2004}
SanjayGhemawatandJeffDean.
“MapReduce:Simplifieddataprocessingonlargeclusters,’
\textit{ProceedingsofOSDI},2004.
\bibitem[8]{jelinek97}
FrederickJelinek.
“StatisticalMethodsforSpeechRecognition,’
1997.MITPress,Cambridge,MA,USA.
\bibitem[9]{Mikolov:2011}
Tom{\’a}\v{s}Mikolov,AnoopDeoras,DanielPovey,Luk{\’{a}}sBurgetandJanCernock{\’{y}}.
“Strategiesfortraininglargescaleneuralnetworklanguagemodels,’
\textit{ProceedingsofASRU},196–201,2011.
\bibitem[10]{Pelemans:2015}
JorisPelemans,NoamM.ShazeerandCiprianChelba.
“PruningSparseNon-negativeMatrixN-gramLanguageModels,’
\textit{ProceedingsofInterspeech},1433–1437,2015.
\bibitem[11]{Shazeer:2015}
NoamShazeer,JorisPelemansandCiprianChelba.
“Skip-gramLanguageModelingUsingSparseNon-negativeMatrixProbabilityEstimation,’
\textit{CoRR},abs/1412.1454,2014.
[Online].Available:http://arxiv.org/abs/1412.1454.
\bibitem[11a]{Shazeer:2015a}
NoamShazeer,JorisPelemansandCiprianChelba.
“SparseNon-negativeMatrixLanguageModelingForSkip-grams,’
\textit{ProceedingsofInterspeech},1428–1432,2015.
\bibitem[12]{Xu:2011}
PuyangXu,A.Gunawardana,andS.Khudanpur.
“EfficientSubsamplingforTrainingComplexLanguageModels,’
\textit{ProceedingsofEMNLP},2011.
\bibitem[13]{Weinberger:2009}
Weinbergeretal.
“FeatureHashingforLargeScaleMultitaskLearning,’
\textit{Proceedingsofthe26thAnnualInternationalConferenceonMachineLearning,ACM,pp.1113-1120},2009.
\end{thebibliography}
\end{document}’