1 Introduction
Over the past decade, deep neural networks have become arguably the most popular model choice for a vast number of natural language processing (NLP) tasks and have constantly been delivering stateoftheart results. Because neural network models assume continuous data, to apply a neural network on any text data, the first step is to vectorize the discrete text input with a word embedding matrix through lookup operation, which in turn assumes a predefined vocabulary set. For many NLP tasks, the vocabulary size can easily go up to the order of tens of thousands, which potentially makes the word embedding the largest portion of the trainable parameters. For example, a document classification task like AGnews
Zhang et al. (2015) can include up to 60K unique words, with the embedding matrix accounting for 97.6% of the trainable parameters (Table 1), which leads to underrepresentation of the neural networks’ own parameters.Intuitively, using the full or very large vocabulary are neither economical, as it limits model applicability on computation or memoryconstrained scenarios Yogatama et al. (2015); Faruqui et al. (2015), nor necessary, as many words may contribute little to the end task and could have been safely removed from the vocabulary. Therefore, how to select the best vocabulary is a problem of both theoretical and practical interests. Somewhat surprisingly, this vocabulary selection problem is largely underaddressed in the literature: The de facto standard practice is to do frequencybased cutoff Luong et al. (2015); Kim (2014), and only retain the words more frequent than a certain threshold (Table 1). Although this simple heuristic has demonstrated strong empirical performance, its taskagnostic nature implies that likely it is not the optimal strategy for many tasks (or any task). Taskaware vocabulary selection strategies and a systematic comparison of different strategies are still lacking.
In this work, we present the first systematic study of the vocabulary selection problem. Our study will be based on text classification tasks, a broad family of NLP tasks including document classification (DC), natural language inference (NLI), natural language understanding in dialog systems (NLU), etc. Specifically, we aim to answer the following questions:

[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

How important a role does the vocabulary selection algorithm play in text classification?

How to dramatically reduce the vocabulary size while retaining the accuracy?
The rest of the paper is organized as follows: We first formally define the vocabulary selection problem (subsection 2.1) and present a quantitative study on classification accuracy with different vocabulary selections to showcase its importance in the end task (subsection 2.2). We also propose two new metrics for evaluating the performance of vocabulary selection in text classification tasks (subsection 2.3). We then propose a novel, taskaware vocabulary selection algorithm called Varitional Vocabulary Dropout (VVD) (section 3) which draws on the idea of variational dropout Kingma et al. (2015)
: If we learn a dropout probability
for each given word in the vocabulary during the model training on a given task, the learned dropout probabilities will imply the importance of wordto the end task and can, therefore, be leveraged for vocabulary selection. We propose to infer the latent dropout probabilities under a Bayesian inference framework. During test time, we select the sub vocabulary
by only retaining words with dropout probability lower than a certain threshold. For any words deselected using VVD, we will simply regard them as a special token with null vector representation . Please note that our proposed algorithm needs to retrain a word embedding matrix, thus it is tangential to the research of pretrained word embedding like Word2Vec Mikolov et al. (2013) or Glove Pennington et al. (2014) though we can use them to initialize our embedding.We conduct comprehensive experiments to evaluate the performance of VVD (section 4) on different end classification tasks. Specifically, we compare against an array of strong baseline selection algorithms, including the frequencybased algorithm Luong et al. (2015), TFIDF algorithm Ramos et al. (2003), and structure lasso algorithm Friedman et al. (2010)
, and demonstrate that it can consistently outperform these competing algorithms by a remarkable margin. To show that the conclusions are widely held, our evaluation is based on a wide range of text classification tasks and datasets with different neural networks including Convolutional Neural Network (CNN)
Kim (2014), Bidirectional LongShort Term Memory (BiLSTM)
Bahdanau et al. (2014) and Enhanced LSTM (ESIM) Chen et al. (2017). In summary, our contributions are threefold:
[topsep=0pt,itemsep=1ex,partopsep=1ex,parsep=1ex]

We formally define the vocabulary selection problem, demonstrate its importance, and propose new evaluation metrics for vocabulary selection in text classification tasks.

We propose a novel vocabulary selection algorithm based on variational dropout by reformulating text classification under the Bayesian inference framework. The code will be released in Github^{1}^{1}1https://github.com/wenhuchen/VariationalVocabularySelection.git.

We conduct comprehensive experiments to demonstrate the superiority of the proposed vocabulary selection algorithm over a number of strong baselines.
2 Vocabulary Selection
2.1 Problem Definition
We now formally define the problem setting and introduce the notations for our problem. Conventionally, we assume the neural classification model vectorizes the discrete language input into a vector representation via an embedding matrix , where denotes the size of the vocabulary, and denotes the vector dimension. The embedding is associated with a predefined wordtoindex dictionary where denotes a literal word corresponding to row in the embedding matrix. The embedding matrix covers the subset of a vocabulary of interests for a particular NLP task, note that the value of is known to be very large due to the rich variations in human languages. Here we showcase the embedding matrix size of a popular text classification model^{2}^{2}2https://github.com/dennybritz/cnntextclassificationtf on AGnews dataset Zhang et al. (2015) in Table 1. From which we can easily observe that the embedding matrix is commonly occupying most of the parameter capacity, which could be the bottleneck in many realworld applications with limited computation resources.
In order to alleviate such redundancy problem and make embedding matrix as efficient as possible, we are particularly interested in discovering the minimum rowsized embedding to achieve nearly promising performance as using the full rowsized embedding . More formally, we define the our problem as follows:
(1) 
where #Row is a the number of rows in the matrix , is the learned neural model with parameter to predict the class given the inputs , is the function which measure accuracy between model prediction and (reference output), and is the tolerable performance drop after vocabulary selection. It is worth noting that here includes all the parameter set of the neural network except embedding matrix . For each vocabulary selection algorithm , we propose to draw its characteristic curve to understand the relationship between the vocabulary capacity and classification accuracy, which we call as (characteristic) accuracyvocab curve throughout our paper.
2.2 Importance of Vocabulary Selection
In order to investigate the importance of the role played by the vocabulary selection algorithm, we design a MonteCarlo simulation strategy to approximate accuracy’s lower bound and upper bound of a given vocabulary size reached by a possible selection algorithm . More specifically, for a given vocabulary size of , there exist algorithms which can select distinct vocabulary subset from the full vocabulary . Directly enumerating these possibilities are impossible, we instead propose to use a MonteCarlo vocabulary selection strategy which can randomly pick vocabulary subset
to simulate the possible selection algorithms by running it N times. After simulation, we obtain various point estimations
at each given and depict the point estimates in Figure 1 to approximately visualize the upper and lower bound of the accuracyvocab curve. From Figure 1, we can easily observe that the accuracy range under a limitedvocabulary is extremely large, when the budget increases, the gap gradually shrinks. For example, for document classification with a budget of 1000, a selection algorithm can yield a potential accuracy ranging from to , while for natural language understanding task with a budget of 27, a selection algorithm can yield a potential accuracy ranging from to . Such a MonteCarlo simulation study has demonstrated the significance of vocabulary selection strategy in NLP tasks and also implicate the enormous potential of an optimal vocabulary selection algorithm.2.3 Evaluation Metrics
In order to evaluate how well a given selection algorithm performs, we propose evaluation metrics as depicted in Figure 1 by quantitatively studying its characteristic accuracyvocab curve. These metrics namely Area Under Curve (AUC) and Vocab@X% separately measure the vocabulary selection performance globally and locally. Specifically, AUC computes enclosed area by the curve, which gives an overview of how well the vocabulary selection algorithm performs. In comparison, Vocab@X% computes the minimum vocabulary size required if X% performance drop is allowed, which straightforwardly represents how large vocabulary is required to achieve a given accuracy. For the local evaluation metric, we mainly consider Vocab@3% and Vocab@5%. However, we observe that directly computing AUC lays too much emphasis on the largevocabulary region, thus unable to represent an algorithm’s selection capability under the lowvocabulary conditions. Therefore, we propose to take the logarithm of the vocabulary size and then compute the normalized enclosed area by:
(2) 
It is worth noting that Vocab@X% takes value from range with smaller values indicate better performance. Since AUC is normalized by Acc(V), it takes value from range regardless of the classification error.
3 Our Method
Inspired by DNN dropout Srivastava et al. (2014); Wang and Manning (2013), we propose to tackle the vocabulary selection problem from wordlevel dropout perspective, where we assume each word (an integer index) is associated with its characteristic dropout rate , which represents the probability of being replaced with an empty placeholder, specifically, higher dropout probability indicates less loss suffered from removing it from the vocabulary. Hence, the original optimization problem in Equation 1 can be thought of as inferring the latent dropout probability vector . The overview of our philosophy is depicted in Figure 2, where we associate with each row of the embedding matrix a dropout probability and then retrain the complete system, which grasps how much contribution each word from the vocabulary makes to the end NLP task and remove those “less contributory” words from the vocabulary without hurting the performance.
3.1 Bernouli Dropout
Here we first assume that the neural network vectorizes the discrete inputs with an embedding matrix to project given words into vector space , and then propose to add random dropout noise into the embedding input to simulate the dropout process as follows:
(3) 
where OneHot is a function to transform a word into its onehot form , and is the Bernouli dropout noise with . The embedding output vector is computed with a given embedding matrix under a sampled Bernouli vector . In order to infer the latent Bernouli distribution with parameters under the Bayesian framework where training pairs are given as the evidence, we first define an objective function as and then derive its lower bound as follows (with ):
where is the prior distribution, and denotes the Bernouli approximate posterior with parameter . Here we use as the simplied form of , we separate the text classification model’s parameters with the embedding parameters and assume the classification model directly takes embedding as input.
3.2 Gaussian Relaxation
However, the Bernouli distribution is hard to reparameterize, where we need to enumerate different values to compute the expectation over the stochastic dropout vector . Therefore, we follow Wang and Manning (2013) to use a continuous Gaussian approximation, where the Bernouli noise is replaced by a Gaussian noise :
(4) 
where
follows Gaussian distribution
. It is worth noting that and are onetoone corresponded, and is a monotonously increasing function of . For more details, please refer to Wang and Manning (2013). Based on such approximation, we can use as dropout criteria, e.g. throw away words with above a certain given threshold . We further follow Louizos et al. (2017); Kingma et al. (2015); Molchanov et al. (2017) to reinterpret the input noise as the intrinsic stochasticity in the embedding weights itself as follows:(5) 
where follows a multivariate Gaussian distribution
, where the random weights in each row has a tied variance/mean ratio
. Thus, we rewrite the evidence lower bound as follows:where is the prior distribution and denotes the Gaussian approximate posterior with parameters and . is used as the relaxed evidence lower bound of marginal log likelihood . Here, we follow Kingma et al. (2015); Louizos et al. (2017) to choose the prior distribution
as the “improper logscaled uniform distribution” to guarantee that the regularization term
only depends on dropout ratio , i.e. irrelevant to . Formally, we write the prior distribution as follows:(6) 
Since there exists no closedform expression for such KLdivergence, we follow Louizos et al. (2017) to approximate it by the following formula with minimum variance:
(7) 
By adopting the improper loguniform prior, more weights are compressed towards zero, and the KLdivergence is negatively correlated with dropout ratio . Intuitively, the dropout ratio is an redundancy indicator for word in the vocabulary, with larger meaning less performance loss caused by dropping word. During training, we use reparameterization trick Kingma and Welling (2013)
to sample embedding weights from the normal distribution to reduce the MonteCarlo variance in Bayesian training.
3.3 Vocabulary Selection
After optimization, we can obtain the dropout ratio associated with each word . We propose to select vocabulary subset based on the dropout ratio by using a threshold . Therefore, the remaining vocabulary subset is described as follows:
(8) 
where we use to denote the subset vocabulary of interest, by adjusting we are able to control the selected vocabulary size.
4 Experiments
We compare the proposed vocabulary selection algorithm against several strong baselines on a wide range of text classification tasks and datasets.
Datasets  Task  Description  #Class  #Train  #Test 
ATISflight Tur et al. (2010)  NLU  Classify Airline Travel dialog  21  4,478  893 
Snips Coucke et al. (2018)  Classify inputs to personal voice assistant  7  13,084  700  
AGnews Zhang et al. (2015)  DC  Categories: World, Sports, etc  4  120,000  7,600 
DBPedia Lehmann et al. (2015)  Categories: Company, Athlete, Album, etc  14  560,000  70,000  
Sogounews Zhang et al. (2015)  Categories: Sports, Technology, etc  5  450,000  60,000  
Yelpreview Zhang et al. (2015)  Categories: Review Ratings (15)  5  650,000  50,000  
SNLI Bowman et al. (2015)  NLI  Entailment: Contradict, Neutral,Entail  3  550,152  10,000 
MNLI Williams et al. (2018)  MultiGenre Entailment  3  392,702  10,000 
4.1 Datasets & Architectures
The main datasets we are using are listed in Table 2, which provides an overview of its description and capacities. Specifically, we follow Zhang et al. (2015); Goo et al. (2018); Williams et al. (2018)
to preprocess the document classification datasets, natural language understanding dataset and natural language inference dataset. We exactly replicate their experiment settings to make our method comparable with theirs. Our models is implemented with TensorFlow
Abadi et al. (2015).In order to evaluate the generalization ability of VVD selection algorithm in deep learning architectures, we study its performance under different established architectures (depicted in Figure 3). In natural language understanding, we use the most recent attentionbased model for intention tracking Goo et al. (2018)
, this model first uses BiLSTM recurrent network to leverage lefttoright and righttoleft context information to form the hidden representation, then computes selfattention weights to aggregate the hidden representation and predicts user intention. In document classification, we mainly follow the CNN architecture
Kim (2014)to extract ngram features and then aggregate these features to predict document category. In natural language inference, we follow the popular ESIM architecture
(Williams et al., 2018; Chen et al., 2017) using the Github implementation^{3}^{3}3https://github.com/coetaur0/ESIM. In this structure, three main components input encoding, local inference modeling, and inference composition are used to perform sequential inference and composition to simulate the interaction between premises and hypothesis. Note that, we do not apply the syntaxtree based LSTM proposed in (Chen et al., 2017) because we lost the parse tree Klein and Manning (2003) after the vocabulary compression, instead, we follow the simpler sequential LSTM framework without any syntax parse as input. Besides, the accuracy curve is obtained using the publicly available test split rather than the official online evaluation because we need to evaluate lots of times at different vocabulary capacity.Datasets / Reported Accuracy  Accuracy  Vocab  Methods  AUC  Vocab@3%  Vocab@5% 
Snips / 96.7 Liu and Lane (2016)  95.9  11000  Frequency  77.4  81  61 
95.9  TFIDF  77.6  81  62  
95.6  Group Lasso  82.1  77  52  
96.0  VVD  82.5  52  36  
ATISFlight / 94.1 Goo et al. (2018)  93.8  724  Frequency  70.1  33  28 
93.8  TFIDF  70.5  34  28  
93.8  Group Lasso  72.9  30  26  
94.0  VVD  74.8  29  26  
AGnews / 91.1 Zhang et al. (2015)  91.6  61673  Frequency  67.1  2290  1379 
91.6  TFIDF  67.8  2214  1303  
91.2  Group Lasso  68.3  1867  1032  
91.6  VVD  70.5  1000  673  
DBPedia / 98.3 Zhang et al. (2015)  98.4  563355  Frequency  69.7  1000  743 
98.4  TFIDF  71.7  1703  804  
97.9  Group Lasso  71.9  768  678  
98.5  VVD  72.2  427  297  
Sogounews / 95.0 Zhang et al. (2015)  93.7  254495  Frequency  70.9  789  643 
93.7  TFIDF  71.3  976  776  
93.6  Group Lasso  73.4  765  456  
94.0  VVD  75.5  312  196  
Yelpreview / 58.0 Zhang et al. (2015)  56.3  252712  Frequency  74.0  1315  683 
56.3  TFIDF  74.1  1630  754  
56.5  Group Lasso  75.4  934  463  
57.4  VVD  77.9  487  287  
SNLI / 86.7 Williams et al. (2018)  84.1  42392  Frequency  72.2  2139  1362 
84.1  TFIDF  72.8  2132  1429  
84.6  Group Lasso  73.6  1712  1093  
85.5  VVD  75.0  1414  854  
MNLI / 72.3 Williams et al. (2018)  69.2  100158  Frequency  78.5  1758  952 
69.2  TFIDF  78.7  1656  934  
70.1  Group Lasso  79.2  1466  711  
71.2  VVD  80.1  1323  641 
4.2 Baselines
Here we mainly consider the following baselines:
Frequencybased (taskagnostic)
This approach is already extensively talked about in section 1, its basic idea is to rank the word based on its frequency and then set a threshold to cut off the long tail distribution.
TFIDF (taskagnostic)
This algorithm views the vocabulary selection as a retrieval problem Ramos et al. (2003), where term frequency is viewed as the word frequency and document frequency is viewed as the number of sentences where such word appears. Here we follow the canonical TFIDF approach to compute the retrieval score as follows:
(9) 
where denotes the word frequency, is the balancing factor, denotes the number of sentences and denotes the number of sentences in which appears. We rank the whole vocabulary based on the and cut off at given threshold.
Group Lasso (taskaware)
This baseline aims to find intrinsic sparse structures Liu et al. (2015); Park et al. (2016); Wen et al. (2016) by grouping each row of word embedding. The regularization objective is described as follows, which aims at finding the rowwise sparse structure:
(10) 
After optimized with the above regularization, we use a thresholdbased selection strategy on the rownorm of embedding matrix, the selected vocabulary is described as , where is the threshold.
4.3 Experimental Results
Here we demonstrate our results in natural language understanding, document classification, and natural language inference separately in Table 3. From these tables, first of all, we can observe that VVD is able to maintain or even improve the reported accuracy on DC and NLU tasks, the accuracy of VVD is reported under dropping out the words with dropout rate larger than . The exception is in NLI Williams et al. (2018), where the common approach uses GloVe Pennington et al. (2014) for initialization, and we use random initialization, which makes our model fall slightly behind. It is worth noting that Frequencybased/TFIDF methods are based on the model trained with cross entropy, while both GroupLasso and VVD modify the objective function by adding additional regularization. It can be seen that VVD is performing very similar to the baseline models on DC and NLU tasks, while consistently outperforming the baseline methods (with random initialized embedding) on more challenging NLI and YelpReview tasks, that said, VVD can also be viewed as a generally effective regularization technique to sparsify features and alleviate the overfitting problem in NLP tasks. In terms of the vocabulary selection capability, our proposed VVD is demonstrated to outperform the competing algorithms in terms of both AUC and Vocab@X% metrics consistently over different datasets as shown in Table 3. In order to better understand the margin between VVD and frequencybased method, we plot their accuracyvocab curves in Figure 4, from which we can observe that the accuracy curves start from nearly the same accuracy with the full vocabulary, by gradually decreasing the budget , VVD decreases at a much lower rate than the competing algorithms, which clearly reflects its superiority under limitedbudget scenario. From the empirical result, we can conclude that: 1) the retrievalbased selection algorithm can yield marginal improvement over the AUC metric, but the vocab@X% metric deteriorates. 2) grouplasso and VVD algorithm directly considers the connection between each word and end classification accuracy; such taskawareness can greatly in improving both evaluation metrics. Here we show that NLU datasets are relatively simpler, which only involves detecting key words from human voice inputs to make decent decisions, a keyword vocabulary within 100 is already enough for promising accuracy. For DC datasets, which involve better innersentence and intersentence understanding, hundredlevel vocabulary is required for most cases. NLI datasets involve more complicated reasoning and interaction, which requires a thousandlevel vocabulary.
Case Study
To provide an overview of what words are selected, we depict the selection spectrum over different NLP tasks in Figure 5, from which we observe that most of the selected vocabulary are still from the highfrequency area to ensure coverage, which also explains why the frequencybased algorithm is already very strong. Furthermore, we use the Snips dataset Coucke et al. (2018) to showcase the difference between the vocabularies selected by VVD and by frequencybased baseline. The main goal of this dataset is to understand the speaker’s intention such as “BookRestaurant”, “PlayMusic”, and “SearchLocalEvent”. We show the selected/unselected words by our algorithm in Figure 6 under a vocabulary budget of 100, it is observed that many noninformative but frequent functional words like “get”, “with”, and “five” are unselected while more taskrelated but less frequent words like “neighborhood”, “search”, “theatre” are selected. More vividly, we demonstrate the word cloud of the selected vocabulary of Snips Coucke et al. (2018) in Figure 7.
4.4 Discussion
Here we will talk about some potential issues posed when training and evaluating VVD.
Training Speed
Due to the stochasticity of VVD, the training of text classification takes longer than canonical cross entropy objective. More importantly, we observe that with the increase the full vocabulary size, the convergence time of VVD also increases sublinearly but the convergence time of Cross Entropy remains quite consistent. We conjecture that this is due to the fact that the VVD algorithm has a heavier burden to infer the drop out the probability of the long tail words. Therefore, we propose to use a twostep vocabulary reduction to dramatically decrease VVD’s training time, in the first step, we cut off the rare words without having any harm on the final accuracy, then we continue training with VVD on the shrunk vocabulary. By applying such a hybrid methodology, we are able to decrease the training time dramatically.
Evaluation Speed
As we know, at each vocabulary point, the network needs to perform once evaluation on the whole test set. Therefore, it is not practical to draw each vocabulary size from 1 to V and perform V times of evaluation. Given the limited computational resources, we need to sample some vocabulary size and estimate the area under curve relying on only these points. Uniformly sampling the data points are proved wasteful, since when the accuracy curve will converge to a point very early, most of the sampled point is actually getting equivalent accuracy. Therefore, we propose to increase the interval exponentially to cover more samples at extremely low vocabulary size. For example, given the total vocabulary of 60000, the interval will be split into 1, 2, 4, 8, 24, 56, …, 60K. Using such sampling method achieve a reasonably accurate estimation of ROC with only sample points, which is affordable under many cases.
5 Related Work
Neural Network Compression
In order to better apply the deep neural networks under limitedresource scenarios, much recent research has been performed to compress the model size and decrease the computation resources. In summary, there are mainly three directions, weight matrices approximation Le et al. (2015); Tjandra et al. (2017), reducing the precision of the weights Hubara et al. (2017); Han et al. (2015) and sparsification of the weight matrix Wen et al. (2016). Another group of sparsification relies on the Bayesian inference framework Molchanov et al. (2017); Neklyudov et al. (2017); Louizos et al. (2017)
. The main advantage of the Bayesian sparsification techniques is that they have a small number of hyperparameters compared to pruningbased methods. As stated in
Chirkova et al. (2018), Bayesian compression also leads to a higher sparsity level Molchanov et al. (2017); Neklyudov et al. (2017); Louizos et al. (2017). Our proposed VVD is inspired by these predecessors to specifically tackle the vocabulary redundancy problem in NLP tasks.Vocabulary Reduction
An orthogonal line of research for dealing similar vocabulary redundancy problem is the characterbased approaches to reduce vocabulary sise Kim et al. (2016); Zhang et al. (2015); CostaJussà and Fonollosa (2016); Lee et al. (2017), which decomposes the words into its characters forms for better handling open world inputs. However, these approaches are not applicable to characterfree languages like Chinese and Japanese. Moreover, splitting words into characters incurs potential lose of wordlevel surface form, and thus needs more parameters at the neural network level to recover it to maintain the end task performance Zhang et al. (2015), which contradicts with our initial motivation of compressing the neural network models for computation or memoryconstrained scenarios.
6 Conclusion
In this paper, we propose a vocabulary selection algorithm which can find sparsity in the vocabulary and dynamically decrease its size to contain only the useful words. Through our experiments, we have empirically demonstrated that the commonly adopted frequencybased vocabulary selection is already a very strong mechanism, further applying our proposed VVD can further improve the compression ratio. However, due to the time and memory complexity issues, our algorithm and evaluation are more suitable for classificationbased application. In the future, we plan to investigate broader applications like summarizaion, translation, question answering, etc.
7 Acknowledgement
The authors would like to thank the anonymous reviewers for their thoughtful comments. This research was sponsored in part by NSF 1528175. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
References
 Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Largescale machine learning on heterogeneous systems. Software available from tensorflow.org.
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations.
 Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 1721, 2015, pages 632–642.
 Chen et al. (2017) Qian Chen, Xiaodan Zhu, ZhenHua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30  August 4, Volume 1: Long Papers, pages 1657–1668.
 Chirkova et al. (2018) Nadezhda Chirkova, Ekaterina Lobacheva, and Dmitry Vetrov. 2018. Bayesian compression for natural language processing. arXiv preprint arXiv:1810.10927.
 CostaJussà and Fonollosa (2016) Marta R. CostaJussà and José A. R. Fonollosa. 2016. Characterbased neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 712, 2016, Berlin, Germany, Volume 2: Short Papers.
 Coucke et al. (2018) Alice Coucke, Alaa Saade, Adrien Ball, Théodore Bluche, Alexandre Caulier, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, Maël Primet, and Joseph Dureau. 2018. Snips voice platform: an embedded spoken language understanding system for privatebydesign voice interfaces. CoRR, abs/1805.10190.
 Faruqui et al. (2015) Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah A. Smith. 2015. Sparse overcomplete word vector representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 2631, 2015, Beijing, China, Volume 1: Long Papers, pages 1491–1500.
 Friedman et al. (2010) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2010. A note on the group lasso and a sparse group lasso. arXiv preprint arXiv:1001.0736.
 Goo et al. (2018) ChihWen Goo, Guang Gao, YunKai Hsu, ChihLi Huo, TsungChieh Chen, KengWei Hsu, and YunNung Chen. 2018. Slotgated modeling for joint slot filling and intent prediction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), volume 2, pages 753–757.
 Han et al. (2015) Song Han, Huizi Mao, and William J Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations.

Hubara et al. (2017)
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua
Bengio. 2017.
Quantized neural networks: Training neural networks with low
precision weights and activations.
The Journal of Machine Learning Research
, 18(1):6869–6898.  Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 2529, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1746–1751.
 Kim et al. (2016) Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016. Characteraware neural language models. In AAAI, pages 2741–2749.
 Kingma et al. (2015) Diederik P Kingma, Tim Salimans, and Max Welling. 2015. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583.
 Kingma and Welling (2013) Diederik P Kingma and Max Welling. 2013. Autoencoding variational bayes. International Conference on Learning Representations.
 Klein and Manning (2003) Dan Klein and Christopher D Manning. 2003. Fast exact inference with a factored model for natural language parsing. In Advances in neural information processing systems, pages 3–10.
 Le et al. (2015) Quoc V. Le, Navdeep Jaitly, and Geoffrey E. Hinton. 2015. A simple way to initialize recurrent networks of rectified linear units. CoRR, abs/1504.00941.
 Lee et al. (2017) Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2017. Fully characterlevel neural machine translation without explicit segmentation. TACL, 5:365–378.
 Lehmann et al. (2015) Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al. 2015. Dbpedia–a largescale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2):167–195.

Liu et al. (2015)
Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky.
2015.
Sparse convolutional neural networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 806–814.  Liu and Lane (2016) Bing Liu and Ian Lane. 2016. Attentionbased recurrent neural network models for joint intent detection and slot filling. In Interspeech 2016, 17th Annual Conference of the International Speech Communication Association, San Francisco, CA, USA, September 812, 2016, pages 685–689.
 Louizos et al. (2017) Christos Louizos, Karen Ullrich, and Max Welling. 2017. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems, pages 3288–3298.
 Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attentionbased neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 1721, 2015, pages 1412–1421.
 Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
 Molchanov et al. (2017) Dmitry Molchanov, Arsenii Ashukha, and Dmitry P. Vetrov. 2017. Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 611 August 2017, pages 2498–2507.
 Neklyudov et al. (2017) Kirill Neklyudov, Dmitry Molchanov, Arsenii Ashukha, and Dmitry P Vetrov. 2017. Structured bayesian pruning via lognormal multiplicative noise. In Advances in Neural Information Processing Systems, pages 6775–6784.
 Park et al. (2016) Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yiran Chen, and Pradeep Dubey. 2016. Faster cnns with direct sparse convolutions and guided pruning. International Conference on Learning Representations.
 Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
 Ramos et al. (2003) Juan Ramos et al. 2003. Using tfidf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, volume 242, pages 133–142.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958.
 Tjandra et al. (2017) Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. 2017. Compressing recurrent neural network with tensor train. In 2017 International Joint Conference on Neural Networks, IJCNN 2017, Anchorage, AK, USA, May 1419, 2017, pages 4451–4458.
 Tur et al. (2010) Gokhan Tur, Dilek HakkaniTür, and Larry Heck. 2010. What is left to be understood in atis? In Spoken Language Technology Workshop (SLT), 2010 IEEE, pages 19–24. IEEE.
 Wang and Manning (2013) Sida Wang and Christopher Manning. 2013. Fast dropout training. In international conference on machine learning, pages 118–126.
 Wen et al. (2016) Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. 2016. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pages 2074–2082.
 Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2018. A broadcoverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACLHLT 2018, New Orleans, Louisiana, USA, June 16, 2018, Volume 1 (Long Papers), pages 1112–1122.
 Yogatama et al. (2015) Dani Yogatama, Manaal Faruqui, Chris Dyer, and Noah A. Smith. 2015. Learning word representations with hierarchical sparse coding. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pages 87–96.
 Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Characterlevel convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657.
Comments
There are no comments yet.