. In many of these models, the softmax function is commonly used to produce categorical distributions over the output space. Due to its linear complexity, the computation for the softmax layer can become a bottleneck with large output dimensions, such as language modeling(Bengio et al., 2003), neural machine translation (Bahdanau et al., 2014)
and face recognition(Sun et al., 2014). In language modelling task, with a small RNN model (Zaremba et al., 2014), softmax contributes to more than 95% of computation (Merity et al., 2016). This becomes a significant bottleneck when the computational resource is limited, such as deploying the model to mobile devices (Howard et al., 2017).
Many methods have been proposed to reduce softmax complexity. The softmax computation bottleneck is present in both training and inference phases, but there are different objectives. For training, the goal is to estimate the categorical distribution and approximate the normalization term as quick as possible. Therefore, sampling based(Gutmann & Hyvärinen, 2012) and hierarchical based methods (Goodman, 2001; Morin & Bengio, 2005; Chen et al., 2015; Grave et al., 2016) were introduced. Hierarchical based methods speed up the training by calculating the normalization term on the subset of classes. Recent works in this area, such as D-softmax (Chen et al., 2015) and adaptive-softmax (Grave et al., 2016), construct two level-hierarchies for the output classes through unbalanced word distribution.
In this work, we focus on reducing softmax computation during the inference phase. Unlike training, in inference, our goal is not to compute the exact categorical distribution over the whole vocabulary, but rather to search for top-k classes accurately and efficiently. Most existing methods formulate this as an approximated maximum inner product search problem given an already learned/fixed softmax, and focus on designing efficient approximation techniques to find top-k classes after the standard softmax training procedure is performed (Shrivastava & Li, 2014; Shim et al., 2017; Zhang et al., 2018; Chen et al., 2018a). Since the softmax training objective is misaligned with the inference’s, the standard learned and fixed softmax may not be structured in a (hierarchical) way such that locating top-k can be easily achieved, which could lead to sub-optimal trade-offs between efficiency and accuracy. Motivated by the observation, we want to make the training procedure aware of the approximation at inference time, and adapt softmax to be hierarchically structured.
To achieve this, we propose a novel Doubly Sparse Softmax (DS-Softmax) layer. The model learns a two-level overlapping hierarchy using sparse mixture of sparse experts structure during its training. Each expert is sparse and only contain a small subset of entire output class space,
while each class is permitted to belong to more than one expert. Given an input vector and a set of experts, the DS-Softmax first selects the top expert that is most related to the input (in contrast to a dense mixture of experts). Then, the chosen sparse expert could return the categorical distribution on a subset of classes. Therefore, the reduction is achieved because it does not need to consider the whole vocabulary. Compared to existing methods(Shrivastava & Li, 2014; Shim et al., 2017; Zhang et al., 2018; Chen et al., 2018a), our model can adapt the softmax embedding better during training which results in a better trade-off. In addition, our method can also be combined with existing post-approximation methods by treating each expert as single softmax.
We conduct experiments in one synthetic dataset and three different real tasks, including language modeling, neural machine translation, and image classification. We demonstrate our method can reduce softmax computation dramatically without loss of prediction performance. For example, we achieved more than 23x speedup in language modeling and 15x speedup in translation with similar performance. Qualitatively, we also demonstrate the learned two-level overlapping hierarchy is semantically meaningful on natural language modeling tasks.
The contributions of our work are summarized as follows:
We propose a novel and learning-based method to speedup softmax inference for large output space, which enables fast retrieval of top-k classes.
The proposed method learns a two-level overlapping hierarchy during the training that facilitates a fast inference and is able to bridge the misaligned objectives during softmax training and inference phases.
We empirically demonstrate that the proposed method can achieve significant speedup without loss of prediction performance.
2 The Doubly Sparse Softmax
In this section, we introduce the softmax inference problem, as well as the proposed method.
2.1 Softmax Inference Problem
Given a context vector , a softmax layer is used in order to compute a categorical distribution over a set of classes. In particular, it is defined as where is the normalization term and is the softmax embedding parameter. For inference, our goal is not to compute the full exact distribution, but rather to find the top-k classes, i.e. where is the -th largest value of . The most conventional method to do so is to compute the whole vector and find the top-k, which has complexity111Top-k selection requires an extra by Quickselsort.. Facing a large output space (i.e. large ), the softmax layer becomes a bottleneck, and our goal is to find the top-k both accurately and efficiently.
Many natural discrete objects/classes, such as natural language, exhibit some hierarchical structure where objects are organized in a tree-like fashion. A hierarchical structure can enable retrieving objects in a much faster way since we do not need to consider the whole set. Goodman (2001); Chen et al. (2015); Grave et al. (2016)
studied a two-level hierarchy for language modeling, where each word belongs to a unique cluster while the hierarchy is constructed with different approaches. (A “cluster” here refers to a cluster of words.) However, the construction of the hierarchy is very challenging and usually based on heuristics. Also, it can be very limiting to construct a hierarchy that contains mutual exclusive clusters. This is because, such as in language modeling, it is often difficult to exactly assign a word to a single cluster. For example, if we want to predict the next word of “I want to eat” and one possible correct answer is “cookie”, we can quickly notice that possible answer belongs to something eatable. If we only search for the right answer inside words with the eatable property, we can dramatically increase the efficiency. Even though words like “cookie” are one of the correct answers, it might also appear under some non-edible context such as “a piece of data” in computer science literature. Thus, a two-level overlapping hierarchy can naturally accommodate word homonyms like this by allowing each word to belong to more than one cluster. We believe that this observation is likely to be true in other applications besides language modeling.
2.3 The Proposed Method
Inspired by such hierarchical structures, we propose our method, Doubly Sparse Softmax (DS-Softmax), to automatically capture and leverage that for softmax inference speedup. The proposed method is supposed to learn overlapped two-level hierarchy among output classes. The first level is the sparse mixture and second level contains several sparse experts. A sparse expert is a cluster of classes that is a subset of the whole classes, and we allow each class to belong to more than one expert (non-exclusive). To generate the top-k classes, the sparse mixture enables a fast and dynamic selection of the right expert according to context vector . And then the selected sparse expert allows a fast softmax computation over a small subset of the classes.
The framework is illustrated in Figure 1, which contains two major components: (1) the sparse mixture/gating network indicates the sparse mixture and enables the selection of a top-1 expert, and (2) the sparse experts that are pruned from full softmax with group lasso. We also leverage a loading balance term to balance the utilization of different experts, and the mitosis training to scale it to a larger number of experts. The final objective will be a combination of task-specific loss , group lasso loss and some loading balance regularization losses and . The overall training algorithm is summarized in Algorithm 1.
The first level of sparsification is a sparse gating network, which is designed to find the right expert given the context vector . To facilitate faster inference, only a single most suitable expert is chosen. This sparse gating network is similar to the one in (Shazeer et al., 2017), but the technique we propose here supports a single top expert output while maintaining meaningful gradients.
To be more specific, suppose we have experts. Given the context vector and gating network weight , the gating values , , are calculated and normalized prior to the selection as shown in Eq. 1. And then we only maintain the largest gating value while set all other gates to be zero. More specifically,
Where is the weighting matrix for group selection, where only the top-1 expert is selected. Eq. 1 still allows the gradient to be back-propagated to whole since gating values are normalized.
Given the sparse gate, we can further compute the probability of class under the context as:
where is softmax embedding weight matrix for the -th expert. Gating values can be interpreted as an inverse temperature term for final categorical distribution produced by the chosen expert (Hinton et al., 2015). A smaller
gives a more uniform distribution and a largermakes sharper one, and this can be adjusted automatically according to the context. It is worth noting that during the inference, we only need to compute single chosen expert given the rest are zeros, and select top-k classes.
During the training, we use
as the softmax output distribution and train it end-to-end w.r.t. the task-specific loss function, i.e.. 222In practice, we can also pre-train all layers, and only re-train the softmax layer that takes context and output , while leaves the previous layers fixed. This enables the training being aware of the approximation used in inference since this sparse gating network is consistent in both training and inference.
Sparse experts with group lasso
The second level sparsification is sparse experts. We would like each expert to contain only a small subset of whole classes, which means it should output a categorical distribution where most entries are zeros. To obtain a sparse expert, we start by initializing an expert as a full softmax that covers all classes and apply group lasso to iteratively prune out irrelevant classes. More specifically, we add the following regularization loss,
This regularization term actively prunes embedding vectors that in each expert once their norm is smaller than the pre-defined threshold . When heavily regularized, there will be many classes pruned out of each expert, leading to a set of sparse experts.
A balanced utilization of experts is prefered. Given the number of final classes in -th expert as and the number of total classes is . The utilization ratio indicates the probability of an expert being selected with a given dataset. For example, if model is run 10,000 times and -th expert is selected for 100 times, then the utilization ratio is 0.01. The overall speedup is calculated as . Therefore, it is not desirable to have an unbalanced load since the model can degenerate to a single big softmax and leads to less speedup. To address this issue, we add a loading balance loss that encourages a more balanced utilization of experts. Loading function (Shazeer et al., 2017) is adopted to balance the utilization here, shown in Eq. 5. It encourages the utilization percentage of each expert to be balanced by minimizing the coefficient of variation (CV) for gating outputs. In addition, we also include the following group lasso loss term so that each class is encouraged to exist in only one or a few experts.
The proposed method is efficient in terms of softmax inference as it only consists of (1) a sparse gating to choose an expert, which has complexity given experts; and (2) a small-scale softmax from the selected sparse expert to compute the sparse categorical distribution, which has an average of complexity given a balanced set of experts and a class/word on average belongs to experts. With a reasonable , a significant speedup can be expected.
As for the training, since we initialize each expert with a full softmax, and gradually prune out and generate a sparse expert with small-scale softmax. This suggests that during the training time, the DS-softmax layer would require times the memory as a regular softmax layer when we use experts, despite that at the end the sparse experts would only require a much smaller memory. To mitigate this issue, we introduce a memory efficient training technique, named mitosis training.
Mitosis training is a strategy to progressively increase the number of experts during the training. We start the training with a smaller number of experts. Once it converges, we split each expert into two identical ones and repeat the same training procedure with the initialized model. At the time of splitting/cloning one expert into two, the expert is already relatively sparse and smaller than the full softmax, it would require a much smaller memory consumption as the case without mitosis train. An illustration of the mitosis training can be found in Fig. 2.
We present our empirical evaluations on both real and synthetic tasks in this section. Firstly, we create one synthetic task with two-level hierarchy and test our model’s ability to learn the hierarchical structure. Secondly, we consider three real tasks, namely natural language modeling, neural machine translation, and Chinese handwritten character recognition. Both theoretical speedup (reduction in FLOPs) and real device latency (on CPU) are reported. Finally, some ablations and case study are present to better understand what the model has learned.
In terms of experiment setup, we leave the task-specific matters for later, here we present details on our model setup. The proposed DS-Softmax layer can be trained jointly with other layers in an end-to-end fashion. For real tasks, we find it is easier to first pre-train the whole model with conventional softmax, and replace the softmax layer with DS-Softmax and retrain the new layer while keeping others fixed, with Adam (Kingma & Ba, 2014). For hyper-parameters, and threshold in pruning are fixed for all tasks as 10 and 0.01 respectively. and share the same value and are tuned using the following strategy: starting with zero and increasing exponentially until it decreases the performance in validation. The reported performance is on an independent testing dataset. For the baselines, we mainly compare to the conventional full softmax and recently proposed SVD-Softmax (Shim et al., 2017) and D-Softmax (Chen et al., 2015).
3.1 Synthetic task
A two-level hierarchy synthetic dataset is constructed to test our model. As illustrated in Figure 3(a), data points are organized with hierarchical centers, multiple sub clusters belong to one super cluster. To generate such data, we first sample the center of super cluster, then sample the center of sub clusters around super cluster center, finally the data point is drawn near the sub cluster center. More specifically,
Where we set to 10 and data dimension to 100. For sanity check and visualization purpose, we make sure the ground-truth hierarchy in the synthetic data without overlapping.
We treat the coordinates of a data point as features and the sub cluster membership of the data point as the target. We construct a two-layer Multi-layer Perception (MLP) with DS-softmax as the final layer for the task, and ideally, it should capture the hierarchy structure well. Two different super/sub cluster sizes are evaluated, 10x10 and 100x100. 10x10 means there are 10 super clusters, each of which contains 10 sub clusters.
We investigate the captured hierarchy by examining how sub clusters are distributed through experts. As mentioned, each expert only contains a subset of output classes, because class level pruning is conducted during training. We illustrate the remaining classes in each expert in Fig. 3(b) and Fig. 3(c) for 10x10 and 100x100 sizes respectively. We find DS-Softmax can perfectly capture the hierarchy. We do further ablation analysis on the results on 10x10 synthetic as shown in Fig 4 to study the effect of each additional loss. As we can see, all the loss terms discussed above are important to our model.
3.2 Language Modeling
Language modeling is a task whose goal is to predict the next word given the context. For a language such as English, a large vocabulary is present and softmax can be a bottleneck for inference efficiency. We use two standard datasets for word level language modelling: PennTree Bank (PTB) (Marcus et al., 1994) and WikiText-2 (Merity et al., 2016), where the output dimensions are 10,000 and 33,278 respectively. Standard two-layers LSTM model (Gers et al., 1999) with 200 hidden size is used333https://github.com/tensorflow/models/tree/master/tutorials/rnn/ptb. We use accuracy as our metric as it is a common metric (Chen et al., 1998) in natural language modeling especially in a real application when the extrinsic reward is given, such as voice recognition. Top 1, Top 5 and Top 10 accuracies on testing datasets are reported. We demonstrate that 15.99x and 23.86x times speedup (in terms of FLOPs) can be achieved with 64 experts without loss of accuracy shown in Table 1.444One copy for each word is required among all experts during training. Otherwise, 80x speedup is easily achieved without loss of accuracy at the cost of missing low-frequency words. Moreover, a slight improvement in performance is observed, which suggests the mixture of softmax can bring additional benefit, which is consistent with improvement by breaking low-rank bottleneck (Yang et al., 2017).
|Top 1||Top 5||Top 10|
3.3 Neural Machine Translation
Neural machine translation task is also usually used for softmax speedup evaluation. We use IWSLT English to Vietnamese dataset (Luong & Manning, 2015) and evaluate performance by BLEU score (Papineni et al., 2002) with greedy searching. The BLEU is assessed on the testing dataset. A vanilla softmax model is seq2seq (Sutskever et al., 2014)
and implemented using TensorFlow555https://github.com/tensorflow/nmt (Abadi et al., ). The number of output words is 7,709 and the results are shown in Table 2, where our method achieves 15.08x speedup (in terms of FLOPs) with similar BLEU score.
|IWSLT En-Ve (7,709)||Full||25.2||-|
3.4 Chinese Character Recognition
We also test on Chinese handwriting character recognition task. We use the offline and special characters filtered CASIA dataset (Liu et al., 2011). CAISA is a popular Chinese character recognition dataset with around four thousand characters. Unlike language related task, the class distribution is uniform here rather than unbalanced. Two-thirds of the data is chosen for training and rest for testing. Table 3 shows the results and we can achieve significant (6.91x) speedup (in terms of FLOPs) on this task.
3.5 Real device comparison
Real device experiments were conducted on a machine with Two Intel(R) Xeon(R) CPU @ 2.20GHz and 16G memory. All tested models are re-implemented using Numpy to ensure fairness. Two configurations of SVD-Softmax (Shim et al., 2017) are evaluated, SVD-5 and SVD-10. They use top 5% and 10% dimension for final evaluation in their preview window and window width is 16. Indexing and sorting are computationally heavy for SVD-softmax with Numpy implementation. One configuration of Differentiated(D)-Softmax is compared here, although it is somehow unfair because their main focus is on training speedup (Chen et al., 2015)666D-Softmax is selected instead of adaptive-softmax (Grave et al., 2016) because we measure the time in CPU rather than GPU. The words are sorted by their frequency, and the first quarter and second quarter utilize the same embedding size and half embedding size. The tail uses a quarter embedding size. For example, in PTB, we split the words into buckets (2500, 2500, 5000) and embedding sizes are (200, 100, 50). For a fair comparison, we report latency without sorting and indexing for SVD-softmax. However, regards to full softmax, D-Softmax, DS-Softmax, full latency is reported. The latency results are shown in Table 4, and we observe that DS-Softmax can achieve significantly better theoretic speedup ( better on Wiki-2) as well as lower latency ( faster on Wiki-2). Moreover, compared to D-Softmax, we find our learned hierarchy can achieve much better speedup without loss of performance.
3.6 Mitosis training
Here we evaluate the efficiency of mitosis training on PTB language modeling task. The model is initialized with 2 experts, and clones to 4, 8, 16, 32 and 64 experts sequentially. As demonstrated in Figure 5(a)
, cloning happens for every 15 epochs and pruning starts 10 epochs after cloning. In the end, the model only requires at most 3.25x memory to train DS-64 model and achieve similar performance, significantly smaller than original 64-fold memory.
3.7 Qualitative analysis of sparsity
We demonstrate the redundancy and word frequency pattern in Figure 5(b), where the redundancy indicates the number of experts contains such word. We find words with higher frequency will appear in more experts. This is a similar phenomenon as the topic models in Blei et al. (2003); Wallach (2006), and similar fact that more frequent words require higher capacity model (Chen et al., 2015). We manually interrogate the smallest expert in such a model, where 64 words remain777The words existing in more than experts are filtered.. The words left in such expert is semantically related. Three major groups are identified, which are money, time and comparison, shown in following:
Money: million, billion, trillion, earnings, share, rate, stake, bond, cents, bid, cash, fine, payable.
Time: years, while, since, before, early, late, yesterday, annual, currently, monthly, annually, Monday, Tuesday, Wednesday, Thursday, Friday.
Comparison: up, down, under, above, below, next, though, against, during, within, including, range, higher, lower, drop, rise, growth, increase, less, compared, unchanged.
3.8 Post-approximation of Learned Experts
To speedup softmax inference, most existing methods are based on post-approximation of a learned and fixed softmax (Shim et al., 2017; Chen et al., 2018a; Mussmann et al., 2017). In DS-Softmax, we could also consider each expert as an individual softmax with a subset of whole classes. This suggests that we can take the learned DS-Softmax, and for each of the learned expert, we could apply the post-approximation technique such as SVD-Softmax (Shim et al., 2017). To demonstrate this, two experiments are conducted. One is applying SVD-10 to DS-2. Another is applying SVD-50 (top 50% in the preview window) to DS-64, where SVD is applied upon to expert with more than one thousand classes. The higher percent in SVD is used for DS-64 because there are fewer remaining classes in each expert. The result in Table 5 shows that our technique can also be improved combining with SVD-Softmax.
|DS-2 & SVD-10||0.255||9.64x|
|DS-64 & SVD-50||0.255||32.77x|
4 Related Work
The problem of reducing softmax complexity has been widely studied before (Gutmann & Hyvärinen, 2012; Chen et al., 2015; Grave et al., 2016; Shim et al., 2017; Zhang et al., 2018; Chen et al., 2018a). There are mainly two goals: training speedup and inference speedup. In our work, we focus on inference, where we would like to find the top-k classes efficiently and accurately.
Most existing works for reducing the softmax inference complexity are based on post-approximation of a fixed softmax that has been trained in a standard procedure. Locality Sensitive Hashing (LSH) has been demonstrated as a powerful technique under this category (Shrivastava & Li, 2014; Maddison et al., 2014; Mussmann et al., 2017; Spring & Shrivastava, 2017). Small word graph is another powerful technique for this problem (Zhang et al., 2018). Recent work proposes one learning-based clustering for trained embedding which overcomes the non-differential problem (Chen et al., 2018a). In addition, decomposition-based method, SVD-softmax (Shim et al., 2017), can speedup the searching through one smaller preview matrix. However, as an approximation to a fixed softmax, the main drawback is that it always suffers high cost when high precision is required (Chen et al., 2018a), suggesting a worse trade-off between efficiency and accuracy. In contrast, the proposed DS-softmax is able to adapt the softmax and learn a hierarchical structure to find top-k classes adaptively. Furthermore, it is worth noting that those methods can also be applied upon our method, where each expert can be viewed as a single softmax, which makes them orthogonal to ours.
Hierarchical softmax is another family of methods similar to ours. The most related ones under this category are D-softmax (Chen et al., 2015) and adaptive-softmax (Grave et al., 2016). These two methods can speedup both training and inference while other methods (Morin & Bengio, 2005; Mnih & Hinton, 2009)
usually cannot speedup inference. The construction of hierarchy is through unbalanced word/class distribution due to Zipf’s law. There are two major issues. Firstly, their hierarchy is pre-defined by heuristics that could be sub-optimal. Secondly, the skewness of class distribution in some tasks, e.g. image classification, may not be as significant as in language modeling. The proposed DS-softmax overcomes those limitations by automatically learn the two-level overlapping hierarchy among classes.
Our method is inspired by sparsely-gated mixture-of-experts (MoE) (Shazeer et al., 2017)
. MoE achieves significantly better performance in language modeling and translation with large but sparsely activated experts. However, MoE cannot speedup the softmax inference by definition because each expert covers the whole output classes. Our work on softmax inference speedup can also be considered as a part of recent efforts to make a neural network more compact(Han et al., 2015; Chen et al., 2018c) and efficient (Howard et al., 2017; Chen et al., 2018b), through which we could make modern neural networks faster and more applicable.
In this paper, we present doubly sparse: a sparse mixture of sparse experts for efficient softmax inference. Our method is learning-based and adapts softmax for fast inference. It learns a two-level overlapping class hierarchy. Each expert is learned to be only responsible for a small subset of the output class space. During inference, our method first identifies the responsible expert and then performs a small-scale softmax computation by the expert. Our experiments on several real-world tasks have demonstrated the efficacy of the proposed method.
- (1) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: a system for large-scale machine learning.
- Bahdanau et al. (2014) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Bengio et al. (2003) Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
- Blei et al. (2003) Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
- Chen et al. (2018a) Chen, P. H., Si, S., Kumar, S., Li, Y., and Hsieh, C.-J. Learning to screen for fast softmax inference on large vocabulary neural networks. arXiv preprint arXiv:1810.12406, 2018a.
- Chen et al. (1998) Chen, S. F., Beeferman, D., and Rosenfeld, R. Evaluation metrics for language models. In DARPA Broadcast News Transcription and Understanding Workshop, pp. 275–280. Citeseer, 1998.
- Chen et al. (2018b) Chen, T., Lin, J., Lin, T., Han, S., Wang, C., and Zhou, D. Adaptive mixture of low-rank factorizations for compact neural modeling. In Advances in neural information processing systems (CDNNRIA workshop), 2018b.
- Chen et al. (2018c) Chen, T., Min, M. R., and Sun, Y. Learning k-way d-dimensional discrete codes for compact embedding representations. arXiv preprint arXiv:1806.09464, 2018c.
- Chen et al. (2015) Chen, W., Grangier, D., and Auli, M. Strategies for training large vocabulary neural language models. arXiv preprint arXiv:1512.04906, 2015.
- Gers et al. (1999) Gers, F. A., Schmidhuber, J., and Cummins, F. Learning to forget: Continual prediction with lstm. 1999.
- Goodman (2001) Goodman, J. Classes for fast maximum entropy training. In Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, volume 1, pp. 561–564. IEEE, 2001.
- Grave et al. (2016) Grave, E., Joulin, A., Cissé, M., Grangier, D., and Jégou, H. Efficient softmax approximation for gpus. arXiv preprint arXiv:1609.04309, 2016.
- Gutmann & Hyvärinen (2012) Gutmann, M. U. and Hyvärinen, A. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13(Feb):307–361, 2012.
- Han et al. (2015) Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
- Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Howard et al. (2017) Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436, 2015.
- Liu et al. (2011) Liu, C.-L., Yin, F., Wang, D.-H., and Wang, Q.-F. Casia online and offline chinese handwriting databases. In Document Analysis and Recognition (ICDAR), 2011 International Conference on, pp. 37–41. IEEE, 2011.
- Luong & Manning (2015) Luong, M.-T. and Manning, C. D. Stanford neural machine translation systems for spoken language domain. In International Workshop on Spoken Language Translation, Da Nang, Vietnam, 2015.
- Maddison et al. (2014) Maddison, C. J., Tarlow, D., and Minka, T. A* sampling. In Advances in Neural Information Processing Systems, pp. 3086–3094, 2014.
- Marcus et al. (1994) Marcus, M., Kim, G., Marcinkiewicz, M. A., MacIntyre, R., Bies, A., Ferguson, M., Katz, K., and Schasberger, B. The penn treebank: annotating predicate argument structure. In Proceedings of the workshop on Human Language Technology, pp. 114–119. Association for Computational Linguistics, 1994.
- Merity et al. (2016) Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Mnih & Hinton (2009) Mnih, A. and Hinton, G. E. A scalable hierarchical distributed language model. In Advances in neural information processing systems, pp. 1081–1088, 2009.
- Morin & Bengio (2005) Morin, F. and Bengio, Y. Hierarchical probabilistic neural network language model. In Aistats, volume 5, pp. 246–252. Citeseer, 2005.
- Mussmann et al. (2017) Mussmann, S., Levy, D., and Ermon, S. Fast amortized inference and learning in log-linear models with randomly perturbed nearest neighbor search. arXiv preprint arXiv:1707.03372, 2017.
- Papineni et al. (2002) Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Association for Computational Linguistics, 2002.
- Shazeer et al. (2017) Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- Shim et al. (2017) Shim, K., Lee, M., Choi, I., Boo, Y., and Sung, W. Svd-softmax: Fast softmax approximation on large vocabulary neural networks. In Advances in Neural Information Processing Systems, pp. 5463–5473, 2017.
- Shrivastava & Li (2014) Shrivastava, A. and Li, P. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, pp. 2321–2329, 2014.
- Spring & Shrivastava (2017) Spring, R. and Shrivastava, A. A new unbiased and efficient class of lsh-based samplers and estimators for partition function computation in log-linear models. arXiv preprint arXiv:1703.05160, 2017.
- Sun et al. (2014) Sun, Y., Wang, X., and Tang, X. Deep learning face representation from predicting 10,000 classes. In
- Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
- Wallach (2006) Wallach, H. M. Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international conference on Machine learning, pp. 977–984. ACM, 2006.
- Yang et al. (2017) Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W. Breaking the softmax bottleneck: A high-rank rnn language model. arXiv preprint arXiv:1711.03953, 2017.
- Zaremba et al. (2014) Zaremba, W., Sutskever, I., and Vinyals, O. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
- Zhang et al. (2018) Zhang, M., Liu, X., Wang, W., Gao, J., and He, Y. Navigating with graph representations for fast and scalable decoding of neural language models. arXiv preprint arXiv:1806.04189, 2018.