1 Introduction
Deep learning models have demonstrated impressive performance in many classification problems (LeCun et al., 2015)
. In many of these models, the softmax function is commonly used to produce categorical distributions over the output space. Due to its linear complexity, the computation for the softmax layer can become a bottleneck with large output dimensions, such as language modeling
(Bengio et al., 2003), neural machine translation (Bahdanau et al., 2014)and face recognition
(Sun et al., 2014). In language modelling task, with a small RNN model (Zaremba et al., 2014), softmax contributes to more than 95% of computation (Merity et al., 2016). This becomes a significant bottleneck when the computational resource is limited, such as deploying the model to mobile devices (Howard et al., 2017).Many methods have been proposed to reduce softmax complexity. The softmax computation bottleneck is present in both training and inference phases, but there are different objectives. For training, the goal is to estimate the categorical distribution and approximate the normalization term as quick as possible. Therefore, sampling based
(Gutmann & Hyvärinen, 2012) and hierarchical based methods (Goodman, 2001; Morin & Bengio, 2005; Chen et al., 2015; Grave et al., 2016) were introduced. Hierarchical based methods speed up the training by calculating the normalization term on the subset of classes. Recent works in this area, such as Dsoftmax (Chen et al., 2015) and adaptivesoftmax (Grave et al., 2016), construct two levelhierarchies for the output classes through unbalanced word distribution.In this work, we focus on reducing softmax computation during the inference phase. Unlike training, in inference, our goal is not to compute the exact categorical distribution over the whole vocabulary, but rather to search for topk classes accurately and efficiently. Most existing methods formulate this as an approximated maximum inner product search problem given an already learned/fixed softmax, and focus on designing efficient approximation techniques to find topk classes after the standard softmax training procedure is performed (Shrivastava & Li, 2014; Shim et al., 2017; Zhang et al., 2018; Chen et al., 2018a). Since the softmax training objective is misaligned with the inference’s, the standard learned and fixed softmax may not be structured in a (hierarchical) way such that locating topk can be easily achieved, which could lead to suboptimal tradeoffs between efficiency and accuracy. Motivated by the observation, we want to make the training procedure aware of the approximation at inference time, and adapt softmax to be hierarchically structured.
To achieve this, we propose a novel Doubly Sparse Softmax (DSSoftmax) layer. The model learns a twolevel overlapping hierarchy using sparse mixture of sparse experts structure during its training. Each expert is sparse and only contain a small subset of entire output class space,
while each class is permitted to belong to more than one expert. Given an input vector and a set of experts, the DSSoftmax first selects the top expert that is most related to the input (in contrast to a dense mixture of experts). Then, the chosen sparse expert could return the categorical distribution on a subset of classes. Therefore, the reduction is achieved because it does not need to consider the whole vocabulary. Compared to existing methods
(Shrivastava & Li, 2014; Shim et al., 2017; Zhang et al., 2018; Chen et al., 2018a), our model can adapt the softmax embedding better during training which results in a better tradeoff. In addition, our method can also be combined with existing postapproximation methods by treating each expert as single softmax.We conduct experiments in one synthetic dataset and three different real tasks, including language modeling, neural machine translation, and image classification. We demonstrate our method can reduce softmax computation dramatically without loss of prediction performance. For example, we achieved more than 23x speedup in language modeling and 15x speedup in translation with similar performance. Qualitatively, we also demonstrate the learned twolevel overlapping hierarchy is semantically meaningful on natural language modeling tasks.
The contributions of our work are summarized as follows:

[noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

We propose a novel and learningbased method to speedup softmax inference for large output space, which enables fast retrieval of topk classes.

The proposed method learns a twolevel overlapping hierarchy during the training that facilitates a fast inference and is able to bridge the misaligned objectives during softmax training and inference phases.

We empirically demonstrate that the proposed method can achieve significant speedup without loss of prediction performance.
2 The Doubly Sparse Softmax
In this section, we introduce the softmax inference problem, as well as the proposed method.
2.1 Softmax Inference Problem
Given a context vector , a softmax layer is used in order to compute a categorical distribution over a set of classes. In particular, it is defined as where is the normalization term and is the softmax embedding parameter. For inference, our goal is not to compute the full exact distribution, but rather to find the topk classes, i.e. where is the th largest value of . The most conventional method to do so is to compute the whole vector and find the topk, which has complexity^{1}^{1}1Topk selection requires an extra by Quickselsort.. Facing a large output space (i.e. large ), the softmax layer becomes a bottleneck, and our goal is to find the topk both accurately and efficiently.
2.2 Motivation
Many natural discrete objects/classes, such as natural language, exhibit some hierarchical structure where objects are organized in a treelike fashion. A hierarchical structure can enable retrieving objects in a much faster way since we do not need to consider the whole set. Goodman (2001); Chen et al. (2015); Grave et al. (2016)
studied a twolevel hierarchy for language modeling, where each word belongs to a unique cluster while the hierarchy is constructed with different approaches. (A “cluster” here refers to a cluster of words.) However, the construction of the hierarchy is very challenging and usually based on heuristics. Also, it can be very limiting to construct a hierarchy that contains mutual exclusive clusters. This is because, such as in language modeling, it is often difficult to exactly assign a word to a single cluster. For example, if we want to predict the next word of “I want to eat
” and one possible correct answer is “cookie”, we can quickly notice that possible answer belongs to something eatable. If we only search for the right answer inside words with the eatable property, we can dramatically increase the efficiency. Even though words like “cookie” are one of the correct answers, it might also appear under some nonedible context such as “a piece of data” in computer science literature. Thus, a twolevel overlapping hierarchy can naturally accommodate word homonyms like this by allowing each word to belong to more than one cluster. We believe that this observation is likely to be true in other applications besides language modeling.2.3 The Proposed Method
Inspired by such hierarchical structures, we propose our method, Doubly Sparse Softmax (DSSoftmax), to automatically capture and leverage that for softmax inference speedup. The proposed method is supposed to learn overlapped twolevel hierarchy among output classes. The first level is the sparse mixture and second level contains several sparse experts. A sparse expert is a cluster of classes that is a subset of the whole classes, and we allow each class to belong to more than one expert (nonexclusive). To generate the topk classes, the sparse mixture enables a fast and dynamic selection of the right expert according to context vector . And then the selected sparse expert allows a fast softmax computation over a small subset of the classes.
The framework is illustrated in Figure 1, which contains two major components: (1) the sparse mixture/gating network indicates the sparse mixture and enables the selection of a top1 expert, and (2) the sparse experts that are pruned from full softmax with group lasso. We also leverage a loading balance term to balance the utilization of different experts, and the mitosis training to scale it to a larger number of experts. The final objective will be a combination of taskspecific loss , group lasso loss and some loading balance regularization losses and . The overall training algorithm is summarized in Algorithm 1.
Sparse mixture
The first level of sparsification is a sparse gating network, which is designed to find the right expert given the context vector . To facilitate faster inference, only a single most suitable expert is chosen. This sparse gating network is similar to the one in (Shazeer et al., 2017), but the technique we propose here supports a single top expert output while maintaining meaningful gradients.
To be more specific, suppose we have experts. Given the context vector and gating network weight , the gating values , , are calculated and normalized prior to the selection as shown in Eq. 1. And then we only maintain the largest gating value while set all other gates to be zero. More specifically,
(1) 
Where is the weighting matrix for group selection, where only the top1 expert is selected. Eq. 1 still allows the gradient to be backpropagated to whole since gating values are normalized.
Given the sparse gate, we can further compute the probability of class under the context as:
(2) 
where is softmax embedding weight matrix for the th expert. Gating values can be interpreted as an inverse temperature term for final categorical distribution produced by the chosen expert (Hinton et al., 2015). A smaller
gives a more uniform distribution and a larger
makes sharper one, and this can be adjusted automatically according to the context. It is worth noting that during the inference, we only need to compute single chosen expert given the rest are zeros, and select topk classes.During the training, we use
as the softmax output distribution and train it endtoend w.r.t. the taskspecific loss function, i.e.
. ^{2}^{2}2In practice, we can also pretrain all layers, and only retrain the softmax layer that takes context and output , while leaves the previous layers fixed. This enables the training being aware of the approximation used in inference since this sparse gating network is consistent in both training and inference.Sparse experts with group lasso
The second level sparsification is sparse experts. We would like each expert to contain only a small subset of whole classes, which means it should output a categorical distribution where most entries are zeros. To obtain a sparse expert, we start by initializing an expert as a full softmax that covers all classes and apply group lasso to iteratively prune out irrelevant classes. More specifically, we add the following regularization loss,
(3)  
(4) 
This regularization term actively prunes embedding vectors that in each expert once their norm is smaller than the predefined threshold . When heavily regularized, there will be many classes pruned out of each expert, leading to a set of sparse experts.
Loading Balance
A balanced utilization of experts is prefered. Given the number of final classes in th expert as and the number of total classes is . The utilization ratio indicates the probability of an expert being selected with a given dataset. For example, if model is run 10,000 times and th expert is selected for 100 times, then the utilization ratio is 0.01. The overall speedup is calculated as . Therefore, it is not desirable to have an unbalanced load since the model can degenerate to a single big softmax and leads to less speedup. To address this issue, we add a loading balance loss that encourages a more balanced utilization of experts. Loading function (Shazeer et al., 2017) is adopted to balance the utilization here, shown in Eq. 5. It encourages the utilization percentage of each expert to be balanced by minimizing the coefficient of variation (CV) for gating outputs. In addition, we also include the following group lasso loss term so that each class is encouraged to exist in only one or a few experts.
(5)  
(6) 
Complexity Analysis
The proposed method is efficient in terms of softmax inference as it only consists of (1) a sparse gating to choose an expert, which has complexity given experts; and (2) a smallscale softmax from the selected sparse expert to compute the sparse categorical distribution, which has an average of complexity given a balanced set of experts and a class/word on average belongs to experts. With a reasonable , a significant speedup can be expected.
Mitosis Training
As for the training, since we initialize each expert with a full softmax, and gradually prune out and generate a sparse expert with smallscale softmax. This suggests that during the training time, the DSsoftmax layer would require times the memory as a regular softmax layer when we use experts, despite that at the end the sparse experts would only require a much smaller memory. To mitigate this issue, we introduce a memory efficient training technique, named mitosis training.
Mitosis training is a strategy to progressively increase the number of experts during the training. We start the training with a smaller number of experts. Once it converges, we split each expert into two identical ones and repeat the same training procedure with the initialized model. At the time of splitting/cloning one expert into two, the expert is already relatively sparse and smaller than the full softmax, it would require a much smaller memory consumption as the case without mitosis train. An illustration of the mitosis training can be found in Fig. 2.
3 Experiments
We present our empirical evaluations on both real and synthetic tasks in this section. Firstly, we create one synthetic task with twolevel hierarchy and test our model’s ability to learn the hierarchical structure. Secondly, we consider three real tasks, namely natural language modeling, neural machine translation, and Chinese handwritten character recognition. Both theoretical speedup (reduction in FLOPs) and real device latency (on CPU) are reported. Finally, some ablations and case study are present to better understand what the model has learned.
In terms of experiment setup, we leave the taskspecific matters for later, here we present details on our model setup. The proposed DSSoftmax layer can be trained jointly with other layers in an endtoend fashion. For real tasks, we find it is easier to first pretrain the whole model with conventional softmax, and replace the softmax layer with DSSoftmax and retrain the new layer while keeping others fixed, with Adam (Kingma & Ba, 2014). For hyperparameters, and threshold in pruning are fixed for all tasks as 10 and 0.01 respectively. and share the same value and are tuned using the following strategy: starting with zero and increasing exponentially until it decreases the performance in validation. The reported performance is on an independent testing dataset. For the baselines, we mainly compare to the conventional full softmax and recently proposed SVDSoftmax (Shim et al., 2017) and DSoftmax (Chen et al., 2015).
3.1 Synthetic task
A twolevel hierarchy synthetic dataset is constructed to test our model. As illustrated in Figure 3(a), data points are organized with hierarchical centers, multiple sub clusters belong to one super cluster. To generate such data, we first sample the center of super cluster, then sample the center of sub clusters around super cluster center, finally the data point is drawn near the sub cluster center. More specifically,
(7)  
(8)  
(9) 
Where we set to 10 and data dimension to 100. For sanity check and visualization purpose, we make sure the groundtruth hierarchy in the synthetic data without overlapping.
We treat the coordinates of a data point as features and the sub cluster membership of the data point as the target. We construct a twolayer Multilayer Perception (MLP) with DSsoftmax as the final layer for the task, and ideally, it should capture the hierarchy structure well. Two different super/sub cluster sizes are evaluated, 10x10 and 100x100. 10x10 means there are 10 super clusters, each of which contains 10 sub clusters.
We investigate the captured hierarchy by examining how sub clusters are distributed through experts. As mentioned, each expert only contains a subset of output classes, because class level pruning is conducted during training. We illustrate the remaining classes in each expert in Fig. 3(b) and Fig. 3(c) for 10x10 and 100x100 sizes respectively. We find DSSoftmax can perfectly capture the hierarchy. We do further ablation analysis on the results on 10x10 synthetic as shown in Fig 4 to study the effect of each additional loss. As we can see, all the loss terms discussed above are important to our model.
3.2 Language Modeling
Language modeling is a task whose goal is to predict the next word given the context. For a language such as English, a large vocabulary is present and softmax can be a bottleneck for inference efficiency. We use two standard datasets for word level language modelling: PennTree Bank (PTB) (Marcus et al., 1994) and WikiText2 (Merity et al., 2016), where the output dimensions are 10,000 and 33,278 respectively. Standard twolayers LSTM model (Gers et al., 1999) with 200 hidden size is used^{3}^{3}3https://github.com/tensorflow/models/tree/master/tutorials/rnn/ptb. We use accuracy as our metric as it is a common metric (Chen et al., 1998) in natural language modeling especially in a real application when the extrinsic reward is given, such as voice recognition. Top 1, Top 5 and Top 10 accuracies on testing datasets are reported. We demonstrate that 15.99x and 23.86x times speedup (in terms of FLOPs) can be achieved with 64 experts without loss of accuracy shown in Table 1.^{4}^{4}4One copy for each word is required among all experts during training. Otherwise, 80x speedup is easily achieved without loss of accuracy at the cost of missing lowfrequency words. Moreover, a slight improvement in performance is observed, which suggests the mixture of softmax can bring additional benefit, which is consistent with improvement by breaking lowrank bottleneck (Yang et al., 2017).
Task  Method  Testing Accuracy  Speedup  
Top 1  Top 5  Top 10  
PTB (10,000)  Full  0.252  0.436  0.515   

DS8  0.257  0.448  0.530  2.84x  
DS16  0.258  0.450  0.529  5.13x  
DS32  0.259  0.449  0.529  9.43x  
DS64  0.258  0.450  0.529  15.99x  
WIKI2 (33,278)  Full  0.257  0.456  0.533   
DS8  0.259  0.459  0.536  3.52x  
DS16  0.264  0.469  0.547  6.58x  
DS32  0.260  0.460  0.535  11.59x  
DS64  0.259  0.458  0.533  23.86x 
3.3 Neural Machine Translation
Neural machine translation task is also usually used for softmax speedup evaluation. We use IWSLT English to Vietnamese dataset (Luong & Manning, 2015) and evaluate performance by BLEU score (Papineni et al., 2002) with greedy searching. The BLEU is assessed on the testing dataset. A vanilla softmax model is seq2seq (Sutskever et al., 2014)
and implemented using TensorFlow
^{5}^{5}5https://github.com/tensorflow/nmt (Abadi et al., ). The number of output words is 7,709 and the results are shown in Table 2, where our method achieves 15.08x speedup (in terms of FLOPs) with similar BLEU score.Task  Method  Bleu Score  Speedup 
IWSLT EnVe (7,709)  Full  25.2   

DS8  25.3  4.38x  
DS16  25.1  6.08x  
DS32  25.4  10.69x  
DS64  25.0  15.08x 
3.4 Chinese Character Recognition
We also test on Chinese handwriting character recognition task. We use the offline and special characters filtered CASIA dataset (Liu et al., 2011). CAISA is a popular Chinese character recognition dataset with around four thousand characters. Unlike language related task, the class distribution is uniform here rather than unbalanced. Twothirds of the data is chosen for training and rest for testing. Table 3 shows the results and we can achieve significant (6.91x) speedup (in terms of FLOPs) on this task.
Task  Method  Accuracy  Speedup 
CASIA (3,740)  Full  90.6   

DS8  90.8  1.77x  
DS16  90.2  2.82x  
DS32  89.9  4.72x  
DS64  90.1  6.91x 
3.5 Real device comparison
Real device experiments were conducted on a machine with Two Intel(R) Xeon(R) CPU @ 2.20GHz and 16G memory. All tested models are reimplemented using Numpy to ensure fairness. Two configurations of SVDSoftmax (Shim et al., 2017) are evaluated, SVD5 and SVD10. They use top 5% and 10% dimension for final evaluation in their preview window and window width is 16. Indexing and sorting are computationally heavy for SVDsoftmax with Numpy implementation. One configuration of Differentiated(D)Softmax is compared here, although it is somehow unfair because their main focus is on training speedup (Chen et al., 2015)^{6}^{6}6DSoftmax is selected instead of adaptivesoftmax (Grave et al., 2016) because we measure the time in CPU rather than GPU. The words are sorted by their frequency, and the first quarter and second quarter utilize the same embedding size and half embedding size. The tail uses a quarter embedding size. For example, in PTB, we split the words into buckets (2500, 2500, 5000) and embedding sizes are (200, 100, 50). For a fair comparison, we report latency without sorting and indexing for SVDsoftmax. However, regards to full softmax, DSoftmax, DSSoftmax, full latency is reported. The latency results are shown in Table 4, and we observe that DSSoftmax can achieve significantly better theoretic speedup ( better on Wiki2) as well as lower latency ( faster on Wiki2). Moreover, compared to DSoftmax, we find our learned hierarchy can achieve much better speedup without loss of performance.
Task  Full  DS64 (Ours)  SVD5  SVD10  DSoftmax  
Name  Value  ms  Value  FLOPs  ms  Value  FLOPs  ms  Value  FLOPs  ms  Value  FLOPs  ms 
PTB  0.252  0.73  0.258  15.99x  0.05  0.249  6.67x  0.12  0.251  5.00x  0.18  0.245  2.00x  0.36 
Wiki2  0.257  3.07  0.259  23.86x  0.15  0.253  7.35x  0.43  0.255  5.38x  0.60  0.256  2.00x  1.59 
EnVe  25.2  1.91  25.0  15.08x  0.13  25.0  6.77x  0.32  25.1  5.06x  0.42  24.8  2.00x  0.98 
CASIA  90.6  1.61  90.1  6.91x  0.25  89.9  3.00x  0.59  90.2  2.61x  0.68       
3.6 Mitosis training
Here we evaluate the efficiency of mitosis training on PTB language modeling task. The model is initialized with 2 experts, and clones to 4, 8, 16, 32 and 64 experts sequentially. As demonstrated in Figure 5(a)
, cloning happens for every 15 epochs and pruning starts 10 epochs after cloning. In the end, the model only requires at most 3.25x memory to train DS64 model and achieve similar performance, significantly smaller than original 64fold memory.
3.7 Qualitative analysis of sparsity
We demonstrate the redundancy and word frequency pattern in Figure 5(b), where the redundancy indicates the number of experts contains such word. We find words with higher frequency will appear in more experts. This is a similar phenomenon as the topic models in Blei et al. (2003); Wallach (2006), and similar fact that more frequent words require higher capacity model (Chen et al., 2015). We manually interrogate the smallest expert in such a model, where 64 words remain^{7}^{7}7The words existing in more than experts are filtered.. The words left in such expert is semantically related. Three major groups are identified, which are money, time and comparison, shown in following:

[noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

Money: million, billion, trillion, earnings, share, rate, stake, bond, cents, bid, cash, fine, payable.

Time: years, while, since, before, early, late, yesterday, annual, currently, monthly, annually, Monday, Tuesday, Wednesday, Thursday, Friday.

Comparison: up, down, under, above, below, next, though, against, during, within, including, range, higher, lower, drop, rise, growth, increase, less, compared, unchanged.
3.8 Postapproximation of Learned Experts
To speedup softmax inference, most existing methods are based on postapproximation of a learned and fixed softmax (Shim et al., 2017; Chen et al., 2018a; Mussmann et al., 2017). In DSSoftmax, we could also consider each expert as an individual softmax with a subset of whole classes. This suggests that we can take the learned DSSoftmax, and for each of the learned expert, we could apply the postapproximation technique such as SVDSoftmax (Shim et al., 2017). To demonstrate this, two experiments are conducted. One is applying SVD10 to DS2. Another is applying SVD50 (top 50% in the preview window) to DS64, where SVD is applied upon to expert with more than one thousand classes. The higher percent in SVD is used for DS64 because there are fewer remaining classes in each expert. The result in Table 5 shows that our technique can also be improved combining with SVDSoftmax.
Task  Method  Accuracy  Speedup 
WIKI2 (33,278)  Full  0.257   

DS2  0.258  1.83x  
SVD10  0.255  5.38x  
DS2 & SVD10  0.255  9.64x  
DS64  0.259  23.86x  
SVD50  0.256  1.72x  
DS64 & SVD50  0.255  32.77x 
4 Related Work
The problem of reducing softmax complexity has been widely studied before (Gutmann & Hyvärinen, 2012; Chen et al., 2015; Grave et al., 2016; Shim et al., 2017; Zhang et al., 2018; Chen et al., 2018a). There are mainly two goals: training speedup and inference speedup. In our work, we focus on inference, where we would like to find the topk classes efficiently and accurately.
Most existing works for reducing the softmax inference complexity are based on postapproximation of a fixed softmax that has been trained in a standard procedure. Locality Sensitive Hashing (LSH) has been demonstrated as a powerful technique under this category (Shrivastava & Li, 2014; Maddison et al., 2014; Mussmann et al., 2017; Spring & Shrivastava, 2017). Small word graph is another powerful technique for this problem (Zhang et al., 2018). Recent work proposes one learningbased clustering for trained embedding which overcomes the nondifferential problem (Chen et al., 2018a). In addition, decompositionbased method, SVDsoftmax (Shim et al., 2017), can speedup the searching through one smaller preview matrix. However, as an approximation to a fixed softmax, the main drawback is that it always suffers high cost when high precision is required (Chen et al., 2018a), suggesting a worse tradeoff between efficiency and accuracy. In contrast, the proposed DSsoftmax is able to adapt the softmax and learn a hierarchical structure to find topk classes adaptively. Furthermore, it is worth noting that those methods can also be applied upon our method, where each expert can be viewed as a single softmax, which makes them orthogonal to ours.
Hierarchical softmax is another family of methods similar to ours. The most related ones under this category are Dsoftmax (Chen et al., 2015) and adaptivesoftmax (Grave et al., 2016). These two methods can speedup both training and inference while other methods (Morin & Bengio, 2005; Mnih & Hinton, 2009)
usually cannot speedup inference. The construction of hierarchy is through unbalanced word/class distribution due to Zipf’s law. There are two major issues. Firstly, their hierarchy is predefined by heuristics that could be suboptimal. Secondly, the skewness of class distribution in some tasks, e.g. image classification, may not be as significant as in language modeling. The proposed DSsoftmax overcomes those limitations by automatically learn the twolevel overlapping hierarchy among classes.
Our method is inspired by sparselygated mixtureofexperts (MoE) (Shazeer et al., 2017)
. MoE achieves significantly better performance in language modeling and translation with large but sparsely activated experts. However, MoE cannot speedup the softmax inference by definition because each expert covers the whole output classes. Our work on softmax inference speedup can also be considered as a part of recent efforts to make a neural network more compact
(Han et al., 2015; Chen et al., 2018c) and efficient (Howard et al., 2017; Chen et al., 2018b), through which we could make modern neural networks faster and more applicable.5 Conclusion
In this paper, we present doubly sparse: a sparse mixture of sparse experts for efficient softmax inference. Our method is learningbased and adapts softmax for fast inference. It learns a twolevel overlapping class hierarchy. Each expert is learned to be only responsible for a small subset of the output class space. During inference, our method first identifies the responsible expert and then performs a smallscale softmax computation by the expert. Our experiments on several realworld tasks have demonstrated the efficacy of the proposed method.
References
 (1) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: a system for largescale machine learning.
 Bahdanau et al. (2014) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
 Bengio et al. (2003) Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
 Blei et al. (2003) Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
 Chen et al. (2018a) Chen, P. H., Si, S., Kumar, S., Li, Y., and Hsieh, C.J. Learning to screen for fast softmax inference on large vocabulary neural networks. arXiv preprint arXiv:1810.12406, 2018a.
 Chen et al. (1998) Chen, S. F., Beeferman, D., and Rosenfeld, R. Evaluation metrics for language models. In DARPA Broadcast News Transcription and Understanding Workshop, pp. 275–280. Citeseer, 1998.
 Chen et al. (2018b) Chen, T., Lin, J., Lin, T., Han, S., Wang, C., and Zhou, D. Adaptive mixture of lowrank factorizations for compact neural modeling. In Advances in neural information processing systems (CDNNRIA workshop), 2018b.
 Chen et al. (2018c) Chen, T., Min, M. R., and Sun, Y. Learning kway ddimensional discrete codes for compact embedding representations. arXiv preprint arXiv:1806.09464, 2018c.
 Chen et al. (2015) Chen, W., Grangier, D., and Auli, M. Strategies for training large vocabulary neural language models. arXiv preprint arXiv:1512.04906, 2015.
 Gers et al. (1999) Gers, F. A., Schmidhuber, J., and Cummins, F. Learning to forget: Continual prediction with lstm. 1999.
 Goodman (2001) Goodman, J. Classes for fast maximum entropy training. In Acoustics, Speech, and Signal Processing, 2001. Proceedings.(ICASSP’01). 2001 IEEE International Conference on, volume 1, pp. 561–564. IEEE, 2001.
 Grave et al. (2016) Grave, E., Joulin, A., Cissé, M., Grangier, D., and Jégou, H. Efficient softmax approximation for gpus. arXiv preprint arXiv:1609.04309, 2016.
 Gutmann & Hyvärinen (2012) Gutmann, M. U. and Hyvärinen, A. Noisecontrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13(Feb):307–361, 2012.
 Han et al. (2015) Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
 Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 Howard et al. (2017) Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436, 2015.
 Liu et al. (2011) Liu, C.L., Yin, F., Wang, D.H., and Wang, Q.F. Casia online and offline chinese handwriting databases. In Document Analysis and Recognition (ICDAR), 2011 International Conference on, pp. 37–41. IEEE, 2011.
 Luong & Manning (2015) Luong, M.T. and Manning, C. D. Stanford neural machine translation systems for spoken language domain. In International Workshop on Spoken Language Translation, Da Nang, Vietnam, 2015.
 Maddison et al. (2014) Maddison, C. J., Tarlow, D., and Minka, T. A* sampling. In Advances in Neural Information Processing Systems, pp. 3086–3094, 2014.
 Marcus et al. (1994) Marcus, M., Kim, G., Marcinkiewicz, M. A., MacIntyre, R., Bies, A., Ferguson, M., Katz, K., and Schasberger, B. The penn treebank: annotating predicate argument structure. In Proceedings of the workshop on Human Language Technology, pp. 114–119. Association for Computational Linguistics, 1994.
 Merity et al. (2016) Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
 Mnih & Hinton (2009) Mnih, A. and Hinton, G. E. A scalable hierarchical distributed language model. In Advances in neural information processing systems, pp. 1081–1088, 2009.
 Morin & Bengio (2005) Morin, F. and Bengio, Y. Hierarchical probabilistic neural network language model. In Aistats, volume 5, pp. 246–252. Citeseer, 2005.
 Mussmann et al. (2017) Mussmann, S., Levy, D., and Ermon, S. Fast amortized inference and learning in loglinear models with randomly perturbed nearest neighbor search. arXiv preprint arXiv:1707.03372, 2017.
 Papineni et al. (2002) Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Association for Computational Linguistics, 2002.
 Shazeer et al. (2017) Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparselygated mixtureofexperts layer. arXiv preprint arXiv:1701.06538, 2017.
 Shim et al. (2017) Shim, K., Lee, M., Choi, I., Boo, Y., and Sung, W. Svdsoftmax: Fast softmax approximation on large vocabulary neural networks. In Advances in Neural Information Processing Systems, pp. 5463–5473, 2017.
 Shrivastava & Li (2014) Shrivastava, A. and Li, P. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, pp. 2321–2329, 2014.
 Spring & Shrivastava (2017) Spring, R. and Shrivastava, A. A new unbiased and efficient class of lshbased samplers and estimators for partition function computation in loglinear models. arXiv preprint arXiv:1703.05160, 2017.

Sun et al. (2014)
Sun, Y., Wang, X., and Tang, X.
Deep learning face representation from predicting 10,000 classes.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 1891–1898, 2014.  Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
 Wallach (2006) Wallach, H. M. Topic modeling: beyond bagofwords. In Proceedings of the 23rd international conference on Machine learning, pp. 977–984. ACM, 2006.
 Yang et al. (2017) Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W. Breaking the softmax bottleneck: A highrank rnn language model. arXiv preprint arXiv:1711.03953, 2017.
 Zaremba et al. (2014) Zaremba, W., Sutskever, I., and Vinyals, O. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
 Zhang et al. (2018) Zhang, M., Liu, X., Wang, W., Gao, J., and He, Y. Navigating with graph representations for fast and scalable decoding of neural language models. arXiv preprint arXiv:1806.04189, 2018.
Comments
There are no comments yet.