Doubly Sparse: Sparse Mixture of Sparse Experts for Efficient Softmax Inference

01/30/2019 ∙ by Shun Liao, et al. ∙ 0

Computations for the softmax function are significantly expensive when the number of output classes is large. In this paper, we present a novel softmax inference speedup method, Doubly Sparse Softmax (DS-Softmax), that leverages sparse mixture of sparse experts to efficiently retrieve top-k classes. Different from most existing methods that require and approximate a fixed softmax, our method is learning-based and can adapt softmax weights for a better approximation. In particular, our method learns a two-level hierarchy which divides entire output class space into several partially overlapping experts. Each expert is sparse and only contains a subset of output classes. To find top-k classes, a sparse mixture enables us to find the most probable expert quickly, and the sparse expert enables us to search within a small-scale softmax. We empirically conduct evaluation on several real-world tasks (including neural machine translation, language modeling and image classification) and demonstrate that significant computation reductions can be achieved without loss of performance.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning models have demonstrated impressive performance in many classification problems (LeCun et al., 2015)

. In many of these models, the softmax function is commonly used to produce categorical distributions over the output space. Due to its linear complexity, the computation for the softmax layer can become a bottleneck with large output dimensions, such as language modeling

(Bengio et al., 2003), neural machine translation (Bahdanau et al., 2014)

and face recognition

(Sun et al., 2014). In language modelling task, with a small RNN model (Zaremba et al., 2014), softmax contributes to more than 95% of computation (Merity et al., 2016). This becomes a significant bottleneck when the computational resource is limited, such as deploying the model to mobile devices (Howard et al., 2017).

Many methods have been proposed to reduce softmax complexity. The softmax computation bottleneck is present in both training and inference phases, but there are different objectives. For training, the goal is to estimate the categorical distribution and approximate the normalization term as quick as possible. Therefore, sampling based

(Gutmann & Hyvärinen, 2012) and hierarchical based methods (Goodman, 2001; Morin & Bengio, 2005; Chen et al., 2015; Grave et al., 2016) were introduced. Hierarchical based methods speed up the training by calculating the normalization term on the subset of classes. Recent works in this area, such as D-softmax (Chen et al., 2015) and adaptive-softmax (Grave et al., 2016), construct two level-hierarchies for the output classes through unbalanced word distribution.

In this work, we focus on reducing softmax computation during the inference phase. Unlike training, in inference, our goal is not to compute the exact categorical distribution over the whole vocabulary, but rather to search for top-k classes accurately and efficiently. Most existing methods formulate this as an approximated maximum inner product search problem given an already learned/fixed softmax, and focus on designing efficient approximation techniques to find top-k classes after the standard softmax training procedure is performed (Shrivastava & Li, 2014; Shim et al., 2017; Zhang et al., 2018; Chen et al., 2018a). Since the softmax training objective is misaligned with the inference’s, the standard learned and fixed softmax may not be structured in a (hierarchical) way such that locating top-k can be easily achieved, which could lead to sub-optimal trade-offs between efficiency and accuracy. Motivated by the observation, we want to make the training procedure aware of the approximation at inference time, and adapt softmax to be hierarchically structured.

To achieve this, we propose a novel Doubly Sparse Softmax (DS-Softmax) layer. The model learns a two-level overlapping hierarchy using sparse mixture of sparse experts structure during its training. Each expert is sparse and only contain a small subset of entire output class space,

while each class is permitted to belong to more than one expert. Given an input vector and a set of experts, the DS-Softmax first selects the top expert that is most related to the input (in contrast to a dense mixture of experts). Then, the chosen sparse expert could return the categorical distribution on a subset of classes. Therefore, the reduction is achieved because it does not need to consider the whole vocabulary. Compared to existing methods

(Shrivastava & Li, 2014; Shim et al., 2017; Zhang et al., 2018; Chen et al., 2018a), our model can adapt the softmax embedding better during training which results in a better trade-off. In addition, our method can also be combined with existing post-approximation methods by treating each expert as single softmax.

We conduct experiments in one synthetic dataset and three different real tasks, including language modeling, neural machine translation, and image classification. We demonstrate our method can reduce softmax computation dramatically without loss of prediction performance. For example, we achieved more than 23x speedup in language modeling and 15x speedup in translation with similar performance. Qualitatively, we also demonstrate the learned two-level overlapping hierarchy is semantically meaningful on natural language modeling tasks.

The contributions of our work are summarized as follows:

  • [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

  • We propose a novel and learning-based method to speedup softmax inference for large output space, which enables fast retrieval of top-k classes.

  • The proposed method learns a two-level overlapping hierarchy during the training that facilitates a fast inference and is able to bridge the misaligned objectives during softmax training and inference phases.

  • We empirically demonstrate that the proposed method can achieve significant speedup without loss of prediction performance.

Figure 1: Overview of Doubly Sparse Softmax (DS-Softmax). Each expert is initialized with the whole output space and only the expert with the highest gating value is chosen feed-forward. During training, each expert is pruned iteratively so that it only contains a subset of classes, , in the final model. Therefore, a faster inference can be achieved by only conducting calculation inside such a subset.

2 The Doubly Sparse Softmax

In this section, we introduce the softmax inference problem, as well as the proposed method.

2.1 Softmax Inference Problem

Given a context vector , a softmax layer is used in order to compute a categorical distribution over a set of classes. In particular, it is defined as where is the normalization term and is the softmax embedding parameter. For inference, our goal is not to compute the full exact distribution, but rather to find the top-k classes, i.e. where is the -th largest value of . The most conventional method to do so is to compute the whole vector and find the top-k, which has complexity111Top-k selection requires an extra by Quickselsort.. Facing a large output space (i.e. large ), the softmax layer becomes a bottleneck, and our goal is to find the top-k both accurately and efficiently.

2.2 Motivation

Many natural discrete objects/classes, such as natural language, exhibit some hierarchical structure where objects are organized in a tree-like fashion. A hierarchical structure can enable retrieving objects in a much faster way since we do not need to consider the whole set. Goodman (2001); Chen et al. (2015); Grave et al. (2016)

studied a two-level hierarchy for language modeling, where each word belongs to a unique cluster while the hierarchy is constructed with different approaches. (A “cluster” here refers to a cluster of words.) However, the construction of the hierarchy is very challenging and usually based on heuristics. Also, it can be very limiting to construct a hierarchy that contains mutual exclusive clusters. This is because, such as in language modeling, it is often difficult to exactly assign a word to a single cluster. For example, if we want to predict the next word of “I want to eat

 ” and one possible correct answer is “cookie”, we can quickly notice that possible answer belongs to something eatable. If we only search for the right answer inside words with the eatable property, we can dramatically increase the efficiency. Even though words like “cookie” are one of the correct answers, it might also appear under some non-edible context such as “a piece of data” in computer science literature. Thus, a two-level overlapping hierarchy can naturally accommodate word homonyms like this by allowing each word to belong to more than one cluster. We believe that this observation is likely to be true in other applications besides language modeling.

2.3 The Proposed Method

Inspired by such hierarchical structures, we propose our method, Doubly Sparse Softmax (DS-Softmax), to automatically capture and leverage that for softmax inference speedup. The proposed method is supposed to learn overlapped two-level hierarchy among output classes. The first level is the sparse mixture and second level contains several sparse experts. A sparse expert is a cluster of classes that is a subset of the whole classes, and we allow each class to belong to more than one expert (non-exclusive). To generate the top-k classes, the sparse mixture enables a fast and dynamic selection of the right expert according to context vector . And then the selected sparse expert allows a fast softmax computation over a small subset of the classes.

The framework is illustrated in Figure 1, which contains two major components: (1) the sparse mixture/gating network indicates the sparse mixture and enables the selection of a top-1 expert, and (2) the sparse experts that are pruned from full softmax with group lasso. We also leverage a loading balance term to balance the utilization of different experts, and the mitosis training to scale it to a larger number of experts. The final objective will be a combination of task-specific loss , group lasso loss and some loading balance regularization losses and . The overall training algorithm is summarized in Algorithm 1.

  Initialization: Let be the input, be the corresponding label, be the (pre-trained) layers before the final DS-softmax, and be the distance function for computing task-specific loss. is the parameter for the gating network, and is parameters for experts.
  while training not converged do
     if  then
        for all  do
           , if
        end for
     end if
  end while
Algorithm 1 DS-Softmax
Figure 2: The mitosis training strategy: the sparsity is inherited when a parent expert produce offspring, reducing the memory requirements for training with large number of experts.

Sparse mixture

The first level of sparsification is a sparse gating network, which is designed to find the right expert given the context vector . To facilitate faster inference, only a single most suitable expert is chosen. This sparse gating network is similar to the one in (Shazeer et al., 2017), but the technique we propose here supports a single top expert output while maintaining meaningful gradients.

To be more specific, suppose we have experts. Given the context vector and gating network weight , the gating values , , are calculated and normalized prior to the selection as shown in Eq. 1. And then we only maintain the largest gating value while set all other gates to be zero. More specifically,


Where is the weighting matrix for group selection, where only the top-1 expert is selected. Eq. 1 still allows the gradient to be back-propagated to whole since gating values are normalized.

Given the sparse gate, we can further compute the probability of class under the context as:


where is softmax embedding weight matrix for the -th expert. Gating values can be interpreted as an inverse temperature term for final categorical distribution produced by the chosen expert  (Hinton et al., 2015). A smaller

gives a more uniform distribution and a larger

makes sharper one, and this can be adjusted automatically according to the context. It is worth noting that during the inference, we only need to compute single chosen expert given the rest are zeros, and select top-k classes.

During the training, we use

as the softmax output distribution and train it end-to-end w.r.t. the task-specific loss function, i.e.

. 222In practice, we can also pre-train all layers, and only re-train the softmax layer that takes context and output , while leaves the previous layers fixed. This enables the training being aware of the approximation used in inference since this sparse gating network is consistent in both training and inference.

Sparse experts with group lasso

The second level sparsification is sparse experts. We would like each expert to contain only a small subset of whole classes, which means it should output a categorical distribution where most entries are zeros. To obtain a sparse expert, we start by initializing an expert as a full softmax that covers all classes and apply group lasso to iteratively prune out irrelevant classes. More specifically, we add the following regularization loss,


This regularization term actively prunes embedding vectors that in each expert once their norm is smaller than the pre-defined threshold . When heavily regularized, there will be many classes pruned out of each expert, leading to a set of sparse experts.

Loading Balance

A balanced utilization of experts is prefered. Given the number of final classes in -th expert as and the number of total classes is . The utilization ratio indicates the probability of an expert being selected with a given dataset. For example, if model is run 10,000 times and -th expert is selected for 100 times, then the utilization ratio is 0.01. The overall speedup is calculated as . Therefore, it is not desirable to have an unbalanced load since the model can degenerate to a single big softmax and leads to less speedup. To address this issue, we add a loading balance loss that encourages a more balanced utilization of experts. Loading function (Shazeer et al., 2017) is adopted to balance the utilization here, shown in Eq. 5. It encourages the utilization percentage of each expert to be balanced by minimizing the coefficient of variation (CV) for gating outputs. In addition, we also include the following group lasso loss term so that each class is encouraged to exist in only one or a few experts.


Complexity Analysis

The proposed method is efficient in terms of softmax inference as it only consists of (1) a sparse gating to choose an expert, which has complexity given experts; and (2) a small-scale softmax from the selected sparse expert to compute the sparse categorical distribution, which has an average of complexity given a balanced set of experts and a class/word on average belongs to experts. With a reasonable , a significant speedup can be expected.

Mitosis Training

As for the training, since we initialize each expert with a full softmax, and gradually prune out and generate a sparse expert with small-scale softmax. This suggests that during the training time, the DS-softmax layer would require times the memory as a regular softmax layer when we use experts, despite that at the end the sparse experts would only require a much smaller memory. To mitigate this issue, we introduce a memory efficient training technique, named mitosis training.

Mitosis training is a strategy to progressively increase the number of experts during the training. We start the training with a smaller number of experts. Once it converges, we split each expert into two identical ones and repeat the same training procedure with the initialized model. At the time of splitting/cloning one expert into two, the expert is already relatively sparse and smaller than the full softmax, it would require a much smaller memory consumption as the case without mitosis train. An illustration of the mitosis training can be found in Fig. 2.

3 Experiments

We present our empirical evaluations on both real and synthetic tasks in this section. Firstly, we create one synthetic task with two-level hierarchy and test our model’s ability to learn the hierarchical structure. Secondly, we consider three real tasks, namely natural language modeling, neural machine translation, and Chinese handwritten character recognition. Both theoretical speedup (reduction in FLOPs) and real device latency (on CPU) are reported. Finally, some ablations and case study are present to better understand what the model has learned.

In terms of experiment setup, we leave the task-specific matters for later, here we present details on our model setup. The proposed DS-Softmax layer can be trained jointly with other layers in an end-to-end fashion. For real tasks, we find it is easier to first pre-train the whole model with conventional softmax, and replace the softmax layer with DS-Softmax and retrain the new layer while keeping others fixed, with Adam (Kingma & Ba, 2014). For hyper-parameters, and threshold in pruning are fixed for all tasks as 10 and 0.01 respectively. and share the same value and are tuned using the following strategy: starting with zero and increasing exponentially until it decreases the performance in validation. The reported performance is on an independent testing dataset. For the baselines, we mainly compare to the conventional full softmax and recently proposed SVD-Softmax (Shim et al., 2017) and D-Softmax (Chen et al., 2015).

(a) Synthetic Data Generation
(b) Results on 10 x 10
(c) Results on 100 x 100
Figure 3: (a) Illustration of synthetic data. The input is generated inside sub cluster (green circle) and its corresponding label is the sub cluster. The super cluster information is not present during training. (b) and (c) Results on tasks with 10x10 and 100x100 sizes. The x-axis indicates sub cluster and y-axis shows the expert. Black means this expert is handling this sub cluster. The order of x-axis is arranged through their super cluster information (e.g., in 10x10 size problem, first 10 sub classes are belonged to one super cluster, and so on).
(a) No Group Lasso
(b) No Expert Group Lasso
(c) No Balancing
Figure 4: Ablation analysis of each loss component by removing it. (a), (b) and (c) illustrate the model trained without group lasso, expert level group lasso and balancing factor, respectively. The original result is Fig. 3(b), and they share the same axis .

3.1 Synthetic task

A two-level hierarchy synthetic dataset is constructed to test our model. As illustrated in Figure 3(a), data points are organized with hierarchical centers, multiple sub clusters belong to one super cluster. To generate such data, we first sample the center of super cluster, then sample the center of sub clusters around super cluster center, finally the data point is drawn near the sub cluster center. More specifically,


Where we set to 10 and data dimension to 100. For sanity check and visualization purpose, we make sure the ground-truth hierarchy in the synthetic data without overlapping.

We treat the coordinates of a data point as features and the sub cluster membership of the data point as the target. We construct a two-layer Multi-layer Perception (MLP) with DS-softmax as the final layer for the task, and ideally, it should capture the hierarchy structure well. Two different super/sub cluster sizes are evaluated, 10x10 and 100x100. 10x10 means there are 10 super clusters, each of which contains 10 sub clusters.

We investigate the captured hierarchy by examining how sub clusters are distributed through experts. As mentioned, each expert only contains a subset of output classes, because class level pruning is conducted during training. We illustrate the remaining classes in each expert in Fig. 3(b) and Fig. 3(c) for 10x10 and 100x100 sizes respectively. We find DS-Softmax can perfectly capture the hierarchy. We do further ablation analysis on the results on 10x10 synthetic as shown in Fig 4 to study the effect of each additional loss. As we can see, all the loss terms discussed above are important to our model.

3.2 Language Modeling

Language modeling is a task whose goal is to predict the next word given the context. For a language such as English, a large vocabulary is present and softmax can be a bottleneck for inference efficiency. We use two standard datasets for word level language modelling: PennTree Bank (PTB) (Marcus et al., 1994) and WikiText-2 (Merity et al., 2016), where the output dimensions are 10,000 and 33,278 respectively. Standard two-layers LSTM model (Gers et al., 1999) with 200 hidden size is used333 We use accuracy as our metric as it is a common metric (Chen et al., 1998) in natural language modeling especially in a real application when the extrinsic reward is given, such as voice recognition. Top 1, Top 5 and Top 10 accuracies on testing datasets are reported. We demonstrate that 15.99x and 23.86x times speedup (in terms of FLOPs) can be achieved with 64 experts without loss of accuracy shown in Table  1.444One copy for each word is required among all experts during training. Otherwise, 80x speedup is easily achieved without loss of accuracy at the cost of missing low-frequency words. Moreover, a slight improvement in performance is observed, which suggests the mixture of softmax can bring additional benefit, which is consistent with improvement by breaking low-rank bottleneck (Yang et al., 2017).

Task Method Testing Accuracy Speedup
Top 1 Top 5 Top 10
PTB (10,000) Full 0.252 0.436 0.515 -
DS-8 0.257 0.448 0.530 2.84x
DS-16 0.258 0.450 0.529 5.13x
DS-32 0.259 0.449 0.529 9.43x
DS-64 0.258 0.450 0.529 15.99x
WIKI-2 (33,278) Full 0.257 0.456 0.533 -
DS-8 0.259 0.459 0.536 3.52x
DS-16 0.264 0.469 0.547 6.58x
DS-32 0.260 0.460 0.535 11.59x
DS-64 0.259 0.458 0.533 23.86x
Table 1: Word level natural language modelling results on PTB and WikiText-2. The output dimensions are 10,000 and 33,278 respectively. ’-’ indicates no speedup by definition.
(a) Mitosis Training Result
(b) Uncertainty and Redundancy
Figure 5: (a) Illustration of required memory to train DS-64 starting with DS-2. The y-axis is the memory comparing to one full softmax. The Cloning Start icon means where the cloning happens. (b) A heatmap to demonstrate the correlation between word frequency and its redundancy. The x-axis is the log of word frequency and the y-axis is the log of number of expert containing this word (called Redundancy). Darker color indicates higher density.

3.3 Neural Machine Translation

Neural machine translation task is also usually used for softmax speedup evaluation. We use IWSLT English to Vietnamese dataset (Luong & Manning, 2015) and evaluate performance by BLEU score (Papineni et al., 2002) with greedy searching. The BLEU is assessed on the testing dataset. A vanilla softmax model is seq2seq (Sutskever et al., 2014)

and implemented using TensorFlow

555 (Abadi et al., ). The number of output words is 7,709 and the results are shown in Table 2, where our method achieves 15.08x speedup (in terms of FLOPs) with similar BLEU score.

Task Method Bleu Score Speedup
IWSLT En-Ve (7,709) Full 25.2 -
DS-8 25.3 4.38x
DS-16 25.1 6.08x
DS-32 25.4 10.69x
DS-64 25.0 15.08x
Table 2: Neural machine translation results on IWSLT English to Vietnamese and the vocabulary size is 7,709.

3.4 Chinese Character Recognition

We also test on Chinese handwriting character recognition task. We use the offline and special characters filtered CASIA dataset (Liu et al., 2011). CAISA is a popular Chinese character recognition dataset with around four thousand characters. Unlike language related task, the class distribution is uniform here rather than unbalanced. Two-thirds of the data is chosen for training and rest for testing. Table 3 shows the results and we can achieve significant (6.91x) speedup (in terms of FLOPs) on this task.

Task Method Accuracy Speedup
CASIA (3,740) Full 90.6 -
DS-8 90.8 1.77x
DS-16 90.2 2.82x
DS-32 89.9 4.72x
DS-64 90.1 6.91x
Table 3: Image classification results on CASIA. There are 3,740 different characters inside dataset.

3.5 Real device comparison

Real device experiments were conducted on a machine with Two Intel(R) Xeon(R) CPU @ 2.20GHz and 16G memory. All tested models are re-implemented using Numpy to ensure fairness. Two configurations of SVD-Softmax (Shim et al., 2017) are evaluated, SVD-5 and SVD-10. They use top 5% and 10% dimension for final evaluation in their preview window and window width is 16. Indexing and sorting are computationally heavy for SVD-softmax with Numpy implementation. One configuration of Differentiated(D)-Softmax is compared here, although it is somehow unfair because their main focus is on training speedup (Chen et al., 2015)666D-Softmax is selected instead of adaptive-softmax (Grave et al., 2016) because we measure the time in CPU rather than GPU. The words are sorted by their frequency, and the first quarter and second quarter utilize the same embedding size and half embedding size. The tail uses a quarter embedding size. For example, in PTB, we split the words into buckets (2500, 2500, 5000) and embedding sizes are (200, 100, 50). For a fair comparison, we report latency without sorting and indexing for SVD-softmax. However, regards to full softmax, D-Softmax, DS-Softmax, full latency is reported. The latency results are shown in Table 4, and we observe that DS-Softmax can achieve significantly better theoretic speedup ( better on Wiki-2) as well as lower latency ( faster on Wiki-2). Moreover, compared to D-Softmax, we find our learned hierarchy can achieve much better speedup without loss of performance.

Task Full DS-64 (Ours) SVD-5 SVD-10 D-Softmax
Name Value ms Value FLOPs ms Value FLOPs ms Value FLOPs ms Value FLOPs ms
PTB 0.252 0.73 0.258 15.99x 0.05 0.249 6.67x 0.12 0.251 5.00x 0.18 0.245 2.00x 0.36
Wiki-2 0.257 3.07 0.259 23.86x 0.15 0.253 7.35x 0.43 0.255 5.38x 0.60 0.256 2.00x 1.59
En-Ve 25.2 1.91 25.0 15.08x 0.13 25.0 6.77x 0.32 25.1 5.06x 0.42 24.8 2.00x 0.98
CASIA 90.6 1.61 90.1 6.91x 0.25 89.9 3.00x 0.59 90.2 2.61x 0.68 - - -
Table 4: Comparison with SVD-softmax and D-Softmax on real device latency. The “ms” indicates the latency in microseconds. “FLOPs” indicates FLOPs speedup. The value indicates the task performance. In PTB, Wiki-2 and CASIA, the value indicates the top-1 accuracy. In En-Ve, it means the BLEU score. There is no result for D-Softmax in CASIA because it cannot achieve any speedup by definition.

3.6 Mitosis training

Here we evaluate the efficiency of mitosis training on PTB language modeling task. The model is initialized with 2 experts, and clones to 4, 8, 16, 32 and 64 experts sequentially. As demonstrated in Figure 5(a)

, cloning happens for every 15 epochs and pruning starts 10 epochs after cloning. In the end, the model only requires at most 3.25x memory to train DS-64 model and achieve similar performance, significantly smaller than original 64-fold memory.

3.7 Qualitative analysis of sparsity

We demonstrate the redundancy and word frequency pattern in Figure 5(b), where the redundancy indicates the number of experts contains such word. We find words with higher frequency will appear in more experts. This is a similar phenomenon as the topic models in Blei et al. (2003); Wallach (2006), and similar fact that more frequent words require higher capacity model (Chen et al., 2015). We manually interrogate the smallest expert in such a model, where 64 words remain777The words existing in more than experts are filtered.. The words left in such expert is semantically related. Three major groups are identified, which are money, time and comparison, shown in following:

  • [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

  • Money: million, billion, trillion, earnings, share, rate, stake, bond, cents, bid, cash, fine, payable.

  • Time: years, while, since, before, early, late, yesterday, annual, currently, monthly, annually, Monday, Tuesday, Wednesday, Thursday, Friday.

  • Comparison: up, down, under, above, below, next, though, against, during, within, including, range, higher, lower, drop, rise, growth, increase, less, compared, unchanged.

3.8 Post-approximation of Learned Experts

To speedup softmax inference, most existing methods are based on post-approximation of a learned and fixed softmax (Shim et al., 2017; Chen et al., 2018a; Mussmann et al., 2017). In DS-Softmax, we could also consider each expert as an individual softmax with a subset of whole classes. This suggests that we can take the learned DS-Softmax, and for each of the learned expert, we could apply the post-approximation technique such as SVD-Softmax (Shim et al., 2017). To demonstrate this, two experiments are conducted. One is applying SVD-10 to DS-2. Another is applying SVD-50 (top 50% in the preview window) to DS-64, where SVD is applied upon to expert with more than one thousand classes. The higher percent in SVD is used for DS-64 because there are fewer remaining classes in each expert. The result in Table 5 shows that our technique can also be improved combining with SVD-Softmax.

Task Method Accuracy Speedup
WIKI-2 (33,278) Full 0.257 -
DS-2 0.258 1.83x
SVD-10 0.255 5.38x
DS-2 & SVD-10 0.255 9.64x
DS-64 0.259 23.86x
SVD-50 0.256 1.72x
DS-64 & SVD-50 0.255 32.77x
Table 5: Evaluation of applying post-approximation methods on the learned experts from DS-Softmax, which can further speedup the inference.

4 Related Work

The problem of reducing softmax complexity has been widely studied before (Gutmann & Hyvärinen, 2012; Chen et al., 2015; Grave et al., 2016; Shim et al., 2017; Zhang et al., 2018; Chen et al., 2018a). There are mainly two goals: training speedup and inference speedup. In our work, we focus on inference, where we would like to find the top-k classes efficiently and accurately.

Most existing works for reducing the softmax inference complexity are based on post-approximation of a fixed softmax that has been trained in a standard procedure. Locality Sensitive Hashing (LSH) has been demonstrated as a powerful technique under this category (Shrivastava & Li, 2014; Maddison et al., 2014; Mussmann et al., 2017; Spring & Shrivastava, 2017). Small word graph is another powerful technique for this problem (Zhang et al., 2018). Recent work proposes one learning-based clustering for trained embedding which overcomes the non-differential problem (Chen et al., 2018a). In addition, decomposition-based method, SVD-softmax (Shim et al., 2017), can speedup the searching through one smaller preview matrix. However, as an approximation to a fixed softmax, the main drawback is that it always suffers high cost when high precision is required (Chen et al., 2018a), suggesting a worse trade-off between efficiency and accuracy. In contrast, the proposed DS-softmax is able to adapt the softmax and learn a hierarchical structure to find top-k classes adaptively. Furthermore, it is worth noting that those methods can also be applied upon our method, where each expert can be viewed as a single softmax, which makes them orthogonal to ours.

Hierarchical softmax is another family of methods similar to ours. The most related ones under this category are D-softmax (Chen et al., 2015) and adaptive-softmax (Grave et al., 2016). These two methods can speedup both training and inference while other methods (Morin & Bengio, 2005; Mnih & Hinton, 2009)

usually cannot speedup inference. The construction of hierarchy is through unbalanced word/class distribution due to Zipf’s law. There are two major issues. Firstly, their hierarchy is pre-defined by heuristics that could be sub-optimal. Secondly, the skewness of class distribution in some tasks, e.g. image classification, may not be as significant as in language modeling. The proposed DS-softmax overcomes those limitations by automatically learn the two-level overlapping hierarchy among classes.

Our method is inspired by sparsely-gated mixture-of-experts (MoE) (Shazeer et al., 2017)

. MoE achieves significantly better performance in language modeling and translation with large but sparsely activated experts. However, MoE cannot speedup the softmax inference by definition because each expert covers the whole output classes. Our work on softmax inference speedup can also be considered as a part of recent efforts to make a neural network more compact

(Han et al., 2015; Chen et al., 2018c) and efficient (Howard et al., 2017; Chen et al., 2018b), through which we could make modern neural networks faster and more applicable.

5 Conclusion

In this paper, we present doubly sparse: a sparse mixture of sparse experts for efficient softmax inference. Our method is learning-based and adapts softmax for fast inference. It learns a two-level overlapping class hierarchy. Each expert is learned to be only responsible for a small subset of the output class space. During inference, our method first identifies the responsible expert and then performs a small-scale softmax computation by the expert. Our experiments on several real-world tasks have demonstrated the efficacy of the proposed method.