Compressing Gradient Optimizers via Count-Sketches

by   Ryan Spring, et al.
Rice University

Many popular first-order optimization methods (e.g., Momentum, AdaGrad, Adam) accelerate the convergence rate of deep learning models. However, these algorithms require auxiliary parameters, which cost additional memory proportional to the number of parameters in the model. The problem is becoming more severe as deep learning models continue to grow larger in order to learn from complex, large-scale datasets. Our proposed solution is to maintain a linear sketch to compress the auxiliary variables. We demonstrate that our technique has the same performance as the full-sized baseline, while using significantly less space for the auxiliary variables. Theoretically, we prove that count-sketch optimization maintains the SGD convergence rate, while gracefully reducing memory usage for large-models. On the large-scale 1-Billion Word dataset, we save 25 11.7 GB) by compressing the Adam optimizer in the Embedding and Softmax layers with negligible accuracy and performance loss.


page 1

page 2

page 3

page 4


A Distributed SGD Algorithm with Global Sketching for Deep Learning Training Acceleration

Distributed training is an effective way to accelerate the training proc...

DEAM: Accumulated Momentum with Discriminative Weight for Stochastic Optimization

Optimization algorithms with momentum, e.g., Nesterov Accelerated Gradie...

Global Momentum Compression for Sparse Communication in Distributed SGD

With the rapid growth of data, distributed stochastic gradient descent (...

Frank-Wolfe Style Algorithms for Large Scale Optimization

We introduce a few variants on Frank-Wolfe style algorithms suitable for...

Memory-Efficient Adaptive Optimization for Large-Scale Learning

Adaptive gradient-based optimizers such as AdaGrad and Adam are among th...

FetchSGD: Communication-Efficient Federated Learning with Sketching

Existing approaches to federated learning suffer from a communication bo...

Adaptive Neuron Apoptosis for Accelerating Deep Learning on Large Scale Systems

We present novel techniques to accelerate the convergence of Deep Learni...

1. Introduction

An emerging trend in natural language processing is to train a language model in an unsupervised fashion on a large corpus of text, and then to fine-tune the model for a specific task

(Radford et al., 2018; Puri et al., 2018; Devlin et al., 2018). The language model often takes the form of an LSTM (Jozefowicz et al., 2016)

or a Transformer network

(Vaswani et al., 2017).

These models already contain millions of parameters and will continue to grow even larger. Recently, (Yang et al., 2018) demonstrated that the expressiveness of a single Softmax layer was insufficient for the language modeling task. Their proposed solution was the Mixture of Softmax (MoS) layer, which combines several independent Softmax layers together. The number of Softmax layers typically ranges between 3 and 15, so the proposed solution requires significantly more space, especially for larger vocabularies.

Training large-scale models efficiently is a challenging task. There are numerous publications that describe how to leverage multi-GPU data parallelism and mixed precision training effectively (Hoffer et al., 2017; Ott et al., 2018; Micikevicius et al., 2018). A key tool for improving training time is to increase the batch size, taking advantage of the massive parallelism provided by GPUs. However, increasing the batch size also requires significant amounts of memory. Often times, a practitioner will sacrifice their batch size for a larger, more expressive model. For example, (Puri et al., 2018) showed that doubling the dimensionality of an multiplicative LSTM (Krause et al., 2016) from 4096 to 8192 forces them to reduce the batch size per GPU by .

One culprit that aggravates the memory capacity issue is the auxiliary parameters used by first-order optimization algorithms, which are commonly used to accelerate the convergence rate of the model. Our proposed solution is to compress the auxiliary parameters of the optimizer using the Count-Sketch dataset structure (Charikar et al., 2002), freeing up memory for either a more expressive model or a larger batch size for faster training.

We primarily focus on compressing the auxiliary variables for the embedding and Softmax layers. These layers contain a significant portion of the model’s parameters and the set of active features or classes is extremely sparse for many tasks (Spring and Shrivastava, 2017). Consider the language modeling task where there are only a few words out of a large vocabulary in each sentence. There are several algorithms that impose sparsity on the Softmax layer to improve training time. However, getting around memory is still a major challenge. Since the distribution of words follows a power-law distribution, Sampled Softmax (Jean et al., 2014) is commonly used to training language models. (Shrivastava and Li, 2014; Vijayanarasimhan et al., 2014; Yen et al., 2018a) have proposed using approximate nearest-neighbor search to find the output classes that contain the highest gradients.

Our solution takes advantage of the sparsity present in the Embedding and Softmax layers, so the computational cost scales with the gradient sparsity. We directly insert the sparse gradients into the count-sketch, and then retrieve an approximation of the auxiliary variable. Furthermore, we can easily trade-off the capacity of the count-sketch to maintain the optimizer’s performance, without increasing the cost of updating or querying the structure. In Section 5, we formally prove this graceful memory trade-off, by analyzing the convergence rate of our count-sketch optimizer.

On the 1-Billion Word dataset, we train an LSTM language model using the Adam optimizer, leveraging our count-sketch technique. By compressing the auxiliary variables for the Embedding and Softmax layers, we reduce the memory usage during training by 25 without any accuracy or performance penalty. For an Amazon extreme classification task with over 49.5 million classes, we reduce the training time by 38% by increasing the mini-batch size 3.5 using our count-sketch optimizer.

2. Count-Sketch and Streaming Setting

In the traditional streaming setting, we are given a high-dimensional vector

that is too costly to store in memory. We only see a very long sequence of updates over time. The only information available at time is of the form , which means that coordinate is updated by the amount . We are given a limited amount of storage, on the order of

, which means that we can never store the entire vector. Sketching algorithms aim to estimate the value of current item

, after any number of updates using only memory.

The Count-Sketch is a popular algorithm for estimation in the streaming setting. Count-Sketch keeps a matrix of bins of size , where and are chosen based on the desired accuracy guarantees. The algorithm uses random hash functions for to map the vector’s components to different bins, . In particular, for any row of sketch , component is hashed into bin . In addition, Count-Sketch uses random sign functions to map the components of the vectors randomly to , .

The Count-Sketch supports two operations: UPDATE(item , increment ) and QUERY(item ). The UPDATE operation updates the sketch with any observed increment. More formally, for an increment to an item , the sketch is updated by adding to the cell . The QUERY operation returns an estimate for component , the median of all the different associated counters. If the updates are strictly non-negative, we return the minimum value across all the counters.

Count-Sketch Error: (Charikar et al., 2002) Let be the Count-Sketch estimate of component from vector . For any component

, with probability

, a Count-Min Sketch matrix with width and depth satisfies:

Count-Min Sketch Error: (Cormode and Muthukrishnan, 2005) Let be the Count-Min Sketch estimate of component from vector . For any component , with probability , a Count-Min Sketch matrix with width and depth satisfies:

   universal hash functions
   random sign functions

  Initialize count-sketch tensor

  UPDATE(Count-Sketch , item i, update ):
  Update component with update
  for  to  do
  end for
  QUERY(Count-Sketch , item i, Function ):
  Query sketch for an estimate for item
   - MIN for non-negative values; otherwise MEDIAN
Algorithm 1 Count-Sketch Tensor

3. Intuition

Our goal is to compress the auxiliary variables without incurring significant accuracy loss. Unfortunately, selecting the appropriate compression scheme is not clear without any additional information on the parameter distribution. The challenge is that the parameter distribution can change over time, so any static assumption on the approximation is likely to hurt accuracy. Fortunately, in this section we show that there is a potential solution.

Power Law in Auxiliary Variables over Time: In Figure 2

, we plot the auxiliary variables sorted according to their absolute values during training. To understand the dynamics over time, we show the parameters at two different epochs 5 and 40. The plots clearly indicate a power law behavior where only a few parameters have large magnitudes. In Figure 

1, we confirm this behavior for every iteration by plotting the midpoint dividing the head and tails. The auxiliary variables have long tails throughout the training process. Also, this behavior is invariant across the two datasets - (Wikitext-2 and Image-Net). To the best of our knowledge, this is the first work that empirically shows the existence of a power law distribution behavior in the gradients and auxiliary variables while training. To dig deeper, we also show the identities of top-100 parameters (the head of power law distribution) for epochs 5, 20, and 40 in Figure 2. The top identities change over time, which makes it difficult to cluster parameters into predefined, static clusters.

Power law and linear sequence of updates: In summary, we need to compress a power law distribution where the top-k identities are constantly changing. Fortunately, the auxiliary variables are updated in a linear fashion. The updates can be written as a linear operator over updates (See Section 4). The count-sketch is a dynamic, low-memory data structure, which preserves high magnitude parameters accurately, while allowing for any sequence of linear updates. The linearity of updates allows us to guarantee that the count-sketch provides an accurate estimation of parameters with high probability at every stage in the iteration. The power law distribution and linear updates make sketching-based ideas a perfect fit for this problem.

Figure 1.

An empirical demonstration showing that the model’s gradients and the optimizer’s auxiliary variables follow a power-law distribution. The count-sketch data structure approximates the heavy hitter entries with greater accuracy. Therefore, this experiment implies that the count-sketch data structure is appropriate for compressing the auxiliary variables. The X-axis is the number of iterations during training time. The Y-axis is the 50% threshold that marks the midpoint dividing the head and the tail of the distribution. For a uniform distribution, the midpoint is at 0.5. However, the 50% threshold for the gradients and auxiliary variables is less than 0.2 on average, indicating that they follow a power law distribution. The red line marks the maximum threshold for all layers, while the black line represents the average threshold.

Figure 2.

The optimizer’s auxiliary variables follow a power-law distribution, but the features associated with top-k values change during training. The X-Axis is the feature ID, while the Y-Axis is the magnitude. The first two charts show the sorted absolute values for the auxiliary variables at different training epochs. The last two charts plot the top 100 features and their magnitudes. We plot the 1st and 2nd moments of the Adam Optimizer for an LSTM weight matrix trained on the Wiki2 dataset.

4. Count-Sketch Optimizers

A major chunk of the parameters in the deep network are contained in the fully-connected layers (Han et al., 2015). Fortunately, for the embedding and softmax layers, the set of active features or classes and their corresponding gradient updates are sparse. Our insight is to use the count-sketch data structure to accurately represent the auxiliary variables in a compressed manner. We will insert the sparse gradient information into the count-sketch and retrieve an approximate value for the auxiliary variable whenever needed.

In the deep learning setting, the high-dimensional vector is analogous to the matrices used to represent the auxiliary variables. The auxiliary variables are represented with matrices where is the number of features in the embedding layer or the number of classes in the softmax layer. Since the dimensionality of the columns is usually in the low thousands (), we represent the auxiliary variables with a count-sketch tensor where . This count-sketch tensor preserves structured sparsity where values are read from memory in contiguous chunks along the last dimension of the tensor. See Fig. 3 for a visualization. This tensor structure maintains high performance with GPUs and CPU SIMD vector instructions. On the other hand, the rows are compressed by randomly combining features and classes together.

Figure 3. Visualization of Count Sketch Tensor. Each color represents a unique feature. For each row, each feature is mapped randomly to a different vector. Each vector is read from and written to memory in contiguous chunks. Preserving the last dimension of the auxiliary variable keeps structure sparsity in the count-sketch data structure, which is necessary for high performance.

Here is a brief overview of three popular first-order optimizers whose auxiliary variables we seek to compress: Momentum (Sutskever et al., 2013; Polyak, 1964) remembers a history of gradient updates, which smooths out random oscillations and accelerates convergence. Adaptive gradient descent algorithms alter the learning rate for each feature based on the frequency of its updates. Sparse, rare features are given larger updates and a higher learning rates. These methods track a history of squared gradients for each feature. Adagrad (Duchi et al., 2011) divides the gradient by the square root of the cumulative squared gradient. Adam (Kingma and Ba, 2014) combines momentum and adaptive learning rates together, so it tracks an exponential average of the gradients and squared gradients.

The count-sketch data structure expects to receive a stream of updates . For the Momentum and Adam optimizers, we need to transform the update operation into a form that is compatible with the count-sketch. For an auxiliary variable , the desired update operation is . Given the appropriate update operation, we replace the addition assignment operator for the original matrix with the Update-Query operation for the Count-Sketch Tensor.

For Momentum, the update rule, given some gradient , is . For the Adam optimizer, given some constant and an update , the update rule for the exponential moving average is .

The Count-Sketch is essentially a plug and play replacement that saves memory, while retaining the speed and accuracy of the original matrix. Normally, algorithms that compress memory to save space are slower than their dense counterparts. However, the count-sketch can leverage sparsity by lazily performing updates with high efficiency. In addition, we can gracefully increase the size of the count-sketch for greater accuracy with minimal additional computational cost.

  Initialize Count-Sketch Tensor
   universal hash functions
   random sign functions
  Decay Rate , Learning Rate
  (Item , Parameter , Gradient ):
   Query(, , MEDIAN
  Update, ,
   Query(, , MEDIAN
Algorithm 2 Momentum - Count Sketch Optimizer
  Initialize Count-Min Sketch Tensor
   universal hash functions
  Learning Rate
  (Item , Parameter , Gradient ):
  UPDATE(, ,
   QUERY(, , MIN
Algorithm 3 Adagrad - Count Sketch Optimizer
  Initialize Count-Sketch Tensor
  Initialize Count-Min-Sketch Tensor
   universal hash functions
   random sign functions
  1st Moment Decay Rate , 2nd Moment Decay Rate
  Learning Rate
  (Item , Parameter , Gradient ):
  // Count-Sketch - 1st Moment
   Query(, , MEDIAN
  Update, ,
   Query(, , MEDIAN
  // Count-Min Sketch - 2nd Moment
   Query(, , MIN
  Update, ,
   Query(, , MIN
Algorithm 4 Adam - Count Sketch Optimizer

Count-Min Sketch Cleaning Heuristic:

Since the Count-Min Sketch only accepts non-negative values, it always overestimates the desired value. The Count-Min Sketch is used to estimate the adaptive learning rate for the Adagrad and Adam optimizers. Therefore, an overestimate will prematurely slow the learning rate for certain elements. Our heuristic solution is to clean the sketch periodically by multiplying the tensor by a constant

where every iterations. Instead of this heuristic, an alternative is to use principled adaptive sketches (Shrivastava et al., 2016), which can continuously clean the sketch and decay the overestimates over time.

Periodic cleaning works well with the Count-Min Sketch because it provides a better estimate for the top- elements. During training, the accumulation of updates allows for the heavy hitter estimates to emerge in the sketch (Aghazadeh et al., 2018)

. Due to stochastic gradient descent, there is a certain amount of noise in the gradient, so cleaning immediately after each update destroys the internal state of the sketch. Furthermore, cleaning reduces the scale of the sketch, reducing the overall noise level. If the signal to noise ratio is too high, future heavy hitter are ignored because there values are equal to the noise in the sketch.

5. Theoretical Analysis

For stochastic non-convex optimization (Zaheer et al., 2018), we measure how the algorithm converges to a stationary point at iteration —i.e., for some small constant . In our analysis, we focus on the Count-Min Sketch Adam optimizer where we do not track the 1st moment—i.e., . This optimizer was used in the Amazon Extreme Classification task (See Section 7.3) in order to save additional memory, similar to the Adafactor optimizer (Shazeer and Stern, 2018).

We assume that the function is -smooth with bounded gradients: Function has bounded gradients - . In addition, we receive an unbiased stochastic gradient estimate

with fixed variance

. Then, the following theorem holds:

Theorem 5.1 ().

Let the learning rate . Assume , , and are selected such that and . Given a Count-Min Sketch matrix with width and depth , we have the following bound that holds for Count-Min Sketch Adam with probability where :

The proof of Theorem 5.1 is found in the Appendix. For comparison, we have the convergence bound from (Zaheer et al., 2018) for the standard Adam optimizer where :

Discussion: The bounds are similar except for the additional term caused by the Count-Min Sketch approximation. The theorem states that the Count-Min Sketch Adam converges to a region around a stationary point with radius . The additional error term depends on the adaptivity of the optimizer , the error rate of the sketch, and the gradient norm . The error rate is proportional to the width of the sketch and corresponds with the number of collisions along each row in the sketch. We can improve convergence gracefully by increasing the sketch’s width, which reduces the error caused when multiple components collide in the same bin. In practice, we bound the gradient norm to reasonable constant to prevent instability—i.e., . When the sketch width , the error term becomes a small constant.

Note that the gradient norm decreases over time. Thus, the error caused by the count-sketch approximation decreases as the algorithm progresses, and we can shrink the sketch. A nice property of the count-sketch data structure is that you can add one half of the sketch to the other, reducing its size by half while maintaining its accuracy guarantees. Please see (Matusevych et al., 2012) for more details.

The failure probability of exceeding the Count-Min Sketch error bound is proportional to the depth of the sketch . In our theoretical results, the depth of the sketch depends logarithmically on the number of parameters and the number of time steps . However, our experiments show that a modest depth size of 3-5 is sufficient.

6. Related Work

Feature Compression: A straight-forward option is to use dimensionality reduction techniques to minimize the number of features, which in turn decreases the size of the model and optimizer simultaneously. (Tito Svenstrup et al., 2017) describes a hash embedding scheme where the output embedding for a feature is a weighted sum between the embedding vectors and the weight vector. Their goal was to minimize the size of the embedding layer while preserving its flexibility to model large vocabularies. However, dramatically reducing the feature space may sacrifice model accuracy. For example, training the BERT language model (Devlin et al., 2018) on a GPU with 12-16 GB memory requires a smaller, less effective architecture than the full-sized model trained on the 64 GB Google TPU.

Gradient Checkpointing: (Siskind and Pearlmutter, 2018; Chen et al., 2016) describe an orthogonal approach where training an

-layer neural network requires

memory. Their insight was that storing the activations for the back-propagation pass is the most memory-intensive part of training. Instead of storing all the activations, their algorithm checkpoints certain sections of the neural network and lazily recomputes the activations during the back-propagation phase. In other words, their approach saves memory by sacrificing extra computation time.

Low-Rank Approximation: A low-rank approximation has the potential to reduce the number of parameters from to where . However, updating the low-rank matrices is non-trivial. (Shazeer and Stern, 2018) demonstrated that there exists a unique, fast update rule for a rank-1 approximation that minimizes the I-divergence between the approximation and original matrix. Their rank-1 approximation was limited to non-negative matrices, so only the second moment of the Adam optimizer was compressed in their experiments. The drawback of this approach is that it requires materializing the entire matrix via an outer-product, which is prohibitive for large-scale embedding and softmax layers. In addition, since their update rule only applies for rank-1 vectors, their approach lacks the flexibility to increase the model’s memory capacity gracefully.

Count-Sketch: The original objective of the Count-Sketch data structure was to estimate the frequency of various events in the streaming setting. Recently, (Aghazadeh et al., 2018; Tai et al., 2018)

demonstrated that the Count-Sketch can learn a compressed model that accurately preserves the features with the largest weights. Their objective focused on feature extraction in ultra-high dimensional settings and was limited to simple, linear models. In this work, we seek to use the Count-Sketch to preserve the different auxiliary variables maintained by commonly used first-order optimizers. The ideal solution is for the memory cost of the optimizer to grow sub-linearly with the model size, giving us the flexibility to increase the model’s capacity.

Type Count-Sketch Low-Rank
Gradient Type Sparse Dense
Memory Control Flexible Fixed
Query Time
Table 1. Trade-offs between the Count-Sketch and Low-Rank Approximation. is the number of active features or classes. is the rank of the two factors where . The Count-Sketch data structure is ideally suited for the sparse embedding and softmax layers because it does not require a matrix-multiplication to reconstruct the entire auxiliary variable.

7. Experiments

All of the experiments were performed with the PyTorch framework on a single machine - 2x Intel Xeon E5-2660 v4 processors (28 cores / 56 threads) with 512 GB of memory using a single Nvidia Tesla V100. The code

111 for the Count-Sketch Optimizer is available online. We designed the experiments to answer these questions:

  1. Does the model’s gradients and the optimizer’s auxiliary variables follow a power-law distribution?

  2. How accurate is our estimate of the auxiliary variables retrieved from the count-sketch data structure?

  3. What the effect of cleaning the count-min sketch on convergence time and accuracy?

  4. How well does our count-sketch optimizer compare against the low-rank approximation given the same number of parameters?

  5. Does our count-sketch optimizer match original baseline in terms of speed and accuracy?

Here are the five datasets used in the experiments:

  1. Wikitext-2 (Merity et al., 2016) - This dataset was extracted from Wikipedia and contains 2M training tokens with a vocabulary size of 33,278. (10.8 MB)

  2. Wikitext-103 (Merity et al., 2016) - A larger version of the Wikitext-2 dataset that contains 103M training tokens and its vocabulary size is 267,735. (539.2 MB)

  3. 1-Billion Word (LM1B) (Chelba et al., 2013) - This large-scale corpus contains 0.8 billion training tokens and a vocabulary with 793,471 words. (4.1 GB) An open-sourced PyTorch model is available online 222

  4. MegaFace - A facial recognition dataset derived from MegaFace (Challenge 2)

    333 Each person is a candidate class, but we only select classes with at least 10 images. Thus, this sampled dataset contains 1,943,802 examples with 80,204 classes. 10K images are randomly sampled to create the test dataset. (4 GB)

  5. Amazon - This sampled recommendation dataset contains 70.3 million examples and over 49.5 million object classes. (20.9 GB)

We implemented the following approaches to compare and contrast against our approach:

  1. Non-Negative Matrix Factorization (NMF) Rank-1 — This decomposition minimizes the I-divergence between the auxiliary variable and the approximation formed from two rank-1 vectors. However, it is limited to non-negative matrices, so it cannot compress the auxiliary variables for Momentum or the 1st Moment of Adam. (Shazeer and Stern, 2018)

  2. Rank-1 — After each update, we perform an SVD decomposition of the auxiliary variable, and only keep the top singular value and its corresponding vectors. During the subsequent update, the auxiliary variable is reconstructed via an outer product. Unlike the NMF Rank-1 Approximation, this approach is not limited to non-negative values, but it is extremely slow and cannot be used in practice.

  3. Count-Sketch — As described in Section 4. This approach is also not limited to non-negative values and is capable of compressing the auxiliary variables for all optimizers efficiently.

Title Symbol
Count-Sketch CS
Low-Rank LR
Adam 1st Moment M
Adam 2nd Moment V
Non-Negative Matrix Factorization NNF
Table 2. Abbreviations

7.1. Small-Scale Experiments

Wikitext-2: The language model is a 2-layer LSTM with 672 hidden units. The dimensionality of the word embeddings is equal to the number of hidden units. The model is unrolled 35 steps for the back-propagation through time (BPTT). The model is regularized via Dropout with a 50% chance of disabling a unit. We train the model for 40 epochs with a mini-batch size of 20. For Momentum, the learning rate is 2.5, the decay rate is 0.9, and we clip the gradient norm to 0.25. For Adam, the learning rate is 0.001, the beta values

are (0.9, 0.999), and gradient clipping is 1. We reduce the learning rate by

whenever the validation error plateaus. We use the full softmax layer, so only the embedding layer is sparse for this dataset.

-Norm Approximation Error: Fig. 4 shows the -Norm between the approximation and the original auxiliary variable over several training iterations. The left figure is for the Momentum optimizer, while the right figure is for the 2nd Moment for the Adam optimizer. All of the methods are given roughly an equal amount of parameters to approximate the original auxiliary variable. For the Wikitext-2 dataset, the embedding and softmax layers use [33,278, 256] matrices. Therefore, the rank-1 decomposition uses two vectors that use 33,278 + 256 = 33,534 parameters. The count-sketch data structure is represented with a [3, 16, 672] tensor, containing 32,256 parameters. Our count-sketch approach maps the 33,278 word vocabulary into 16 distinct bins, so there are about 2,080 collisions for each bucket.

The Adam optimizer’s 2nd Moment is strictly non-negative and is suitable for the NMF Rank-1 approximation. For the Momentum variable, we supplement the NMF decomposition with the SVD decomposition. The SVD decomposition maintains a good approximation of the Momentum variable. However, it is extremely slow during training, so we only show the approximation error for the first epoch of training. As expected, the NMF Rank-1 baseline poorly approximates the momentum variable, which is not strictly non-negative. It experiences significant variance in its approximation quality. The Count-Sketch is a consistent estimator for both variables with slightly more error for both variables.

Test Perplexity: Tables 3,4 show the test perplexity after training the model with the Momentum and Adam optimizers. For the momentum optimizer, the NNM Low-Rank approximation performs poorly, reinforcing the results from Fig. 4. When only the 2nd moment is compressed, the NNM Low-Rank and Count-Sketch approximations have negligible differences. When we compress both the 1st and 2nd moments with the Count-Sketch, there is some minor accuracy loss from the original optimizer.

Momentum CS LR-NMF
94.25 95.93 176.31
Table 3. Test Perplexity for Momentum Optimizer on the Wikitext-2 dataset. The size of the count-sketch tensor is [3, 16, 672] while the rank-1 approximation uses 33,278 + 672 parameters.
109.24 105.14 106.32 106.21
Table 4. Test Perplexity for Adam Optimizer on the Wikitext-2 dataset. The modifiers indicate which auxiliary variables are compressed.

Figure 4. Left - Momentum, Right - Adam - 2nd Moment. -Norm between the approximation and the original auxiliary variable.

MegaFace: For this experiment, we obtain pretrained embeddings of size 512 from the FaceNet architecture (Schroff et al., 2015) trained on the MS-Celeb-1M dataset 444 Afterwards, we train a softmax classifier on the MegaFace dataset using LSH Sampling (Yen et al., 2018b; Vijayanarasimhan et al., 2014). For LSH Sampling, we use SimHash — Signed Random Projection (SRP) with K=15 bits per hash fingerprint. There are L=16 hash tables that are rebuilt every 250 iterations. For Adam, the learning rate is 0.001 and the beta values are (0.9, 0.999). For Adagrad, the learning rate is 0.1. All the models were trained for 10 epochs.

Fig. 5 shows the effect of cleaning the Count-Min Sketch Tensor on its corresponding optimizer. We measure how the testing accuracy, convergence rate, and auxiliary variable error changes because of cleaning for the Adam and Adagrad optimizers. The Count-Min Sketch tensor is set to of the original variable’s size. For Adam, the cleaning scheme is every 125 iterations, multiply the count-min sketch by a constant . For Adagrad, the rate of cleaning is the same, but the constant is changed to .

For both Adam and Adagrad, there is a noticeable drop in -Norm error with cleaning, which reflects positively in terms of test accuracy and convergence. For Adam, the count-sketch optimizer with cleaning closely matches the convergence rate of the baseline and slightly surpasses its test accuracy. The test accuracy for Count-Sketch with cleaning is 69.4%, while the baseline is 69.03%. For Adagrad, cleaning did not improve the initial convergence rate, but allowed the final test accuracy to match the baseline. There is a solid improvement in test accuracy from to by using cleaning for the Count-Sketch Adagrad optimizer.

Given that the Adam optimizer already contains an exponential decay term, it is surprising that cleaning is necessary. However, despite further hyper-parameter tuning, the count-sketch optimizer with cleaning still achieves the best performance. For dense gradients, the decay term is applied to all elements. Since the gradients are sparse, only the non-zero elements are updated. Thus, the decay is applied in an irregular fashion for the elements in the sketch.

Figure 5. The effect of cleaning on the Count-Min Sketch Tensor and its corresponding optimizer for the MegaFace dataset.

7.2. Large-Scale Language Model

Since the Wikitext-103 and LM1B datasets have large vocabularies, we use Sampled Softmax (Jean et al., 2014) to induce sparsity in the softmax layer and for faster training. Each Count-Sketch Tensor is smaller than the original variable. Therefore, there are at least 15 collisions for each bin on average.

Adagrad - Wikitext-103: Our language model is a single layer LSTM with 1024 hidden units. The dimensionality of the word embeddings is 256 and we use a projection layer between the LSTM and Softmax layers. The model is unrolled 35 steps BPTT. The model is regularized via Dropout with . We train the model for 25 epochs with a mini-batch size of 1024. For the Adagrad optimizer, the gradient norm is clipped to 0.1, the learning rate starts at 0.4 and decays linearly to 0 during training.

Results: For the Wikitext-103 dataset, we allocated a [3, 17,849, 256] Count-Sketch tensor for each auxiliary variable. By providing the Count-Sketch with more parameters, our method has notably better test accuracy than the NMF low-rank approximation while using only slightly more memory. In addition, despite using more parameters than the low-rank approximation, the count-sketch optimizer is still somewhat faster. Finally, the low-rank approximation fails to meet the same accuracy as the original baseline, while surprisingly the count-sketch optimizer has the best test perplexity.

Metric Adagrad CS LR-NMF
Time 6.4 6.6 6.7
Size 10,625 10,089 10,077
Test Perplexity 57.63 56.07 58.27
Table 5. Test Perplexity, Running Time, and Memory Consumption on the Wikitext-103 dataset using the Adagrad Optimizer. CS — Count-Sketch, LR — Low-Rank

Adam - LM1B: For the 1-Billion Word dataset, our goal is to mimic multi-GPU distributed training on a single GPU. The original batch size is 128 with a learning rate of 5e-4. By increasing our batch size from 128 to 1024, we scale our learning rate linearly by (Goyal et al., 2017). In addition, we decay our learning rate linearly to zero over 5 training epochs. We double the LSTM size from 1024 to 2048, but keep the word embedding size at 256. The model is unrolled 20 steps BPTT. Dropout is kept nominally at and the gradient norm is clipped to 1. A surprising side effect of increasing the batch size was that we reduced our training time by roughly from 12.25 hours to 6.25 hours per epoch despite using a single GPU.

Results: For the 1-Billion Word dataset, we allocated a [3, 52,898, 256] Count-Sketch tensor for each auxiliary variable. Our primary comparison is only with the 2nd moment because the NMF low-rank approximation is not applicable to the 1st moment. The count-sketch is slightly more accurate than the low-rank approximation. When both the 1st and 2nd moments are compressed with the count-sketch tensor, its accuracy is on-par with the low-rank approximation that compresses only the 2nd moment. In general, the count-sketch tensor is faster than the low-rank approach while using substantially less GPU memory. For large matrices, there is a noticeable cost with reconstructing the entire matrix to update only a sparse subset of values.

Metric CS-MV Adam CS-V LR-NMF-V
Time 27.1 26.4 26.75 29.2
Size 8,591 11,707 10,167 13,259
Table 6. Running Time and Memory Consumption on the 1-Billion Word dataset for the Adam optimizer.
Epoch CS-MV Adam CS-V LR-NMF-V
1 50.78 48.48 49.49 50.04
2 46.08 45.34 45.22 45.60
3 43.71 42.79 42.95 43.55
4 41.82 41.15 41.23 41.82
5 40.55 39.90 39.88 40.41
Table 7. Convergence Rate (Test Perplexity) after 5 epochs on the 1-Billion Word dataset. The modifiers indicate which auxiliary variables are compressed for the Adam optimizer.

7.3. Extreme Classification

For the extremely large-scale classification task, we conducted our experiments on an Amazon recommendation dataset. The task is to predict an object out of over 49 million classes given a query. The text query is parsed into trigram features. Feature hashing is applied to convert the strings into integers. The input feature dimension is 80K. On average, there are on 30 non-zero features per query, so the input layer is very sparse and suitable for our Count-Sketch optimizer. We trained a single hidden layer, fully-connected neural network with an embedding dimension of 1024.

A traditional softmax classifier would require over 200 GB of memory, which is well beyond the memory capacity of the largest GPUs. Instead, we leverage a novel approach for extreme classification called Merged-Averaged Classifiers via Hashing (MACH) (Huang et al., 2018). This algorithm randomly merges the output classes into a manageable number of coarse-grained, meta-classes via universal hashing. Several independent, fully-connected neural networks are trained to solve this meta-class classification task. Each meta-classifier is associated with a unique hash function that creates a distinct class mapping. At inference time, we recover the scores for the original classes by aggregating the meta-class scores assigned to the original output class. For this experiment, we used 20K meta-classes in the output layer of each meta-classifier. For high-accuracy models, we use 32 meta-classifiers. Each individual meta-classifier required 414 MB of memory for a total of 12.95 GB. Therefore, our ensemble MACH classifier used less memory than a monolithic softmax classifier.

Since we are primarily interested in faster training times, we limit ourselves to 4 meta-classifiers in this experiment. For our baseline, each meta-classifier is trained using the Adam optimizer with a batch size of 750. Given these settings, a single meta-classifier takes 4 GB of GPU memory, allowing us to train 4 models in parallel on a single GPU. For maximum memory savings, we eliminate the 1st moment and use a count-min sketch tensor of size [3, 266, 1024] for the 2nd moment (1% of original size). By using the Adam Count-Sketch optimizer, we reduce the memory cost for each model from 4 GB to 2.6 GB (45% smaller). We take of advantage of this extra memory by increasing the batch size from 750 to 2600 (3.5 larger). As a result, the running time per epoch decreased from 5.32 hours to 3.3 hours (38% faster).

We measure the accuracy of the MACH model using the Recall@100 metric on a test dataset containing 20K queries. First, we evaluate the meta-classifiers and aggregate their scores. Then, we check how often the target class appears within the top 100 scores generated by the classifier. A major bottleneck during evaluation is sorting the 49.5 million classes to find the top 100 scores. Since we are only comparing the model’s relative performance and are interested in fast running times, we down-sample the scores from 49.5 million to 1 million. The class subset contains the target classes for all 20K test queries and a random sample of the remaining classes. Given 16 meta-classifiers, the Adam baseline has a 0.6881 recall, while the Count-Sketch optimizer achieves a 0.6889 recall.

Type Batch Size Epoch Time Recall@100
Adam 750 5.32 0.4704
CS-V 2600 3.3 0.4789
Table 8. Extreme Classification — A MACH ensemble with 4 meta-classifiers is trained on a single GPU using Adam and the Count-Sketch optimizer.

8. Conclusion and Future Work

In this paper, we present the concept of a count-sketch tensor to compress the auxiliary variables associated with popular first-order optimizers. The count-sketch tensor retains the constant-time update and query operations, while maintaining structured sparsity for high-speed vectorized operations. The count-sketch tensor can reduce the memory usage of large-scale models with minimal cost by taking advantage of the model’s sparsity. Going forward, we are interested in compressing the auxiliary variables associated with the hidden layers without incurring any performance penalty. We hope to leverage recent ideas of adding sparsity to the hidden layers in order to increase the size of the model without increasing its computational cost (Spring and Shrivastava, 2017; Shazeer et al., 2017; Wen et al., 2017). Structured sparsity in the hidden layers would mesh well with our current approach for the Embedding and Softmax layers.


Appendix A Appendix

Count-Sketch Error Bound: (Charikar et al., 2002) Let be the Count-Sketch estimate of component from vector . For any component , with probability , a Count-Min Sketch matrix with width and depth satisfies


Count-Min Sketch Error Bound: (Cormode and Muthukrishnan, 2005) Let be the Count-Min Sketch estimate of component from vector . For any component , with probability , a Count-Min Sketch matrix with width and depth satisfies


For stochastic non-convex optimization, we measure how the algorithm converges to a stationary point - for some constant . Notation: batch size , learning rate , 2nd moment decay rate , count-min sketch error rate , count-min sketch failure probability . Assumptions: Here are the assumptions used in our analysis:

  1. Function is L-Smooth - There exists a constant such that

  2. Function has bounded gradients -

  3. The stochastic gradient oracle provides us with an unbiased estimate with fixed variance. Let

    represents the randomness (due to mini-batch sampling) at iteration .

For simplicity and to save additional memory by not tracking the 1st moment, let

. In this form, the optimizer is commonly called RMSPROP. Therefore, the update rule for all



where represents the Count-Min Sketch estimate of component from vector .

Theorem A.1 ().

Let learning rate and batch size . Assume , , and are selected such that and . Given a Count-Min Sketch matrix width and depth , we have the following bound that holds for Count-Min Sketch Adam with probability where :


Given that the function is -smooth and by the optimizer update rule, we derive the following:


Next, we take the expectation of , given we that know (assumed fixed):

The second equality occurs because is an unbiased estimate of . Now, we upper-bound the term :

From Lemma A.4, we have the second equality. The second inequality occurs because of Lemma A.3, which is derived using the Count-Min Sketch error bound. The third inequality occurs because and when we drop from .

By substituting the upper-bound for , we arrive at the following:

The first inequality follows because the function has bounded gradients - . Now, the second inequality holds because . In addition, we split the and terms using the linearity of expectation. For the third inequality, we use the result and definitions in Lemma A.2. From the specified parameters for , , and , we assume the following conditions hold: and .