1 Introduction
Large scale deep neural networks have achieved groundbreaking success in various cognitive and recommendation tasks including, but not limited to, image classification
(He et al., 2016), speech recognition (Amodei et al., 2016), and machine translation (Wu et al., 2016). However, modern deep neural networks have blown up to hundreds of billions or even trillions of parameters, e.g., GPT3 (Brown et al., 2020), TuringNLG (Microsoft, 2020), MegatraonLM (Shoeybi et al., 2020), and thus training and serving these models requires massive energy consumption that ultimately translates to carbon emissions (Strubell et al., 2019). It is estimated that modern data centers consume 200 TW
(Jones, 2018; Andrae and Edler, 2015) of power every year, which accounts for 1% of global electricity usage and is expected to increase to 20% by 2030. A large part of this demand is fuelled by modern deep learning systems, with inference cost dominating (8090%) the total cost of successful deployment (Hernandez and Brown, 2020). Hence, a number of model compression techniques (Gupta and Agrawal, 2020) such as pruning, quantization (Choi et al., 2018), and distillation (Polino et al., 2018) have been developed and deployed.In recent years, there has been a similar surge (Naumov et al., 2019; Cheng et al., 2016; Zhou et al., 2018; He et al., 2017a) in the development of large scale deep recommendation networks as well. Personalization and content recommendation systems have become ubiquitous on both edge and data center scale systems, and thus, scaling these models poses an even steeper demand on modern infrastructure. Unlike other domains, recommendation models present a new set of challenges due to data nonstationarity, and hence model compression techniques such as network pruning need to be reevaluated in this context.
Recommendation models usually employ wideanddeep (Cheng et al., 2016) fullyconnected (FC) layers. Recent work (Park et al., 2018) has shown the modern data centers can spend as much as 42% of total computation in fullyconnected (FC) layers during the serving of recommendation systems. Unstructured network pruning is an effective technique that has been shown to reduce the computation cost of FC layers. For example, with 90% of parameters in FC layers zeroed out, the computation cost can be reduced by 23x (Wang, 2020). Meanwhile, the memory footprint needed to hold the models in memory (Gupta et al., 2020) and the required communication memory bandwidth can also be reduced. This is particularly important in production systems, which typically load a large number of models simultaneously in memory (Park et al., 2018).
Pruning the recommendation system is advantageous but has a number of challenges:

Online recommendation systems are lifelong learners (Parisi et al., 2019) and they need to keep adapting to a nonstationary data distribution using an incremental training paradigm. The nonstationarity is rooted in the flux in content (videos, movies, text, etc.) that gets continuously added/removed from the system. Such data distribution shift results in continuous change in feature distribution and, therefore, the importance of the model parameters. However, existing techniques apply network pruning in a stationary domain and hence these techniques cannot be applied outofthebox.

The architecture of the online recommendataion models is highly heterogeneous. It contains very large embedding tables (Naumov et al., 2019) to represent categorical features. Besides, various MLPs are deployed to process the dense features and learn the interaction between input features. The FC layers in those MLPs have different propensities to pruning and usually a lot of handtuning is required to achieve optimal pruning sparsity to avoid accuracy degradation.
Incremental training periodically produces a new model by training the previous snapshots with the latest data, and in this way the new model captures the latest data distribution. However, the efficacy of the incremental training requires that the network is sufficiently overparameterized so that it quickly captures the new data patterns. It has been shown (Gordon, 2020), compared with the full dense network, the pruned network is no longer sufficiently overparameterized and thus is not able to learn as well. Figure 1 provides an example showing that a pruned model is no longer be able to adapt to the data distribution shift with incremental training.
We overcome the data distribution shift issue in pruning with a twofold strategy: a new incremental training paradigm called DensetoSparse (D2S) and a novel pruning algorithm that is specifically designed for it:

We propose the D2S paradigm that maintains a full dense model, which is used to adapt to the data nonstationarity by applying incremental training, and periodically produces a pruned model from the latest dense model. It requires a pruning algorithm that is able to produce an accurate pruned model from a welltrained full model with only a limited amount of data for finetuning.

To satisfy the requirements from the D2S paradigm, we propose a binary auxiliary mask based pruning algorithm. It draws a connection between the heuristic Taylor approximation based pruning algorithm
(Molchanov et al., 2016, 2019) and the sparse penalty based methods (He et al., 2017b; Liu et al., 2017). Based on this connection, the proposed algorithm provides a unified framework that inherits the advantages of both methods. Because of this, the algorithm also automatically learns the sparsity for each model layer and does not depend on heuristics to tune sparsity. 
We also discuss system design considerations for applying pruning to production scale recommendation systems.
2 Background
2.1 Recommendation Models
Figure 2 depicts a generalized architecture used in many of the modern recommendation systems (Naumov et al., 2019; Cheng et al., 2016; Zhou et al., 2018; He et al., 2017a)
. Personalization models aim to find the most relevant content for a user and employs a combination of realvalued and categorical features. Categorical features are represented using embedding lookup tables where each unique entity (e.g., videos/movies/text) is allocated a row in the table. For multiple entities, the embedding vectors are looked up and pooled together using some aggregate statistic to produce a single embedding for each categorical feature. Dense features are fed directly into MultiLayer Perceptron (MLP) layers. All the embedding feature vectors coupled with dense features undergo a feature interaction step where combined factors are learned between each pair of embeddings such as Factorization Machines
(Rendle, 2010), xDeepFM(Lian et al., 2018). The most common type of interaction is the dot product, as described in the Deep Learning Recommendation Model (Naumov et al., 2019)(DLRM). In this paper, we consider industrial scale production models similar to the aforementioned architectures.2.2 Data NonStationarity
In contrast to the classical classification/regression problem in which we assume a fixed data population, online recommendation models observe gradually shift data distribution. Suppose at time , the distribution of features and labels follows the data distribution , and changes gradually with . Given the neural network parameterized by , we define the loss over a data distribution as . As the data is nonstationary, the parameter of the model also needs adjustment in order to capture the data distribution shift. Suppose we set the parameter at time , our goal is to find (i.e. the parameters of network at every time ) such that the averaged loss across time period
(1) 
is minimized. Here can be a very long period, e.g., months for a recommendation system in practice. In practice, these systems are optimized for a moving window of average loss because prediction accuracy on new data (Cervantes et al., 2018b) is much more valuable than lifetime accuracy of the model.
2.3 Incremental Training
can be learned with gradient descent with training data. However, in practice, due to the need of deploying the recommendation model for serving, it is hard to update the parameters in real time (Geng and SmithMiles, 2009; Cervantes et al., 2018b). Instead, the training process is discretized, where parameters are updated at certain fixed time points using the accumulated batch of data. Without loss of generality, suppose that the parameters are only updated at and we assume the time points for updates are equally distributed, i.e.,
and thus where denotes the length between two consecutive time points for model update.
Instead of using the loss in Equation 1, is learned via minimizing the following loss
(2) 
The incremental training is conducted in the following way. We start with a pretrained model , which is trained with offline data, at time . To obtain a new model at ( is a positive integer), we start with the previous snapshot and apply gradient descent on this previous snapshot with the data collected between time . This latest model will be fixed and deployed for serving between time and .
Notice that the data distribution shift is a slow and gradual process and thus once is not too large, the data distribution between and is similar and thus the model is able to give accuracy prediction in the coming time window .
Pass the training data only for a few epochs
Different from traditional training of neural networks, in which the optimizer passes the training set for a large number of epochs, in each iteration of the online incremental training (i.e., from
to ), the model will be severely overfitted if the optimizer passes the training data generated between for too many epochs (Zheng et al., 2020). This is similar to the catastrophic forgetting (Kirkpatrick et al., 2017) in lifelong learning. Besides, the system requires the trainers to finish training in times in order to deploy on time, which also puts constraints on the number of epochs for training.2.4 Network Pruning
Pruning is one of the most important techniques for deep neural network (DNN) model compression. By removing redundant weights and their related computation, pruning can dramatically reduce the computation and storage requirements of largesize DNN models Zhang et al. (2018). Previous works Park et al. (2016); Wang (2020) demonstrate that, on generalpurpose processors (e.g., CPUs and GPUs), pruning offers significant speedup during DNN model inference. As an example, Wang (2020) shows that a speedup of up to 2.5x can be achieved on Nvidia V100 GPU with an unstructured sparsity of 90%. Considering CPUs, which have much lower hardware parallelism than GPUs, are widely deployed for running deep learning workloads in data centers Park et al. (2018), an even higher performance benefit is expected Yu et al. (2017) for pruning on recommendation models.
Learning a sparse network can be formulated by the following constrained optimization problem
(3) 
where is the number of nonzero parameters we can have. Notice that the above problem is highly nonconvex and empirically directly training a sparse network is not able to achieve high accuracy. Instead, starting with a large network and learning a sparse network via pruning is much better. Also see Liu et al. (2018); Ye et al. (2020) for more empirical and theoretical results.
Starting with a large dense neural network, network pruning algorithms remove the redundant parameters before, during, or after the training of the full network. For example, pruning during training means the pruning happens during the training of the large network. The learned sparse network requires finetuning or retraining with a certain amount of data in order to obtain good accuracy. We now give a detailed review of different types of pruning algorithms.
3 Related Work
Model compression has received a lot of attention in the last few years with techniques like Quantization (Choi et al., 2018; Jain et al., 2020), Distillation (Polino et al., 2018; Hinton et al., 2015; Kim and Rush, 2016), and Factorization (Wang et al., 2019)
making their way to most stateoftheart models. Network pruning has also been proposed in a number of different settings, and can generally be classified into three classes based on when pruning is applied:
Pruning Before Training Pruning deep learning models before the training process is a recent trend of pruning algorithms. The lottery ticket hypothesis Frankle and Carbin (2018) demonstrates that particular subnetworks (winning tickets) can be trained in isolation and reach the test accuracy comparable to the original network in a similar number of iteration. Dettmers and Zettlemoyer (2019); van Amersfoort et al. (2020); Wang et al. (2020) propose various criteria and techniques to find the winning tickets using only a few batches of training data. However, pruning before training algorithms are not applicable for the online recommendation system. The optimal sparsity structure keeps changing due to the nonstationary data and thus the winning tickets selected using the earlier data might not be a winning ticket for later data.
Pruning During Training Typical pruning during training algorithms are sparse penalty based, which introduce sparse penalty into the training objective to enforce the sparsity (He et al., 2017b; Liu et al., 2017; Ye et al., 2018; Wen et al., 2016; Zhou et al., 2016; Lebedev and Lempitsky, 2016; Alvarez and Salzmann, 2016; Gordon et al., 2018; Yoon and Hwang, 2017). Besides, various methods Dettmers and Zettlemoyer (2019); Ding et al. (2019); Evci et al. (2020) are proposed to keep a sparse model through the training process. Recent works (Kusupati et al., 2020; Azarian et al., 2020) learn the threshold for pruning the weight and thus automatically also learn the sparsity per layer. Despite the achieved empirical success and the ability to learn the sparsity, they cannot be applied to our online recommendation models. This is because, as we discuss in Section 4.1, for online recommendation systems, sparse models learn slower than full dense models. Therefore, using these pruning during training methods will prevent the model from capturing the data distribution shift in the long term.
Pruning After Training
Most pruning algorithm that prune after training use heuristic criteria to measure the weight/neuron importance
(Liu et al., 2018; Blalock et al., 2020; Molchanov et al., 2016; Li et al., 2016; Molchanov et al., 2019). However, previous heuristicbased methods usually require strong priorknowledge on the desired sparsity structure, i.e., the desired sparsity for different layers. However, handcrafting those hyperparameters can hardly gives an optimal solution, and the solution might also change due to the data nonstationarity. Besides, previous works only consider using a single criterion to measure weight importance. Our algorithm design improves over those methods by automatically learning the sparsity structure and combining multiple criteria to measure weight importance.4 DensetoSparse Paradigm
4.1 Difficulties in Incremental Training of Sparse Network
The success of the incremental training procedure relies on the fact that the full model is able to converge to a good local optimal in every incremental training step and thus captures the distribution shift in long term. However, this is not the case for the pruned model. The reasons are twofold:

First, as the sparse network has much fewer parameters than the full network, in each incremental training step, the sparse network learns slower than the full network and converges to a worse local optimal. This is because the sparse network cannot tap into the benefit from the overparameterization for optimization (Jacot et al., 2018; Du et al., 2018; Mei et al., 2018).

Second, the different data distribution at different times might require sparse networks with different sparsity structures, i.e., different layers might need to have different sparsity, or the pruned parameters might be distributed differently in the same layer. However, simply performing incremental training on the sparse network will only update the value of nonzero parameters but keeps the sparse structure fixed.
Empirically, we find that naively applying incremental training on the sparse models is able to adapt to the data distribution shift only in a short period as the data distribution only shifts a little bit in such a short period. We refer readers to Section 6.2 for detailed empirical results.
4.2 Capture Data Distribution Shift with Full Model and then Prune
To address the difficulties of adapting the sparse model to capture the data distribution shift, we propose the DensetoSparse (D2S) paradigm. It maintains a full model and applies incremental training on it to adapt to the data distribution shift. The sparse model is then produced periodically from the maintained full model using a customized pruning algorithm.
Figure 4 shows an overview of the D2S paradigm. For every period of , we will generate a sparse model with updated sparsity structures, learned from the recent full model with the latest data. As an example, at time , we will generate a new sparse model by pruning the full model at time . The incremental training data in the time window is used for pruning and necessary finetuning.
The learned sparse model will be deployed for serving during the time window . Notice that in this period, the incremental training will also be applied to the sparse model. It means, at , the sparse model is incrementally trained using the training data between and .
The hyperparameter
needs to be carefully chosen for the D2S paradigm. In an ideal case, we need to choose the value to satisfy two requirements:
For the time , the loss of the sparse model needs to be lower than or similar to a new sparse model generted by pruning the full model .

For the time , the loss of the sparse model will be higher than a new sparse model generated by pruning the full model .
The value cannot be larger because the data distribution shift will hurt the accuracy of the stale sparse models. On the other hand, considering the noise and variation in the incremental training, we can conservatively choose a slightly lower value. However, cannot be too small. As shown in Figure 4, for the time window , we need to launch three jobs for the incremental training (dense and sparse models) and pruning/finetuning. With a smaller value, more training resources will be spent on pruning/finetuning. In the worst case, if , we need to maintain jobs in parallel, for which the requirement of training resources is unacceptable.
4.3 Requirements for Pruning Algorithm
The special design of D2S raises several requirements for the pruning algorithm. First, D2S requires a pruning algorithm that produces a sparse model given a welltrained full model, and thus, only the pruning after training algorithm is desirable in our system.
Second, following the discussion in Section 4.2 of the choice of , if the sparse model obtained by the pruning algorithm is able to recover its original performance with only a small number of data, then we are able to choose a smaller , which makes the pruned model suffer less from the data distribution shift issue.
Third, due to the heterogeneous architecture in the recommendation model, it is expected that the optimal sparsity for different layers varies. However, handcrafted tuning of layerwise sparsity is untenable due to the high cost. Moreover the layerwise sparsity can be different at different times. It is desirable to let the algorithm learns the sparsity of each layer.
In summary, a desirable pruning algorithm for online recommendation systems should be a pruning after training algorithm that produces a sparse network which only needs a few data for finetuning and learns the sparsity for each layer automatically.
5 Auxiliary Mask Pruning Algorithm
Classic pruning after training algorithms use heuristic based criteria to measure the importance of weights and prune the less important weights. Some popular heuristics are magnitude based criteria (Li et al., 2016, 2018; Park et al., 2020) and Taylor approximation based criteria (Molchanov et al., 2016, 2019).
Despite impressive empirical achievement of these heuristic based methods, they require significant amount of finetuning in terms of more data or more number of epochs. However, in the presence of data distribution shift, these methods are unable to learn a good sparse network, especially when the amount of data available for finetuning is limited. We argue that there are three aspects of these algorithms that can be improved:
(1) Using a heuristic criterion for pruning requires that we set the desired sparsity for each layer. However, instead of setting the sparsity of each layer, which tends to lead to a suboptimal choice of the sparsity of each layer, it is better to learn the sparsity for each layer.
(2) Those heuristic methods usually require gradual pruning schedules to achieve the best accuracy, which is usually sensitive to hyperparameters. Instead, there needs to be a principled method that can produce sparse networks with minimal tuning.
(3) Previous work only considers using a single heuristic to measure weight importance. However, we find that combining different heuristics can significantly improve pruning quality.
Based on these motivations, we propose an auxiliary mask based pruning method. To improve (1) and (2), our algorithm borrows the pruning dynamics of sparse constraint based pruning methods (i.e., He et al. (2017b); Liu et al. (2017)), and is able to automatically decide the sparsity of each layer and percentage of weight to prune at each iteration. To improve (3), we draw an equivalence between the sparse constraint based method and Taylor approximation based heuristics. We find that our dynamics can be viewed as a finegrain version of Taylor approximation pruning. Based on this insight, we introduce the idea of using multiple heuristics for pruning.
Recall that given a well trained dense network with parameter , our goal is to reduce the size of the network via constraining given some constraint . We consider adding a binary mask with latent continuous parameters into the network and reduce the size of parameters by obtaining sparse masks. That is, for parameter , we equip it with a binary mask with a latent continuous parameter and produced a masked parameter . We are able to prune out by letting and vice versa. In this way, by learning sparse masks (allowing most parameters to be nonpositive), we are able to zero out a large portion of the parameters. An advantage of using binary mask is that that the algorithm will not overshrink the unpruned weights.
We consider the following problem
Here denotes the elementwise production. Notice that since we consider binary mask, . It is equivalent to consider training by minimizing the following penalized loss
Here is the penalty term that enforces the sparsity, and higher gives higher sparsity. Notice that as the indicator function is not differentiable and to overcome, we use the straightthrough estimator (Bengio et al., 2013; Hinton et al., 2012)
, which replaces the illdefined gradient of a nondifferentiable function in the chain rule by a fake gradient. Specifically, we consider the following update rule at iteration
with learning rate(4)  
Here is the fake gradient used to replace the illdefined . Some common choice are (Linear STE) or
(ReLU STE). Notice that for ReLU STE, the gradients of
with negative values will always equal to zero. It means once a weight is being pruned, there is no chance for it to be alive again, which makes it impossible for the algorithm to correct the mistake if it mistakenly prunes out some important weights. Based on this intuition, we choose to use Linear STE in this paper.The Connection with Taylor Approximation Based Pruning
Choosing and using the firstorder Taylor approximation, we have
Here denotes the parameter with set to 0, and all other elements remain the same as . From this perspective, our algorithm calculates the weight importance (with batches of data) using the firstorder Taylor approximation. Once there is enough evidence showing that a certain weight is not important by finding that its corresponding mask becomes zero, this weight is pruned.
Combining Multiple Criteria: the Generalized Pruning Dynamics
Notice that our algorithm is able to: 1) learn the sparsity of each layer; 2) do iterative pruning with pruning percentage automatically decided. To further improve the result, we find that combining multiple criteria is better than simply using one criterion to measure weight importance. We make the following modification over the vanilla version.
Firstly, we find that it is more useful to measure the importance of weight by how much the loss changes (i.e., ) if pruned rather than how much the loss decreases (i.e.,
). This is consistent with the findings in computer vision
(Molchanov et al., 2016, 2019) and can be achieved by using the absolute value of Taylor approximation.Besides, we find it useful to also include weight magnitude information to measure weight importance. Incorporating all the motivations, we consider the following update rule
(5) 
where we have
Here control the ratio of the information coming from Taylor approximation and weight magnitude. Notice that here we are no more solving an optimization problem but doing a finegrain version of weight importance measurement.
The Gradient Rescaling Trick
Directly applying the update rule in Equation 5 is problematic. The reason is that the Taylor Approximation term and weight magnitude term may have a very different scale for each layer and across different layers. To resolve this issue, we proposed the gradient rescaling trick, in which we normalize each component of the gradient to ensure they have the same scale. The final update rule is as follows:
(6) 
where we have
Here denotes the vector norm. We find that without using the weight magnitude information, the error increases about 120% and if we directly use the vanilla updating rule in Equation 5, the error increases about 2400%.
6 Experiments
Training Setup
We conduct our numerical experiments on training recommendation models for clickthrough rate prediction tasks on large scale systems. We use the DLRM model architecture (Naumov et al., 2019) to design the networks with both dense and categorical features and use the dotproduct interaction method. Please note that the proposed methods are agnostic to the model architecture, and are applicable to most recommendation models.
The models are trained using an asynchronous dataparallel distributed setup, where we have multiple trainer nodes working on a chunk of the data, and a centralized server that synchronizes the weights across trainers. Similar to Zheng et al. (2020), we perform a single pass on the training data. We use the Adagrad (Duchi et al., 2011) optimizer for training the network parameters, and all hyperparameters are selected using crossvalidation. We use pretrained models trained with data samples, and then train them incrementally for data samples for all experiments.
We use the stochastic gradient descent optimizer for learning the latent auxiliary parameter
of the mask. During the learning of the sparse mask, we also allow the (unpruned) weights to be updated. Unpruned weights are the ones for whom the auxiliary parameter is greater than zero. For the proposed algorithm, we achieve the targeted sparsity by tuning the penalty , which penalizes . For example, and gives around 0.75 (75% of weights are zeroed out) and 0.80 overall sparsity, respectively. We also note that the prune penalty is stable and it produces the same desired sparsity for different values (length of the pruning phase). This is very important to avoid retuning for different data segments. We set , giving equal weights to the information coming from Taylor approximation and weight magnitude terms for AUX. We aim to achieve overall target sparsity ratios of 0.8, which is expected to give up to 2x speedup on large scale models.Evaluation
For model accuracy, we report the crossentropy loss (CE) from the classification task. More specifically, recommendation systems often use the Normalized Cross Entropy (He et al., 2014) metric which measures the average log loss divided by what the average log loss would be if a model predicted the background clickthrough rate.
Given that we care more about the performance of incoming new data rather than historical performance over the whole dataset, we also consider the look ahead window evaluation (eval) CE Cervantes et al. (2018a), in which the CE is calculated using the data (before passing it to the optimizer) within the moving time window (e.g., samples for this paper). Since we only pass each data into the optimizer once and the data used for calculating the look ahead CE has not been passed to the optimizer at the time of evaluation, the look ahead eval CE is an improved version of the windowed evaluation loss. We report the relative CE of the compared methods. The relative CE is defined by
Compared Methods
We compare the baselines below with the proposed Auxiliary Mask Pruning method (AUX).
(1) Magnitude based pruning (MP) (Li et al., 2016), which uses weight magnitude to rank the weight importance and prunes the least important weight. We consider an iterative version of this magnitude based pruning, in which we gradually and linearly increase the pruning ratio (i.e., the percentage of pruned weight) from zero to targeted sparsity over a period of time (pruning phase).
(2) Taylor Approximation based Pruning (TP) (Molchanov et al., 2016) is similar to MP but measure the weight importance via (firstorder) Taylor approximation.
(3) Momentum Based Pruning (MoP) (Ding et al., 2019; Dettmers and Zettlemoyer, 2019; Evci et al., 2020): Notice that as we have nonstationary data and thus it is also reasonable to measure the weight importance by its momentum (calculated with the exponential moving average of gradients). Larger momentum means the weight is more important for recent data, as its magnitude of updates is larger. However, empirically, we find that naively applying those methods gives very poor results. We believe this is because some of the very high magnitude weights (which may have low momentum) affect the network a lot when pruned. We enhance momentum techniques via measuring the importance of weight with both (normalized) weight magnitude and (normalized) momentum magnitude. That is, the weight importance is calculated by
where is the momentum of the weight and is the momentum of all the weights in the same layer. Momentum is calculated using the decay parameter (chosen using crossvalidation). This algorithm does not prune weights with large magnitude, but at the same time keeps weights with low magnitude and high momentum. This achieves the same objective as methods like Rigged Lottery (Evci et al., 2020), Sparse Networks From Scratch (Dettmers and Zettlemoyer, 2019), which periodically prune and grow weights based on gradient momentum.
6.1 Pruning with Fixed Mask
We first evaluate the performance of the pruning algorithm itself independent of the impact of data nonstationarity. We start with a pretrained model using up to samples, apply the pruning phase on samples, and then fix the sparsity structure during subsequent incremental training phases (only allowing the unpruned weights to be updated). Empirically we found that samples are sufficient to learn the mask for all the methods, and increasing this does not improve the accuracy. Figure 5 shows the results based on the relative window look ahead eval CE. AUX outperforms all the baselines with a large margin. Interestingly, all momentum based techniques are worse than pure magnitude based pruning (MP). This shows that MP is a very strong baseline to begin with, which has been shown previously as well (Kusupati et al., 2020). Moreover it is quite challenging to augment the magnitude information with momentum information.
Figure 5 also shows that the AUX method reaches steady state loss much before other methods. For instance, the AUX method reaches steady state after samples whereas MP requires at least samples. This also shows that AUX pruning does a much better job of removing unnecessary weights and causes much less disruption to the learning process. Table 1 also gives the relative window look ahead eval CE of the latest samples and eval CE calculated using the samples right after training ends.
Method  Look Ahead Eval CE %  Eval CE % 

MP  0.149  0.191 
TP  0.157  0.184 
MoP (0.5, 0.5)  0.156  0.211 
MoP (0.2, 0.8)  0.206  0.246 
MoP (0.8, 0.2)  0.147  0.254 
AUX  0.097  0.144 
6.2 Difficulty of Directly Adjusting Sparse Network for Nonstationary Data
We empirically show that adapting the mask to changing distribution is a nontrivial problem and all pruning methods suffer from this. We give a more extensive study on several potential mask adaption methods for adjusting the sparse model to adapt to the data nonstationarity.
Mask Adaption Methods
(1) AUX (Fixed Mask): This is the baseline scheme where we apply AUX pruning for samples and then continue to finetune the fixed mask for the rest of the training. This is exactly the same way we propose to finetune the sparse model during its serving period (i.e., ).
(2) AUX (Adaptation Mask): This is a straightforward way to extend AUX pruning for mask adaptation. We continue to update the mask and the weights together continuously throughout incremental training. Due to the fact that we use Linear STE as the fake gradient for the nondifferentiable indicator function of the auxiliary mask , weights can continuously be pruned and unpruned leading to a natural adaptation of the mask. There is no explicit finetuning phase here, where the auxiliary mask is fixed. It should be noted that the final achieved sparsity depends solely on the regularization strength and not on the length of the pruning phase. Hence, as long as is kept constant, the overall achieved sparsity remains stable.
(3) MoP (Adaptation Mask): Similar to AUX (Adaption Mask), we consider continuously updating the mask for MoP using the latest momentum and weight magnitude values. We update the mask using this criterion for every samples. Using crossvalidation we found that optimal value for is . This is similar to many recent pruning during training methods that employ momentum as the importance metrics to rank weights (Ding et al., 2019; Dettmers and Zettlemoyer, 2019; Evci et al., 2020). This technique takes advantage of the fact that even though the magnitude of pruned weights is not updated, their momentum can still increase, which may allow them to be selected in the next selection cycle.
(4) AUX (DensetoSparse): This method combines AUX with the DensetoSparse paradigm. Periodically the dense model is used to produce a fresh sparse model. We use AUX (D2S @x) to indicate that the second sparse network is generated after x examples. We consider as the offset at which the second sparse network is instantiated by pruning the full dense network.
Here we only consider modifying the MoP, as MP and TP prune weights in an unrecoverable way and thus cannot be easily modified to allow the sparsity structure to change. For instance, in MP, the pruned weights are never updated, and hence they do not have a chance to be selected again.
Mask Adaptation Results
The results are summarized in Figure 6. It can be shown that all the methods that directly update the sparse model are not able to make the sparse network adapt to the new data as the window CE loss start to increase after samples. AUX (Adaptation Mask) is also unable to adapt to the distribution shift, which illustrates our claim that underparameterized networks struggle to adapt irrespective of the pruning algorithm.
In comparison, the proposed AUX (D2S @ 0.5e10) gives much lower CE loss because it depends on the dense network to adapt to the data distribution shift. D2S paradigm is powerful because we can produce sparse networks by pruning the dense networks at any chosen time. For instance, the shaded algorithm D2S illustrates how the two sparse networks generated at samples and samples can be combined to give the best overall accuracy. How often we can produce a sparse network is dependent on how fast can a pruned network converges to steady state. This is where the AUX pruning algorithm is ideal because it converges very fast after pruning is applied.
One difficulty considering the data distribution shift is that it is very difficult to predict when the model accuracy will start to suffer from the stale mask. In Figure 5, we can see that the data distribution shift starts to hurt the model accuracy after samples. This threshold can vary for different architectures, different data segments, and different target sparsity ratios. A critical advantage of D2S is that we always keep a dense model for comparison. In this case, we can monitor the accuracy difference between the dense and sparse model to determine the value and dynamically produce sparse models when the accuracy starts to diverge.
6.3 Learned sparsity for each layer
AUX pruning automatically learns the sparsity of each layer, which is critical in reducing the overall accuracy loss. Figure 7 plots the final achieved sparsity across layers plotted against the relative size of the layer. It suggests that the algorithm tends to give larger sparsity to those layers with smaller size. This is due to the heterogeneity in the architecture where the size of the layer does not necessarily dictate the redundancy of the parameters, contrary to popular belief that larger layers are more redundant. Hence applying the same sparsity to all layers or only pruning the largest layers are suboptimal choices.
Figure 8 plots the layerwise sparsity vs. the depth of the layer for three different regularization strengths (), which produce different overall target sparsity ratios. This shows that in contrast to domains like Computer Vision, there is no obvious correlation between sparsity and depth. This makes it more desirable for the algorithm to learn the sparsity instead of using handcrafted rules. We can also see how the values can be tuned to modulate the overall sparsity of the network. We also find that a given value produces consistent sparsity ratios across different date ranges and model sizes, which reduces the overhead of tuning .
6.4 Significance of Combining Multiple Criteria
Results in Figure 5 suggest that except for the proposed AUX algorithm, MP seems to give the best performance. It is thus of interest to understand whether adding the Taylor approximation term significantly changes the decision on what kind of weights are pruned compared with MP. Figure 9 plots the histogram of the pruned weights and the unpruned weights in four different FC layers. It shows that the algorithm prunes weight with very small magnitude and keeps weights with very large magnitude. However, for a large portion of weights with ‘moderate’ magnitudes, the algorithm carefully decides their importance by checking the information of the Taylor approximation term, which causes the huge overlap of weight histogram of the two groups. This shows that gradient information is very important in deciding the weight importance for recommendation models. However, simple momentum based ranking heuristics are unable to harness that information. Hence a formal algorithm like AUX pruning outperforms momentum based algorithms.
7 Conclusion & Future Work
We have proposed a novel pruning algorithm coupled with a novel DensetoSparse (D2S) paradigm that is designed for application of pruning to large scale recommendation systems. We have discussed the implications of this algorithm on system design and shown the efficacy of the algorithms on large scale recommendation systems. D2S is effective in lowering the accuracy loss but is inefficient during training. We need to train a dense model at all times, and thus the number of model replicas is increased during training. In the future, it will become important to improve mask adaptation techniques to reduce the training overhead.
References
 Learning the number of neurons in deep networks. In Advances in Neural Information Processing Systems, pp. 2270–2278. Cited by: §3.
 Deep speech 2: endtoend speech recognition in english and mandarin. In International conference on machine learning, pp. 173–182. Cited by: §1.
 On global electricity usage of communication technology: trends to 2030. ChallengesarXiv preprint arXiv:1811.09886arXiv preprint arXiv:1608.01409ACM SIGARCH Computer Architecture NewsNeural networks : the official journal of the International Neural Network SocietyProceedings of the national academy of sciencesarXiv preprint arXiv:1308.3432Coursera, video lecturesarXiv preprint arXiv:2003.03477arXiv preprint arXiv:1803.03635arXiv preprint arXiv:1911.11134J. Mach. Learn. Res.arXiv preprint arXiv:2008.11849Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningarXiv preprint arXiv:1810.02340arXiv preprint arXiv:2007.00389arXiv preprint arXiv:2002.07376arXiv preprint arXiv:2002.03231arXiv preprint arXiv:1806.06610 6 (1), pp. 1–41. Cited by: §1.
 Learned threshold pruning. External Links: 2003.00075 Cited by: §3.
 Estimating or propagating gradients through stochastic neurons for conditional computation. Cited by: §5.
 What is the state of neural network pruning?. arXiv preprint arXiv:2003.03033. Cited by: §3.
 Language models are fewshot learners. External Links: 2005.14165 Cited by: §1.
 Evaluating and characterizing incremental learning from nonstationary data. Cited by: §6.
 Evaluating and characterizing incremental learning from nonstationary data. External Links: 1806.06610 Cited by: §2.2, §2.3.
 Wide & deep learning for recommender systems. External Links: 1606.07792 Cited by: §1, §1, §2.1.
 PACT: parameterized clipping activation for quantized neural networks. External Links: 1805.06085 Cited by: §1, §3.
 Sparse networks from scratch: faster training without losing performance. External Links: 1907.04840 Cited by: §3, §3, §6, §6.2.
 Global sparse momentum sgd for pruning very deep neural networks. External Links: 1909.12778 Cited by: §3, §6, §6.2.
 Gradient descent provably optimizes overparameterized neural networks. arXiv preprint arXiv:1810.02054. Cited by: 1st item.
 Adaptive subgradient methods for online learning and stochastic optimization. 12 (null), pp. 2121–2159. External Links: ISSN 15324435 Cited by: §6.
 Rigging the lottery: making all tickets winners. External Links: Link Cited by: §3, §6, §6.2.
 The lottery ticket hypothesis: finding sparse, trainable neural networks. Cited by: §3.
 Incremental learning. In Encyclopedia of Biometrics, S. Z. Li and A. Jain (Eds.), pp. 731–735. External Links: ISBN 9780387730035, Document, Link Cited by: §2.3.

Morphnet: fast & simple resourceconstrained structure learning of deep networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 1586–1595. Cited by: §3.  Do we really need model compression?. Note: http://mitchgordon.me/machine/learning/2020/01/13/dowereallyneedmodelcompression.html Cited by: §1.
 Compression of deep learning models for text: a survey. External Links: 2008.05221 Cited by: §1.
 The architectural implications of facebook’s dnnbased personalized recommendation. External Links: 1906.03109 Cited by: §1.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
 Neural collaborative filtering. External Links: 1708.05031 Cited by: §1, §2.1.
 Practical lessons from predicting clicks on ads at facebook. New York, NY, USA. External Links: ISBN 9781450329996, Link, Document Cited by: §6.
 Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: 2nd item, §3, §3, §5.
 Measuring the algorithmic efficiency of neural networks. External Links: 2005.04305 Cited by: §1.
 Neural networks for machine learning. 264 (1). Cited by: §5.
 Distilling the knowledge in a neural network. External Links: 1503.02531 Cited by: §3.
 Neural tangent kernel: convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580. Cited by: 1st item.
 Trained quantization thresholds for accurate and efficient fixedpoint inference of deep neural networks. External Links: 1903.08066 Cited by: §3.
 How to stop data centres from gobbling up the world’s electricity. Nature 561, pp. 163–166. External Links: Document Cited by: §1.
 Sequencelevel knowledge distillation. External Links: 1606.07947 Cited by: §3.
 Overcoming catastrophic forgetting in neural networks. 114 (13), pp. 3521–3526. Cited by: §2.3.
 Soft threshold weight reparameterization for learnable sparsity. Cited by: §3, §6.1.
 Fast convnets using groupwise brain damage. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2554–2564. Cited by: §3.
 Optimization based layerwise magnitudebased pruning for dnn compression.. In IJCAI, Vol. 1, pp. 2. Cited by: §5.
 Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §3, §5, §6.
 XDeepFM. External Links: ISBN 9781450355520, Link, Document Cited by: §2.1.
 Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744. Cited by: 2nd item, §3, §5.
 Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270. Cited by: §2.4, §3.
 Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: §3.
 A mean field view of the landscape of twolayer neural networks. Proceedings of the National Academy of Sciences 115 (33), pp. E7665–E7671. Cited by: 1st item.
 DeepSpeed. GitHub. Note: https://github.com/microsoft/DeepSpeed Cited by: §1.
 Importance estimation for neural network pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11264–11272. Cited by: 2nd item, §3, §5, §5.

Pruning convolutional neural networks for resource efficient inference
. arXiv preprint arXiv:1611.06440. Cited by: 2nd item, §3, §5, §5, §6.  Deep learning recommendation model for personalization and recommendation systems. arXiv preprint arXiv:1906.00091. Cited by: 2nd item, §1, §2.1, §6.
 Continual lifelong learning with neural networks: a review. 113, pp. 54—71. External Links: Document, ISSN 08936080, Link Cited by: 1st item.
 Faster cnns with direct sparse convolutions and guided pruning. Cited by: §2.4.
 Deep learning inference in facebook data centers: characterization, performance optimizations and hardware implications. Cited by: §1, §2.4.
 Lookahead: a farsighted alternative of magnitudebased pruning. arXiv preprint arXiv:2002.04809. Cited by: §5.
 Model compression via distillation and quantization. External Links: 1802.05668 Cited by: §1, §3.
 Factorization machines. In Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM ’10, USA, pp. 995–1000. External Links: ISBN 9780769542560, Link, Document Cited by: §2.1.
 Megatronlm: training multibillion parameter language models using model parallelism. External Links: 1909.08053 Cited by: §1.
 Energy and policy considerations for deep learning in nlp. External Links: 1906.02243 Cited by: §1.
 Single shot structured pruning before training. Cited by: §3.
 Picking winning tickets before training by preserving gradient flow. Cited by: §3.
 Structured pruning of large language models. External Links: 1910.04732 Cited by: §3.
 SparseRT: accelerating unstructured sparsity on gpus for deep learning inference. Cited by: §1, §2.4.
 Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: §3.

Google’s neural machine translation system: bridging the gap between human and machine translation
. arXiv preprint arXiv:1609.08144. Cited by: §1.  Rethinking the smallernormlessinformative assumption in channel pruning of convolution layers. arXiv preprint arXiv:1802.00124. Cited by: §3.
 Good subnetworks provably exist: pruning via greedy forward selection. In ICML, Cited by: §2.4.
 Combined group and exclusive sparsity for deep neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 3958–3966. Cited by: §3.
 Scalpel: customizing dnn pruning to the underlying hardware parallelism. 45 (2), pp. 548–560. Cited by: §2.4.
 A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 184–199. Cited by: §2.4.
 ShadowSync: performing synchronization in the background for highly scalable distributed training. Cited by: §2.3, §6.
 Deep interest evolution network for clickthrough rate prediction. External Links: 1809.03672 Cited by: §1, §2.1.
 Less is more: towards compact cnns. In European Conference on Computer Vision, pp. 662–677. Cited by: §3.
 Discriminationaware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pp. 875–886. Cited by: §3.