The contributions of this work are as follows:
Propose a new greedy forward selection strategy for pruning wide, two-layer networks, which allows for the loss of a winning ticket network to be expressed with respect to the loss of the pre-trained network.
First work to incorporate the amount of training time of a pre-trained network into the the theoretical analysis of winning ticket performance derived from this network.
First work to generalize the analysis on winning ticket networks to pre-trained networks that are trained with stochastic gradient descent (SGD).
Extensions (i.e., contributions in progress):
Analyze the relationship between the performance of the winning ticket and the width of the two-layer network.
Demonstrate that state-of-the-art pruning rates can be achieved (i.e., or better, where is the number of nodes in the pruned network) without full pre-training of the global network.
Extension to more complex pruning strategies, such as [log_pruning].
Analyze the loss at initialization to see if it can be reduced in any way or become negligible.
The current trend in deep learning is towards larger models and datasets[xlmr, gpt3, 16by16]. Despite the widespread moral and practical questioning of this trend [costofnlp, transcarbonemit, sotaaimodels, energy_consider], the deep learning community continues to push the limits of experimental scale, finding that severely overparameterized models generalize surprisingly well [doubledescent]. In contrast, the proposal of the Lottery Ticket Hypothesis (LTH) [lth] has led to significant interest in using pruning techniques to discover small (sometimes sparse) models that perform well. LTH claims that, given an overparameterized pre-trained network, a smaller network (i.e., a “winning ticket”) can be discovered within the pre-trained model that, if trained in isolation from the same initial weights, performs comparably to the full model. LTH has been empirically explored in detail, which led to questioning of its applicability to larger models and datasets [rethink_pruning, state_of_sparsity]
. However, LTH was subsequently shown to be applicable at scale, given proper hyperparameter tuning and the correct pruning methodology[to_prune_or_not, stable_lth, deconstructing_lth]. In general, such results highlight that good performance can be obtained at a lesser computational cost by pruning large, pre-trained models to form smaller networks, then fine-tuning the weights of these smaller networks (i.e., either from the previous initialization or some later point) [lth_bert, objdet_lth, finetune_rewind].
Despite the value provided by LTH, the methodology still requires an overparameterized, pre-trained model to be present [provable_overparam]. In practice, obtaining this model, especially if it must be trained from scratch, can be quite expensive. In an attempt to discover alternatives to pre-training, several works studied the possibility of pruning networks at initialization [pruning_at_init, gradflow_prune, whats_hidden]. The ability to discover winning tickets within randomly-weighted networks is commonly referred to as the “strong lottery ticket hypothesis” [whats_hidden]. Although randomly-initialized winning tickets cannot yet match the performance of those obtained from fully pre-trained models, several works have shown that high-performing tickets can be obtained from models with limited pre-training [early_bert, early_bird]. Furthermore, winning tickets generated on one dataset can even generalize well to other datasets, given a big enough dataset and sufficient fine-tuning [pretrn_lth, one_ticket_wins]. Therefore, there is hope that winning tickets can be discovered without incurring the full cost of pre-training, thus allowing for the creation of efficient networks without massive, up-front costs.
The extensive empirical analysis of LTH has inspired the development of theoretical foundations for winning tickets [pruning_is_all_you_need, log_pruning_is_all_you_need, subset_lth]. Several works have derived bounds for the performance and size of winning tickets discovered in randomly-initialized networks [pruning_is_all_you_need, log_pruning_is_all_you_need, subset_lth]. However, these works require that the original model be sufficiently overparameterized and typically produce sparse networks that do not provably outperform similarly-sized networks trained with stochastic gradient descent (SGD). In contrast, other theoretical works explore how the performance of winning tickets, derived from pre-trained networks, compares to the performance of networks trained from scratch with gradient descent [provable_subnetworks, log_pruning]. More specifically, greedy forward selection strategies are proposed for pruning pre-trained networks, allowing the loss of the pruned network to be theoretically analyzed as a function of its size. It should be noted that some findings from these works may still be applicable to randomly-initialized networks given the proper assumptions [provable_subnetworks].
A theoretical comparison of winning tickets to networks trained with gradient descent requires that convergence rates for the training dynamics of neural networks be developed. Such convergence rates were originally explored for wide, two-layer neural networks [mf_2layer, mf_2layer] using mean-field analysis techniques. Similar theory was later expanded to deeper models, such as transformers and ResNets [resnet, transformer, mf_resnet, mf_transformer]. Generally, the analysis of neural network training dynamics has become a large topic of interest in recent years, leading to novel analysis techniques [ntk, finite_ntk], extensions to alternate optimization techniques [moderate_overparam, relu_alt_min], and even introductions of different architectural components [one_hid_relu, one_conv_layer, two_layer_relu]. However, many of these novel, theoretical developments have yet to be applied to the analysis of LTH.
This paper. Within this paper, a novel pruning methodology based on greedy forward selection is proposed. This pruning methodology is structured such that the loss of the pruned network can be explicitly expressed with respect to the loss of the network being pruned. As a result, the loss of the network being pruned can be unrolled with respect to the amount of training (i.e., with gradient descent or SGD) and used to analyze the amount of training needed to achieve good performance with the pruned network, demonstrating that the amount of needed training for the discovery of winning tickets scales logarithmically with the size of the dataset. Such analysis provides a theoretical foundation for the idea of early bird lottery tickets [early_bird, early_bert] within the deep learning community and explains why LTH is more difficult to replicate at scale [state_of_sparsity, rethink_pruning]. Furthermore, we then show that the proposed pruning methodology can be used to achieve an convergence rate, where is the number of nodes within the pruned network, without the need for any overparameterization assumptions.
Notation.Vectors are represented with bold type (e.g., ), while scalars are represented by normal type (e.g., ). represents the vector norm, while and represent Lipschitz and infinity norms. Unless otherwise specified, vector norms are always considered to be norms. is used to represent the set of positive integers from to (i.e., ). We denote the ball of radius centered at as .
In addition to basic notation, several relevant constructions are used within our theoretical analysis. For all problems, we consider two-layer neural networks of width , defined by (1):
represents all weights within the two-layer neural network. In (1),
represents the activation of a single neuron within the network (i.e., the-th neuron of total neurons). It should be noted that two-layer neural networks have the special property of output activations being separable between neurons. Note to Tasos: remind the literature how two-layer NNs can be used for training of deeper models. To produce the full network output, the activation is computed for every neuron in the network; then, a uniform average is taken over the activations. The activation of a single neuron is expressed in (2):
represents a smooth activation function (e.g., sigmoid or tanh). Our analysis is agnostic of the choice of the activation function, so long as associated smoothness assumptions are satisfied (i.e., see Section5). The weights of a single neuron are represented by , where is the dimension of the input vector (i.e., ). We assume that our network is modeling a dataset , where . The dataset has the form , where and for all . In all cases, we consider an -norm regression loss over this dataset during training, defined by (3):
We define . In words, represents a vector of all labels within the dataset , where each label is divided by a constant factor . Similarly, we define as the output of neuron for the -th input vector in the dataset. Utilizing this definition, we then construct the vector , which is a scaled vector of output activations for a single neuron across the entire dataset . We use to denote the convex hull over such activation vectors for all neurons, as shown in (4):
In (4), we use to represent the convex hull. It should be noted that forms a marginal polytope of the feature map for all neurons in the 2-layer network and all examples within the dataset . We use to denote the vertices of the marginal polytope (i.e., ). Furthermore, using this construction of the marginal polytope , the loss on this space can easily be defined by for some . Similarly, we define the diameter of the space as . Unlike [provable_subnetworks], we never make the assumption that .
In all cases, we assume the existence of a global network of width from which the pruned model is constructed (i.e., notice no assumption is made on the amount of training for the global model). Given a subset of neurons within the global model , we define the forward pass with this subset of neurons as shown in (5):
Beginning from an empty network (i.e., ), our methodology aims to discover a subset of neurons such that
. We find an approximate solution to this combinatorial optimization problem using greedy forward selection. In particular, a pruned network is constructed by greedily selecting the neuron that yields the largest decrease in loss in an iterative fashion. Such a strategy is formalized by (6), where represents the number of forward selection steps:
Tasos: pick or as the iteration counter. is permitted to contain duplicate elements. From an analytical perspective, we formalize this forward selection strategy in (7) using the constructions introduced in Section 3:
Tasos: this descriptions comes like from nowhere: it might be good to have this in proper pseudocode form + have verbal description of the steps.
Given the forward selection strategy in (7), we obtain iterates within each iteration of forward selection, where for all . It should be noted that the analytical formulation of the forward selection strategy provided in (7), which is used within all theoretical analysis, perfectly matches the practical pruning algorithm outlined in (6). Such alignment between the theoretical analysis and practical algorithm is lacking in previous work [provable_subnetworks]–a further discussion is provided in Appendix D. To better understand the motivation for this pruning methodology, notice that (7) is structured such that the output of both pruned and global networks is a uniform average over a subset of neuron activations. Such structure is suited to analyzing the loss of the pruned network with respect to the performance of the network being pruned. The exact implementation of our forward selection strategy used in experiments is further elaborated in Section XX.
5 Theoretical Results
Within this section, we overview the main theoretical contributions of this work. All proofs are deferred to the Appendix; see Appendix A. We begin by stating the main assumption considered within our analysis.
For some constant , we assume for all that and . Furthermore, we assume and for defined in (2).
amounts to a boundedness and smoothness assumption on the data and activation function. Although this assumption may not hold for certain activation functions (e.g., ReLU), the performance of smooth activation functions is comparable to activations such as ReLU in practice, which makes this a mild assumption.Tasos: can we support this argument with some citations? Under this assumption, we are able to derive an expression for the loss of the pruned network with respect to the loss of the global network.
Theorem 1 expresses the loss of the pruned network as a function of the global network’s loss. By examining the convergence rate provided via Theorem 1, one can trivially observe that the decrease in the loss of the pruned network is only achieved if the value of does not dominate the expression. This observation raises the question – how can we ensure is small enough to not dominate the expression? To analyze the impact of on the loss of the pruned network, we draw upon previous work that provides convergence rates for 2-layer neural networks, trained with both gradient descent (GD) and stochastic gradient descent (SGD), under mild overparameterization assumptions [moderate_overparam]. Using the convergence rates provided in this work, we can unroll the value of within Theorem 1, and determine how much the global network must be trained to achieve a good loss in the pruned network.
Assume that Assumption 1 holds and a two-layer neural network of width was trained for iterations with SGD over a dataset of size . After iterations, this two-layer network has parameters that will be used to create a pruned network. Furthermore, assume that and , where represents the input dimension of data in . When the network is pruned via (7) for iterations, it will achieve a loss given by , if the amount of training for the global network satisfies the following condition:
Otherwise, the loss of the pruned network will not improve during successive iterations of (7).
In the case that , (9) implies that (i.e., the amount of required training scales logarithmically with the size of the dataset). This reveals that when the dataset is large the global network must be trained more for the pruned network to achieve a good loss. Such a result aligns with empirical observations for the behavior of winning tickets [stable_lth]. Similarly, in the case that , the denominator of (9) approaches . As a result, , revealing that pruned networks can achieve high performance with minimal prior training on smaller datasets [lth]. Such results provide a theoretical explanation for the fact that winning tickets can be discovered with minimal pre-training [early_bird, early_bert], implying that conducting full pre-training of the global network may be a waste of resources for certain datasets. To provide further theoretical validation of this idea, we prove that similar results hold for networks trained with plain gradient descent; see Appendix A.2.2.
The overparameterization requirements in Theorem 5, which are adopted from [moderate_overparam], imply that a larger pre-trained model must be used to achieve good pruning performance on larger datasets, as has been shown in practical experiments [whats_hidden, provable_subnetworks]. Our overparameterization requirements make no assumption about the performance of the global network (e.g., we do not assume ), and assume a mild amount of overparameterization in comparison to previous work [li2018learning, allen2018learning, du2019gradient]. As a result, our results hold in more general cases and account for variable performance of the global network.
In addition to enabling the above analysis, the pruning algorithm in (7) achieves the same convergence rates as previous work, under milder assumptions [provable_subnetworks]. Namely, the following result can be shown to be true for a two-layer network pruned via (7).
Although Theorem 3 relies on the assumption that for the faster rate to be achieved, this assumption is mild in practice given a small amount of prior training. To show this, we analyze the assumption for two-layer neural networks trained on image classification datasets to prove that it is often satisfied in practice; see Appendix B.
In this section, we empirically analyze our theoretical results from Section 5. We show that the amount of pre-training required to discover a good winning ticket is dependent upon the size of the underlying dataset. Although previous work has demonstrated that winning tickets can be discovered with minimal pre-training [early_bert, early_bird], our expression in Theorem 2 provides a theoretical foundation for this empirical observation. Furthermore, such theory also provides an explanation of why “rewinding” is necessary to achieve good winning ticket results for large-scale datasets [stable_lth] (i.e., larger datasets require more pre-training to achieve reasonable pruning loss). We leverage our experiments to gain an in-depth understanding of the scaling properties of LTH with an overall aim of better understanding how winning tickets can be discovered with minimal training costs for different datasets.
The empirical performance of the pruning rule in (6) has already been experimentally validated by previous work [provable_subnetworks]. To avoid needlessly replicating existing experimental analysis, we specifically focus on studying the scaling properties of this pruning algorithm with respect to different sizes and types of datasets. We perform both small-scale analysis with MLPs, as described in Section 3
, and large-scale experiments with modern CNN architectures on ImageNet. We aim to match the performance demonstrated in[provable_subnetworks]
with significantly reduced pre-training cost, demonstrating that lottery tickets can be discovered with provably better efficiency in numerous different domains. Such efficient strategies for discovering winning tickets are especially useful when fully pre-trained models for a desired target domain are not available through open-sourced packages online[tensorflow, pytorch] (i.e., this is often the case for industrial applications and other niche domains) and one must obtain a pre-trained model from scratch for pruning purposes.
6.1 Small-Scale Experiments
We conduct experiments on the MNIST dataset with a two-layer MLP model. We binarize MNIST labels to match the single output neuron setup described in Section3 by considering all classes less than five as class zero and vice versa. Our model architecture exactly matches the description in Section 3, aside from a few minor differences. Namely, we adopt a ReLU activation function (i.e., instead of a smooth activation) within the hidden layer and apply a sigmoid output transformation so that the model can be trained with a binary cross entropy loss (i.e., instead of quadratic loss). These changes are adopted solely for the purpose of improving training stability so that experimental results are more consistent across trials. Tasos: could be a reason for red flag by reviewers, if they want to be adversarial, but we can do nothing about it. We train the MLP with a stochastic gradient descent optimizer, momentum of 0.9, no weight decay, and a batch size of 128, which we found to perform well in multiple different experimental settings. Further hyperparameter choices (e.g., learning rate, size of pruned model, and number of training iterations) are explained in Appendix C.1.
To study how dataset size impacts the performance of a winning ticket, we construct “sub-datasets” of various sizes from the original MNIST dataset by randomly sampling an equal number of examples from each of the 10 original classes. More specifically, sub-datasets of sizes between 1K and 50K are constructed in increments of 5K (i.e., this yields datasets of sizes 1K, 5K, 10K, 15K, …, 50K). Experiments are conducted for MLPs with several different hidden dimensions (i.e., ). We pre-train the MLP for 8000 total iterations and use (6) to construct a new pruned network with 200 hidden nodes every 1000 iterations. Then, the performance of the pruned model over the entire training dataset is recorded, allowing the performance of the pruned model to be examined with respect to both dataset size and pre-training iterations. To exactly match the theoretical analysis in Section 5, no fine-tuning is performed on the pruned model prior to measuring its performance. We report the results of these experiments in Figure 1, where the accuracy is measured across three separate trails with different random seeds.
As can be seen in Figure 1, the performance of the pruned network exactly matches the theoretical analysis provided in Section 5. Namely, as the size of the dataset increases, the amount of needed training to achieve comparable accuracy in the pruned model increases. Tasos: the three plots look pretty identical to me. Furthermore, the increase in the amount of training needed to achieve good pruning accuracy seems to be logarithmic with respect to the size of the dataset. The increase in the number of needed pre-training iterations seems to plateau as the size of the dataset becomes larger, hinting at a logarithmic relationship. Interestingly, pruning results are shockingly uniform across different hidden dimensions of the model being pruned. Such an observation lends further support to the theoretical analysis in Section 5, which predicts that the pruned model performance is not dependent upon the size of the pre-trained network so long as the original network is sufficiently overparameterized.
6.2 Large-Scale Experiments
Appendix A Proofs
a.1 Convergence Analysis
Prior to presenting the proofs of the main theoretical results, we introduce several relevant technical lemmas.
Because the objective , defined over the space , is both quadratic and convex, the following expressions hold for any :
From Lemma 2, we can derive the following inequality.
Assume there exists a sequence of values such that and
where C and are positive constants. Then, it must be the case that for .
We use an inductive argument to prove the above claim. For the base case of , the claim is trivially true because . For the inductive case, we define . It should be noted that is a 1-dimensional convex function. Therefore, given some closed interval within the domain of , the maximum value of over this interval must be achieved on one of the end points. This fact simplifies to the following expression, where and are two values within the domain of such that .
From here, we begin the inductive step, for which we consider two cases.
Case 1: Assume that . Then, the following expression for can be derived.
Case 2: Assume that . Then, the following expression can be derived.
In both cases, it is shown that . Therefore, the lemma is shown to be true by induction. ∎
a.1.1 Proof of Theorem 1
We now present the proof of Theorem 1.
We define . Additionally, we define as follows.
The second equality holds because a linear objective is being optimized on a convex polytope . Thus, the solution to this optimization is known to be achieved on some vertex . Recall, as stated in Section 3, that denotes the diameter of the marginal polytope .
We assume the existence of some global two-layer neural network with hidden neurons from which the pruned network is derived. It should be noted that the neurons of this global network are used to define the vertices as described in Section 3. As a result, the loss of this global network, which we denote as , is the loss achieved by a uniform average over the vertices of (i.e., see (1)). In other words, the loss of the global network at the time of pruning is given by the expression below.
It is trivially known that . Intuitively, the value of has an implicit dependence on the amount of training underwent by the global network. However, we make no assumptions regarding the global network’s training (i.e., can be arbitrarily large for the purposes of this analysis). Using Observation 1, as well as Lemmas 1 and 2, we derive the following expression for the loss of iterates obtained with Eq. (7).
where is due to Eq. (7), is because , is from Lemma 2, is from Lemma 1 since it holds , is from the definition of , and are due to Observation 1. This expression can then be rearranged to yield the following recursive expression:
By unrolling the recursion in this expression over iterations, we get the following:
By rearranging terms, we arrive at the desired expression
a.1.2 Proof of Theorem 3
We define . Furthermore, we define as follows.
Notice that minimizes a linear objective (i.e., the dot product with ) over the domain of the marginal polytope . As a result, the optimum is achieved on a vertex of the marginal polytope, implying that for all . We assume that . Under this assumption, it is known that , which allows the following to be derived.
From (7), the following expressions for and can be derived.
Combining all of this together, the following expression can be derived for , where is the diameter of :
Then, by invoking Lemma 14, we derive the following inequality.
With this in mind, the following expression can then be derived for the loss achieved by (7) after iterations.
This yields the desired expression, thus completing the proof. ∎
a.2 Training Analysis
Prior to analyzing the amount of training needed for a good pruning loss, several supplemental theorems and lemmas exist that must be introduced. From [moderate_overparam], we utilize theorems regarding the convergence rates of two-layer neural networks trained with GD and SGD. We begin with the theorem for the convergence of GD in Theorem 4, then provide the associated convergence rate for SGD within Theorem 5. Both Theorems 4 and 5 are simply restated from [moderate_overparam] for convenience purposes.
Theorem 4 (Tasos: replace this comment with citation to the theorem in [moderate_overparam]).
Assume there exists a two-layer neural network and associated dataset as described in Section 3.
Denote as the number of hidden neurons in the two-layer neural network, as the number of unique examples in the dataset, and as the input dimension of examples in the dataset.
Assume without loss of generality that the input data within the dataset is normalized so that for all .
A moderate amount of overparameterization within the two-layer network is assumed, given by .
Furthermore, it is assumed that and that the first and second derivatives of the network’s activation function are bounded (i.e., and for some ).
Given these assumptions, the following bound is achieved with a high probability by training the neural network with gradient descent.
). Given these assumptions, the following bound is achieved with a high probability by training the neural network with gradient descent.
In (15), , represents the network weights at iteration of gradient descent, represents the network output over the full dataset , and represents a vector of all dataset labels.
is assumed to be randomly sampled from a normal distribution (i.e.,
is assumed to be randomly sampled from a normal distribution (i.e.,).
Here, all assumptions of Theorem 4 are adopted, but we assume the two-layer neural network is trained with SGD instead of GD. For SGD, parameter updates are performed over a sequence of randomly-sampled examples within the training dataset (i.e., the true gradient is not computed for each update). Given the assumptions, there exists some event with probability , where , , and . Given the event , with high probability the following bound is achieved for training a two-layer neural network with SGD.
In (16), , represent the network weights at iteration of SGD, is the indicator function for event , represents the output of the two layer neural network over the entire dataset , and represents a vector of all labels in the dataset.
It should be noted that the overparameterization assumptions within Theorems 4 and 5 are very mild, which leads us to adopt this analysis within our work. Namely, we only require that the number of examples in the dataset exceeds the input dimension and the number of parameters within the first neural network layer exceeds the squared size of the dataset. In comparison, previous work lower bounds the number of hidden neurons in the two-layer neural network (i.e., more restrictive than the number of parameters in the first layer) with higher-order polynomials of to achieve similar convergence guarantees [li2018learning, allen2018learning, du2019gradient].
In comparing the convergence rates of Theorems 4 and 5, one can notice that these linear convergence rates are very similar. The extra factor of within the denominator of Theorem 5 is intuitively due to the fact that updates are performed in a single pass through the dataset for SGD, while GD uses the full dataset at every parameter update. Such alignment between the convergence guarantees for GD and SGD allows our analysis to be similar for both algorithms.
Assume is the -th iterate of (7). Then, the following is true.
where is the loss achieved by a two-layer neural network of width .
Recall, that represents the diameter of the marginal polytope . Beginning with (10), the following can be shown.
We commonly refer to the value of , representing the quadratic loss of the two-layer network over the full dataset. The value of can be expressed as follows.
The expression above is derived by simply applying the definitions associated with that are provided in Section 3.
a.2.1 Proof of Theorem 2
We now provide the proof for Theorem 2.
From Theorem 5, we have a bound for , where represents the loss over the entire dataset after iterations of SGD (i.e., without the factor of ). Two sources of stochasticity exist within the expectation expression : randomness over the event and randomness over the -th iteration of SGD given the first iterations. The probability of event is independent of the randomness over SGD iterations, which allows the following expression to be derived.
where holds from the independence of expectations and is derived from the probability expression for event in Theorem 5. Combining the above expression with (16) from Theorem 5 yields the following, where two possible cases exist.
From Observation 2, we can derive the following, where the expectation is with respect to randomness over SGD iterations (i.e., we assume the global two-layer network of width is trained with SGD).
In (20), it can be seen that all terms on the right-hand-side of the equation will decay to zero as increases aside from the rightmost term. The rightmost term of (20) will remain relatively fixed as increases due to its factor of in the numerator. Within (20), there are two parameters that can be modified by the practitioner: and (i.e., notice that depends on ). All other factors within the expressions are constants based on the dataset that cannot be modified. only appears in the factor of , thus revealing that the value of cannot be naively increased within (20) to remove the factor of .
To determine how can be modified to achieve a better pruning rate, we notice that a setting of would cancel the factor of in (20). With this in mind, we observe the following.