Greedy Pruning with Group Lasso Provably Generalizes for Matrix Sensing and Neural Networks with Quadratic Activations

by   Nived Rajaraman, et al.

Pruning schemes have been widely used in practice to reduce the complexity of trained models with a massive number of parameters. Several practical studies have shown that pruning an overparameterized model and fine-tuning generalizes well to new samples. Although the above pipeline, which we refer to as pruning + fine-tuning, has been extremely successful in lowering the complexity of trained models, there is very little known about the theory behind this success. In this paper we address this issue by investigating the pruning + fine-tuning framework on the overparameterized matrix sensing problem, with the ground truth denoted U_⋆∈ℝ^d × r and the overparameterized model U ∈ℝ^d × k with k ≫ r. We study the approximate local minima of the empirical mean square error, augmented with a smooth version of a group Lasso regularizer, ∑_i=1^k U e_i _2 and show that pruning the low ℓ_2-norm columns results in a solution U_prune which has the minimum number of columns r, yet is close to the ground truth in training loss. Initializing the subsequent fine-tuning phase from U_prune, the resulting solution converges linearly to a generalization error of O(√(rd/n)) ignoring lower order terms, which is statistically optimal. While our analysis provides insights into the role of regularization in pruning, we also show that running gradient descent in the absence of regularization results in models which are not suitable for greedy pruning, i.e., many columns could have their ℓ_2 norm comparable to that of the maximum. Lastly, we extend our results for the training and pruning of two-layer neural networks with quadratic activation functions. Our results provide the first rigorous insights on why greedy pruning + fine-tuning leads to smaller models which also generalize well.


page 1

page 2

page 3

page 4


Pruning Pre-trained Language Models Without Fine-Tuning

To overcome the overparameterized problem in Pre-trained Language Models...

Digital Watermarking for Deep Neural Networks

Although deep neural networks have made tremendous progress in the area ...

Fine-tuning Pruned Networks with Linear Over-parameterization

Structured pruning compresses neural networks by reducing channels (filt...

Shapley Value as Principled Metric for Structured Network Pruning

Structured pruning is a well-known technique to reduce the storage size ...

Pruning On-the-Fly: A Recoverable Pruning Method without Fine-tuning

Most existing pruning works are resource-intensive, requiring retraining...

Parameter-Efficient Fine-Tuning with Layer Pruning on Free-Text Sequence-to-Sequence modeling

The increasing size of language models raises great research interests i...

A Survey of Pruning Methods for Efficient Person Re-identification Across Domains

Recent years have witnessed a substantial increase in the deep learning ...

Please sign up or login with your details

Forgot password? Click here to reset