Modern deep learning often trains millions or even billions of parameters(Devlin et al., 2018; Shoeybi et al., 2019; Raffel et al., 2019; Brown et al., 2020) to deliver good performance for a model. Recently, Frankle and Carbin (2018); Frankle et al. (2020) demonstrated that these over-parameterized networks contain sparse subnetworks, when trained in isolation, that can achieve similar or better performance than the original model.
Furthermore, recent studies revisit the initialization stage of finding these subnetworks in vision models (Zhou et al., 2019; Ramanujan et al., 2020). Such a mask, which is used to mask out a part of the entire network to those subnetworks, is referred to as a “Supermask.” That is to say, subnetworks of a randomly weighted neural network (NN) can achieve competitive performance, which may act as a good “prior” (Gaier and Ha, 2019) and connect to the long history of leveraging random features (Gamba et al., 1961; Baum, 1988) and/or random kernel methods (Rahimi and Recht, 2008, 2009)
in machine learning. Here, we examine the following question: how does a fully randomized natural language processing (NLP) model perform in the multi-layer setting, and particularly in the (so far under-explored) one-layer setting?
In this work, we first validate that there exist subnetworks of standard randomly weighted Transformers (Reservoir Transformers in (Shen et al., 2021)) that can perform competitively with fully-weighted alternatives on machine translation and natural language understanding tasks. With 50% randomized weights remaining, we found a subnetwork that can reach 29.45/17.29 BLEU on IWSLT14/WMT14, respectively. We also investigate the special case of finding subnetworks in one-layer randomly weighted Transformers (see Fig. 1). To obtain the subnetworks, we repeatedly apply the same randomized Transformer layer several times with different Supermasks. The resulting subnetwork of a one-layer randomly-weighted Transformer has similar performance as the multi-layer counterparts with a 30% lower memory footprint. We also study the impact of different depths/widths of Transformers along with the effectiveness of two initialization methods. Finally, using the pre-trained embedding layers, we find that the subnetworks hidden in one layer randomly weighted Transformer are smaller than, but can match 98%/92% of the performance of, a trained Transformer on IWSLT14/WMT14. We hope our findings can offer new insights for understanding Transformers.
2 Related Work
Lottery Tickets Hypothesis. Frankle and Carbin (2018)
found that NNs for computer vision contain subnetworks that can be effectively trained from scratch when reset to their initialization. Subsequent works(Zhou et al., 2019; Ramanujan et al., 2020; Wortsman et al., 2020) demonstrated that so-called winning tickets can achieve performance without training, where the mask for finding the subnetwork at initialization is called “supermask.” In NLP, previous works find that matching subnetworks exist early in training with Transformers (Yu et al., 2019), LSTMs (Renda et al., 2020), and fully-weighted per-trained BERT (Chen et al., 2020; Prasanna et al., 2020) or Vison-and-Language model (Gan et al., 2021), but not at initialization.
Random Feature. In the early days of neural networks, fixed random layers (Baum, 1988; Schmidt et al., 1992; Pao et al., 1994) have been studied in reservoir computing (Maass et al., 2002; Jaeger, 2003; Lukoševičius and Jaeger, 2009), “random kitchen sink” kernel machines (Rahimi and Recht, 2008, 2009), and so on. Recently, random features have also been extensively explored for modern neural networks in deep reservoir computing networks (Scardapane and Wang, 2017; Gallicchio and Micheli, 2017; Shen et al., 2021), random kernel feature (Peng et al., 2021; Choromanski et al., 2020), and applications in text classification (Conneau et al., 2017; Wieting and Kiela, 2019), summarization (Pilault et al., 2020) and probing (Voita and Titov, 2020).
Compressing Transformer. A wide range of neural network compression techniques have been applied to Transformers. This includes pruning (Fan et al., 2019; Michel et al., 2019; Sanh et al., 2020; Yao et al., 2021) where parts of the model weights are dropped, parameter-sharing (Lan et al., 2020; Dehghani et al., 2018; Bai et al., 2019) where the same parameters are used in different parts of a model, quantization (Shen et al., 2020; Li et al., 2020) where the weights of the Transformer model are represented with fewer bits, and distilliation (Sun et al., 2020; Jiao et al., 2020) where a compact student model is trained to mimic a larger teacher model. To find the proposed subnetwork at initialization, we develop our method in the spirit of parameter sharing and pruning.
Finding a Supermask for Randomly Weighted Transformer. In a general pruning framework, denote weight matrix as ( could be a non-square matrix), input as and the network as . A subnetwork defined is , where is a binary matrix and is the element-wise product. To find the subnetwork for a randomly weighted network, is trained while is kept at a random initialization. Following Ramanujan et al. (2020), denote as the associated importance score matrix of , which is learnable during training. We keep top-k percents of weights by the importance score of to compute , i.e.,
Note that is an undifferentiated function. To enable training of
, we use the straight-through gradient estimator(Bengio et al., 2013), in which
is treated as the identity in backpropagation. During inference, we can simply construct and store the binary Supermaskand the floating-point while dropping for future usage.
One-layer randomly weighted Transformer. We use the Transformer architecture (see Vaswani:2017attention for more details). For a general randomly weighted Transformer model with Supermask, there exist s and s for all layers . Due to the natural property of layer stacking in Transformers, all s have the same shape with the same initialization method. This leads to an unexplored question: “What’s hidden in a one-layer (instead of L-layer) randomly weighted transformer?”
Let us use a toy example to explain why there is no need for redundant s. Assume that, for a random weighted matrix
, the probability that it has a “good” subnetwork is222Here, the “good” can be any defined metric, e.g., for all and a pre-defined .. Furthermore, assume that for two different layers, the probability that both have the “good” subnetworks is independent. Then for different layers, the probability that all s have the “good” subnetworks is . Meanwhile, since has the same initialization method as , the probability that has a “good” subnetwork for -th layer is also . Thus, for different layers, the probability that using to generate all “good” subnetworks is also .
In this paper, we investigate the scenario where one randomized layer is applied for times repeatedly with different Supermasks. As a result, this can reduce the memory footprint since all Supermasks can be stored in the binary format.
Model Architecture. For model architectures, we experiment with Transformer and Transformer, following the same setting as in Ott et al. (2018): 6 encoder layers and 6 decoder layers on IWSLT14 and WMT14. We also vary the depth and width of the Transformer model on machine translation tasks. On IWSLT14, we use 3 different random seeds and plot the mean accuracy
one standard deviation. All the embedding layers (including the final output projection layer) are also randomized and pruned unless otherwise specified. Moreover, on all figures, the “fully-weighted model” denotes the standard full model (all weights remaining).
Machine Translation results. In Fig. 2, we present results for directly pruning a randomly weighted Transformer on IWSLT14 and WMT14 tasks. Specifically, we vary the ratio of remaining parameters in the randomized model.
As can be seen, there is no significant performance difference between a one-layer random Transformer versus a 6-layer standard random Transformer across different percents of remaining weights on IWSLT14 and WMT14. We also observe that having the remaining randomized weight percents approach 0 or 100 leads to the worst performance across the settings. This is expected since the outputs will be random when we have 100% randomized weights, and the model will not perform well when only limited weights are unpruned (close to 0%). The best performing subnetwork of a one-layer randomized Transformer has 50% weights remained. Connected to the search space of the employed method where we are choosing % out of 100% randomized weights, leads to the largest search space.
Effectiveness of Pre-trained Embeddding layers. Embedding layers are critical since they can be viewed as the inputs for an NLP model, which are analogous to the image pixels in vision. Plenty of prior studies have explored how to obtain the pre-trained embedding in an un-supervised way (Mikolov et al., 2013; Pennington et al., 2014). We experiment with this practical setting where we could have access to the encoder/decoder embedding layers, which are pre-trained from the public checkpoint in fairseq333https://github.com/pytorch/fairseq/, and we present the results in Fig. 3. We observe a significant performance boost for a one-layer randomized transformer across different remaining weights. The difference is much larger for the bigger WMT14 dataset (around +3.0 BLEU for WMT14 and +1.0 BLEU for IWSLT14). The best one-layer randomized Transformer reaches 89%/74% of the fully-weighted Transformer performance on IWSLT14/WMT14, respectively.
Effectiveness of Depth and Width. In Tab. 1, we report the parameter size, BLEU score, and memory size of different one-layer randomized Transformers with 50% remaining weights, where Trans are 12 encoder/decoder layers variant of Trans. Trans have 2x hidden size as the Trans. The results are gathered with pre-trained encoder/decoder embedding layers.444We use the checkpoint from FairSeq for Trans on WMT14, and Trans on IWSLT14 to obtain the pre-trained embedding layer for one-layer Trans and one-layer Trans. For one-layer Trans on IWSLT14, we pre-train fully-weighted model and then dump the embedding layer. Trans share the same embedding of the Trans.
Either increasing the depth or enlarging the width can improve the performance of our one-layer random transformer. Particularly, the deeper transformer can already achieve 79%/90% of the fully-weighted baseline models on WMT14/IWSLT14, respectively. For wider models, those numbers even increase to 92%/98%. This is mainly due to the larger search space introduced by the larger weight matrix. Another important point is that even when we increase/enlarge the depth/width of the model, the total memory consumption of these models is actually smaller than the standard baseline, since we only have one repeated layer and all the masks can be stored in a 1-bit setting.
Furthermore, we explore the effect of the different ratios of remaining parameters for different models on IWSLT14 in Fig. 4. As can be seen, for the wider model, its performance is always better than the standard one across all different settings. However, for the deeper model, there is a sharp transition that happens at 50%–60% remaining parameters. The reason is that, given that our deeper model is twice as deep as the original, when we retain more random parameters (50%), the probability that the layer has a good “subnetwork” decreases significantly. This will lead the final probability to be (), which is much smaller than (see Section 3).
Different Initialization. Weight initialization is one of the critical components to the success of the random feature (Wieting and Kiela, 2019; Ramanujan et al., 2020; Shen et al., 2021). We experiment with kaiming uniform (Ramanujan et al., 2020) and Xavier uniform (Vaswani et al., 2017) initialization methods, and we scale the standard deviation by when we retain randomized weights. As shown in Fig. 5, the performance of the one-layer randomized Transformer decreases when we switch to the Xavier uniform. The degradation becomes larger when more randomized weights retain in the network.
QQP and MNLI results.
On QQP and MNLI, we experiment with RoBERTa and RoBERTa, following Liu et al. (2019). We use the pre-trained embedding layer of RoBERTa (Liu et al., 2019). In Fig. 6 and 7, we show consistent results on QQP and MNLI, except that the best performing one-layer randomly weighted RoBERTa is achieved when we retain 70% randomized weights, it reaches 79%/91% fully-weighted RoBERTa accuracy on QQP and MNLI, respectively. The performance approaches 84%/92% of the aforementioned fully-weighted model performance when using the larger hidden size with one-layer randomly weighted RoBERTa.
and MultiNLI-matched (MNLI)(Williams et al., 2017) for natural language understanding.555For IWSLT, we follow the pre-processing steps in Edunov et al. (2018). The train/val/test split is 129k/10k/6.8k sentences. For WMT, we follow pre-process as in Ott et al. (2018), with 4.5M/16.5k/3k sentences in train/val/test.
We use 8 Volta V100 GPUs for WMT, and one V100 for IWSLT, QQP, and MNLI. The hyperparameters on IWSLT14 and WMT14 for training a one-layer randomized Transformer were set the same to the best-performing values fromOtt et al. (2018) for training fully-weighted Transformer. The QQP and MNLI experiments followed Liu et al. (2019).
In this paper, we validate the existence of effective subnetworks in a one-layer randomly weighted Transformer on translation tasks. Hidden within a one-layer randomly weighted Transformer with fixed pre-trained embedding layers, we find there exist subnetworks that are smaller than, but can competitively match, the performance of a trained Transformer on IWSLT14/WMT14.
We thank anonymous reviewers for their comments and suggestions. SS and KK were supported by grants from Samsung, Facebook, and the Berkeley Deep Drive Consortium. We would like to acknowledge DARPA, IARPA, NSF, and ONR for providing partial support of this work.
- Deep equilibrium models. Advances in Neural Information Processing Systems 32, pp. 690–701. Cited by: §2.
On the capabilities of multilayer perceptrons. Journal of complexity 4 (3), pp. 193–215. Cited by: §1, §2.
Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. Cited by: §3.
- Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA. Cited by: §4.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §1.
- Report on the 11 th iwslt evaluation campaign , iwslt 2014. In Proceedings of IWSLT, Cited by: §4.
- The lottery ticket hypothesis for pre-trained bert networks. arXiv preprint arXiv:2007.12223. Cited by: §2.
- Masked language modeling for proteins via linearly scalable long-context transformers. arXiv preprint arXiv:2006.03555. Cited by: §2.
- Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364. Cited by: §2.
- Universal transformers. In International Conference on Learning Representations, Cited by: §2.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
- Classical structured prediction losses for sequence to sequence learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana. Cited by: footnote 5.
- Reducing transformer depth on demand with structured dropout. In International Conference on Learning Representations, Cited by: §2.
- The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635. Cited by: §1, §2.
- Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp. 3259–3269. Cited by: §1.
- Weight agnostic neural networks. arXiv preprint arXiv:1906.04358. Cited by: §1.
- Echo state property of deep reservoir computing networks. Cognitive Computation 9 (3), pp. 337–350. Cited by: §2.
- Further experiments with papa. Il Nuovo Cimento (1955-1965) 20 (2), pp. 112–115. Cited by: §1.
- Playing lottery tickets with vision and language. arXiv preprint arXiv:2104.11832. Cited by: §2.
- First quora dataset release: question pairs, 2017. URL https://data. quora. com/First-Quora-Dataset-Release-Question-Pairs. Cited by: §4.
- Adaptive nonlinear system identification with echo state networks. In Advances in neural information processing systems, Cited by: §2.
- TinyBERT: distilling bert for natural language understanding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4163–4174. Cited by: §2.
- ALBERT: a lite bert for self-supervised learning of language representations. Cited by: §2.
- Train big, then compress: rethinking model size for efficient training and inference of transformers. In International Conference on Machine Learning, Cited by: §2.
- Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §4, §4.
Reservoir computing approaches to recurrent neural network training. Computer Science Review 3 (3). Cited by: §2.
- Real-time computing without stable states: a new framework for neural computation based on perturbations. Neural computation 14 (11), pp. 2531–2560. Cited by: §2.
- Are sixteen heads really better than one?. Advances in Neural Information Processing Systems 32, pp. 14014–14024. Cited by: §2.
Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §4.
Scaling neural machine translation. arXiv preprint arXiv:1806.00187. Cited by: §4, §4, footnote 5.
- Learning and generalization characteristics of the random vector functional-link net. Neurocomputing 6 (2), pp. 163–180. Cited by: §2.
- Random feature attention. In International Conference on Learning Representations, Cited by: §2.
- GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Cited by: §4.
- On the impressive performance of randomly weighted encoders in summarization tasks. arXiv preprint arXiv:2002.09084. Cited by: §2.
- When bert plays the lottery, all tickets are winning. arXiv preprint arXiv:2005.00561. Cited by: §2.
Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. Cited by: §1.
- Random features for large-scale kernel machines. In Advances in neural information processing systems, pp. 1177–1184. Cited by: §1, §2.
- Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In Advances in neural information processing systems, pp. 1313–1320. Cited by: §1, §2.
What’s hidden in a randomly weighted neural network?.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11893–11902. Cited by: §1, §2, §3, §4.
- Comparing rewinding and fine-tuning in neural network pruning. arXiv preprint arXiv:2003.02389. Cited by: §2.
- Movement pruning: adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems 33. Cited by: §2.
- Randomness in neural networks: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7 (2), pp. e1200. Cited by: §2.
- Feedforward neural networks with random weights. In Proceedings of the 11th International Conference on Pattern Recognition, 1992. Vol. II. Conference B: Pattern Recognition Methodology and Systems, pp. 1–4. Cited by: §2.
- Reservoir transformers. In ACL, Cited by: §1, §2, §4.
Q-bert: hessian based ultra low precision quantization of bert.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 8815–8821. Cited by: §2.
- Megatron-LM: training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053. Cited by: §1.
- MobileBERT: a compact task-agnostic bert for resource-limited devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2158–2170. Cited by: §2.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.
- Information-theoretic probing with minimum description length. arXiv preprint arXiv:2003.12298. Cited by: §2.
- No training required: exploring random encoders for sentence classification. arXiv preprint arXiv:1901.10444. Cited by: §2, §4.
- A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426. Cited by: §4.
- Supermasks in superposition for continual learning. Advances in Neural Information Processing Systems (NeurIPS) 6. Cited by: §2.
- MLPruning: a multilevel structured pruning framework for transformer-based models. arXiv preprint arXiv:2105.14636. Cited by: §2.
- Playing the lottery with rewards and multiple languages: lottery tickets in rl and nlp. arXiv preprint arXiv:1906.02768. Cited by: §2.
- Deconstructing lottery tickets: zeros, signs, and the supermask. In Advances in Neural Information Processing Systems, pp. 3597–3607. Cited by: §1, §2.