1 Introduction
Modern deep learning often trains millions or even billions of parameters
(Devlin et al., 2018; Shoeybi et al., 2019; Raffel et al., 2019; Brown et al., 2020) to deliver good performance for a model. Recently, Frankle and Carbin (2018); Frankle et al. (2020) demonstrated that these overparameterized networks contain sparse subnetworks, when trained in isolation, that can achieve similar or better performance than the original model.Furthermore, recent studies revisit the initialization stage of finding these subnetworks in vision models (Zhou et al., 2019; Ramanujan et al., 2020). Such a mask, which is used to mask out a part of the entire network to those subnetworks, is referred to as a “Supermask.” That is to say, subnetworks of a randomly weighted neural network (NN) can achieve competitive performance, which may act as a good “prior” (Gaier and Ha, 2019) and connect to the long history of leveraging random features (Gamba et al., 1961; Baum, 1988) and/or random kernel methods (Rahimi and Recht, 2008, 2009)
in machine learning. Here, we examine the following question: how does a fully randomized natural language processing (NLP) model perform in the multilayer setting, and particularly in the (so far underexplored) onelayer setting?
In this work, we first validate that there exist subnetworks of standard randomly weighted Transformers (Reservoir Transformers in (Shen et al., 2021)) that can perform competitively with fullyweighted alternatives on machine translation and natural language understanding tasks. With 50% randomized weights remaining, we found a subnetwork that can reach 29.45/17.29 BLEU on IWSLT14/WMT14, respectively. We also investigate the special case of finding subnetworks in onelayer randomly weighted Transformers (see Fig. 1). To obtain the subnetworks, we repeatedly apply the same randomized Transformer layer several times with different Supermasks. The resulting subnetwork of a onelayer randomlyweighted Transformer has similar performance as the multilayer counterparts with a 30% lower memory footprint. We also study the impact of different depths/widths of Transformers along with the effectiveness of two initialization methods. Finally, using the pretrained embedding layers, we find that the subnetworks hidden in one layer randomly weighted Transformer are smaller than, but can match 98%/92% of the performance of, a trained Transformer on IWSLT14/WMT14. We hope our findings can offer new insights for understanding Transformers.
2 Related Work
Lottery Tickets Hypothesis. Frankle and Carbin (2018)
found that NNs for computer vision contain subnetworks that can be effectively trained from scratch when reset to their initialization. Subsequent works
(Zhou et al., 2019; Ramanujan et al., 2020; Wortsman et al., 2020) demonstrated that socalled winning tickets can achieve performance without training, where the mask for finding the subnetwork at initialization is called “supermask.” In NLP, previous works find that matching subnetworks exist early in training with Transformers (Yu et al., 2019), LSTMs (Renda et al., 2020), and fullyweighted pertrained BERT (Chen et al., 2020; Prasanna et al., 2020) or VisonandLanguage model (Gan et al., 2021), but not at initialization.Random Feature. In the early days of neural networks, fixed random layers (Baum, 1988; Schmidt et al., 1992; Pao et al., 1994) have been studied in reservoir computing (Maass et al., 2002; Jaeger, 2003; Lukoševičius and Jaeger, 2009), “random kitchen sink” kernel machines (Rahimi and Recht, 2008, 2009), and so on. Recently, random features have also been extensively explored for modern neural networks in deep reservoir computing networks (Scardapane and Wang, 2017; Gallicchio and Micheli, 2017; Shen et al., 2021), random kernel feature (Peng et al., 2021; Choromanski et al., 2020), and applications in text classification (Conneau et al., 2017; Wieting and Kiela, 2019), summarization (Pilault et al., 2020) and probing (Voita and Titov, 2020).
Compressing Transformer. A wide range of neural network compression techniques have been applied to Transformers. This includes pruning (Fan et al., 2019; Michel et al., 2019; Sanh et al., 2020; Yao et al., 2021) where parts of the model weights are dropped, parametersharing (Lan et al., 2020; Dehghani et al., 2018; Bai et al., 2019) where the same parameters are used in different parts of a model, quantization (Shen et al., 2020; Li et al., 2020) where the weights of the Transformer model are represented with fewer bits, and distilliation (Sun et al., 2020; Jiao et al., 2020) where a compact student model is trained to mimic a larger teacher model. To find the proposed subnetwork at initialization, we develop our method in the spirit of parameter sharing and pruning.
3 Methodology
Finding a Supermask for Randomly Weighted Transformer. In a general pruning framework, denote weight matrix as ( could be a nonsquare matrix), input as and the network as . A subnetwork defined is , where is a binary matrix and is the elementwise product. To find the subnetwork for a randomly weighted network, is trained while is kept at a random initialization. Following Ramanujan et al. (2020), denote as the associated importance score matrix of , which is learnable during training. We keep topk percents of weights by the importance score of to compute , i.e.,
Note that is an undifferentiated function. To enable training of
, we use the straightthrough gradient estimator
(Bengio et al., 2013), in whichis treated as the identity in backpropagation. During inference, we can simply construct and store the binary Supermask
and the floatingpoint while dropping for future usage.Onelayer randomly weighted Transformer. We use the Transformer architecture (see Vaswani:2017attention for more details). For a general randomly weighted Transformer model with Supermask, there exist s and s for all layers . Due to the natural property of layer stacking in Transformers, all s have the same shape with the same initialization method. This leads to an unexplored question: “What’s hidden in a onelayer (instead of Llayer) randomly weighted transformer?”
Let us use a toy example to explain why there is no need for redundant s. Assume that, for a random weighted matrix
, the probability that it has a “good” subnetwork is
^{2}^{2}2Here, the “good” can be any defined metric, e.g., for all and a predefined .. Furthermore, assume that for two different layers, the probability that both have the “good” subnetworks is independent. Then for different layers, the probability that all s have the “good” subnetworks is . Meanwhile, since has the same initialization method as , the probability that has a “good” subnetwork for th layer is also . Thus, for different layers, the probability that using to generate all “good” subnetworks is also .In this paper, we investigate the scenario where one randomized layer is applied for times repeatedly with different Supermasks. As a result, this can reduce the memory footprint since all Supermasks can be stored in the binary format.
4 Experiments
Model Architecture. For model architectures, we experiment with Transformer and Transformer, following the same setting as in Ott et al. (2018): 6 encoder layers and 6 decoder layers on IWSLT14 and WMT14. We also vary the depth and width of the Transformer model on machine translation tasks. On IWSLT14, we use 3 different random seeds and plot the mean accuracy
one standard deviation. All the embedding layers (including the final output projection layer) are also randomized and pruned unless otherwise specified. Moreover, on all figures, the “fullyweighted model” denotes the standard full model (all weights remaining).
Machine Translation results. In Fig. 2, we present results for directly pruning a randomly weighted Transformer on IWSLT14 and WMT14 tasks. Specifically, we vary the ratio of remaining parameters in the randomized model.
As can be seen, there is no significant performance difference between a onelayer random Transformer versus a 6layer standard random Transformer across different percents of remaining weights on IWSLT14 and WMT14. We also observe that having the remaining randomized weight percents approach 0 or 100 leads to the worst performance across the settings. This is expected since the outputs will be random when we have 100% randomized weights, and the model will not perform well when only limited weights are unpruned (close to 0%). The best performing subnetwork of a onelayer randomized Transformer has 50% weights remained. Connected to the search space of the employed method where we are choosing % out of 100% randomized weights, leads to the largest search space.
Effectiveness of Pretrained Embeddding layers. Embedding layers are critical since they can be viewed as the inputs for an NLP model, which are analogous to the image pixels in vision. Plenty of prior studies have explored how to obtain the pretrained embedding in an unsupervised way (Mikolov et al., 2013; Pennington et al., 2014). We experiment with this practical setting where we could have access to the encoder/decoder embedding layers, which are pretrained from the public checkpoint in fairseq^{3}^{3}3https://github.com/pytorch/fairseq/, and we present the results in Fig. 3. We observe a significant performance boost for a onelayer randomized transformer across different remaining weights. The difference is much larger for the bigger WMT14 dataset (around +3.0 BLEU for WMT14 and +1.0 BLEU for IWSLT14). The best onelayer randomized Transformer reaches 89%/74% of the fullyweighted Transformer performance on IWSLT14/WMT14, respectively.
Task  Model  BLEU  Memory 




IWSLT  Trans  34.66 (0.11)  148MB  100.0  39M  

30.95 (0.12)  28MB  50.0  7M  

34.14 (0.08)  71MB  50.0  18M  

31.51 (0.10)  29MB  50.0  7M  
WMT  Transbase  27.51  328MB  100.0  86M  

20.35  96MB  50.0  25M  

25.24  227MB  50.0  57M  

21.76  98MB  50.0  25M 
Effectiveness of Depth and Width. In Tab. 1, we report the parameter size, BLEU score, and memory size of different onelayer randomized Transformers with 50% remaining weights, where Trans are 12 encoder/decoder layers variant of Trans. Trans have 2x hidden size as the Trans. The results are gathered with pretrained encoder/decoder embedding layers.^{4}^{4}4We use the checkpoint from FairSeq for Trans on WMT14, and Trans on IWSLT14 to obtain the pretrained embedding layer for onelayer Trans and onelayer Trans. For onelayer Trans on IWSLT14, we pretrain fullyweighted model and then dump the embedding layer. Trans share the same embedding of the Trans.
Either increasing the depth or enlarging the width can improve the performance of our onelayer random transformer. Particularly, the deeper transformer can already achieve 79%/90% of the fullyweighted baseline models on WMT14/IWSLT14, respectively. For wider models, those numbers even increase to 92%/98%. This is mainly due to the larger search space introduced by the larger weight matrix. Another important point is that even when we increase/enlarge the depth/width of the model, the total memory consumption of these models is actually smaller than the standard baseline, since we only have one repeated layer and all the masks can be stored in a 1bit setting.
Furthermore, we explore the effect of the different ratios of remaining parameters for different models on IWSLT14 in Fig. 4. As can be seen, for the wider model, its performance is always better than the standard one across all different settings. However, for the deeper model, there is a sharp transition that happens at 50%–60% remaining parameters. The reason is that, given that our deeper model is twice as deep as the original, when we retain more random parameters (50%), the probability that the layer has a good “subnetwork” decreases significantly. This will lead the final probability to be (), which is much smaller than (see Section 3).
Different Initialization. Weight initialization is one of the critical components to the success of the random feature (Wieting and Kiela, 2019; Ramanujan et al., 2020; Shen et al., 2021). We experiment with kaiming uniform (Ramanujan et al., 2020) and Xavier uniform (Vaswani et al., 2017) initialization methods, and we scale the standard deviation by when we retain randomized weights. As shown in Fig. 5, the performance of the onelayer randomized Transformer decreases when we switch to the Xavier uniform. The degradation becomes larger when more randomized weights retain in the network.
QQP and MNLI results.
On QQP and MNLI, we experiment with RoBERTa and RoBERTa, following Liu et al. (2019). We use the pretrained embedding layer of RoBERTa (Liu et al., 2019). In Fig. 6 and 7, we show consistent results on QQP and MNLI, except that the best performing onelayer randomly weighted RoBERTa is achieved when we retain 70% randomized weights, it reaches 79%/91% fullyweighted RoBERTa accuracy on QQP and MNLI, respectively. The performance approaches 84%/92% of the aforementioned fullyweighted model performance when using the larger hidden size with onelayer randomly weighted RoBERTa.
Implementation Details.
We evaluate on IWSLT14 deen (Cettolo et al., 2015) and WMT14 ende (Bojar et al., 2014) for machine translation; QQP (Iyer et al., 2017)
and MultiNLImatched (MNLI)
(Williams et al., 2017) for natural language understanding.^{5}^{5}5For IWSLT, we follow the preprocessing steps in Edunov et al. (2018). The train/val/test split is 129k/10k/6.8k sentences. For WMT, we follow preprocess as in Ott et al. (2018), with 4.5M/16.5k/3k sentences in train/val/test.We use 8 Volta V100 GPUs for WMT, and one V100 for IWSLT, QQP, and MNLI. The hyperparameters on IWSLT14 and WMT14 for training a onelayer randomized Transformer were set the same to the bestperforming values from
Ott et al. (2018) for training fullyweighted Transformer. The QQP and MNLI experiments followed Liu et al. (2019).5 Conclusions
In this paper, we validate the existence of effective subnetworks in a onelayer randomly weighted Transformer on translation tasks. Hidden within a onelayer randomly weighted Transformer with fixed pretrained embedding layers, we find there exist subnetworks that are smaller than, but can competitively match, the performance of a trained Transformer on IWSLT14/WMT14.
Acknowledgements
We thank anonymous reviewers for their comments and suggestions. SS and KK were supported by grants from Samsung, Facebook, and the Berkeley Deep Drive Consortium. We would like to acknowledge DARPA, IARPA, NSF, and ONR for providing partial support of this work.
References
 Deep equilibrium models. Advances in Neural Information Processing Systems 32, pp. 690–701. Cited by: §2.

On the capabilities of multilayer perceptrons
. Journal of complexity 4 (3), pp. 193–215. Cited by: §1, §2. 
Estimating or propagating gradients through stochastic neurons for conditional computation
. arXiv preprint arXiv:1308.3432. Cited by: §3.  Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA. Cited by: §4.
 Language models are fewshot learners. arXiv preprint arXiv:2005.14165. Cited by: §1.
 Report on the 11 th iwslt evaluation campaign , iwslt 2014. In Proceedings of IWSLT, Cited by: §4.
 The lottery ticket hypothesis for pretrained bert networks. arXiv preprint arXiv:2007.12223. Cited by: §2.
 Masked language modeling for proteins via linearly scalable longcontext transformers. arXiv preprint arXiv:2006.03555. Cited by: §2.
 Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364. Cited by: §2.
 Universal transformers. In International Conference on Learning Representations, Cited by: §2.
 Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
 Classical structured prediction losses for sequence to sequence learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana. Cited by: footnote 5.
 Reducing transformer depth on demand with structured dropout. In International Conference on Learning Representations, Cited by: §2.
 The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635. Cited by: §1, §2.
 Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp. 3259–3269. Cited by: §1.
 Weight agnostic neural networks. arXiv preprint arXiv:1906.04358. Cited by: §1.
 Echo state property of deep reservoir computing networks. Cognitive Computation 9 (3), pp. 337–350. Cited by: §2.
 Further experiments with papa. Il Nuovo Cimento (19551965) 20 (2), pp. 112–115. Cited by: §1.
 Playing lottery tickets with vision and language. arXiv preprint arXiv:2104.11832. Cited by: §2.
 First quora dataset release: question pairs, 2017. URL https://data. quora. com/FirstQuoraDatasetReleaseQuestionPairs. Cited by: §4.
 Adaptive nonlinear system identification with echo state networks. In Advances in neural information processing systems, Cited by: §2.
 TinyBERT: distilling bert for natural language understanding. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4163–4174. Cited by: §2.
 ALBERT: a lite bert for selfsupervised learning of language representations. Cited by: §2.
 Train big, then compress: rethinking model size for efficient training and inference of transformers. In International Conference on Machine Learning, Cited by: §2.
 Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §4, §4.

Reservoir computing approaches to recurrent neural network training
. Computer Science Review 3 (3). Cited by: §2.  Realtime computing without stable states: a new framework for neural computation based on perturbations. Neural computation 14 (11), pp. 2531–2560. Cited by: §2.
 Are sixteen heads really better than one?. Advances in Neural Information Processing Systems 32, pp. 14014–14024. Cited by: §2.

Efficient estimation of word representations in vector space
. arXiv preprint arXiv:1301.3781. Cited by: §4. 
Scaling neural machine translation
. arXiv preprint arXiv:1806.00187. Cited by: §4, §4, footnote 5.  Learning and generalization characteristics of the random vector functionallink net. Neurocomputing 6 (2), pp. 163–180. Cited by: §2.
 Random feature attention. In International Conference on Learning Representations, Cited by: §2.
 GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: §4.
 On the impressive performance of randomly weighted encoders in summarization tasks. arXiv preprint arXiv:2002.09084. Cited by: §2.
 When bert plays the lottery, all tickets are winning. arXiv preprint arXiv:2005.00561. Cited by: §2.

Exploring the limits of transfer learning with a unified texttotext transformer
. arXiv preprint arXiv:1910.10683. Cited by: §1.  Random features for largescale kernel machines. In Advances in neural information processing systems, pp. 1177–1184. Cited by: §1, §2.
 Weighted sums of random kitchen sinks: replacing minimization with randomization in learning. In Advances in neural information processing systems, pp. 1313–1320. Cited by: §1, §2.

What’s hidden in a randomly weighted neural network?.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 11893–11902. Cited by: §1, §2, §3, §4.  Comparing rewinding and finetuning in neural network pruning. arXiv preprint arXiv:2003.02389. Cited by: §2.
 Movement pruning: adaptive sparsity by finetuning. Advances in Neural Information Processing Systems 33. Cited by: §2.
 Randomness in neural networks: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7 (2), pp. e1200. Cited by: §2.
 Feedforward neural networks with random weights. In Proceedings of the 11th International Conference on Pattern Recognition, 1992. Vol. II. Conference B: Pattern Recognition Methodology and Systems, pp. 1–4. Cited by: §2.
 Reservoir transformers. In ACL, Cited by: §1, §2, §4.

Qbert: hessian based ultra low precision quantization of bert.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 34, pp. 8815–8821. Cited by: §2.  MegatronLM: training multibillion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053. Cited by: §1.
 MobileBERT: a compact taskagnostic bert for resourcelimited devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2158–2170. Cited by: §2.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.
 Informationtheoretic probing with minimum description length. arXiv preprint arXiv:2003.12298. Cited by: §2.
 No training required: exploring random encoders for sentence classification. arXiv preprint arXiv:1901.10444. Cited by: §2, §4.
 A broadcoverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426. Cited by: §4.
 Supermasks in superposition for continual learning. Advances in Neural Information Processing Systems (NeurIPS) 6. Cited by: §2.
 MLPruning: a multilevel structured pruning framework for transformerbased models. arXiv preprint arXiv:2105.14636. Cited by: §2.
 Playing the lottery with rewards and multiple languages: lottery tickets in rl and nlp. arXiv preprint arXiv:1906.02768. Cited by: §2.
 Deconstructing lottery tickets: zeros, signs, and the supermask. In Advances in Neural Information Processing Systems, pp. 3597–3607. Cited by: §1, §2.