Speeding up Deep Model Training by Sharing Weights and Then Unsharing

by   Shuo Yang, et al.

We propose a simple and efficient approach for training the BERT model. Our approach exploits the special structure of BERT that contains a stack of repeated modules (i.e., transformer encoders). Our proposed approach first trains BERT with the weights shared across all the repeated modules till some point. This is for learning the commonly shared component of weights across all repeated layers. We then stop weight sharing and continue training until convergence. We present theoretic insights for training by sharing weights then unsharing with analysis for simplified models. Empirical experiments on the BERT model show that our method yields better performance of trained models, and significantly reduces the number of training iterations.



page 1

page 2

page 3

page 4


FoundationLayerNorm: Scaling BERT and GPT to 1,000 Layers

The mainstream BERT/GPT model contains only 10 to 20 layers, and there i...

An Exploratory Study on the Repeatedly Shared External Links on Stack Overflow

On Stack Overflow, users reuse 11,926,354 external links to share the re...

BagBERT: BERT-based bagging-stacking for multi-topic classification

This paper describes our submission on the COVID-19 literature annotatio...

CAMERO: Consistency Regularized Ensemble of Perturbed Language Models with Weight Sharing

Model ensemble is a popular approach to produce a low-variance and well-...

Attention Module is Not Only a Weight: Analyzing Transformers with Vector Norms

Because attention modules are core components of Transformer-based model...

Of Non-Linearity and Commutativity in BERT

In this work we provide new insights into the transformer architecture, ...

BERT Busters: Outlier LayerNorm Dimensions that Disrupt BERT

Multiple studies have shown that BERT is remarkably robust to pruning, y...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It has been widely observed that increasing model size often leads to significantly better performance on various real tasks, especially natural language processing applications

(Amodei et al., 2016; Wu et al., 2016; Vaswani et al., 2017; Devlin et al., 2019; Raffel et al., 2020; Brown et al., 2020). However, as models getting larger, the optimization problem becomes more and more challenging and time consuming. To alleviate this problem, there has been a growing interest in developing systems and algorithms for more efficient training of deep learning models, such as distributed large-batch training (Goyal et al., 2017; Shazeer et al., 2018; Lepikhin et al., 2020; You et al., 2020), various normalization methods (De and Smith, 2020; Salimans and Kingma, 2016; Ba et al., 2016)

, gradient clipping and normalization

(Kim et al., 2016; Chen et al., 2018), and progressive/transfer training (Chen et al., 2016; Chang et al., 2018; Gong et al., 2019).

In this paper, we seek a better training approach by exploiting common network architectural patterns. In particular, we are interested in speeding up the training of deep networks which are constructed by repeatedly stacking the same layer, with a special focus on the BERT model. We propose a simple method for efficiently training such kinds of networks. In our approach, we first force the weights to be shared across all the repeated layers and train the network, and then stop the weight-sharing and continue training until convergence. Empirical studies show that our method yields trained models with better performance, or reduces the training time of commonly used models.

Our method is motivated by the successes of weight-sharing models, in particular, ALBERT (Lan et al., 2020). It is a variant of BERT in which weights across all repeated transformer layers are shared. As long as its architecture is sufficiently large, ALBERT is comparable with the original BERT on various downstream natural language processing benchmarks. This indicates that the optimal weights of an ALBERT model and the optimal weights of a BERT model can be very close. We further assume that there is a commonly shared component across weights of repeated layers.

Correspondingly, our training method consists of two phases: In the first phase, a neural network is trained with weights shared across its repeated layers, to learn the commonly shared component across weights of different layers; In the second phase, the weight-sharing constraint is released, and the network is trained till convergence, to learn a different weight for each layer based on the shared component.

We theoretically show that training in the direction of the shared component in the early steps effectively constrains the model complexity and provides better initialization for later training. Eventually, the trained weights are closer to the optimal weights, which leads to better performance. See Figure 1 for an illustration.

Figure 1: Intuition for sharing weights then unsharing. The figure illustrates how the weights change when trained by sharing weights then unsharing (the red arrows) and by gradient descent (the blue arrow). Weight sharing first trains the weights in a more constrained set. It brings the weights closer to the optimal solution, which provides a better initialization for subsequent training. Comparing with gradient descent, which directly optimizes the weights in the original parameter space, our proposed training method converges to a solution closer to the optimal, and thus leads to better generalization performance.

Existing approaches can be viewed as extreme cases as our proposed method: ALBERT shares weights throughout the training process, and BERT does not share weights at all. Our experiments show that oftentimes under various settings, the best result is given by sharing weights for the first 10% of the total number of iterations.

The rest of this paper is organized as follows. We present our algorithmic motivation and the sharing weights then unsharing algorithm in Section 2. In Section 3, we present some theoretical insights of the proposed algorithm. In Section 4, we discuss related work. In Section 5, we show detailed experimental setups and results. We also provide various ablation studies on different choices in implementing our algorithm. Finally, we conclude this paper with discussions in Section 6.

2 Sharing Weights and Then Unsharing

In this section, we first motivate the training with “sharing weights and then unsharing” as a way of reparameterization. We show how a particular reparameterization of “stem-direction” and “branch-direction” naturally leads to our algorithm. We then formally present our algorithm.

Algorithmic Motivation

For a machine learning model

composed by modules with the same structure (e.g., the transformer modules (Vaswani et al., 2017) in BERT), let be the trainable weights of and be the weights correspond to . We can rewrite the s as


The can be viewed as the stem-direction shared by all , while s are the branch-directions, capturing the difference among s. The is a scaling factor whose meaning will be clear soon. To optimize for by steps of gradient descent, let be the step size, one could do

  1. Sharing weights in early stage: For some , in the first steps, compute gradient for and update all by .

  2. Unsharing weights in later stage: For , compute gradient for and update all with .

On a high-level, training only on the stem-direction () in the early steps effectively constrains the model complexity and provides better initialization for later training.

It is very easy to implement the training with aforementioned reparameterization. The gradients can be easily adapted from the gradient of weights , where we have

Thus the effective update to in every step are

In the special case of all s are equal, the “sharing weights and then unsharing” is equivalent to the standard gradient descent. This shows that gives the right scaling.

Practical Algorithm

We now formally present our algorithm, which first trains the deep network with all the weights shared. Then, after a certain number of training steps, it unties the weights and further trains the network until convergence. See Algorithm 1.

1:  Input: total number of training steps , untying point , learning rates
2:  Randomly and equally initialize weights
3:  for  to  do
4:     Compute gradient of
5:     if  then
7:     else
Algorithm 1 Sharing Weights and then Untying

Note that, from lines 1 to 6, we initialize all the weights equally, and then update them using the mean of their gradients. It is easy to see that such an update is equivalent to sharing weights; in lines 6 to 8, the

s are updated according to their own gradient, which corresponds to the unshared weight. For the sake of simplicity, we only show how to update the weights using the plain (stochastic) gradient descent rule. One can replace this plain update rule with any of their favorite optimization methods, for example, the Adam optimization algorithm

(Kingma and Ba, 2015).

While the repeated layers being the most natural units for weight sharing, that is not the only choice. We may view several layers together as the weight sharing unit, and share the weights across those units. The layers within the same unit can have different weights. For example, for a 24-layer transformer model, we may combine every four layers as a weight-sharing unit. Thus, there will be six such units for weight sharing. Such flexibility of choosing weight-sharing units allows for a balance between “full weight sharing” and “no weight sharing” at all.

3 Theoretic Understanding

In this section, we first provide theoretic insights of sharing weights then unsharing via an illustrative example of over-parameterized linear regression. Both the numerical experiment and theoretical analysis show that training by weight sharing allows the model to learn the commonly shared component of weights, which provides good initialization for subsequent training and leads to good generalization performance. We then extend the analysis to a deep linear model and show similar benefits.

3.1 An Illustrative Example

We first use linear regression, the simplest machine learning model, to illustrate our core idea - training with weight sharing helps the model learn the commonly shared component of the optimal solution. Consider a linear model with , where is the ground truth model parameter; is the input and is the response. We can rewrite element-wise as , i.e., elements in are roughly analogous to the repeated modules.

As a concrete example, we set , and draw and generate 120 training samples by drawing and compute correspondingly. Note that the dimension is larger than the size of training samples, and the regression problem yields an infinite number of solutions that can perfectly fit the training data.

We initialize and train the model with gradient descent, sharing weights then unsharing for 500 steps, with weights untied after 100 steps. Figure 2 shows that training by sharing weights then unsharing, the model first learns the mean of and has significantly better performance when fully trained.

Figure 2: Fitting an under-determined linear regression (less training samples than dimension). “SWE” refers to training by “sharing weights then unsharing”. The first figure shows that both methods perfectly fit the training set, while the sharing weights then unsharing has significantly better generalization performance. The second shows the change of parameters. The solid and dash curves are the average of the first and last 100 elements in . By training with weight-sharing, first learns the mean of which leads to a much better initialization when it starts to learn the whole .

Next, we formalize our observation in Figure 2 that training with sharing weights then unsharing learns the commonly shared component, even when the model is over-parameterized (i.e. having more parameters than training samples).

Denote by the space generated by the shared weights . In the special case of the aforementioned linear regression, is a 1-dimensional space. Let be the commonly shared weights for all coordinates, where denotes the projection to . Here the elements of are simply the average of . Our next proposition shows that when trained with weight sharing, converges to .

Proposition 1.

Consider the -dimensional linear regression problem and training samples with generated from and , we have

where is the solution obtained by training with weight sharing. Especially, this result holds for the over-parameterized regime with .

As an example, for , we have being a dimension-independent constant, whereas scales as . It shows that training with weight sharing effectively constrains the model complexity, and learns the common component of the optimal solution even in the over-parameterized regime.

Further, let be the space orthogonal to and let , we can decompose the error in the parameter space as

Therefore, training with weight sharing corresponds to minimizing the first term , which brings closer to the optimal solution , even when the model is over-parameterized. For the subsequent training, the weights are untied and the model is then trained in the original parameter space . The solution we obtained in (by weight sharing) can be viewed as a good initialization in the parameter space .

3.2 Further Implication for Deep Linear Model

Here we further show the theoretic insights of sharing weights then unsharing with deep linear model as an example. In particular, we show that training by weight sharing can 1) speed up the convergence, and 2) provides good initialization to subsequent training.

A deep linear network is a series of matrix multiplications


At the first glance, deep learning models may look trivial since a deep linear model is equivalent to a single matrix. However, when trained with backpropagation, its behavior is analogous to generic deep models.

The task is to train the deep linear network to learn a target matrix . To focus on the training dynamics, we adopt the simplified objective function

Let be the gradient of with respect to . We have

where . The standard gradient update is given by

To train with weights shared, all the layers need to have the same initialization. And the update is


Sharing Weights Brings Faster Convergence

Since the initialization and updates are the same for all layers, the parameters are equal for all . For simplicity, we denote the weight at time to be . Notice that the gradients are averaged, the norm of the update to each layer doesn’t scale with .

We first consider the case where the target matrix is positive definite. It is immediate that is a solution to the deep linear network. We study the convergence result with continuous-time gradient descent (with extension to discrete-time gradient descent deferred to appendix), which demonstrates the benefit of training with weight sharing when learning a positive definite matrix . We draw a comparison with training with zero-asymmetric (ZAS) initialization (Wu et al., 2019). To the best of our knowledge, ZAS gives the state-of-the-art convergence rate. It is actually the only work showing the global convergence of deep linear networks trained by gradient descent.

With continuous-time gradient descent (i.e. ), the training dynamics of continuous-time gradient descent can be described as

Fact 1 (Continuous-time gradient descent without weight sharing (Wu et al., 2019)).

For the deep linear network , the continuous time gradient descent with the zero-asymmetric initialization satisfies

Fact 1 shows that with the zero-asymmetric initialization, the continuous gradient descent linearly converges to the global optimal solution for general target matrix .

We have the following convergence result for training with weight sharing.

Theorem 1 (Continuous-time gradient descent with weight sharing).

For the deep linear network , initialize all

with identity matrix

and update according to Equation 2. With a positive definite target matrix , the continuous-time gradient descent satisfies .

Training with ZAS, the loss decays as , whereas for training with weight sharing, the loss is . The extra in the exponent demonstrates the acceleration of training with weight sharing. See Appendix A.5 for the extension to discrete-time gradient descent.

Remark 1.

The difference between convergence rates in Fact 1 and Theorem 1 is not an artifact of analysis. For example, when the target matrix is simply . It can be explicitly shown that with the initialization in Fact 1, we have while training with weight sharing (Theorem 1), we have . This implies that the convergence results in Fact 1 and Theorem 1 cannot be improved in general.

Sharing Weights Provides Good Initialization

In this subsection, we show that weight sharing can provably improve the initialization for training deep linear models, which can bring significant improvement on existing convergence results. Now we consider the case for which the target matrix may not be a positive definite matrix. In the rest of the analysis, we denote to be the loss induced by parameter . The local convergence result has been established as

Fact 2 (Theorem 1 of (Arora et al., 2018)).

For any initialization with and for some . With proper choice of step size , training with gradient descent converges to .

This result shows that starting with a good initialization, i.e. layer-wise similar and small initial loss, the deep linear model converges to the optimal solution. Next we show that training with weight-sharing navigates itself to a good initialization and easily improves the above result. We take reparameterization where is the stem-direction forced to be symmetric.

Theorem 2 (Weight-sharing provably improves initialization).

For any target matrix with relatively small distance to a positive definite matrix as . Initialize all to be identity matrix and train with weight-sharing, with proper choice of , the gradient descent converges to .

We show that when trained with weight sharing, the model first learns the commonly shared component of all layers, and converges to the symmetric matrix . After untying, the convergence then follows from Fact 2, as the weights are already close to the optimal solution. See full proof in the appendix.

It is interesting that weight-sharing easily converts the established local convergence result to global convergence for a large set of target matrix . This demonstrates that training with weight-sharing can find a good initialization, which brings huge benefits to subsequent training.

4 Related Work

Lan et al. (2020) propose ALBERT with the weights being shared across all its transformer layers. Large ALBERT models can achieve good performance on several natural language understanding benchmarks. Bai et al. (2019b) propose trellis networks which are temporal convolution networks with shared weights and obtain good results for language modeling. This line of work is then extended to deep equilibrium models (Bai et al., 2019a) which are equivalent to infinite-depth weight-tied feedforward networks. Dabre and Fujita (2019) show that the translation quality of a model that recurrently stacks a single layer is comparable to having the same number of separate layers. Zhang et al. (2020) also demonstrate the application of weight-sharing in neural architecture search.

There are also a large number of algorithms proposed for the fast training of deep learning models. Net2Net (Chen et al., 2016) achieves acceleration by training small models first then transferring the knowledge to larger models, which can be viewed as sharing weights within the same layer. Similarly, Chang et al. (2018)

propose to view the residual network as a dynamical system, and start training with a shallow model, and double the depth of the model by splitting each layer into 2 layers and halving the weights in the residual connection. Progressive stacking

(Gong et al., 2019) focuses on the fast training of BERT. The algorithm also starts with training a shallow model, then grows the model by copying the shallow model and stack new layers on top of it. It empirically shows great training acceleration. Dong et al. (2020) demonstrate it is possible to train an adaptively growing network to achieve acceleration.

5 Experiments

In this section, we first present the experimental setup and results for training the BERT-large model with and without our proposed Sharing WEights (SWE) method. Then, we show ablation studies of how different untying point values () affect the final performance, and how our method works with different model sizes, etc. In what follows, without explicit clarification, BERT always means the BERT-large model.

English Wikipedia + BookCorpus XLNet data
0.5 m iterations 1 m iterations 0.5 m iterations 1 m iterations
Baseline SWE Baseline SWE Baseline SWE Baseline SWE
Pretrain MLM (acc.%) 73.66 73.92 74.98 75.09 70.06 70.18 71.75 71.76
SQuAD average 87.27 88.34 88.82 89.51 88.85 89.20 89.93 90.09
GLUE average 78.17 78.98 79.30 80.03 79.53 79.97 80.20 80.83
SQuAD v1.1 (F-1%) 91.54 92.54 92.58 92.81 92.41 92.82 93.37 93.55
SQuAD v2.0 (F-1%) 82.99 84.14 85.06 86.20 85.28 85.57 86.49 86.63
GLUE/AX (corr%) 39.0 40.1 42.3 43.8 43.2 43.7 44.2 46.8
GLUE/MNLI-m (acc.%) 85.9 87.2 86.9 87.8 87.3 88.4 88.6 88.8
GLUE/MNLI-mm (acc.%) 85.4 86.7 86.1 87.4 87.4 87.4 87.9 88.5

GLUE/QNLI (acc.%)

92.3 93.4 93.5 94.0 91.9 92.8 92.3 93.1
GLUE/QQP (F-1%) 72.1 71.7 72.2 72.3 71.7 72.1 72.3 72.1
GLUE/SST-2 (acc.%) 94.3 94.8 94.8 94.9 95.7 95.4 95.9 95.7
Table 1: Experiment results using BERT. Models trained with SWE consistently outperforms models trained without SWE.

5.1 Experimental Setup

We use the TensorFlow official implementation of BERT

(TensorFlow team, ). We first show experimental results with English Wikipedia and BookCorpus for pre-training as in the original BERT paper (Devlin et al., 2019). We then move to the XLNet enlarged pre-training dataset (Yang et al., 2019). We preprocess all datasets with WordPiece tokenization (Schuster and Nakajima, 2012). We mask 15% tokens in each sequence. For experiments on English Wikipedia and BookCorpus, we randomly choose tokens to mask. For experiments on the XLNet dataset, we do whole word masking – in case that a word is broken into multiple tokens, either all tokens are masked or not masked. For all experiments, we set both the batch size and sequence length to 512.

Embedding size = 768 Embedding size = 1536
0.5 m iterations 1 m iterations 0.5 m iterations 1 m iterations
Baseline SWE Baseline SWE Baseline SWE Baseline SWE
Pretrain MLM (acc.%) 71.68 72.03 73.00 73.10 75.61 75.79 77.23 77.16
SQuAD average 86.65 87.00 87.38 87.97 88.32 89.68 89.38 89.79
GLUE average 77.30 77.98 78.13 79.28 79.00 80.08 80.37 80.88
SQuAD v1.1 (F-1%) 91.29 91.51 91.95 91.99 92.29 93.19 92.76 93.30
SQuAD v2.0 (F-1%) 82.00 82.48 82.80 83.95 84.35 86.16 86.00 86.28
GLUE/AX (corr%) 36.4 38.8 37.8 42.9 41.7 44.2 44.6 46.7
GLUE/MNLI-m (acc.%) 85.4 86.0 86.1 87.0 87.1 87.8 88.1 88.5
GLUE/MNLI-mm (acc.%) 85.0 85.9 86.0 86.6 86.5 87.4 87.2 87.8
GLUE/QNLI (acc.%) 92.0 92.5 93.0 92.8 92.9 93.6 94.1 94.1
GLUE/QQP (F-1%) 71.5 71.5 71.8 72.0 71.8 71.9 72.6 72.3
GLUE/SST-2 (acc.%) 93.5 93.2 94.1 94.4 94.0 95.6 95.6 95.9
Table 2: Experiment results using BERT with alternative embedding sizes, on the English Wikipedia + BookCorpus dataset. For results with the original embedding size of 1024, see Table. 1. The proposed SWE approach works for different embedding sizes.
0.5 m iterations 1 m iterations
Baseline SWE Baseline SWE
Pretrain MLM (acc.%) 68.74 69.21 69.86 70.16
SQuAD average 82.06 82.91 85.01 85.50
GLUE average 76.17 76.72 76.40 77.55
SQuAD v1.1 (F-1%) 88.29 89.23 90.32 90.40
SQuAD v2.0 (F-1%) 75.82 76.59 79.69 80.59
GLUE/AX (corr%) 35.3 36.2 34.7 36.4
GLUE/MNLI-m (acc.%) 83.9 84.3 84.4 85.2
GLUE/MNLI-mm (acc.%) 83.0 83.5 83.5 84.9
GLUE/QNLI (acc.%) 90.9 91.6 91.5 92.9
GLUE/QQP (F-1%) 70.9 71.2 71.4 71.5
GLUE/SST-2 (acc.%) 93.0 93.5 92.9 94.4
Table 3: Experiment results using BERT-base (12-layer with embedding size 768), on the English Wikipedia + BookCorpus dataset. The proposed SWE method also improves the performance of BERT-base.

We use the AdamW optimizer (Loshchilov and Hutter, 2019) with the weight decay rate being , , and . For English Wikipedia and BookCorpus, we use Pre-LN (He et al., 2016; Wang et al., 2019b) instead of the original BERT’s Post-LN. Note that the correct implementation of Pre-LN contains a final layer-norm right before the final classification/masked language modeling layer. Unlike the claim made by (Xiong et al., 2020), we notice that using Pre-LN with learning rate warmup leads to better baseline performance, as opposed to not using learning rate warmup. In our implementation, the learning rate starts from , linearly increases to the peak value of (the same learning rate used by (Xiong et al., 2020)) at the -th iteration, and then linearly decays to at the end of the training. For the XLNet dataset, we apply the same Pre-LN setup except the peak learning chosen to be . The peak learning rate of makes training unstable here and yields worse performance than .

We adopt the training procedure in the TensorFlow official implementation in which a BERT model is trained for 1 million iteration steps (both on English Wikipedia plus BookCorpus and on the XLNet dataset) with a constant batch size of 512. We also report the results of training for a half-million steps, to show how our method performs with less computational resources. When applying the proposed SWE, we train the model with weights shared for 10% of the total number of steps, then train with weights untied for the rest of the steps.

Baseline SWE Baseline
0.5 m iter. 0.5 m iterations 1 m iter.
24x1 12x2 6x4 4x6 2x12 1x24 24x1
Pretrain MLM (acc.%) 73.66 73.82 73.99 73.90 74.16 73.92 74.98
SQuAD average 87.27 87.96 88.59 89.17 88.75 88.34 88.82
GLUE average 78.17 78.87 79.12 78.82 79.47 78.98 79.30
SQuAD v1.1 (F-1%) 91.54 92.18 92.62 92.52 92.44 92.54 92.58
SQuAD v2.0 (F-1%) 82.99 83.74 84.56 85.82 85.06 84.14 85.06
GLUE/AX (corr%) 39.0 40.7 40.3 40.2 43.0 40.1 42.3
GLUE/MNLI-m (acc.%) 85.9 86.9 87.9 87.0 87.1 87.2 86.9
GLUE/MNLI-mm (acc.%) 85.4 85.8 87.0 86.4 86.4 86.7 86.1
GLUE/QNLI (acc.%) 92.3 93.1 93.2 92.9 93.8 93.4 93.5
GLUE/QQP (F-1%) 72.1 72.0 72.0 71.6 71.8 71.7 72.2
GLUE/SST-2 (acc.%) 94.3 94.7 94.3 94.8 94.7 94.8 94.8
Table 4: We group several consecutive layers as a weight sharing unit instead of sharing weights only across original layers. means grouping layers as a unit which is being shared with times. Models are trained on English Wikipedia and BookCorpus. Note that SWE 4x6 and 2x12 trained for a half million iterations achieve similar fine-tuning results compared to the non-SWE baseline trained for one million iterations.

After pre-training, we fine-tune the models for the Stanford Question Answering Dataset (SQuAD v1.1 and SQuAD v2.0)

(Rajpurkar et al., 2016) and the GLUE benchmark (Wang et al., 2019a)

. For all fine-tuning tasks, we follow the setting as in the literature: the model is fine-tuned for 3 epochs; the learning rate warms up linearly from 0.0 to peak in the first 10% of the training iterations, then linearly decay to

. We select the best peak learning rate from based on performance on the validation set. For the SQuAD datasets, we fine-tune each model 5 times and report the average. For the GLUE benchmark, for each training method, we simply train one BERT model and submit the model’s predictions over the test sets to the GLUE benchmark website to obtain test results. We observed that when using Pre-LN, the GLUE fine-tuning process is stable and only rarely leads to divergence.

5.2 Experiment Results

Both pre-training and fine-tuning results of our method vs. the baseline method are shown in Table 1. We see that our method consistently outperforms baseline methods, especially on fine-tuning tasks. We show training loss curves in Figure 3. The training loss is high when the weights are shared. The loss drops significantly after untying. Eventually, SWE yields a slightly lower training loss than the non-SWE baseline.

Figure 3: BERT pre-training loss curves. In the first 100k steps of SWE training, weights are shared and the pre-training loss is high. After untying, the loss improves significantly. Eventually, SWE yields a slightly lower pre-training loss, and significantly better fine-tuneing results. This matches the observance in our illustrative example for linear regression (Figure 2).

5.3 Ablation studies

In this section, we study how our method performs across different experimental settings.

5.3.1 How SWE Works with Different Model Sizes

According to experiments shown in the ALBERT paper (Lan et al., 2020), ALBERT-base with an embedding size of 128 performs preferably compared to ALBERT-base with a smaller or larger embedding size. In this section, we conduct experiments to see if the performance of our SWE training method is also related to the width of the model.

In particular, in addition to the conventional embedding size of 1024 in BERT, we experiment with alternative embedding sizes of 768 and 1536. We also scale the number of hidden units in the feed-forward block, and the size of each self-attention head accordingly. For each model size, we select the best peak learning rate from . The resulting peak learning rates adopted for embedding sizes of 768 and 1536 are and , respectively. Experiment results are shown in Table 2. The proposed SWE approach improves the performance of BERT with different embedding sizes.

Additionally, we experiment with the BERT-base architecture which contains 12 transformer layers, with an embedding size of 768. We keep other experimental settings such as the learning rate schedule unchanged. Results are shown in Table 3. The proposed SWE method also improves the performance of the BERT-base consistently.

5.3.2 When to Stop Weight Sharing

In this section, we study the effects of using different untying points (Algorithm 1). If weights are shared throughout the entire pre-training process, the final performance will be much worse than without any form of weight sharing (Lan et al., 2020). On the other hand, without weight sharing at all yields worse generalization ability.

Results of using different untying point values are summarized in Figure. 4. BERT models are trained for 500k iterations on English Wikipedia and BookCorpus. From the results, we see that untying at around 10% of the total number of training steps lead to the best performance. This is the reason we use / when training BERT for a half-million / one million steps, respectively.

Figure 4: Selecting the untying point from range (non-weight sharing baseline) to (ALBERT-like). The optimal untying point is around 10% of the total number of training steps.

5.3.3 How to choose weight sharing units

Note that it is not necessary to be restricted to share weights only across the original layers. We can group several consecutive layers as a weight-sharing unit. We denote as grouping layers as a weight sharing unit which is being shared with times. Since BERT has 24 layers, the baseline method without weight sharing can be viewed as “24x1”, and our method shown in Table 1 can be viewed as “1x24”. We present results from more different choices of weight sharing units in Table 4. We can see that, in order to achieve good results, the size of the chosen weight-sharing unit should not be larger than 6 layers. This means that the weights of a layer must be shared for at least 4 times.

6 Conclusion

We proposed a simple weight sharing method to speed up the learning of the BERT model and showed promising empirical results. In particular, our method demonstrated consistent improvement for BERT training and its performance on various downstream tasks. Our method is motivated by the successes of weight sharing models in the literature, and is validated with theoretic analysis. For future work, we will extend our empirical studies to other deep learning models and tasks, and analyze under which conditions our method will be helpful.


  • D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In International conference on machine learning, pp. 173–182. Cited by: §1.
  • S. Arora, N. Cohen, N. Golowich, and W. Hu (2018) A convergence analysis of gradient descent for deep linear neural networks. arXiv preprint arXiv:1810.02281. Cited by: Fact 2.
  • J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §1.
  • S. Bai, J. Z. Kolter, and V. Koltun (2019a) Deep equilibrium models. In Advances in Neural Information Processing Systems, pp. 690–701. Cited by: §4.
  • S. Bai, J. Z. Kolter, and V. Koltun (2019b) Trellis networks for sequence modeling. In International Conference on Learning Representations, Cited by: §4.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: §1.
  • B. Chang, L. Meng, E. Haber, F. Tung, and D. Begert (2018) Multi-level residual networks from dynamical systems view. In International Conference on Learning Representations, Cited by: §1, §4.
  • T. Chen, I. Goodfellow, and J. Shlens (2016) Net2net: accelerating learning via knowledge transfer. In International Conference on Learning Representations, Cited by: §B.1, Table 5, §1, §4.
  • Z. Chen, V. Badrinarayanan, C. Lee, and A. Rabinovich (2018) Gradnorm: gradient normalization for adaptive loss balancing in deep multitask networks. In International Conference on Machine Learning, pp. 794–803. Cited by: §1.
  • R. Dabre and A. Fujita (2019)

    Recurrent stacking of layers for compact neural machine translation models


    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 6292–6299. Cited by: §4.
  • S. De and S. L. Smith (2020) Batch normalization biases deep residual networks towards shallow paths. arXiv preprint arXiv:2002.10444. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pp. 4171–4186. Cited by: §1, §5.1.
  • C. Dong, L. Liu, Z. Li, and J. Shang (2020) Towards adaptive residual network training: a neural-ODE perspective. In International Conference on Machine Learning, Cited by: §4.
  • L. Gong, D. He, Z. Li, T. Qin, L. Wang, and T. Liu (2019) Efficient training of BERT by progressively stacking. In International Conference on Machine Learning, pp. 2337–2346. Cited by: §1, §4.
  • P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017)

    Accurate, large minibatch SGD: training ImageNet in 1 hour

    arXiv preprint arXiv:1706.02677. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In

    European conference on computer vision

    pp. 630–645. Cited by: §5.1.
  • J. Kim, J. Kwon Lee, and K. Mu Lee (2016)

    Accurate image super-resolution using very deep convolutional networks


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1646–1654. Cited by: §1.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference for Learning Representations, Cited by: §2.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2020)

    ALBERT: a lite BERT for self-supervised learning of language representations

    In International Conference on Learning Representations, Cited by: §1, §4, §5.3.1, §5.3.2.
  • D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen (2020) GShard: scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668. Cited by: §1.
  • S. Lippl, B. Peters, and N. Kriegeskorte (2020) Iterative convergent computation is not a useful inductive bias for resnets. In Submission to International Conference on Learning Representations, Cited by: §B.1, Table 5.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In International Conference on Learning Representations, Cited by: §5.1.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Journal of Machine Learning Research 21 (140), pp. 1–67. Cited by: §1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100, 000+ questions for machine comprehension of text. In EMNLP, Cited by: §5.1.
  • T. Salimans and D. P. Kingma (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In Conference on Neural Information Processing Systems (NeurIPS), Cited by: §1.
  • M. Schuster and K. Nakajima (2012) Japanese and Korean voice search. In IEEE International Conference on Acoustics, Speech and Signal Processing, Cited by: §5.1.
  • N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong, C. Young, et al. (2018) Mesh-TensorFlow: deep learning for supercomputers. In Advances in Neural Information Processing Systems, pp. 10414–10423. Cited by: §1.
  • [28] TensorFlow team TensorFlow model garden NLP. Note: github.com/tensorflow/models/tree/master/official Cited by: §5.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019a) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, Cited by: §5.1.
  • Q. Wang, B. Li, T. Xiao, J. Zhu, C. Li, D. F. Wong, and L. S. Chao (2019b) Learning deep transformer models for machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1810–1822. Cited by: §5.1.
  • L. Wu, Q. Wang, and C. Ma (2019) Global convergence of gradient descent for deep linear residual networks. In Advances in Neural Information Processing Systems, pp. 13389–13398. Cited by: §3.2, Fact 1, Fact 3.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §1.
  • R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu (2020) On layer normalization in the transformer architecture. In International Conference on Machine Learning, Cited by: §5.1.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5753–5763. Cited by: §5.1.
  • Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C. Hsieh (2020) Large batch optimization for deep learning: training bert in 76 minutes. In International Conference on Learning Representations, Cited by: §1.
  • Y. Zhang, Z. Lin, J. Jiang, Q. Zhang, Y. Wang, H. Xue, C. Zhang, and Y. Yang (2020) Deeper insights into weight sharing in neural architecture search. arXiv preprint arXiv:2001.01431. Cited by: §4.

Appendix A Proofs of Section 3

a.1 Proof of Proposition 1


Let be the

-dimensional vector with all elements being 1. Let

. For simplicity, denote , and , where are scalars. For the training set , let the gradient of be 0, we have

It implies the solution of to be

Note that by definition, is orthogonal with . For

generated from gaussian distribution with identity covariance, we know


are independent gaussian random variable with variance

. Therefore we have

which implies for the weight sharing solution , we have

a.2 Proof of Theorem 1


With weight sharing, we have

For the loss function

, we have

By continuous gradient descent with , it is easy to see that

Therefore we have

a.3 Technical Lemma

Lemma 1.

Initializing and training with weight sharing update (Equation 2), by setting , we have


With weigth sharing, we know that

has the same eigenvectors as

. Take any eigenvector, denote and

to be the corresponding eigenvalue of

and . Thus

For , we would like to show by setting . Since , we know this claim holds trivially at . Then suppose we have the claim holds for , then equals to

To make , we set and can upper bound by

And guarantees that . By induction, when , we have.

Similarly, for , we would like to show by setting . Note again the claim holds trivially when . Suppose , we can lower bound by

And guarantees that . By induction, when , we have.

Note that the two claims hold for all , it then directly implies that by setting , we have

which completes the proof. ∎

a.4 Proof of Theorem 2


Denote . Focusing on the first stage of weight-sharing training, we have . The gradient of for is

which implies that the gradient of is

Therefore, the first stage of weight-sharing is equivalent to training towards . For simplicity, define , where stands for ”symmetry”. Further, is positive definite as . We have proved in Theorem 1 and 3 that learning a positive definite matrix with identity initialization by weight-sharing is easy and fast.

For the at the end of the first stage in weight-sharing, we have , and . Combining with , we have . Thus satisfies all requirements for initialization in Fact 2. And the convergence to in the second stage of weight-sharing follows immediately. ∎

a.5 Discrete-time gradient descent

One can extend the previous result to the discrete-time gradient descent with a positive constant step size . It can be shown that with zero-asymmetric initialization, training with the gradient descent will achieve within steps; initializing and training with weights sharing, the deep linear network will learn a positive definite matrix to within steps, which reduces the required iterations by a factor of . Formally, see the convergence result of ZAS in Fact 3 and the convergence result of weight sharing in Theorem 3.

Fact 3 (Continuous-time gradient descent without weight sharing (Wu et al., 2019)).

For deep linear network with zero-asymmetric initialization and discrete-time gradient descent, if the learning rate satisfies , where , then we have linear convergence .

Since the learning rate is , Fact 3 indicates that the gradient descent can achieve within steps.

Theorem 3 (Discrete-time gradient descent with weight sharing).

For the deep linear network , initialize all with identity matrix and update according to Equation 2. With a positive definite target matrix , and setting , we have linear convergence .

Take as constants and focus on the scaling with , we have . Because of the extra in the exponent, we know that when learning a positive definite matrix , training with weight sharing can achieve within steps. The dependency on reduces from previous to linear, which shows the acceleration of training by weight sharing.

a.6 Proof of Theorem 3


Denote . Training with weight sharing, we have

By Lemma 1, setting , we have

Denote , we immediately have

With one step of gradient update, we have

where the denotes the element-wise multiplication. Let

where the matrix comes from

We have

Note that

Using the fact that for ,

Take , we have


Thus we have