Log In Sign Up

Blockwise Self-Attention for Long Document Understanding

We present BlockBERT, a lightweight and efficient BERT model that is designed to better modeling long-distance dependencies. Our model extends BERT by introducing sparse block structures into the attention matrix to reduce both memory consumption and training time, which also enables attention heads to capture either short- or long-range contextual information. We conduct experiments on several benchmark question answering datasets with various paragraph lengths. Results show that BlockBERT uses 18.7-36.1 reduces the training time by 12.0-25.1 better prediction accuracy, compared to an advanced BERT-based model, RoBERTa.


page 1

page 2

page 3

page 4


BERTVision – A Parameter-Efficient Approach for Question Answering

We present a highly parameter efficient approach for Question Answering ...

Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding

Transformer has become ubiquitous in the deep learning field. One of the...

Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks

Open-ended video question answering aims to automatically generate the n...

ERNIE-DOC: The Retrospective Long-Document Modeling Transformer

Transformers are not suited for processing long document input due to it...

ConvBERT: Improving BERT with Span-based Dynamic Convolution

Pre-trained language models like BERT and its variants have recently ach...

ChordMixer: A Scalable Neural Attention Model for Sequences with Different Lengths

Sequential data naturally have different lengths in many domains, with s...

1 Introduction

Recent emergence of the pre-training and fine-tuning paradigm, exemplified by methods like ELMo (peters2018deep)

, GPT-2 

(radford2019language), BERT (devlin2019bert), XLNet (yang2019xlnet) and RoBERTa (liu2019roberta)

, has drastically reshaped the landscape of the natural language processing research. These methods first pre-train a deep model with language model objectives using a large corpus and then fine-tune the model using in-domain supervised data for target applications. Despite its conceptual simplicity, this paradigm has reestablished the new state-of-the-art baselines across various tasks, such as question answering 

(devlin2019bert), coreference resolution (joshi2019bert), relation extraction (soares2019matching) and text retrieval (lee2019latent; nogueira2019passage), to name a few.

Building such models in practice, however, is an extremely resource-intensive process. For instance, the training of BERT-family models is notoriously expensive. devlin2019bert report that it takes four days for pre-training BERT-Base/BERT-Large on 4/16 Cloud TPUs, respectively. In order to reduce the pre-training time of RoBERTa to 1 day, liu2019roberta use 1,024 V100 GPUs. One crucial factor that contributes to the long training time is the memory consumption of these deep models, as it directly affects the batch size. Although the fine-tuning stage is relatively inexpensive, the memory issue still restricts the scenarios in which BERT can be used. For instance, “it is currently not possible to re-produce most of the BERT-Large results on the paper using a GPU with 12GB-16GB of RAM, because the maximum batch size that can fit in memory is too small.111

Although one may think that model size is the main contributor to the large memory consumption, our analysis (Section 2.1) shows that one of the main bottlenecks is actually dot-product self-attention, operated in multiple layers of Transformers (vaswani2017attention), the building block of BERT. As the attention operation is quadratic to the sequence length, this fundamentally limits the maximum length of the input sequence, and thus restricts the model capacity in terms of capturing long-distance dependencies. As a result, downstream tasks have to either truncate their sequences to leading tokens (nogueira2019passage) or split their sequences with a sliding window (joshi2019spanbert; joshi2019bert). Ad-hoc handling of long sequences is also required in the pre-training stage, such as updating the model using only short sequences in the early stage (devlin2019bert).

Common strategies for reducing memory consumption, unfortunately, do not work. For instance, shrinking the model by lowering the number of layers , attention heads , or hidden units  leads to significant performance degradation (vaswani2017attention; devlin2019bert) and does not address the long sequence issue. Alternatively, general low-memory training techniques, such as microbatching (huang2018gpipe) and gradient checkpointing (chen2016training) essentially trade off training time for memory consumption, prolongs the already lengthy training process.

In this work, we explore a different strategy, sparsifying the attention layers, intending to design a lightweight and effective BERT that can model long sequences in a memory-efficient way. Our BlockBert extends BERT by introducing sparse block substructures into the attention matrix to reduce both memory consumption and the number of floating point operations (FLOPs), which also enables attention heads to capture either short- or long-range contextual information. Compared to the previous method that also enforces sparsity (e.g., child2019generating), our approach is much simpler mathematically and very easy to implement. More importantly, the results of experiments conducted on several benchmark question answering datasets with various paragraph lengths show that BlockBert performs comparably or even better than the original BERT-family models, while enjoying an 18.7-36.1% reduction in memory usage and 12.0-25.1% reduction in training time.

The rest of the paper is organized as follows. Section 2 gives a brief introduction of the BERT model, along with an in-depth analysis of its memory usage during training time. We describe our proposed model in Section 3 and contrast it with existing methods that aim for creating a lighter model. Section 4 presents the experimental results and ablation studies, followed by a short survey of other related work in Section 5 and the conclusion in Section 6.

2 Background: Memory Bottleneck in Training BERT

We briefly review BERT and introduce its memory profiling in this section. Following the paradigm of language model pre-training and down-stream task fine-tuning, BERT (devlin2019bert) consists of multiple layers of bidirectional Transformers (vaswani2017attention), where each Transformer encoder has a multi-head self-attention layer and a position-wise feed-forward layer. Using the same notation as in (devlin2019bert), we denote the number of Transformer layers by , the number of hidden units by , the number of attention heads by , the sequence length by and the batch size by . We also assume the feed-forward hidden unit size to be .222The default parameter settings for BERT-Base and BERT-Large can be found in Table 7 in Appendix A.1.

2.1 Memory Profiling

Training BERT is a memory-intensive process. In order to identify the bottleneck, we follow the memory model proposed by sohoni2019low

, where the memory usage throughout neural network training is categorized into three main types: (1) 

Model Memory is used to store model parameters; (2) Optimizer Memory is the additional memory used by the specific learning algorithm during the process; (3) Activation Memory

consists of the outputs of each layer, which are cached for reuse in backpropagation to compute gradients.

Take BERT-Base training as an example. The model has 110M parameters, so the model memory uses 0.2 GB if stored in half-precision floating-point format (FP16). For Adam (kingma2014adam)

, the optimizer needs additional memory to store the gradients, first moments, and second moments of model parameters. If stored using the same precision, the optimizer memory should be three times of model memory.


In the current PyTorch Adam implementation, the first and second moments are stored in single precision. Consequently, BERT’s optimizer memory (1 GB) is five times of model memory (0.2 GB).

To calculate the exact size of activation memory is not trivial because it depends heavily on the implementation of the toolkit. Instead, we measure it empirically by training BERT-Base using Adam with a memory profiler (more details are provided in Appendix A.2).

We use 32 NVIDIA V100 GPUs for training. Each single GPU thus consumes a mini-batch of size . Figure (a)a shows the profiling result for a single GPU, where the model/optimizer/activation memory consumes 0.21/1.03/8.49 GB, resp. We can see that activation memory accounts for the vast majority of the total GPU memory (87.6%) and is clearly the bottleneck. Notice that although our analysis is done on BERT-Base, it can be easily generalized to BERT-Large and other models such as RoBERTa (liu2019roberta) and XLNet (yang2019xlnet).

(a) BERT-Base Training Memory Profiling
(b) Regression Analysis on Activation Memory
Figure 1: Memory Profiling for BERT

2.2 A Regression Analysis on Activation Memory

For BERT, or more specifically, Transformer, the activation memory corresponds to intermediate results of different layers It grows linearly in all the model hyper-parameters, except the sequence length , due to the attention layers. To quantify more clearly the and components in the activation memory, we conduct a regression analysis as follows. Assume that the activation memory (in each GPU) is a polynomial , where is the batch size in each GPU. If we fix the total number of tokens in a GPU, i.e., , to be constant (in our case, 4096), we should have a linear function w.r.t. , i.e., . We enumerate from in our experiments, and plot the corresponding profiled activation memory in Figure (b)b

. Using ordinary least squares (OLS), with

, the estimated linear function for activation memory is

, where the first term is responsible for the component. When , we can see that for BERT-Base, the component accounts for 3.66 GB and accounts for 4.83 GB. When the sequence length increases to 1024, however, the component increases to 7.32 GB, while is unchanged.

2.3 General Techniques for Reducing Memory Usage in Model Training

Observing that activation memory is the bottleneck, we discuss the effectiveness of common memory reduction techniques for BERT training below.

Low Precision (micikevicius2017mixed): Low precision is to use half-precision or mixed-precision for training neural networks. This technique has been widely used in Transformer training (ott2019fairseq; liu2019roberta). In this work, we already assume to use mixed-precision training by default, as indicated in the aforementioned analysis.

Microbatching (huang2018gpipe): Microbatching is to split a batch into small micro-batches (which can be fit into memory), and then run forward and backward passes on them separately with gradients for each micro-batch accumulated. Because it runs forward/backward pass multiple times for a single batch, it trades off time for memory.

Gradient Checkpointing (chen2016training): Gradient checkpointing saves memory by only caching activations of a subset of layers. The un-cached activations will be recomputed during backpropagation from the latest checkpoint. This strategy trades off time for memory by repeating computations that require large memory and will obviously extend the model training time.

Knowledge Distillation (hinton2015distilling): Knowledge distillation aims to compress and transfer knowledge from a teacher model to a simpler student model. However, knowledge distillation relies on a teacher model (which is still expensive in training time) and usually suffers from a certain degree of performance degradation.

As common techniques are limited in reducing both the training time and memory usage, we investigate how to optimize the dot-product attention layers and introduce our approach next.

3 Model: BlockBert

Following (vaswani2017attention), the dot-product attention in Transformer is defined as:

where with to be the sequence length and to be a hidden dimension. As we can see, the inner product between and consumes memory. One simple way to reduce memory consumption of attention is to sparsify the attention matrix. Suppose we have a masking matrix . We define a masked version of attention as follows:


with operator defined by

In this work, we design to be a sparse block matrix

, which not only reduces memory and the number of floating point operations (FLOPs) but also benefits from efficient dense matrix support from deep learning frameworks, such as PyTorch and Tensorflow. More formally, we split the length-

input sequence into partitions, with each partition of length .444We assume can be divided by

. If not, we pad the input sequence to make

The attention matrix is then partitioned into blocks, where each block matrix is of size . A sparse block matrix can be defined by a permutation of :


By writing as be block matrices, such that and and pluging them into Equation 1, we can formally define Blockwise Attention as follows:


As a result, it only needs to compute and store  (), each of which has size . In other words, BlockBert reduces the corresponding memory consumption and FLOPs by a factor of , since .

Figure 2: Architecture of Blockwise Multi-head Attention.

3.1 Blockwise Multi-Head Attention

Analogous to Multi-head Attention (vaswani2017attention), we allow queries, keys, and values to be projected multiple times and perform blockwise attentions in parallel. Moreover, different blockwise attention heads can use different masking matrices. The outputs of multiple heads are then concatenated and aggregated with another linear projection. Let be the number of attention heads and the number of hidden units. Blockwise multi-head attention is formally defined by:

with and the projection matrix . Each masking matrix is determined by permutation according to Equation 2. In particular, we choose from permutations generated by shifting one position: , i.e., we select . For example, with 12 attention heads () and 2 blocks (), one configuration can be assigning 10 heads to permutation and the other 2 heads to permutation . Figure 2 illustrates the blockwise multi-head attention with the block numbers . Blockwise sparsity captures both local and long-distance dependencies in a memory-efficiency way, which is crucial for long-document understanding tasks. For instance, the identity permutation, i.e., , enables each token to attend its nearby tokens in self-attention. Tokens within the same local group attend a long-distance group of tokens together in other permutations. Our proposed BlockBert essentially replaces the multi-head attention layers in Transformer/BERT with blockwise multi-head attention.

3.2 Analysis of Memory Usage Reduction

To validate our claim that BlockBert with blocks can reduce the memory use by a factor of , we perform the same memory profiling as described in sections 2.1 and 2.2. Again, We fix the number of tokens in each GPU () and choose from .555We use GPUs of 16 GB memory for profiling. BERT with fails due to an out-of-memory error. As we can see from Figure 4 and Table 4, the empirical results align well with the theoretical values. When we set block size to be 2 and 3 for BlockBert, their estimated activation memory decreases to 1/2 and 1/3 of BERT’s activation memory, resp. As shown in Table 1, for the sequence length , BlockBert with 2 / 3 blocks saves 18.7% / 23.8% overall memory, resp. The saving is more significant for longer sequences. When , the overall memory reduction of BlockBert with 2 / 3 blocks is 27.3% / 36.1%, resp.


Figure 3: Regression analysis on activation memory for BERT and BlockBert.
Act. Mem. (GB)
512 8 BERT 4.83 3.66
BlockBert n=2 4.84 1.83
BlockBert n=3 4.87 1.22
1024 4 BERT 4.83 7.32
BlockBert n=2 4.84 3.66
BlockBert n=3 4.87 2.44
Figure 4: Estimated and activation memory for BERT and BlockBert.

4 Experiments

We evaluate the pre-training and fine-tuning performance of BlockBert. In particular, when , we denote 10:2 to be the configuration which distributes 10 heads to permutation and 2 to permutation ; when , we denote 8:2:2 to be the configuration which assigns 8, 2, 2 heads to permutation , , and , resp. We compare BlockBert with the following baselines:

Google BERT The pre-trained base model from devlin2019bert.

RoBERTa-2seq and RoBERTa-1seq We compare with two versions of RoBERTa (liu2019roberta). RoBERTa-2seq is trained with both masked language model (MLM) task and next sentence prediction (NSP) task, while RoBERTa-1seq refers to the pre-training model with only MLM task.

SparseBert We pre-train BERT models with its Transformer encoder replaced by a Sparse Transformer encoder (child2019generating)

. We set its sparsity hyper-parameters stride

and expressivity . The attention masks used for Sparse Transformer encoder are illustrated in Figure 8.

4.1 Pre-training

All the models follow the base setting, i.e., and are trained on the same corpus — BooksCorpus and English Wikipedia with uncased word piece tokens. We fix the number of tokens per batch , i.e., if sequence length then batch size , if sequence length then batch size . The detailed pre-training configuration is listed in Table 7 in Appendix A.1. Moreover, the pre-training of SparseBert and BlockBert follows the RoBERTa-1seq setting, i.e., we drop the NSP (Next Sentence Prediction) task, and an input sequence is up to tokens until it reaches a document boundary.

Model Training Time (day) Memory (per GPU, GB) Heads Config. Valid. ppl
512 RoBERTa-1seq 6.62 9.73 - 3.58
BlockBert n=2 5.83 (-12.0%) 7.91 (-18.7%) 10:2 3.56
BlockBert n=3 5.80 (-12.5%) 7.32 (-23.8%) 8:2:2 3.71
1024 RoBERTa-1seq 9.66 13.39 - 3.60
BlockBert n=2 7.51 (-22.3%) 9.73 (-27.3%) 9:3 3.57
BlockBert n=3 7.23 (-25.1%) 8.55 (-36.1%) 8:2:2 3.63
Table 1: Pre-training Performance Analysis.

A summary of the pre-training performance comparison between BlockBert and RoBERTa-1seq is shown in Table 1. Besides memory saving, we also achieve a significant speedup. For example, when , BlockBert () reduces the training time from RoBERTa’s 9.7 days to 7.5 days.

4.2 Fine-tuning Tasks

We evaluate BlockBert on several question answering tasks, including SQuAD 1.1/2.0 (rajpurkar2018know) and five other tasks from the MrQA shared task666 — HotpotQA (yang2018hotpotqa), NewsQA (trischler2017newsqa), SearchQA (dunn2017searchqa), TriviaQA (joshi2017triviaqa) and NaturalQA (kwiatkowski2019natural). Since MrQA does not have an official test set, we follow joshi2019spanbert who split the development set evenly to build a new development set and test set.

These QA datasets have different paragraph length distribution patterns and are thus ideal for testing the effectiveness of BlockBert. For example, SQuAD, NaturalQA, and HotpotQA consist of mostly short paragraphs (shorter than 512), while paragraphs in SearchQA (average length 1,004) and TriviaQA (average length 934) have around 1,000 tokens. This means that for SearchQA and TriviaQA, a BERT model with sequence length can only capture half of the context. The detailed paragraph length distributions can be found in Figure 9.

For all the pre-trained models, we adopt the same fine-tuning QA setup from devlin2019bert. The tokenized paragraph and question are concatenated to be a sequence . The sequence is then fed into the pre-trained model with two extra linear layers for predicting the start and end positions of the answer spans. The detailed fine-tuning setting is listed in Appendix A.4. Table 2 and Table 3 report the experimental results.

SQuAD 1.1 SQuAD 2.0
Model EM F1 EM F1
- Human Perf. 82.30 91.20 86.80 89.40
512 Google BERT 81.19 88.45 74.08 77.16
XLNet - - 78.46 81.33
RoBERTa-2seq 82.91 89.78 75.79 79.17
RoBERTa-1seq 84.43 91.48 79.22 82.27
SparseBert 80.49 88.09 74.15 76.96
BlockBert n=2, 10:2 84.08 90.77 78.34 81.46
BlockBert n=3, 8:2:2 82.37 89.64 77.33 80.33
1024 RoBERTa-1seq 84.58 91.14 79.34 82.26
SparseBert 81.02 88.37 74.51 77.57
BlockBert n=2, 9:3 83.65 90.74 78.55 81.45
BlockBert n=3, 8:2:2 82.74 90.05 76.79 79.84
Table 2: Dev set results on SQuAD 1.1/2.0. The result of XLNet(-Base) is from (yang2019xlnet).

BlockBert (n=2) v.s. RoBERTa-1seq Comparing BlockBert () with RoBERTa-1seq on pre-trained model with , we observe an absolute F1 difference from 0.04 (in NaturalQA) to 1.18 (in NewsQA), with average difference to be 0.55. For , BlockBert achieves more comparable or even better performance (in SearchQA, NewsQA, and HotpotQA) to RoBERTa-1seq. The average difference on F1 reduces to 0.27.

BlockBert v.s. SparseBert For , it is interesting that BlockBert with 3 blocks (density 33.33%) performs better then SparseBert (density 44.20%) in both SQuAD and MrQA tasks. Similar patterns can be observed for . These results show that off-diagonal masking matrices, e.g., the masking matrix defined by permutation , play crucial roles in BlockBert.

Effect of Long Sequence Pre-training Our observations are twofold. (1) Long sequence pre-training benefits long sequence fine-tuning. In TriviaQA and SearchQA, of which paragraph lengths are around 1024, pre-training models with achieve significantly better performance. (2) The heterogeneity of pre-training and fine-tuning sequence length may hurt performance. For example, in SQuAD, we do not see significant performance gain by using pre-trained models with ; in HotpotQA and NewsQA, longer sequence pre-training even hurts performance.

Effect of #Blocks It is not surprising that BlockBert with 2 blocks () performs better than that with 3 blocks (), because it keeps more attention matrix entries. The biggest difference is in SQuAD 2.0 and NewsQA with , where we observe an absolute loss of 1.6 F1 by increasing block number from 2 to 3.

In summary, not only BlockBert saves training time and memory, but it also has competitive and sometimes better performance, especially for tasks with longer sequences. This demonstrates the effectiveness of our blockwise multi-head attention approach.

 SearchQA  TriviaQA  NewsQA  NaturalQA  HotpotQA
Model EM F1 EM F1 EM F1 EM F1 EM F1
512 Google BERT 74.94 80.37 70.18 75.35 51.27 66.25 66.13 78.29 60.50 77.08
RoBERTa-2seq 76.12 81.74 71.92 76.79 52.45 66.73 66.98 78.63 61.52 77.81
RoBERTa-1seq 77.09 82.62 73.65 78.22 56.13 70.64 67.14 79.07 62.77 79.28
SparseBert 73.36 79.01 68.71 73.15 51.18 65.47 65.53 77.46 58.54 74.85
BlockBert n=2, 10:2 76.68 82.33 72.36 77.53 54.66 69.46 66.94 79.03 62.13 79.15
BlockBert n=3, 8:2:2 75.54 81.07 72.05 76.74 53.82 68.39 66.14 78.47 60.64 77.46
1024 RoBERTa-1seq 77.47 83.12 75.29 80.20 55.00 69.64 68.28 80.35 61.89 78.71
SparseBert 74.83 80.54 70.56 75.34 51.67 67.16 65.07 77.31 59.65 76.02
BlockBert n=2, 9:3 77.95 83.51 75.06 79.41 55.44 70.08 67.31 79.39 62.13 78.94
BlockBert n=3, 8:2:2 76.98 82.76 74.78 79.28 53.48 68.50 65.91 78.20 61.89 78.18
Table 3: MrQA test results (Tasks are sorted decreasingly by average paragraph length).

4.3 Ablation Study

We discuss how the assignment of attention heads affects pre-training performance. We grid search attention head assignments and plot their best validation performance in 1.2M training steps. Our observations are threefold: (1) Identity permutation is important. As shown in Figure 5, all optimal solutions assign considerable attention heads to block diagonal matrices, since those matrices enable each token to attend its nearby tokens; (2) Non-identity permutations follow the rule of “vital few and trivial many.” Although diagonal matrices are important, assigning all attention heads to them (corresponding to 12:0 and 12:0:0 in Figure 5) significantly hurts performance; (3) Pre-training performance and fine-tuning performance are correlated but not always consistent. When , pre-training performance suggests 10:1:1 to be the best head assignment, but we observe that the configuration of 8:2:2 achieves better performance in fine-tuning tasks.

Figure 5: Ablation over blockwise attention heads assignment.

5 Related Work

In this section, we review the related work of memory optimization for neural network training and recent efforts to simplify Transformer and BERT. In recent years, there is an increasing interest in training neural networks with low-memory (sohoni2019low). Mainstream techniques include low-precision training (micikevicius2017mixed), microbatching (huang2018gpipe), gradient checkpointing (chen2016training). Another line of researches studies this problem from a theoretical perspective, including the recently proposed lottery ticket hypothesis (frankle2018lottery). Since the invention of Transformer (vaswani2017attention; dai2019transformer) and its successful application on language model pre-training (devlin2019bert; radford2019language; yang2019xlnet; liu2019roberta), there have been several studies attempted to simplify it from different perspectives. Most of them focus on the sparsification of attention matrix, such as Star Transformer (guo2019star), Sparse Transformer (child2019generating), Adaptive Sparse Transformer (correia2019adaptively; sukhbaatar2019adaptive), Log-Sparse Transformer (li2019enhancing)

, etc. However, due to limited support for sparse tensor from current deep learning platforms, most of studies have to represent a sparse matrix using a dense matrix with a binary mask or rely on customized CUDA kernels 

(gray2017gpu). Another line of research focuses on knowledge distillation, including DistilBERT777 which distills BERT using a smaller BERT and tang2019distilling which distills BERT with BiLSTM (hochreiter1997long).

6 Conclusion

In this work, we study lightweight BERT model with the goal of achieving both efficiency and effectiveness. We profile and analyze the memory bottlenecks of BERT, and focus on optimize dot-product self-attention, which consumes quadratic memory with respect to the sequence length. To reduce both training time and memory consumption, we present BlockBert, which sparsifies the attention matrices to be sparse block matrices. The proposed model achieves time and memory saving without significant loss of performance. In the future, we would like to explore more applications of BlockBert on NLP tasks involving long sequences such as coreference resolution (joshi2019bert) and document-level machine translation (miculicich2018document), and also non-NLP tasks such as protein sequence modeling (rives2019biological).


Appendix A Appendix

a.1 Notations and Pre-training Hyper-parameters

The notations and pre-training hyper-parameters are listed in Table 7 and Table 7.

Description Base Large
Batch size 256 256
# Self-attention heads 12 16
# Layers 12 24
# Hidden units 768 1024
# Feed-forward hidden units 3072 4096
Sequence length 512 512
Figure 6: BERT notations.
Hyper-parameter Value
Dropout 0.1
Attention dropout 0.1
Warmup steps 10K
Weight decay 0.01
Max steps 2.4M
Initial learning rate 0.00025
Learning rate decay Linear
Adam 1e-8
Adam 0.9
Adam 0.999
Gradient Clipping 1.0
Figure 7: Pre-training hyper-parameters.

a.2 Profiler Implementation

Among the three types of training memory, model memory and optimizer memory is relatively easy to profile (can be computed by enumerate each tenor and summing up tensor.numel() * tensor.element_size()). To calculate activation memory, sohoni2019low traverse PyTorch’s autograd graph and sum up necessary storage space. They find that the summation of model memory, optimizer memory and activation memory matches PyTorch memory profiling tool torch.cuda.max_memory_allocated. Based on their observation, we use


as an estimate to activation memory. When profiling BERT, we first pre-train it for 1000 steps, and then compute its model and optimizer memory. Finally, we esitmate its activation memory according to Equation 4.

a.3 SparseBert

The sparse masking matrices we use for Sparse Transformer (child2019generating) are shown in Figure 8. We adopt the implementation from Fairseq888

Figure 8: The sparse masking matrices we use in Sparse Transformer (fixed mode) encoder. White color indicates attention values to be masked. (a) , density 44.20%; (b) , density 34.97%.

a.4 Fine-tuning Settings

Our fine-tuning is implemented based on code base from HuggingFace999 and SpanBERT (joshi2019spanbert). We use max_sequence_length=, i.e., we allow fine-tuning task to input sequences as long as the pre-training model. If the input sequence is too long to fit the max_sequence_length=

constraints, we use a sliding window of stride 128 to split it. We grid search learning rate from {5e-6, 1e-5, 2e-5, 3e-5, 5e-5} and batch size from {16, 32}. The fine-tuning is performed for 4 epoches.

Figure 9: Paragraph length (after tokenization) distribution. The distribution of SQuAD 2.0 is very similar to SQuAD 1.1, so we only plot SQuAD 1.1 here.