Federated Split BERT for Heterogeneous Text Classification

by   Zhengyang Li, et al.
Ping An Bank

Pre-trained BERT models have achieved impressive performance in many natural language processing (NLP) tasks. However, in many real-world situations, textual data are usually decentralized over many clients and unable to be uploaded to a central server due to privacy protection and regulations. Federated learning (FL) enables multiple clients collaboratively to train a global model while keeping the local data privacy. A few researches have investigated BERT in federated learning setting, but the problem of performance loss caused by heterogeneous (e.g., non-IID) data over clients remain under-explored. To address this issue, we propose a framework, FedSplitBERT, which handles heterogeneous data and decreases the communication cost by splitting the BERT encoder layers into local part and global part. The local part parameters are trained by the local client only while the global part parameters are trained by aggregating gradients of multiple clients. Due to the sheer size of BERT, we explore a quantization method to further reduce the communication cost with minimal performance loss. Our framework is ready-to-use and compatible to many existing federated learning algorithms, including FedAvg, FedProx and FedAdam. Our experiments verify the effectiveness of the proposed framework, which outperforms baseline methods by a significant margin, while FedSplitBERT with quantization can reduce the communication cost by 11.9×.


No One Left Behind: Inclusive Federated Learning over Heterogeneous Devices

Federated learning (FL) is an important paradigm for training global mod...

QuPeL: Quantized Personalization with Applications to Federated Learning

Traditionally, federated learning (FL) aims to train a single global mod...

HarmoFL: Harmonizing Local and Global Drifts in Federated Learning on Heterogeneous Medical Images

Multiple medical institutions collaboratively training a model using fed...

When BERT Meets Quantum Temporal Convolution Learning for Text Classification in Heterogeneous Computing

The rapid development of quantum computing has demonstrated many unique ...

Federated Acoustic Modeling For Automatic Speech Recognition

Data privacy and protection is a crucial issue for any automatic speech ...

Incentivizing Data Contribution in Cross-Silo Federated Learning

In cross-silo federated learning, clients (e.g., organizations) collecti...

OFedQIT: Communication-Efficient Online Federated Learning via Quantization and Intermittent Transmission

Online federated learning (OFL) is a promising framework to collaborativ...

I Introduction

Lately language model pre-training has shown to be highly effective in learning universal language representations from large-scale unlabeled data. Pre-trained models such as ELMo [24], GPT [2] and BERT [6] have achieved great success in many NLP tasks, such as sentiment classification [36], natural language inference [38], and question answering [16]. In many real-world applications, textual data such as clinical records are decentralized and stored locally on client devices [39], such as phones and personal computers. Due to the recent stringent regulation on private data protection [32], these data cannot be directly uploaded to a central server. Federated learning [22, 14]

allows multiple clients to train a global deep neural network model collaboratively in a distributed environment without moving data to a centralized storage.

There have been some attempts on fine-tuning BERT in federated learning setting. For example, [11] proposed an encryption method for clients to protect data privacy while keeping the performance of fine-tuned models comparable to centralized fine-tuning. [20] confirms that it is possible to both pre-train and fine-tune BERT models in a federated manner using clinical texts from different silos without moving the data. [10]

provides an overview of the applicability of the federated learning to Transformer-based language models.


firstly applied federated learning to Transformer-based Neural Machine Translation to avoid sharing customers’ chat recording with the server.

However, there are still a few under-explored challenges for fine-tuning BERT over decentralized data: firstly, the data distribution on clients may vary substantially, which might cause severe performance degradation of the fine-tuned model [40]; secondly, due to the huge number of parameters in BERT, the communication cost between clients and the central server can be very high.

To address these concerns, we propose a new framework, FedSplitBERT, referred to as Federated Split BERT for heterogeneous data. Specifically, we split the whole BERT encoders (12 layers for BERT Base or 24 layers for BERT Large) into two parts at one specific layer, called critical layer. BERT encoder layers above this critical layer (excluding itself) are locally learned with data on this client only, rather than jointly trained over all clients. Encoder layers below the critical layer (including itself) are jointly trained over all clients. The intuition behind our method is that encoder layers above the critical layer are close to the softmax output, so their parameters can be tuned to adapt the local data distribution. The layers below the critical layer are used as a shared feature extractor over all clients (global part). Instead of being a provider of data during the FL, every client finally trains a personalised model consisting of global part and loacl part which satisfied its specific data distribution. Moreover, because only a part of the parameters need to be transmit between central server and end-point clients, the communication cost during model training will drop sharply.

Our contributions are summarized as follows:

  • We propose a new federated learning framework, FedSplitBERT, which specializes in fine-tuning BERT on heterogeneous data over multiple clients.

  • We further investigate quantization method to reduce the communication cost.

  • We conduct extensive experiments to show the effectiveness of our methods and prove that the split architecture of the Transformer-based model has the great ability to tackle the heterogeneity problems.

Ii Related Works

Ii-a Federated Learning

The general FL setup involves two types of updates, the server and local, the local updates are associated with minimizing some local loss function, while the global updates aggregate local weights and sychronize the updated model at each communication round.

Federated averaging (FedAvg) [22] and federated proximal (FedProx) [18] are two general-purpose algorithms in federated learning. FedAvg is an iterative method that each client optimizes a local surrogate of global objective function, and merges the weights at each communication round , with the same learning rate on each client. To address the heterogeneity issue, FedProx adds a regularization term to force the weights on client to be closer to global model during training. FedProx reaches more stable and better performance than FedAvg on images classification tasks. Recent advanced methods like FedOpt [26] and FedDyn [1] follow similar pipeline while improve model aggregation and reduce communication cost. FedSmart [9] optimizes the performance of the client model on its local data by updating weights based on the accuracy of a local validation set. These model aggregation ideas are complementary to our work and they can be integrated in the global update step in our proposed framework.

Another line of FL works aim to decrease communication cost by splitting the whole model architecture into two parts: the local model and global model [19, 41, 4]. The main idea lies in that each client keeps a local model which can adapt to local data to mitigate the effect of data heterogeneity as well as reduce the amount of parameters need to be transmitted. For instance, [31] formulates a new bi-level optimization problem designed for personalized FL (pFedMe) by using the Moreau envelope as a regularized loss function. This work can produce personalized models for individual clients, but its communication cost might be too high for the BERT model. Our FedSplitBERT improves communication efficiency by learning part of BERT layers locally and quantization. Local global FedAvg (LG-FedAvg) [19] and the heterogeneous Data Adaptive Federated Learning (HDAFL) algorithm [41]

utilizes similar strategies as us: (1) locally train client specific neural network layers (local models) and (2) aggregate generic (global) model parameters shared by all clients. However, these two papers mainly focus on architectures like convolution neural networks (CNN) and multi-layered perceptrons (MLPs), rather than the Transformer-based pre-trained BERT models. Our paper mainly studies how to fine-tune BERT in a federated setting over heterogeneous data on clients.

Ii-B BERT and Quantization

Since its inception, BERT and its variants have been the base models for many NLP applications [6, 29, 35]. There are a few research on applying BERT in the federated setting, such as [20] on federated pre-training and fine-tuning BERT over decentralized clinical notes. However, how to fine-tune BERT on heterogeneous textual data remains under-explored. In this work, we will fill this gap by proposing FedSplitBERT. Due to the large number of parameters in BERT models, the communication efficiency is a significant challenge for FL.

Quantization is a commonly used approach to reduce model size while maximally keeping its performance [28, 21]. It reduces the number of bits used to represent a number in the model. For instance, [28] proposes a Hessian based ultra low precision quantization method for BERT. [15] study the integer and binary quantization methods for the BERT model during the inference phase, respectively. [42] achieves the impressive compression ratio by using quantization-aware training during fine-tuning process and quantizing all the Embedding layers and Fully Connected layers of BERT to 8-bit during inference. [13] leverages the learned step size quantization for Transformer-based model, i.e. BERT, and further uses the knowledge distillation technique to get a compressed ”student” BERT model. [27] adapts a random Low-precision quantizer for gradients to federated learning manner. These methods can be used in our FedSplitBERT and could potentially reduce communication cost in the federated BERT fine-tuning.

Iii Methodology

This section covers the details of our FedSplitBERT framework and how we tackle the challenges of heterogeneous data and communication bottleneck in the federated setting.

Iii-a FedSplitBERT

Fig. 1: Pictorial view of the proposed FedSplitBERT in federated learning for the 12-layer BERT model. All clients share a set of encoder layers from Layer 1 to Layer (colored blue) and have distinct local layers (from Layer onward) that can potentially adapt to individual clients. The global layers are shared with the server while the local layers are kept private by each client.
0:  :
1:  Initialize global model

as a 16-bit tensor and local models on

clients as 32-bit tensors
2:  for iteration  do
3:     (randomly select a number of clients from )
4:     for  in parallel do
6:     end for
8:  end for
1:  Cast the 16-bit global tensor to 32-bit on client as
2:  Combine into
3:  Split local data into a set of mini-batches
4:  for each mini-batch  do
5:     Optimize with local batch to compute
8:  end for
9:  Quantize to 16-bit
10:  return
Algorithm 1 FedSplitBERT with quantization.
is the learning rate, the total sample size , denotes the sample size on client , is the quantization function

Fig. 1 depicts how our FedSplitBERT proceeds: for each client, it splits the whole BERT encoders into local and global encoder layers. All clients share a set of encoder layers from Layer 1 to Layer (including itself), called the global model, and each client has distinct local layers (from Layer onward) that can potentially adapt to heterogeneous (non-identically independent distributed) client data. Specifically, for the -th client, we denote its parameters as , where is the global parameters in the shared encoder layers, and is the local model parameters specific to the -th client.

During fine-tuning of the -iteration, server randomly select clients () among all active participants, is the total number of clients. the selected -th client feeds each batch of data to the global layers first to extract features, then pass their features to the local layers to compute the cross-entropy loss. The difference between local and global layers is that after back propagation, client only uploads its calculated global weights to the central server, then the server performs aggregation and synchronizes the new weights of global encoders to clients. Theoretically, our goal is to optimize the function:


is the loss of the prediction on , where is the data on the -th client. The FedSplitBERT is presented in Algorithm 1, and here we use FedAvg to aggregate client weights. Actually, other aggregation methods like FedOpt could also be used.

In FedSplitBERT, global encoders from Layer 1 to Layer act as a common feature extractor to capture surface features[12], while the local encoders (from Layer onward) aim to extract high-level features that reflect the individual characteristics. We call Layer the critical layer, and in federated BERT fine-tuning, we treat as a hyper-parameter and it can balance the overfitting and communication cost of the system. For a fixed BERT architecture, when is small, the BERT model is mostly an individual local model for each client, the local model will have large number of parameters (and tends to overfitting the data), but the communication cost will be low. An extreme case is , where every client has its independent BERT model trained on its own isolated data and there is no communication between clients and the server. On the contrary, when is large and close to the top layers, the local model on each client is small and has small number of parameters (less likely to overfitting the local data), but the communication cost is high. An extreme case of this way is when for the standard BERT model (12 layers in total), all clients share the same BERT model, equivalent to FedAvg algorithm. The problem of FedAvg is that its performance degrades substantially for heterogeneous clients. In our experiments, we treat as a hyper-parameter and it can be tuned to choose the best model for each dataset.

Iii-B Quantizing Global Weights

FedSplitBERT only requires clients to upload and download (i.e. the weights of the global model) in each communication round, so the communication cost is significantly reduced. Considering the huge number of parameters in BERT, there might still be millions of parameters to upload or download at each communication round and the network costs are still prohibitive for portable edge devices.

Fig. 2: Quantization Scheme of FedSplitBERT

To further improve communication efficiency, we employ quantization on all the parameters to upload and download (See the quantization process in Fig. 2). For selected clients in each iteration, it proceeds in the following steps: (1) after the back propagation, quantize the global weights from 32-bit to 16-bit on client ; (2) upload the quantized to the server for aggregation; (3) the server aggregate quantized weights in 16-bit according to a certain rule (it could be FedAvg, FedProx or FedOpt, etc); (4) selected clients download the updated and inversely quantize from 16-bit to 32-bit. The whole FedSplitBERT algorithm with quantization is illustrated as Algorithm 1.

Iv Experiments

Here we conduct experiments to verify the effectiveness of FedSplitBERT on text classification over extensive public datasets. We aim to demonstrate that our methods can (1) fine-tune BERT over multiple clients with heterogeneous data and produce better performance than baseline methods like FedAvg, FedProx and FedAdam, and (2) reduce the communication cost during fine-tuning in terms of uploading and downloading size.

Iv-a Baseline Methods & Datasets

The baseline methods included in this paper are: FedAvg, FedProx and FedAdam [26]. We evaluate all methods on General Language Understanding Evaluation (GLUE) benchmark [33], a collection of 9 sentence-level language understanding tasks:

  • Two sentence-level classification tasks including Corpus of Linguistic Acceptability (CoLA)

    [37], and Stanford Sentiment Treebank (SST-2) [30].

  • Three sentence-pair similarity tasks including Microsoft Research Paraphrase Corpus (MRPC)

    [7], Semantic Textual Similarity Benchmark (STSB) [3]

    , and Quora Question Pairs (QQP)

  • Four natural language inference (NLI) tasks including Multi NLI (MNLI) [38]

    , Question NLI (QNLI)


    , Recognizing Textual Entailment (RTE)

    [5, 8], and Winograd NLI (WNLI) [17].

The performance of our FedSplitBERT framework in comparison with baseline methods are evaluated under the non-IID settings, which indicates that data over different clients have distinct label distributions.

Dataset Client1 Client2 Client3
MRPC 95%/5% 75%/25% 25%/75%
CoLA 90.5%/9.5% 83%/17% 40%/60%
QQP 70%/30% 37%/63% 13%/87%
MNLI 86%/5%/9% 5%/90%/5% 5%/5%/90%
STSB 0.93 2.90 4.26
Other 80%/20% 50%/50% 20%/80%
TABLE I: The label distributions for each client used to construct non-IID datasets for 3 clients (average score for STSB and percentages of every class for classification tasks).

Prior to running experiments, we create synthetic client data from the GLUE benchmark. Throughout our experiments, every client has approximately the same number of data, i.e., , where is the total sample size and is the client number. We manually construct synthetic data for clients to vary the distribution of data labels across each client. To achieve this goal, we simulate synthetic data for all clients according to a distribution scheme specific to each dataset. Table I displays the data distribution schemes of 9 datasets for 3 clients. In this experiment, 8 of 9 datasets are classification tasks, except STSB, which is a regression task. Therefore, in Table I we present average scores for three clients for the STSB dataset. Specifically, for STSB we sorted the samples by their scores, then split the whole dataset to parts in order, each part has

examples. So client 1, 2 and 3 have different distribution of scores, which ranges in 0.0-2.0, 2.0-3.6 and 3.6-5.0, respectively. For other datasets, we randomly sample data without replacement according to the given label probabilities in this table for each client.

For the experiments with 10 clients, we construct synthetic data for each client in a similar manner by varying the percentage of positive labels from approximately 90% to 0% with a step-size of 10%. Table II shows the data construction details of experiment with 10 clients. Due to the task MRPC has relatively small number of data samples and imbalance positive and negative label distribution, its synthetic non-iid dataset cannot strictly follow the step-size of 10%, we set different distribution for each client.

Client1 90%/10% 92%/8%
Client2 80%/20% 86%/14%
Client3 70%/30% 71%/29%
Client4 60%/40% 66%/34%
Client5 50%/50% 50%/50%
Client6 40%/60%. 44%/56%
Client7 30%/70% 35%/65%
Client8 20%/80%. 28%/72%
Client9 10%/90%. 12%/88%
Client10 2%/98% 51%/48%
TABLE II: The label distributions for each client used to construct non-IID datasets for 10 clients on MPRC, SST-2, QQP and QNLI tasks.
F1/Acc Mcc Acc Acc Acc F1/Acc Acc Acc Pear. corr
FedAvg 76.65/83.96 48.81 88.65/90.77 88.61 78.00/76.66 57.66/61.68 87.02 54.43 64.36 72.25
FedProx 76.67/82.31 46.45 88.92/91.17 88.12 79.89/77.14 57.03/64.81 86.83 52.23 69.71 72.87
FedAdam 76.99/85.41 45.29 89.50/91.65 90.12 80.00/80.66 60.28/72.14 88.36 65.29 70.72 75.61
FedSplitBERT 79.97/86.15 51.17 90.12/93.78 90.54 82.24/78.86 62.78/73.11 90.65 68.90 71.15 77.94
FedSplitBERT +qat
77.95/86.27 49.23 90.29/91.04 90.49 80.32/78.20 59.68/71.89 89.41 67.49 71.94 76.64
TABLE III: Performance of FedAvg, FedProx, FedAdam and FedSplitBERT on 9 GLUE datasets under non-IID setting over 3 clients. Last row, we present the performance of FedSplitBERT with quantization (FedSplitBERT+qat).
F1/Acc Mcc Acc Acc Acc Acc Acc Acc Pear. corr
12 76.65/83.96 48.81 90.77 88.61 78.00/76.66 61.68 87.02 54.43 64.36 72.25
10 78.98/86.03 49.24 93.78 89.19 82.24/78.86 74.11 90.65 68.90 71.15 77.58
8 79.97/86.15 50.33 93.27 90.18 80.70/74.71 73.11 89.72 71.31 70.94 77.72
6 79.57/86.85 51.17 93.43 90.54 80.31/71.83 73.76 89.67 68.46 70.35 77.47
4 77.60/85.96 43.68 91.22 90.35 78.42/70.27 72.61 89.51 61.65 70.58 75.06
0 75.01/86.21 43.40 89.72 90.01 77.96/69.28 69.37 88.49 50.01 68.30 72.47
TABLE IV: Performance of FedSplitBERT under non-IID setting over 3 clients. The critical layer is fixed at 0, 4, 6, 8, 10 and 12 (same as FedAvg).

Iv-B Experimental Settings

Test Set & Implementation Since we need the true labels to construct synthetic client data, we merge the training and development sets of GLUE benchmarks as the whole datasets, and on each client we split the dataset by 80% and 20% for training and test sets. When performing FL experiments, training data at all clients are used to train the model and our FedSplitBERT methods are evaluated on the local test sets of clients. We cannot utilize the original test sets in GLUE data as it is impossible to construct heterogeneous test data consistent to the training data without label.

For each dataset, we employ five different federated methods: FedAvg, FedProx, FedAdam, and FedSplitBERT with/without quantization, and compare their results. We used the pre-trained model ’bert-base-uncased’ (Layers=12, Hidden size=768, Self-Attention Heads=12) in our research. Then, we quantized 32-bit weights to 16-bit and reverted back with basic Pytorch Tensor data structure operations

[23]. All our experiments were conducted on NVIDIA Tesla V100.

Hyperparameters & Evaluation Metrics For each task, the local learning rate

takes a value in the set {2e-5, 3e-5, 4e-5, 5e-5}, and the local fine-tuning epochs ranging from 3 to 9, the batch size of local training is chosen from {8, 16, 32}, with the input padding length fixed at 128. We used the hyper-parameters of FedAdam from

[26]. We conducted the experiments under 3 and 10 clients federated setting and tuned the hyper-parameter for FedProx, critical layers for FedSplitBERT. At each iteration, all clients are participated for model aggregation in our experiments.

We report the average Matthew’s corr for CoLA, average Pearson for STSB, both average F1 score and average Accuracy for RTE and MRPC, and average Accuracy for the rest of datasets. For FedSplitBERT, we evaluate the accuracy under the local test setting, following the method in [19].

Iv-C Results and Analysis


Table III and V depict the performance of various FL methods under non-IID settings over 3 and 10 clients, respectively. In 10-client setting, we test the baseline methods (FedAvg, FedAdam, and FedProx) and FedSplitBERT on 4 datasets as it is very resource demanding for training up to 10 BERT models at the same time. Here, the baseline methods train one shared BERT model for all clients. By contrast, we implement FedSplitBERT by setting the critical layer at 5 values: 0, 4, 6, 8, 10. When , it means that there is no shared global model among clients, and each client trains its own BERT model with its local isolated data. Due to the overfitting on local data and learning from less data than global model, FedSplitBERT () performs the worst among FedSplitBERT methods, and some times even worse than FedAvg.

Dataset MRPC SST-2 QQP QNLI Average
Acc Acc Acc Acc
FedAvg 62.95 92.36 84.56 82.84 80.68
FedProx 64.15 92.25 86.03 81.16 80.90
FedAdam 67.20 93.15 86.32 84.22 82.72
FedSplitBERT 79.86 94.92 91.35 89.14 88.81
FedSplitBERT +qat
79.96 94.13 90.76 89.24 88.52
TABLE V: Performance of FedAvg, FedProx, FedAdam, and FedSplitBERT on 4 datasets under non-IID setting over 10 clients. The communication rounds for MPRC is set to 6, for SST-2 and for QQP and QNLI.
Fig. 3: The test accuracy versus communication rounds of the baseline methods and FedSplitBERT without quantization on 6 datasets.
Fig. 4: The training loss versus communication rounds of the baseline methods and FedSplitBERT without quantization on 6 datasets.

Our FedSplitBERT and FedSplitBERT+qat () outperform baseline methods (FedAvg, FedProx and FedAdam) significantly on all datasets, by an average score improvement of nearly 4% and at least 1% on single task. Also, Table IV reports the performance of FedSplitBERT under different choices of critical layer c (6 values).The study [12] shows that BERT captures phrase-level information in the lower layers, syntactic information in the middle layers, semantic information at the top layers. Our experiments show that when layer ranges from 6 to 8, FedSplitBERT has better performance than and on the linguistic related classification task (CoLA) and sentence similarity tasks, such as MRPC, STSB and QQP. For nature language inference tasks, MNLI, QNLI, WNLI and RTE, a higher c yields better performance. These results imply a finding that our global encoders for BERT model can learn the general features, which is important for these tasks, while deeper layers corresponding to the local model in FedSplitBERT learn high-level feature representations and semantic information. The layer c can be be selected according to the type of task. Our FedSplitBERT model with can both utilize more data to learn general features and keep the ability to capture local characteristics (Fig. 5). Therefore, a general suggestion for selecting the critical layer c is to try 6, 8 and 10. The value of c balances the model capacity and communication cost, so different datasets require different values of c.

Fig. 5: Left plot reports the communication cost of each round in MBits versus different of FedSplitBERT when client number =3. Right plot displays test accuracy versus the number of global layers on 3 datasets.

FedAvg, FedProx and FedAdam have critical layer , so they fail to capture data heterogeneity. Under this setting, one global BERT model cannot depict the data features and heterogeneous distribution of different clients. On the other hand, when , the FedSplitBERT is equivalent to isolate training, with all the layers locally updated. In this case, the model only learns from client’s local data, leading to overfitting. Our FedSplitBERT with appropriate choice of critical layer addresses these two problems and achieves better performance as well as faster convergence.

In addition, we report the fine-tuning results of our approach over 16-bit quantization (FedSplitBERT+qat). It achieves a similar performance with FedSplitBERT () with a very low communication cost. FedSplitBERT+qat performs the best on MRPC and QNLI datasets under the 10 clients setting and STSB under the 3 clients setting, which may be due to the decreased over-fitting by truncating the weights.

Communication Complexity

Communication cost is an important metric for federated learning. Two factors concur in reducing communication complexity: the rate of convergence at a target accuracy, which determines the number of rounds, and the amount of model parameters transferred at each round. We show in Fig. 3 and 4 the convergence analysis of the training loss and test accuracy over the number of rounds communications for the non-IID setting across 6 datasets: RTE, MRPC, STSB, WNLI, QNLI and COLA. Our methods converge more quickly than the FedAvg and FedProx, achieving a higher accuracy and lower training loss across all datasets.

Fig. 6: Communication cost in MBits when baseline methods and FedSplitBERT attain a particular accuracy on four datasets(non-iid): RTE(64%), MRPC(84%), WNLI(47%) and QNLI(90%). Client number =3.
Para. size
Comm. Cost
Comm. gain
FedAvg/FedProx 1.03 6.18 1
FedAdam 1.7 3.4 1.8
FedSplitBERT 0.51 1.02 6.1
FedSplitBERT+qat 0.26 0.52 11.9
TABLE VI: First column shows uploading and downloading parameter size in Gigabyte in each communication round . Last two columns demonstrate the communication cost and gain for baseline methods and FedSplitBERT with quantization(+qat) to achieve a test accuracy of 84% on MRPC. Note client number , critical layer .

We demonstrate the necessary size of uploading and downloading parameters of the baseline methods and our proposed framework at each round over 3 clients and critical layer . The communication cost to attain the same performance(84% accuracy) for these methods on MRPC dataset are reported in Table VI. The minimum transferred parameter size of FedSplitBERT is 0.26 GB with quantization, 4 times lower than FedAvg and FedProx, 6 times lower than FedAdam. Furthermore, since FedSplitBERT has a faster convergence rate than fully generic model on heterogeneous data, the communication complexity decreases along with the number of parameter-transferring rounds.

Fig.6 displays the communication cost in terms of uploading size in MBits when all methods achieving a particular level of test accuracy on four datasets. This figure shows the trend of faster convergence of FedSplitBERT than FedAvg, Fedprox and FedAdam on 4 different datasets. On RTE tasks, the dotted blue line shows that our FedSplitBERT without quantization converges within only 3 communication rounds which costs nearly 1000 MBits to exceed the 64% accuracy, while baseline methods need at least 5000 MBits communication overhead. On MRPC and WNLI tasks, FedSplitBERT also attains the performance threshold with less than 1GB communication cost. However, baseline methods need at least 2 communication cost to achieve the same performance. With quantization the global layers, our methods obtain 11.9 communication gain, when achieve the comparable performance to FedAvg (non-IID).

V Conclusion and Discussion

In this paper, we propose a framework, FedSplitBERT, to address the heterogeneous data in federated BERT fine-tuning. We evaluate our methods on the GLUE benchmarks and we find that our methods exceed FedAvg, FedProx and FedAdam by a significant margin for heterogeneous clients. Through ablation studies, we find that the critical layer for FedSplitBERT can be tuned according to the type of the downstream task. Due to the huge size of BERT model, we also investigate quantization to further reduce the communication cost, which could reduce the communication cost by

. Our method is easy-to-implement, and very effective to heterogeneous data. Our framework is also compatible to many existing and incoming FL aggregation algorithms, like FedYogi, etc. In the future, we can study more NLP tasks with BERT in federated setting, like named entity recognition, etc. Another important problem is how to reduce the communication cost due to the increasing size of NLP models. We have studied the quantization method in this paper, and many other efficient weight compression methods can be employed.


This paper is supported by the Key Research and Development Program of Guangdong Province under grant No.2021B0101400003. Corresponding author is Jianzong Wang from Ping An Technology (Shenzhen) Co., Ltd (wangjianzong347@pingan.com.cn).


  • [1] D. A. E. Acar, Y. Zhao, R. Matas, M. Mattina, P. Whatmough, and V. Saligrama (2021) Federated learning based on dynamic regularization. In International Conference on Learning Representations, Cited by: §II-A.
  • [2] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, and e. a. Sastry (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 1877–1901. Cited by: §I.
  • [3] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia (2017) Semeval-2017 task 1: semantic textual similarity-multilingual and cross-lingual focused evaluation. Cited by: 2nd item.
  • [4] L. Collins, H. Hassani, A. Mokhtari, and S. Shakkottai (2021) Exploiting shared representations for personalized federated learning. In

    International Conference on Machine Learning

    pp. 2089–2099. Cited by: §II-A.
  • [5] I. Dagan, O. Glickman, and B. Magnini (2006) The pascal recognising textual entailment challenge. In achine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment, pp. 177–190. Cited by: 3rd item.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §I, §II-B.
  • [7] W. B. Dolan and C. Brockett (2005) Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), Cited by: 2nd item.
  • [8] D. Giampiccolo, B. Magnini, I. Dagan, and W. B. Dolan (2007) The third pascal recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pp. 1–9. Cited by: 3rd item.
  • [9] A. He, J. Wang, Z. Huang, and J. Xiao (2020) FedSmart: an auto updating federated learning optimization mechanism. In Web and Big Data: 4th International Joint Conference, APWeb-WAIM 2020, Tianjin, China, Proceedings, Part I, Berlin, Heidelberg, pp. 716–724. Cited by: §II-A.
  • [10] A. Hilmkil, S. Callh, M. Barbieri, L. R. Sütfeld, E. L. Zec, and O. Mogren (2021) Scaling federated learning for fine-tuning of large language models. In International Conference on Applications of Natural Language to Information Systems, pp. 15–23. Cited by: §I.
  • [11] Y. Huang, Z. Song, D. Chen, K. Li, and S. Arora (2020) TextHide: tackling data privacy in language understanding tasks. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1368–1382. External Links: Document Cited by: §I.
  • [12] G. Jawahar, B. Sagot, and D. Seddah (2019) What does bert learn about the structure of language?. In ACL 2019-57th Annual Meeting of the Association for Computational Linguistics, Cited by: §III-A, §IV-C.
  • [13] J. Jin, C. Liang, T. Wu, L. Zou, and Z. Gan (2021) KDLSQ-bert: a quantized bert combining knowledge distillation with learned step size quantization. arXiv preprint arXiv:2101.05938. Cited by: §II-B.
  • [14] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. (2021) Advances and open problems in federated learning. Foundations and Trends® in Machine Learning 14 (1–2), pp. 1–210. Cited by: §I.
  • [15] S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer (2021) I-bert: integer-only bert quantization. International Conference on Machine Learning (Accepted). Cited by: §II-B.
  • [16] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy (2017-09) RACE: large-scale reading comprehension dataset from examinations. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 785–794. External Links: Document Cited by: §I.
  • [17] H. Levesque, E. Davis, and L. Morgenstern (2012) The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, Cited by: 3rd item.
  • [18] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith (2020) Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems 2, pp. 429–450. Cited by: §II-A.
  • [19] P. P. Liang, T. Liu, L. Ziyin, R. Salakhutdinov, and L. Morency (2019) Think locally, act globally: federated learning with local and global representations. NeurIPS 2019 Workshop on Federated Learning distinguished student paper award. Cited by: §II-A, §IV-B.
  • [20] D. Liu and T. Miller (2021) Federated pretraining and fine tuning of bert using clinical notes from multiple silos. AI for Public Health Workshop at ICLR’21. Cited by: §I, §II-B.
  • [21] Z. Liu, G. Li, and J. Cheng (2021) Hardware acceleration of fully quantized bert for efficient natural language processing. Design, Automation & Test in Europe (DATE). Cited by: §II-B.
  • [22] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §I, §II-A.
  • [23] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)

    PyTorch: an imperative style, high-performance deep learning library

    In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §IV-B.
  • [24] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018-06) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. External Links: Document Cited by: §I.
  • [25] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2383–2392. Cited by: 3rd item.
  • [26] S. J. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Konečný, S. Kumar, and H. B. McMahan (2021) Adaptive federated optimization. In International Conference on Learning Representations, Cited by: §II-A, §IV-A, §IV-B.
  • [27] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani (2020) Fedpaq: a communication-efficient federated learning method with periodic averaging and quantization. In International Conference on Artificial Intelligence and Statistics, pp. 2021–2031. Cited by: §II-B.
  • [28] S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer (2020) Q-bert: hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 8815–8821. Cited by: §II-B.
  • [29] S. Si, R. Wang, J. Wosik, H. Zhang, D. Dov, G. Wang, and L. Carin (2020-07–08 Aug)

    Students need more attention: bert-based attention model for small data with application to automatic patient message triage

    In Proceedings of the 5th Machine Learning for Healthcare Conference, F. Doshi-Velez, J. Fackler, K. Jung, D. Kale, R. Ranganath, B. Wallace, and J. Wiens (Eds.), Proceedings of Machine Learning Research, Vol. 126, pp. 436–456. External Links: Link Cited by: §II-B.
  • [30] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642. Cited by: 1st item.
  • [31] C. T. Dinh, N. Tran, and J. Nguyen (2020) Personalized federated learning with moreau envelopes. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 21394–21405. Cited by: §II-A.
  • [32] P. Voigt and A. Von dem Bussche (2017) The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing 10, pp. 3152676. Cited by: §I.
  • [33] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, Cited by: §IV-A.
  • [34] J. Wang, Z. Huang, L. Kong, D. Li, and J. Xiao (2021) Modeling without sharing privacy: federated neural machine translation. In International Conference on Web Information Systems Engineering, pp. 216–223. Cited by: §I.
  • [35] R. Wang, S. Si, G. Wang, L. Zhang, L. Carin, and R. Henao (2020-11) Integrating task specific information into pretrained language models for low resource fine tuning. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 3181–3186. External Links: Link, Document Cited by: §II-B.
  • [36] Y. Wang, M. Huang, X. Zhu, and L. Zhao (2016) Attention-based lstm for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 606–615. Cited by: §I.
  • [37] A. Warstadt, A. Singh, and S. R. Bowman (2019-03) Neural network acceptability judgments. Transactions of the Association for Computational Linguistics 7, pp. 625–641. External Links: Document Cited by: 1st item.
  • [38] A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. Cited by: §I, 3rd item.
  • [39] J. Xu, B. S. Glicksberg, C. Su, P. Walker, J. Bian, and F. Wang (2021) Federated learning for healthcare informatics. Journal of Healthcare Informatics Research 5 (1), pp. 1–19. Cited by: §I.
  • [40] C. Yang, Q. Wang, M. Xu, Z. Chen, Y. L. Kaigui Bian, and X. Liu (2021) Characterizing impacts of heterogeneity in federated learning upon large-scale smartphone data. Ljubljana, Slovenia.ACM, New York, NY, USA, 12 pages.. External Links: Document Cited by: §I.
  • [41] L. Yang, C. Beliard, and D. Rossi (2020) Heterogeneous data-aware federated learning. IJCAI 2020 Federated Learning Workshop. Cited by: §II-A.
  • [42] O. Zafrir, G. Boudoukh, P. Izsak, and M. Wasserblat (2019) Q8bert: quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-S Edition (EMC2-NIPS), pp. 36–39. Cited by: §II-B.