The rapid evolution of pre-trained models are toward the trend of involving more and higher-quality data, larger amount of parameters and stronger computing power BERT; GPT-2; GPT-3; CPM-2; PANGU; ERNIE-Titan; Gopher; Switch; M6; PaLM
. It naturally relies on increasingly more compute cost and time, e.g. training the BERT-large 345 million model took 6.16E PF-days, training the GPT-3 175 billion model consumed 3.64E+03E PF-daysGPT-3. On the one hand, larger models tend to obtain better performances, especially on few- and zero-shot learning tasks ERNIE-Titan; PaLM, which can greatly empower the AI industry. On the other hand, the increasing demand of compute brings about challenges and triggers more explorations on the development of advanced distributed training techniques as well as the optimization of large-scale resource scheduling and allocation strategies.
There have been extensive researches that focus on parallel computing for training large deep learning models. The most popular techniques include data parallism PS; Poseidon, model parallism Mesh; Megatron, pipeline parallism Pipedream; Memory-efficient-pp and hybrid parallism Megatron; DeepSpeed, which aim to accelerate the training process from different dimensions. These techniques have already been widely applied in the training of recent pre-trained models and obtained considerable speed-up ratios GPT-3; PANGU; ERNIE-Titan; PaLM.
As the model size continues to grow, e.g. from 170 billion GPT-3 to 540 billion PaLM, more problems arise. One of the practical problems is: it will become more frequent that the spare resources one cluster can provide are insufficient to satisfy the training of a large model, which is particular common for business clusters. A direct intuition to address the problem is to expand the computing power by building larger-scale of hardware. With the development of nationwide infrastructure construction, many organizations have built super computers or intelligent clusters distributed in different locations. Each of these clusters, however, only serves independently in most cases. Shi et al. introduced a method to train deep learning models on public cloud clusters, which utilizes distributed compute connected by networks Public-Cloud. This inspires us to explore potential solutions to connect the computational resources of different cloud clusters and train models in a collaborative style. Meanwhile, the modularity of many deep learning algorithms enables the decoupling of a model into different components, which could be dispatched to multiple clusters. However, there also exist many challenges. For example, as the connections between cloud clusters are usually low-bandwidth (even much lower than those mentioned in Public-Cloud), especially between different locations, communication efficiency will easily become a bottleneck.
To take a step forward, in this work, we proposed Nebula-I, a unified framework for collaboratively training deep learning models over public cloud clusters, especially on clusters connected by low-bandwidth wide area networks (WANs). Nebula-I is a stack of optimization layers together with a security layer, which is designed to better assist the deployment of a deep learning model on the cloud environment. Using Nebula-I, a model can be decoulpled and dispatched to different clusters which are then work collaboratively to execute certain tasks. Our motivations of proposing Nebula-I are in three folds. Firstly, we aim to set up practical solutions for pre-training larger models collaboratively over cloud clusters, in which the tasks are carried out using compute aggregated from different locations. Secondly, we would like to explore more possibilities that the cloud environment can support, to fully realize the functionality of cloud clusters and existed pre-trained models. Lastly, we expect Nebula-I can provide user-friendly interfaces that can minimize the development of end users who are willing to train models over the clouds.
We demonstrated how Nebula-I works using two natural language processing (NLP) scenarios that include: a) pre-training a multilingual language model using two remote clusters, and b) fine-tuning a machine translation model using knowledge distilled from pre-trained models. These two steps form into the most popular training paradigm of recent deep learning methods, which can cover a large variety of training tasks. In our demonstration, we deployed Nebula-I on remote cloud clusters with heterogeneous architectures, i.e. GPU and NPU clusters from Baidu and Peng Cheng Laboratory. The connections between the clusters are low-bandwidth WANs (only up to 170Mbit/s and 60Mbit/s), which is significantly lower than those within a cluster. This raises great challenges for the communication efficiency especially for training cases that need frequent data exchanges.
Experiments were conducted on both the scenarios to validate the effectiveness of Nebula-I. For Scenario-I, we show the computation efficiency using Nebula-I in pre-training over two remote clusters by comparing the throughput and verify that the model can converge well on the cloud environment. Specifically, we generated new state-of-the-art (SoTA) performance on a series of cross-lingual natural language inference tasks with the multilingual model pre-trained under a novel learning framework and Nebula-I. For Scenario-II, we compared the knowledge-enhanced model with Transformer-base Transformer and show that the accelerated model can well preserve the performance running across clusters. Different communication optimization combinations have been compared as well.
The contribution of this work can be summarized as follows:
We introduce a novel framework that well fits distributed deep learning over remote cloud clusters, which are connected by low-bandwidth networks;
We propose a unified optimization technique in Nebula-I, named Nebula-Optimizer, that can jointly optimize the training strategy, parallization and communication;
Both pre-training and fine-tuning scenarios have been validated on the cloud environment, demonstrating the effectiveness of the proposed framework;
We output a multilingual pre-trained language model, ERNIE-M Extra, which has obtained new SoTA results on cross-lingual natural language inference tasks, under the proposed novel multilingual learning framework and Nebula-I.
To our knowledge, this is the first work that decently introduces collaborative training techniques over cloud clusters, which could inspire further researches to delve into this area.
2 The Nebula-I framework
The overview of Nebula-I is shown in Figure 1. Nebula-I can be deployed on top of the cloud hardware and the management platform, and supports the user-specific tasks in the task layer. The cloud hardware layer is a group of machines that are connected using WANs and the hardwares can be heterogeneous. The hardwares distributed in different locations are managed using a unified management platform which can schedule and allocate resources in an automatic style (the management platform). After the computational resources are allocated and the training data are ready, users are able to run optimized distributed learning programs with the help of Nebula-I. The Nebula-I framework generally contains four layers:
Training optimization layer The goal of this layer is to provide communication-efficient techniques associated with user-specific deep learning tasks. Once a deep learning task, e.g. pre-train a language model, is selected by the user, this layer offers an entrance to define cloud-based training strategies, decouple the selected model and dispatch it to multiple clusters.
Parallelization layer This layer is to improve the computation efficiency through optimized scheduling of different tasks or operations. By adaptively applying the PaddlePaddle hybrid parallism techniques within each cluster, and designing task-specific distributed learning strategies across clusters, the training components dispatched to different clusters will work in parallel to execute the task collaboratively. When designing the parallization strategies, both the computation operations and the network environment should be taken into account, so as to maximize the overall throughput.
Communication optimization layer This layer includes an ensemble of data compression methods to accelerate the communication between clusters. For example, sparsification can be used to select the most effective set of elements during data transfer topk. Quantization can further reduce the communication traffic by using low-bit numbers instead of the original data qsgd
. Singular value decomposition is a typical method in low-rank to select the most important feature representationsCompression-survey.
Security layer This layer offers security mechanisms for computation and communication for the whole training process across clusters. Through this layer, data confidentiality and security are ensured when collaboratively training models.
We argue that the execution of a model should be optimized from multiple perspectives before running on the cloud clusters. In Nebula-I, we call the joint of the three optimization layers (i.e. training strategy optimization, parallization, and communication optimization) Nebula-Optimizer (Figure 1), which forms a pipeline optimization system. Only by designing specific optimization strategies for each layer and let them work together can the overall running efficiency be maximized.
2.1 Training optimization layer
A targeted training strategy for cloud-based computing should satisfy both well convergence and reduced communication. There are several scenarios for cloud-based computing. For example, different clusters can be assembled in a parameter server (PS)-worker topology, where each worker computes multiple steps first and then the gradients are aggregated by the server. This PS-worker architecture works often for data parallism, where each worker owns its data and does not need to share with others. As each worker has the overview of a whole model, the volume for transfer could be huge, i.e. the size of the model, if the parameters between them are frequently exchanged, which could easily exhaust the networkPoseidon. On the other hand, if each worker is allowed to compute for sufficient local steps before parameter aggregation, the model convergence problem might exist FL-outlook. Thus, when deploying Nebula-I on a PS environment, the communication frequency should be defined in advance, which might be designed together with the model’s learning objective.
Another family of scenarios is where a complex model can be decoupled into multiple parts, and these parts can be located in different clusters. By applying pipeline parallism Pipedream, the communication volume between clusters can be greatly reduced compared with data parallism, as generally only a proportion of parameters need to be transferred. However, the communication frequency is usually high, which can also affect the training efficiency. In the demonstration of the current work, we used this method as the overall architecture. However, we can scale out to more mechanisms, e.g. combine the above two together.
Recently, some knowledge distillation (KD)-based studies offer new insights for training a model with multiple workers collaboratively ELECTRA; KD-MT. We absorb these ideas into Nebula-I and argue that parameter-efficient techniques should be favored when a model is trained over clouds. In the current version of Nebula, we focus more on the support of KD-based pre-training and fine-tuning for the training of deep learning models. These models are usually easy to be decoupled into different parts, and require much fewer data to communicate compared with traditional training paradigms. We show two examples in Figure 2 where lightweight networks are added into existed large networks and only the parameters of the add-in networks are to be transferred between clusters ABNet; Prompt-MT.
2.2 Parallization layer
Using PS is a direct way for distributed training especially data parallelism PS; Poseidon. However, when frequent data exchange is met, PS might not be a good choice as it can be overwhelmed by the huge volume of communications Geeps, as mentioned above. Later work focus more on all-reduce together with other MPI primitives to form different types of parallization techniques including model parallism Mesh; Megatron, pipeline parallism Pipedream; Memory-efficient-pp, and hybrid parallism Megatron; DeepSpeed; PANGU; ERNIE-Titan.
The development of distributed training techniques has substantially accelerated the pace of training larger models Megatron; Pathways. In Nebula-I, the training environment contains two parts, i.e. intra-cluster and inter-cluster. For the intra-cluster part, existing well-designed parallization techniques such as Megatron Megatron and DeepSpeed DeepSpeed can still be applied. For the inter-cluster part, as the bandwidth limitation exists, specific parallization techniques should be designed to consider both the interactions on computation between different clouds and the overlap between computation and network transfer. In Nebula-I, we have already supported data parallism, pipeline parallism, and model parallism. These techniques can also be assembled to form hybrid parallism. In our following demonstrations that show how Nebula-I works, the overall architecture of parallization for inter-cluster is pipeline parallism.
2.3 Communication optimization layer
Due to the low-bandwidth and high-latency network between different cloud clusters, data communication would easily become the system bottleneck. To address the communication problem in distributed training cross clusters, we integrate several communication optimization techniques in Nebula-Optimizer. In distributed training, data compression technologies including gradient or model quantization qsgd, sparsification topk, and low-rank updates atomo; powersgd are popular strategies to reduce the communication overheadsCompression-survey.
Gradient or model quantization can exploit low-bit floating point number to represent the data that should be communicated through the network. For example, 8-bit quantization 8bit only use 8-bit floating point to represent each gradient, which reduces 75% communication traffic compared to the 32-bit counterpart while preserving the training convergence. Low-rank reduces the number of variables for communication by factorizing the source data matrix into several smaller matriceslowranksurvey.
In Nebula-I, we use singular value decomposition (SVD) to factorize and decompose the data matrixto three matrices by the following formula:
where , , and . By using the low-rank feature of the matrix , we only use some high singular values to compress the matrices while reserving the important information of the original matrix. Using , and to denote the compressed matrices for communication, they can be computed by
where . can be a hyper-parameter to tune the training performance while preserving the model accuracy. The data matrix can be recovered by the following formula:
where is approximately equal to A. Thus, the compression ratio using SVD is
To further reduce the communication volume, we can quantize the compressed matrices from SVD compression with low-bit representation such as 16-bit floating points (FP16) or even 8-bit integers (INT8). For example, for the input data , we can compress the data by
where is the quantization compressor that converts the 32-bit floating point numbers to 16-bit, is the number of singular values used.
2.4 Security layer
Compared with training models within a cluster, cross-cluster training brings more challenges to data security. This is mainly because the model parameters (e.g., weights or gradients) transferred between two clusters may leak sensitive information to malicious adversaries and cause deep privacy leakage fl-security1. As reported in fl-security2, a small portion of the original gradients could reveal privacy about local training datasets. Therefore, security layer in Nebula-I provides four mechanisms for computation and communication safety.
The first mechanism is to separate the computing nodes from the public network, e.g., Internet, while only one node (called "Switch Server") can access to outsiders.
The second mechanism is a unified identity authentication management service, including account life cycle, uniformed data access control. When a new user needs to submit a task (e.g. upload data & code), we create a unique account and a password. Then isolated resources (compute, storage and network) are allocated for each user. When a user applies for the permission of the data owned by another user, a data access interface is provided by the data owner instead of direct raw data sharing.
The third mechanism is encrypted data transfer using TLSv1.2 protocol across two clusters. A cloud certificate manager service is to issue certificates and manages the life cycle of certificates by calling APIs provided by cloud service providers. Switch Server in the local cluster applies for digital certificates with its own public key, ID, information of the issuing certificate authority (CA), validity time, certificate serial number and other information, including a signature. Meanwhile, a CA public key is sent to the peer Switch Server (as a client).
The last mechanism is the audition of the codes, data and operations, which ensures that no data is accessed and no action is performed illegally.
Through these mechanisms, data confidentiality and security are ensured when collaboratively training models over remote heterogeneous clouds.
3 How Nebula-I works for pre-training and fine-tuning
We took two clusters as our cloud environment to deploy Nebula-I in each scenario that we would like to demonstrate. The two scenarios are designed to show how Nebula-I works in a pre-training - fine-tuning process, which is a general pipeline of training deep learning models. In each scenario, we followed a top-down style of Nebula-I (Figure 1) to design the functionality of each layer and finally make them work together.
3.1 Pre-training: ERNIE-M Extra-Cloud
The first scenario (Scenario-I) is designed to pre-train a new NLP model from an existed well-trained model, the design of which aims to maximize the utility of existed models but only use knowledge distilled from the cloud. This could be extremely useful when we do not have direct access to certain existed models, or we do not have enough capacity to maintain two model during training. Another, the collaborative training on cloud clusters makes it possible for collaborative training with multi-sources pretraining data. In this paper, we simulate the collaborative training on multi-sources pretraining data with multilingual pretraining. For training optimization, we employed the pre-training structure from ELECTRA ELECTRA, the architecture of which contains a generator and a discriminator. The output of Scenario-I is a much larger multilingual language model named ERNIE-M Extra (550 Million parameters, defined as discriminator), distilled from a smaller ERNIE-M model ERNIE-M (220 Million parameters, defined as generator). Notably, the prevailing ELECTRA-based pre-trained models ELECTRA; XLM-E advocated sharing the embeddings (both the token and positional embeddings) between the generator and discriminator to improve the efficiency of pre-training, this undoubtedly increases the traffic between the two clusters. Furthermore, the shared embeddings forces the generator and discriminator to have the same hidden size, which severely limits the flexibility of Scenario-I. Consequently, we proposed to discard the strategy of sharing embeddings to minimize the communication between two clusters to only a small number of word indices.
Different from the previous BERT-style pre-trained language models, that trained by predicting the masked tokens, ELECTRA innovatively proposed a sample-efficient pre-training task known as Replaced Token Detection (RTD) to train to distinguish real input tokens vs fake input tokens generated by . Specifically, the generator is a conventional BERT-style model, which was trained with the Masked Language Modeling (MLM) BERT task. Given the input that consists of tokens, MLM replaces the tokens in a randomly selected position subset with [MASK] token to construct the masked input
. Then, the generator learns to predict the probability distributions of the masked-out tokens. The discriminator is another Transformer-based encoder, trained with the RTD task, in which the corrupted input is built by displacing the tokens in
with the generated sample. Formally, the loss functionof ELECTRA is as follows:
Inspired by the language grouping M2M-100 among multilingual and the collaborative training on cloud clusters, we creatively propose the framework of Multi-sources Multi-lingual pre-training with Multi-tasks (M3). As shown in Figure 4, the M3 framework constructs n generators for the n language clusters grouped by language family that is introduced in M2M-100. Evidently, a large multilingual corpus with more languages requires models with increased capacity, expanding the scale of the model is a common-used strategy. Under the Scenario-I, we explore to employ more generators for different language clusters to learn the corresponding language knowledge, which lead to four benefits, that is, (1) language cluster with fewer languages demands weaker model capability (such as the capability of base model size); (2) the pre-training strategies of different language clusters trained on different generators are more beneficial to the knowledge transfer between similar languages; (3) the multi-sources (generators) structure fits perfectly with the collaboratively training framework on multi cloud clusters (each generator can be deployed on one cluster); (4) different pre-training tasks can be conducted on different generators, which resulting multi-task learning. For the discriminator, the multi-sources (generator) can be regarded as multi-teachers and the fusion knowledge from different teachers enables the efficient training of discriminator.
Following XLM-E XLM-E
and COCO-LMCOCO-LM, we construct three tasks, relied on self-supervised or weak-supervised signals that could be obtained from massive data without human annotation, to pre-train ERNIE-M Extra, that is Multilingual Replaced Token Detection (MRTD) task, Translation Replaced Token Detection (TRTD) task and Corrective Language Modeling (CLM) task.
Multilingual Replaced Token Detection As introduced in XLM-E, the multilingual replaced token detection task is similar to that in monolingual ELECTRA pretraining, of which input sentences can spread across various languages rather than a single language. Naturally, the generator, the discriminator and the vocabulary are all shared across languages.
Translation Replaced Token Detection As the opening explore for how to improve discriminative pre-training based on parallel corpora, translation replaced token detection, proposed in XLM-E
, aims to distinguish real input tokens from corrupted parallel sentence pairs. At length, consider an input parallel sentence pair, MLM first chooses a random positions subset to be masked, in which the positions are uniformly distributed in both languages. The Generators learn to predict the masked tokens and then construct the corrupted parallel sentence pairs by replacing the masked-out tokens with generated samples. The discriminative pre-training is conducted on discriminator through predicting whether the token corrupted parallel sentence pairs is the original one or the replaced one.
Corrective Language Modeling Compared to BERT-style pre-trained models, ELECTRA is more compute-efficient and achieves better performance. While the lack of language modeling ability limits the application of the model, mentioned in COCO-LM. COCO-LM proposed the corrective language modeling task to alleviate the problem described above. More specifically, given the corrupted input , CLM trains the discriminator to recover the original tokens by optimizing RTD task and All-Token MLM task (the detail describe can refer to ELECTRA) simultaneously. Nevertheless, the RTD loss has been calculated in both multilingual replaced token detection task and translation replaced token detection for ERNIE-M Extra. Therefore, the loss function of corrective language modeling is simplified as follows,
In general, we minimize the combined loss
over a large monolingual and parallel corpora , in which the denotes the i-th generator deployed on the corresponding i-th cloud cluster. Similar to ELECTRA, only the discriminator is fine-tuned on downstream tasks while the generators are discarded.
3.2 Fine-tuning: ABNet-Cloud
The goal of the second scenario (Scenario-II) is to fine-tune a task model with the help of existed pre-trained models, e.g. ERNIE-M Extra from Scenario-I. We use machine translation (MT) as the fine-tuning task in this scenario. Specifically, we assume two pre-trained models for the source and target languages are distributed in two remote cloud clusters, respectively. During training the MT model, the task executor will iteratively require knowledge from the pre-trained models for encoding and decoding. We adopted the architecture of ABNet ABNet to simulate this process, in which the encoder and decoder parts are initialized from two pre-trained models. Since the source and target models are decoupled, neither of the cloud clusters can obtain any parallel training instances, thus the privacy of the user’s corpus can be guaranteed.
ABNet is a parameter-efficient fine-tuning model which comprises of adapter modules and pre-trained models for the source and target languages. It utilizes knowledge from separated pre-trained models by training the inserted adapter modules between en/decoder sub-layers while freezing the parameters of pre-trained models. For the encoder side, the adapter module with layer normalization as well as two feed-forward layers with non-linear activation is inserted between each sub-layer of the pre-trained (masked) language model from the source language domain. As for the decoder side, the base module is the pre-trained (masked) language model from the target language domain, and the adapter module that consists of the multi-head cross-attention, feed-forward layer, layer normalization and residual connections is inserted between each sub-layer of the base module.
Different from most previous fine-tuning studies that also adopt the parameter-efficient strategy (Parameter-efficient-1; Parameter-efficient-2), the architectures of the adapter modules in ABNet have no need to been fixed. Instead, we can employ different architectures of adapter modules on the encoder and decoder sides, which allows it to be easily adjust to different downstream tasks. As shown in Figure 5, the architecture of ABNet is suitable for decoupling. We re-implemented the model in the PaddlePaddle deep learning framework and deployed it on two remote cloud clusters, and named it as ABNet-Cloud. We assume that two pre-trained models from the source and target language domains are distributed in remote cloud cluster and separately. Before training, each cluster obtains the corresponding sequences and from the parallel training dataset , respectively, which can be controlled and dispatched by the cloud management platform to guarantee users’ data privacy.
During the feed-forward stage, the representations of source sequences are sent from the remote cloud cluster to the remote cloud cluster after communication optimization. On the remote cloud cluster , source sequences are fed into the encoder which consists of the pre-trained model from source language domain and inserted adapter module. Formally, we denote the adapter module and the pre-trained model layer block in the encoder side as and , respectively. And the hidden state of each encoder layer in the model is computed as:
where denotes the hidden state of the -th encoder layer. After obtaining the hidden state of the last encoder layer, we take it as the representations of source sequences, which are sent to the cloud cluster .
On the remote cloud cluster , target sequences and the representations of source sequences are fed into the decoder which consists of the pre-trained model from the target language domain and inserted adapter module with extra multi-head cross-attention. Specifically, the multi-head cross-attention layers in adapter modules extract conditional context from the representations of source sequences. Formally, we denote the adapter module and the pre-trained model layer block in decoder side as and , respectively. And the hidden state of each decoder layer in the model is computed as:
where represents the hidden state of the -the decoder layer.
During back propagation, the gradients of the last encoder layer are sent from the remote cloud cluster to the remote cloud cluster after communication optimization. On the remote cloud cluster , the decoder computes the conditional MLM loss according to the hypothesis sequences and target sequences, and subsequently returns the gradients of the last encoder layer. On the remote cloud cluster , the encoder continually computes the gradient of each remaining encoder layer. Finally, the inserted adapter modules in en/decoder are trained to adapt for the machine translation task while the parameters of the pre-trained models are frozen during training.
3.3 Network architecture and parallel computing for both the scenarios
Some novel parallel computing methods are introduced to train with high performance on different clusters and to compensate for the variance of computational capability of different clusters.
For Scenario-I, data parallelism was adopted for both the generator side and the discriminator side and we only used one generator for simplification. One problem is that we have to find a proper way to train the generator and the discriminator on different clusters so as to maximize the hardware utility. Since the data flow is only one-way, i.e. the generator only needs to send the feed-forward result to the discriminator , but the does not need to send the gradient back to , the basic inter-cluster training architecture for Scenario-I is simple. After generates a result for a micro-batch, it sends this result to through the network and does the reset computation. After receiving the result, the discriminator then carries out the corresponding computation.
However, this basic idea introduces a challenge. If just sends the results to the discriminator for the current micro-batch and then moves on to the next micro-batch, and might not be in the same training pace, i.e. they may process different micro-batches at the same time. This divergence will get larger when and are of different sizes, which might cause the models saved by and are different regarding to the amount of training data. To handle this, a synchronization method was added. For Scenario-I, after each optimization stage (updating the parameters in and ), we let
send a small tensor toto guarantee that these two parts are in the same pace of training.
Besides, the variance of computational capability of different clusters introduces challenge for high performance training. For example, if is located on a computationally faster cluster and is located on a slower cluster111This can be caused by diverse reasons, e.g. the different scales or types of hardware, how a certain deep learning framework behaves on different clusters, etc., assigning each generator with one discriminator will result in low hardware utilization on the side, as spends more time on syncing with than on training. To address this challenge, each generator will be assigned with multiple discriminators according to the sizes of the generator and the discriminator. If is assigned with s, it will first perform forward computations once and produce results. All these results are sent from to a root discriminator in the discriminator cluster, then scattered to the corresponding sub-discriminators . With this modification, will spend more time on computing, which can increase the hardware utilization. The synchronization method should be updated thereupon, all the discriminators should sync first before the syncs with to make sure each trainer are in the same training pace. In a sentence, resources on the slower cluster will be connected to resource on the faster cluster, where the number
is estimated based on the practical performance test on each cluster.
The overall parallel computing method for Scenario-I is shown in Figure 6. According to our test, the number was set as , i.e. eight discriminators will be assigned to one generator in our environment. in the generator side firstly computes eight times in each run and generate eight results. These eight results are then sent to in the discriminator side. Then of the discriminator side will scatter the corresponding data to , , , …, respectively. After each run, , , , …, will sync with within the discriminator, while of the generator side will sync with of the discriminator side. Note that both the generator side and the discriminator side can be scaled to more hardware resources, such as with .
For Scenario-II, the encoder not only has to send the results to the decoder, but also has to receive the gradients from the decoder. We adopted the pipeline parallelism provided by the PaddlePaddle framework, which can provide all the send/recv operations needed by Scenario-II. Inside each cluster, data parallelism was adopted. Since the encoder and the decoder are of the same size in Scenario-II, which means that we can assign almost the same scale of hardware for them. The overall parallel computing structure is illustrated in Figure 7.
4.1 ERNIE-M Extra-Cloud
ERNIE-M Extra was trained with monolingual and parallel corpora. For the monolingual data, ERNIE-M Extra adopted the CC100 corpus cc100; XLM-R used in XLM-R XLM-R, including 116 languages and 5 of which are romanized languages. For the bilingual data, we used a total of 15 languages as INFOXLM infoxlm, collected from MultiUN UN, IIT Bombay Bombay, OPUS OPUS, and WikiMatrix WM. Following M2M-100 M2M-100, we grouped all the training data into 15 groupings by language families.
The prevailing Transformer-encoder was adopted as the backbone of the model in Scenario-I. For the generators, a structure with 12 layers, 768 hidden units, 12 heads was employed. For the discriminator, a structure with 24 layers, 1,024 hidden units, 16 heads was put to use. The activation function used is GeLUgelu. In order to maximize the utility of existed models, we initialized the parameters of the generators with , and the discriminator with , respectively. We used the Adam optimizer Adam to train ERNIE-M Extra; the learning rate was scheduled with a linear decay with 10K warm-up steps, and the peak learning rate was
. The hyperparameterand were set to 50 and 1 respectively. The training was separated into two steps, i.e. intra-cluster and inter-cluster. During intra-cluster training, we conducted the pre-training experiments using 64 NVIDIA A100-40GB GPUs with 2,048 batch size and 512 max length. During inter-cluster training, we used 8 NVIDIA V100-32GB GPUs and 64 Ascend 910-32GB NPUs and keep other hyperparameters the same.
We executed experiments on the typical cross-lingual evaluation benchmarks XNLI XNLI to evaluate the fine-tuning performances of the pre-trained ERNIE-M Extra.
Cross-lingual Natural Language Inference
As a multilingual language inference task, cross-lingual natural language inference (XNLI) task aims to determine the relationship between the two input sentences. It is noteworthy that we only evaluated ERNIE-M Extra in the cross-lingual transfer XNLI setting, in which the model is fine-tuned with the English training set and evaluated on the foreign language XNLI test set.
|Fine-tune cross-lingual model on English training set (Cross-lingual Transfer)|
The results of ERNIE-M Extra on the XNLI task are reported in Table 1. As shown in Table 1, ERNIE-M Extra outperforms all baseline models including XLM XLM, Unicoder Unicoder, XLM-R XLM-R, InfoXLM infoxlm, VECO VECO and ERNIE-M ERNIE-M on most of languages. The reported scores on the test set are averaged over five runs with different random seeds. Under cross-lingual transfer setting, ERNIE-M Extra achieves 82.2 accuracy, outperforming InfoXLM by 0.8 and slightly better than .
We trained ERNIE-M Extra on both the intra-cloud and inter-cloud environment. The network optimizations on multi clouds are illustrated in §4.3.
For Scenario-II, we conducted experiments on Español-English (Es-En) and Chinese-English (Zh-En) translations from IWSLT’14 datasets 222https://wit3.fbk.eu/
to verify the effectiveness of the machine translation model for our fine-tuning task, which has been replicated by us using PaddlePaddle from PyTorch, and with the multilingual model replaced by ERNIE-MERNIE-M. For all language pairs in IWSLT’14, we merged the validation dataset dev 2010 and the test datasets tst 2010, tst 2011, tst 2012. And we reported the BLEU score on the merged dataset, which is the same as the dataset configurations of ABNet strictly. Preprocessing like tokenization was done automatically with the sentencepiece or wordpiece program, which depends on the pre-trained model.
Following the metric configuration of ABNet, we used case-insensitive BLEU as the evaluation metric. The BLEU score is calculated using themulti-bleu.perl.
We used the ERNIE-M-base model ERNIE-M, mBERT-base model and BERT-base model BERT as the pre-trained language models in our experiments. Specifically, for the encoder side we use ERNIE-M-base-cased or mBERT-base-cased as the pre-trained model from the source language domain. For the decoder side, We used bert-base-uncased as the pre-trained model from the target language domain. For adapter modules, we adopt 512 dimensions for the hidden state between two feed forward layers on the encoder side. On the decoder side, we adopt 768 dimensions for the hidden state of the cross attention module, which is equal to the hidden dimension of BERT-base models. We trained the model with a batch size of 128 sequences and 64 max length. Parameters were optimized by using Adam optimizer Adam, with , and , with . Label smoothing Label-smoothing of value is also adopted. The training was separated into two steps, i.e. intra-cluster and inter-cluster. During intra-cluster training, we conducted the fine-tuning experiments using 2 NVIDIA V100-32GB GPUs. During inter-cluster training, we deployed the encoder and decoder on 8 NVIDIA V100-32GB GPUs for each cluster. For the inference, we used the beam search algorithm with to obtain the translation from the ABNet model.
Table LABEL:scneII-experiment shows the results for the Es-En and Zh-En translation tasks. We compared our re-implemented ABNet with the Transformer-base model in the PaddlePaddle deep learning framework. As can be observed from Table LABEL:scneII-experiment, with employing pre-trained models on both the encoder and decoder sides, our methods obtain significant improvements on both language pairs compared with Transformer-base. This validates the correctness of our re-implementation and the effectiveness of ABNet to utilize the knowledge from pre-trained language models for better translation. We observed that the performance of is better than on both language pairs, especially on the Zh-En pair. This phenomenon can be mainly attributed to the superiority of ERNIE-M compared with mBERT. An interesting observation is that the performance of is slightly better than . This signifies that the appropriate communication optimization might even bring a slight increase in accuracy while significantly improve the efficiency of transmission.
4.3 Network, parallel computing and communication experiment
To test the performance of the framework, we conducted some experiments on the throughput of Scenario-I. To test the intra-cloud training, we carried out the test on Peng Cheng Cloud Brain I (V100). To test the inter-cloud training with the homogeneous hardware structure, we carried out the test on Baidu Cloud (V100) and Peng Cheng Cloud Brain I (V100). To test the inter-cloud training with heterogeneous hardware structure, we carried out the test on Baidu Cloud (V100) and Peng Cheng Cloud Brain II (Ascend 910). The experiments results for homogeneous hardware can be found in Table 3, and the results for heterogeneous hardware can be found in Table 4.
|Cloud Cluster(s)||Throughput Ratio|
|Intra Peng Cheng Cloud Brain I||1.0|
|Inter Baidu Cloud and Peng Cheng Cloud Brain I||0.85|
|Cloud Clusters||Speedup Ratio|
|Inter Baidu Cloud and Peng Cheng Cloud Brain II (8 NPUs)||1.0|
|Inter Baidu Cloud and Peng Cheng Cloud Brain II (64 NPUs)||4.34 333Some optimizations could be further applied for improving the performance.|
Beside the performance experiments, we also carried out accuracy experiment for inter-cloud training. Considering the fact that the inter-cloud resources are limited, for the inter-cloud training, we hot start the model with a checkpoint instead of training model from scratch. In the experiment, we first trained the model for Scenario-I on intra-cloud for 88,000 steps and then resume the training on the inter-cloud environment (Baidu Cloud and Peng Cheng Cloud Brain I) for another 2,000 steps. From the training loss shown in Figure 8, we can see that the inter-cloud training won’t effect the convergence of the model.
To test the effect of communication optimization, we conducted experiments using different communication compression strategies in Scenario II, i.e. the Es-En fine-tuning task. The experiment setting was the same as that in §4.2. FP16 quantization was used for feed-forward activation compression before transmission, and INT8 quantization was used for back propagation gradient compression before transmission. SVD compression was nested with FP16 to further reduce the communication volume following, and the compression ratio effect was tested by choosing different SVD singular values ratios. The experiments were carried out on Baidu Cloud (V100) and Peng Cheng Cloud Brain I (V100), which was connected by Internet with bandwidth up to 60 Mbit/s. The test results of models training with different compression communication methods are shown in Table 5, and the training loss of each method is shown in Figure 9.
In Table 5, the baseline was trained with no-compression communication, FP16+INT8 was trained with feed-forward FP16 compression and back propagation INT8 compression, FP16(SVD())+INT8 was trained with feed-forward FP16 and SVD() nested compression and back propagation INT8 compression, and is the used ratio of the total singular values. We compared the effect of different compression ratios on the final BLEU results and training speed. The communication compression was used from the first training step in Table 5.
From the experimental results in Table 5, it can be seen that the BLEU values of all the methods with communication compression are lower than the baseline, among which the best BLEU is 41.92. This demonstrates that using compression communication will lead to loss of model accuracy. The comparison of different compression methods, however, shows that more data compressed did not always lead to more BLEU loss. For example, comparing with FP16+INT8, FP16(SVD(0.2))+INT8 compresses 41% more data but improves 0.18 in BLEU. FP16(SVD(0.6))+INT8 with the compression ratio 0.30 is the best BLEU in all the compression methods. This signifies that the compression ratio can be taken as a hyper-parameter for tuning for communication optimization.
From Table 5 we can see that, data compression can reduce the data transmission time and speed up the model training. FP16(SVD(0.2))+INT8 reduces 3.56 seconds for each training step compared with the baseline. It only sends 17% data of the baseline and uses 19% time. The results prove that communication compression is effective to reduce the training time in the cloud-cluster environment. FP16(SVD(0.9))+INT8 sends less data but the speed is 0.5 seconds slower than FP16+INT8. It shows that the SVD decomposition is likely to consume much time on computing. If the data are not compressed enough, the training speed will be slower. Thus, the selection of the compression ratio is an important step.
From the training losses in Figure 9, we can see that using communication compression will lead to training loss increase, and after about 8,000 steps, more compression leads to larger training loss. All the compression methods, however, still can guarantee the convergence of model training.
We also conducted experiments on starting communication compression from different training steps. We chose FP16(SVD(0.6))+INT8, the communication compression method with the best performance according to Table 5, as the targeted method. The experimental results are shown in Figure 10. We can observe that starting using communication compression from step 1,000, 2,000 and 10,000 can further improve the BLEUs, and from step 2,000 and step 10,000, the performance are even better than the baseline in Table 5. This means that if choosing an appropriate startup step to use compression, it can further balance the accuracy and communication efficiency.
The experimental results demonstrate that the NLP tasks in the two scenarios are practical running between the two clusters. For Scenario-I, we have shown that when the communication traffic is low, the inter-cloud training did not affect much of the throughput, which can be accepted. Furthermore, when hot starting the model in the intra-cluster environment and then resuming the training over clouds, we found that the loss can still converge well. For Scenario-II, since the communicated volume is much larger than that of Scenario-I, we have applied several optimization techniques to enhance the parallization degree and accelerate the communication, results demonstrate that with the help of Nebula-I, the training speed can be considerably improved, i.e. around 3 compared with the baseline while preserving satsifactory performance (Table 5). It is interesting to see from Figure 10 that after performing compression operations even can improve the BLEU of the baseline (43.77 vs 43.19). We deem that the compression by giving up some data is similar to the Dropout mechanism Dropout that is widely applied in deep learning, which can somehow improve the generalizability of a model.
Trade-off between accuracy and efficiency
The training environment cross-cloud clusters encounters the low-bandwidth and high-latency problem, which would significantly slow down the training process. Though we have multiple compression techniques to reduce the communication time during training, it may sacrifice some accuracy when the compression ratio is not properly used. Furthermore, according to our experimental results (Table 5), some high compression ratios may achieve better generalization performance (possibly similar to the dropout mechanism). Thus, how to select a proper compression ratio (or a compression method) for a particular training task to achieve the best performance under specific cross-cloud environments is worth for further exploration.
Unlike clusters located on a data center, cross-cloud clusters generally have different hardware configurations (i.e., hardware heterogeneity). The hardware heterogeneity easily makes the training speed unbalanced as some slow processors would become stragglers resulting in inefficient training. For example, in our studied environment, one cloud is equipped with GPUs and the other is with NPUs. Though the straggler problem in distributed training has been well studied, our proposed solutions (i.e., splitting the model to two parts according to the model architecture) to enable distributed training in the cross-cloud environment is quite different from the data-parallel scenario straggler. How to split the model and displace the model to different cloud clusters by considering the hardware configurations to optimally utilize the hardware resources is another direction for further study End-to-End.
In the current version of Nebula-I, we only provide basic security mechanisms for computation and communication based on the trust between two cross-cloud clusters. However, when deploying the system on two clouds that have not any trustworthy mechanisms, the communicated data (e.g., activation outputs) cross two clouds may leak the data privacy. It should also be carefully designed when the data cannot be shared across the cloud clusters.
In addition to the Nebula-I framework that proposed to optimize the cloud-based training, we also would like to promote more on the utility of pre-trained models by means of showing the effectiveness of the models in our two scenarios. Currently, the re-use of many large models still have limitations. For example, some of the models only provide inference APIs while most end users do not have the hardware capacity to hold them GPT-3; PANGU; ERNIE-Titan. Scenario-I is designed to re-use existed models to assist the pre-training of a model with larger capabilities. Figure 4
shows a method to aggregate the ability of different pre-trained models, which is theoretically effective and efficient. This can also be scaled out to other tasks where each language cluster can be replaced by a teacher. Each teacher in different clusters has its own expertise and can teach the student collaboratively. In this case, more machine learning tasks can be integrated. Scenario-II is also a good case in which pre-trained models provide their knowledge to downstream tasks. We believe that by deploying large models onto cloud clusters and adding task-specific accessories (e.g. lightweight parameters), the values of them can be further amplified.
In this work, we proposed a general framework, Nebula-I, which is implemented using the PaddlePaddle deep learning framework, for collaboratively training deep learning models over remote heterogeneous clusters. We applied Nebula-I in two different NLP scenarios which include pre-training a multilingual language model and fine-tuning a machine translation model. Results demonstrate that both model accuracy and communication efficiency can be satisfied with the help of Nebula-I. The success in these scenarios not only shows the effectiveness of Nebula-I in optimizing the whole deep learning training process, but also offers potential ways for the reuse and promotion of large models, which would be helpful to the research field. We hope that users can quickly deploy training tasks over cloud clusters under the framework of Nebula-I with minimum development, e.g. by adding several lines of codes and modifying a configuration file. However, training a general model on cloud clusters remains a huge challenge and needs more explorations from multiple perspectives.
We sincerely thank Dr. Shujian Huang and Dr. Xiaoxiong Zhong for their valuable suggestions in this work, and thank all the members from Peng Cheng-Baidu NLP Joint Lab for their continuous support.