Training Transformers Together

The infrastructure necessary for training state-of-the-art models is becoming overly expensive, which makes training such models affordable only to large corporations and institutions. Recent work proposes several methods for training such models collaboratively, i.e., by pooling together hardware from many independent parties and training a shared model over the Internet. In this demonstration, we collaboratively trained a text-to-image transformer similar to OpenAI DALL-E. We invited the viewers to join the ongoing training run, showing them instructions on how to contribute using the available hardware. We explained how to address the engineering challenges associated with such a training run (slow communication, limited memory, uneven performance between devices, and security concerns) and discussed how the viewers can set up collaborative training runs themselves. Finally, we show that the resulting model generates images of reasonable quality on a number of prompts.


page 9

page 10


Sparse is Enough in Scaling Transformers

Large Transformer models yield impressive results on many tasks, but are...

Tiny Transformers for Environmental Sound Classification at the Edge

With the growth of the Internet of Things and the rise of Big Data, data...

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

Large-scale pretrained transformers have created milestones in text (GPT...

CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

The development of the transformer-based text-to-image models are impede...

ZeRO: Memory Optimization Towards Training A Trillion Parameter Models

Training large DL models with billions and potentially trillions of para...

AdapterDrop: On the Efficiency of Adapters in Transformers

Massively pre-trained transformer models are computationally expensive t...

Distributed Deep Learning in Open Collaborations

Modern deep learning applications require increasingly more compute to t...

1 Introduction

Training state-of-the-art deep learning models is becoming ever more computationally demanding. One infamous example of this trend is transformers

(transformer), a popular architecture widely used in NLP (bert; roberta; gpt3), speech processing (asrtransformer; ttstransformer)

, and computer vision

(vit; deit; dino). Transformers benefit from having billions of parameters (gpt3; kaplan2020scaling; ott2018scaling) and large-batch training (Popel2018TrainingTF), which makes them dependent on large-scale training infrastructure (megatron2; shoeybi2019megatron; Lepikhin2020GShardSG).

Unfortunately, this kind of infrastructure can be prohibitively expensive, whether one buys the hardware or rents cloud resources (gpt3cost; gpt3costlambda). As a result, most researchers simply cannot afford to conduct the necessary experiments to develop their ideas, which ultimately slows down scientific progress.

To make large-scale deep learning more accessible, recent work proposes to train these models collaboratively, i.e., to pool together the hardware from many independent parties and train a shared model over the Internet (lc0; hivemind_dmoe; volunteer_dl_async; atre2021distributed; dedloc). Such work proposes general distributed algorithms for training on many devices with uneven compute capability and reliability. However, to make them practical, one must overcome several engineering challenges, such as slow communication, limited memory, and security concerns.

In this demonstration, we collaboratively trained a text-to-image transformer similar to DALL-E (dalle). Our contributions are the following:

  • We modify the DALL-E model, making it suitable for training over the Internet using the method from dedloc and the hivemind library (hivemind). We set up the infrastructure for such a training run and publish the training results.

  • We provide a webpage111See explaining how to join the ongoing training run, address challenges related to collaborative training runs (slow communication, low memory budget, support of heterogeneous devices), and set up such a training run by yourself.

  • We provide an interactive “calculator” that shows the memory consumed by different models in case of using various memory-efficiency techniques. Also, we present a tutorial on setting up dataset streaming and model compression using the datasets and bitsandbytes libraries (datasets; bitsandbytes).

2 Demonstration Contents

2.1 Main webpage

The central part of our demonstration is a webpage where people can explore the demonstration materials. The webpage describes the motivation behind collaborative training projects, the method for efficient training from dedloc, and the ongoing collaborative training of our adapted version of DALL-E (see Section 3). Here, we also show a plot of the training objective and the number of active participants.

Next, we provide instructions on how to join the training run using free cloud providers or their own GPU. This involves (1) joining a specific Hugging Face organization, where we can authenticate the users and measure their contribution, and (2) running a Jupyter notebook (jupyter) with the training code. Our intention was that the user can explore our collaborative training environment through active participation while at the same time reading the detailed explanations of how it works. Here, we also provide the link to the interactive dashboard which shows the statistics and the leaderboard of contributors and provides further information about the training run, such as model checkpoints uploaded to the Model Hub, notebooks for inference, and links to the source code.

Then, we proceed to discuss the engineering challenges of collaborative training runs:

  • Communication efficiency. Most distributed training algorithms are designed for the networks inside HPC clusters with a 10–100 Gbit/s bandwidth. However, typical Internet connections are orders of magnitude slower (10–100 Mbit/s). To make training over the Internet practical, one can reduce the communication costs using large-batch training (lamb), gradient compression (Dettmers20158BitAF; deepgradientcompression; powersgd; tang20211), parameter sharing (albert; xue2021go), and overlapping computation with communication (zerooffload).

  • Uneven device performance. Traditional data-parallel training waits for the slowest device on every batch. dedloc allow the devices to process different numbers of samples for a batch, while keeping the guarantees of synchronous training.

  • Memory efficiency. Distributed training requires either storing all parameters and optimizer statistics on each participant, which is challenging in the case of low-end hardware, or using model parallelism which introduces another level of complexity. Fortunately, the first option is often viable if we reduce the memory consumption with 8-bit optimizers (bitsandbytes), by offloading the statistics to CPU, with gradient checkpointing or parameter sharing (albert; xue2021go).

  • Dataset streaming. Participants often cannot store or even download the whole dataset, since datasets used for pretraining transformers may contain hundreds of gigabytes of data. To address that, one can use dataset streaming tools, such as the datasets library (datasets).

  • Security.

    Crucially, the participants only exchange tensors and never send code to be executed on each other’s computers. Since a malicious participant also could influence the training outcome by sending wrong tensors, we should either authenticate participants, as described in


    , and/or use gradient aggregation techniques robust to outliers

    (karimireddy2020learning; gorbunov2021secure).

Finally, we provide a recipe on how to combine all that and set up a new collaborative training run using the hivemind library (hivemind).

2.2 Memory calculator

The demonstration webpage includes an interactive “calculator” showing the benefits of various memory-efficiency techniques and their combinations. It can compute the consumption of RAM and GPU memory for BERT (bert), T5 (Raffel2020ExploringTL)

, GPT-2

(radford2019language), GPT-3 (gpt3), GPT-J (gpt-j), and DALL-E (dalle) in case of using 8-bit optimizers, offloading the optimizer statistics to CPU, using gradient checkpointing and parameter sharing.

2.3 Tutorial on memory-efficiency techniques

The demonstration webpage refers to a tutorial on setting up dataset streaming with the datasets library (datasets) and model compression with the bitsandbytes library (bitsandbytes). The goal of the tutorial is to fine-tune the GPT-2 Large model (radford2019language) on the C4 dataset (Raffel2020ExploringTL) using only a low-end GPU, which is possible with the 8-bit Adam optimizer.

3 Collaborative Training Run

3.1 Model

For the practical example of a collaborative training run, we chose to train a text-to-image transformer similar to DALL-E (dalle), based on the code from


. Specifically, we used a decoder-only transformer with 1024 hidden units and 64 layers, each of which uses 16 attention heads with a per-head state size of 64 (1.1B parameters in total). We alternated the attention masks as in the original paper, i.e., repeated “row, column, row, row” masks until the last layer, which had the convolutional mask.

To improve communication and memory efficiency, we tied weights of all “row, column, row, row” layer groups (albert) and tied the input and output embeddings (press2016using), so the model uses 8x fewer parameters (but the same amount of compute). We also used reversible layers (reversible) to reduce memory usage and rotary embeddings (rotary) to improve training stability.

We replaced dVAE with VQ-GAN (vqgan), since it has a smaller reconstruction error. We used the checkpoint with and the codebook size 8192. Finally, we used CLIP ViT/B-32 (clip) to choose the best 4 out of 128 generated images.

3.2 Dataset

We trained the model on the first 100 million image-text pairs from LAION-400M (laion). We skipped 10% images due to short captions, extreme aspect ratios, and NSFW labels.

Before training, we preprocessed all images with VQGAN and uploaded the VQGAN codes and captions, both compressed with Brotli (brotli), to the Hugging Face Dataset Hub (datasets). During training, we streamed the compressed codes instead of the original images, thus consuming 18x less bandwidth.

3.3 Training procedure

We followed the distributed training procedure from dedloc and used the 8-bit LAMB optimizer (lamb; bitsandbytes) offloaded to CPU. We used the linear training schedule with 31250 steps (the first 10% is the warm-up) and the peak learning rate of . While exchanging gradients and parameters, we used the 8-bit quantization (Dettmers20158BitAF) for tensors with elements and the 16-bit precision for other tensors. Unlike the original paper, we did not use PowerSGD (powersgd).

3.4 Results

The training run lasted for 2.5 months and passed 80% of the training schedule. Besides the authors, 37 volunteers have contributed for at least 10 minutes (see Appendix A).

During inference, we note that limiting sampling to top 256 logits or top logits whose probability sums up to

greatly improves the image quality. The final model generates realistic images for some prompts but fails to draw correct shapes for the others, while using the appropriate image style, textures, and colors (see Appendix B). We attribute that to the fact that our model is too small to remember the full diversity of images in LAION-400M. Still, the model can generalize to the concepts not present in the dataset.


Appendix A Top Volunteers by Contributed Compute Time

px px px

Figure 1: Hugging Face usernames of volunteers who contributed the most compute time.

Appendix B Model Inference Results

px px

Figure 2: Inference results of the final model (the prompts are taken from dalle-mega-sample-predictions):
(a)–(c) Prompts leading to realistic outputs.
(d)–(f) Prompts where the model fails to draw the correct object shapes, but uses the appropriate image style, textures, and colors.
(g)–(i) Prompts where the model is able to generalize and draw the concepts not present in the training set. This is checked by inspecting training set images whose CLIP embeddings are close to the prompt embeddings (clip-retrieval).