Training state-of-the-art deep learning models is becoming ever more computationally demanding. One infamous example of this trend is transformers(transformer), a popular architecture widely used in NLP (bert; roberta; gpt3), speech processing (asrtransformer; ttstransformer)
, and computer vision(vit; deit; dino). Transformers benefit from having billions of parameters (gpt3; kaplan2020scaling; ott2018scaling) and large-batch training (Popel2018TrainingTF), which makes them dependent on large-scale training infrastructure (megatron2; shoeybi2019megatron; Lepikhin2020GShardSG).
Unfortunately, this kind of infrastructure can be prohibitively expensive, whether one buys the hardware or rents cloud resources (gpt3cost; gpt3costlambda). As a result, most researchers simply cannot afford to conduct the necessary experiments to develop their ideas, which ultimately slows down scientific progress.
To make large-scale deep learning more accessible, recent work proposes to train these models collaboratively, i.e., to pool together the hardware from many independent parties and train a shared model over the Internet (lc0; hivemind_dmoe; volunteer_dl_async; atre2021distributed; dedloc). Such work proposes general distributed algorithms for training on many devices with uneven compute capability and reliability. However, to make them practical, one must overcome several engineering challenges, such as slow communication, limited memory, and security concerns.
In this demonstration, we collaboratively trained a text-to-image transformer similar to DALL-E (dalle). Our contributions are the following:
We modify the DALL-E model, making it suitable for training over the Internet using the method from dedloc and the
hivemindlibrary (hivemind). We set up the infrastructure for such a training run and publish the training results.
We provide a webpage111See https://training-transformers-together.github.io explaining how to join the ongoing training run, address challenges related to collaborative training runs (slow communication, low memory budget, support of heterogeneous devices), and set up such a training run by yourself.
We provide an interactive “calculator” that shows the memory consumed by different models in case of using various memory-efficiency techniques. Also, we present a tutorial on setting up dataset streaming and model compression using the
bitsandbyteslibraries (datasets; bitsandbytes).
2 Demonstration Contents
2.1 Main webpage
The central part of our demonstration is a webpage where people can explore the demonstration materials. The webpage describes the motivation behind collaborative training projects, the method for efficient training from dedloc, and the ongoing collaborative training of our adapted version of DALL-E (see Section 3). Here, we also show a plot of the training objective and the number of active participants.
Next, we provide instructions on how to join the training run using free cloud providers or their own GPU. This involves (1) joining a specific Hugging Face organization, where we can authenticate the users and measure their contribution, and (2) running a Jupyter notebook (jupyter) with the training code. Our intention was that the user can explore our collaborative training environment through active participation while at the same time reading the detailed explanations of how it works. Here, we also provide the link to the interactive dashboard which shows the statistics and the leaderboard of contributors and provides further information about the training run, such as model checkpoints uploaded to the Model Hub, notebooks for inference, and links to the source code.
Then, we proceed to discuss the engineering challenges of collaborative training runs:
Communication efficiency. Most distributed training algorithms are designed for the networks inside HPC clusters with a 10–100 Gbit/s bandwidth. However, typical Internet connections are orders of magnitude slower (10–100 Mbit/s). To make training over the Internet practical, one can reduce the communication costs using large-batch training (lamb), gradient compression (Dettmers20158BitAF; deepgradientcompression; powersgd; tang20211), parameter sharing (albert; xue2021go), and overlapping computation with communication (zerooffload).
Uneven device performance. Traditional data-parallel training waits for the slowest device on every batch. dedloc allow the devices to process different numbers of samples for a batch, while keeping the guarantees of synchronous training.
Memory efficiency. Distributed training requires either storing all parameters and optimizer statistics on each participant, which is challenging in the case of low-end hardware, or using model parallelism which introduces another level of complexity. Fortunately, the first option is often viable if we reduce the memory consumption with 8-bit optimizers (bitsandbytes), by offloading the statistics to CPU, with gradient checkpointing or parameter sharing (albert; xue2021go).
Dataset streaming. Participants often cannot store or even download the whole dataset, since datasets used for pretraining transformers may contain hundreds of gigabytes of data. To address that, one can use dataset streaming tools, such as the
Crucially, the participants only exchange tensors and never send code to be executed on each other’s computers. Since a malicious participant also could influence the training outcome by sending wrong tensors, we should either authenticate participants, as described indedloc
, and/or use gradient aggregation techniques robust to outliers(karimireddy2020learning; gorbunov2021secure).
Finally, we provide a recipe on how to combine all that and set up a new collaborative training run using the
hivemind library (hivemind).
2.2 Memory calculator
The demonstration webpage includes an interactive “calculator” showing the benefits of various memory-efficiency techniques and their combinations. It can compute the consumption of RAM and GPU memory for BERT (bert), T5 (Raffel2020ExploringTL)
, GPT-2(radford2019language), GPT-3 (gpt3), GPT-J (gpt-j), and DALL-E (dalle) in case of using 8-bit optimizers, offloading the optimizer statistics to CPU, using gradient checkpointing and parameter sharing.
2.3 Tutorial on memory-efficiency techniques
The demonstration webpage refers to a tutorial on setting up dataset streaming with the
datasets library (datasets) and model compression with the
bitsandbytes library (bitsandbytes). The goal of the tutorial is to fine-tune the GPT-2 Large model (radford2019language) on the C4 dataset (Raffel2020ExploringTL) using only a low-end GPU, which is possible with the 8-bit Adam optimizer.
3 Collaborative Training Run
For the practical example of a collaborative training run, we chose to train a text-to-image transformer similar to DALL-E (dalle), based on the code from dalle-pytorch
dalle-pytorch. Specifically, we used a decoder-only transformer with 1024 hidden units and 64 layers, each of which uses 16 attention heads with a per-head state size of 64 (1.1B parameters in total). We alternated the attention masks as in the original paper, i.e., repeated “row, column, row, row” masks until the last layer, which had the convolutional mask.
To improve communication and memory efficiency, we tied weights of all “row, column, row, row” layer groups (albert) and tied the input and output embeddings (press2016using), so the model uses 8x fewer parameters (but the same amount of compute). We also used reversible layers (reversible) to reduce memory usage and rotary embeddings (rotary) to improve training stability.
We replaced dVAE with VQ-GAN (vqgan), since it has a smaller reconstruction error. We used the checkpoint with and the codebook size 8192. Finally, we used CLIP ViT/B-32 (clip) to choose the best 4 out of 128 generated images.
We trained the model on the first 100 million image-text pairs from LAION-400M (laion). We skipped 10% images due to short captions, extreme aspect ratios, and NSFW labels.
Before training, we preprocessed all images with VQGAN and uploaded the VQGAN codes and captions, both compressed with Brotli (brotli), to the Hugging Face Dataset Hub (datasets). During training, we streamed the compressed codes instead of the original images, thus consuming 18x less bandwidth.
3.3 Training procedure
We followed the distributed training procedure from dedloc and used the 8-bit LAMB optimizer (lamb; bitsandbytes) offloaded to CPU. We used the linear training schedule with 31250 steps (the first 10% is the warm-up) and the peak learning rate of . While exchanging gradients and parameters, we used the 8-bit quantization (Dettmers20158BitAF) for tensors with elements and the 16-bit precision for other tensors. Unlike the original paper, we did not use PowerSGD (powersgd).
The training run lasted for 2.5 months and passed 80% of the training schedule. Besides the authors, 37 volunteers have contributed for at least 10 minutes (see Appendix A).
Appendix A Top Volunteers by Contributed Compute Time
Appendix B Model Inference Results
(a)–(c) Prompts leading to realistic outputs.
(d)–(f) Prompts where the model fails to draw the correct object shapes, but uses the appropriate image style, textures, and colors.
(g)–(i) Prompts where the model is able to generalize and draw the concepts not present in the training set. This is checked by inspecting training set images whose CLIP embeddings are close to the prompt embeddings (clip-retrieval).