FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours

by   Shenggan Cheng, et al.

Protein structure prediction is an important method for understanding gene translation and protein function in the domain of structural biology. AlphaFold introduced the Transformer model to the field of protein structure prediction with atomic accuracy. However, training and inference of the AlphaFold model are time-consuming and expensive because of the special performance characteristics and huge memory consumption. In this paper, we propose FastFold, a highly efficient implementation of the protein structure prediction model for training and inference. FastFold includes a series of GPU optimizations based on a thorough analysis of AlphaFold's performance. Meanwhile, with Dynamic Axial Parallelism and Duality Async Operation, FastFold achieves high model parallelism scaling efficiency, surpassing existing popular model parallelism techniques. Experimental results show that FastFold reduces overall training time from 11 days to 67 hours and achieves 7.5-9.5X speedup for long-sequence inference. Furthermore, We scaled FastFold to 512 GPUs and achieved an aggregate of 6.02 PetaFLOPs with 90.1 implementation can be found at



page 1

page 2

page 3

page 4


PersGNN: Applying Topological Data Analysis and Geometric Deep Learning to Structure-Based Protein Function Prediction

Understanding protein structure-function relationships is a key challeng...

Efficient and Robust Parallel DNN Training through Model Parallelism on Multi-GPU Platform

The training process of Deep Neural Network (DNN) is compute-intensive, ...

MCP: a Multi-Component learning machine to Predict protein secondary structure

The Gene or DNA sequence in every cell does not control genetic properti...

Easy and Efficient Transformer : Scalable Inference Solution For large NLP mode

The ultra-large-scale pre-training model can effectively improve the eff...

LightSeq: A High Performance Inference Library for Transformers

Transformer, BERT and their variants have achieved great success in natu...

BERTology Meets Biology: Interpreting Attention in Protein Language Models

Transformer architectures have proven to learn useful representations fo...

Effective Batching for Recurrent Neural Network Grammars

As a language model that integrates traditional symbolic operations and ...

Code Repositories


Optimizing Protein Structure Prediction Model Training and Inference on GPU Clusters

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Protein structure prediction has been an important research problem in structural biology for over 50 years. Predicting the three-dimensional structure of a protein directly from its amino acid sequence has a wide range of applications in many fields, including drug design, protein design, etc. Both experimental and computational approaches can be used to predict protein structure. The experimental approach allows for more accurate protein structures at the high cost of time and finance. The computational approach can predict protein structure with high throughput at a low cost, so it is of paramount importance to improve the prediction accuracy for the computational approaches.

The recent success of deep neural networks has led to the widespread use of Artificial Intelligence in a variety of domains, including Computer Vision (CV), Natural Language Processing (NLP), Recommendation Systems, etc. Convolutional Neural Networks (CNN) were introduced to the field of protein structure prediction by AlphaFold

[1], RaptorX-Contact[2]. These CNN models achieved significant performance improvement in the Critical Assessment of Protein Structure Prediction (CASP) Challenge. It has demonstrated that deep neural networks can be an efficient solution to the challenge of protein structure prediction.

Meanwhile, because of the superior performance of Multi-head Attention for sequence modelling [3], Transformer has made a huge performance improvement in the NLP/CV field and gradually has become the mainstream model structure, such as BERT[4], GPT[5], ViT[6]. AlphaFold 2 [7] successfully introduced Transformer to protein structure prediction model, becoming the first model to achieve atomic accuracy. (We refer AlphaFold as the transformer-based AlphaFold 2 model in the following sections.)

Although Transformer delivers impressive performance in prediction accuracy, it has raised serious computational challenges to training and inference. As AlphaFold’s intermediate representation has two sequence dimensions, its computational complexity is an order of magnitude higher than the that handled by the general Transformer. In addition, AlphaFold has a unique model architecture, which is different from that of general Transformer models, making it less computationally efficient on the GPU platform.

There are two main challenges in training: 1) the limited global batch size prevents training from scaling to more nodes using data parallelism as a larger batch size will lead to a drop in accuracy, so it takes around 11 days to train AlphaFold with its official open-source implementation on 128 Google TPUv3

[8]; 2) the huge memory consumption exceeds what current GPUs can handle. During inference, longer sequence has a much greater demand for GPU memory, and the inference time for one long sequence can even reach several hours for the AlphaFold model. AlphaFold reduces the demand for GPU memory capacity by using activation checkpointing and chunking technique at the trade-off of some performance.

To solve the challenges mentioned above, we propose FastFold, a highly efficient implementation of protein structure prediction model for training and inference. To the best of our knowledge, FastFold is the first performance optimization work for the training and inference of protein structure prediction models. FastFold successfully introduced large model training techniques and reduces the time and economic cost of AlphaFold model training and inference significantly.

FastFold consists of a high-performance implementation of Evoformer, the backbone structure of AlphaFold, and an innovative model parallelism strategy called Dynamic Axial Parallelism. We analyzed the complex structure of Evoformer and performed kernel fusion. We additionally optimized for unique operation in Evoformer, as well as specific kernel optimizations such as Softmax and LayerNorm based on performance characteristics. The high-performance Evoformer implementation substantially reduces the economic cost of training and inference. For parallelism strategy, we propose Dynamic Axial Parallelism

, which outperforms the current standard Tensor Parallelism in terms of scaling efficiency. For communication optimization, we proposed

Duality Async Operation

and implemented it as an extension of PyTorch. Relying on the inserted

Duality Async Operation, FastFold implements computation-communication overlap in forward and backward. The implementation can be found at

In summary, our paper has the following major contributions:

  • We optimized for AlphaFold operators based on the AlphaFold-specific performance characteristic. Combined with kernel fusion, the kernel implementation of FastFold achieves significant speedup.

  • We proposed Dynamic Axial Parallelism, which has a lower communication overhead than other model parallelism methods. In terms of communication optimization, the proposed Duality Async Operation implements computation-communication overlap in dynamic computational graphs framework like PyTorch.

  • We successfully scaled the AlphaFold model training to 512 NVIDIA A100 GPUs and obtained aggregate 6.02 PetaFLOPs at the training stage. The overall training time is reduced to 67 hours from 11 days with significant economic cost savings. On the other hand, FastFold achieves speedup for long sequences and makes it possible for inference over extremely long sequences at the inference stage.

Ii Background

Ii-a Overview of AlphaFold

Unlike prior protein structure models, AlphaFold is an end-to-end model that uses amino acid sequences as model input and directly outputs the three-dimensional structure of the protein. Through genetic database search and structure database search, AlphaFold obtains Multiple Sequence Alignment (MSA) and Templates information of the sequences. MSA information includes amino acid sequences that are similar to the target sequence. Using MSA information, amino acids that have mutated during evolution can be identified, based on the principle that these co-evolving residues will be located at neighboring positions or contacts in the three-dimensional structure of the protein. The Templates information, which contains structural information corresponding to known sequences, will provide sufficient information for the model to predict the protein structure.

The architecture of AlphaFold is shown in Figure 1, which consists of three parts: Embedding, Evoformer, and Structure Module. The Embedding part encodes the MSA and Template information of the target sequence into MSA representation and pair representation. The MSA representation contains the co-evolving information of all similar sequences, and the pair representation contains the interaction information of residues pairs in the sequences. These representations are fed to the Evoformer blocks, which will be discussed in detail in the next section. After Evoformer, MSA representation and pair representation contains highly processed modeling information and are fed into the Structure Module, which eventually outputs the three-dimensional structure of the protein directly.

To speed up training time and reduce memory consumption, AlphaFold training uses Bfloat16 precision [9]. AlphaFold uses the recycling technique to improve the accuracy of model prediction at the cost of repeatedly performing forward propagation of the model. Recycling re-embeds the output of the model into representation and allows the model to process multiple versions of the embedding features. The number of recycling is set by uniformly sampling between 1 and 4 during training, and is fixed to 4 when inference.

The complete training process of AlphaFold includes Initial Training and Fine-tuning. The main difference is that Fine-tuning uses a larger sequence length for training. In the official AlphaFold experiments[7], the training process was done on 128 TPUv3-cores with a mini-batch size of 128. The limited batch size prevents AlphaFold from scaling to more computational resources, and the overall training time reaches 11 days.


Model Initial Training Fine-tuning
Residues sequence 256 384
Number of sequences 128 512
Batch size 128 128
Precision Bfloat16 Bfloat16
Training samples ()
Training time days days


TABLE I: Details of AlphaFold model training.
Fig. 1: The Architecture of AlphaFold Model. The amino acid sequence is encoded into MSA and pair representation after Embedding layer, then feeding into Structure Module after 48 Evoformer blocks. In Evoformer, MSA and pair representation were processed by MSA Stack and Pair Stack, respectively. In addition to this, there is a communication mechanism that allows information to be exchanged between the two representation.

Ii-B Evoformer

The main trunk of the network consists of 48 Evoformer blocks, and each block has three parts: MSA Stack, Communication, and Pair Stack as shown in Figure 1. The representations that go through Evoformer are highly processed and contain feature relationships in the sequence dimension that can be used by subsequent modules for structure prediction.

MSA representations are processed with Row-wise Attention, Column-wise Attention, and Transition (2 MLP layers), while pair representations are processed with similar blocks with additional Triangular Updates Module. Triangular Updates Module uses triangular relationships in pair information to infer and update representations. Attention Biasing and Outer Product Mean are designed to enable communication between two representations.

Ii-C Parallelism for Training

In modern deep learning training, parallel methods are introduced for two main purposes: 1) to significantly reduce the time cost of training; 2) to train large models with limited resources. For these two purposes, the most mainstream parallel methods include Data Parallelism and Model Parallelism.

The most basic and widely used parallel method is Data Parallelism. Each device has a complete set of model parameters and then processes different training data (mini-batch). During the training phase, each device calculates the local gradient using a local mini-batch, then uses All-Reduce communication to obtain the globally averaged gradient, after which the model parameters are updated.

Model parallelism distributes the model parameters to different devices, which are generally classified as Tensor Parallelism and Pipeline Parallelism according to the distribution method. Tensor Parallelism divides the model parameters into different devices, each of which computes according to the local tensor slice and uses collective communication (All-Gather or All-Reduce) when the entire tensor is needed.

In Pipeline Parallelism, the model is split vertically (layer-wise) among multiple devices, with several successive layers on a single device. As there are dependencies between the computations of different devices in Pipeline Parallelism, it will introduce device idleness, which is called bubbles. To improve the utilization of device resources, the current mainstream approach will continue to divide the mini-batch into micro-batch, providing more opportunities for overlapping, thus reducing the bubble ratio.

A combination of several parallel approaches is usually required for training large models. Megatron-LM[10], for example, uses Tensor Parallelism as intra-node model parallel method and Pipeline Parallelism as inter-node model parallel method since Tensor Parallelism is more bandwidth-intensive than Pipeline Parallelism. Finally, Data Parallelism is used to scale to more nodes.

(a) Tensor Parallelism
(b) Pipeline Parallelism
Fig. 2: Two mainstream approaches for Model Parallelism.

Iii In-depth Analysis of Evoformer

In this section we will compare the differences between Evoformer and Transformer models and perform a performance analysis of Evoformer. These Evoformer performance characteristics inform the subsequent computation and parallel optimization.

For the convenience of later expressions, we denote the number of residues in the input by , the number of sequences processed in the MSA stack by , the hidden dim for MSA representation by , the hidden dim for pair representation by . Specific values of and can be found in the Table I.

Iii-a Differences between Evoformer and Transformer

There are several key differences between Evoformer and vanilla Transformer:


AlphaFold ViT-B/16 GPT-2
Sequence Shape (, ) or (, ) 196 512
Layers 48 12 48
Hidden Dim 128 or 256 768 1600
Heads 8 or 4 12 25
Params per Layer 1.8 M 7.1 M 30.7 M


TABLE II: Different Setting of Evoformer in AlphaFold and Transformer in ViT and GPT.

1) Evoformer accepts two inputs, MSA representation and Pair representation, while Transformer processes only one input representation. So Evoformer has MSA Stack and Pair Stack to process two inputs separately and introduce communication mechanisms for information exchange. The intermediate representation has two sequence dimensions, so attention has to be computed row-wise and column-wise separately. As a result, basic structure in Evoformer has three parts: Row Attention, Column Attention, and Feed-Forward. For Pair Stack, the Triangular Updates Module is introduced to enhance the modeling of the triangle relationship of residues before the attention module. The details of the Triangular Updates Module are shown in Figure 4.

Fig. 3: Attention of Evoformer. There are two main differences with vanilla Attention: 1) add gating mechanism on Attention Context; 2) optional pair bias add to Attention Score before Softmax

2) There are two differences between Attention in Evoformer and vanilla Transformer which are shown in Figure 3. Evoformer Attention introduces a gating mechanism, which uses the Linear layer to project the normalized input and do an element-wise product with Attention Context, thus controlling the output of the Attention module. Another difference is pair bias, which adds linearly projected pair representation into Attention Score. Pair bias is enabled in all Attention except Column Attention in MSA Stack.

3) Pair bias allows pair representation to participate in the update of MSA representation and the MSA representation is transformed into an update for the pair representation with Outer Product Mean. All MSA entries are linearly projected with two different Linear layers. To compute an update for entry ij

in the pair representation, the outer products of these vectors from two columns

i and j are averaged over the sequences and projected to dimension . We can use Einstein-notation sums to represent the core computation of Outer Product Mean: einsum(bsid, bsje -> bijde).

Fig. 4: Triangular Updates Module in Evoformer. Pair Stack has two consecutive symmetric Triangular Updates Module, differing only in the order of the axes in MatMul part.

Iii-B Performance Analysis

Further, we can analyze the performance characteristic at the operator level in Evoformer. We classify Operators into three main categories based on the characteristics of the computation and memory access: 1) GEMM. Operators in this category include matrix multiplication, batch matrix-matrix product, and other dense matrix calculations. Tensor Core from NVIDIA Tesla GPU can dramatically accelerate GEMM operators; 2) Batch Reduction. Operators in this category include LayerNorm, Softmax, etc. These operations have lower computational intensity than the GEMM operator and are more susceptible to access bottlenecks; 3) Element-wise Operators. Operators in this category include element-wise addition or product, dropout, and activations. This is the category with the lowest compute-intensive.

GEMM operators are generally computed by the highly optimized BLAS library provided by the vendor, such as cuBLAS on GPU platforms and MKL (Math Kernel Library) on CPU platforms. In general, PyTorch or other Deep Learning Frameworks cannot achieve high efficiency for many non-GEMM operators. For AlphaFold model training on NVIDIA Tesla A100, only 14.7% of the time is spent on GEMM operators, while 55.7% on Batch Reduction, 19.8% on Element-wise and 9.8% on other operators like data movement. Batch Reduce takes so much time because LayerNorm and Softmax implementation in PyTorch is very inefficient. This performance issue also occurs on the NLP Transformer model but is more severe in AlphaFold. This phenomenon is related to the fact that the hidden dim of AlphaFold is much smaller than that of ViT and GPT, which is shown in Table II. PyTorch’s native kernel of Batch Reduction is less performance efficient with small hidden dim.

From the memory perspective, we can observe huge memory consumption during AlphaFold training. However, we can see that the overall model parameter count of AlphaFold is only 93M according to Table II, which is a rather small Transformer model. The memory requirement of AlphaFold is much higher than the general Transformer model, largely due to the much larger intermediate activation. Taking the activation in the Attention module as an example, the memory required to store the activation in bytes is given by the expression below because of the cubic scaling in attention context.

It exceeds 20 GB for 48 layers when and

. Such a huge memory consumption makes it impractical to store all the activation for backpropagation. So AlphaFold leverages gradient checkpointing technique

[11] to reduce the memory consumption. However, AlphaFold is still a memory-heavy model, and each device can still only process one data sample at the time of training with limited memory capacity.

Iv Implementation and Optimization

In this section we will introduce FastFold implementation and optimization, including: 1) Design and implementation of high performance GPU kernel; 2) Design and implementation of Dynamic Axial Parallelism; 3) Communication optimization through Duality Async Operation.

Iv-a Computational Optimizations

According to its performance characteristics mentioned in the previous section, GEMM operators in AlphaFold only accounts for a small portion of total runtime, and there is a lot of potential for performance improvement in the non-GEMM part. To achieve high performance, we implemented several model-specific optimizations, including kernel fusion and highly optimized kernels for Batch Reduction operators(Softmax and LayerNorm).

Iv-A1 Kernel Fusion

Kernel fusion is a common performance optimization technique in deep learning, which can reduce the overhead of memory accesses, improve the overall efficiency of the memory system, and reduce kernel launch overhead. In AlphaFold model, we apply kernel fusion in two approaches:

Merge GEMM. In the computation of Query, Key and Value in Attention, we can merge three different Linear Layers of QKV into one, thus improving computational efficiency and reducing kernel launch overhead. This optimization method is also used by other Transformer optimization implementations, such as TurboTransformer [12]. In the alphafold-specific Triangular Updates Module (Figure 4), we can also use the merge GEMM method to merge the left project with the right project and the left gating with the right gating.

JIT Fusion. PyTorch JIT[13] can generate optimized TorchScript ”Just-In-Time”. Thus, we combine several element-wise operators into ones with PyTorch JIT. (bias + sigmod + element-wise product, bias + dropout + add, etc.)

Iv-A2 Fused Softmax

The softmax function is a kind of normalized exponential function, which converts the elements into between 0 and 1, and then the sum of the elements is 1. For numerical stability, it is generally necessary to implement the softmax calculation by subtracting all elements from their maximum values to narrow the range of values after calculating the exponent. Therefore, the formula for softmax is as follows:

The softmax function is used in attention, where the attention score is normalized by the softmax function and subsequently calculated with Value. In the AlphaFold model, the input of the softmax function has many rows, but the number of elements of each row is relatively small. For this situation, if the native kernel is not properly implemented and parallelized, it can easily deliver very low performance.

For small column size, we use one warp to calculate one-row data, which can be achieved very efficiently using the communication primitives between registers and warps. Referring to Figure 5, to get the Global Max of a row, we need to get the local max in threads, then use WarpAllReduce to get the global max between threads. We then perform subtraction and exponential operations, then get the local sum in threads and use WarpAllReduce to get the global sum followed by one final division.

We implement WarpAllReduce to perform Global Max and Global Sum operations between threads within Warp, and WarpAllReduce is implemented using the warp-level primitive __shfl_xor_sync. In addition, we fused scaling and add bias into the softmax kernel, and merge accesses to improve the overall performance of this part.

Fig. 5: Implementation of Softmax Kernel in FastFold. FastFold uses a warp to compute a row of data, and uses warp-level primitive to implement WarpAllReduce

Iv-A3 LayerNorm

Layer Normalization is one of the most common operations in the Transformer model. The input data is normalized by the mean and standard deviation calculated over the last dimensions in LayerNorm Layer. Layer Normalization applies element-wise scale

and bias with which are learnable parameters.

There are 12 LayerNorm layers in one block of Evoformer: 4 for MSA Stack, 7 for Pair Stack, and 1 for Outer Product Mean

. In LayerNorm Layer, the input data is normalized using the mean and variance determined across the last dimensions. Layer Normalization applies element-wise scale and bias with

and . The calculation is as follows:

There are several methods for calculating the variance, including the two-pass method, the one-pass method, and the Welford algorithm[14]. The two-pass method obtains the global mean in the first pass and calculates the variance using straightforward variance definition in the second pass. The one-pass method requires only one pass of data and uses the variance calculation formula: . Compared with the two-pass method, the one-pass method is easier to achieve good performance, but it is difficult to use in practical implementations because of the numerical unstable. The Welford algorithm can be represented by the following equation:

The Welford algorithm is also a single pass method and has good numerical stability, so we use the Welford method to calculate the variance.

Similar to the Softmax kernel, we also use one warp to calculate one-row data. Each thread in one warp calculates local mean and variance from part of data from one row and then uses WarpAllReduce to obtain the global mean and variance. Finally, each thread calculates the normalized value based on the global mean and variance.

In Softmax and LayerNorm kernel, we use vectorized memory access to improve CUDA Kernel performance. Many CUDA Kernels are bandwidth constrained, and using vectorized memory access can reduce the total number of instructions, reduce latency, and improve bandwidth utilization.

Iv-B Parallel Evoformer

Because of the restricted global batch size, AlphaFold training can only scale to a maximum of 128 devices. This makes the training time very long. In the official experiments, the complete training time of AlphaFold is as long as 11 days. Therefore, by introducing model parallelism, the training can be better scaled to more computing resources, thus reducing the overall training time.

As we described in the background section, Pipeline Parallelism can have some bubbles that reduce the efficiency of hardware resources utilization. To improve the performance of Pipeline Parallelism, the mini-batch need to be further split to multiple micro-batches and combine it with gradient accumulate. During the training of AlphaFold, using pipeline parallelism does not scale to more devices with guaranteed performance because of the batch size limitation.

Iv-B1 Tensor Parallelism on AlphaFold

Tensor Parallelism can be applied to AlphaFold model training. The main structure of Evoformer contains Attention and FeedForward, similar to the structure of vanilla Transformer. Therefore, we can use Tensor Parallelism similar to that in Megatron-LM. Tensor Parallelism is mainly imposed on the Linear layer because matrix multiplication is relatively easy to distribute to different devices.

Since there is no previous work about Tensor Parallelism on AlphaFold, we describe here how Tensor Parallelism can be used for AlphaFold training. The two types of Tensor Parallelism proposed in Megatron-LM are column parallelism and row parallelism. The Linear Layer can be written as , where and are input and output vectors, respectively, and is the weight matrix. In column parallelism, we divide the weight matrix column-wise across N devices as and conduct matrix multiplications through in parallel, resulting in output vectors . In row parallelism, we divide the weight matrix column-wise and input vectors across N device, then conduct matrix multiplications through in parallel, resulting in output vectors, after AllReduce to get the final . In FeedForward, column parallelism can be used in the previous Linear layer and row parallelism in the next layer. in Multi-head Attention, column parallelism can be used in the computation of the Linear layer of QKV and row parallelism in the In Multi-head Attention, you can use column parallelism for calculating the Linear layer of QKV and row parallelism for the output Linear layer. This method minimizes the number of times Tensor Parallelism needs to communicate.

Tensor Parallelism (TP) will introduce many synchronization communication in each Evoformer layer, and can only apply to some part of the model like Attention and FeedForward. The scaling of Tensor Parallelism is limited by the number of heads in Attention. The heads in the AlphaFold are 4 in the Pair Stack, so Tensor Parallelism can be scaled to a maximum of 4 devices. On the other hand, the model parameter of AlphaFold is small, however, the input and activation of AlphaFold are extremely large. So Tensor Parallelism is inefficient for AlphaFold to scale to more devices.

Iv-B2 Dynamic Axial Parallelism

To tackle this problem, we propose Dynamic Axial Parallelism (DAP), which provides end-to-end parallelism for Evoformer. In AlphaFold training, the number of parameters in the model is relatively small, but the activation is relatively large. So unlike Tensor Parallelism, we choose to keep the complete model parameters on each device and divide the input and activation among different devices. Both MSA representation and pair representation processed by Evoformer contain two sequence dimensions, but the calculations in Evoformer are all along one sequence dimension in the data. So we can divide on the other dimension and insert All_to_All communication when the two sequence dimensions are transformed, thus keeping the data dimension of the computation direction complete on each device, which is shown in Figure 6(a). No other communication is needed in the computation of Attention. In Outer Product Mean, we need to get the global Left Projection by AllGather, and then perform the outer product mean calculation with the local Right Projection. The Triangular Updates Module also uses a similar approach to Outer Product Mean for parallelism.

(a) Transpose
(b) Outer Product Mean
Fig. 6: Communication for Dynamic Axial Parallelism. All_to_All is required at transpose, for example in the middle of Row Attention and Column Attention. AllGather needs to be inserted in Outer Product Mean and Triangular Updates Module.

Table III compares the communication overhead introduced by the two parallel approaches, Tensor Parallelism and Dynamic Axial Parallelism. It can be observed that TP supports parallelism only in Attention and FeedForward (FF) , while DAP supports all computational modules of Evoformer. TP introduces 12 AllReduce communications in the Attention+FF, with 6 AllReduce in the forward and 6 in the backward. In the forward computation, DAP introduces one AllGather communication in the Outer Product Mean and one communication in each of the two Triangular Updates Modules. The backward pass has no additional communication overhead. DAP makes these two parts parallelizable by introducing three AllGather communications. DAP needs to insert All_to_All communication in between calculations in different directions, 12 times (Forward 6 times, backward 6 times) in an Evoformer block. However, TP needs to pass the entire intermediate representation during AllReduce communication, while DAP only needs to pass of the intermediate representation during All_to_All communication with devices. So the communication volume of DAP is much smaller than TP.

Overall, DAP has several advantages over TP: 1) it supports all computational modules in Evoformer; 2) the amount of communication required by DAP is much smaller than that required by TP; 3) Parallelism can distribute activation to different devices, and DAP consumes less memory than TP because it has more parallel parts; 4) DAP has more opportunities for communication optimization, such as computation-communication overlap.


Attention+FF No Comm
Outer Product Mean
Triangle Update Module
Transpose No Comm


TABLE III: Communication overhead for each Evoformer block.

Iv-C Communication Optimization

Dynamic Axial Parallelism requires All_to_All and AllGather communication between all axial parallelism devices. Similar to Tensor Parallelism, because of the synchronized communication in the layer, communication of Dynamic Axial Parallelism can be a bottleneck. So we design and implement some communication optimization strategies to reduce the overhead.

The communication is synchronous in PyTorch. Although all computation and communication will assign to different CUDA streams, PyTorch will barrier the computation stream to wait for the completion of the communication. On the other hand, the computation is straightforward in the vanilla Transformer model, so there is no opportunity to overlap the communication in the transformer layer, such as Megatron-LM[10] and DeepSpeed-MoE[15].

While in AlphaFold, we have two representation features to process, which give us the opportunity to overlap the computation and communication. And the overlap will affect by the order of kernel launch. However, in dynamic-graph Deep learning frameworks like PyTorch, it’s hard to explicitly use asynchronous communication interfaces and implement corresponding communication in backpropagation. So we design Duality Async Operation for PyTorch to implement Communication and Computation Overlap.

Duality Async Operation is shown in Figure 7. Duality Async Operation consists of a pair of communication operators. During the forward propagation of the model, the former operator triggers asynchronous communication, then some computation without dependencies is performed on the Computation Stream, and then the latter operator blocks the asynchronous communication until the communication is completed, and then the subsequent computation is performed. When the model is propagated backward, the latter operator will trigger the asynchronous communication and the former operator will block the communication. We observe that when using asynchronous communication, the communication task is carried out in the communication stream while the Computation Stream is doing some non-dependent computation at the same time, so the communication overhead is reduced by computation-communication overlap.

Fig. 7: Asynchronous with Duality Async Operation. Computation and communication overlap is achieved through a pair of communication operators that trigger and block asynchronous communication in the forward and backward.

We implement Duality Async Operation with PyTorch Autogard Function provided by PyTorch automatic differentiation package. PyTorch Autogard Function provides the ability to define the forward and backward independently. We accomplish the interoperation by passing asynchronous communication requests between two operators, thus enabling triggering and blocking of the asynchronous communication.

(a) Bfloat16
(b) Float32
Fig. 8: Fused Softmax Performance on NVIDIA Telsa A100.
(a) BFloat16
(b) Float32
Fig. 9: LayerNorm Performance on NVIDIA Telsa A100.

V Evaluations

We first evaluated performance improvement of the kernels for Evoformer, and then end-to-end training and inference performance. All experiments were done on the NVIDIA Tesla A100 platform. The baseline is the official implementation of AlphaFold and another open-source PyTorch implementation of OpenFold[16]. The official implementation of AlphaFold has only the inference part, while OpenFold reproduces both training and inference according to the original AlphaFold paper.

V-a Evoformer Performance

Figure 8(a) and Figure 9(a) present the performance comparisons of Fused Softmax and LayerNorm, respectively. We use a certain range of problem sizes to evaluate the latency of kernel at Bfloat16 precision. For Fused Softmax, we compare the performance of PyTorch native kernel and FastFold optimized kernel. The problem size means that the length of the sequence of Attention input is and the hidden size of Attention is . As can be seen from Figure 8(a), the FastFold kernel can achieve a performance improvement of . For LayerNorm, we compare not only the PyTorch native kernel but also the highly optimized LayerNorm kernel from NVIDIA Apex[17]. According to Figure 9(a), the performance of FastFold is improved by and compared to PyTorch and Apex, respectively. Because of our special optimization for the limited range, FastFold also achieves a good performance improvement over the highly optimized Apex LayerNorm.

V-B End-to-End Training Performance

In the evaluation of End-to-End Training Performance, we use the training parameters from the official AlphaFold paper for testing as much as possible. This allows a better comparison of how different methods or implementations work on real training scenarios. All training experiments are done on a 128-node GPU supercomputer. In the supercomputer, each node includes 4 NVIDIA Tesla A100s and has NVIDIA NVLink for GPU interconnects.

Because tensor parallelism is more dependent on high-speed interconnections between devices for communication, model parallelism is generally used within nodes and data parallelism between nodes during training. We test the training performance of the model at two levels, model parallelism and data parallelism, respectively, and the results are shown in Figure 10 and Figure 11. For model parallelism, we compare the scalability of two parallel methods, Tensor Parallelism and Dynamic Axial Parallelism, under two training settings, Initial Training and Fine-tuning.

For model parallelism, we compared the scalability of two parallel methods, TP and DAP, under two training settings, Initial Training and Fine-tuning. As can be seen in Figure 10, the scalability of DAP is significantly better than TP for both Initial Training and Fine-tuning. It can be observed that the scalability of Initial Training is worse because the sequence length of residues and MSA direction is smaller in Initial Training, so the overhead from communication will be more obvious. One point worth noting is that when Initial Training scales to 4 GPUs, we can turn off the activation checkpoint because the GPU memory is sufficient. The performance improvement is shown in Figure 10 from the blue dashed line to the blue solid line.

Fig. 10: Parallel Efficiency on Model Parallelism Intra-node.

For data parallelism, we use data parallelism for scaling with fixed MP settings. Following the settings of the official AlphaFold paper, data parallelism scales the global batch size to 128. In Fine-tuning training, the computation of a sample is scaled to a full node (4 GPUs) using DAP, so data parallelism will scale from 1 to 128 nodes. In Initial Training, considering the scaling efficiency, DAP is only scaled to half nodes (2 GPU s), so the data parallelism will be scaled to 64 nodes only. The scaling results are shown in Figure 11. It can be seen that for data parallelism basically scales linearly. the scaling efficiency of Fine-tuning training reaches 90.1%.

Fig. 11: Parallel Efficiency on Data Parallelism Inter-node.


Implementation Framework Training Process Hardward Step Time (s) Training Time (days) Resource
AlphaFold JAX[18] Initial training 128 TPUv3 / 11 33792 TPU hours
Fine-tuning /
OpenFold PyTorch Initial training 128 A100 6.186 8.39 25774 GPU hours
Fine-tuning 20.657
FastFold PyTorch Initial training 256 A100 2.487 2.81 20738 GPU hours
Fine-tuning 512 A100 4.153


TABLE IV: Resource and Time Cost Compare.

Based on the results of our evaluations on training performance, the overall time and economic cost of AlphaFold can be extrapolated. Based on the results of our tests on training performance, the overall time and economic cost of AlphaFold can be extrapolated. Table IV lists and compares the time and economic costs of the three implementations of AlphaFold, OpenFold, and FastFold. AlphaFold does not have a publicly available training code, so the data of AlphaFold is derived from official papers.

Considering the time and economic cost, we chose to use for Initial Training and then scale to during the Fine-tuning phase. Based on this setting, FastFold can reduce the training time to 2.81 days. Compared with AlphaFold, which requires 11 days of training, the time cost of training is reduced by 3.91 times. Compared with OpenFold, the time cost of training is reduced by 2.98 times, and the economic cost is reduced by 20%. In the Fine-tuning phase, FastFold achieved aggregate 6.02 PetaFLOPs with . With such a significant reduction in time and economic costs, FastFold makes training a protein structure prediction model faster and cheaper, which will drive the efficiency of research and development of related models and facilitate the development of Evoformer-based protein structure prediction models.

V-C End-to-End Inference Performance

We compare the inference performance of FastFold, OpenFold, and AlphaFold implementations in three scenarios: short sequence, long sequence, and extremely long sequence. All inference experiments are done on a GPU server consisting of 8 NVIDIA A100s (with NVLink). In practical inference scenarios with AlphaFold, it is generally necessary to infer multiple models and ensemble the results to improve the accuracy of the prediction results. Since the performance characteristics of multiple models are consistent, all our experiments on inference performance only evaluate the inference performance of a single model.

For short sequences, typically amino acid sequences are no more than 1K in length, and single model inference takes from a few seconds to about a minute. At this sequence range, the video memory consumption is relatively small and the efficiency of using distributed inference would be lower. So we compared the inference latency of three implementations on 1 GPU, and the results are shown in Figure 12. In the scenario of short sequence inference, FastFold’s inference performance is improved by and compared to AlphaFold and OpenFold, respectively. The performance benefits of FastFold are mainly due to the highly optimized CUDA Kernel implementation. It is worth noting that the performance of AlphaFold is lower on the GPU platform because it uses JAX, which has better support for Google TPU, and the computational performance of JAX on the GPU platform may not be optimal. And in addition to the inference time, it takes seconds to compile kernel during the inference of AlphaFold when using the JAX framework.

Fig. 12: Comparison of inference latency for short sequences.

For long sequence inference with amino acid sequences of 1K to 2.5K in length, direct inference already encounters memory capacity problems and inference times of several minutes or even tens of minutes. AlphaFold and OpenFold need to use the chunking technique for inference, that is, in a single operator, chunks are divided along the sequence dimension and then computed sequentially. The chunking technique can effectively reduce the memory consumption during the computation, but to a certain extent, it will reduce the inference performance. For FastFold, the distributed inference method can be used to reduce the memory capacity requirement and significantly shorten the inference time. As shown in Figure 13, FastFold can reduce inference time by in comparison to OpenFold and by in comparison to AlphaFold when using distributed inference. We can see that Dynamic Axial Parallelism can scale to more GPUs (Tensor Parallelism can only scale to 4 GPUs due to the limitations mentioned in section IV.B), and the overall scaling efficiency is significantly better than Tensor Parallelism.

(a) Sequence Length = 1536
(b) Sequence Length = 2048
(c) Sequence Length = 2560
Fig. 13: Comparison of inference performance for long sequences.

For inference with extremely long sequences over 3k in length, even with chunk technology, the limit of the single GPU’s memory capacity is exceeded. As shown in Table 3, for both AlphaFold and OpenFold, Out of Memory (OOM) occurs when the sequence length reaches 3K. However, for FastFold, because of distributed inference, more computation and memory of the GPU can be utilized to accomplish extremely long sequence inference. Moreover, for sequences of up to 4K in length, The inference latency FastFold is within 10 minutes.


AlphaFold OpenFold
(8 GPU)
(4 GPU)
2560 1511.315 1265.193 133.381 154.422
3072 OOM OOM 201.916 238.51
3584 OOM OOM 388.691 414.185
4096 OOM OOM 547.955 OOM


TABLE V: Inference Latency for Extremely Long Sequence (s).

FastFold can significantly reduce inference time and economic cost for inference of different sequence lengths, and enable protein structure prediction of ultra-long sequences. fastFold can support large-scale high-throughput protein structure prediction tasks, which will greatly promote the application scenarios of protein structure prediction models.

V-D Validation

We validated the correctness of FastFold by comparing inference results with the official implementation of AlphaFold from Google DeepMind. From the point of view of theoretical analysis, neither the optimization of the kernel nor the parallel strategy will change the structure and results of the computation. However, because of the introduction of some custom CUDA Kernel, the different calculation methods and orders will lead to certain precision errors. To demonstrate that FastFold’s prediction is as expected, we use both AlphaFold and FastFold to predict the same amino acid sequence, and visualize the inference results for comparison. The results are shown in Figure 14, and it can be seen that the protein structures predicted by FastFold and the results of AlphaFold match.

Fig. 14: Protein structure prediction results of AlphaFold and FastFold.

Vi Related Work

The optimization of protein structure prediction model training and inference has rarely been studied. ParaFold[19] is an optimized system for AlphaFold inference on heterogeneous platforms. The main contribution is in optimizing the data processing flow, including data processing for multi-threaded parallelism on the CPU platform. While FastFold focuses on the training and inference of the AlphaFold model in the GPU platform. The model optimization of FastFold and the total data processing flow optimization of ParaFold are complementary and can be combined in future work.

There has been a significant amount of work on optimizing the performance of the Transformer model, which can be summarized into two categories: efficient design of Transformer and efficient implementation of Transformer. Many works try to improve Transformer by reducing the complexity of attention computation through chunking or sliding windows, or low-rank approximate, such as Performer [20], Reformer[21], Linformer[22], etc. On the other hand, there are many works such as LightSeq [23, 24]and TurboTransformer[12], which focuses on optimizing the performance of the vanilla Transformer on the GPU platform. These efforts included CUDA optimization and memory management for the Transformer model. However, there are several differences between the Evoformer and the vanilla Transform, and while some prior optimization technology can be used, FastFold is computationally optimized mostly based on the characteristics of Evoformer.

Many works are being done to tackle the challenge that large-scale training poses. Large batch training, such as LAMB[25] or LARS[26], has also been used to speed up training. These strategies address optimization issues that arise during data parallelism scaling. For large model training, there are mainly two approaches to achieve high performance. Megatron-LM[10] provides a sophisticated hybrid parallel strategy to scale the model to more GPUs for training. Another approach is DeepSpeed ZeRO[27, 28], which provides a memory-efficient optimizer combined with offload technology. For large-scale training of AlphaFold, the main difficulty lies in model parallelism, and FastFold proposed Dynamic Axial Parallelism, which has higher scaling efficiency than the current mainstream model parallelism methods.

Vii Conclusion

Addressing the challenges of protein structure prediction model training and inference has important implications for its better application in structural biology. FastFold analyzes the main structure of AlphaFold from a computational perspective, providing an efficient CUDA Kernel implementation on the GPU platform. FastFold also proposed Dynamic Axial Parallelism, which allows both training and inference to scale to more GPUs thus reducing the time consumption by an order of magnitude, exceeding the current mainstream model parallelism methods. With these, FastFold greatly reduces the time cost and economic cost of protein structure prediction model training and inference. It improves the efficiency of design and deployment in the field of protein structure prediction models. Meanwhile, Dynamic Axial Parallelism makes it possible to design and train larger models to get higher performance. Other Protein Structure Prediction Models, such as RoseTTAFold[29] and MSA Transformer[30], can benefit from FastFold’s optimization technologies. Also, the parallel strategy can apply to Multidimensional Transformers[31], such as Video Vision Transformer[32].