A Decentrilized Asynchronously Distribute Training framework
We introduce novel communication strategies in synchronous distributed Deep Learning consisting of decentralized gradient reduction orchestration and computational graph-aware grouping of gradient tensors. These new techniques produce an optimal overlap between computation and communication and result in near-linear scaling (0.93) of distributed training up to 27,600 NVIDIA V100 GPUs on the Summit Supercomputer. We demonstrate our gradient reduction techniques in the context of training a Fully Convolutional Neural Network to approximate the solution of a longstanding scientific inverse problem in materials imaging. The efficient distributed training on a dataset size of 0.5 PB, produces a model capable of an atomically-accurate reconstruction of materials, and in the process reaching a peak performance of 2.15(4) EFLOPS_16.READ FULL TEXT VIEW PDF
Deep learning has demonstrated superb efficacy in processing imaging dat...
Distributed synchronous stochastic gradient descent has been widely used...
Microstructural materials design is one of the most important applicatio...
This paper presents the design, implementation, and evaluation of the Py...
State-of-the-art deep learning systems rely on iterative distributed tra...
Distributed synchronous stochastic gradient descent has been widely used...
In this paper, we introduce a novel deep learning framework, termed Puri...
A Decentrilized Asynchronously Distribute Training framework
In light of the recent successes of ever-larger Deep Neural Networks (DNN) models and data sets Dai et al. (2019)
, the need for efficient distributed machine learning strategies on massively parallel systems is more significant than ever before. Various distributed deep learning approaches have been explored throughout the years ranging from Multiple Instruction Multiple Data (MIMD) programming in model-parallelismDean et al. (2012) to the Single Program Multiple Data (SPMD) used in data-parallelism, and most recently pipelining algorithms Huang et al. (2018), parallel tensor contractions Shazeer et al. (2018), and task graph-based strategies Jia and Aiken (2019). Despite many of these advances, data-parallelism Krizhevsky et al. (2012) remains the most widely adopted distributed deep learning strategy. Data-parallelism is both broadly applicable, and its implementation is agnostic to a system’s architecture, by contrast to MIMD programming.
As a distribution strategy, data-parallelism is communication-heavy, requiring the execution of blocking communication collectives to synchronize DNN gradients throughout a training run. A sub-optimal overlap between computation and communication operations during a single training step introduces communication overheads or inefficiencies in data-parallel distributed deep learning. On small to moderate-scale systems, with 10’s - 100’s of GPU/TPU accelerators, these scaling inefficiencies can be difficult to detect and systematically optimize due to system noise and load variability. Note, however, that even moderate scaling inefficiencies on the order of 5-10% accumulate across many training steps and training runs and further increase the enormous carbon footprint of deep learning and its associated environmental impact Strubell et al. (2019). The scaling inefficiencies of data-parallel implementations are most readily apparent on large-scale systems such as supercomputers with 1,000’s-10,000’s of accelerators. Here, we show that supercomputers are ideal systems to develop and test new gradient reduction strategies to achieve near-linear scaling of data-parallelism.111The gradient reduction strategies we describe below have either been recently incorporated in the latest release of Horovod (https://github.com/horovod/horovod) (i.e. Bitvector Allreduce) or are currently in the pull-request review stage (i.e. Grouping).
Extending data-parallelism to the massive scale of supercomputing systems is also motivated by the latter’s traditional workload consisting of scientific numerical simulations Kent and Kotliar (2018). In particular, infusing deep learning into scientific simulations to speed-up their execution and decrease their computational demands often requires approximating the solution of longstanding inverse problems with DNN. Here, we demonstrate the first step in this direction, made possible by our improved gradient reduction strategies.
All measurements reported here were carried out on the Summit supercomputer at the Oak Ridge Leadership Computing Facility, a US Department of Energy Office of Science User Facility. Summit is a system dedicated to open science with access applications in the form of peer-reviewed scientific user proposals.
The Summit system consists of 256 racks populated by IBM Power System AC922 compute nodes ( 4600 nodes in total), each equipped with 2 IBM POWER9 CPUs and 6 NVIDIA V100 GPUs. It is ideally suited for Deep Learning workloads due to its node-local NVMe (burst buffer) and the Tensor Cores on V100 for faster low-precision operations. Within a Summit node, CPU-GPU and GPU-GPU communications are carried out over NVIDIA’s NVLink interconnect, supporting a (peak) bi-directional bandwidth of 100 GB/s, where each 3 V100 GPUs are grouped in a ring topology with all-to-all connections to a POWER9 CPU. The CPU-CPU bridge consists of two NVLink connections, each with a (peak) bandwidth of 25 GB/s. Summit nodes are configured in a non-blocking fat-tree topology via a dual-rail Mellanox EDR 100G InfiniBand Interconnects. The IBM Alpine GPFS provides 2.5 TB/s aggregated I/O bandwidth, which is not enough to feed over 27,600 V100 GPUs each processing at over 0.5 GB/s, while NVMe offers a read bandwidth of 6 GB/s per node and provides a local I/O alternative which scales linearly with the numbers of compute nodes.
All of the data we include here was collected (and reproduced) during normal operations of the Summit supercomputer and in the presence of other applications running on available computed nodes. As such, the performance we report is typical of the system.
We focus on a data-parallelism approach to the distributed training of DNN. To date, the largest distributed DNN training was carried out by Kurth et al. (2018) to learn a segmentation task on climate simulation data. These authors used a modified DNN segmentation model (DeepLabV3 Chen et al. (2017)) which achieved a per GPU computing performance of 38.45 TFLOP, equivalently 31% of the theoretical peak of the V100 GPU (the subscript 16 refers to float16 precision).
One of the key innovations introduced in Kurth et al. (2018) is a hierarchical Allreduce strategy consisting of intra-node collectives with NCCL (v2.3) and inter-node collectives with IBM’s Spectrum-MPI. This communication strategy proved highly effective at reducing the ratio of communication time to compute time, and achieving a scaling efficiency of 90.7% on 4560 Summit nodes with a sustained (peak) performance of 990 PFLOPS (1.13 EFLOPS), but at the expense of skipping gradient synchronization/reduction every other training step.
The scaling efficiency used in this previous study and in other work using data-parallelism (including ours) is defined as the total number of inputs (i.e. images) processed during a training step as a function of computing resources (e.g. Summit nodes).
In subsequent sections, we describe new orchestration strategies of collectives during the gradients reduction stage, which prove to be more efficient than a hierarchical allreduce, allowing us to achieve scaling efficiency on 4600 nodes, and near perfect scaling efficiency () on compute resources on the order of 1000’s of GPUs or less.
The optimized implementation of DNN mathematical operations in cuDNN and their fast execution on state of the art GPUs such as the V100 Tensor Cores leads to small computation times, , during a training step (typically sub-second to seconds). The time required to perform gradient reduction using blocking collectives, , therefore, is the key quantity to optimize in a data-parallel approach to distributed deep learning. We used Horovod Sergeev and Balso (2018), an open source library to perform gradient reduction across model replicas during distributed training. Horovod embeds allreduce operations into the TensorFlow computation graph of the DNN and employs efficient inter-GPU communication via the MPI Allreduce algorithm and/or by using the NVIDIA Collective Communications Library (NCCL) NVIDIA (2018), depending on the configuration selected during installation time. Note that Horovod supports multiple frameworks and can be used to carry out data-parallelism on PyTorch Paszke et al. (2017) and MXNet Chen et al. (2015).
The hierarchical allreduce strategy introduced in Kurth et al. (2018) was originally implemented within Horovod but the publicly available code base does not contain all of the features described in Kurth et al. (2018). As such, a direct comparison between the hierarchical allreduce in Kurth et al. (2018) and the one we use here is not meaningful. Furthermore, some of the features of the original implementation of hierarchical allreduce made assumptions regarding the network topology that were somewhat specific to Summit’s architecture.
In Figure 1, we measured the scaling efficiency of hierarchical allreduce up to 1024 Summit nodes. The sub-linear scaling is evident and was traced to poor overlap between communication and computation caused by inefficient worker coordination at large nodes. The newly released NCCL (v2.4) addresses latency issues of the systolic ring algorithm of NCCL (v2.3), using an implementation of double binary trees for full bandwidth utilization and logarithmic latency of allreduce operations Sanders et al. (2009). This new NCCL double binary trees implementation obviates the need for Horovod’s explicitly hierarchical allreduce altogether, as is seen from the 3 gain in performance between the green and blue lines in Figure1. At larger node counts ( GPUs), the scaling inefficiency of data-parallelism as originally implemented in Horovod becomes apparent, necessitating the need for new strategies.
Our main contributions consist of:
Implementing new gradient reduction strategies which produce optimal overlap between computation and communication, a decrease in during execution of the computation graph, and achieving state of the art scaling efficiency and performance of distributed deep learning up to 27,600 GPUs.
Harnessing these gradient reduction strategies in the distributed training of a DNN with over weights on a dataset with size of a 500 TB to approximate, for the first time, a solution to an inverse problem in scientific imaging.
The gradient reduction strategies consist of: (1) a lightweight worker coordination technique (BitAllReduce) and (2) a gradient tensor grouping strategy (Grouping). These two orchestration strategies improve on different aspects of distributed deep learning as currently implemented in Horovod. The effects of BitAllReduce and Grouping on the scaling efficiency are shown in Figure1 in black and red lines, respectively. In tandem, they lead to over better scaling efficiency (Figure1, 2). These gradient reduction strategies are computing platform agnostic and do not make any assumptions regarding the interconnect network topology.
First, Bitvector Allreduce modifies how the coordination of gradient tensors reduction via collective is performed (see Figure 5). The main idea of Bitvector Allreduce is the use of cached meta-data, associated with each gradient tensor, and locally accessible to each MPI-rank to globally coordinate the execution of collective operations. In essence, we replace the original master-worker strategy of Horovod with a single collective (an MPI_Allreduce on a bitvector) (see Figure (b)b).
Second, we introduce a “grouping” scheme for the gradient tensors akin to a graph coloring algorithm. Essentially, each MPI rank locally colors the nodes of its computational dependency graph (node = gradient tensor), and groups of gradient tensors are formed from like colors (see Figure 6). Collective operations are then only issued for those groups which are ready across all ranks. One of the strengths of “grouping” is to grant the user with the flexibility to order collectives in a fashion that exploits the architecture of her DNN model, thereby achieving greater efficiency.
Finally, we note that both “Grouping” and “Bitvector Allreduce” can be used independently, but used in combination they provided the massive gains in performance we report here. In the next section we describe in detail the implementations of these novel orchestration strategies.
Harnessing the well-known function approximation capabilities of modern Deep Neural Networks (DNN) to solve challenging inverse problems in imaging Lucas et al. (2018) has been mostly explored within the field of medical imaging Adler and Öktem (2017); Rivenson et al. (2018), though there have been a few notable exceptions within materials imaging Cherukara et al. (2018); Laanait et al. (2019). In contrast to other application domains, materials imaging, especially at the atomic scale, has the benefit of having access to highly-accurate and fully-quantitative forward simulation models and theories underpinned by quantum theory. The massive size of a single training example, which are often multi-dimensional arrays, can easily reach GBs and presents new challenges in the training of DNN. Most notably, the need for efficient I/O and the distributed training of large DNN models and consequently large message sizes. While large scale scientific simulation problems are a prevalent workload on supercomputers, to this date, however, no previous work has harnessed the capabilities of high-performance computing to produce a DNN-based solution to a scientific inverse problem. We show that our improvements to gradient reduction strategies now make it possible to approximate solutions to scientific inverse problems with deep learning and supercomputing.
TensorFlow’s use of a graph-based scheduler permits the order of operations executed across workers to vary, even when running an identical DNN. However, collective operations which involve all workers must be performed in a globally consistent order to avoid deadlock. To solve this issue, Horovod introduces additional worker coordination logic to ensure all workers submit collective operations in a common order. The preexisting logic uses a master-worker coordination strategy in which a single coordinator rank is tasked with gathering requests from all workers, determining common requests across workers, forming responses for common requests, and then broadcasting an ordered list of responses to all workers for execution. Requests, , are objects submitted by each worker to request a collective operation, containing basic meta-data about the tensor involved in the operation (name, shape, datatype), as well as the type of collective operation desired (allreduce, allgather, or broadcast). Responses, , which are associated with a given request, contain aggregated meta-data from all workers submitting a common request (for example, all displacements for an allgather operation and the set of ranks that submitted this request), and are used for the execution of the collective operation (see Figure (a)a). This scheme is implemented using MPI collectives, in particular MPI_Gatherv and MPI_Bcast, on serialized representations of the various request and response objects.
This coordination process occurs at frequent regular intervals for the duration of training, where at each tic only common collective operation requests across workers are executed. While this coordination strategy works well up to moderate scales, its effectiveness breaks down once the node count is increased further. At these larger scales, the communication cost for this coordination strategy increases to non-negligible levels, resulting in severe degradation in scaling efficiency (green and blue lines in Figure1).
To address this, a new lightweight coordination scheme was implemented in Horovod
, replacing the master-worker strategy and related MPI collective communication with a global intersection of a bit vector, implemented using only a singleMPI_Allreduce operation. One of the major overheads of the existing coordination strategy is that although identical collective operations are completed during every training iteration, requests for each operation are redundantly communicated to the coordinator rank in order to create new responses for execution. To avoid this, we implemented a caching scheme where the responses to execute collective operations are gathered and processed by the coordinator rank only once, with the broadcasted result of this process stored in a cache on every worker. On subsequent iterations, this cached response can be directly used by each worker, bypassing redundant communication of requests to the coordinator rank. Assuming the cache remains globally consistent, it also forms the basis for a simple global enumeration of the collective operations and leads naturally to a simple procedure for worker coordination. For a given set of requests across workers, the coordination process is as follows:
Each worker populates a bit vector, setting bits associated with its pending requests with bit positions determined from the cache.
The bit vectors are globally intersected using MPI_Allreduce with the binary MPI_BAND operation.
Each worker searches for set bits in the intersected bit vector and forms a list of associated cache entries. This list is the common set of collective operation requests for each worker to execute.
A depiction of this improved coordination strategy can be seen in Figure (b)b. This new coordination strategy greatly reduces communication overheads and resulted in significant improvements to scaling efficiency, shown in the black line in Figure1.
As noted in the previous section, worker coordination in Horovod occurs at a fixed tic rate, referred to in Horovod as the cycle time (see blue vertical lines in Figure 2). This cycle time is user configurable at run-time via an environment variable. This tic rate controls how often worker coordination occurs and pending collective requests are processed and executed. One of the major features of Horovod is the ability to fuse individual collective operations into single operations on larger message buffers for better network performance. Notably, the scope of this fusion is limited to the requests that are encountered during a single coordination cycle. This leads to a coupling between the cycle time and collective message sizes, where in any given iteration, a shorter cycle time will lead to a more responsive execution of many collective operations with small message sizes, while a larger cycle time will lead to a slower execution of fewer collective operations with larger message sizes. This leads to a tuning dilemma: for low-latency execution of collective operations, the cycle time should be reduced as much as possible; however, for efficient network utilization, the minimum message sizes cannot be too small. Due to this, it is challenging to find an optimal cycle time that effectively balances these requirements and achieves good scaling performance.
To weaken the coupling between the cycle time and message sizes, we implemented an additional feature into Horovod that enables explicit assignment of collective operations into groups. When using this feature, rather than executing all collective operation requests encountered during a given cycle, only requests forming a complete group are fused and executed. If multiple complete groups are encountered, they are fused together into larger messages. By enforcing a lower bound on fusion to complete groups only, a minimum message size independent of the cycle time is enforced. This enables the use of lower cycle time for low-latency execution with a constant minimum message size, maintaining efficient network utilization. Usage of this feature in tandem with the lightweight bitvector coordination scheme described previously, yielded the red performance curve in Figure 1, a significant improvement in scaling behavior.
A strong indicator of the efficiency of an application on a supercomputer is the measured power consumption. In particular, the use of blocking collectives such as Allreduce causes all operations executed on a GPU/CPU to cease until the result from the collectives are returned. For instance, in a case where the reduction of gradients stalls due to overheads introduced by an inefficient coordination strategy, this stalling would be reflected in the GPU power consumption via a cyclic increase and decrease in the power as a function of application run-time or equivalently, in our case, the training steps.
In Figure7, we present the measured power consumption of the main hardware components on Summit during a distributed training run using Bitvector Allreduce and Grouping. The DNN model used in that training run and throughout the rest of the presented results is modified version of the fully-convolutional dense neural network (FC-DenseNet) Jegou et al. (2017) with 220 million parameters. This choice of model produces a message size large enough to ensure that our experiments tests the robustness of the new gradient reduction strategies. The distributed training run shown in Figure7 was carried out on 4600 out of 4608 available Summit nodes and allows us to directly measure the efficiency of our application as a whole. We found that energy metrics collected on time scales similar to the duration to a training step, show that our application’s power usage is nearly constant, due to the absence of power usage fluctuations caused by GPU idleness in the presence of communication overheads.
In addition to power consumption, we also profiled the compute performance of distributed training with the new gradient reduction strategies. All of our reported performance measurements include: (1) I/O (reading of data and writing of model checkpoints), (2) computation performed for the DNN forward and backward propagation, and (3) communication operations embedded in the computation graph.
We measure the single GPU performance of our code using two distinct methods. First, we use an analytical calculation of mathematical operations performed by DNN convolution layers assuming direct convolution. We then augment that with the tracing of calls to cuDNN during execution of TensorFlow’s computation graph to eliminate any errors that arise from the availability of the multiple numerical implementations of the convolution operation in cuDNN (e.g. FFT vs. Winograd vs. direct convolution) Chetlur et al. (2014). The computational complexity of these algorithms can vary substantially, and TensorFlow makes runtime decisions regarding which algorithm to use for each operation in the graph. As shown in AppendixA(Table 2), our DNN implementation uses exclusively algorithms with a direct convolution implementation, for which the number of multiply-add operations for a direct (2-D) convolution is given by:
where , are the height and width dimensions of the inputs, and are the number of input and output channels respectively, and are the convolution kernel dimensions, and the factor of 2 accounts for “multiply” and “add” operations.
The execution time of the TensorFlow graph,
is obtained through the use of Python’s built-in time module as well as a GPU hardware trace with CUPTI. The CUPTI trace provides the runtime of every operation individually for a single training step, whereas the application-level timing has sufficiently low overhead to be used throughout a training run. We denote the application time spent in I/O and memory copies between the host and the device as . and are the times spent on communication and computation, respectively.
The two performance numbers we report, namely sustained and peak are then given by,
where the factor of 3 accounts for forward convolutions (Conv2D_FWD), gradient backpropagation with respect to the convolution kernels (Conv2D_BackpropKernel), and gradient backpropagation with respect to the inputs (Conv2D_BackpropInput).
Performance measurements on multiple nodes are carried out in a similar fashion, with the minor addition of averaging
across all MPI-ranks. The sustained performances reported at each node count is averaged across a distributed training run lasting 1000 steps and the variance is reported as error bars. While our definition of the peak performance at a single node does not account for, when we report its value on multiple nodes (see below), we multiply its value by the measured scaling efficiency ( for Summit nodes ). This scaling is performed to accurately reflect the synchronous execution of our application.
In Table 1, we summarize the math operations, their timing, and the overall performance during the execution of our application (one training step) on a single Summit node using the performance measurement methodology described in the previous section. We also account for the speed-up in execution enabled by the hardware implementation of half-precision intrinsics in the V100’s Tensor Cores. This is done by making use of TensorFlow’s TF_DISABLE_CUDNN_TENSOR_OP_MATH environment variable. We find that execution with Tensor Cores produces an average speed-up of approximately of the computation times of the convolution operations than without (Table 1).
During DNN training, we attain sustained (peak) performance of 59.67 (83.92) TFLOPS per GPU representing 49.7% (70%) of the theoretical peak of a V100 (120 TFLOPS), which to our knowledge, exceeds the single GPU performance of all other DNN trained on the same system to date.
Finally, using the communication strategies described in section2.3, we are able to achieve a scaling efficiency of 0.93 at 4600 nodes during distributed deep learning (Figure 8) and reach a sustained (peak) performance of 1.54(2) (2.15(2)) EFLOPS. Both our scaling efficiency and sustained performance improve significantly () on the record established by the 2018 ACM Gordon Bell prize winner Kurth et al. (2018). Note that in the results reported in Kurth et al. (2018), synchronized gradient updates were skipped every other training step, which introduces a level of asynchronous execution, and reduces their communication overhead (at the expense of gradient staleness). Our reported performance comes from fully-synchronous training, making the two results not directly comparable.
|Operation Name||Type||CUPTI Timing||CUPTI Timing|
|(ms, no Tensor Core Math)||(ms, Tensor Core Math)||(Analytical, float16)||(cuDNN , per GPU)|
|ms||ms||Total Math Ops =||Sustained Performance (per GPU) = 59.67 TFLOPS|
In a general inverse problem in imaging, we seek to reconstruct an image from a set of measurements (typically also given by an image), where and are (Banach) spaces. The forward operator, , defined by
maps the space of solutions to the space of measurements. The goal of any reconstruction method is to find by solving
where denotes the -norm (typically, ), is a parameter (typically ), and is a regularization function to incorporate a priori knowledge about that the solution ought to obey.
In our inverse problem of interest, illustrated in Figure9, represents the local electron density of a material , is a diffraction pattern , and is the celebrated Schrödinger equation of quantum mechanics. The central difficulty of the above inverse problem lies almost entirely in the fact that experimentally, one can only measure image intensities (i.e. diffraction patterns) of the exiting probe electrons and not the full complex-valued needed to find from . Consequently, half of the information needed to directly invert the forward model is always missing. A problem known as the phase problem Born and Wolf (2013).
Deep Neural Networks are notoriously data hungry. To simulate enough training and test data from the forward model in optimal time, we developed a custom multi-GPU, multi-node electron scattering simulation code called NAMSA, which implements the multi-slice algorithm (MSA)Cowley and Moodie (1957), a nearly-exact solution of the fast-electron Schrödinger equation Kirkland (2010).
Our simulation workflow is shown in Figure 10A and consists of A material supercell is built (with typical dimensions nm and atoms), followed by a simulation of the probe electron wavefuntion interacting and propagating through all atomic planes of the supercell to produce the intensity of the exit wavefunction, ( pixels). This procedure is performed at each position on a 2-D grid (32x32) defined at the surface of the supercell. The stack of represents the inputs to our DNN,, while the target outputs of the DNN is the 2-D projected electron density, ( pixels). The projected electron density is computed, after the scattering simulation, by integrating along the thickness of the supercell (axis).
Our simulations span over 60,000 solid-state materials crystal structure files accessible via the materials project database Ong et al. (2013). For each material structure, multiple crystallographic orientations were simulated as they produce markedly different pico-diffraction patterns and projected electron densities. In total, 400,000 configuration files were generated and then partitioned into a 90/5/5 split for training, development, and test data sets.
Simulations of training and test data sets were generated on-the-fly and stored on the node-local burst-buffer. Given our highly-optimized simulation code NAMSA, we found it to be more time-effective to generate the data immediately before the start of DNN training than to stage upwards of 500 TB of data (to 4600 nodes) via the global parallel filesystem- a shared resource accessible to all users. Typically, a simulation with 0.5 hours of wall-time generates about a 200 GB data set per compute node. Note, that the number of unique samples the DNN model trains on grows linearly with the numbers of GPUs used during distributed training. The entire complement of 360,000 training configuration files are only used when distributed training reaches 4600 nodes. All data I/O (file-saving during simulation, DNN model checkpointing, and data reading during DNN training/testing) was carried out via the burst buffer and used LMDB (in addition to Python’s built-in multiprocessing module during the reading stage).
Encoder-Decoder networks are prevalent in computer vision tasks such as segmentation and denoisingBadrinarayanan et al. (2017). This style of DNN architecture learns an encoding of multidimensional input into a compressed latent representation, followed by learning a reconstruction of a multidimensional output from an encoding along the decoder path Vincent et al. (2008). Encoder-decode architectures have many variations: our work adapts a fully-convolutional dense neural networks (FC-DenseNet) Jegou et al. (2017), shown in Figure10B. The two main modifications we introduce in our model consist of: (1) Large growth rates (
= 256) of the number of channels of the 2-D convolution layers, and (2) replacing max pooling with average pooling. The former modification is needed to give the model enough capacity to represent our input with its 1024 channels; a smaller number of channels in the first few 2-D convolutional layers would decimate most of the information encoded in. The latter modification was found in earlier work to produce substantially more accurate DNN models on atomically-resolved imagingVasudevan et al. (2018)
, due the inherent sparsity of these images. The output of each dense block was passed through a rectifier non-linearity (ReLU) to compute the activation, followed by a dropout layer (with probability). In total, our DNN model has weights (free parameters).
We trained our DNN to learn a reconstruction of the (projected) electron density,
by minimizing the following loss function,given by
where is the Huber loss evaluated on the true and predicted electron densities, and , respectively. We use an -based regularization,, on the weight values of the model with (weight-decay) coefficient . We initialized the Huber loss “cutoff” value with
and decreased it during training using an exponential decay rate policy (decay rate of 0.99 every data epoch).
Due to the large DNN model and input sizes, the 16 GB memory of a V100 can only accommodate a training batch size of 1 (,
), even in a float16 implementation. This batch size, however, increases linearly with the scale of distributed training, reaching 27,600 at 4600 nodes. It is well established that large batch sizes adversely affect the learning dynamics of DNN trained with stochastic gradient descent. To mitigate such effects, we used a layer-wise adaptive learning rate scaling strategy (LARS), which computes a layer-wise weight update based on the-norm of the gradients You et al. (2017). We used LARS in conjunction with an adaptive stochastic gradient descent optimizer (Adam optimizer, ), and a staircase learning rate decay policy. Furthermore, the warm-up policy was used to linearly increase the learning rate , from an initial value of 0.0001, to a maximum value of , where is the number of GPUs (MPI-ranks) participating in the distributed training of the DNN.
Mixed-precision training has been shown to produce similar convergence behavior and accuracy to training in pure single-precision across many DL applications Child et al. (2019); You et al. (2017), as long as judicious numerical scaling strategies are applied. Here, we performed adaptive scaling of the loss before gradient computation (and application of LARS) to avoid numerical values outside of the dynamic range of float16, using the loss scaling strategies implemented in OpenSeq2SeqKuchaiev et al. (2018). All of our deep learning code was implemented using the TensorFlow (v1.13) framework Abadi et al. (2016).
We carried out multiple distributed training runs extending to 2,000 training steps. In each run, the DNN was initialized randomly and trained using the optimization strategies described in the preceding sections. We found that the training error converges reasonably well as shown in Figure11 for runs spanning 128 nodes through 4096 nodes ( of the full machine). These observations indicate that the learning strategies employed were effective in enabling good training behavior irrespective of computational scale or equivalently batch size.
In typical data-parallelism work, the total size of the training data set, given by the number of training samples is fixed regardless of the number of DNN model replicas or equivalently the number of MPI ranks used in distributed training. In our application, however, the total number of unique data samples the DNN encounters during each one of the training runs depends on and grows linearly as a function of GPUs used (as discussed in section6.2). This linear growth in the training data set size is necessary given the finite capacity of the local node storage which can accommodate less than 1% of the total data and the massive performance hits () our application would incur if I/O is performed directly from the larger capacity global file system (see subsection2.2).
The increase in the predictive efficacy of machine learning, deep learning in particular, as a function of growth in data is well-documented Halevy et al. (2009); Sun et al. (2017). As our data size grows as a function of MPI-ranks used, we expect that the quality of the DNN reconstruction on an unseen sample drawn from the test data improves. We show one such example in Figure11. We find that the reconstruction of the projected electron density is visibly far closer to the ground truth for a model trained on 1024 nodes versus 128 nodes. Both DNN models, however, fail to faithfully reconstruct the true electron density of this material across the entire field of view of the image. In the case of the DNN trained on 1024 nodes it is plausible that its reconstruction capabilities will improve with additional training and hyper-parameter tuning.
We also report the reconstruction error evaluated on the entire test data for models trained on 128, 1024, and 4096 nodes (see inset in Figure11). We find that this test error, averaged over all test samples, decreases as the number of compute (and data) increases, indicating an improving reconstruction quality on materials configurations unseen during training.
We have shown that by introducing new coordination strategies during gradient reductions we exceed the state of the art in scaling efficiency. This opens up, in particular, opportunities in exploiting the different levels of parallelism present in many systems (e.g. intra-node vs inter-node) such as Summit to train even larger models than we do here, for instance via the combination of model- and data-parallelism. In addition, the scaling efficiency results clearly indicate that with carefully chosen synchronized gradient reduction strategies we obtain greater utilization of the interconnect network.
In regards to our application, the promising results shown here are a first in using DNN to solve the phase problem in the atomic imaging of materials. Future research directions can target improving the reconstruction baseline achieved here and extending the DNN-based reconstruction approach to the full 3-D electron density. Higher-dimensional reconstructions would require the use of GPU-memory intensive 3-D convolution layers presenting an additional opportunity to further benchmark the effectiveness of the novel coordination strategies we introduced here as well as extending our gradient reduction strategies to model-parallelism.
In light of the ever-increasing data streams emanating from large scale scientific user facilities, we believe this is an opportune time to harness state of the art supercomputing and machine learning. The impact of exascale machine learning on accelerating scientific progress could be, in due time, of comparable magnitude to the advances made possible via large scale physics-based simulations currently enabled by high-performance computing.
This research was funded by a Lab Directed Research and Development project at Oak Ridge National Laboratory, a U.S. Department of Energy facility managed by UT-Battelle, LLC. An award of computer time was provided by the INCITE program. This research also used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725.
This manuscript has been authored by UT-Battelle, LLC under Contract No. DE-AC05-00OR22725 with the U.S. Department of Energy. The United States Government retains and the publisher, by accepting the article for publication, acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes. The Department of Energy will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).
Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. Cited by: §6.2.
|cuDNN Function/Algorithm||# Calls||DNN Operation||Tensor Cores Implementation|