1 Introduction
Clock speeds of the central processing units (CPUs) on the desktop and laptop computers hit the physical limit more than a decade ago, and there will be likely no major breakthrough until quantum computing becomes practical. Instead, the increase in computing power is now accomplished by using multiple cores within a processor chip. Highperformance computing (HPC) means computations that are so large that their requirement on storage, main memory, and raw computational speed cannot be met by a single (desktop) computer (hager2010introduction). Modern HPC machines are equipped with more than one CPU that can work on the same problem (eijkhout2013introduction)
. Often, specialpurpose coprocessors such as graphical processing units (GPUs) are attached to the CPU for orderofmagnitudes of acceleration for some tasks. A GPU can be thought of a massively parallel matrixvector multiplier and vector transformer on a data stream. With the needs of analyzing terabyte or even petabytescale data common, the success of largescale statistical computing heavily relies on how to engage HPC in the statistical practice.
About a decade ago, the second author discussed the potential of GPUs in statistical computing. zhou2010graphics predicted that “GPUs will fundamentally alter the landscape of computational statistics.” Yet, it does not appear that GPU computing, or HPC in general, has completely smeared into the statistical community. Part of the reasons for this may be attributed to the fear that parallel and distributed code is difficult to program, especially in R (rcoreteam)
, “the” programming language of statisticians. On the other hand, the landscape of scientific computing in general, including socalled data science
(donoho2017), has indeed substantially changed. Many highlevel programming languages, e.g., Python (rossum1995python) and Julia (bezanson2017julia), support parallel computing by design or through standard libraries. Accordingly, many software tools have been developed in order to ease programming in and managing HPC environments. Last but not least, cloud computing (fox2011cloud) is getting rid of the necessity for purchasing expensive supercomputers and scales computation as needed.Concurrently, easily parallelizable algorithms for fitting statistical models with hundreds of thousand parameters have also seen significant advances. Traditional NewtonRaphson or quasiNewton type of algorithms face two major challenges in contemporary problems: 1) explosion of dimensionality renders storing and inversion of Hessian matrices prohibitive; 2) regularization of model complexity is almost essential in highdimensional settings, which is often realized by nondifferentiable penalties; this leads to highdimensional, nonsmooth optimization problems. For these reasons, nonsmooth firstorder methods have been extensively studied during the past decade (beck2017first). For relatively simple, decomposable penalties (negahban2012unified), the proximal gradient method (Beck:SiamJournalOnImagingSciences:2009, combettes2011proximal, parikh2014proximal, polson2015proximal) produces a family of easily parallelizable algorithms. For the prominent example of the Lasso (tibshirani1996regression), this method contrasts to the highly efficient sequential coordinate descent method of friedman2010regularization and the smooth approximation approaches, e.g., hunter2005variable. Decomposability or separability of variables is often the key to parallel and distributed algorithms. The popular alternating direction method of multipliers (ADMM, gabay1976dual, glowinski1975approximation, Boyd:FoundationsAndTrendsInMachineLearning:2010) achieves this goal through variable splitting, while often resulting in nontrivial subproblems to solve. As an alternative, the primaldual hybrid gradient (PDHG) algorithm (zhu2008efficient, Esser:SiamJournalOnImagingSciences:2010, chambolle2011first, Condat:JournalOfOptimizationTheoryAndApplications:2012, Vu2013) has a very low periteration complexity, useful for complex penalties such as the generalized lasso (Tibshirani:TheAnnalsOfStatistics:2011, ko2019easily, ko2019optimal). Another route toward separability is through the MM principle (lange2000optimization, hunter2004tutorial, lange2016mm), which has been explored in zhou2010graphics. In fact, the proximal gradient method can be viewed as a realization of the MM principle. Recent developments in the application of this principle include distance majorization (chi2014distance) and proximal distance algorithms (keys2016proximal).
The goal of this paper is to review the advances in parallel and distributed computing environments during the last decade and demonstrate how easy it has become to write a code for largescale, highdimensional statistical models and run it on various distributed environments. In order to make the contrast clear, we deliberately take examples from
zhou2010graphics, namely nonnegative matrix factorization (NMF), positron emission tomography (PET), and multidimensional scaling (MDS). The difference lies in the scale of the examples: our experiments deal with data of size at least and as large as for dense data, and for sparse data. This contrasts with the size of at best of zhou2010graphics. This level of scaling is possible because the use of multiple GPUs in a distributed fashion has become handy, as opposed to the single GPU, CUDA C implementation of 2010. Furthermore, using the power of cloud computing and modern deep learning software, we show that exactly the samecode can run on multiple CPU cores and/or clusters of workstations. Thus we bust the common misconception that deep learning software is dedicated to neural networks and heuristic model fitting. Wherever possible, we apply more recent algorithms in order to cope with the scale of the problems. In addition, a new example of largescale proportional hazards regression model is investigated. We demonstrate the potential of our approach through a single multivariate Cox regression model regularized by the
penalty on the UK Biobank genomics data (with 200,000 subjects), featuring timetoonset of Type 2 Diabetes (T2D) as outcome and 500,000 genomic loci harboring single nucleotide polymorphisms as covariates. To our knowledge, such a largescale joint genomewide association analysis has not been attempted. The reported Cox regression model retains a large proportion of bona fidegenomic loci associated with T2D and recovers many loci near genes involved in insulin resistance and inflammation, which may have been missed in conventional univariate analysis with moderate statistical significance values.
The rest of this article is organized as follows. We review HPC systems and see how they have become easy to use in Section 2. In Section 3, we review software libraries employing the ‘‘write once, run everywhere’’ principle (especially deep learning software), and discuss how they can be employed for fitting highdimensional statistical models on the HPC systems of Section 2. In Section 4, we review modern scalable optimization techniques that suit well to the HPC environment. We present how to distribute a large matrix over multiple devices in Section 5, and numerical examples of NMF, PET, MDS, and regularized Cox regression in Section 6. Finally, we conclude the article in Section 7.
2 Accessible HighPerformance Computing Systems
2.1 Preliminaries
Since modern HPC relies on parallel computing, in this section we review several concepts from parallel computing literature at a level minimally necessary for the subsequent discussions. Further details can be found in nakano2012parallel, eijkhout2013introduction.
Data parallelism.
While parallelism can appear at various levels such as instructionlevel and tasklevel, what is most relevant to statistical computing is datalevel parallelism or data parallelism. If data can be split into several chunks that can be processed independently of each other, then we say there is data parallelism in the problem. Many operations such as scalar multiplication of a vector, matrixvector multiplication, and summation of all elements in a vector can exploit data parallelism using parallel architectures discussed shortly.
Memory models.
In any computing system, processors (CPUs or GPUs) need to access data residing in the memory. While physical computer memory uses complex hierarchies (L1, L2, and L3 caches; bus and networkconnected, etc.), systems employ abstraction to provide programmers an appearance of transparent memory access. Such logical memory models can be categorized into the shared memory model and the distributed memory model. In the shared memory model, all processors share the address space of the system’s memory even if it is physically distributed. For example, if two processors refer to a variable , that means the variable is stored in the same memory address; if a processor alters the variable, then the other processor is affected by the changed value. Modern CPUs that have several cores within a processor chip fall into this category. On the other hand, in the distributed memory model, the system has memory both physically and logically distributed. Processors have their own memory address spaces, and cannot see each other’s memory directly. If two processors refer to a variable , then there are two separate memory locations, each of which belongs to each processor under the same name. Hence the memory does appear distributed to programmers, and the only way processors can exchange information with each other is through passing data through some explicit communication mechanism. The advantage at the cost of this complication is scalability  the number of processors that can work in a tightly coupled fashion is much greater in distributed memory systems (say 100,000) than shared memory systems (say four). Hybrids of the two memory models are also possible. A typical computer cluster consists of multiple nodes interconnected in a variety of network topology. A node is a workstation that can run standalone, with its main memory shared by several processors installed on the motherboard. Hence within a node, it is a shared memory system, whereas across the nodes the cluster is a distributed memory system.
Parallel programming models.
For sharedmemory systems, programming models based on threads are most popular. A thread is a stream of machine language instructions that can be created and run in parallel during the execution of a single program. OpenMP is a widely used extension to the C and Fortran programming languages based on threads. It achieves data parallelism by letting the compiler know what part of the sequential program is parallelizable by creating multiple threads. Simply put, each processor core can run a thread operating on a different partition of the data. In distributedmemory systems, parallelism is difficult to achieve via a simple modification of sequential code like by using OpenMP. The programmer needs to coordinate communications between processors not sharing memory. A de facto standard for such processortoprocessor communication is the message passing interface (MPI). MPI routines mainly consist of pointtopoint communication calls that send and receive data between two processors, and collective communication calls that all processors in a group participate in. Typical collective communication calls include

Scatter: one processor has data as an array, and each other processor receives a partition of the array;

Gather: one processor collects data from all other processors to construct an array;

Broadcast: one processor sends its data to all other devices;

Reduce: gather data and produce a combined output based on an associative binary operator, such as sum or maximum of all the elements.
Parallel architectures.
To realize the above models, a computer architecture that allows simultaneous execution of multiple machine language instructions is required. A single instruction, multiple data (SIMD) architecture has multiple processors that execute the same instruction on different parts of the data. The GPU falls into this category of architectures, as its massive number of cores can run a large number of threads that share memory. A multiple instruction, multiple data (MIMD), or single program, multiple data (SPMD) architecture has multiple CPUs that execute independent parts of program instructions on their own data partition. Most computer clusters fall into this category.
2.2 Multiple CPU nodes: clusters, supercomputers, and clouds
Computing on multiple nodes can be utilized in many different scales. For midsized data, one may build his/her own cluster with a few nodes. This requires determining the topology and purchasing all the required hardware, along with resources to maintain it. This is certainly not familiar to virtually all statisticians. Another option may be using a wellmaintained supercomputer in a nearby HPC center. A user can take advantage of the facility with up to hundreds of thousand cores. The computing jobs on these facilities are often controlled by a job scheduler, such as Sun Grid Engine (sge), Slurm (slurm), Torque (torque), etc. However, access to supercomputers is almost always limited. (Can you name a ‘‘nearby’’ HPC center from your work? If so, how can you submit your job request? What is the cost?) Even when the user has access to them, he/she often has to wait in a very long queue until the requested computation job is started by the scheduler.
In recent years, cloud computing has emerged as a third option. It refers to both the applications delivered as services over the Internet and the hardware and systems software in the data centers that provide those services (Armbrust:2010). Big information technology companies such as Amazon, Microsoft, and Google lend their practically infinite computing resources to users on demand by wrapping the resources as ‘‘virtual machines’’, which are charged per CPU hours and storage. Users basically pay utility bills for their use of computing resources. An important implication of this infrastructure to endusers is that the cost of using 1000 virtual machines for one hour is almost the same as using a single virtual machine for 1000 hours. Therefore a user can build his/her own virtual cluster ‘‘on the fly,’’ increasing the size of the cluster as the size of the problem to solve grows. A catch here is that a cluster does not necessarily possess the power of HPC as suggested in Section 2.1: a requirement for high performance is that all the machines should run in tight lockstep when working on a problem (fox2011cloud). However, early cloud services were more focused on web applications that do not involve frequent data transmissions between computing instances, and less optimized for HPC, yielding discouraging results (evangelinos2008cloud, walker2008benchmarking).
Eventually, many improvements have been made at hardware and software levels to make HPC on clouds feasible. At hardware level, cloud service providers now support CPU instances such as c4, c5, and c5n instances of Amazon Web Services (AWS), with up to 48 physical cores of higher clock speed of up to 3.4 GHz along with support for accelerated SIMD computation. If network bandwidth is critical, the user may choose instances with faster networking (such as c5n instances in AWS), allowing up to 100 Gbps of network bandwidth. At the software level, these providers support tools that manage resources efficiently for scientific computing applications, such as ParallelCluster (parallelcluster) and ElastiCluster (elasticluster). These tools are designed to run programs in clouds in a similar manner to proprietary clusters through a job scheduler. In contrast to a physical cluster in an HPC center, a virtual cluster on a cloud is exclusively created for the user; there is no need for waiting in a long queue. Accordingly, over 10 percent of all HPC jobs are running in clouds, and over 70 percent of HPC centers run some jobs in a cloud as of June 2019; the latter is up from just 13 percent in 2011 (hyperion:2019).
In short, cloud computing is now a costeffective option for statisticians who are in demand for high performance, with not so a steep learning curve.
2.3 MultiGPU node
In some cases, HPC is achieved by installing multiple GPUs on a single node. Over the past two decades, GPUs have gained a sizable amount of popularity among scientists. GPUs were originally designed to aid CPUs in rendering graphics for video games quickly. A key feature of GPUs is their ability to apply a mapping to a large array of floatingpoint numbers simultaneously. The mapping (called a kernel) can be programmed by the user. This feature is enabled by integrating a massive number of simple compute cores in a single processor chip, realizing the SIMD architecture. While this architecture of GPUs was created in need of generating a large number of pixels in a limited time due to the frame rate constraint of highquality video games, the programmability and high throughput soon gained attention from the scientific computing community. Matrixvector multiplication and elementwise nonlinear transformation of a vector can be computed several orders of magnitudes faster on GPU than on CPU. Early applications of generalpurpose GPU programming include physics simulations, signal processing, and geometric computing (owens2007survey). Technologically savvy statisticians demonstrated its potential in Bayesian simulation (suchard2010some, suchard2010understanding) and highdimensional optimization (zhou2010graphics, yu2015high)
. Over time, the number of cores has increased from 240 (Nvidia GTX 285, early 2009) to 4608 (Nvidia Titan RTX, late 2018) and more local memory  separated from CPU’s main memory  has been added (from 1GB of GTX 285 to 24GB for Titan RTX). GPUs could only use singleprecision for their floatingpoint operations, but they now support double and halfprecisions. More sophisticated operations such as tensor operations are also supported. Highend GPUs are now being designed specifically for scientific computing purposes, sometimes with faulttolerance features such as error correction.
A major drawback of GPUs for statistical computing is that GPUs have a smaller memory compared to CPU, and it is slow to transfer data between them. Using multiple GPUs can be a cure: recent GPUs can be installed on a single node and communicate with each other without the meddling of CPU; this effectively increases the local memory of a collection of GPUs. (lee2017large explored this possibility in imagebased regression.) It is relatively inexpensive to construct a node with 48 desktop GPUs compared to a cluster of CPU nodes with a similar computing power (if the main computing tasks are well suited for the SIMD model), and the gain is much larger for the cost. Linear algebra operations that frequently occur in highdimensional optimization is a good example.
Programming environments for GPU computing have been notoriously hostile to programmers for a long time. The major sophistication is that a programmer needs to write two suits of code, the host code that runs on a CPU and kernel functions that run on GPU(s). Data transfer between CPU and GPU(s) also has to be taken care of. Moreover, kernel functions need to be written in special extensions of C, C++, or Fortran, e.g., CUDA (nvidia2007compute) or OpenCL (munshi2009opencl). Combinations of these technical barriers made casual programmers, e.g., statisticians, to keep away from writing GPU code despite its computational gains. There were efforts to sugarcoat these hostile environments with a highlevel language such as R (buckner2009gputools) or Python (tieleman2010gnumpy, klockner2012pycuda, lam2015numba), but these attempts struggled to garner big enough user base to maintain the community in general. The functionalities were often limited and inherently hard to extend.
Fortunately, GPU programming environments have been revolutionized since deep learning (lecun2015deep)
brought sensation in many machine learning applications. Deep learning is almost synonymous to deep neural networks, which refer to a repeated (‘‘layered’’) application of an affine transformation of the input followed by identical elementwise transformations through a nonlinear link function, or ‘‘activation function.’’ Fitting a deep learning model is almost always conducted via (approximate) minimization of the specified loss function through a clever application of the chain rule to the gradient descent method, called ‘‘backpropagation’’
(rumelhart1988learning). These computational features fit well to the SIMD architecture of GPUs, use of which dramatically reduces the training time of this highly overparameterized family of models with a huge amount of training data (raina2009large). Consequently, many efforts had been made to ease GPU programming for deep learning, resulting in easytouse software libraries. Since the sizes of neural networks get ever larger, more HPC capabilities, e.g., support for multiple GPUs and CPU clusters, have been developed. As we review in the next section, programming with those libraries gets rid of many hassles with GPUs, close to the level of conventional programming.Readers might ask: why should statisticians care about deep learning software? As cheng1994neural
pointed out 25 years ago, ‘‘neural networks provide a representational framework for familiar statistical constructs,’’ and ‘‘statistical techniques are sometimes implementable using neuralnetwork technology.’’ For example, linear regression is just a simple neural network with a single layer and linear activation functions. Many more sophisticated statistical frameworks can be mapped to that of neural networks and can benefit from those easetouse deep learning libraries for computational performance boosting.
3 Easytouse Software Libraries for HPC
3.1 Deep learning libraries and HPC
As of writing this article (late 2019), the two most popular deep learning software libraries are TensorFlow
(tensorflow)and PyTorch
(pytorch). There are two common features of these libraries. One is the computation graph that automates the evaluation of the loss function and its differentiation required for backpropagation. The other feature, more relevant to statistical computing, is an efficient and userfriendly interface to linear algebra and convolution routines that work on both CPU and GPU in a unified fashion. A typical pattern of using these libraries is to specify the model and describe how to fit the model to the training data in a highlevel scripting language (mostly Python). To fit a model, the software selects a backend optimized for the system in which the model runs. If the target system is a CPU node, then the software can be configured to utilize the OpenBLAS (openblas) or the Intel Math Kernel Library (mkl), which are optimized implementations of the Basic Linear Algebra Library (BLAS, blas) for sharedmemory systems. If the target system is a workstation with a GPU, then the same script can employ a pair of host and kernel code that may make use of cuBLAS (cublas), a GPU version of BLAS, and cuSPARSE (cusparse), GPUoriented sparse linear algebra routines. Whether to run the model on a CPU or GPU can be controlled by a slight change in the option for device selection, which is usually a line or two of the script. From the last paragraph of the previous section, we see that this ‘‘write once, run everywhere’’ feature of deep learning libraries can make GPU programming easier for statistical computing as well.TensorFlow is a successor of Theano
(bergstra2011theano), one of the first libraries to support symbolic differentiation based on computational graphs. Unlike Theano that generates GPU code on the fly, TensorFlow is equipped with precompiled GPU code for a large class of predefined operations. The computational graph of TensorFlow is static so that a user has to predefine all the operations prior to execution. Unfortunately, such a design does not go along well with the philosophy of scripting languages that the library should work with and makes debugging difficult. To cope with this issue, an ‘‘eager execution’’ mode, which executes commands without building a computational graph, is supported.PyTorch inherits Torch
(collobert2011torch7), an early machine learning library written in a functional programming language called Lua, and Caffe
(jia2014caffe), a Pythonbased deep learning library. Unlike TensorFlow, PyTorch uses dynamic computation graphs, so it does not require computational graphs to be predefined. Thanks to this dynamic execution model, the library is more intuitive and flexible to the user than most of its competitors. PyTorch (and Torch) can also manage GPU memory efficiently. As a result, it is known to be faster than other deep learning libraries (bahrampour2015comparative).Both libraries support multiGPU and multinode computing. In Tensorflow, multiGPU computation is supported natively. If data are distributed in multiple GPUs and one needs data from the other, the GPUs communicate implicitly and the user does not need to care. Multinode communication is more subtle: while remote procedure call is supported natively in the same manner as multiGPU communications, it is recommended to use MPI through the library called Horovod (sergeev2018horovod) for tightlycoupled HPC environments (more information is given in Section 3.2). In PyTorch, both multiGPU and multinode computing are enabled by using the interface torch.distributed. This interface defines MPIstyle (but simplified) communication primitives (see the parallel programming models paragraph in Section 2.1), whose specific implementation is called a backend. Possible communication backends include the MPI, Nvidia Collective Communications Library (NCCL), and Gloo (gloo). NCCL is useful for a multiGPU node; (CUDAaware) MPI maps multiGPU communications to the MPI standard as well as traditional multinode communications; Gloo is useful in cloud environments.
This feature of unified interfaces for various HPC environments is supported through operator overloading or polymorphism in modern programming languages, but achieving this seamlessly with a single library, along with multidevice support, is remarkable. This is partially because of injection of capital in pursuit of commercial promises of deep learning (TensorFlow is being developed by Google, and PyTorch by Facebook). There are other deep learning software libraries with similar HPC supports: Apache MxNet (chen2015mxnet) supports multinode computation via Horovod; multiGPU computing is also supported at the interface level. Microsoft Cognitive Toolkit (CNTK, seide2016cntk) supports parallel stochastic gradient algorithms through MPI.
3.2 Case study: PyTorch versus TensorFlow
In this section, we illustrate how simple it is to write a statistical computing code on multidevice HPC environments using a modern deep learning libraries. We compare PyTorch and TensorFlow code written in Python, which computes a Monte Carlo estimate of the constant
. The emphasis is on readability and flexibility, i.e., how small a modification is needed to run the code written for a singleCPU node on a multiGPU node and a multinode system.Listing 1 shows the Monte Carlo estimation code for PyTorch. Even for those who are not familiar with Python, the code should be quite readable. The main workhorse is function mc_pi() (Lines 1421), which generates a sample of size
from the uniform distribution on
and compute the proportion of the points that fall inside the quarter circle of unit radius centered at the origin. Listing 1 is a fully executable program. It uses torch.distributed interface with an MPI backend (Line 3). An instance of the program of Listing 1 is attached to a device and is executed as a ‘‘process’’. Each process is given its identifier (rank), which is retrieved in Line 5. The total number of processes is known to each process via Line 6. After the proportion of the points in the quartercircle is computed in Line 17, each process gathers the sum of the means computed from all the processes in Line 18 (this is called the allreduce operation; see Section 2.1). Line 19 divides the sum by the number of processes, yielding a Monte Carlo estimate of based on the sample size of .We have been deliberately ambiguous about the ‘‘devices.’’ Here, a CPU core or a GPU is referred to as a device. Listing 1 assumes the environment is a workstation with one or more GPUs, and the backend MPI is CUDAaware. A CUDAaware MPI, e.g., OpenMPI (openmpi), allows data to be sent directly from a GPU to another GPU through the MPI protocols. Lines 910 specify that the devices to use in the program are GPUs. If the environment is a cluster with multiple CPU nodes (or even a single node), then all we need to do is changing Line 9 to device = ’cpu’. The resulting code runs on a cluster seamlessly.
In TensorFlow, however, a separate treatment to multiGPU and cluster settings is almost necessary. The code for multiGPU setting is similar to Listing 1 hence given in Appendix B. In a cluster setting, unfortunately, it is extremely difficult to reuse the multiGPU code. If direct access to individual compute nodes is available, that information can be used to run the code distributedly, albeit not being much intuitive. However, in HPC environments where computing jobs are managed by job schedulers, we often do not have direct access to the compute nodes. The National Energy Research Scientific Computing Center (NERSC), the home of the 13th most powerful supercomputers in the world (as of November 2019), advises that gRPC, the default internode communication method of TensorFlow, is very slow on tightlycoupled nodes, thus recommends a direct use of MPI (nersc:tf). Using MPI with TensorFlow requires an external library called Horovod and a substantial modification of the code, as shown in Listing 2. This is a sharp contrast to Listing 1, where essentially the same PyTorch code can be used in both multiGPU and multinode settings.
Therefore we employ PyTorch in the sequel to implement the highly parallelizable algorithms of Section 4 in a multiGPU node and a cluster on a cloud, as it allows simpler code that runs on various HPC environments with a minimal modification. (In fact this modification can be made automatic through a command line argument.)
4 Highly Parallelizable Algorithms
In this section, we discuss some easily parallelizable optimization algorithms useful for fiting highdimensional statistical models, assuming that data are so large that they have to be stored distributedly. These algorithms can benefit from the distributedmemory environment by using relatively straightforward operations, via distributed matrixvector multiplication and independent update of variables.
4.1 MM algorithms
The MM principle (lange2000optimization, lange2016mm), where ‘‘MM’’ stands for either majorizationminimization or minorizationmaximization, is a useful tool for constructing parallelizable optimization algorithms. In minimizing an objective function iteratively, for each iterate we consider a surrogate function satisfying two conditions: the tangency condition and the domination condition for all . Updating guarantees that is a nonincreasing sequence:
In fact, full minimization of is not necessary for the descent property to hold; merely decreasing it is sufficient. The EM algorithm (dempster1977maximum) is an instance of the MM principle. In order to maximize the marginal loglikelihood
where is the observed data, is unobserved missing data, and is the parameter to estimate, we maximize the surrogate function
since
by Jensen’s inequality, and the second term in the last inequality is irrelavent to . (See wu2010mm for more details about the relation between MM and EM.)
MM updates are usually designed to make a nondifferentiable objective function smooth, linearize the problem, or avoid matrix inversions by a proper choice of the surrogate function. MM is naturally wellsuited for parallel computing environments, as we can choose a separable surrogate function and update variables independently. For example, when maximizing loglikelihoods, a term involving summation inside the logarithm often arises. By Jensen’s inequlity, this term can be minorized and separated as
where ’s are constants and is a constant only depending on ’s. Parallelization of MM algorithms on a single GPU using separable surrogate functions is extensively discussed in zhou2010graphics. Separable surrogate functions are especially important in distributed environments, e.g. multiGPU systems.
4.2 Proximal gradient descent
The proximal gradient descent method is an extension of the gradient descent method, which deals with minimization of sum of two convex functions, i.e.,
Function is possibly nondifferentiable, while is continuously differentiable.
We first define the proximity operator of :
For many functions their proximity operators take closed forms. We say such functions ‘‘proximable’’. For example, consider the indicator function of a closed convex set
The corresponding proximity operator is the Euclidean projection onto : . The proximity operator of the norm is the softthresholding operator:
For many sets, e.g., nonnegative orthant, is simple to compute.
Now we proceed with the proximal gradient descent for minimization of . Assume is convex and has Lipschitz gradients, i.e., for all , in the interior of its domain, and is lowersemicontinuous, convex, and proximable. The Lipschitz gradients naturally result in following surrogate function that majorizes :
Minimizing with respect to results in the update:
(1) 
This update guarantees a nonincreasing sequence of by the MM principle. Proximal gradient method also has an interpretation of forwardbackward operator splitting, and the step size guarantees convergence (combettes2011proximal, Bauschke:ConvexAnalysisAndMonotoneOperatorTheoryIn:2011, combettes2018monotone). If , then the corresponding algorithm is called the projected gradient method. If , then the corresponding algorithm is the iterative shrinkagethresholding algorithm (ISTA, Beck:SiamJournalOnImagingSciences:2009). For many functions , the update (1) is simple and easily parallelized, thus the algorithm is suitable for HPC computing. For example, the softthresholding operator is elementwise hence the updates are independent. In addition, if , then
(2) 
This proximity operator is useful for the example in Section 6.2. See parikh2014proximal for a thorough review and distributedmemory implementations, and polson2015proximal for a statisticsoriented review.
4.3 Primaldual methods
The algorithms discussed so far are primal methods. Primaldual methods introduce additional dual variables but can deal with a larger class of problems. Consider the problems of the form , where is a linear map. We further assume that , are lower semicontinuous, convex, and proper functions. Even if is proximable, the proximity operator for is not easy to compute. Define the convex conjugate of as . It is known that since is lower semicontinuous and convex, so . Then the minimization problem is equivalent to the saddlepoint problem
Under mild conditions strong duality
holds and the saddle point satisfies the optimality conditions
where denotes the subdifferential of a convex function . The vector is the dual variable and the maxmin problem
is called the dual of the original (primal) minimization problem.
A widely known method to solve this saddle point problem in the statistical literature is the ADMM (xue2012positive, zhu2015augmented, ramdas2016fast, gu2018admm). The ADMM update is given by:
(3a)  
(3b)  
(3c) 
The update (3a) is not a proximity operator, as the quadratic term is not spherical. It defines an inner optimization problem that is often nontrivial. In the simplest case of being linear or quadratic (which arises in linear regression), (3a) involves solving a linear system. While it is plausible to obtain the inverse of the involved matrix once and reuse it for future iterations, inverting a matrix even once quickly becomes intractable in the highdimensional setting, as its time complexity is cubic in the number of variables.
The primaldual hybrid gradient method (PDHG, zhu2008efficient, Esser:SiamJournalOnImagingSciences:2010, chambolle2011first) avoids such inversion via the following iteration:
(4a)  
(4b)  
(4c) 
where (4a) and (4b) are dual ascent and primal descent steps, respectively; and are step sizes. The last step (4c) corresponds to the extrapolation. If is proximable, so is , since by Moreau’s decomposition. This method has been studied using monotone operator theory (Condat:JournalOfOptimizationTheoryAndApplications:2012, Vu2013, ko2019easily). Convergence of iteration (4) is guaranteed if , where is the spectral norm of matrix . If has Lipschitz gradients, then the proximal step (4b) can be replaced by a gradient step
The PDHG algorithms are also highly parallelizable as long as the involved proximity operators are easy to compute and separable; no matrix inversion is involved in iteration (4) and only matrixvector multiplications appear.
5 Distributed matrix data structure for PyTorch
For the forthcoming examples and potential future uses in statistical computing, we propose a simple distributed matrix data structure named distmat. In this structure, each process, enumerated by its rank, holds a contiguous block of the full data matrix by rows or columns. The data may be a sparse matrix. If GPUs are involved, each process controls a GPU whose index matches the process rank. For notational simplicity, we denote the dimension to split in square brackets. If a matrix is split over four processes, the process with rank 0 keeps the first 25 rows of the matrix, and the rank 3 process takes the last 25 rows. For the sake of simplicity, we always assume that the size along the split dimension is divided by the number of processes. The code along with the examples in Section 6 is available at http://stat.snu.ac.kr/compstat/software/dist_stat.html.
In distmat, unary elementwise operations such as exponentiation, square root, absolute value, and logarithm of matrix entries were implemented in an obvious way. Binary elementwise operations such as addition, subtraction, multiplication, division were implemented in a similar manner to R’s vector recycling. For example, if two matrices of different dimensions are to be added together, say one is threebyfour and the other is threebyone, the latter matrix is expanded to a threebyfour matrix with the column repeated four times. Another example is adding a onebythree matrix and a fourbyone matrix. The former matrix is expanded to a fourbythree matrix by repeating the row four times, and the latter to a fourbythree matrix by repeating the column three times. Application of this concept is natural using the broadcast semantics of PyTorch. Reduction operations, such as rowwise (columnwise, and matrixwise) summation, (maximum, and minimum) were also implemented in a similar fashion.
Matrix multiplications are more subtle. Six different scenarios of matrixmatrix multiplications, each representing a different configuration of the split dimension of two input matrices and the output matrix, were considered and implemented. These scenarios are listed in Table 1. Note that ‘‘broadcasting’’ and ‘‘reduction’’ in this paragraph are defined over a matrix dimension (rows or columns), unlike in the other parts of this article where they are defined over multiple processes or ranks. The implementation of each case is carried out using the collective communication directives introduced in Section 2.1. Matrix multiplication scenarios are automatically selected based on the shapes of the input matrices and , except for the Scenarios 1 and 3 sharing the same input structure. Those two are further distinguished by the shape of output, . The nonnegative matrix factorization example of Section 6.1, which utilizes distmat most heavily among others, involves Scenarios 1 to 5. Scenario 6 is for matrixvector multiplications, where broadcasting small vectors is almost always efficient.
Description  Usage in Section 6  
1  Inner product, result distributed.  
2  Fat matrix multiplied by a thin and tall matrix.  
3 

,  
4 


5 



6 

In Listing 3, we demonstrate an example usage of distmat. We assume that this program is run with 4 processes (size in Line 5 is 4). Line 11 generates a doubleprecision matrix on CPU sampled from the uniform distribution. The function distgen_uniform has an optional argument TType that allows users to choose the data type and location of the matrix: Line 10 specifies the matrix to be a doubleprecision matrix on CPU. The user may change it to torch.cuda.FloatTensor to create this matrix on a GPU with singleprecision. Line 13 multiplies the two matrices A and B to form a distributed matrix of size . The matrix multiplication routine internally chooses to utilize Scenario 2 in Table 1. In order to compute elementwise, all that is needed to do is to write (1 + AB).log() as in Line 17. Here, is computed elementwise first, then its logarithms are computed. The local block of data can be accessed by appending .chunk to the name of the distributed matrix, as in Lines 16 and 20.
6 Examples
In this section, we compare the performance of the optimization algorithms on four statistical computing examples: nonnegative matrix factorization (NMF), positron emission tomography (PET), multidimensional scaling (MDS), and regularized Cox model for survival analysis. We demonstrate singledevice codes to show the simplicity of the programming and distribute it over a cluster consisted of multiple AWS EC2 instances or a local multiGPU workstation. For NMF and PET, we compare two algorithms, one more classical, and the other based on recent development. We evaluate the objective function once per 100 iterations. For the comparison of execution time, the iteration is run for a fixed number of iterations, regardless of convergence. For comparison of different algorithms regarding the same problem, we iterate until . Table 2 shows the setting of our HPC systems used for the experiments. For virtual cluster experiments, we utilized 1 to 20 of AWS c5.18xlarge instances with 36 physical cores with AVX512 (512bit advanced vector extension to the x86 instruction set) enabled in each instance through CfnCluster. Network bandwidth of each c5.18xlarge instance was 25GB/s. A separate c5.18xlarge instance served as the ‘‘master’’ instance. This instance does not participate in computation by itself but manages the computing jobs over the 1 to 20 ‘‘worker’’ instances. Data and software for the experiments were stored in an Amazon Elastic Block Store (EBS) volume attached to this instance and shared among the worker instances via the network file system. Further details are given in Appendix C. For GPU experiments, we used a local machine with two CPUs (10 cores per CPU) and eight Nvidia GTX 1080 GPUs. These are desktop GPUs, not optimized for doubleprecision. All the experiments were conducted using PyTorch version 0.4 built on the MKL; the released code works for the versions up to 1.3, the most recent stable version as of December 2019.
For all of our experiments, the singleprecision computation results on GPU were almost the same as the doubleprecision result up to six significant digits, except for regularized Cox regression, the necessary cumulative sum operation implemented in PyTorch caused numerical instability in some cases with small penalties. Therefore all the computations for Cox regression were performed in doubleprecision. Extra efforts for writing a multidevice code were modest using distmat. Given around 1000 lines of code to implement basic operations for multidevice configuration in distmat, additional code for our four applications was less than 30 lines for each application.
As can be verified in the sequel, computing on GPUs was effective on midsized (around 10,000 10,000) datasets, but stalled on larger (around 100,000 100,000) datasets due to memory limitation. In contrast, the virtual clusters were not very effective on midsized data, and may even slow down due to communication burden. They were effective and scaled well on larger (around 100,000 100,000) datasets.
local node  AWS c5.18xlarge  
CPU  GPU  CPU  
Model  Intel Xeon E52680 v2  Nvidia GTX 1080 


# of cores  10  2560  18  
Clock  2.8 GHz  1.6 GHz  3.0GHz  
# of entities  2  8 


Total memory  256 GB  64 GB  144 GB 120  
Total cores  20  20,480 (CUDA)  36 120 
6.1 Nonnegative matrix factorization
NMF is a procedure that approximates a nonnegative data matrix by a product of two lowrank nonnegative matrices, and . It is widely used in image processing, bioinformatics, and recommender systems (wang2013nonnegative) where the data have only nonnegative values. One of the first effective algorithms was the multiplicative algorithm introduced by lee1999learning, lee2001algorithms. In a simple setting, NMF minimizes
where denotes the Frobenius norm.
The multiplicative algorithm written using PyTorch for a single device is given in Listing 4. This algorithm can be interpreted as a case of MM algorithm with a surrogate function of based on Jensen’s inequality:
The update rule is:
where and denote elementwise multiplication and division, respectively.
The simplelooking code in Listing 4 can fully utilize the sharedmemory parallelism: if the matrices are stored on the CPU memory, it runs parallelly, fully utilizing OpenMP and MKL/OpenBLAS (depending on installation). If they are stored on a single GPU, it runs parallely utilizing GPU cores through the CUDA libraries. Distributing this algorithm on a large scale machine is straightforward (liu2010distributed).
Figure 1 shows an example of NMF on a publicly available hyperspectral image. It was acquired by the reflective optics system imaging spectrometer sensor in a flight campaign over Pavia University in Italy. The image is essentially a hyperspectral cube. It is interpreted as a matrix and then analyzed using NMF. The rank was set to 20. In the resulting matrix , each column can be interpreted as a composite channel from the original 103 bands. Three of these channels showing distinct features chosen by hand are shown in Figure 1.
A problem with the multiplicative algorithm is the potential to generate subnormal numbers, significantly slowing down the algorithm. A subnormal number or denormal number is a number smaller (in magnitude) than the smallest positive number that can be represented by the floatingpoint number system. Subnormal numbers are generated by the multiplicative algorithm if values smaller than 1 are multiplied repeatedly. Indeed, when Listing 4 was run on a CPU with a small synthetic data of size , we observed a significant slowdown. The IEEE floatingpoint standard is to deal with subnormal numbers properly with a special hardware or software implementation (ieee2008754). In many CPUs, the treatment of subnormal numbers relies on software and hence is very slow. Forcing such value to zero is potentially dangerous depending on applications because it becomes prone to divisionbyzero error. In our experiments, we did not observe divisionbyzero error when flushing the subnormal numbers to zero. In contrast, Nvidia GPUs support subnormal numbers at a hardware level since the Fermi architecture, and simple arithmetic operations do not slow down by subnormal numbers (whitehead2011precision).
Subnormal numbers can be completely avoided (especially in CPUs) by using a different algorithm. The alternating projected gradient (APG) method (lin2007projected) is such an algorithm, and it is also easy to introduce regularization terms.With ridge penalties the objective function
is minimized. The corresponding APG update is given by
where denotes the projection onto the nonnegative orthant; and are the step sizes. This update rule can be interpreted as an MM algorithm, due to the nature of projected gradient. Convergence of APG is guaranteed if , , and .
For the distributed implementation, is assumed to be an matrix. The resulting matrix is distributed as an matrix, and is distributed as an matrix. The distributed code is equivalent to replacing torch.mm with distmat.mm in Listing 4, with an additional optional argument out_sizes=W.sizes on Line 10. As discussed in Section 5, distributed matrix multiplication algorithms are automatically selected from Table 1 based on the arguments.
Table 3 shows the performance of the two NMF algorithms on a input matrix with various values of . For APG, is used. While the APG algorithm requires more operations per iteration than the multiplicative algorithm, it is faster on CPUs, because subnormal numbers are avoided. As GPU does not slow down with subnormal numbers, each iteration is faster in the multiplicative algorithm. Table 4 shows that APG appears to converge slower early on (10,000 iterations), but it eventually catches up (100,000 iterations) in terms of objective value. As more GPUs are used, the algorithms speed up in general. The only exception is with 8 GPUs with , where interGPU communication overhead dominates the actual computation.
Additional experiments were conducted to see how the value of affects the convergence. The results are shown in Table 6. Convergence was faster for higher values of . The number of iterations to convergence in the multiplicative algorithm was longer than the APG with for higherrank decompositions ( = 40 and 60) due to heavier communication burden.
Table 6 displays the performance comparison of APG between singlemachine multiGPU and multiinstance virtual cluster settings. The datasets used were of different sizes: ‘‘small,’’ 10,000 10,000; ‘‘midsize,’’ 200,000 100,000; and ‘‘largesize,’’ 200,000 200,000. For reference, the dataset used in zhou2010graphics was of size . MultiGPU setting achieved up to 4.14xspeedup over a single CPU instance with the small dataset, but could not run larger datasets. The cluster in a cloud was scalable on larger datasets, running faster with more instances, up to 4.10xspeedup over the twoinstance cluster.
10,000 iterations  
method  CPU  1 GPU  2 GPUs  4 GPUs  8 GPUs  
Multiplicative  20  655  160  93  62  50 
40  978  165  102  73  72  
60  1355  168  109  85  86  
APG  20  504  164  97  66  57 
()  40  783  168  106  78  77 
60  1062  174  113  90  92 
method  10,000 iterations  100,000 iterations  
Multiplicative  20  8.270667E+06  8.270009E+06 
40  8.210266E+06  8.208682E+06  
60  8.155084E+06  8.152358E+06  
APG  20  8.271248E+06  8.270005E+06 
()  40  8.210835E+06  8.208452E+06 
60  8.155841E+06  8.151794E+06 
, 8 GPUs  , 8 GPUs  , 4 GPUs  
Method  iterations  time (s)  function  iterations  time (s)  function  iterations  time (s)  function 
Multiplicative  21200  110  8.270530E+06  36600  269  8.209031E+06  50000  446  8.152769E+06 
APG  31500  198  8.270202E+06  37400  310  8.208875E+06  55500  536  8.152228E+06 
APG  30700  191  8.274285E+06  36700  302  8.210324E+06  55500  537  8.153890E+06 
APG  30500  190  8.282346E+06  37300  307  8.223108E+06  47800  460  8.168503E+06 
APG  28000  178  8.389818E+06  31000  257  8.347859E+06  46400  448  8.308998E+06 
configuration  10000 10000  200000 100000  200000 200000  
10,000 iterations  1,000 iterations  1,000 iterations  
GPUs  
1  164  168  174  
2  97  106  113  
4  66  78  90  
8  57  77  92  
AWS EC2 c5.18xlarge instances  
1  236  471  549  
2  198  375  493  1487  1589  2074  
4  205  310  430  863  998  1233  1493  1908  2232 
5  230  340  481  661  896  1082  1326  1652  2070 
8  328  390  536  448  541  688  937  1044  1587 
10  420  559  643  422  540  682  737  937  1179 
20  391  1094  1293  363  489  592  693  818  1041 
6.2 Positron emission tomography
Positron emission tomography (PET) is one of the earliest applications of the EM algorihtm in computed tomography (lange1984reconstruction, vardi1985statistical). In this scenario, we consider a twodimensional imaging consisted of pixels obtained from the circular geometry of photon detectors. We estimate Poisson emission intensities , which is proportional to the concentration of radioactively labeled isotopes injected to biomolecules. Such an isotope emits a positron, which collides with a nearby electron, forming two gammaray photons flying in almost opposite directions. These two photons are detected by a pair of photon detectors corresponding to the line of flight. The coincidence counts are observed. Detector pairs are enumerated by 1, 2, , . The likelihood of detection for a detector pair
is modeled by Poisson distribution with mean
, whereis the probability that a pair of photons is detected by the detector pair
given that a positron is emitted in the pixel location . The matrix can be precomputed based on the geometry of the detectors. The corresponding loglikelihood to maximize is given byWithout a spatial regularization term, the reconstructed intensity map is grainy. One remedy is adding a ridgetype penalty of , where is the finite difference matrix on the pixel grid; each row of has one +1 and one . The MM iteration based on separation of the penalty function by the minorization
is:
where and are precomputed. Matrix is the adjacency matrix corresponding to the grid. See Section 3.2 of zhou2010graphics for the detailed derivation. By using matrix notations and broadcasting semantics, the PyTorch code can be succinctly written as in Listing 5.
Figure 3 shows the results with a RolandVaradhanFrangakis (RVF) phantom (roland2007squared) with with various values of , and Figure 5 shows the results with a extended cardiactorso (XCAT) phantom (lim2018pet, ryu2018splitting) with . Images get smooth as the value of increases, but the edges are blurry.
To promote sharp contrast, the total variation (TV) penalty (rudin1992nonlinear) can be employed. Adding an anisotropic TV penlty yields minimizing
We can use the PDHG algorithm discussed in Section 4.3. Put , , and , where is the allone vector of conforming shape and is the indicator function for the nonnegative orthant. Since is separable in and , applying iteration (4) using the proximity operator (2), we obtain the following update rule:
where is elementwise projection to the interval . Convergence is guaranteed if .
Figures 3 and 5 are the TVreconstructed versions of Figures 3 and 5, respectively. Compare the edge contrast.
Table 7 shows the convergence with different values of penalty parameters. Observe that the algorithm converges faster for large values of .
iterations  time (s)  function  
0  6400  20.6  2.417200E+05 
0.01  4900  15.8  2.412787E+05 
0.1  5000  16.1  2.390336E+05 
1  2800  9.5  2.212579E+05 
Scalability experiments were carried out with large RVFlike phantoms using grid sizes , , and , with the number of detectors (). The matrix is distributed as a matrix, and the matrix is distributed as an matrix. The symmetric adjacency matrix is distributed as a matrix. The sparse structure of these matrices is exploited using the sparse tensor data structure of PyTorch. Timing per 1000 iterations is reported in Table 8. For reference, the data used in zhou2010graphics was for grid with , or . Time per iterations of the PDHG method for the TV penalty is noticeably shorter as each iteration is much simpler than the MM counterpart for the ridge penalty, with no intermediate matrix created. The total elapsed time gets shorter with more GPUs. Although the speedup when adding more devices is somewhat mitigated in this case due to using sparse structure, resulting in 1.25xspeedup for 8 GPUs over 2 GPUs with , we can still take advantage of the scalability of memory with more devices.
configuration  
GPUs  
1  
2  21  35  
4  19  31  
8  18  28  
AWS EC2 c5.18xlarge instances  
1  63  108  530 
2  46  84  381 
4  36  49  210 
5  36  45  188 
8  33  39  178 
10  38  37  153 
20  26  28  131 
6.3 Multidimensional scaling
Multidimensional scaling is one of the earliest applications of the MM principle (deleeuw1977mds, de1977convergence). In this example, we reduce the dimensionality of data points by mapping them into in dimensional Euclidean space in a way that keeps the dissimilarity measure between the data points and as close as possible to that in the original manifold. In other words, we minimize the stress function
Comments
There are no comments yet.