I-a Motivation and Background
Over the past decade—and especially the past few years—there has been a rapid increase in research and development of artificial intelligence (AI) systems across the public and private sectors. A significant fraction of this increase is attributable to remarkable recent advances in a subfield of AI that is termed machine learning. Briefly, a machine learning system uses a number of data samples—referred to as training data—in order to learn a mathematical model of some aspect of the physical world that can then be used for automated decision making; see Fig. 1
(a) for an example of this in the context of automated tagging of images of cats and dogs. Training a machine learning model involves mathematical optimization of adata-driven function with respect to the model variable. The decision making capabilities of a machine learning system, in particular, tend to be directly tied to one’s ability to solve the resulting optimization problem up to a prescribed level of accuracy.
While solution accuracy remains one of the defining aspects of machine learning, the advent of big data—in terms of data dimensionality and/or number of training samples—and the adoption of large-scale models with millions of parameters in machine learning methods such as deep learning  has catapulted the computing time for training (i.e., the training time) to another one of the defining parameters of modern systems. It is against this backdrop that stochastic optimization methods such as stochastic gradient descent (SGD) and its variants [2, 3, 4, 5], in which training data are processed one sample or a small batch of samples—referred to as a mini batch—per iteration, as opposed to deterministic optimization methods such as gradient descent , in which the entire batch of training data is used in each iteration, have become the de facto standard for faster training of models.
Another major shift in machine learning practice concerns the use of distributed and decentralized computing platforms, as opposed to a single computing unit, for training of models. There are myriad reasons for this paradigm shift, which range from the focus on further decreasing the training times and preserving privacy of data to adoption of machine learning for decision making in inherently decentralized systems. In particular, distributed and decentralized training of machine learning models can be epitomized by the following three prototypical frameworks.
Distributed computing framework: A distributed computing framework, also referred to as a compute cluster, brings together a set of computing units such as CPUs and GPUs to accelerate training of large-scale machine learning models from big data in a more cost-effective manner than a single computer with comparable storage capacity, memory, and computing power. Computing units/machines in a compute cluster typically communicate among themselves using either Ethernet or InfiniBand interconnects, with the intra-cluster communication infrastructure often abstracted in the form of a graph in which vertices/nodes correspond to computing units. A typical graph structure that is commonly utilized for distributed training within compute clusters is star graph, which corresponds to the so-called master–worker architecture; see, e.g., Fig. 1(b). Training data within this setup is split among the worker nodes, which perform bulk of the computations, while the master node coordinates the distributed training of machine learning model among the worker nodes.
Federated learning framework: The term “federated learning,” coined in , refers to any machine learning setup in which a collection of autonomous entities (e.g., smartphones and hospitals), each maintaining its own private training data, collaborate under the coordination of a central server to learn a “global” machine learning model that best describes the collective non-collocated/distributed training data. A typical federated learning system, in which entities collaborate only through communication with the central server and are prohibited from sharing raw training samples with the server, can also be abstracted as a star graph; see, e.g., Fig. 1(c). However, unlike a master–worker distributed machine learning system—in which the primary objective is reduction of the wall-clock time for training of machine learning models, the first and foremost objective of a federated learning system is to preserve privacy of the data of collaborating entities.
Edge computing framework: The term “edge computing” refers to any decentralized computing system comprising geographically distributed and compact computing devices that collaboratively complete a computational task through local computations and device-to-device communications. Some of the defining characteristics of an edge computing system, which set it apart from a compute cluster, include lack of a coordinating central server, (relatively) slower-speed device-to-device communications (e.g., wireless communications and power-line communications), and abstraction of inter-device communication infrastructure in terms of arbitrary graph topologies (as opposed to star topology). Many emerging edge computing systems, such as the internet-of-things (IoT) systems and smart grids, have each computing device connected to a number of data-gathering sensors that generate large volumes of data. Since exchange of these large-scale “local” data among the computing devices becomes prohibitive due to communication constraints, machine learning in such systems necessitates decentralized collaborative training that involves each device learning (approximately) the same “local” model through inter-device communications that best fits the collective system data; see, e.g., Fig. 1(d).
The purpose of this paper is to provide an overview of some important aspects of distributed/decentralized machine learning that have implications for all three of the aforementioned frameworks. We slightly abuse terminology in the following for ease of exposition and refer to training of machine learning models under any one of these frameworks as distributed machine learning. When one considers distributed training of (large-scale) models from (big) distributed datasets, it raises a number of important questions; these include: () What are the fundamental limits on solution accuracy of distributed machine learning? () What kind of optimization frameworks and communication strategies (which exclude exchange of raw data among subcomponents of the system) result in near-optimal distributed learning? () How do the topology of the graph underlying the distributed computing setup and the speed of communication links in the setup impact the learning performance of any optimization framework? A vast body of literature in the last decade has addressed these (and related) questions for distributed machine learning by expanding on foundational works in distributed consensus [8, 9], distributed diffusion , distributed optimization [11, 12], and distributed computing . Several of the key findings of such works have also been elucidated through excellent survey articles and overview papers in recent years [14, 15, 16, 17, 18, 19, 20, 21, 22, 23]. Nonetheless, there remains a need to better understand the interplay between solution accuracy, communication capabilities, and computational resources in distributed systems that carry out training using “streaming” data. Indeed, distributed training using streaming data necessitates utilization of single-pass stochastic optimization within the distributed framework, which gives rise to several important operational changes that are not widely known. It is in this regard that this overview paper summarizes some of the key research findings, and their implications, in relation to distributed machine learning from streaming data.
I-B Streaming Data and Distributed Machine Learning
Continuous gathering of data is a hallmark of the digital revolution; in countless applications, this translates into streams of data entering into the respective machine learning systems. Within the context of distributed machine learning, the continuous data gathering has the effect of training data associated with each “node” in the distributed system being given in the form of a data stream (cf. Fig. 3 in Section II). Since “(full) batch processing” is practically infeasible in the face of continuous data arrival, distributed training of models from streaming data requires (single-pass) stochastic optimization methods. Accordingly, we provide in this paper an overview of some of the state-of-the-art concerning stochastic optimization-based distributed training from streaming data.
Unlike much of the literature on centralized machine learning from streaming data, (relative) streaming rate of data—defined as the (average) number of new data samples arriving per second—fundamentally shapes the discussion of streaming-based distributed machine learning. In this regard, our objective is to elucidate the performance challenges and fundamental limits when the streaming rate of data is fast compared to the processing speed of computing units and/or the communications speed of inter-node links in the system. In particular, this involves addressing of the following question: what happens to the solution accuracy of distributed machine learning when it is impossible to have high-performance computing machines for computing nodes and/or (multi-)gigabit connections for inter-node communication links? Note that this question cannot be addressed by simply “slowing down” the data stream(s) through regular discarding of incoming samples. Within a distributed computing framework, for instance, letting some of the incoming samples pass without updating the model would be antithetical to its overarching objective of accelerated training. Similarly, downsampling of time-series data streams in an edge computing system would cause the system to lose out on critical high-frequency modes of data. In short, processing all data samples arriving into the distributed system and incorporating them into the learned model is both paramount and non-trivial.
There are many ways to frame and analyze the problem of distributed machine learning from fast streaming data, leading to far more relevant works than we can discuss in this overview paper. Instead, we provide a very brief discussion of the different framings, and motivate our prioritization of the following system choices under the general umbrella of distributed machine learning: decentralized-parameter systems, synchronous-communications distributed computing, and statistical risk minimization for training of machine learning models. We dive into the relevant distinctions for these system choices in the following.
I-C General Framing of the Overview
The area of distributed machine learning is far too rich and broad to be covered in a paper. Instead, we cover only some aspects of the area that are the most relevant to the topic of distributed machine learning from streaming data. To put the rest of our discussion in context, we give a very coarse description of these aspects in the following, drawing out some of the crucial distinctions and pointing out which aspects remain uncovered in the paper.
System models for distributed learning. We abstract away the dependence on any particular computing architecture by modeling the architecture as an interconnected network of (computing) nodes having a certain topology (e.g., star topology for the master–worker architecture). Accordingly, our discussion is applicable to any of the computing frameworks discussed in Section I-A that adhere to the data and system assumptions described later in Section II. In the interest of generality, we also move away from the so-called parameter-server system model that is used in some distributed environments [24, 25]. In the simplest version of this model, a single node—termed parameter server—maintains and updates parameters of the machine learning model, whereas the remaining nodes in the network compute gradients of their local data that are then transmitted to the parameter server and used to make updates to the shared set of parameters. We instead center our discussion around the decentralized-parameter system model, where each node maintains and updates its own copy of the parameters. This system model is more general, since any result that holds for a decentralized-parameter network also holds for a parameter-server network, it prevents a single point of failure in the system, and it allows us to present a unified discussion that transcends multiple system models.
Models for message passing and communications. Algorithmic-level synchronization (or lack thereof) among different computing nodes is one of the most important design choices in distributed implementations. On one hand, synchronous implementations (which often make use of “blocking” message passing protocols for synchronization [26, 27]) can slow down training times due to either message passing (i.e., communications) delays or “straggler” nodes taking longer than the rest of the network to complete their subtasks. On the other hand, asynchronous implementations have the potential to drastically impact the solution accuracy. Such tradeoffs between synchronous and asynchronous implementations, as well as approaches that hybridize the two, have been investigated in recent years [28, 29, 30, 31, 32, 33]
. In this paper, we focus exclusively on synchronous implementations for the sake of concreteness. In addition, we abstract lower-level communications within the synchronous system as happening in discrete, pre-defined epochs (time intervals, slots, etc.). While such an abstraction models only a restrictive set of communications protocols, it greatly simplifies the exposition without sacrificing too much of the generality.
Optimization framework for distributed machine learning.
Machine learning problems involve the optimization of a “loss” function with respect to the machine learning model. And this optimization side of machine learning can be framed in two major interrelated ways. The first (and perhaps most well-known) framing is referred to asempirical risk minimization (ERM). The objective in this case is to minimize the empirical risk , defined as the empirical average of the so-called (regularized) loss function evaluated on the training samples, with respect to the model variable . Under mild assumptions on the loss function, data distribution, and training data, the ERM solution
is known to converge (with high probability) to the minimizerof the “true” risk , i.e., the expected loss , with study of the rates of this convergence being a long-standing and active research area . Distributed learning literature within the ERM framework typically supposes a fixed and finite number of training samples distributed across computing nodes, and primarily focuses on understanding convergence of the output of different distributed optimization schemes to the ERM solution . The accuracy of the final solution, termed excess risk and defined as , is then provided either implicitly or explicitly in the works as the sum of two gaps: () gap between the risk of the distributed optimization solution and that of the ERM solution, i.e., , and () gap between the risk of the ERM solution and that of the optimal solution, termed Bayes’ risk, i.e., . In contrast to the ERM framework, the second optimization-based framing of machine learning—termed statistical risk minimization (SRM)—facilitates a direct bound on the excess risk; see, e.g., Fig. 2. This is because the objective in SRM framework is to minimize expected loss (risk) over the true data distribution, as opposed to empirical loss over the training data in ERM framework. The SRM framework falls squarely within the confines of stochastic optimization, with a large body of existing work—covering both centralized and distributed machine learning—that characterizes the excess risk of the resulting solution under the assumption that either the number of training samples is sufficiently large or it grows asymptotically. Since we are concerned with streaming data, in which a virtually unbounded number of samples may arrive at the system, we focus on the SRM-based framework and single-pass stochastic optimization for distributed machine learning. We discuss further the distinction between the convergence results derived under the frameworks of ERM and SRM in the sequel.
Structure of the optimization objective function. The vast majority of works at the intersection of (stochastic) optimization and (distributed) learning suppose that the loss function is convex with respect to the model parameters. But some of the most exciting recent results in machine learning have come about in the context of deep learning, where the objective function tends to be highly nonconvex and most practical methods do not even concern themselves with global optimality of the solution [36, 37]. Nevertheless, for the purpose of being able to carry out analysis, we focus in this overview on either convex problems or structured nonconvex problems, such as principal component analysis (PCA), where the structure can be exploited by local search methods to find a global solution.
I-D An Outline of the Overview Paper
We now provide an outline of the remainder of this paper. Section II gives a formal description of the learning and system models considered in this paper, including the loss function, the streaming data model, the communications model, and the way compute nodes exchange messages with each other during distributed learning. In Section III, we discuss relevant results in (distributed) machine learning that prefigure the state-of-the-art being reviewed in the paper. Section IV and Section V of the paper are devoted to coverage of the state-of-the-art in terms of distributed machine learning from fast streaming data. The main distinction between the two sections is the nature of the communications infrastructure underlying the distributed computing framework. Section IV focuses on the case of (relatively) high-speed communications infrastructure that enables completion of message-passing primitives such as AllReduce  in a “reasonable” amount of time, whereas Section V discusses distributed machine learning from streaming data in systems with (relatively) lower-speed communications infrastructure. In both cases, we discuss scenarios and distributed algorithms that can lead to near-optimal excess risk for the final solution as a function of the number of data samples arriving at the system; in addition, we present results of numerical experiments to corroborate some of the stated results. One of the key insights delivered by these two sections is that a judicious use of (implicit or explicit) mini-batching of data samples in distributed systems is fundamental in dealing with fast streaming data in compute- and/or communications-limited scenarios. To this end, we provide theoretical results for the optimum choice of the size of (network-wide and local) mini-batches as well as conditions on when mini-batching is sufficient to achieve near-optimal convergence. We conclude the paper in Section VI with a brief recap of the implications of presented results for the practitioners as well as a discussion of possible next steps for researchers working on distributed machine learning.
I-E Notational Convention
We use regular-faced (e.g., and ), bold-faced lower-case (e.g., ), and bold-faced upper-case (e.g.,
) letters for scalars, vectors, and matrices, respectively. We use calligraphic letters (e.g.,) to represent sets, while denotes the set of first natural numbers, and and denote the sets of non-negative real numbers and positive integers, respectively. Given a vector and a matrix , , , and denote the -norm of , the spectral norm of , and the Frobenius norm of , respectively. Given a symmetric matrix , denotes its
-th largest-by-magnitude eigenvalue, i.e.,. Given a function that is partially differentiable in the first argument, denotes the gradient of with respect to its first argument. Given functions and , we use Landau’s Big-O notation (e.g., or ) to describe the scaling relationship between them. Finally, denotes the expectation operator, where the underlying probability space is either implicit from the context or is explicitly noted.
Ii Problem Formulation
In this section, we discuss the problem of distributed processing of fast streaming data for machine learning in three parts. First, we describe the general statistical optimization problem underlying machine learning. Second, we describe the system model that formalizes distributed processing of streaming data . Finally, we formalize the notion of fast streaming data in terms of, among other things, data streaming rate, processing rate of compute nodes, and communications rate of inter-node links within the distributed environment.
Ii-a Statistical Optimization for Machine Learning
Most machine learning problems can be posed as data-driven optimization problems, with the objective termed a loss function that quantifies the error (classification, regression or clustering error, mismatch between the learned and true data distributions, etc.) in a candidate solution. We denote this loss function by , where denotes the space of candidate machine learning models and denotes the space of data samples. Given a model , measures the modeling loss associated with in relation to the data sample .111The model space in most formulations is taken to be one that is completely described by a set of parameters. For example, if denotes the space of all polynomials of degree , a model is uniquely characterized by the coefficients of the respective polynomial. In this paper, we slightly abuse the notation and use to denote both the model and, when the model is parameterizable, its respective parameters.
Several examples of loss functions and their respective model space(s) for supervised learning (e.g., regression and classification) and unsupervised learning (e.g., feature learning and clustering) problems are listed below.
Loss functions and models for supervised machine learning. Data samples in supervised learning can be expressed as tuples , with referred to as data and referred to as its label. In particular, focusing on the linear classification problem with label , augmented data , and model space , the model
describes an affine hyperplane inand two common choices of loss functions are: () Hinge loss: , and () Logistic loss: .
Loss functions and models for unsupervised machine learning. Data samples in unsupervised machine learning do not have labels, with the unlabeled data sample in this case. We now describe the loss functions and models/model parameters associated with two popular unsupervised machine learning problems.
Principal component analysis (PCA): The -PCA problem is a feature learning problem in which the model space is , the matrix-valued model describes a -dimensional subspace of , and the loss function is .
Center-based clustering: The -means clustering problem has the model space , the model is a -tuple of -dimensional cluster centroids, and a common choice of the loss function is .
In this paper, our discussion of machine learning revolves around the statistical learning viewpoint . To this end, we suppose each data sample
is drawn from some unknown probability distributionthat is supported on . The overarching goal in (statistical) machine learning then is to obtain a model that has the smallest loss averaged over all . Specifically, let
denote the expected loss, also referred to as (statistical) risk, associated with model for the entire data space . Then, the objective of machine learning from the statistical learning perspective is to approach the Bayes optimal solution that minimizes the statistical risk, i.e.,
The risk incurred by (i.e., ) is termed Bayes’ risk. The main challenge in machine learning is that the distribution is unknown and therefore (2) cannot be directly solved. Instead, one uses training data samples that are independently drawn from to obtain a model whose risk comes close to Bayes’ risk as a function of the number of training samples. In particular, the performance of a machine learning algorithm is measured in terms of either the excess risk of its solution, defined as , or the parameter estimation error
parameter estimation errorcalculated in terms of some distance between the solution and the set of minimizers .
Since optimization is central to machine learning, the geometrical structure and properties of the loss function determine whether and how easily a method finds a solution that has (nearly) minimal excess risk / estimation error. We describe this structure and properties of in terms of its gradients, convexity, and smoothness.
Definition 1 (Existence of Gradients).
A loss function is said to have its gradients exist everywhere if exists for all .
Definition 2 (Convexity and Strong Convexity).
A loss function is convex in if for all , all , and all , we have
In words, the function for all must lie below any chord for the loss function to be convex in . Further, a loss function whose gradients exist everywhere is said to be strongly convex with modulus if for all and all , we have
Definition 3 (Smoothness).
We say that a loss function whose gradients exist everywhere is smooth if its gradients are Lipschitz continuous with some constant , i.e., for all and all , we have
Going forward, we drop the subscript in for notational compactness. Note that in the case of a smooth, convex (loss) function, gradient-based local search methods are guaranteed to converge to a global minimizer of the function. In addition, the global minimizer is unique for strongly convex functions and convergence of gradient-based methods to the minimizer of these functions is provably fast.
Our discussion revolves around both convex and (certain structured) nonconvex loss functions. Some of it in relation to convex losses requires an assumption on the variance of the gradients with respect to the data distribution.
Definition 4 (Gradient Noise).
We say the gradients of have bounded variance if for every , we have
In the following, we term as the gradient noise variance. In addition, we use the notion of single-sample covariance noise variance in lieu of gradient noise variance in relation to our discussion of the nonconvex loss function associated with the -PCA problem.
Definition 5 (Sample-covariance Noise).
We say the single-sample covariance matrix associated with data sample drawn from distribution has bounded variance if we have
The gradient (resp., sample covariance) noise variance controls the error associated with evaluating the gradient (resp., sample covariance) at individual sample points instead of evaluating it at the statistical mean of the unknown distribution . Smaller gradient (resp., sample covariance) noise variance results in faster convergence, and the main message of this paper is that leveraging distributed streams to average out gradient (resp., sample covariance) noise is often an optimum way to speed up convergence in compute- and/or communications-limited regimes.
The last definition we need is that of a bounded model space, which plays a role in the analysis of optimization methods for convex loss functions.
Definition 6 (Bounded model space).
Let denote the expanse of the model space . We say that an optimization problem has bounded model space if .
Since our focus is training from fast streaming data that necessitates distributed processing, we next formalize the distributed processing / communications framework underlying the algorithms being discussed in the paper.
Ii-B Distributed Training of Machine Learning Models from Streaming Data
In addition to optimality, in the face of large volumes and high dimensionality of data in modern applications, the solution needs to be efficient in terms of resource utilization as well (e.g., computational, communication, storage, energy, etc.). In Section I, we discussed three mainstream distributed frameworks for resource-efficient machine learning, where each of the frameworks is primarily designed to adhere to specific practical constraints posed due to characteristics of the training data. One such characteristic is the physical locality of data, which results in following two common scenarios involving streaming data: () for the master–worker learning framework, the data stream arrives at a single master node and, in order to ease the computational load and accelerate training time, the data stream is then divided among a total of worker nodes (Fig. 1(b) and Fig. 3(a)), or () for the federated learning and edge computing frameworks, there is a collection of geographically distributed nodes—each of which receives its own independent stream of data—and the goal is to learn a machine learning model using information from all these nodes (Fig. 1(c), Fig. 1(d), and Fig. 3(b)). Despite the apparent physical differences between these two scenarios, we can study them under a unified abstraction that assumes the data are arriving at a hypothetical “data splitter” that then evenly distributes the data across an interconnected network of nodes for distributed processing (Fig. 3(c)).
Mathematically, let us discretize the data arrival time as , and let be a stream of independent and identically distributed (i.i.d.) data samples arriving at the splitter at a fixed rate of samples per second. The splitter then evenly distributes the data stream across a network of nodes, which we represent by an undirected connected graph ; here, denotes the set of all nodes in the network and denotes the set of edges corresponding to the communication links between these nodes, i.e., means there is a communication link between nodes and . We also define to be the index that denotes the total number of data-splitting operations that have been performed within the system. Without loss of generality, we take each data-splitting round to be the time in which nodes carry out a single iteration of a distributed algorithm; i.e., after data-splitting rounds, the nodes have carried out iterations of the distributed algorithm under study.
We next set the notation for the distributed data streams within our data-splitting abstraction to facilitate the prevalent practice of training using mini-batches of data samples. To this end, we assume without loss of generality that a total of samples arrive at the network during each data-splitting round. That is, a system-wide mini-batch of size is processed by the network during each algorithmic iteration (see, e.g., Fig. 4). Hence, each splitting operation results in a mini-batch of size arriving at each node. The data splitting across nodes in the system therefore gives rise to i.i.d. streams of mini-batched data, where we denote the i.i.d. data samples within the -th mini-batch at node as , with the mapping of these samples to the ones in the original data stream given in terms of the relationship .
Given this distributed, streaming data model, our goal is study of machine learning algorithms that can efficiently process and incorporate the newest-arriving network-wide samples into a running approximation of the Bayes optimal solution (cf. (2)) before the arrival of the next mini-batch of data. In order to highlight the challenges involved in the designs of such algorithms, we can divide the task of processing of a mini-batch of samples within the network into two phases (cf. Fig. 4): () the computation phase, in which each node performs computations over its local mini-batch of data samples, and () the subsequent communications phase, in which nodes share the outcomes of their local computations with each other for eventual incorporation into the (network-wide, decentralized) machine learning models .
Consider now the compute-limited regime within our framework, in which the distributed system comprising compute nodes is incapable of finishing computations on samples between two consecutive data-splitting instances because of the fast data streaming rate. (Indeed, the time between two data-splitting instances decreases as increases.) One could push the system out of this compute-limited regime by adding more compute nodes to the system. Keeping the system-wide mini-batch size fixed (and large), this will result in smaller local mini-batch size . Alternatively, keeping the local mini-batch size fixed, this will result in larger time between two data-splitting instances. And in either case, the system no longer remains compute limited. However, as one adds more and more compute nodes into the distributed system, it could be pushed into the communications-limited regime, in which the size/topology of the network prevents the nodes from completing full exchange of their local computations between two consecutive data-splitting instances. This communications-limited regime—which becomes especially pronounced in systems with slower communications links—can only be mitigated through larger data-splitting intervals, which in turn necessitates a larger for any fixed data streaming rate . But this can again push the system into the compute-limited regime. Therefore, any machine learning algorithm intending to process fast streaming data in an optimal fashion must strike a balance between the compute- and the communications-limited regimes through judicious choices of system parameters such as and . We now formalize some of this discussion in the following, which should lead to a better understanding of the interplay between the data streaming rate, the computational capabilities of compute nodes, the communications capabilities of the network, the system-wide mini-batch size , and the number of compute nodes in distributed systems.
Ii-C Interplay Between System Parameters in Distributed, Streaming Machine Learning
We have already defined as the number of data samples arriving per second at the splitter. We also assume the compute nodes in the system to be homogenous in nature and use to denote the processing/compute rate of each of these nodes, defined as the number of data samples per second that can be locally processed per node during the computation phase. Distributed algorithms also involve the use of message passing routines for inter-node communications. We use to denote the rate of messages shared among nodes using such routines, defined as the number of messages (synchronously) communicated between nodes per second during the communications phase. This parameter also subsumes within itself any overhead associated with implementation of the message passing routine such as time spent on additional computations or communications necessitated by the implementation.
Distributed machine learning algorithms typically involve multiple message passing rounds within the communications phase (cf. Fig. 4), which we denote by . This parameter , which we assume remains fixed for the duration of the training, can be constrained in terms of the system parameters , , , , and as follows:
Our focus in this paper is on algorithms that make use of either “exact” or “inexact” distributed averaging procedures within the communications phase for information sharing. Specifically, let be a set of vectors that is distributed across the nodes in the network at the start of any communications phase and define to be an estimate of the average of these vectors at node . We then have the following communications-related characterizations of the algorithms being studied in the paper.
Exact averaging algorithms. After message passing rounds within the communications phase, these algorithms can exactly estimate the average at each node, i.e., .
Inexact averaging algorithms. After message passing rounds within the communications phase, these algorithms can only guarantee -accurate estimates, i.e., for some parameter that typically increases as decreases and/or increases.
Exact averaging algorithms often find applications in settings like high-performance computing clusters and enterprise cloud computing systems, where communications is typically fast and reliable. In contrast, inexact averaging algorithms tend to be more prevalent in settings like edge computing systems, multiagent systems, and IoT systems, where the network connectivity can be sparser and the communications tend to be slower and unreliable.
We have now described all the system parameters needed to formalize the notion of effective (mini-batch) processing rate, , of the distributed system, which is defined as the number of mini-batches comprising samples that can be processed by the system per second. (In the non-distributed setting, corresponding to , it is straightforward to see that .) Under the assumption of a synchronous system in which computation and communications phases are carried out one after the other, the parameter can be defined as follows:
This expression formally highlights the tradeoff between the compute-limited and the communications-limited regimes. In the case of fixed and , for instance, increasing the effective processing rate requires an increase in . Doing so, however, necessitates an increase in that—beyond a certain point—can only be accomplished through an increase in (cf. (3)), which in turn also increases the first term in (4).
The overarching theme of this paper is discussion of algorithmic strategies that can be used to tackle the challenge of near-optimal training of machine learning models from fast streaming data, where “fast” is defined in the sense that . This discussion involves allowable selections of system parameters such as the network-wide mini-batch size , the number of nodes , and the number of communications rounds that facilitate taming of the fast incoming data stream without compromising the fidelity of the final solution. In particular, the recommended strategies end up pushing the ratio to satisfy either or for an appropriate parameter , where the latter scenario involves discarding of samples per splitting instance at the data splitter.
In order to prime the reader for subsequent discussion, we also provide a simple example in Fig. 5 that illustrates the impact of the choice of (network-wide) mini-batch size on system performance. We suppose a network of compute nodes, and focus on the exact averaging paradigm described above. We assume a data streaming rate of samples per second, whereas the data processing rate per node is taken to be samples per second. We plot the ratio of the streaming rate and the effective (mini-batch) processing rate as defined in (4), for communications rates and , as a function of the mini-batch size . As noted earlier, the number of samples effectively processed by the network keeps pace with the number of samples arriving at the system provided , and we observe that for sufficiently large mini-batch size , the ratio indeed drops below the line plotted in Fig. 5.
Next, we also overlay corresponding plots of the excess risk predicted for Distributed Minibatch SGD, presented in Section IV-A, after samples have arrived at the system. These plots show that increased mini-batch size helps the excess risk, but only to a point. Eventually, becomes so large that the reduction in the number of algorithmic iterations carried out by the network hurts the overall performance more than the increase in the effective processing rate helps it. This illustrates that the mini-batch size must be chosen judiciously, and in the following sections we will discuss theoretical results that shed light on this choice.
Iii An Overview of the Technical Landscape
This paper ties together research in optimization and distributed processing within the context of machine learning. To elucidate the state of the art and set the stage for the results described in Sections IV and V, we present an overview of these areas and describe in detail key results that will be used later.
Iii-a Optimization for Machine Learning
As mentioned in Section I-C, the literature on optimization for machine learning can be roughly divided into two interrelated frameworks, namely, statistical risk minimization (SRM) and empirical risk minimization (ERM). Both these frameworks aim to find a solution to the statistical optimization problem (2) and, as such, fall under the broad category of stochastic optimization (SO) within the optimization literature . In particular, the SRM framework is often referred to as stochastic approximation (SA) and the ERM framework is sometimes termed sample-average approximation (SAA) in the literature . In terms of specifics, the SA/SRM framework considers directly the statistical learning problem (2), and researchers have developed algorithms that minimize the risk using “noisy” (stochastic) samples of its gradient . In contrast, the risk in the SAA/ERM framework is approximated by the empirical distribution over a fixed training dataset of data samples. This empirical risk, defined as , is then minimized directly within the ERM framework, usually via some form of gradient-based (first-order) deterministic optimization methods. In the following, we describe a few key results from these two frameworks that are the most relevant to our discussion in this paper.
Iii-A1 Stochastic Approximation (SA)
The general assumption within the SA framework is that one has access to a stream of noisy gradients in order to solve (2), where the noisy gradient at iteration is defined as
with denoting i.i.d. noise with mean zero and finite variance, i.e., . In the parlance of SA, we have access to a first-order “oracle” that can be queried for a noisy gradient evaluated at the query point . In the parlance of machine learning, we have a stream of data samples , each drawn i.i.d. according to the data distribution , and we solve (2) using the gradients , which have gradient noise variance as defined in Definition 4.222Note that the data arrival index and the algorithmic iteration index are one and the same in a centralized setting; we are using here in lieu of to facilitate comparisons with results in distributed settings. It is straightforward to verify that these two formulations are equivalent: , so we can define to be the zero-mean gradient noise in our problem setup.
The prototypical SA algorithm for loss functions whose gradients exist is stochastic gradient descent (SGD) , in which iterations/iterates take the form
where denotes projection onto the constraint set and denotes an appropriate stepsize that is either fixed (constant stepsize) or that decays to with increasing according to a prescribed strategy (decaying stepsize).
The term ‘stochastic gradient descent’ is overloaded in the literature. Many papers (e.g., [41, 42]) use the term in the SA sense described here, with a continuous stream of data in which no sample is used more than once. However, other papers (e.g., [2, 43]) use the term within the ERM framework to describe algorithms that operate on a fixed dataset, from which mini-batches of data are sampled with replacement and noisy gradients are computed. To disambiguate, some authors (e.g., ) use the term single-pass SGD to indicate the former usage.
Convex Problems. A common elaboration on SGD for convex loss functions is Polyak–Ruppert averaging [45, 46, 47, 41], in which a running average of iterates is maintained as The convergence rates of SGD for convex SA have been studied under a variety of settings, both with and without iterate averaging. The following result with a modified form of Polyak–Ruppert averaging comes from , in which iterate averaging takes the form
For convex and smooth loss functions with (gradient) Lipschitz constant , gradient noise variance , and bounded model space with expanse , there exist stepsizes such that the approximation error of SGD with iterate averaging in (7) satisfies:
In , an optimal constant stepsize is given in the case where the optimization ends at a finite time horizon known in advance. In this case, the prescribed stepsize is and this achieves the bound in (8). When the time horizon is unknown, a varying stepsize policy achieves expected excess risk , which is optimum for much larger than . For simplicity, we are working with the optimum stepsize proposed in  to retain the analysis for not necessarily much larger than .
It is desirable in some applications to state the SGD results in terms of convergence of the averaged iterate to . In the case of convex, smooth, and twice continuously differentiable loss functions,  provides such results for Polyak–Ruppert averaging in the almost sure sense and also proves asymptotic normality of , i.e., converges to a zero-mean Gaussian vector. In the case of strongly convex and smooth loss functions,  derives non-asymptotic convergence results for the Polyak–Ruppert averaged iterate in the mean-square sense. However, since machine learning is often concerned with minimizing the excess risk , we do not indulge further in discussion of convergence of the SGD iterates to the Bayes optimal solution .
A natural question is whether the convergence rate of Theorem 1 can be improved upon by another algorithm. It has been shown that incorporating Nesterov’s acceleration  into SGD can indeed improve this rate somewhat. Roughly speaking, Nesterov’s acceleration introduces a “momentum” term into the SGD iterations, allowing the directions of previous gradients to impact the direction taken during the current step and thereby speeding up convergence. The following formulation is an SGD-based simplification of the accelerated stochastic mirror descent algorithm of . Define the accelerated SGD updates as follows:
where , and and are stepsizes. We then have the following result from .
For convex and smooth loss functions with (gradient) Lipschitz constant , gradient noise variance , and bounded model space with expanse , there are stepsizes and such that the expected risk of accelerated SGD is bounded by
Both Theorem 1 and Theorem 2 explicitly bring out the dependence of the convergence rates on gradient noise variance . And in doing so, they hint at the potential advantages of (centralized or distributed) mini-batching of data for improved performance. In both instances, as the number of samples/iterations goes to infinity and all else is held constant, the terms dominate the convergence rates in (8) and (12). But if the terms can be made to decay faster than their respective first terms then both SGD and accelerated SGD result in improved asymptotic convergence rates in the limit of many data (or gradient) samples.
We conclude our discussion of SA for convex loss functions by noting that the convergence rate of accelerated SGD is provably optimal for smooth, convex SA problems in the minimax sense: there is no single algorithm that can converge for all such SA problems at a rate faster than . (See  for an argument for this.) However, generalized and sometimes improved rates are possible outside of the regime of this setting. In particular, when is smooth and strongly convex, a convergence rate of is possible for bounded away from zero, and it is the minimax rate [49, 41]. Results are also available when the loss function is non-smooth, when the solution is sparse or otherwise structured, and when the optimization space has a geometry that can be exploited to speed up convergence [50, 51, 3].
Nonconvex Problems. Nonconvex functions can have three types of critical points, defined as points for which : saddle points, local minima, and global minima. This makes optimization of nonconvex (loss) functions using only first-order (gradient) information challenging. While works such as [52, 53, 54, 55, 56, 57, 58] provide convergence rates for nonconvex problems that are similar to their convex programming analogs, the convergence is only guaranteed to a critical point that is not necessarily a global optimum. Nonetheless, global optimization of nonconvex SA problems has been studied in the literature under a variety of assumptions on the geometry of objective functions. A major strand of work in this direction involves modifying the canonical SGD algorithm by injecting slowly decreasing Monte Carlo noise in its iterations. The resulting SA methods have been investigated in works such as [59, 60, 61, 62, 63, 64, 65] under the monikers of (continuous) simulated annealing and stochastic gradient Langevin dynamics. (Strictly speaking,  does not fall under the SA framework being discussed in this section.) A recent work  also provides global convergence guarantees for SGD for the class of (nonconvex) Morse functions.
Another major strand of work in global optimization of nonconvex functions involves explicit exploitation of the geometry of structured nonconvex problems such as principal component analysis
(PCA), dictionary learning, phase retrieval, and low-rank matrix completion for global convergence guarantees. In this paper, we focus on one such structured nonconvex SA problem that corresponds to estimating the top eigenvectorof the covariance matrix of i.i.d. samples . The investigation of this -PCA problem in the paper, whose global convergence behavior has been investigated in works such as [67, 68, 69, 42], serves two purposes. First, it helps validate the generality of the main message of this paper that the mismatches between the data streaming rate, compute rate, and communications rate can be accounted for through judicious choices of system parameters such as , , and . Second, it helps crystallize the key characteristics of any global convergence analysis of nonconvex problems that can facilitate the convergence speed-up guarantees for the distributed mini-batch framework.
The loss function for the -PCA problem under the assumption of zero-mean distribution supported on and having covariance matrix takes the form
Note that and the optimal solution corresponds to the dominant eigenvector of . In this paper, we focus on the SA approach termed Krasulina’s method  that approximates the optimal solution from data stream using iterations of the form
Notice that changing to in (14) gives us the SGD iteration. Despite the empirical success of SA iterations such as (14) in approximating the top eigenvector of , earlier works only provided asymptotic convergence guarantees for such methods. Recent studies such as [67, 42, 68, 71, 72] have filled this gap by providing non-asymptotic results. The following theorem, which is due to , provides guarantees for Krasulina’s method.
Let the i.i.d. data samples be bounded, i.e., , define , fix any , and define for any . Next, pick any
and choose the stepsize sequence as . Then there exists a sequence of nested subsets of the sample space such that and
where is the conditional expectation over , and and are constants defined as
The convergence guarantees in Theorem 3 depend on problem parameters such as , , and . Recent works [71, 73] have provided lower bounds on the dependence of convergence rates on these parameters for the stochastic PCA problem. Theorem 3 achieves these lower bounds with respect to and up to logarithmic factors. But the dependence on data dimension in Theorem 3 is , while the lower bound suggests dependence. In addition, convergence guarantees for a variant of Krasulina’s method termed Oja’s algorithm are known to achieve this lower bound dependence on data dimensionality [71, 68, 74, 72].
Despite this somewhat suboptimal nature of Theorem 3, Krasulina’s method lends itself to relatively simpler analysis for the distributed (mini-batch) framework being studied in this paper. Specifically, as alluded to in our discussion in Section II-A, implicit averaging out of the sample-covariance noise is the key reason for the potential speed-up in convergence within any distributed processing framework. And while Theorem 3 does not have an explicit dependence on the noise variance , a variance-based analysis of Krasulina’s method—discussed in detail in Section IV and having similar dependence on , , and as Theorem 3—has been provided in a recent work . In contrast, results in [71, 72] are oblivious to the variance in sample covariance and hence cannot be used to show faster convergence within distributed frameworks. On the other hand, while the results in [68, 74] do take the noise variance into account, the probability of success in these works cannot be improved beyond in a single-pass SA setting.
Iii-A2 Empirical Risk Minimization (ERM)
Given the fixed training dataset of i.i.d. samples drawn from the distribution and the corresponding empirical risk , the main objective within the ERM framework is to directly minimize in order to obtain the ERM solution . Such problems, sometimes referred to as finite-sum optimization problems, have traditionally been solved using (deterministic, projected) gradient descent or similar methods. But the advent of massive datasets has made direct computations of gradients of intractable. This has led to the development of several families of SGD-type methods for the ERM problem, where the stochasiticity in these methods refers to noisy gradients of the empirical risk , as opposed to noisy gradients of the true risk within the (single-pass) SA framework. Specifically, the prototypical SGD algorithm for the ERM problem samples with replacement a single data sample (or a small mini-batch of samples) from in each iteration , computes the gradient , and takes a step in the negative of the computed gradient’s direction. The iterates of this particular SGD variant are known to converge reasonably fast to the ERM solution under various assumptions on the geometry of the loss function [2, 76].
A variety of adaptive and more elaborate SGD-style algorithms, such as Adagrad, RMSProp, and Adam[77, 78], which introduce adaptive stepsizes, momentum terms, and Nesterov-style acceleration, have been developed in recent years. Empirically, these methods provide faster convergence to at least a stationary point of
, especially when training deep neural networks. (Note that some of these methods have provable convergence issues, even for convex problems.) A family of so-called variance-reduction methods [4, 69, 80, 81, 82], such as stochastic variance reduced gradient (SVRG), stochastically controlled stochastic gradient (SCSG), and Natasha, have also been developed in the literature for the ERM problem. In these methods, iterates from previous epochs are averaged to produce a low-complexity estimate of the gradient with provably small variance, which speeds up convergence. In terms of theoretical analysis, SGD-style and variance-reduction algorithms are studied in both convex and nonconvex settings. Unlike the SA framework, however, the convergence analysis of these methods for the ERM setting is in terms of the computational effort, measured in terms of the number of gradient evaluations, needed to approach a global optimum or a stationary point of the empirical risk .
Since optimization methods for the ERM framework primarily provide bounds on either or , a bound on the excess risk under the ERM setting necessitates additional analytical steps that typically involve bounding the generalization error, defined as , of the ERM solution. Classic generalization error bounds have been provided in terms of the Vapnik–Chervonenkis dimension or Rademacher complexity of the class of functions induced by [34, 83], or in terms of the uniform or so-called “leave-one-out” stability [84, 85, 86, 87] of the solution. Together, the optimization-theoretic bounds and the learning-theoretic bounds on quantities such as the generalization error result in excess risk bounds that decay at rates or for various loss functions as long as the number of optimization iterations is on the order of the number of training samples . Thus, the ERM framework can yield excess risk bounds that match the sample complexity of the ones under the SA framework. Nonetheless, we focus primarily on the SA setting in this paper for two reasons. First, we are concerned with the statistical optimization problem (2), and the SA framework measures performance directly with respect to this problem, whereas the ERM/finite-sum setting yields the final results only after a combination of optimization-theoretic and learning-theoretic bounds. Second, the SA framework is naturally well-suited to the setting of streaming data, whereas ERM supposes access to the entire dataset.
Iii-B Distributed Optimization and Machine Learning
Distributed optimization is an extremely broad field, with a rich history. In this paper, we only discuss the portion of the literature most relevant to our problem setting. Specifically, we focus on methods for distributing SGD-style algorithms over collections of computing devices and/or processors that communicate over networks defined by graphs and aggregate data by averaging information over the network. We further divide these methods into two categories, based on the nature of distributed averaging that is employed within each algorithm: exact averaging, in which processing nodes use a robust message passing interface (MPI) communications primitive such as AllReduce  to compute exact averages of gradients and/or iterates in the network, and inexact averaging, in which an approximate approach such as distributed consensus/diffusion [8, 9, 10] is used to approximate averages of gradients and/or iterates in the network. The former category of algorithms requires careful network configuration in order to coordinate AllReduce-style averaging, whereas the latter category requires minimal explicit configuration, but the algorithms can suffer from slower convergence due to approximation error in the averaging step.
Iii-B1 Exact Averaging and Distributed Machine Learning
In the case of algorithms utilizing exact averaging, processing nodes employ an MPI library to compute exact averages in a robust manner. While implementations differ, a generic approach is to compute averages over a spanning tree in the network. Reusing the notation introduced in Section II-C, let be the set of vectors distributed across the network at the start of the averaging subroutine and let denote their average. Then, the average can be obtained at each node in a two-pass manner. In the first pass, each leaf node in the spanning tree passes its vector to its parent node, which averages together the vectors of its child nodes and passes the average to its parent node; this process continues recursively until the root node has the average . In the second pass, the root node disseminates to the network by passing it to its child nodes; this continues recursively until all of the leaf nodes posses . This type of averaging is provably efficient, requiring only exchange of messages within the network.
This generic approach to computing exact averages has been applied to distributed machine learning via a variety of implementations, especially under the distributed computing framework. TensorFlow has a package for parameter-server distributed learning on multiple GPUs that uses exact averaging; worker nodes compute gradients, which are forwarded to the parameter server for exact averaging . By contrast, Horovod  is a distributed-parameter library for deep learning that averages gradients using ring AllReduce; the GPU nodes are connected into a ring topology, which makes for simple and efficient exact averaging.
Iii-B2 Inexact Averaging and Distributed Machine Learning
In the case of algorithms utilizing inexact averaging, processing nodes use local communications, without network-wide coordination, to compute approximate averages of their data. A widespread method for this is averaging consensus, a mainstay of distributed control, signal processing, and learning [89, 15]. Again suppose is the set of vectors distributed across the network at the start of the averaging subroutine and denotes the exact average of these vectors. Next, define a doubly stochastic matrix
doubly stochastic matrixthat is consistent with the topology of the network . That is, is a matrix whose entries are non-negative, whose rows and columns sum to one, whose diagonal entries are non-zero, and whose -th entry only when . Averaging consensus then proceeds in multiple rounds of the following iteration using local communications:
Here, denotes the iteration index for averaging consensus, denotes an approximation of at node after iterations, and . In words, each processing node takes a convex combination of the estimate of at its neighboring nodes. Under mild conditions, averaging consensus converges geometrically on , with the approximation error scaling as .
Distributed gradient descent (DGD) is a classic approach to distributed optimization via inexact averaging . It uses only a single round of averaging consensus per iteration, i.e., using the notation of Section II-C, and it is posed in the setting of finite-sum optimization: each node has a local cost function , and the objective is to minimize the sum . While DGD was originally posed in the framework of distributed control, it applies equally well to the distributed ERM setting in which corresponds to the empirical risk over the training data at node . In terms of specifics, the original DGD formulation supposes a synchronous communications model in which each node computes a weighted average of its neighbors’ iterates at each iteration , after which it takes a gradient step with respect to its local cost function:
Thus, each node takes a standard gradient descent step preceded by one-round averaging consensus on the iterates.
Several extensions to DGD have been proposed in the literature, including extensions to time-varying and directed graphs [90, 91, 92] and variations with stronger convergence guarantees [93, 94, 95]. Other related works have studied distributed (stochastic) optimization via means other than gradient descent, including distributed dual averaging [96, 97] and the alternating direction method of multipliers (ADMM) [98, 99, 100]. The convergence of DGD-style methods has been studied under a variety of settings; two relevant results are that stochastic DGD-style algorithms have error decaying as for general smooth convex functions and for smooth strongly convex functions, even if the network is time varying [90, 91].
We conclude by noting that inexact averaging-based distributed algorithms have also been analyzed/proposed for nonconvex optimization problems. In particular, DGD-style methods for nonconvex finite-sum problems are presented in [101, 102, 103], and convergence rates to stationary points and, when possible, local minima are derived. Further particularization of these works to problems with “nicer” geometry of saddle points and to structured nonconvex problems such as PCA can be found in works such as [104, 105, 106, 107].
Iii-C Roadmap for the Remainder of the Paper
Putting the results presented in this paper in the context of the preceding discussion, the rest of the paper describes recent results in distributed machine learning from fast streaming data over networks that aggregate the distributed information using both exact and inexact averaging. Specifically, we synthesize results from four recent papers [108, 75, 109, 110] that focus on the distributed SA setting of Section II. Among these works, nodes in [108, 75] exchange messages using a robust MPI primitive such as AllReduce, allowing exact averaging of messages for processing. The main distinction between these two works is that  focuses on distributed convex SA problems, whereas  studies the distributed PCA problem under the SA setting. In contrast, nodes in [109, 110] exchange messages using multiple rounds of averaging consensus and thus, similar to DGD, aggregate information using inexact averaging of messages. Both these works study distributed convex SA problems, with  focusing on dual averaging and  investigating gradient descent as solution strategies.
Iv Distributed Stochastic Approximation Using Exact Averaging
We detail two machine learning algorithms in this section for the distributed mini-batch framework of Section II, with one algorithm for general convex loss functions and the other one for the nonconvex loss function corresponding to the -PCA problem. Both these algorithms operate under the assumption of nodes aggregating distributed information via exact averaging using AllReduce-style communications primitives. The main focus in both these algorithms is to strike a balance between streaming, computing, and communications rates, while ensuring that the error in the final estimates is near optimal in terms of the number of samples arriving at the distributed system.
Both the algorithms take advantage of the fact that (implicit or explicit) mini-batching reduces (gradient / sample covariance) noise variance. Between any two data-splitting instances, nodes in each algorithm compute average gradients/iterates over the newest (network-wide) data samples and use these exactly averaged quantities for a stochastic update. Given ample compute resources and keeping everything else fixed, an increase in network-wide mini-batch size under such a strategy decreases both the noise variance and demands on the communications resources. In doing so, however, one also reduces the number of algorithmic iterations that take place within the network per second, which has the potential to slow down the convergence rates of the algorithms to the optimal solutions. An important question then is whether (and when) it is possible to utilize network-wide mini-batch averaging to simultaneously balance the compute-limited and communications-limited regimes in high-rate streaming settings (i.e., ensure ), reduce the noise variance, and guarantee that (order-wise) the convergence rate is not adversely impacted. We address this question in the following for the case of exact averaging.
Iv-a Distributed Mini-batched Stochastic Convex Approximation
Due to the high impact of mini-batching on the performance of distributed stochastic optimization, distributed methods deploying mini-batching and utilizing exact averaging have been studied extensively in the past few years; see, e.g., [108, 111, 5, 112]. Among these works, the results in  provide an upper bound on the network-wide mini-batch size that ensures sample-wise order-optimal convergence in SA settings. In contrast, [5, 111, 112] focus on the selection of mini-batch size under ERM settings. Since the SA setting is best suited for the streaming framework of this paper, our discussion here focuses exclusively on the distributed mini-batch (DMB) algorithm proposed in  for stochastic convex approximation. The DMB algorithm is listed as Algorithm 1 in the following and discussed further below.
We begin with the data-splitting model of Section II and initially assume sufficient provisioning of resources so that . The DMB algorithm at iteration in this setting has a mini-batch of data samples at the splitter, which is then distributed as smaller mini-batches of size each across the network of compute nodes. Afterwards, the nodes in the network locally (and in parallel) compute an average gradient of the loss function over their local mini-batch of data samples (see Steps 3–6 in Algorithm 1). Next, nodes engage in distributed exact averaging of their local mini-batched gradients using an AllReduce-style communications primitive to obtain the network-wide mini-batched average gradient (cf. Step 7, Algorithm 1), which is then used to update the network-wide estimate of the machine learning model (cf. Step 8, Algorithm 1).
The DMB algorithm can also deal with reasonable under-provisioning of resources without sacrificing too much in terms of the quality of the estimate . Recall that the distributed processing framework cannot process all incoming samples when . However, as long as , the DMB algorithm simply resorts to dropping samples per splitting instance at the splitter in this resource-constrained setting and then proceeds with Steps 2–8 using the remaining samples as before.
The main analytical contribution of  was providing upper bounds on the mini-batch size and, when necessary, the number of discarded samples that ensure sample-wise order-optimal convergence for the DMB algorithm. We summarize these results of  in the following theorem.
Let the loss function be convex and smooth with -Lipschitz gradients and gradient noise variance . Then, assuming bounded model space and choosing stepsizes as , the approximation error of Algorithm 1 after iterations is bounded as follows:
Furthermore, if for any and , then the approximation error is bounded as
It can be seen from Theorem 4 that the DMB algorithm results in near-optimal convergence rate of , which corresponds to speed-up by a factor of , in two cases. First, when and thus , it can be seen from (19) that this speed-up is obtained as long as . Second, even when and therefore , (19) guarantees the convergence speed-up as long as