1 Introduction: big data, parallel processing and nested models
Serial computing has reached a plateau and parallel, distributed architectures are becoming widely available, from machines with a few cores to cloud computing with 1000s of machines. The combination of powerful nested models with large datasets is a key ingredient to solve difficult problems in machine learning, computer vision and other areas, and it underlies recent successes in deep learning(Hinton et al., 2012; Le et al., 2012; Dean et al., 2012). Unfortunately, parallel computation is not easy, and many good serial algorithms do not parallelise well. The cost of communicating (through the memory hierarchy or a network) greatly exceeds the cost of computing, both in time and energy, and will continue to do so for the foreseeable future (Fuller and Millett, 2011; Graham et al., 2004). Thus, good parallel algorithms must minimise communication and maximise computation per machine, while creating sufficiently many subproblems (ideally independent) to benefit from as many machines as possible. The load (in runtime) on each machine should be approximately equal. Faults become more frequent as the number of machines increases, particularly if they are inexpensive machines. Machines may be heterogeneous and differ in CPU and memory; this is the case with initiatives such as SETI@home, which may become an important source of distributed computation in the future. Big data applications have additional restrictions. The size of the data means it cannot be stored on a single machine, so distributed-memory architectures are necessary. Sending data between machines is prohibitive because of the size of the data and the high communication costs. In some applications, more data is collected than can be stored, so data must be regularly discarded. In others, such as sensor networks, limited battery life and computational power imply that data must be processed locally.
In this paper, we focus on machine learning models of the form , i.e., consisting of a nested mapping from the input to the output . Such nested models involve multiple parameterised layers of processing and include deep neural nets (Hinton and Salakhutdinov, 2006), cascades for object recognition in computer vision (Serre et al., 2007; Ranzato et al., 2007) or for phoneme classification in speech processing (Gold and Morgan, 2000; Saon and Chien, 2012), wrapper approaches to classification or regression (Kohavi and John, 1997)
, and various combinations of feature extraction/learning and preprocessing prior to some learning task. Nested and hierarchical models are ubiquitous in machine learning because they provide a way to construct complex models by the composition of simple layers. However, training nested models is difficult even in the serial case becausefunction composition produces inherently nonconvex functions, which makes gradient-based optimisation difficult and slow, and sometimes inapplicable (e.g. with nonsmooth or discrete layers).
Our starting point is a recently proposed technique to train nested models, the method of auxiliary coordinates (MAC) (Carreira-Perpiñán and Wang, 2012, 2014). This reformulates the optimisation into an iterative procedure that alternates training submodels independently with coordinating them. It introduces significant model and data parallelism, can often train the submodels using existing algorithms, and has convergence guarantees with differentiable functions to a local stationary point, while it also applies with nondifferentiable or even discrete layers. MAC has been applied to various nested models (Carreira-Perpiñán and Wang, 2014; Wang and Carreira-Perpiñán, 2014; Carreira-Perpiñán and Raziperchikolaei, 2015; Raziperchikolaei and Carreira-Perpiñán, 2016; Carreira-Perpiñán and Vladymyrov, 2015). However, the original papers proposing MAC (Carreira-Perpiñán and Wang, 2012, 2014) did not address how to run MAC on a distributed computing architecture, where communication between machines is far costlier than computation. This paper proposes ParMAC, a parallel, distributed framework to learn nested models using MAC, implements it in Message Passing Interface (MPI) for the problem of learning binary autoencoders (BAs), and demonstrates its ability to train on large datasets and achieve large speedups on a distributed cluster. We first review related work (section 2), describe MAC in general and for BAs (section 3) and introduce the ParMAC model and some extensions of it (section 4). Then, we analyse theoretically ParMAC’s parallel speedup (section 5) and convergence (section 6). Finally, we describe our MPI implementation of ParMAC for BAs (section 7) and show experimental results (section 8). Although our MPI implementation and experiments are for a particular ParMAC algorithm (for binary autoencoders), we emphasise that our contributions (the definition of ParMAC and the theoretical analysis of its speedup and convergence) apply to ParMAC in general for any situation where MAC applies, i.e., nested functions with layers.
2 Related work
Distributed optimisation and large-scale machine learning have been steadily gaining interest in recent years. Most work has centred on convex optimisation, particularly when the objective function has the form of empirical risk minimisation (data fitting term plus regulariser) (Cevher et al., 2014)2010), coordinate descent (CD) (Wright, 2016) or the alternating direction method of multipliers (ADMM) (Boyd et al., 2011). This has resulted in several variations of parallel SGD (McDonald et al., 2010; Bertsekas, 2011; Zinkevich et al., 2010; Gemulla et al., 2011; Niu et al., 2011), parallel CD (Bradley et al., 2011; Richtárik and Takáč, 2013; Liu and Wright, 2015) and parallel ADMM (Boyd et al., 2011; Ouyang et al., 2013; Zhang and Kwok, 2014).
It is instructive to consider the parallel SGD case in some detail. Here, one typically runs SGD independently on data subsets (done by worker machines), and a parameter server regularly gathers the replica parameters from the workers, averages them and broadcasts them back to the workers. One can show (Zinkevich et al., 2010) that, for a small enough step size and under some technical conditions, the distance to the minimum in objective function value satisfies an upper bound. The upper bound has a term that decreases as the number of workers increases, so that parallelisation helps, but it has another term that is independent of , so that past a certain point parallelisation does not help. In practice, the speedups over serial SGD are generally modest. Also, the theoretical guarantees of parallel SGD are restricted to shallow models, as opposed to deep or nested
models, because the composition of functions is nearly always nonconvex. Indeed, parallel SGD can diverge with nonconvex models. The intuitive reason for this is that, with local minima, the average of two workers can have a larger objective value than each of the individual workers, and indeed the average of two minima need not be a minimum. In practice, parallel SGD can give reasonable results with nonconvex models if one takes care to average replica models that are close in parameter space and thus associated with the same optimum (e.g. eliminating “stale” models and other heuristics), but this is not easy(Dean et al., 2012).
uses asynchronous parallel SGD, with gradients for the full model computed with backpropagation, to achieve data parallelism (with the caveat above), and some form of model parallelism. The latter is achieved by carefully partitioning the neural net into pieces and allocating them to machines to compute gradients. This is difficult to do and requires a careful match of the neural net structure (number of layers and hidden units, connectivity, etc.) to the target hardware. Although this has managed to train huge nets on huge datasets by using tens of thousands of CPU cores, the speedups achieved were very modest. Other work has used similar techniques but for GPUs(Chen et al., 2012; Zhang et al., 2013; Coates et al., 2013; Seide et al., 2014).
or TensorFlow(Abadi et al., 2015), with the goal of making cloud computing easily available to train machine learning models. Again, this is often based on shallow models trained with gradient-based convex optimisation techniques, such as parallel SGD. Some of these systems implement some form of deep neural nets.
Finally, there also exist specific approximation techniques for certain types of large-scale machine learning problems, such as spectral problems, using the Nyström formula or other landmark-based methods (Williams and Seeger, 2001; Bengio et al., 2004; Drineas and Mahoney, 2005; Talwalkar et al., 2013; Vladymyrov and Carreira-Perpiñán, 2013, 2016).
ParMAC is specifically designed for nested models, which are typically nonconvex and include deep nets and many other models, some of which have nondifferentiable layers. As we describe below, ParMAC has the advantages of being simple and relatively independent of the target hardware, while achieving high speedups.
3 Optimising nested models using the method of auxiliary coordinates (MAC)
Many machine learning architectures share a fundamental design principle: mathematically, they construct a (deeply) nested mapping from inputs to outputs, of the form with parameters , such as deep nets or binary autoencoders consisting of multiple processing layers. Such problems are traditionally optimised using methods based on gradients computed using the chain rule. However, such gradients may sometimes be inconvenient to use, or may not exist (e.g. if some of the layers are nondifferentiable). Also, they are hard to parallelise, because of the inherent sequentiality in the chain rule.
The method of auxiliary coordinates (MAC) (Carreira-Perpiñán and Wang, 2012, 2014) is designed to optimise nested models without using chain-rule gradients while introducing parallelism. It solves an equivalent but in appearance very different problem to the nested one, which affords embarrassing parallelisation. The idea is to break nested functional relationships judiciously by introducing new variables (the auxiliary coordinates) as equality constraints. These are then solved by optimising a penalised function using alternating optimisation over the original parameters (which we call the step) and over the coordinates (which we call the step). The result is a coordination-minimisation (CM) algorithm: the minimisation () step updates the parameters by splitting the nested model into independent submodels and training them using existing algorithms, and the coordination () step ensures that corresponding inputs and outputs of submodels eventually match.
MAC algorithms have been developed for several nested models so far: deep nets (Carreira-Perpiñán and Wang, 2014), low-dimensional SVMs (Wang and Carreira-Perpiñán, 2014), binary autoencoders (Carreira-Perpiñán and Raziperchikolaei, 2015)
, affinity-based loss functions for binary hashing(Raziperchikolaei and Carreira-Perpiñán, 2016) and parametric nonlinear embeddings (Carreira-Perpiñán and Vladymyrov, 2015). In this paper we focus mostly on the particular case of binary autoencoders. These define a nonconvex nondifferentiable problem, yet its MAC algorithm is simple and effective. It allows us to demonstrate, in an actual implementation in a distributed system, the fundamental properties of ParMAC: how MAC introduces parallelism; how ParMAC keeps the communication between machines low; the use of stochastic optimisation in the step; and the tradeoff between the different amount of parallelism in the and steps. It also allows us to test how good our theoretical model of the speedup is in experiments. We first give the detailed MAC algorithm for binary autoencoders, and then generalise it to hidden layers.
3.1 Optimising binary autoencoders using MAC
A binary autoencoder (BA) is a usual autoencoder but with a binary code layer. It consists of an encoder
that maps a real vectoronto a binary code vector with bits, , and a linear decoder which maps back to in an effort to reconstruct . We will call a binary hash function (see later). Let us write ( includes a bias by having an extra dimension for each ) where and is a step function applied elementwise, i.e., if and otherwise. Given a dataset of -dimensional patterns , our objective function, which involves the nested model , is the usual least-squares reconstruction error:
Optimising this nonconvex, nonsmooth function is NP-complete. Where the gradients do exist with respect to they are zero, so optimisation of using chain-rule gradients does not apply. We introduce as auxiliary coordinates the outputs of , i.e., the codes for each of the input patterns, and obtain the following equality-constrained problem:
Note the codes are binary. We now apply the quadratic-penalty method (it is also possible to apply the augmented Lagrangian method; Nocedal and Wright, 2006) and minimise the following objective function while progressively increasing , so the constraints are eventually satisfied:
Finally, we apply alternating optimisation over and . This results in the following two steps:
Over for fixed , this is a binary optimisation on variables, but it separates into independent optimisations each on only variables, with the form of a binary proximal operator (where we omit the index ): s.t. . After some transformations, this problem can be solved exactly for small by enumeration or approximately for larger by alternating optimisation over bits, initialised by solving the relaxed problem to and truncating its solution (see Carreira-Perpiñán and Raziperchikolaei, 2015 for details).
Over for fixed , we obtain independent problems: for each of the single-bit hash functions (which try to predict optimally from ), each solvable by fitting a linear SVM; and for each of the linear decoders in (which try to reconstruct optimally from ), each a linear least-squares problem. With linear and this simply involves fitting SVMs to and linear regressors to .
The user must choose a schedule for the penalty parameter (sequence of values ). This should increase slowly enough that the binary codes can change considerably and explore better solutions before the constraints are satisfied and the algorithm stops. With BAs, MAC stops for a finite value of (Carreira-Perpiñán and Raziperchikolaei, 2015). This occurs whenever does not change compared to the previous step, which gives a practical stopping criterion. Also, in order to generalise well to unseen data, we stop iterating for a value not when we (sufficiently) optimise , but when the precision of the hash function in a validation set decreases. This is a form of early stopping that guarantees that we improve (or leave unchanged) the initial , and besides is faster. We also have to initialise . This can be done by running PCA and binarising its result, for example. Fig. 1 gives the MAC algorithm for BAs.
The BA was proposed as a way to learn good binary hash functions for fast, approximate information retrieval (Carreira-Perpiñán and Raziperchikolaei, 2015). Binary hashing (Grauman and Fergus, 2013) has emerged in recent years as an effective way to do fast, approximate nearest-neighbour searches in image databases. The real-valued, high-dimensional image vectors are mapped onto a binary space with bits and the search is performed there using Hamming distances at a vastly faster speed and smaller memory (e.g. points with take 2 TB, but only 8 GB using bits, which easily fits in RAM). As shown by Carreira-Perpiñán and Raziperchikolaei (2015), training BAs with MAC beats approximate optimisation approaches such as relaxing the codes or the step function in the encoder, and yields state-of-the-art binary hash functions in unsupervised problems, improving over established approaches such as iterative quantisation (ITQ) (Gong et al., 2013).
In this paper, we focus on linear hash functions because these are, by far, the most used type of hash functions in the literature of binary hashing, due to the fact that computing the binary codes for a test image must be fast at run time. We also provide an experiment with nonlinear hash functions (RBF network).
3.2 MAC with layers
We now consider the more general case of MAC with hidden layers (Carreira-Perpiñán and Wang, 2012, 2014), inputs and outputs (for a BA, ). It helps to think of the case of a deep net and we will use it as a running example, but the ideas apply beyond deep nets. Consider a regression problem of mapping inputs to outputs (both high-dimensional) with a deep net given a dataset of pairs . We minimise the least-squares error (other loss functions are possible):
where each layer function has the form , i.e., a linear mapping followed by a squashing nonlinearity ( applies a scalar function, such as the sigmoid , elementwise to a vector argument, with output in ). We introduce one auxiliary variable per data point and per hidden unit and define the following equality-constrained optimisation problem:
Each can be seen as the coordinates of in an intermediate feature space, or as the hidden unit activations for . Intuitively, by eliminating we see this is equivalent to the nested problem (4); we can prove under very general assumptions that both problems have exactly the same minimisers (Carreira-Perpiñán and Wang, 2012). Applying the quadratic-penalty method, we optimise the following function:
over and drive . This defines a continuous path which, under mild assumptions (Carreira-Perpiñán and Wang, 2012), converges to a minimum of the constrained problem (5), and thus to a minimum of the original problem (4). In practice, we follow this path loosely. The quadratic-penalty objective function can be seen as breaking the functional dependences in the nested mapping and unfolding it over layers. Every squared term involves only a shallow mapping; all variables are equally scaled, which improves the conditioning of the problem; and the derivatives required are simpler: we require no backpropagated gradients over , and sometimes no gradients at all. We now apply alternating optimisation of the quadratic-penalty objective over and :
- step (submodels)
Minimising over for fixed results in a separate minimisation over the weights of each hidden unit—each a single-layer, single-unit submodel that can be solved with existing algorithms (logistic regression).
- step (coordinates)
Minimising over for fixed separates over the coordinates for each data point and can be solved using the derivatives with respect to of the single-layer functions (omitting the subindex ): .
Thus, the step results in many independent, single-layer single-unit submodels that can be trained with existing algorithms, without extra programming cost. The step is new and has the form of a “generalised” proximal operator (Rockafellar, 1976; Combettes and Pesquet, 2011). MAC reduces a complex, highly-coupled problem—training a deep net—to a sequence of simple, uncoupled problems (the step) which are coordinated through the auxiliary variables (the step). For a large net with a large dataset, this affords an enormous potential for parallel computation.
4 ParMAC: a parallel, distributed computation model for MAC
We now turn to the contribution of this paper, the distributed implementation of MAC algorithms. As we have seen, a specific MAC algorithm depends on the model and objective function and on how the auxiliary coordinates are introduced. We can achieve steps that are closed-form, convex, nonconvex, binary, or others. However, the following always hold: (1) In the step, the subproblems for are independent, one per data point. Each step depends on all or part of the current model. (2) In the step, there are independent submodels, where depends on the problem. For example, is the number of hidden units in a deep net, or the number of hash functions and linear decoders in a BA. Each submodel depends on all the data and coordinates (usually, a given submodel depends, for each , on only a portion of the vector ). We now show how to turn this into a distributed, low-communication ParMAC algorithm, give an MPI implementation of ParMAC for BAs, and discuss the convergence of ParMAC. Throughout the paper, unless otherwise indicated, we will use the term “machine” to mean a single-CPU processing unit with its own local memory and disk, which can communicate with other machines in a cluster through a network or shared memory.
4.1 Description of ParMAC
The basic idea in ParMAC is as follows. With large datasets in distributed systems, it is imperative to minimise data movement over the network because the communication time generally far exceeds the computation time in modern architectures. In MAC we have 3 types of data: the original training data , the auxiliary coordinates , and the model parameters (the submodels). Usually, the latter type is far smaller. In ParMAC, we never communicate training or coordinate data; each machine keeps a disjoint portion of corresponding to a subset of the points. Only model parameters are communicated, during the step, following a circular topology, which implicitly implements a stochastic optimisation. The model parameters are the hash functions and the decoder for BAs, and the weight vector of each hidden unit for deep nets. Let us see this in detail (refer to fig. 2).
Assume for simplicity we have identical processing machines, each with their own memory and CPU, which are connected through a network. The machines are connected in a circular (ring) unidirectional topology, i.e., machine machine machine machine , where “machine machine ” means machine can send data directly to machine (and we say machine is the successor of machine ). Call the entire dataset and corresponding coordinates. Each machine will store a subset such that the subsets are disjoint and their union is the entire data, i.e., the index sets satisfy if and .
The step is very simple. Before the step starts111Also, the machines need not start all at the same time in the step. A machine can start the step on its data as soon as it has received all the updated submodels in the step. Likewise, as soon as a machine finishes its step, it can start the step immediately, without waiting for all other machines to finish their step. However, in our implementation we consider the and steps as barriers, so that all machines start the or step at the same time., each machine will contain all the (just updated) submodels. This means that in the step each machine processes its auxiliary coordinates independently of all other machines, i.e., no communication occurs.
The step is more subtle. At the beginning of the step, each machine will contain all the submodels and its portion of the data and (just updated) coordinates. Each submodel must have access to the entire data and coordinates in order to update itself and, since the data cannot leave its home machine, the submodel must go to the data (this contrasts with the intuitive notion of the model sitting in a computer while data arrive and are processed). We achieve this in the circular topology as follows. We assume synchronous processing for simplicity, but in practice one would implement this asynchronously. Assume arithmetic modulo and an imaginary clock whose period equals the time that any one machine takes to process its portion of submodels. At each clock tick, the machines update each a different portion of the submodels. For example, in fig. 2, at clock tick 1 machine updates submodels – using its data (where ); machine updates submodels –; machine updates submodels –; and machine updates submodels –. This happens in parallel. Then each machine sends the submodels updated to its successor, also in parallel. In the next tick, each machine updates the submodels it just received, i.e., machine updates –, machine updates submodels –, machine updates submodels –; and machine updates submodels – (and each machine always uses its data portion, which never changes). This is repeated until each submodel has visited each machine and thus has been updated with the entire dataset . This happens after ticks, and we call this an epoch. This process may be repeated for epochs in ticks. At this time, each machine contains submodels that are finished (i.e., updated times over the entire dataset), and the remaining submodels it contains are not finished, indeed the finished versions of those submodels reside in other machines. Finally, before starting with the step, each machine must contain all the (just updated) submodels (i.e., the parameters for the entire nested model). We achieve this222In MPI, this can be directly achieved with MPI_Alltoall broadcasting, which scatters/gathers data from all members to all members of a group (a complete exchange). However, in this paper we implement it using the circular topology mechanism described. by running a final round of communication without computation, i.e., each machine sends its just updated submodels to its successor. Thus, after one clock tick, machine sends final submodels to machine and receives submodels from machine . After clock ticks, each machine has received the remaining submodels that were finished by other machines, hence each machine contains a (redundant) copy of all the current submodels. Fig. 3 illustrates the sequence of operations during one epoch for the example of fig. 2.
In practice, we use an asynchronous implementation. Each machine keeps a queue of submodels to be processed, and repeatedly performs the following operations: extract a submodel from the queue, process it (except in epoch ) and send it to the machine’s successor (which will insert it in its queue). If the queue is empty, the machine waits until it is nonempty. The queue of each machine is initialised with the portion of submodels associated with that machine. Each submodel carries a counter that is initially and increases every time it visits a machine. When it reaches then the submodel is in the last machine and the last epoch. When it reaches , it has undergone epochs of processing and all machines have a copy of it, so it has finished the step.
Since each submodel is updated as soon as it visits a machine, rather than computing the exact gradient once it has visited all machines and then take a step, the step is really carrying out stochastic steps for each submodel. For example, if the update is done by a gradient step, we are actually implementing stochastic gradient descent (SGD) where the minibatches are of size (or smaller, if we subdivide a machine’s data portion into minibatches, which should be typically the case in practice). From this point of view, we can regard the step as doing SGD on each submodel in parallel by having each submodel visit the minibatches in each machine.
In summary, using machines, ParMAC iterates as follows:
The submodels (hash functions and decoders for BAs) visit each machine. This implies we train them with stochastic gradient descent, where one “epoch” for a submodel corresponds to that submodel having visited all machines. All submodels are communicated in parallel, asynchronously with respect to each other, in a circular topology. With epochs, the entire model parameters are communicated times. The last round of communication is needed to ensure each machine has the most updated version of the model for the step.
Identical to MAC, each data point’s coordinates are optimised independently, in parallel over machines (since each machine contains , , , and all the model parameters). No communication occurs at all.
4.2 A step with only two rounds of communication
As described, and as implemented in our experiments, running epochs in the step requires rounds of communication (plus a final round). However, we can run epochs with only 1 round of communication by having a submodel do consecutive passes within each machine’s data. In the example of fig. 2, running epochs for submodel means the following: instead of visiting the data as 1,…,10, 11,…,20, 21,…,30, 31,…,40, 1,…,10, 11,…,20, 21,…,30, 31,…,40, it visits the data as 1,…,10, 1,…,10, 11,…,20, 11,…,20, 21,…,30, 21,…,30, 31,…,40, 31,…,40. (We can also have intermediate schemes such as doing between 1 and within-machine passes.) This reduces the amount of shuffling, but should not be a problem if the data are randomly distributed over machines (and we can still do within-machine shuffling). This effectively reduces the total communication in the step to 2 rounds regardless of the number of epochs (with the second round needed to ensure each machine has the most updated submodels).
4.3 Extensions of ParMAC
In addition, the ParMAC model offers good potential for data shuffling, load balancing, streaming and fault tolerance, which make it attractive for big data. We describe these next.
It is well known that shuffling (randomly reordering) the dataset prior to each epoch improves the SGD convergence speed. With distributed systems, this can sometimes be a problem and require data movement across machines. Shuffling is easy in ParMAC. Within a machine, we can simply access the local data (minibatches) in random order at each epoch. Across machines, we can simply reorganise the circular topology randomly (while still circular) at the beginning of each new epoch (by generating a random permutation and resetting the successor’s address of each machine). We could even have each submodel follow a different, random circular topology. However, we do not implement this because it is unlikely to help (since the submodels are independent) and can unbalance the load over machines.
This is simple because the work in both the and steps is proportional to the number of data points . Indeed, in the step each submodel must visit every data point once per epoch. So, even if the submodels differ in size, the training of any submodel is proportional to . In the step, each data point is a separate problem dependent on the current model (which is the same for all points), thus all problems are formally identical in complexity. Hence, in the assumption that the machines are identical and that each data point incurs the same runtime, load balancing is trivial: the points are allocated in equal portions of to each machine. If the processing power of machine is proportional to (where could represent the clock frequency of machine , say), then we allocate to machine a subset of the points proportional to , i.e., machine gets data points. This is done once and for all at loading time.
In practice, we can expect some degradation of the parallel speedup even with identical machines and submodels of the same type. This is because machines do vary for various reasons, e.g. the runtime can be affected by differences in ventilation across machines located in different areas of a data centre, or because machines are running other user processes in addition to the ParMAC optimisation. Another type of degradation can happen if the submodels differ significantly in runtime (e.g. because there are different types of submodels): the runtime of the step will be driven by the slow submodels, which become a bottleneck. As discussed in section 5.4, we can group the submodels into a smaller number
of approximately equal-size aggregate submodels, for the purpose of estimating the speedup in theory. This need not be the fastest way to schedule the jobs, and in practice we still process the individual submodels asynchronously.
Streaming refers to the ability to discard old data and to add new data from training over time. This is useful in online learning, or to allow the data to be refreshed, but also may be necessary when a machine collects more data than it can store. The circular topology allows us to add or remove machines on the fly easily, and this can be used to implement streaming.
We consider two forms of streaming: (1) new data are added within a machine (e.g. as this machine collects new data), and likewise old data are discarded within a machine. And (2) new data are added by adding a new machine to the topology, and old data are discarded by removing an existing machine from the topology. Both forms are easily achieved in ParMAC. The first form, within-machine, is trivial: a machine can always add or remove data without any change to the system, because the data for each note is private and never interacts with other machines other than by updating submodels. Adding or discarding data is done at the beginning of the step. Discarding data simply means removing the corresponding from that machine. Adding data means inserting in that machine and, if necessary, creating within that machine coordinate values (e.g. by applying the nested model to ). We never upload or send any values over the network.
The second form, creating a new machine or removing an existing one, is barely more complicated, assuming some support from the parallel processing library. We describe it conceptually. Imagine we currently have machines. We can add a new machine, with its own preloaded data , as follows. Adding it to the circular topology simply requires connecting it between any two machines (done by setting the address of their successor): before we have “machine machine ”, afterwards we have “machine new machine machine ”. We add it in the step, making sure it receives a copy of the final model that has just been finished. The easiest way to do this is by inserting it in the topology at the end of the step, when each machine is simply sending along a copy of the final submodels. In the step, we proceed as usual, but with machines. Removing a machine is easier. To remove machine , we do so in the step, by reconnecting “machine machine ” and returning machine to the cluster. That is all. In the subsequent step, all machines contain the full model, and the submodels will visit the data in each machine, thus not visiting the data in the removed machine.
This situation is similar to discarding a machine in streaming, except that the fault can occur at any time and is not intended. We can handle it with a little extra bookkeeping, and again assuming some support from the parallel processing library. Imagine a fault occurs at machine and we need to remove it. If it happens during the step, all we need to do is discard the faulty machine and reconnect the circular topology. If it happens during the step, we also discard and reconnect, but in addition we need to rescue the submodels that were being updated in , which we lose. To do this, we revert to the previously updated copy of them, which resides in the predecessor of in the circular topology (if no predecessor, we are at the beginning of the step and we can use any copy in any machine). As for the remaining submodels being updated in other machines, some will have already been updated in (which require no action) and some will not have been updated in yet (which should not visit anymore). We can keep track of this information by tagging each submodel with a list of the machines it has not yet visited. At the beginning of the step the list of each submodel contains , i.e., all machines. When this list is empty, for a submodel, then that submodel is finished and needs no further updates.
Essentially, the robustness of ParMAC to faults comes from its in-built redundance. In the (and ) step, we can do without the data points in one machine because a good model can still be learned from the remaining data points in the other machines. In the step, we can revert to older copies of the lost submodels residing in other machines.
The asynchronous implementation of ParMAC we described earlier relied on tagging each submodel with a counter in order to know whether it needs processing and communicating. A more general mechanism to run ParMAC asynchronously is to tag each submodel with a list (per epoch) of machines it has to visit. All a machine needs to do upon receiving a submodel is check its list: if is not in the list, then the submodel has already visited machine and been updated with its data, so machine simply sends it along to its successor without updating it again. If is in the list, then machine updates the submodel, removes from its list, and sends it along to its successor. This works even if we use a different communication topology for each submodel at each epoch.
5 A theoretical model of the parallel speedup for ParMAC
In this section we give a theoretical model to estimate the computation and communication times and the parallel speedup in ParMAC. Specifically, eq. (12) gives the speedup as a function of the number of machines and other parameters, which seems to agree well with our experiments (section 8.3). In practice, this model can be used to estimate the optimal number of machines to use, or to explore the effect on the speedup of different parameter settings (e.g. the number of submodels ). Throughout the rest of the paper, we will call “speedup” the ratio of the runtime using a single machine (i.e., the serial code) vs using machines (the parallel code), and “perfect speedup” when . Our theoretical model applies to the general ParMAC case of layers, whether differentiable or not; it only assumes that the resulting submodels after introducing auxiliary coordinates are of the same “size,” i.e., have the same computation and communication time (this assumption can be relaxed, as we discuss at the end of the section).
We can obtain a quick, rough understanding of the speedup appealing to (a generalisation of) Amdahl’s law (Goedecke and Hoisie, 2001). ParMAC iterates the and steps as follows (where is the number of submodels and the number of data points):
W[B] step: problems Z[B] step: problems
Roughly speaking, the step has independent problems so its speedup would be , while the step has independent problems so its speedup would be (because in practice ). So the overall speedup would be between and depending on the relative runtimes of the and steps. This suggests we would expect a nearly perfect speedup with and diminishing returns for . This simplified picture ignores important factors such as the ratio of computation vs communication (which our model will make more precise), but it does capture the basic, qualitative behaviour of the speedup.
5.1 The theoretical model of the speedup
Let us now develop a more precise, quantitative model. Consider a ParMAC algorithm, operating synchronously, such that there are independent submodels of the same size in the step, on a dataset with training points, distributed over identical machines (each with points). The ParMAC algorithm runs a certain number of iterations, each consisting of a and a step, so if we ignore small overheads (setup and termination), we can estimate the total runtime as proportional to the number of iterations. Hence, we consider a theoretical model of the runtime of one iteration of the ParMAC algorithm, given the following parameters:
: number of machines.
: number of training points.
We assume is divisible by . This is not a problem because in practice (otherwise, there would be no reason to distribute the optimisation).
: number of submodels in the step.
This may be smaller than, equal to or greater than .
: number of epochs in the step.
: computation time per submodel and data point in the step.
This is the time to process (within the current epoch) one data point by a submodel, i.e., the time do an SGD update to a weight vector, per data point (if we use minibatches, then this is the time to process one minibatch divided by the size of the minibatch).
: communication time per submodel in the step.
This is the time to send one submodel from one machine to another, including overheads such as buffering, partitioning into messages or waiting time. We assume communication does not overlap with computation, i.e., a machine can either compute or communicate at a given time but not both. Also, communication involves time spent both by the sender and the receiver; we interpret as the time spent by a given machine in first receiving a submodel and then sending it.
: computation time per data point in the step.
This is the time to finish one data point entirely, using whatever optimisation algorithm performs the step.
, , and are integers greater or equal than 1, and , and are real values greater than 0. This model assumes that , and are constant and equal for every submodel or and data point. In reality, even if the submodels are of the same mathematical form and dimension (e.g. each submodel is a weight vector of a linear SVM of dimension ), the actual times may vary somewhat due to many factors. However, as we will show in section 8.3, the model does agree quite well with the experimentally measured speedups.
Let us compute the runtimes in the and step under these model assumptions. The runtime in the step equals the time for any one machine to process its points on all submodels, i.e.,
since all machines start and end at the same time and do the same amount of computation, without communication. To compute the runtime in the step, we again consider the synchronous procedure of section 4. At each tick of an imaginary clock, each machine processes its portion of submodels and sends it to its successor. After ticks, this concludes one epoch. This is repeated for epochs, followed by a final round of communication of all the submodels. If is not divisible by , say with and , we can apply this procedure pretending there are fictitious submodels333This means that our estimated runtime is an upper bound, because when is not divisible by , there may be a better way to organise the computation in the step that reduces the time when any machine is idle. In practice this is irrelevant because we implement the computation asynchronously. Each machine keeps a queue of incoming submodels it needs to process, from which it repeatedly takes one submodel, processes it and sends it to the machine’s successor. (on which machines do useless work). Then, the runtime in each tick is (time for any one machine to process its points on its portion of submodels) plus (time for any one machine to send its portion of submodels). The total runtime of the step is then times this plus the time of the final round of computation:
(The final round actually requires ticks, but we take it as ticks to simplify the equation a bit.) Finally, the total runtime of one ParMAC iteration ( and step) with machines is:
where for machine we have no communication (). Hence, the parallel speedup is
which can be written more conveniently as
by defining the following constants:
These constants can be understood as ratios of computation vs communication, independent of the training set size, number of submodels and number of machines. These ratios depend on the actual computation within the and step, and on the performance of the distributed system (computation power of each machine, communication speed over the network or shared memory, efficiency of the parallel processing library that handles the communication between machines). The value of these ratios can vary considerably in practice, but it will typically be quite smaller than 1 (say, ), because communication is much slower than computation in current computer architectures.
5.2 Analysis of the speedup model
We can characterise the speedup of eq. (12) in the following three cases:
If and is divisible by , then we can write the speedup as follows:
Here, the function is independent of and monotonically increasing with . It would asymptote to , but the expression is only valid up to . From (14) we derive the following condition for perfect speedup to occur (in the limit)444Note that if (no communication overhead) then and there is no upper bound in (15), but still holds, because we have to have at least one data point per machine.:
This gives an upper bound on the number of machines to achieve an approximately perfect speedup. Although is quite small in practice, the value of is very large (typically millions or greater), otherwise there would be no need to distribute the data. Hence, we expect , so could be quite large. In fact, the limit in how large can be does not come from this condition (which assumes anyway) but from the number of submodels, as we will see next.
In summary, we conclude that if and is divisible by then the speedup is given by (14), and in practice typically.
Within each interval for we have and we obtain that either is monotonically increasing, or is monotonically decreasing, or achieves a single maximum at
The parallelisation ability in this case is less than if is divisible by , since now some machines are idle at some times during the step.
If , then we can write the speedup as follows:
which corresponds to the last interval (for ) over which is continuous. We obtain that either is monotonically decreasing (if ), or it increases from up to a single maximum at and then decreases monotonically, with
As we have that (assuming so ). This decrease of the speedup for large is caused by the communication overhead in the step, where machines are idle at each tick in the step.
In the impractical case where there is no communication cost ( so ) then is actually monotonically increasing and , so the more machines the larger the speedup, although with diminishing returns.
Theorem A.1 shows that at the beginning of each interval is greater than anywhere before that interval, i.e., , for . That is, although the speedup is not necessarily monotonically increasing for , it is monotonically increasing for . This suggests selecting values of that make integer, in particular when is divisible by .
Globally maximum speedup
This is given by (see appendix A):
If : , achieved at .
If : , achieved at .
In practice, with large values of , the more likely case is for . In this case, the maximum speedup is achieved using more machines than submodels (even though this means some machines will be idle at some times in the step), and is bigger than . Since diminishing returns occur as we approach the maximum, the practically best value of will be somewhat smaller than .
The “large dataset” case
The case where is large is practically important because the need for distributed optimisation arises mainly from this. Specifically, if we take , the speedup becomes (see appendix A):
so that the speedup is almost perfect up to
, and then it is approximately the weighted harmonic mean ofand (hence, is monotonically increasing and between and ). For , we have .
The “dominant step” case
If we take or equivalently very large, which means the step dominates the runtime, then . This is because the step parallelises perfectly (as long as ).
Transformations that keep the speedup invariant
We can rewrite the speedup of eq. (12) as:
so that is independent of , which has been absorbed into the communication-computation ratios. This means that depends on the dataset size () and computation/communication times (, , ) only through , and , and is therefore invariant to parameter transformations that leave these ratios unchanged. Such transformations are the following (where ):
Scaling , and as , and .
“Larger dataset, faster computation,” or “smaller dataset, slower computation.”
Scaling and as and
“Larger dataset, slower communication,” or “smaller dataset, faster communication.”
Scaling , and as , and
“Faster computation, faster communication,” or “slower computation, slower communication.”
5.3 Discussion and examples
Fig. 4 plots a “typical” speedup curve , obtained with a realistic choice of parameter values. It displays the prototypical speedup shape we should expect in practice (the experimental speedups of fig. 10 confirm this). For the curve is very close to the perfect speedup , slowly deviating from it as approaches . For , the curve continues to increase until it reaches its maximum at , and decreases thereafter.
Fig. 5 plots for a wider range of parameter settings. We set the dataset size to a practically small value (), otherwise the curves tend to look like the typical curve from fig. 4. The parameter settings are representative of different, potential practical situations (some more likely than others). We note the following observations:
Again, the most important observation is that the number of submodels is the parameter with the most direct effect on the speedup: near-perfect speedups () occur if , otherwise the speedups are between and (and eventually saturate if ).
When the time spent on communication is large in relative terms, the speedup is decreased. This can happen when the runtime of the step is low (small ), when the communication cost is large (large ), or with many epochs (large ). Indeed, since the step is perfectly parallelisable, any decrease of the speedup should come from the step.
Some of the curves display noticeable discontinuities, caused by the function , occurring at values of of the form for . At each such value, is greater than for any smaller value of , in accordance with theorem A.1. This again suggests selecting values of that make integer, in particular when is a multiple of (, , …). This achieves the best speedup efficiency in that machines are never idle (in our theoretical model).
Also, for fixed , the function can take the same value for different values of (e.g. and for ). This explains why some curves (for different ) partly overlap.
The maximum speedup is typically larger than and occurs for . It is possible to have the maximum speedup be smaller than (in which case it occurs at ); an example appears in fig. 5, row 3, column 2 (for the larger values). But this happens only when , which requires an unusually small dataset and an impractically large number of submodels. Generally, we should expect to be beneficial. This is also seen experimentally in fig. 10.
5.4 Practical considerations
In practice, given a specific problem (with a known number of submodels , epochs and dataset size ), our theoretical speedup curves can be used to determine optimal values for the number of machines to use. As seen in section 8.3, the theoretical curves agree quite well with the experimentally measured speedups. The theoretical curves do need estimates for the computation time and communication times of the and steps. These are hard to obtain a priori; the computational complexity of the algorithm in -notation ignores constant factors that affect significantly the actual times. Their estimates should be measured from test runs.
As seen from eq. (21), we can leave the speedup unchanged by trading off dataset size () and computation/communication times (, , ) in various ways, as long as one of the three following holds: the products and remain constant; or the quotient remains constant; or the quotients and remain constant.
Theoretically, the most efficient operating points for are values such that is divisible by , because this means no machine is ever idle. In practice with an asynchronous implementation and with , and exhibiting some variation over submodels and data points, this is not true anymore. Still, if in a given application one is constrained to using machines, choosing close to a divisor of
would probably be preferable.
One assumption in our speedup model is that the machines are identical in processing power. The model does extend to the case where the machines are different, as noted in our discussion of load balancing (section 4.3). This is because the work performed by each machine is proportional to the number of data points it contains: in the step, because every submodel runs (implicitly) SGD, and every submodel must visit each machine; in the step, because each data point is a separate problem, and involves all submodels (which reside in each machine). Hence, we can equalise the work over machines by loading each machine with an amount of data proportional to its processing speed, independent of the number of submodels.
Another assumption in our model (in the step) is that all submodels are identical in “size” (computation and communication time). This is sometimes not true. For example, in the BA, we have submodels of two types: the encoders (each a binary linear SVM operating on a -dimensional input) and the decoders (each a linear regressor operating on an -dimensional input). Since , the encoders are bigger than the decoders and take longer to train and communicate. We can still apply our speedup model if we “group” smaller submodels into a single submodel of size comparable to the larger submodels, so as to equalise as much as possible the submodel sizes (the actual implementation does not need to group submodels, of course). For the BA, under the reasonable assumption that the ratio of computation times (and communication times ) of decoder vs encoder is , we can group the decoders into groups of decoders each. Each group of decoders has now a computation and communication time equal to that of one encoder. This gives an effective number of independent submodels , and this is what we use when applying the model to the experimental speedups in section 8.3.
Finally, we emphasise that the goal of this section was to characterise the parallel speedup of ParMAC quantitatively and demonstrate the runtime gains that are achievable by using machines. In practice, other considerations are also important, such as the economic cost of using machines (which may limit the maximum available); the type of machines (obviously, we want all the computation and communication times as small as possible); the choice of optimisation algorithm in the and steps; the fact that, because of its size, the dataset may need to be, or already is, distributed across machines; etc. It is also possible to combine ParMAC with other, orthogonal techniques. For example, if each of the submodels in the step is a convex optimisation problem (as is the case with the linear SVMs with the binary autoencoder), we could use the techniques described in section 2 for distributed convex optimisation to each submodel. This would effectively allow for larger speedups when .
6 Convergence of ParMAC
The only approximation that ParMAC makes to the original MAC algorithm is using SGD in the step. Since we can guarantee convergence of SGD under certain conditions, we can recover the original convergence guarantees for MAC. Let us see this in more detail. Convergence of MAC to a stationary point is given by theorem B.3 in Carreira-Perpiñán and Wang (2012), which we quote here:
Consider the constrained problem of eq. (5) and its quadratic-penalty function of eq. (6). Given a positive increasing sequence , a nonnegative sequence , and a starting point , suppose the quadratic-penalty method finds an approximate minimiser of that satisfies for Then, , which is a KKT point for the problem (5), and its Lagrange multiplier vector has elements , .
This theorem applies to the general case of differentiable layers, where the standard Karush-Kuhn-Tucker (KKT) conditions hold (Nocedal and Wright, 2006). It relies on a standard condition for penalty methods for nonconvex problems (Nocedal and Wright, 2006), namely that we must be able to reduce the gradient of the penalised function below an arbitrary tolerance for each value of the penalty parameter (in MAC iterations ). This can be achieved by running a suitable (unconstrained) optimisation method for sufficiently many iterations. How does this change in the case of ParMAC? The step remains unchanged with respect to MAC (the fact that the optimisation is distributed is irrelevant since the subproblems are independent). The step does change, because we are now obliged to use a distributed, stochastic training. What we need to ensure is that we can reduce the gradient of the penalised function with respect to each submodel (since they are independent subproblems in the step) below an arbitrary tolerance. This can also be guaranteed under standard conditions. In general, we can use convergence conditions from stochastic optimisation (Benveniste et al., 1990; Kushner and Yin, 2003; Pflug, 1996; Spall, 2003; Bertsekas and Tsitsiklis, 2000). Essentially, these are Robbins-Monro schedules, which require the learning rate of SGD to decrease such that , , , where is the epoch number (SGD iteration, or pass over the entire dataset555Note that it is not necessary to assume that the points (or minibatches) are sampled at random during updates. Various results exist that guarantee convergence with deterministic errors (e.g. Bertsekas and Tsitsiklis, 2000 and references therein), rather than stochastic errors. These results assume a bound on the deterministic errors (rather than a bound on the variance of the stochastic errors), and apply to general, nonconvex objective functions with standard conditions (Lipschitz continuity, Robbins-Monro schedules for the step size, etc.). They apply as a particular case to the “incremental gradient” method, where we cycle through the data points in a fixed sequence.
and references therein), rather than stochastic errors. These results assume a bound on the deterministic errors (rather than a bound on the variance of the stochastic errors), and apply to general, nonconvex objective functions with standard conditions (Lipschitz continuity, Robbins-Monro schedules for the step size, etc.). They apply as a particular case to the “incremental gradient” method, where we cycle through the data points in a fixed sequence.). We can give much tighter conditions on the convergence and the convergence rate when the subproblems in the step are convex (which is often the case, as with logistic or linear regression, linear SVMs, etc.). This is a topic that has received much attention recently (see section 2), and many such conditions exist, often based on techniques such as Nesterov accelerated algorithms and stochastic average gradient (Cevher et al., 2014). They typically bound the distance to the minimum in objective function value as or