## 1. Introduction

The dramatic success of deep networks has fueled the growth of massive datasets, e.g. Google’s JFT dataset has 100 million images, this in turn has prompted researchers to employ even larger models. Parallel and distributed training of deep learning is paramount to tackle problems at this scale. Such an escalation however hits a roadblock: to minimize communication costs, one could use large batch sizes in stochastic gradient descent (SGD) but this leads to a degradation of the generalization performance.

^{1}

^{1}1We discuss the connection to the recent work of Goyal et al. (2017) in Section 2.5 On the other hand, small batches incur communication costs that quickly dwarf the benefits of parallelization.

In this paper, we take a different approach than model or data parallelization. The algorithm we introduce, called Parle, trains multiple copies of the same model in parallel. Each of these copies performs multiple gradient steps and communicates its progress very infrequently to the master parameter server. This approach is motivated by the phenomenon of “flat minima” that has been show to improve generalization performance of deep networks. Parle has few low communication requirements and is thus well suited to both single-machine multi-GPU settings and distributed implementations over multiple compute nodes. We demonstrate extensive empirical evidence that it obtains significant performance improvements over baseline models and obtains nearly state-of-the-art generalization errors; it also obtains a - wall-clock time speedup over data-parallel SGD. Moreover, Parle is insensitive to hyper-parameters, all experiments in this paper are conducted with the same hyper-parameters. We do not introduce any additional hyper-parameters over SGD.

### 1.1. Approach

If we denote the parameters of a deep network by , training consists of solving

(1) |

where is the average loss (say cross-entropy) over the entire dataset, along with a regularization term (say weight decay). We denote with

copies of the model by variables, also called “replicas.” They may reside on multiple GPUs on the same computer, on multiple compute nodes. As a means of coupling these variables, consider the loss function of Elastic-SGD

(Zhang et al., 2015a):(2) |

where a parameter couples two replicas and through a “reference” variable . Performing gradient descent on Eq. 2 involves communicating the replicas for all with the reference after each mini-batch. Even though Elastic-SGD was introduced in the parallel setting, it nonetheless introduces significant communication bottlenecks.

In order to reduce this communication, we replace by a modified loss function called “local entropy”

(3) |

where

is the Gaussian kernel with variance

. We will discuss Entropy-SGD (Chaudhari et al., 2016) in Section 2.1 which is an algorithm to solve(4) |

The Parle algorithm instead solves

(5) |

As we will see in Section 2.3, for a semi-convex function , the three problems above, namely, Elastic-SGD Eq. 2, Entropy-SGD Eq. 4 and Parle Eq. 5 are equivalent. They differ in their communication requirements: Entropy-SGD does not involve any communication, Elastic-SGD involves a large communication overhead while Parle strikes a balance between the two.

Parle can be thought of as running Entropy-SGD to minimize and coupling each member of the ensemble via Elastic-SGD. The loss function forces every to minimize its loss and also forces the reference to move towards the mean of all the s. As training progresses, we let both and . This technique is called “scoping” and it causes all replicas to collapse to a single configuration which is our solution. The use of this technique for Elastic-SGD is novel in the literature and it performs very well in our experiments (Section 4).

### 1.2. Motivation

Training independent copies of a deep network in parallel is difficult. Let us discuss an experiment that explores averaging such models and motivates the choice of the loss function in Section 1.1. We trained instances of the All-CNN architecture of Springenberg et al. (2014) on the CIFAR-10 dataset. These networks converge to a training error of and a validation error of . Averaging their softmax predictions performs only slightly better than any individual network and the ensemble gets a validation error of , indeed if we look at the correlation of the softmax predictions of these networks, it shows that they make mistakes on the same examples. This marginal improvement comes at the cost of a large test-time performance penalty.

On the other hand, a model that consists of the average weights of these independent networks (“one shot averaging”) performs poorly: it obtains validation error, which is equivalent to guessing. This is expected, given the non-convexity of the loss function for deep networks. Although individual networks might converge to good local minima, their average need not even be locally optimal.

A typical deep network possess permutation symmetries, i.e., if the first and last layers are fixed, filters on intermediate layer can be permuted without changing the predictions. For a fully-connected network, this is akin to permuting the rows and columns of successive weight matrices. Post-training, we aligned each network to the first one by permuting its filters using a greedy layer-wise matching algorithm. Fig. 1 shows the average overlap of the weights of these aligned filters; it can be seen as a permutation invariant metric between two networks. As we can see, these copies are very far away from each other in weight space.

What is surprising is that while a naively averaged model has close to validation error, a model obtained by averaging after aligning the weights performs much better at validation error. This suggests that if we can force the copies to remain aligned to each other during training, we can obtain one single average model at the end that combines these copies. Parle uses a quadratic distance to align two copies during training. The loss function in Eq. 5 ensures that two replicas and are aligned through the quadratic forcing term . This encourages members of the ensemble to be close to each other in Euclidean space. For a small enough , these replicas are constrained to have a large overlap while still minimizing their individual loss functions . As towards the end of training, the overlap goes to unity.

## 2. Background and related work

The loss function of a deep network

is a non-convex function, and there have been numerous efforts to understand the geometry of the energy landscape. The relevance of this literature to our work is that it has been observed empirically that “flat minima” (local extrema where most of the eigenvalues of the Hessian are near zero) yield parameters that produce classifiers with good generalization

(Hochreiter and Schmidhuber, 1997; Baldassi et al., 2016b). Entropy-SGD (Chaudhari et al., 2016) which we describe in the next section, was designed specifically to seek such minima. In the following section, we also discuss Elastic-SGD (Zhang et al., 2015a) since the updates of Parle to minimize Eq. 5 combine these two algorithms.### 2.1. Entropy-SGD

Consider the sub-problem Eq. 4 that involves minimizing . The authors in Chaudhari et al. (2016)

constructed an SGD-based algorithm to solve this which involves an inner-loop that executes Markov chain Monte Carlo (MCMC) updates. This algorithm can be written as equationparentequation

(6a) | ||||

(6b) | ||||

(6c) |

We initialize to every time is an integer. Let us parse these updates further, note that the variable performs gradient descent on the original loss function with a “proximal” term that ensures that successive updates of stay close to . The variables maintain a running exponential average of and the updates uses this average as the gradient. It can be proved that these updates are equivalent to performing gradient descent on (Chaudhari et al., 2016). The system Eq. 6 is thus simply

with the gradient where denotes the average of the iterates; we use exponential averaging in Eq. 6b. We have introduced a slight modification in Eq. 6a as compared to Chaudhari et al. (2016), namely, we do not add any MCMC noise. In practice, the gradient of a deep network

is computed on mini-batches and is a noisy estimate of the true gradient, so there is already some inherent stochasticity in

Eq. 6a.### 2.2. Elastic-SGD

Let us now discuss how to minimize the objective in Eq. 2. Consider a setting where the objective is split between different workers , these could be physically different computers, or different process on the same machine. Each worker performs SGD to minimize its own objective while communicating the “elastic” gradient after each SGD update. The updates therefore look like equationparentequation

(7a) | ||||

(7b) |

for all replicas with . If one considers the variable as the “master” or the parameter server, we can see that these updates again have the flavor of the master updating itself with the average of the replicas. Indeed Eq. 6a and Eq. 7a are similar; while the former takes steps in the variable, the latter performs independent steps in the variables.

### 2.3. Equivalence of Entropy-SGD and Elastic-SGD

The resemblance of Eq. 6 and Eq. 7 is not a coincidence, the authors in Chaudhari et al. (2017) proved that Elastic-SGD is equivalent to Entropy-SGD if the updates converge quickly, i.e., if the sub-objective of Eq. 6a,

is strictly convex in . This happens if . This condition implies that the stochastic process of the updates in Eq. 6a has an ergodic steady-state distribution whereby temporal averages ( updates of Eq. 6b) are equivalent to spatial averages Eq. 7b. Using different techniques, the two objectives can be shown equivalent under different approximations (Baldassi et al., 2016a).

Operationally speaking, Entropy-SGD is a sequential MCMC algorithm and hence hard to parallelize. The updates in Eq. 6a form a single trajectory of steps before the update Eq. 6c and it is difficult to execute chunks of this trajectory independently. On the other hand, Elastic-SGD is a naturally parallelizable algorithm but suffers from a large communication overhead; every weight update Eq. 7b requires a reduce operation from all the workers and another broadcast of to each of them. This becomes prohibitive for large deep networks. Fortunately, the two algorithms are equivalent, and we can exploit this to minimize Eq. 5.

### 2.4. Scoping

We can see from Eq. 3 that local entropy converges to the original loss function as and it converges to a constant over the entire parameter space when . This is also true for the loss function of Elastic-SGD

as , the variable converges to the average of the global minimizers of . For a convex loss , the replicas and the reference all collapse to the same configuration. Our use of this technique for Elastic-SGD is novel in the literature and it performs very well in our experiments (Section 4).

### 2.5. Related work

There are two primary ways to parallelize the training of deep networks. The first, called model parallelism, distributes a large model across multiple GPUs or compute nodes (Krizhevsky, 2014; Dean et al., 2012). This does not scale well because intermediate results of the computation need to be transmitted to different compute nodes quickly, which puts severe limits on the latencies that the system can tolerate, e.g., large neural networks with recurrent or fully-connected layers are not amenable to model parallelism.

Data parallelism is more widely used (Jin et al., 2016; Moritz et al., 2015; Taylor et al., 2016) and maintains multiple copies of the network on different GPUs or compute nodes. Each mini-batch is split evenly amongst these copies who compute the gradient on their samples in parallel. A parameter server then aggregates these partial gradients and updates the weights. This aggregation can be either synchronous or asynchronous. The former is guaranteed to be exactly equivalent to a non-distributed, sequential implementation while the latter can work with large communication latencies but offers guarantees only for convex or sparse problems (Recht et al., 2011; Duchi et al., 2013).

Both synchronous and asynchronous gradient aggregation necessitate a high per-worker load (Qi et al., 2016) if they are to scale well. This is difficult to do in practice because small mini-batches require frequent communication while large batch-sizes suffer from a degradation of the generalization performance. Previous work in the literature has mostly focused on minimizing the communication requirements using stale gradients in HogWild! fashion (Recht et al., 2011; Zhang et al., 2015b), extreme quantization of gradients (Seide et al., 2014)

etc. These heuristics have shown impressive performance although it is hard to analyze their effect on the underlying optimization for deep networks. Synchronous approaches often require specialized hardware and software implementations to hide communication bottlenecks

(Wu et al., 2015; Abadi et al., 2015; Chen et al., 2015). Recent work by Goyal et al. (2017) shows that one can indeed obtain generalization performance comparable to small batch-sizes by a careful tuning of the learning rate. While large batches, in theory, can enable large learning rates, as is often seen in practice, optimization of a deep network with a large learning rate is difficult and even the training loss may not decrease quickly. The authors demonstrate that a warm-up scheme that increases the learning rate from a small initial value followed by the usual annealing leads to similar training and validation error curves as those of small mini-batches. Parle can benefit from both synchronous and asynchronous gradient aggregation, indeed, each replica in Eq. 5 can itself be data-parallel. Our work shows that exploiting the properties of the optimization landscape leads to better generalization performance.Ensemble methods train multiple models and average their predictions to improve generalization performance^{2}^{2}2

the top-performing methods on ImageNet are ensembles of deep networks (

http://image-net.org/challenges/LSVRC/2016/results). Deploying ensembles of deep networks in practice however remains challenging due to both the memory footprint and test-time latency of state-of-the-art networks. For instance, an ensemble of an object detection network like YOLO (Redmon et al., 2016) that works at frames per second on PASCAL VOC can no longer be used in a real-time environment like a self-driving car (Leonard et al., 2008) with limited computational budget. In contrast to an ensemble, Parle results in one single model that performs well at test-time. Second, training multiple models of an ensemble is computationally expensive (Loshchilov and Hutter, 2016; Huang et al., 2017), and as our experiment in Section 1.2 shows, the ensemble obtains a marginal improvement over each individual model. Parle maintains a correlated, robust ensemble during training and returns the average model that obtains better errors than a naive ensemble.## 3. Parle

Stochastic gradient descent step to minimize Eq. 5 amounts to combining Entropy-SGD in Eq. 6 and Elastic-SGD in Eq. 7. For all replicas , Parle performs the following updates: equationparentequation

(8a) | ||||

(8b) | ||||

(8c) | ||||

(8d) |

We reset to every time is an integer. Notice that Eq. 8a and Eq. 8b are the same as Eqs. 6a to 6b. The update for in Eq. 8d is also the same as the update for the reference variable in Eq. 7 with a step-size of . The only difference is that Eq. 8c takes a gradient step using both the gradient of and the gradient of the elastic term .

###### Remark 1 (Scoping and learning rate annealing).

The gradient of local entropy is

The learning rate for a deep network is typically reduced in steps as the training progresses. As discussed in Section 2.1, we would like to take , which interferes with this learning rate annealing. We have therefore scaled up the learning rate by in Eq. 8c. We do not scale it for the second term because this automatically gives us a weighted combination of the gradients of local entropy and the proximal term. This choice is akin to picking a modified annealing schedule for the parameters and .

### 3.1. Hyper-parameter choice

For all the experiments in this paper, the parameters of Parle in Eq. 8 are fixed to , . We also set in Eq. 8d, i.e., at each update we simply average the replicas to get

The parameters and are updated every time is an integer to

(9) |

where and and is the number of mini-batches in the dataset. We clip at and at . The parameter is fixed to be the initial learning rate.

The only remaining parameter is the learning rate which we drop by a factor of when the validation error plateaus, Section 4 provides more details. We have found that this algorithm is quite robust to parameter changes. In particular, both the speed of convergence and the final generalization error are insensitive to the exact values of or .

###### Remark 2 (Nesterov’s momentum).

We use Nesterov’s momentum (fixed to ) for updating the variables in Eq. 8a and in Eq. 8c. With our specific choice of , the update does not have first-order dynamics, and we therefore do not use momentum to update it. With other choices of , one could use momentum for Eq. 8d as well, but we found this to converge marginally slower.

### 3.2. Many deputies under one sheriff

Consider an alternative optimization problem:

(10) |

Under an ergodicity assumption, from the discussion in Section 2, this is equivalent to Eq. 5. It is also equivalent to Eq. 7 with a modified coupling between the workers where the constant in Eq. 2 takes to different values. This loss function provides a completely distributed version of Parle and has the interpretation of “deputies” connected to a “sheriff” through the term . Each deputy in turn controls the workers through the proximal term .

The variables are high-dimensional and there are copies. Optimizing Eq. 10 therefore involves a large communication overheard: each deputy communicates bits with its workers and the sheriff communicates another bits with each deputy. The total communication complexity is thus quadratic at each weight update which is prohibitive for large . On the other hand, using the updates in Eq. 6 and Eq. 7 to minimize Eq. 5 results in an amortized communication complexity of while remaining equivalent to Eq. 10.

###### Remark 3 (Running Parle on diverse computational platforms).

We can extend the above loss function further and again replace with either or even the Elastic-SGD loss function. Coupled with the idea of HogWild! or stale gradients, one can achieve a seamless trade-off between replicas that may have a lot of computational budget but relatively scarce communication resources, e.g., GPUs, and replicas that can communicate faster than they can compute, e.g., CPUs and mobile devices. Entropy-SGD is naturally suited to the former and Elastic-SGD is naturally suited to the latter, the loss function of Parle shows how to couple such diverse computational platforms together.

## 4. Empirical validation

This section discusses experimental results on a variety of benchmark image classification datasets, namely, MNIST (Section 4.2), CIFAR-10 and CIFAR-100 (Section 4.3) and SVHN (Section 4.4). Parle obtains nearly state-of-the-art errors with a significant wall-clock time speed up without any additional hyper-parameters. In Section 5, we show that Parle obtains error rates better than SGD with full data even in the case when each replica only has access to a subset of the data. For all the following experiments, we use SGD with Nesterov’s momentum (fixed to ) as a baseline and compare the performance of Parle

, Entropy-SGD and Elastic-SGD. We compute the mean and standard deviation of the validation error over

runs for each algorithm with random initialization. However, for some networks that take a long time to train, we only report the results of a single run (see Table 1 and Table 2).### 4.1. Implementation details and communication latencies

We have implemented a parallel version of the updates in Eq. 8, i.e., we execute multiple neural networks in the same process on a standard desktop with GPUs. The communication with the master for the variable happens via optimized NCCL^{3}^{3}3https://devblogs.nvidia.com/parallelforall/fast-multi-gpu-collectives-nccl

routines on PCI-E. This is implemented using PyTorch

^{4}

^{4}4http://pytorch.org in Python but it scales well to replicas. In particular, the reduce operation required in Eqs. 8c to 8d does not incur a large communication overhead. For instance, on our machine, each mini-batch of size for the wide residual network WRN-28-10 in Section 4.3 takes ms on an average, while steps Eqs. 8c to 8d take ms, their ratio is and is therefore negligible. For the All-CNN network in Section 5, this ratio is . Let us however note that addressing communication bottlenecks in a distributed implementation is non-trivial. As an example, the authors in Goyal et al. (2017) perform gradient aggregation and backprop in parallel. Parle can also benefit from such techniques although that is not the focus of our paper.

###### Remark 4 (Plotting against wall-clock time).

Entropy-SGD and SGD are sequential algorithms while Parle and Elastic-SGD can run on all available GPUs simultaneously. In order to obtain a fair comparison, we run the former two in data-parallel fashion on three GPUs and plot all error curves against the wall-clock time. For the networks considered in our experiments, with a batch-size of , the efficiency of a data-parallel implementation in PyTorch is above (except for ).

### 4.2. Mnist

We use the standard architecture (LeCun et al., 1998)

with ReLU nonlinearities

(Nair and Hinton, 2010)(Ioffe and Szegedy, 2015). It has two convolutional layers with and channels, respectively, followed by a fully-connected layer of hidden units that culminates into a -way softmax. Both the convolutional and fully-connected layers use a dropout (Srivastava et al., 2014)of probability

. We do not perform any pre-processing for MNIST. The learning rate is initialized to and dropped by a factor ofat epochs

for SGD and only once, after the second epoch, for Entropy-SGD and Parle. Fig. 2 shows the validation errors for , Parle obtains a validation error of with three replicas as compared to about - on with SGD, Entropy-SGD and Elastic-SGD.### 4.3. CIFAR-10 and CIFAR-100

We use the WRN-28-10 network of Zagoruyko and Komodakis (2016) for these datasets; this architecture was shown to have very good empirical performance in spite of not being very deep. We use the same training pipeline and hyper-parameters as that of the original authors to enable an easy comparison. In particular, we perform global contrast normalization followed by ZCA normalization and train with data-augmentation which involves random mirror-flipping (with probability ) and random crops of size

after padding the image by

pixels on each side. We use a dropout of probability and weight decay of . The learning rate is initialized to and dropped by a factor of at epochs for SGD and for Entropy-SGD and Parle. The former is taken from the original paper while the later was constructed using the heuristic that Parle and Entropy-SGD use gradient evaluations per weight update.Figs. 2(a) to 2(b) show the empirical performance on CIFAR-10 and CIFAR-100 respectively with the same data summarized in Table 1. In our implementation, we obtained a validation error of with SGD as compared to by Zagoruyko and Komodakis (2016) for CIFAR-10, however our baseline for CIFAR-100 matches well with theirs, both obtain validation error. On CIFAR-10, Parle with replicas obtains a significantly better validation error of while it obtains error on CIFAR-100. Note that these are both more than better than the baseline SGD with exactly the same network and pre-processing. For these datasets, we found that the benefit of adding more replicas is small, with , we see an initial speed up for both CIFAR-10 and CIFAR-100, but the network converges to a worse error with the same hyper-parameters. Note that if more GPUs are available, each replica can itself be run in a data-parallel fashion to further accelerate training time. Nevertheless, both versions of Parle are better than the baseline SGD implementation.

It is instructive to compare the performance of Elastic-SGD and Entropy-SGD with Parle. Since Parle essentially combines these two algorithms, as we saw in Section 3, it is a more powerful than either of them. This is corroborated by our experimental evidence. We also observed that Entropy-SGD obtains very similar errors as those of Elastic-SGD with scoping on the parameter. Scoping was adapted from the results of Chaudhari et al. (2016) and in our experience, it improves the performance of Elastic-SGD significantly.

Our errors on CIFAR-10 and CIFAR-100 are better than those reported previously in published literature for a single model ^{5}^{5}5 Recent work by Gastaldi (2017) reports an error of on CIFAR-10 and on CIFAR-100 using “shake-shake” regularization on a three-branch residual network that is trained for epochs. In fact, our result on CIFAR-10 is better than the reported error in Huang et al. (2017) on an ensemble of six DenseNet-100 networks (Huang et al., 2016). Our result on CIFAR-100 is only slightly worse than a DenseNet-100 ensemble which gets (Huang et al., 2017).

Model | Parle | Elastic-SGD | Entropy-SGD | SGD | ||||
---|---|---|---|---|---|---|---|---|

Error | Time | Error | Time | Error | Time | Error | Time | |

(MNIST, ) | ||||||||

WRN-28-10 (CIFAR-10, ) | ||||||||

WRN-28-10 (CIFAR-100, ) | ||||||||

WRN-16-4 (SVHN) | ||||||||

### 4.4. Svhn

SVHN is a dataset consisting of house numbers from Google’s Street View and contains about images of digits. We use the WRN-16-4 network of (Zagoruyko and Komodakis, 2016). We perform global contrast normalization for the input images and do not perform data augmentation. The dropout probability is set to and weight decay is set to . The learning rate is initialized to and dropped by a factor of at epochs for SGD and epochs for Entropy-SGD and Parle.

Fig. 4 shows the validation error for SVHN using Parle with three replicas, Entropy-SGD, and SGD. For comparison, the authors in Zagoruyko and Komodakis (2016) report error with SGD. All the three algorithms obtain comparable errors in this case, with Elastic-SGD being marginally better. For comparison, the best reported result on SVHN is using a larger network WRN-16-8 which gets error, this is very close to our result with Elastic-SGD with scoping. Let us note that Elastic-SGD does not work this well without scoping, we did not get errors below on SVHN.

### 4.5. Training error

Let us look at the training error for WRN-28-10 on CIFAR-10 (Fig. 4(a)) and CIFAR-100 (Fig. 4(b)) and WRN-16-4 on SVHN (Fig. 4(c)). Note that while SGD and Elastic-SGD always converge to near-zero training errors, both Entropy-SGD and Parle have much larger training error and do not over-fit as much. The minima discovered by SGD and Parle are qualitatively different: while the former manages to get almost zero training error and converge to the global minimum of the loss function, it ends up over-fitting to the training data and does not generalize as well. Parle obtains superior generalization performance at the cost of under-fitting to the training data. This also sheds light on the structure of the loss function. For the purposes of deep learning, with current regularization techniques, it is not always important to converge to the global optimum. Flat minima, which may exist at higher energy levels instead afford better generalization performance and can be found by algorithms like Entropy-SGD and Parle that are designed to seek them. Note that Elastic-SGD only guaranteed to find flat minima for semi-convex loss functions (Chaudhari et al., 2017).

## 5. Splitting the data between replicas

In the experiments above, each replica in Parle has access to the entire dataset. Our aim in this section is to explore how much of this improvement can be traded off for speed. If we denote the entire dataset by , each replica (see Section 3) only operates on a subset . We split the dataset evenly amongst the replicas, i.e.,

and all are of the same size. In particular, we ensure that each sample lies in at least one of the subsets . By doing so, we would like to explore the efficacy of the proximal term in Eq. 5. Effectively, the only way a replica gets a gradient term on is through this term. We will consider two cases, (i) with replicas and each gets of the training data, and (ii) with replicas and each gets of the training data.

We use the All-CNN network of Springenberg et al. (2014) for these experiments on CIFAR-10. Again, the hyper-parameters are kept the same as the original authors, in particular, we use a dropout probability of with weight decay of and data augmentation with mirror flips and random crops. Entropy-SGD and SGD are non-distributed algorithms and cannot handle this scenario, we therefore compare Parle with Elastic-SGD (both operate on the same subsets) and a data-parallel SGD on three GPUs with access to the entire dataset. The latter is our baseline and obtains validation error. As Fig. 5(a) and Fig. 5(b) show, quite surprisingly, Parle obtains better error than SGD in spite of the dataset being split between the replicas. The speedup in wall-clock time in Fig. 5(b) is a consequence of the fact that Parle has very few mini-batches. Elastic-SGD with split data converges quickly but does not obtain a validation error as good as SGD on the full dataset. For comparison, Parle obtains an error of on CIFAR-10 with this network if it has access to the entire dataset in minutes. To our knowledge, this is the best reported error on CIFAR-10 without a residual network, which is itself of importance since All-CNN is about faster at test-time than WRN-28-10. In Table 2, for comparison, we also report the error of SGD with access to a random subset of the training data (averaged over independent runs); as expected, this is much worse than its error with access to the full dataset.

This experiment shows that the proximal term is strong enough to pull the replicas towards good regions in the parameter space in spite of their individual gradients being computed on different datasets . If the proximal term is strong enough (as ), they can together converge to a region in the parameter space that works for the entire dataset. Exploiting this observation in a data distributed setting (McMahan et al., 2016) to obtain state-of-the-art performance is a promising direction for future work.

Model | Parle | Elastic-SGD | SGD | |||
---|---|---|---|---|---|---|

Error | Time | Error | Time | Error | Time | |

All-CNN (full data) | ||||||

All-CNN (, 50 data) | ||||||

All-CNN (, 25 data) | ||||||

SGD performs poorly in these cases because it only has access to a (random) subset of the training data.

## 6. Discussion

This paper proposed an algorithm called Parle for training deep neural networks in parallel that exploits the phenomenon of flat regions in the parameter space. Parle requires infrequent communication with the parameter server and instead performs more computation on each client. It scales well to multi-GPU parallel settings and is amenable to a distributed implementation. Our experiments showed that it obtains nearly state-of-the-art performance on benchmark datasets. We obtain significantly better errors than SGD with the same architecture which shows that even with numerous regularization techniques like weight-decay, dropout and batch-normalization, there is still some performance left on the table by SGD, which Parle can extract.

In the broader scope of this work, parallelization of non-convex problems like deep neural networks is fundamentally different from convex problems that have been primarily studied in distributed machine learning. Impressive large-scale distributed systems have been built for the specific purposes of deep learning but obtaining a theoretical understanding of popular heuristics used in these systems is hard.

Parle is a step towards developing such an understanding, for instance, the loss function used in this paper is a specific way to smooth a rugged, non-convex loss function (Chaudhari et al., 2017).Another interesting offshoot of Parle is that different replicas can have very different computational and communication capabilities. For instance, replicas with GPUs are more suited to run Entropy-SGD while CPU clusters and mobile devices, which can typically communicate quicker than they can compute, are more suited to run Elastic-SGD steps. Coupling these diverse platforms together in a single, interpretable loss function to train a shared model, or multiple coupled models of different sizes, is promising for both scaling up further and learning from private and sensitive data.

## References

- Abadi et al. (2015) Abadi, M., Agarwal, A., Barham, P., Brevdo, E., et al. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems.
- Baldassi et al. (2016a) Baldassi, C., Borgs, C., Chayes, J., Ingrosso, A., Lucibello, C., Saglietti, L., and Zecchina, R. (2016a). Unreasonable effectiveness of learning neural networks: From accessible states and robust ensembles to basic algorithmic schemes. PNAS, 113(48):E7655–E7662.
- Baldassi et al. (2016b) Baldassi, C., Gerace, F., Lucibello, C., Saglietti, L., and Zecchina, R. (2016b). Learning may need only a few bits of synaptic precision. Physical Review E, 93(5):052313.
- Chaudhari et al. (2016) Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, R. (2016). Entropy-SGD: Biasing Gradient Descent Into Wide Valleys. arXiv:1611.01838.
- Chaudhari et al. (2017) Chaudhari, P., Oberman, A., Osher, S., Soatto, S., and Guillame, C. (2017). Deep Relaxation: partial differential equations for optimizing deep neural networks. arXiv:1704.04932.
- Chen et al. (2015) Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. (2015). Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv:1512.01274.
- Dean et al. (2012) Dean, J., Corrado, G., Monga, R., Chen, K., et al. (2012). Large scale distributed deep networks. In NIPS.
- Duchi et al. (2013) Duchi, J., Jordan, M. I., and McMahan, B. (2013). Estimation, optimization, and parallelism when data is sparse. In NIPS.
- Gastaldi (2017) Gastaldi, X. (2017). Shake-shake regularization. arXiv:1705.07485.
- Goyal et al. (2017) Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. (2017). Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv:1706.02677.
- Hochreiter and Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. (1997). Flat minima. Neural Computation, 9(1):1–42.
- Huang et al. (2017) Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., and Weinberger, K. Q. (2017). Snapshot ensembles: Train 1, get m for free. arXiv:1704.00109.
- Huang et al. (2016) Huang, G., Liu, Z., Weinberger, K. Q., and van der Maaten, L. (2016). Densely connected convolutional networks. arXiv:1608.06993.
- Ioffe and Szegedy (2015) Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167.
- Jin et al. (2016) Jin, P. H., Yuan, Q., Iandola, F., and Keutzer, K. (2016). How to scale distributed deep learning? arXiv:1611.04581.
- Krizhevsky (2014) Krizhevsky, A. (2014). One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997.
- LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
- Leonard et al. (2008) Leonard, J., How, J., Teller, S., Berger, M., Campbell, S., Fiore, G., Fletcher, L., Frazzoli, E., Huang, A., Karaman, S., et al. (2008). A perception-driven autonomous urban vehicle. Journal of Field Robotics, 25(10):727–774.
- Loshchilov and Hutter (2016) Loshchilov, I. and Hutter, F. (2016). Sgdr: stochastic gradient descent with restarts. arXiv:1608.03983.
- McMahan et al. (2016) McMahan, H. B., Moore, E., Ramage, D., Hampson, S., et al. (2016). Communication-efficient learning of deep networks from decentralized data. arXiv:1602.05629.
- Moritz et al. (2015) Moritz, P., Nishihara, R., Stoica, I., and Jordan, M. I. (2015). Sparknet: Training deep networks in spark. arXiv:1511.06051.
- Nair and Hinton (2010) Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In ICML.
- Qi et al. (2016) Qi, H., Sparks, E. R., and Talwalkar, A. (2016). Paleo: A performance model for deep neural networks. https://openreview.net/forum?id=SyVVJ85lg¬eId=SyVVJ85lg.
- Recht et al. (2011) Recht, B., Re, C., Wright, S., and Niu, F. (2011). Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS.
- Redmon et al. (2016) Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. (2016). You only look once: Unified, real-time object detection. In CVPR.
- Seide et al. (2014) Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. (2014). 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association.
- Springenberg et al. (2014) Springenberg, J., Dosovitskiy, A., Brox, T., and Riedmiller, M. (2014). Striving for simplicity: The all convolutional net. arXiv:1412.6806.
- Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958.
- Taylor et al. (2016) Taylor, G., Burmeister, R., Xu, Z., Singh, B., Patel, A., and Goldstein, T. (2016). Training neural networks without gradients: A scalable admm approach. In ICML.
- Wu et al. (2015) Wu, R., Yan, S., Shan, Y., Dang, Q., and Sun, G. (2015). Deep image: Scaling up image recognition. arXiv:1501.02876.
- Zagoruyko and Komodakis (2016) Zagoruyko, S. and Komodakis, N. (2016). Wide residual networks. arXiv:1605.07146.
- Zhang et al. (2015a) Zhang, S., Choromanska, A., and LeCun, Y. (2015a). Deep learning with elastic averaging SGD. In NIPS.
- Zhang et al. (2015b) Zhang, W., Gupta, S., Lian, X., and Liu, J. (2015b). Staleness-aware Async-SGD for distributed deep learning. arXiv:1511.05950.