1 Introduction
Over the past few years, neural networks have proven to be a general and effective tool for many practical problems, such as image classification (Krizhevsky et al., 2012; Szegedy et al., 2015; He et al., 2016), speech recognition (Hinton et al., 2012; Graves & Jaitly, 2014; Hannun et al., 2014; Chan et al., 2015), machine translation (Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2015; Wu et al., 2016) and speech synthesis (Oord et al., 2016; Arik et al., 2017; Wang et al., 2017)
. Together with their success is the growth in size and computational requirements of training and inference. Currently, a typical approach to address these requirements is to use a heterogeneous distributed environment with a mixture of many CPUs and GPUs. In this environment, it is a common practice for a machine learning practitioner to specify the device placement for certain operations in the neural network. For example, in a neural translation network, each layer, including all LSTM layers, the attention layer, and the softmax layer, is computed by a GPU
(Sutskever et al., 2014; Wu et al., 2016).Although such decisions can be made by machine learning practitioners, they can be challenging, especially when the network has many branches (Szegedy et al., 2016), or when the minibatches get larger. Existing algorithmic solvers (Pellegrini, 2009; Karypis & Kumar, 1995b), on the other hand, are not flexible enough to work with a dynamic environment with many interferences.
In this paper, we propose a method which learns to optimize device placement for training and inference with neural networks. The method, illustrated in Figure 1, takes into account information of the environment by performing series of experiments to understand which parts of the model should be placed on which device, and how to arrange the computations so that the communication is optimized. Key to our method is the use of a sequencetosequence model to read input information about the operations as well as the dependencies between them, and then propose a placement for each operation. Each proposal is executed in the hardware environment to measure the execution time. The execution time is then used as a reward signal to train the recurrent model so that it gives better proposals over time.
Our main result is that our method finds nontrivial placements on multiple devices for InceptionV3 (Szegedy et al., 2016), Recurrent Neural Language Model (Zaremba et al., 2014; Jozefowicz et al., 2016) and Neural Machine Translation (Sutskever et al., 2014; Wu et al., 2016). Singlestep measurements show that Scotch (Pellegrini, 2009) yields disappointing results on all three benchmarks, suggesting that their graphbased heuristics are not flexible enough for them. Our method can find nontrivial placements that are up to times faster. When applied to train the three models in real time, the placements found by our method are up to faster than human experts’ placements.
2 Related Work
Our work is closely related to the idea of using neural networks and reinforcement learning for combinatorial optimization
(Vinyals et al., 2015; Bello et al., 2016). The space of possible placements for a computational graph is discrete, and we model the placements using a sequencetosequence approach, trained with policy gradients. However, experiments in early work were only concerned with toy datasets, whereas this work applies the framework to a largescale practical application with noisy rewards.Reinforcement learning has also been applied to optimize system performance. For example, Mao et al. (2016) propose to train a resource management algorithm with policy gradients. However, they optimize the expected value of a handcrafted objective function based on the reward, unlike this work, where we optimize directly for the running time of the configurations, hence relieving the need to design intermediate cost models.
Graph partitioning is an intensively studied subject in computer science. Early work such as Kernighan & Lin (1970); Kirkpatrick et al. (1983); Fiduccia & Mattheyses (1988); Johnson et al. (1989) employ several iterative refinement procedures that start from a partition and continue to explore similar partitions to improve. Alternative methods such as Hagen & Kahng (1992); Karypis & Kumar (1995b)
perform spectral analyses on matrix representations of graphs to partition them. Despite their extensive literature, graph partitioning algorithms remain heuristics for computational graphs. The reason is that in order to apply these algorithms, one has to construct cost models for the graphs of concern. Since such models are expensive to even estimate and in virtually all cases, are not accurate, graph partitioning algorithms applied on them can lead to unsatisfying results, as we show in Section
4 of this paper.A wellknown graph partitioning algorithm with an open source software library is the Scotch optimizer (Pellegrini, 2009), which we use as a baseline in our experiments. The Scotch mapper attempts to balance the computational load of a collection of tasks among a set of connected processing nodes, while reducing the cost of communication by keeping intensively communicating tasks on nearby nodes. Scotch relies on a collection of graph partitioning techniques such as kway FiducciaMattheyses (Fiduccia & Mattheyses, 1988), multilevel method (Barnard & Simon, 1994; Hendrickson & Leland, 1993; Karypis & Kumar, 1995a), band method (Chevalier & Pellegrini, 2006), diffusion method (Pellegrini, 2007), and dual recursive bipartitioning mapping (Pellegrini & Roman, 1996)).
Scotch models the problem with graphs. The first graph is called the target architecture graph, whose vertices represent hardware resources such as CPUs or GPUs and whose edges represent the communication paths available between them, such as a PCIe bus or a network link. The second graph is called the source graph, which models the computation to be mapped onto the target architecture graph. In the case of TensorFlow (Abadi et al., 2016)
, the computations of programs are modeled as a graph whose vertices represent operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. Scotch users have to choose how and when given partitioning should be applied to graphs. However, in our experiment, we rely on the software’s default strategies implemented in Scotch, which have already been extensively tuned.
3 Method
Consider a TensorFlow computational graph , which consists of operations , and a list of available devices. A placement is an assignment of an operation to a device , where . Let denote the time that it takes to perform a complete execution of under the placement . The goal of device placement optimization is to find such that the execution time is minimized.
3.1 Training with Policy Gradients
While we seek to minimize the execution time , direct optimization of results in two major issues. First, in the beginning of the training process, due to the bad placements sampled, the measurements of can be noisy, leading to inappropriate learning signals. Second, as the RL model gradually converges, the placements that are sampled become more similar to each other, leading to small differences between the corresponding running times, which results in less distinguishable training signals. We empirically find that the square root of running time, , makes the learning process more robust. Accordingly, we propose to train a stochastic policy to minimize the objective
(1) 
In our work, is defined by an attentional sequencetosequence model, which we will describe in Section 3.2. We learn the network parameters using Adam (Kingma & Ba, 2014) optimizer based on policy gradients computed via the REINFORCE equation (Williams, 1992),
(2) 
We estimate by drawing placement samples using
. We reduce the variance of policy gradients by using a baseline term
, leading to(3) 
We find that a simple moving average baseline works well in our experiments. In practice, on computational graphs with large memory footprints, some placements can fail to execute, e.g., putting all of the operations of a huge LSTM on a single GPU will exceed the device’s memory limit. For such cases, we set the square root of running time to a large constant, which we call the failing signal. We specify the failing signal manually depending on the input graph. We observe that throughout our training process, some placements sporadically and unexpectedly fail, perhaps due to factors such as the state of the machine (we train our model on a shared cluster). This phenomenon is particularly undesirable towards the end of the training process, since a large difference between and the baseline leads to a large update of the parameters, which potentially perturbs parameters out of a good minimum. We thus hardcode the training process so that after steps, one performs a parameter update with a sampled placement only if the placement executes. In our experiments, we also find that initializing the baseline with the failing signal results in more exploration.
3.2 Architecture Details
We use a sequencetosequence model (Sutskever et al., 2014) with LSTM (Hochreiter & Schmidhuber, 1997) and a contentbased attention mechanism (Bahdanau et al., 2015) to predict the placements. Figure 2 shows the overall architecture of our model, which can be divided into two parts: encoder RNN and decoder RNN.
The input to the encoder RNN is the sequence of operations of the input graph. We embed the operations by concatenating their information. Specifically, for each input graph , we first collect the types of its operations. An operation’s type describes the underlying computation, such as MatMul or conv2d
. For each type, we store a tunable embedding vector. We then record the size of each operation’s list of output tensors and concatenate them into a fixedsize zeropadded list called the output shape. We also take the onehot encoding vector that represents the operations that are direct inputs and outputs to each operation. Finally, the embedding of each operation is the concatenation of its type, its output shape, and its onehot encoded adjacency information.
The decoder is an attentional LSTM (Bahdanau et al., 2015) with a fixed number of time steps that is equal to the number of operations in a graph . At each step, the decoder outputs the device for the operation at the same encoder time step. Each device has its own tunable embedding, which is then fed as input to the next decoder time step.
3.3 Colocating Operations
A key challenge when applying our method to TensorFlow computational graphs is that these graphs generally have thousands of operations (see Table 1). Modeling such a large number of operations with sequencetosequence models is difficult due to vanishing and exploding gradient issues (Pascanu et al., 2013) and large memory footprints. We propose to reduce the number of objects to place on different devices by manually forcing several operations to be located on the same device. In practice, this is implemented by the colocate_with feature of TensorFlow.
We use several heuristics to create colocation groups. First, we rely on TensorFlow’s default colocation groups, such as colocating each operation’s outputs with its gradients. We further apply a simple heuristic to merge more operations into colocation groups. Specifically, if the output of an operation is consumed only by another operation , then operations and are colocated. Many initialization operations in TensorFlow can be grouped in this way. In our experiments, we apply this heuristic recursively, and after each iteration, we treat the colocation groups as operations until there are not any further groups that can be merged. For certain models, we apply specific rules to construct colocation groups. For example, with ConvNets, we can treat several convolutions and pooling layers as a colocation group, and with RNN models, we treat each LSTM cell as a group.
3.4 Distributed Training
We speed up the training process of our model using asynchronous distributed training, as shown in Figure 3. Our framework consists of several controllers, each of which execute the current policy defined by the attentional sequencetosequence model as described in Section 3.2. All of the controllers interact with a single shared parameter server. We note that the parameter server holds only the controllers’ parameters, and not the input graph’s parameters, because keeping the input graph’s parameters on the parameter server can potentially create a latency bottleneck to transfer these parameters. Each controller in our framework interacts with workers, where is the number of Monte Carlo samples in Equation 3.
The training process has two alternating phases. In the first phase, each worker receives a signal that indicates that it should wait for placements from its controller, while each controller receives a signal that indicates it should sample
placements. Each sampled placement comes with a probability. Each controller then independently sends the placements to their workers, one placement per worker, and sends a signal to indicate a phase change.
In the second phase, each worker executes the placement it receives and measures the running time. To reduce the variance in these measurements, each placement is executed for
steps and the average running time of the steps but the first one is recorded. We observe that in TensorFlow, the first step can take longer to execute compared to the following steps, and hence we treat itss runing time as an outlier. Each controller waits for all of its workers to finish executing their assigned placements and returning their running times. When all of the running times are received, the controller uses the running times to scale the corresponding gradients to asynchronously update the controller parameters that reside in the parameter server.
In our experiments, we use up to controllers, each with either or workers. Under this setting, it takes between to hours to find the best placement for the models in our experiments. Using more workers per controller yields more accurate estimates of the policy gradient as in Equation 3, but comes at the expense of possibly having to put more workers in idle states. We also note that due to the discrepancies between machines, it is more stable to let each controller have its own baseline.
4 Experiments
In the following experiments, we apply our proposed method to assign computations to devices on three important neural networks in the deep learning literature: Recurrent Neural Language Model (RNNLM)
(Zaremba et al., 2014; Jozefowicz et al., 2016), Attentional Neural Machine Translation (Bahdanau et al., 2015), and InceptionV3 (Szegedy et al., 2016). We compare the RL placements against strong existing baselines described in Section 4.2.4.1 Experiment Setup
Benchmarks.
We evaluate our approach on three established deep learning models:

Recurrent Neural Network Language Model (RNNLM) with multiple LSTM layers (Zaremba et al., 2014; Jozefowicz et al., 2016). The grid structure of this model introduces tremendous potential for parallel executions because each LSTM cell can start as soon as its input and previous states are available.

Neural Machine Translation with attention mechanism (NMT) (Bahdanau et al., 2015; Wu et al., 2016). While the architecture of this model is similar to that of RNNLM, its large number of hidden states due to the source and target sentences necessitates model parallelism. Both Sutskever et al. (2014) and Wu et al. (2016) propose to place each LSTM layer, the attention layer, and the softmax layer, each on a separate device. While the authors observe significant improvements at training time, their choices are not optimal. In fact, we show in our experiments that a trained policy can find significantly better placements.

InceptionV3 (Szegedy et al., 2016)
is a widelyused architecture for image recognition and visual feature extraction
(Khetan & Oh, 2016; Esteva et al., 2016). The Inception network has multiple blocks. Each block has several branches of convolutional and pooling layers, which are then concatenated to make the inputs for the next block. While these branches can be executed in parallel, the network’s depth restricts such potential since the later blocks have to wait for the previous ones.
Model details.
For InceptionV3, each step is executed on a batch of images, each of size , which is the widelyused setting for the ImageNet Challenge (Szegedy et al., 2015). For RNNLM and NMT, we use the model with LSTM layers, with sizes of and , respectively. We set the number of unrolling steps for RNNLM, as well as the maximum length for the source and target sentences of NMT, to . Each pass on RNNLM and NMT consists of a minibatch of sequences.
Colocation groups.
We preprocess the TensorFlow computational graphs of the three aforementioned models to manually create their colocation groups. More precisely; for RNNLM and NMT, we treat each LSTM cell, each embedding lookup, each attention step and each softmax prediction step as a group; for InceptionV3, we treat each branch as a group. Table 1 shows the grouping statistics of these models.
Model  #operations  #groups 

RNNLM  8943  188 
NMT  22097  280 
InceptionV3  31180  83 
Tasks  SingleCPU  SingleGPU  #GPUs  Scotch  MinCut  Expert  RLbased  Speedup 

RNNLM  6.89  1.57  2  13.43  11.94  3.81  1.57  0.0% 
(batch )  4  11.52  10.44  4.46  1.57  0.0%  
NMT  10.72  OOM  2  14.19  11.54  4.99  4.04  23.5% 
(batch )  4  11.23  11.78  4.73  3.92  20.6%  
InceptionV3  26.21  4.60  2  25.24  22.88  11.22  4.60  0.0% 
(batch )  4  23.41  24.52  10.65  3.85  19.0% 
Metrics.
We implement training operations for RNNLM and NMT using Adam (Kingma & Ba, 2014)
, and for InceptionV3 using RMSProp
(Tieleman & Hinton, 2012). We evaluate a placement by the total time it takes to perform one forward pass, one backward pass and one parameter update. To reduce measurement variance, we average the running times over several trials. Additionally, we train each model from scratch using the placements found by our method and compare the training time to that of the strongest baseline placement.Devices.
In our experiments, the available devices are 1 Intel Haswell 2300 CPU, which has 18 cores, and either 2 or 4 Nvidia Tesla K80 GPUs. We allow GB of RAM for all models and settings.
4.2 Baselines
SingleCPU.
This placement executes the whole neural network on a single CPU. Processing some large models on GPUs is infeasible due to memory limits, leaving SingleCPU the only choice despite being slow.
SingleGPU.
This placement executes the whole neural network on a single CPU. If an operation lacks GPU implemention, it will be placed on CPU.
Scotch.
We estimate the computational costs of each operation as well as the amount of data that flows along each edge of the neural network model, and feed them to the Scotch static mapper (Pellegrini, 2009). We also annotate the architecture graph (see Section 2) with compute and communication capacities of the underlying devices.
MinCut.
We use the same Scotch optimizer, but eliminate the CPU from the list of available devices fed to the optimizer. Similar to the singleGPU placement, if an operation has no GPU implementation, it runs on the CPU.
Expertdesigned.
For RNNLM and NMT, we put each LSTM layer on a device. For NMT, we also put the attention mechanism and the softmax layer on the same device with the highest LSTM layer, and we put the embedding layer on the same device with the first LSTM layer. For InceptionV3, the common practice for the batch size of is to put the entire model on a single GPU. There is no implementation of InceptionV3 with batch 32 using more than 1 GPU. To create an intuitive baseline on multiple GPUs, we heuristically partition the model into contiguous parts that have roughly the same number of layers. We compare against this approach in Section 4.3. The common practice for InceptionV3 with the larger batch size of is to apply data parallelism using GPUs. Each GPU runs a replica of the model and processes a batch of size (Szegedy et al., 2016). We compare against this approach in Section 4.4.
4.3 SingleStep Runtime Efficiency
In Table 2, we present the perstep running times of the placements found by our method and by the baselines. We observe that our model is either on par with or better than other methods of placements. Despite being given no information other than the running times of the placements and the number of available devices, our model learns subtle tradeoffs between performance gain by parallelism and the costs induced by interdevice communications.
Rnnlm.
Our method detects that it is possible to fit the whole RNNLM graph into one GPU, and decides to do so to save the interdevice communication latencies. The resulting placement is more than twice faster than the best published humandesigned baseline.
Neural MT.
Our method finds a nontrivial placement (see Figure 4) that leads to a speedup of up to for GPUs. Our method also learns to put the less computational expensive operations, such as embedding lookups, on the CPU. We suspect that whilst being the slowest device, the CPU can handle these lookup operations (which are less computationally expensive than other operations) to reduce the load for other GPUs.
InceptionV3.
For InceptionV3 with the batch size of , RLbased placer learns that when there are only
GPUs available, the degree of freedom for model parallelism is limited. It thus places all the operations on a single GPU (although it could use
GPUs). However, when GPUs are available, the RLbased placer finds an efficient way to use all of the GPUs, reducing the model’s perstep running time from seconds to seconds. This result is significant, as neither of our baselines could find a placement better than assigning all the operations to a single GPU.We also conduct a simple extension of our experiments, by increasing the batch sizes of RNNLM and NMT to , and their LSTM sizes to and , respectively. This makes the models’ memory footprints so large that even one layer of them cannot be fitted into any single device, hence ruling out the humandesigned placement. Nevertheless, after several steps of finding placements that fail to run, our approach manages to find a way to successfully place input models on devices The running times of the placements found for large RNNLM and NMT are and seconds, respectively.
4.4 EndtoEnd Runtime Efficiency
We now investigate whether the RLbased placements can speedup not only the singlestep running time but also the entire training process.
Neural MT.
We train our Neural MT model on the WMT14 EnglishGerman dataset.^{1}^{1}1http://www.statmt.org/wmt14/ For these experiments, we preprocess the dataset into word pieces (Wu et al., 2016) such that the vocabularies of both languages consist of word pieces. In order to match our model’s settings, we consider only the translation pairs where no sentence has more than word pieces. We train each model for steps and record their train perplexities. Each training machine has Nvidia Tesla K80 GPUs and Intel Haswell 2300 CPU. Since there are inevitable noises in the computer systems when measuring the running times, we train each model times independently and average their perstep running times and perplexities.
The RLbased placement runs faster than the expertdesigned placement, as shown in the training curves in Figure 6. Quantitatively, the expertdesigned placement, which puts each layer (LSTM, attention and softmax) on a different GPU, takes hours; meanwhile the RLbased placement (see Figure 4) takes hours, giving speed up of total training time. We note that the measured speedup rate (and the running times) of these models appear different than reported in Table 2 because measuring them in our RL method has several overheads.
InceptionV3.
We train InceptionV3 on the ImageNet dataset (Russakovsky et al., 2015) until the model reaches the accuracy of on the validation set. In practice, more often, inception models are trained with data parallelism rather than model parallelism. We thus compare the placements found by our algorithm (see Figure 5) against two such baselines.
The first baseline, called Asynchronous towers, puts one replica of the InceptionV3 network on each GPU. These replicas share the data reading operations, which are assigned to the CPU. Each replica independently performs forward and backward passes to compute the model’s gradients with respect to a minibatch of images and then updates the parameters asynchronously. The second baseline, called Synchronous Tower, is the same as Asynchronous towers, except that it waits for the gradients of all copies before making an update. All settings use the learning rate of and are trained using RMSProp.
Figure 7 shows the training curves of the three settings for InceptionV3. As can be seen from the figure, the endtoend training result confirms that the RLbased placement indeed speedups the training process by compared to the Synchronous Tower. While Asynchronous towers gives a better perstep time, synchronous approaches lead to faster convergence. The training curve of the RLbased placement, being slower at first, eventually crosses the training curve of Asynchronous towers.
4.5 Analysis of Found Placements
In order to understand the rationale behind the RLbased placements, we analyze their profiling information and compare them against those of expertdesigned placements.
Neural MT.
We first compare the perdevice computational loads by RLbased placement and expertdesigned placement for the NMT model. Figure 8 shows such performance profiling. RLbased placement balances the workload significantly better than does the expertdesigned placement. Interestingly, if we do not take into account the time for backpropagation, then expertdesigned placement makes sense because the workload is more balanced (whilst still less balanced than ours). The imbalance is much more significant when backpropagation time is considered.
InceptionV3.
On InceptionV3, however, the RLbased placement does not seek to balance the computations between GPUs, as illustrated in Figure 9top. We suspect this is because InceptionV3 has more dependencies than NMT, allowing less room for model parallelism across GPUs. The reduction in running time of the RLbased placement comes from the less time it spends copying data between devices, as shown in Figure 9bottom. In particular, the model’s parameters are on the same device as the operations that use them, unlike in Synchronous tower, where all towers have to wait for all parameters have to be updated and sent to them. On the contrary, that use them to reduce the communication cost, leading to overall reduction in computing time.
5 Conclusion
In this paper, we present an adaptive method to optimize device placements for neural networks. Key to our approach is the use of a sequencetosequence model to propose device placements given the operations in a neural network. The model is trained to optimize the execution time of the neural network. Besides the execution time, the number of available devices is the only other information about the hardware configuration that we feed to our model.
Our results demonstrate that the proposed approach learns the properties of the environment including the complex tradeoff between computation and communication in hardware. On a range of tasks including image classification, language modeling, and machine translation, our method surpasses placements carefully designed by human experts and highly optimized algorithmic solvers.
Acknowledgements
We thank Martin Abadi, Stephan Gouws, and the Google Brain team for their help with the project.
References
 Abadi et al. (2016) Abadi, Martín, Barham, Paul, Chen, Jianmin, Chen, Zhifeng, Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Irving, Geoffrey, Isard, Michael, Kudlur, Manjunath, Levenberg, Josh, Monga, Rajat, Moore, Sherry, Murray, Derek G., Steiner, Benoit, Tucker, Paul, Vasudevan, Vijay, Warden, Pete, Wicke, Martin, Yu, Yuan, and Zheng, Xiaoqiang. Tensorflow: A system for largescale machine learning. arXiv preprint arXiv:1605.08695, 2016.
 Arik et al. (2017) Arik, Sercan O, Chrzanowski, Mike, Coates, Adam, Diamos, Gregory, Gibiansky, Andrew, Kang, Yongguo, Li, Xian, Miller, John, Raiman, Jonathan, Sengupta, Shubho, et al. Deep voice: Realtime neural texttospeech. arXiv preprint arXiv:1702.07825, 2017.
 Bahdanau et al. (2015) Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2015.
 Barnard & Simon (1994) Barnard, S. T. and Simon, H. D. A fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems. Concurrency: practice and Experience, 6(2):101–117, 1994.
 Bello et al. (2016) Bello, Irwan, Pham, Hieu, Le, Quoc V., Norouzi, Mohammad, and Bengio, Samy. Neural combinatorial optimization with reinforcement learning. arXiv preprint arXiv:1611.09940, 2016.
 Chan et al. (2015) Chan, William, Jaitly, Navdeep, Le, Quoc V, and Vinyals, Oriol. Listen, attend and spell. arXiv preprint arXiv:1508.01211, 2015.

Chevalier & Pellegrini (2006)
Chevalier, C. and Pellegrini, F.
Improvement of the efficiency of genetic algorithms for scalable parallel graph partitioning in a multilevel framework.
EuroPar, Dresden, LNCS 4128, pp. 243––252, September 2006.  Cho et al. (2014) Cho, Kyunghyun, Van Merriënboer, Bart, Gulcehre, Caglar, Bahdanau, Dzmitry, Bougares, Fethi, Schwenk, Holger, and Bengio, Yoshua. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 Esteva et al. (2016) Esteva, Andre, Kuprel, Brett, Novoa, Rob, Ko, Justin, Swetter, Susan, Blau, Helen M., and Thrun, Sebastian. Dermatologistlevel classification of skin cancer. Nature, 2016.
 Fiduccia & Mattheyses (1988) Fiduccia, Charles M and Mattheyses, Robert M. A lineartime heuristic for improving network partitions. In Papers on Twentyfive years of electronic design automation, pp. 241–247. ACM, 1988.
 Graves & Jaitly (2014) Graves, Alex and Jaitly, Navdeep. Towards endtoend speech recognition with recurrent neural networks. In International Conference on Machine Learning, 2014.
 Hagen & Kahng (1992) Hagen, Lars and Kahng, Andrew B. New spectral methods for ratio cut partitioning and clustering. IEEE transactions on computeraided design of integrated circuits and systems, 11(9):1074–1085, 1992.
 Hannun et al. (2014) Hannun, Awni, Case, Carl, Casper, Jared, Catanzaro, Bryan, Diamos, Greg, Elsen, Erich, Prenger, Ryan, Satheesh, Sanjeev, Sengupta, Shubho, Coates, Adam, et al. Deep speech: Scaling up endtoend speech recognition. arXiv preprint arXiv:1412.5567, 2014.

He et al. (2016)
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian.
Deep residual learning for image recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 770–778, 2016.  Hendrickson & Leland (1993) Hendrickson, B. and Leland, R. A multilevel algorithm for partitioning graphs. Technical Report SAND93–1301, Sandia National Laboratories, June 1993.
 Hinton et al. (2012) Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E., Mohamed, Abdelrahman, Jaitly, Navdeep, Senior, Andrew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath, Tara N., et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 2012.
 Hochreiter & Schmidhuber (1997) Hochreiter, Sepp and Schmidhuber, Jurgen. Long shortterm memory. Neural Computation, 1997.
 Johnson et al. (1989) Johnson, David S, Aragon, Cecilia R, McGeoch, Lyle A, and Schevon, Catherine. Optimization by simulated annealing: an experimental evaluation; part i, graph partitioning. Operations research, 37(6):865–892, 1989.
 Jozefowicz et al. (2016) Jozefowicz, Rafal, Vinyals, Oriol, Schuster, Mike, Shazeer, Noam, and Wu, Yonghui. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
 Karypis & Kumar (1995a) Karypis, G. and Kumar, V. A fast and high quality multilevel scheme for partitioning irregular graphs. Technical Report 95035, University of Minnesota, June 1995a.
 Karypis & Kumar (1995b) Karypis, George and Kumar, Vipin. Metis–unstructured graph partitioning and sparse matrix ordering system, version 2.0. 1995b.
 Kernighan & Lin (1970) Kernighan, Brian W and Lin, Shen. An efficient heuristic procedure for partitioning graphs. The Bell system technical journal, 49(2):291–307, 1970.
 Khetan & Oh (2016) Khetan, Ashish and Oh, Sewoong. Achieving budgetoptimality with adaptive schemes in crowdsourcing. In Advances in Neural Information Processing Systems, pp. 4844–4852. Curran Associates, Inc., 2016.
 Kingma & Ba (2014) Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2014.
 Kirkpatrick et al. (1983) Kirkpatrick, Scott, Vecchi, Mario P, et al. Optimization by simulated annealing. Science, 220(4598):671–680, 1983.

Krizhevsky et al. (2012)
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.
Imagenet classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems, 2012.  Mao et al. (2016) Mao, Hongzi, Alizadeh, Mohammad, Menache, Ishai, and Kandula, Srikanth. Resource management with deep reinforcement learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, 2016.
 Oord et al. (2016) Oord, Aaron van den, Dieleman, Sander, Zen, Heiga, Simonyan, Karen, Vinyals, Oriol, Graves, Alex, Kalchbrenner, Nal, Senior, Andrew, and Kavukcuoglu, Koray. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
 Pascanu et al. (2013) Pascanu, Razvan, Mikolov, Tomas, and Bengio, Yoshua. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, 2013.
 Pellegrini (2007) Pellegrini, F. A parallelisable multilevel banded diffusion scheme for computing balanced partitions with smooth boundaries. EuroPar, Rennes, LNCS 4641, pp. 191––200, August 2007.
 Pellegrini (2009) Pellegrini, F. Distillating knowledge about scotch. 2009.
 Pellegrini & Roman (1996) Pellegrini, F. and Roman, J. Experimental analysis of the dual recursive bipartitioning algorithm for static mapping. Research Report, LaBRI, Universite Bordeaux I, August 1996.
 Russakovsky et al. (2015) Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and FeiFei, Li. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 2015.
 Sutskever et al. (2014) Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, 2014.
 Szegedy et al. (2015) Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, and Rabinovich, Andrew. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 Szegedy et al. (2016) Szegedy, Christian, Vanhoucke, Vincent, Ioffe, Sergey, Shlens, Jon, and Wojna, Zbigniew. Rethinking the inception architecture for computer vision. In The IEEE Conference on Computer Vision and Pattern Recognition, June 2016.
 Tieleman & Hinton (2012) Tieleman, T. and Hinton, G. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
 Vinyals et al. (2015) Vinyals, Oriol, Fortunato, Meire, and Jaitly, Navdeep. Pointer networks. In Advances in Neural Information Processing Systems, pp. 2692–2700, 2015.
 Wang et al. (2017) Wang, Yuxuan, SkerryRyan, R. J., Stanton, Daisy, Wu, Yonghui, Weiss, Ron J., Jaitly, Navdeep, Yang, Zongheng, Xiao, Ying, Chen, Zhifeng, Bengio, Samy, Le, Quoc V., Agiomyrgiannakis, Yannis, Clark, Rob, and Saurous, Rif A. Tacotron: A fully endtoend texttospeech synthesis model. In InterSpeech, 2017.
 Williams (1992) Williams, Ronald. Simple statistical gradient following algorithms for connectionnist reinforcement learning. In Machine Learning, 1992.
 Wu et al. (2016) Wu, Yonghui, Schuster, Mike, Chen, Zhifeng, Le, Quoc V., Norouzi, Mohammad, Macherey, Wolfgang, Krikun, Maxim, Cao, Yuan, Gao, Qin, Macherey, Klaus, Klingner, Jeff, Shah, Apurva, Johnson, Melvin, Liu, Xiaobing, Łukasz Kaiser, Gouws, Stephan, Kato, Yoshikiyo, Kudo, Taku, Kazawa, Hideto, Stevens, Keith, Kurian, George, Patil, Nishant, Wang, Wei, Young, Cliff, Smith, Jason, Riesa, Jason, Rudnick, Alex, Vinyals, Oriol, Corrado, Greg, Hughes, Macduff, and Dean, Jeffrey. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
 Zaremba et al. (2014) Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.