1 Introduction00footnotetext: * Authors contributed equally
The problem we address is neither VQA nor optimization of a single architecture. Our motivation is to accelerate a large class of dynamic architectures such that they become computationally comparable to their static counterparts. This cause is not motivated only by the recent successes of dynamic architectures, but by their numerous desirable properties that make them likely to retain and increase in importance in the future, particularly their ability to explicitly modularize knowledge.
We specifically explore Johnson et al.’s recent work in greatest detail because it serves as a useful testbed for multiple approaches to dynamic batching. Their execution engine’s modules (see Background or original work) not only yield dramatic accuracy gains over all strong baselines, but are also a prime example of explicit modularization of knowledge. We view this as a key advantage of dynamic architectures directly comparable to facets of human intelligence. Our work successfully enables efficient parallelization over minibatches in a large class of architectures despite the fact that a new network is assembled for each example.
1.1 Related Work
Previous notable dynamic graph results include neural module networks , which form the basis of the execution engine of Johnson et al. in their CLEVR IEP result. The difference is that latter’s architecture is built on generic, minimally-engineered neural network blocks that are more likely to generalize to a wider class of problems than the original neural module networks approach, which uses a heavily-engineered question parser and custom per-module architectures. Whereas improvement upon neural module networks constitutes improvement upon a single architecture, improvement on the CLEVR architecture is generalizable to a wide class of models under a minimal set of assumptions (see Discussion).
Additional dynamic graph results include neural Turing machines  and memory networks  , which both provide auxiliary queryable memory for read/write use during inference. While such architectures are applicable in problems requiring long-term memory, visual question answering places more focus on short term memory. Like the IEP result, these works tend towards higher level reasoning. However, they are perhaps less directly comparable than approaches that explicitly attempt to build generalizable program structures, such as neural program interpreters . The main difference is that the IEP result assembles programs that are defined in their entirety before being executed, thus additional dynamic batching optimizations are possible. Note that a subset of our results are applicable in both cases.
Much of our work is built atop the recently published CLEVR dataset and subsequent IEP result. We briefly outline these for convenience.
CLEVR is a VQA dataset comprising 70K images and 700K questions/answers/programs triplets. Images are synthetic but high quality 3D renders of geometric objects with varying shapes, sizes, colors, and textures. The standard VQA task is given by (question, image) (answer). The difference lies in the inclusion of programs in CLEVR, which are functional representations of the questions. CLEVR therefore allows VQA to be split between two intermediate tasks, as in the IEP result: (question) (program) and (program, image) (answer).
One might argue that intermediate programs are unrealistic, as one is unlikely to have program annotations in large, realistic tasks. From the CLEVR result, it seems likely that one could collect a small number of annotations on realistic datasets and use these to initialize the program generator. This is similar to the transfer learning experiment in the IEP result. However, performance did degrade compared to the original task; additional work is required to close the gap.
1.2.2 Visual Reasoning Programs
The IEP result consists of a program generator and execution engine. The program generator is a 2-layer word-level question encoder LSTM  and 2-layer word (function)-level program decoder LSTM. We focus on the execution engine, as it is the dynamic portion of the architecture and the source of the majority of computation time.
The program generator predicts a sequence of functions over the function vocabulary with a standard argmax. As the arity (number of arguments) of each function is predetermined, there exists a unique mapping from the predicted vector of functions to a program tree. This is assembled via a depth-first search. Each function is itself a neural network, with the exception of a specialSCENE token, which instead outputs ResNet-101 features 
taken from an intermediate layer. This program tree is then directly executed, and the outputs are passed through a small classifier network (one convolutional and two fully connected layers) to yield a softmax confidence distribution over answers, which is then optimized as normal via backpropagation over the cross-entropy loss.
In the IEP result, programs must be executed sequentially with an explicit loop over the examples in each minibatch. As a result, unlike static networks, the computation time of the forward pass scales linearly with the batch size. We present two variants of topological sort that remedy this issue.
To clarify the ongoing notation, programs have max length and function vocabulary size . The batch size is denoted by and the max program tree depth by .
Standard topological sort. First, consider a naive topological sort. Each program tree is sorted via an infix depth-first search. This results in a queue ordering such that each node can be executed sequentially; no node is executed before all of its dependents. While this operation runs in time linear in the number of nodes (e.g. ), it is fast compared to expensive neural network operations and can be multithreaded extremely efficiently, thus we ignore this factor in our computations.
We now have a flat representation of each program, which can be viewed as a grid of size . Instead of executing each program independently, we loop only over the rows and execute one full column of size . Each node corresponds to a different element in the function vocabulary. However, for , we need only make at most expensive neural network calls instead of . This results in neural network calls.
Improved topological sort. In the improved variant of topological sort, we take this sorting operation one step further. Instead of flattening programs, we instead label each node by its maximum distance from the root node. Nodes with the same label are pooled. Each pool is executed at once in neural network calls, for a total of calls. In the case where program trees are balanced (important in the design of future datasets), this yields execution. The program trees used in CLEVR are, unfortunately, highly imbalanced, thus this approach results in only a 10-25 percent speedup over standard topological sort. Note that, as is the maximum depth across all programs in a minibatch, .
We evaluate performance gains with improved topological sort vs. our implementation of the original IEP architecture. Relevant portions of program construction/execution code are shared appropriately: our experiments are robust to any unintended inefficiency in our implementation of the original architecture.
Using all memory in a single Nvidia GTX 1080Ti, we achieve 5.5X faster inference and a 2X faster backward pass. It is currently unclear why gains do not better transfer to the backwards pass in the PyTorch backend, as the expected gains are symmetric. However, this is not a fair comparison, as over half of computation time is spent in inefficient CPU Python graph sorting code. Furthermore, this CPU code is embarrassingly parallel and should be written in multithreaded C++.
Thus, a fairer comparison is to measure neural cell execution time, in which case we achieve over 14X gains (see Fig. 2). There is a small amount of additional data stacking code omitted from this computation because it can likely also be optimized and is not directly comparable to the original architecture; unfairly including this, gains are still well over 10X.
More importantly, scaling is linear with batch size. Doubling GPU memory yields 2X performance. Those familiar with minibatch parallelism may object that this performance gain usually drops off after a certain point (minibatch size 1000 in our experience). However, this is not likely to be an issue in our case, as there is an additional factor of the program function vocabulary size (40). With equal distribution of execution over functions, minibatch size 1000 per cell corresponds to overall minibatch of size 40000, which would require approximately 500 GB of GRAM. While increasing the program vocabulary does incur a linear decrease in performance, it causes an equivalent increase in maximum minibatch size before incurring diminishing returns.
Furthermore, it is possible to maintain such gains in the case of multiple GPUs (e.g. large batch size split over many devices) by assigning a different cell function to each GPU. Goyal et. al recently demonstrated that this sort of data parallel scaling can remain practical even at extremely large batch sizes by scaling the learning rate correspondingly .
3.2 Sparsely Gated Mixture of Experts
As a second test, we evaluate performance on the sparsely gated mixture of experts (MOE) layer as in , where we maintain a set of n fully connected expert networks of which k are active for each of the b examples in each minibatch, and . Note that unlike the IEP example above, each MOE layer must fully terminate before the next layer can begin execution: the architecture is not known a-priori. We therefore apply a degenerate case of our standard topological sort: we batch computation across all currently known modules. For the MOE layer, this corresponds to experts in each of examples for a total of known modules.
The naive implementation loops over and executes the experts independently. Our implementation makes only expert calls. We evaluate vs. the naive implementation using 256-dimensional data. Each expert is a neural network with one hidden layer. We tested many network sizes, but found that network size is largely independent of the speedup factor. We therefore only present results with 256 hidden units in Fig. 3; this is the largest experiment that fit on a single 11GB card.
Our metric for performance is cell execution speed (e.g. total time spent executing experts) rather than forward pass speed. This is, as in IEP, to avoid unfair comparison to large swaths of unoptimized python CPU code. Note that there is an additional data stacking operation in our approach omitted from these calculations, as in the IEP result, because including it is not an equal comparison to the vanilla approach. Also, this stacking operation is almost certainly required for any approach in the distributed case.
Notice that performance quickly decays as the number of experts approaches the minibatch size. This should be expected, as parallelization is impossible if each expert receives one example. It is difficult to predict the largest practical speedup achievable, as we do not possess the computational resources to run production-scale results. However, we can extrapolate from the results above. Furthermore, we assume that our results will scale linearly with many GPUs. This is reasonable both from our suggested parallelization scheme (see IEP) and the results of the similar scheme utilized in the original MOE work.
Consider the case where each of n experts has h hidden units, and our data is d dimensional. This yields parameters. If we want each expert to to receive m examples and we use k experts per example, then we must store floating point activations. Thus the ratio of memory required between the activations and the parameters is Note that this does not depend on the number of experts.
As a practical example, with , , and million, this ratio is 7.3. Thus even outrageously large networks require a reasonable fraction of the total memory in order to use extremely large batch sizes. With 10k experts, we would expect 50-80X performance with this configuration. Furthermore, we could increase by up to a factor of 10 if we desired without significant loss of absolute performance and obtain a 8-10X efficiency gain.
The recent independent results of automatic batching in DyNet 
and the TensorFlow Fold library are closest to our work. However, the DyNet batching optimizer relies on lazy execution to optimally organize data on the fly, and TensorFlow Fold similarly operates over compiled graphs.
Lazy execution is not present in alternative frameworks such as PyTorch by design, as it is decouples implementation from execution, which is often an undesirable quality during implementation. Without lazy execution, automatic batching is not possible, but variants of our approach are still viable. Our standard and improved topological sorting approaches can be viewed as a set of manual batching optimizations applicable to a large class of dynamic architectures without requiring lazy execution or prior compilation.
With regard to technical implementation, we mark the dependencies of each cell as in DyNet. However, we batch by dependency depth as in TensorFlow Fold in order to minimize extra CPU code. This facilitates changing between approaches. While our experiments are not directly comparable to the benchmarks vs. TensorFlow Fold in the DyNet result, our speedup curves maintain the linear gains of TensorFlow Fold at large batch sizes.
Our improved topological sort makes the following hard assumptions in order to achieve where complexity is measured by the number of calls to expensive (e.g. neural network) functions:
There exists a set of expensive modules.
The architecture, composed of modules, can be executed with batch size such that .
The architecture is a balanced tree with structure and module arities known a priori.
In the case where the third assumption fails because the architecture (and arities) are known ahead of time but it is a general DAG, our improved topological sort is still applicable, but with complexity where is the maximum dependency path length. This has the same complexity as the standard topological sort but with an equivalent or more favorable constant factor which can be fairly significant, depending on the number of concurrent branches in the graph.
In the case where the architecture is a generic graph but is not specifically known ahead of time, it is no longer possible to use our improved topological sort. However, as the next module is always known in any architecture, it is still possible to apply our standard topological sort approach and achieve by aggregating the computations of all current modules over , as done in the MOE example above. In the presence of cycles, becomes the maximum length of an unrolled graph. This is always limited in practice to avoid infinite cycles.
In general, our approach is applicable whenever significant module reuse is present among examples. While it may be possible to improve upon our approach in select cases by searching over the set of known modules to optimize the order of dependency execution, this would require additional CPU code that may not be possible to fully optimize.
We demonstrate the effectiveness of our dynamic batching method on IEP and MOE (our codebase is available at https://github.com/jsuarez5341/Efficient-Dynamic-Batching), achieving over 14X and up to 1000X neural cell execution, respectively. In each case, we characterize the trend of improvements as batch size varies, which yields increasing returns and becomes linear until extremely large batch size. We define the class of problems for which our improved topological sort is applicable as well as the class where it is not but standard topological sort is still feasible; in both cases, we provide complexity bounds as a function of neural network calls. The breadth of architectures in which at least one variant of our approach is applicable implies that a large class of dynamic architectures can be trained and executed as quickly and efficiently as their static counterparts.
-  J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In , June 2016.
-  P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
-  A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
-  A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
-  J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. B. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. CoRR, abs/1612.06890, 2016.
-  J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, F. Li, C. L. Zitnick, and R. B. Girshick. Inferring and executing programs for visual reasoning. CoRR, abs/1705.03633, 2017.
-  M. Looks, M. Herreshoff, D. Hutchins, and P. Norvig. Deep learning with dynamic computation graphs. arXiv preprint arXiv:1702.02181, 2017.
-  G. Neubig, Y. Goldberg, and C. Dyer. On-the-fly operation batching in dynamic computation graphs. arXiv preprint arXiv:1705.07860, 2017.
-  S. Reed and N. De Freitas. Neural programmer-interpreters. arXiv preprint arXiv:1511.06279, 2015.
-  N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
-  S. Sukhbaatar, a. szlam, J. Weston, and R. Fergus. End-to-end memory networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates, Inc., 2015.
-  J. Weston, S. Chopra, and A. Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014.