Cavs: A Vertex-centric Programming Interface for Dynamic Neural Networks

12/11/2017 ∙ by Hao Zhang, et al. ∙ 0

Recent deep learning (DL) models have moved beyond static network architectures to dynamic ones, handling data where the network structure changes every example, such as sequences of variable lengths, trees, and graphs. Existing dataflow-based programming models for DL---both static and dynamic declaration---either cannot readily express these dynamic models, or are inefficient due to repeated dataflow graph construction and processing, and difficulties in batched execution. We present Cavs, a vertex-centric programming interface and optimized system implementation for dynamic DL models. Cavs represents dynamic network structure as a static vertex function F and a dynamic instance-specific graph G, and performs backpropagation by scheduling the execution of F following the dependencies in G. Cavs bypasses expensive graph construction and preprocessing overhead, allows for the use of static graph optimization techniques on pre-defined operations in F, and naturally exposes batched execution opportunities over different graphs. Experiments comparing Cavs to two state-of-the-art frameworks for dynamic NNs (TensorFlow Fold and DyNet) demonstrate the efficacy of this approach: Cavs achieves a near one order of magnitude speedup on training of various dynamic NN architectures, and ablations demonstrate the contribution of our proposed batching and memory management strategies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning (DL), which refers to a class of neural networks (NNs) with deep architectures, is now a workhorse powering state-of-the-art results on a wide spectrum of tasks Yan:2015:HDCNN ; yan2016automatic ; mikolov2013efficient

. One reason for its widespread adoption is the variety and quality of software toolkits, such as Caffe 

jia2014caffe , TensorFlow abadi2016tensorflow and DyNet neubig2017dynet ; neubig2017fly , which ease programming of DL models, and speed computation by harnessing modern computing hardware (e.g. GPUs), software libraries (e.g. CUDA, cuDNN chetlur2014cudnn ), and compute clusters zhang2015poseidon ; zhang2017poseidon ; cui2016geeps .

One dominant paradigm in the training of DL models, adopted by toolkits such as Caffe and TensorFlow, uses static dataflow graphs abadi2016tensorflow ; murray2013naiad . These graphs represent the flow of data through computational functions, and are defined using symbolic programming Bergstra:2011:NIPSW ; abadi2016tensorflow , once before beginning training or testing of the model. The training of these models is performed through auto-differentiation, in which users are only required to assemble their model architectures by connecting operators using high-level language interface (e.g. Python), after which the framework will automatically derive the correct algorithm for training bartholomew2000automatic . With proper optimization, the execution of these static dataflow graphs can be highly efficient. Specifically, by separating model declaration and execution, it makes it possible for the graph to be further processed and optimized before runtime abadi2016tensorflow . In addition, the evaluation of multiple data samples in a dataflow graph can be naturally batched to leverage the improved computational capability of modern hardware (e.g. GPUs), which is extremely advantageous for DL workloads Krizhevsky:2012:NIPS .

While these static dataflow graph have major efficiency advantages, their applicability highly relies on a key assumption – the dataflow graph (i.e. NN architecture) fixed throughout the runtime. With the increasing complexity of the problems to be addressed, DL has been extended and applied on data with more complicated structures, such as sequences hochreiter1997long ; sutskever2014sequence , trees tai2015improved and graphs liang2016semantic , over which the NN may conditionally choose its own computation order for specific modeling needs, i.e. the structure of its dataflow graph changes over training. To better support these dynamic models, some recent frameworks tokui2015chainer ; neubig2017dynet propose to declare a dataflow graph per sample (a.k.a. dynamic declaration). While dynamic declaration is convenient to developers as code can basically be written in the same way as it usually is in the native programming language (e.g. Python, C++), it exhibits a few limitations. First, programmers still have to write code to explicitly assemble the dataflow graph for each input sample, which might be nontrivial for graphs with sophisticated structures. Second, as the graph construction needs to be performed repeatedly, its overhead grows linearly with the number of training instances, preventing the application of complex static graph optimization techniques (in fact, graph construction takes longer time than the computation in some frameworks looks2017deep , see §5.2). Finally, since each sample owns a dataflow graph specifying its unique computational pattern, batching together similarly shaped computations across instances is non-trivial. Without batching operations, the computation is inefficient due to its lack of ability to exploit modern computational hardware, and while some progress has been made in recent research neubig2017fly ; looks2017deep , how to automatically batch the computational operations from different graphs remains a difficult problem.

To address these challenges, we present Cavs, a new programming interface for dynamic NNs, and a system implementation with optimization strategies tailored to it. Cavs leverages the recurrent and recursive nature of dynamic NNs. Instead of declaring a dataflow graph per sample, it alternatively decomposes a dynamic dataflow graph as two components: one static vertex function that is only declared (by the user) and optimized once, and an input graph that is instance-specific and not used until runtime. Thereby, the workflow of training a dynamic NN can be represented as scheduling the execution of following the structure of the input graph . Cavs combines the best of symbolic construction of dataflow graphs for DL abadi2016tensorflow ; Bergstra:2011:NIPSW with the vertex-centric model gonzalez2012powergraph in graph computing: it only requires users to define symbolically by “thinking locally like a vertex” tian2013think . Cavs will perform auto-differentiation, schedule the function execution following the dependency reflected by , and guarantee efficiency and correctness. It also inherits the flexibility of symbolic programming, i.e. users are allowed to declare multiple vertex functions to express more dynamics, or connect static dataflow graphs with dynamic ones to construct more complex NN architectures.

Cavs demonstrates a few advantages over other programming models. It simplifies user programs and avoids the overhead of repeated dataflow graph construction. Moreover, this vertex-centric model naturally exposes opportunities for batched computation: we introduce a simple batching policy in Cavs’ scheduler to parallelize the execution of on multiple vertices during the evaluation of a batch of samples with different input graphs (§3.2), and a novel memory management mechanism to guarantee the memory coalescing (§3.3). Together they yield significant performance improvements. Compared to dynamic declaration, as the dataflow graph encoded by the vertex function is static throughout the runtime, it can benefit from various static graph optimizations abadi2016tensorflow ; chen2015mxnet ; caffe2 ; xla , such as lazy batching, streaming, and kernel fusion (§3.5), which would otherwise be less effective on the scenario of dynamic declaration because of the repeated preprocessing/optimization cost (see §6).

We implement Cavs as an additional layer pluggable to most existing DL frameworks to enhance their support for dynamic NNs. To evaluate its performance, we compare Cavs to TensorFlow Fold looks2017deep and DyNet neubig2017dynet ; neubig2017fly , two state-of-the-art systems supporting dynamic NNs and dynamic batching. We focus our experiments on GPU training, and verify that both Fold and DyNet suffer from substantial overhead caused by repeated graph preprocessing or construction, which is bypassed by Cavs (§5.2). In terms of overall performance, on static NNs, Cavs demonstrates equivalent or slightly better performance than Fold and DyNet, while on several dynamic NNs with notably difficult-to-batch workloads (e.g. Tree-LSTM tai2015improved and Tree-FC looks2017deep ), Cavs demonstrates near one order of magnitude speedups across various dataset and hyper-parameter settings (§5.1). We further investigate the key contributing factors to the performance: Cavs benefits from not only a better memory management strategy, but also graph execution optimizations which are originally designed for static dataflow graphs and perhaps less useful in dynamic declaration.

[numbersep=0pt, frame=single, framesep=-0.5mm, fontfamily=helvetica, escapeinside=——, fontsize=, mathescape=true]cpp /* (a) static declaration */ // all samples must share one graph declare a —static— dataflow graph ——. for ——: read the ——th data batch ——. batched —computation—: ——.

[numbersep=0pt, frame=single, framesep=-0.5mm, fontfamily=helvetica, escapeinside=——, fontsize=, mathescape=true]cpp /* (b) dynamic declaration */ for ——: read the ——th data batch ——. for ——: declare a dataflow graph —— —for— ——. single-instance —computation—: ——.

[numbersep=0pt, frame=single, framesep=-0.5mm, fontfamily=helvetica, escapeinside=——, fontsize=, mathescape=true]cpp /* (c) our proposed vertex-centric model */ declare a symbolic vertex function ——. for ——: read the ——th data batch ——. read their associated graphs ——. compute —— over —— with inputs ——.

Figure 1: The workflows of (a) static declaration, (b) dynamic declaration, (c) Cavs’ vertex-centric programming model.
Model Frameworks Expressiveness Batching Graph Cons. Overhead Graph Exec. Optimization
static declaration

Caffe, Theano, TensorFlow, MxNet

low beneficial
dynamic declaration (instant evaluation) PyTorch, Chainer N/A unavailable
dynamic declaration (lazy evaluation) DyNet high not beneficial
Fold TensorFlow-Fold high unknown
Vertex-centric Cavs low beneficial
Figure 2: Left (a)-(d): A cell function shown in (a) could be applied on different structures such as a (b) chain (c) tree, or (d) graph. Right table: the landscape of existing programming models for dynamic NNs, and their advantages and disadvantages (see §2.2 and §6).

2 Background

DL is distinguished from other ML algorithms mainly by its use of deep neural networks, a family of ML models with many interconnected layers, each composed of various mathematical operations (e.g.

). Before a DL model can give predictions, it is usually trained by stochastic gradient descent, an iterative process in which gradients are calculated through backpropagation 

rumelhart1988learning . There is a natural connection between directed graphs and NNs: we can map the graph nodes to the computational operations or parameters in NNs, and let the edges indicate the direction of the data being passed between the nodes. In this case, we can represent the process of training NNs as batches of data flowing through computational graphs, i.e. dataflow graphs Bergstra:2011:NIPSW ; abadi2016tensorflow ; neubig2017dynet .

Figure 1(a) summarizes the programming model derived from these dataflow graphs, which is named as static declaration and has been adopted in many DL frameworks Bergstra:2011:NIPSW ; abadi2016tensorflow ; chen2015mxnet . Without ambiguity, we use to denote both the dataflow graph itself and the computational function implied by . On one hand, we note its execution is highly efficient as the computation over multiple samples is batched – at each iteration

, a batched tensor of

samples is fed to , and the computation is executed in a single pass, allowing for efficient use of memory caches or parallelized computation. On the other hand, this paradigm relies on a key assumption: the dataflow graph is static for all samples and fixed throughout the computation. Hence, will only be declared once with a constant graph construction/optimization overhead; all samples share the same computational pattern specified in , so the computation of different samples can be by nature

batched by simply expanding the input with a batch dimension. Though static declaration is effective on a wide range of NN models, such as convolutional neural networks (CNNs) over fixed-size images, it is much more difficult to apply to graphs with dynamically changing structures, some examples of which are shown in the next section.

2.1 Dynamically Structured Computational Graphs

Modern DL has been developed and applied extensively over data with more complicated structures, e.g. data structured as sequences, trees and graphs, which are required to tackle practical problems such as machine translation sutskever2014sequence ; tai2015improved , questionanswering tan2015lstm , and semantic image segmentation yan2016combining ; liang2016semantic . As a concrete example of dynamic NNs, we will use recurrent and recursive neural networks (RNNs) elman1990finding ; hochreiter1997long ; socher2013recursive . RNNs are a class of NNs generally applied on modeling structured inputs or outputs, e.g. sequences or graphs. At the core of an RNN is a cell function with trainable parameters. It will be dynamically applied at different places of the input structure, and optionally produce an output if needed. Figure 2(a) illustrates such a cell function: it takes an input element , forwards it through a few mathematical transformations, and generates some intermediate state and an output

. Depending on what transformations are applied, different variants of RNNs have been derived, such as long-short term memory units (LSTM) 

hochreiter1997long

and gated recurrent units (GRU) 

chung2014empirical . However, the internals of the cells themselves are secondary; the dynamics of the net as a whole are mainly reflected by the structures that the NN works on.

Sequence RNNs. When the input to the RNN are sequences (e.g. sentences) as in Figure 2b, the cell function is applied across all elements of the sequence. At each step , it takes the element (e.g. a word) from the input sequence, and the state variable maintained by the model at step . It computes an output , and a new state that will be used by the next step . Hence, This sequence RNN encodes not only the data values, but also the dependencies present in the sequence. If represented as a dataflow graph, the graph exhibits a chain structure. As the input or output sequences usually have variable length (e.g. translating an arbitrary-length English sentence into Chinese), the dataflow graph needs to be dynamically changed, i.e. the steps of the chain must be adapted to fit the length of the input or output.

Tree-structured RNNs. Further, RNNs can be enhanced to model data with more complex structures suited for downstream tasks. For example, tree-structured RNNs (Tree-RNNs, Figure 2

c), have been used to classify the sentiment of sentences 

pang2002thumbs given an associated binary tree representing the sentence structure tai2015improved ; socher2011parsing . In this case, a leaf of the tree maps to a word of the sentence, an internal node corresponds to a multi-word phrase. To process this structure, the cell function scans the tree recursively, starting from leaf nodes until reaching the root. At the node , it computes the state , where is the input to the node, and are the states of its left and right children, respectively. As the tree structure vary from example to example, the dataflow graph of a Tree-RNN is highly dynamic.

Graph-structured RNNs. Similarly, RNNs can be extended to compute over more general graphs, such as N-ary trees or graphs (Figure 2d), as long as their parameters are learnable. In fact, various NNs have been developed toward having more dynamic workflows liang2016semantic ; tai2015improved , and proven quite effective because of their ability to encode structured information. While we take RNNs as examples for explanation, we note there are many other dynamic NNs in the literature or production with their dynamics reflected in various perspectives: variably sized inputs/outputs sutskever2014sequence ; bahdanau2014neural ; elman1990finding ; hochreiter1997long ; dyer2015transition ; buckman2016transition , variably structured inputs/outputs socher2011parsing ; tai2015improved ; liang2016semantic , or with nontrivial inference algorithms graves2006connectionist ; zheng2015conditional ; gormley2015approximation ; kong2015segmental .

2.2 Programming Dynamic Dataflow Graphs

As the assumption in §2 no longer holds for dynamic structures, static dataflow graphs in their original form cannot be used. There are currently two remedies to this problem: expanding the graph programming language to allow it to explicitly include controls structure necessary to implement these applications, or forgo the efficiency gains afforded by static dataflow graphs and instead use a dynamic declaration framework that reconstructs the graph for every training example. We explain below and summarize them in Figure 2.

Static declaration. Static unrolling abadi2016tensorflow is a standard way to express sequence RNNs with fixed steps. To handle variable-length data, it declares an RNN that has number of steps equal with the length of the longest sequence in the dataset. It then appends zeros at the end of other sequences to have equal length, and feeds them in batches to the dataflow graph for computation. Static unrolling enables batched computation of multiple sequences, but obviously results in substantial unnecessary computation.111It is also possible to split sentences into several buckets of different lengths, which alleviates this problem somewhat but adds some degree of code complexity and is not a fundamental solution. Dynamic unrolling implements basic control flow functionality within static graphs, allowing for the declaration of graph operators similar to while

loops. At each iteration of the training, the cell function of the RNN will be executed a conditional number of times determined at runtime by the length of the longest sentence in the batch. It then pads the sequences in the batch and perform batched computation, it waste computational resources. Notably, both of these methods essentially cannot support more complex structures than sequences.

Dynamic declaration. Dynamic declaration is able to express dynamically varying dataflow graphs, by creating a unique dataflow graph for each sample according to its associated structure. It however requires users to explicitly write (more) code to build a dataflow graph for each input graph, which is nontrivial for graphs with sophisticated structures. As the dataflow graphs are always changing, it can hardly benefit from well-established dataflow graph optimization techniques (§3.5) – we will have to perform graph processing/optimization for each dataflow only for a single sample, but incorporating the optimization itself has an overhead. More importantly, as we are unable to naturally batch the computation of different sample, single-instance training would be very inefficient in the absence of batched computation. At the backend, since a dataflow graph needs to be constructed per sample, the overhead is linearly increasing with the number of samples, and sometimes yields downgraded performance looks2017deep 5.2), even for frameworks with optimized graph construction implementations neubig2017dynet .

Tensorflow Fold looks2017deep and DyNet neubig2017fly go one step further and perform dynamic batching for dynamic dataflow graphs. Fold turns dynamic dataflow graphs into a static control flow graph to enable batched execution, but introduces a complicated functional programming-like languages and a large graph preprocessing overhead. DyNet proposes an auto-batching strategy that searches for batching opportunities by profiling every fine-grained operator, while this step itself has non-negligible overhead, and loses the opportunities of graph-level optimizations. There are also some “imperative” frameworks, such as PyTorch pytorch and Chainer chainer that allow users to construct dynamic NNs but performs instant evaluation of each user expression. As model construction and execution are coupled, they are usually difficult to perform dynamic batching or graph optimization. Overall, they are still far from efficient when handling dynamic NNs. We next describe our proposed vertex-centric programming model to overcome the aforementioned limitations.

3 Cavs Design and Optimization

Our motivation comes from several key principles ML developers usually comply with to ensure the feasibility and learnability of the model during their design of dynamic NNs. We note most dynamic NNs are designed to exhibit a recursive structure (e.g. sequence RNN, Tree-RNN), or a combination of static and recursive structures (e.g. LRCN donahue2015long ; andreas2016neural , attention xu2015show ), or even a combination of different recursive structures (e.g. encoder-decoder RNNs sutskever2014sequence ). Within one such structure, a function is dynamically applied over instance-specific graphs, and every vertex of the graph usually interacts in a same way with it neighboring vertices following the function. The computational function itself, however, is usually static and parameterized by fixed learnable parameters.

This observation motivates us to design a new programming model, called Cavs, that combines the best of dataflow graphs with the vertex-centric model in graph computing. As a comparison, we present the workflow of Cavs in Figure 1c. For clarity, we will use the following terminology and notation in the rest of the paper: we call the instance-specific structure associated with the input sample as an input graph, and notate it as , and a node in that graph as a vertex, to be distinguished from a dataflow graph and the nodes (which are usually operators or variables) therein. Figure 3 illustrates the concept of this vertex-centric programming model. To describe an aforementioned dynamic structure, different from dynamic declaration, which requires users to manually declare dataflow graphs for each sample according to its associated graph, Cavs instead directly takes it as an input argument. To be aware of what computation shall be performed, Cavs requires users to implement a simple vertex function by “thinking like a vertex”, informing the framework how one vertex in a dynamic NN will interact with its connected vertices (if these is any). In , users can utilize conventional DL operators to assemble a symbolic construct that will be evaluated dynamically following the structure of , while Cavs will ensure the correctness and efficiency. Therefore, a vertex function , together with an input graph , implicitly encodes a recurrent dataflow graph, which maps to a subgraph of the implicit full dataflow graph of the model that may needs to be explicitly declared in traditional programming models. For convenience of notations, we will call any part of the structure that cannot be encoded by and as external to , and vice versa. Cavs allows users to connect any external static dataflow graph to a dynamic structure encoded by to express various model architectures (e.g. connecting a CNN to an RNN), or declare multiple vertex functions for different structures, and connect them appropriately to express more complex models (e.g. an encoder-decoder LSTM network).

While it is still necessary to create an I/O function to read input graphs for each sample, this must be done in any models, and only once before training commences, which means that it can be shared across epochs or even training runs. Cavs no longer requires users to construct the full dataflow graphs for each sample by themselves. As repeated graph construction is bypassed, the overhead will also be avoided. With this vertex-centric model, Cavs transforms the problem of evaluating multiple dataflow graphs with different structures 

looks2017deep ; neubig2017fly into a simpler form – scheduling the execution of the vertex functions following the input graphs. For the later problem, we can easily batch the execution of over multiple vertices at runtime (§3.2), leveraging the batching computational capability of modern hardware. Moreover, as the vertex function itself maps to a static symbolic dataflow graph, it is open and can benefit from various graph optimization techniques originally developed for static declaration, such as kernel fusion, streaming, and our proposed lazy batching, which might not be effective in the scenario of dynamic declaration. We next describe Cavs’ APIs.

Figure 3: Cavs represents a dynamic structure as a dynamic input graph (left) and a static vertex function (right).

3.1 Programming Interface

Besides the generic math operators used to declare the computation, Cavs exposes four symbolic APIs for users to specify how the messages shall be passed between vertices in their vertex functions: gather, scatter, pull, push.

  • gather(child_idx): gather accepts an index of the child vertices, gets the child content from gather/scatter buffer and returns a list of symbols that represent the output of these vertices.

  • scatter(op): scatter is a reverse API of gather, and has a symbol op as its input argument. Scatter will set the output of current vertex to gather/scatter buffer.

gather and scatter resemble the GAS model in graph computing gonzalez2012powergraph – both are vertex-centric APIs that help users express the overall computational patterns by thinking locally like a vertex: gather receives messages from dependent vertices, while scatter updates information to parent vertices. But note several key differences: (1) gather and scatter here are fully symbolic – gather allows backpropagation through it; (2) In graph computing, all nodes interact with their connected nodes in the same way following a user-specified apply function, while in dynamic NNs, a vertex usually interacts differently with its different child vertices, specified by the symbolic programs (between the call of gather and scatter) in the vertex function; (3) In graph computing, a vertex of a graph always interacts with other vertices of this graph, while in DL, the vertex of a dynamic NN usually takes input from not only the internal of the structure expressed by and (internal data path in Figure 3), but also from the external of it, e.g. a step in an RNN can take inputs from a CNN feature extractor or some external I/O (external data path in Figure 3). In this case, gather and scatter are insufficient to express such semantics. Cavs therefore provides another two APIs:

  • pull(): pull grabs inputs from the external of the current dynamic structure, e.g. another NN, or some I/O.

  • push(op): push is thus the reverse of pull that sets the output of the current vertex as op. If this vertex is pulled by others, the content of op will be returned.

With appropriate indexing, push and pull connect a vertex inside a dynamic structure expressed by to other connectors external to , such as another dynamic structure, or another static dataflow graph. With these four APIs, we present in Figure 4 an example user program how the -ary child-sum Tree-LSTM tai2015improved can be simply expressed by using them and other mathematical operators.

[linenos, numbersep=-6pt, frame=single, framesep=0.3mm, fontfamily=tt,, escapeinside=——, fontsize=, mathescape=true]python def ——(): for k in range(N): S = gather(k) # gather states of child vertices c——, h—— = split(S, 2) # get hidden states c and h x = pull() # pull the first external input x

# specify the computation h = ——h—— i = sigmoid(W—— x + U—— h + b——) for k in range(N): f—— = sigmoid(W—— x + U—— h—— + b——) o = sigmoid(W—— x + U—— h + b——) u = tanh(W—— x + U—— h + b——) c = i —— u + ——f—— —— c—— h = o —— tanh(c)

scatter(concat([c, h], 1)) # scatter c, h to parent vertices push(h) # push to external connectors

Figure 4: An N-ary child-sum TreeLSTM tai2015improved in Cavs.

Expressiveness. With these four APIs, Cavs can be seen as a middle ground between static and dynamic declaration: In the best case, the model can be easily represented by a single vertex function plus input graphs. While in the worse case scenario, that every sample has a unique input graph while every vertex in the graph has a unique way to interact with its neighboring vertices, Cavs reduces to dynamic declaration that one has to define a vertex function for each vertex of input graphs. However, dynamic NNs in this scenario are very rare and usually not preferred because of the difficulty of design, programming and learning.

3.2 Scheduling

Once users define the vertex function and launch the execution, the Cavs scheduler arranges the evaluation of over the input graphs. Given , Cavs’s scheduler follows designed policies to efficiently perform backpropagation for all samples and their associated graphs .

Backpropagation. Cavs performs backpropagation hinton2006reducing as follows. For a sample with its input graph , the scheduler starts the forward pass from the input vertices of , and proceeds following the direction indicated by the edges in : at each sub-step, the scheduler figures out the next activated vertex in , and evaluates at this vertex following the symbolic programs in . It then marks this vertex as evaluated

, and proceeds with the next activated vertex until reaching a terminal vertex (e.g. the loss function). A vertex of

is activated if and only if all its dependent vertices have been evaluated. The backward pass is continued right after the forward. The scheduler first resets the status of all vertices as not evaluated, then scans the graph in a reverse direction, starting from the ending point of the forward pass. It similarly figures out the next activated vertex, but applies another function , which is the backward function of and automatically derived by Cavs via auto-differentiation (§3.4), until all vertices have been evaluated in backward.

1:function Forward()
2:  set task ID , task stack .
3:  while NOT all vertices in are evaluated do
4:    figure out all activated vertices in as a set .
5:    push into .
6:    evaluate on : (see §3.5).
7:    set the status of all vertices in as evaluated.
8:    set .
9:  end while
10:  return .
11:end function
12:function Backward()
13:  set as the size of .
14:  while  is not empty do
15:    pop the top element of as .
16:    Evaluate on : 3.5).
17:    set .
18:  end while
19:end function
Algorithm 1 Backpropagation with the batching policy.

To train a NN to convergence, the above process has to be iterated by the scheduler over all samples and their associated graphs , for many epochs. Instead a sequential execution, Cavs designs a batching policy to perform batched computation, considering the fact that evaluating a set of same arithmetic operations together is significantly faster than the sequential evaluation of each of them.

Batching policy. Given a data batch and associated graphs , this policy groups multiple vertices and then performs batched evaluation of in order to reduce kernel launches and exploit parallelism. Algorithm 1 details this policy. Specifically, the scheduler divides the forward evaluation of (a batch of) graphs into multiple steps. At each step , it analyzes at runtime and determines a set contains all activated vertices in . It then evaluates over these vertices by creating a batching execution task, with the task ID set to 222Without ambiguity, we use to denote both the set of vertices to be batched together, as well as the batching task itself.. The task is passed to a graph execution engine, which will further optimize the execution and conduct the actual computation (§3.5). Meanwhile, the scheduler records the information of this task by pushing into a stack . At each step of the backward, the scheduler pops out an element from , creates a corresponded backward batching task – the execution engine will evaluate the derivative function over vertices in , until all vertices of are evaluated.

We note the batching policy plays a similar role as the dynamic batching in Fold looks2017deep and DyNet neubig2017dynet in the scenario of dynamic declaration. However, Cavs determines how to batch fully dynamically during runtime using simple breadth-first search with negligible cost (instead of analyzing graphs before every iteration of the execution). We next describe an improved management management strategy based on this batching policy.

3.3 Memory Management

Figure 5: A color or a dashed lined box corresponds to a batching task. The rectangles are memory blocks. The numbers are vertex IDs. Memory blocks in one row belong to a dynamic tensor (e.g. ) and are physically continuous, though we separate them in different boxes.

In static declaration abadi2016tensorflow ; neubig2017dynet , a symbol in the user program usually corresponds to a tensor object, with its shape inferred from the program and batch size specified in advance. The framework usually preallocates a continuous storage on memory for each tensor and fixes it throughout runtime. However, in Cavs, as each batching task is determined at runtime and not visible before execution, its memory management exhibits more complexities. For the batched computation to be efficient, Cavs must guarantee the inputs and intermediate states during the evaluation of over a group of runtime-determined vertices coalescing in memory. If we adopt the aforementioned strategy, for each operation in , Cavs has to index each slice of its input tensor (which may be scattered in different places) and rearrange them as a continuous memory block, which might cause nontrivial overhead.

Cavs proposes a novel data structure dynamic tensor to address this challenge (Figure 6). A dynamic tensor is a wrapper of a multi-dimensional array abadi2016tensorflow ; walt2011numpy that contains four main members: shape, bs, a pointer p to a chunk of memory, and offset. shape is an array of integers representing the specific shape of the tensor excluding the batch dimension. It can be inferred from the user program and set before execution. The batch size bs is implemented as a placeholder in Cavs, with its value dynamically set by the scheduler at runtime at the beginning of a batching task. For each dynamic tensor, Cavs will preallocate a chunk of continuous memory, and point p to its starting address. This memory block is often very large and not fixed-sized – it can be further extended if needed. To access a dynamic tensor, the execution engine moves p forward with the value specified in offset, and reads/writes number of elements equal to . Therefore, bs together with offset provide a view of the tensor, and the state of the tensor will vary based on their values. Given a vertex function , Cavs creates dynamic tensors for each non-parameter symbol in , and also as their gradients, while it creates static tensors for model parameters.

[numbersep=0pt, frame=single, framesep=-1.8mm, fontfamily=tt, escapeinside=——, fontsize=

, mathescape=true]cpp struct DynamicTensor vector¡int¿ shape; int bs; int offset; void* p; ;

Figure 6: Dynamic tensor.

Figure 5 illustrates how the memory is assigned during the forward pass by simply manipulating dynamic tensors. In particular, in a training iteration, for a batching task , the scheduler sets bs of all to (the number of vertices in ). The execution engine then performs batched evaluation of each expression in one by one. For an expression 333Without losing generality the user-defined expressions can be arbitrary, e.g. with more than one argument or return values., the engine first reads (the dynamic tensor of the RHS symbol ) by offsetting by then reading a block of elements, and presents it as a tensor with batch size and other dimensions specified in . It then applies batched computational kernels of the operator op over this continuous block, and writes the results to (the dynamic tensor of the LHS symbol ) on the continuous block in between . Upon the completion of , the scheduler increases the offset of all by , respectively. It then starts the next batching task until has been evaluated at all vertices of . Hence, intermediate results generated in each batching task at forward pass are stored continuously in the dynamic tensors, and their offsets are recorded.

The scheduler then starts the backward pass. It initializes for each as . Since the backward execution follows an exactly reverse order of the forward pass (Algorithm 1), the intermediate results generated during forward can be easily accessed by decreasing the offset of . Specifically, for each batching task popped out from , the execution engine sets bs of all dynamic tensors to , and for each and , decreases their offset by . For an expression in that corresponds to in (see §3.4), the engine reads the current states of (which are continuous on memory) and performs batched execution of grad_op. Different from the forward pass, the gradient result will be added to the current state of instead of overwriting.

At the entrance of , the vertices in need to interact with its dependent vertices in previous to gather their outputs as inputs (L3 of Figure 4), or pull inputs from the external (L5 of Figure 4). Cavs maintains memory buffers to enable this (Figure 5). The memory buffers are key-value stores where the key is the vertex ID and the value is a tensor slice corresponding to the results of the scattered symbol at vertex (batch size is 1). Cavs provides a query function IndexBuffer(op, key) that returns the value in op’s corresponded buffer given a key. During the execution, a gather expression will trigger memory movements: for each , the scheduler figures out the vertex ID of its child vertex in the input graph with index child_idx; it then indexes the content in gatherBuffer with the vertex ID as key, and copies and writes it into as a slice. Similarly, at the exit of , a scatter expression will split the current state of into a few slices, and move them to the gatherBuffer for its parent nodes to gather. The push/pull work similarly with gather/scatter, but over the pushBuffer and pullBuffer, respectively, to communicate messages with the external.

Algorithm 2 summarizes the memory management process during forward pass. With this strategy, Cavs guarantees memory continuity for any batched computation in . Compared to dynamic batching in DyNet, Cavs performs memory movement only at the entrance and exit of , instead of for each expression (operator), it therefore significantly reduces overhead by memory operations (§5.3).

1:function Forward()
2:  for  do
3:    for  do  end for
4:    for  in  do
5:      if  then
6:        .
7:        for  do
8:          .
9:          .
10:        end for
11:      else if  then
12:        .
13:        for  do
14:          .
15:          .
16:        end for
17:      else
18:        perform batched computation: .
19:      end if
20:    end for
21:    for  do  end for
22:  end for
23:end function
Algorithm 2 Memory management at forward pass.

3.4 Auto-differentiation

Cavs by nature supports auto-differentiation. Given a vertex function it derives following the auto-differentiation rules: for each math expression such as in , Cavs generates a corresponded backward expression in as . For the four proposed operators, with the memory management strategy described above, we note scatter is the backward operator of gather in the sense that if gather collects inputs from gatherBuffer previously written by scatter at the forward pass, a scatter needs to be performed to write the gradients to the gatherBuffer for its dependent vertices to gather at the backward pass. Hence, for an expression like in , Cavs will generate a backward expression in . Similarly, the gradient operator of scatter is gather. The same auto-differentiation rule applies for push and pull as well.

3.5 Graph Execution Engine

Figure 7: The dataflow graph encoded by of Tree-LSTM.

Benefiting from the vertex-centric representation, the vertex function essentially defines a (small) static dataflow graph that is open to various graph execution optimization techniques (which might not be the case in dynamic declaration). We next discuss three proposed optimizations on Cavs’ execution engine for improved performance.

Lazy batching and streaming444Streaming is a borrowed terminology from CUDA programming which means executing different commands concurrently or out of order with respect to each other on different GPU streams. As Cavs’ optimization strategies are agnostic to the low-level hardware, we use streaming interchangeably with multi-threading if the underlying computing hardware is CPU.. In addition to batched execution of , the lazy batching and streaming explore potential parallelism for a certain group of finer-grained operators in or called lazy and eager operators.

Definition.

An operator in () is a lazy operator if at the forward (backward) pass, for , the evaluation of () at any parent (dependent) vertex of does not rely on the evaluation of at . It is an eager operator if the evaluation at does not rely on the evaluation of () at any dependent (parent) vertices of .

In Cavs, figuring out eager and lazy operators in and is straightforward given the following proposition:

Proposition.

Denote () as the dataflow graph encoded by (), and () as corresponded nodes of the gather and scatter operator, respectively. A node that has as its dependent and is not on any path from to is a lazy operator. A node that has as its ancestor and is not on any path from to is an eager operator.

Figure 7 illustrates a dataflow graph of the vertex function of Tree-LSTM, with eager and lazy operators colored. A property of them is that their evaluation is not fully subject to the dependency reflected by the input graph . For instance, the pull operator in Figure 7 is eager and can be executed in prior – even before has been evaluated at the vertices gather tries to interact with; the push operator is lazy, so we can defer its execution without impacting the evaluation of at parent vertices. Similarly, in , the gradient derivation for model parameters are mostly lazy – their execution can be deferred as long as the gradients of hidden states are derived and propagated in time. Cavs leverages this property and proposes the lazy batching strategy. It defers the execution of all lazy operators in and until all batching tasks has finished. It then performs a batched execution of these lazy operators over all vertices of . These operators includes, but is not limited to, the push operator that are doing memory copy, and the math operators for computing gradients of the model parameters. Lazy batching helps exploit more parallelism for the execution of lazy operators and significantly reduces kernel launches. Empirically lazy batching brings overall improvement (see §5.3).

On the other hand, we are unable to “eagerly” batch eager operators, as their execution over some vertices relies on knowing the detailed memory location of all intermediate results in advance, a condition which is not satisfied in Cavs where memory is dynamically assigned. To leverage the exhibited parallelization opportunity between eager operators and the operators on the path from gather to scatter (Figure 7), Cavs proposes a streaming strategy that pipelines the execution of these two groups of operators. It allocates two streams, and puts the eager operators on one stream, and the rest (excluding lazy operators) on the other. Hence, independent operators in two streams run in parallel, while for those operators that depend on an eager operator, this dependency is respected by synchronization barriers (see Figure 7). It is also possible to parallelize independent paths from to on the graph, but we find it does not yield further improvement.

Automatic kernel fusion. Since Cavs abstracts out a static dataflow graph encoded by that will be dynamically evaluated elsewhere, we can replace the original with an optimized one that runs more efficiently, as long as it accepts the same input and produces the same output.

Particularly, given , before execution, Cavs will run a fusion detector gysi2015stella to scan its corresponded dataflow graph and report all fuse-able subgraphs therein, i.e. all nodes in a fuse-able subgraph can be fused as a single operator that behaves equivalently but takes less execution time (e.g. with fewer kernel launches and I/O, or faster computation). Currently, we only detect groups of directly linked elementwise operators, such as , as shown in Figure 7, and we use a simple union-find algorithm to detect the largest possible fuse-able subgraphs. Given a fuse-able subgraph, Cavs adopts de facto automatic code generation techniques quinlan2000rose ; dave2009cetus ; ragan2013halide ; nvrtc to generate lower-level kernel codes as an implementation for it. Replacing the original fuse-able subgraphs with fused operators during execution is beneficial in many aspects: (1) it reduces the number of kernel launches; (2) on some devices such as GPUs, kernel fusion transform device memory access into faster device registers access. We empirically report another improvement with automatic kernel fusion (§5.3).

4 Implementation

We implement Cavs as a pluggable C++ library that can be integrated with existing DL frameworks to provide or enhance their support for dynamic NNs. We next briefly discuss implementation details. For clarity, we assume the host framework is composed of three major layers (which is the case for most popular frameworks Bergstra:2011:NIPSW ; abadi2016tensorflow ; neubig2017dynet ): (1) a frontend that provides device-agnostic symbolic programming interface; (2) an intermediate layer that implements the core execution logic; (3) a backend with device-specific kernels for all provided operators.

Frontend. Cavs provides a base class GraphSupport in addition to conventional operators and the four proposed APIs (§3.1). Users are required to instantiate it by providing a symbolic vertex function – therefore an instantiation of GraphSupport represents a single dynamic structure. To construct more complex structures (e.g. encoder-decoder LSTM sutskever2014sequence , LRCN donahue2015long )), users employ push and pull to connect this dynamic structure to external structures.

Intermediate Layer. At the intermediate layer, Cavs will create a unique scope abadi2016tensorflow , and generates a small dataflow graph for each instantiation of GraphSupport, connecting them appropriately with other parts of the model according to user programs. Cavs implements its core runtime logic at this layer, i.e. the scheduler, the memory management, and the graph execution engine, etc. During execution, the execution engine first analyzes the received dataflow graphs and incorporates described optimization in §3.5. The scheduler then instructs the system to read training samples and their associates graphs (e.g. adjacency matrices). It starts training by submitting batching tasks to the execution engine and assigning memory accordingly.

Backend. Following common practice abadi2016tensorflow ; neubig2017dynet ; caffe2 , Cavs puts device-specific kernel implementations for each supported operator at this layer. Each operator implementation is a function that takes as inputs static tensors and produces static tensors as outputs – therefore the higher-layer logic, i.e. how the computation is scheduled or how the memory is assigned are invincible to this layer. Cavs will reuse the native operator implementations from the host framework, while it provides optimized implementations for the four proposed primitives (gather, scatter, pull, push). Specifically, gather and pull index different slices of a tensor and puts them together continuously on memory; scatter and push by contrast splits a tensor along its batch dimension, and copy different slices to different places. Cavs implements a customized memcpy kernel for there four operators so that copying multiple slices from (or to) different places is performed within one kernel, to further reduce kernel launches.

5 Evaluation

In this section, we evaluate Cavs on training different NNs across multiple datasets, obtaining the following major findings: (1) Cavs has little overhead: when training static NNs that can be by nature batched, Cavs demonstrates equal performance with other DL systems. On several NNs with notably difficult-to-batch structures, Cavs outperforms all existing frameworks by a large margin; (2) We confirm the graph construction overhead is substantial in both Fold looks2017deep and dynamic declaration neubig2017dynet , while Cavs bypasses it by loading input graphs through I/O; (3) We verify the effectiveness of our proposed design and optimization via ablation studies, and discuss Cavs’ advantages over other state-of-the-art DL systems for dynamic dataflow graphs.

Environment. We perform all experiments in this paper on a single machine with an NVIDIA Titan X (GM200) GPU, a 16-core (32 threads) CPU, and CUDA toolkit 8.0 and cuDNN v6 installed. As modern DL models are mostly trained using GPUs, we focus our evaluation on GPUs, but note Cavs’ design and implementation do not reply on a specific type of device. We borrow the implementations of most mathematical operators from TensorFlow v1.2, while we implement the four proposed operators and other system modules by ourselves. We mainly compare Cavs to TensorFlow v1.2 abadi2016tensorflow with XLA xla and its variant Fold looks2017deep , as well as DyNet v2.0 neubig2017dynet with autobatching neubig2017fly , as they have reported better performance than other frameworks pytorch ; chainer on dynamic NNs. We focus on metrics for system performance, e.g. the average time to scan one epoch of data. Cavs produces exactly the same numerical results with other frameworks, hence the same per-epoch convergence555The code of Cavs will be released along with a the next major release of the DyNet project: http://dynet.io/..

Models and dataset. We experiment on the following models with increasing difficulty to batch: (a) Fixed-LSTM language model (LM): a static sequence LSTM with fixed steps for language modeling sundermeyer2012lstm ; sutskever2014sequence ; zaremba2014recurrent . We train it using the PTB dataset ptb that contains over 10K different words. We set the number of steps as 64, i.e. at each iteration of training, the model takes a 64-word sentence from the training corpus, and predicts the next word of each word therein. Obviously, the computation can be by nature batched easily, as each sentence has exactly the same size. (b) Var-LSTM LM: that accepts variable-length inputs. At each iteration the model takes a batch of natural sentences with different length from PTB, and predicts the next words; (c) Tree-FC: the benchmarking model used in looks2017deep with a single fully-connected layer as its cell function. Following the same setting in looks2017deep , we train it over synthetic samples generated by their code treefc-code – each sample is associated with a complete binary tree with 256 leaves (therefore 511 vertices per graph); (d) Tree-LSTM: a family of dynamic NNs widely adopted for text analysis liang2016semantic ; vinyals2015grammar . We implement the binary child-sum Tree-LSTM model in tai2015improved

, and train it as a sentiment classifier using Stanford sentiment treebank (SST) dataset 

socher2013recursive , which contains 8544 training sentences in which the longest sentence has 54 words. Each sentence is associated with a human annotated grammar tree.

Figure 8: Comparing five systems in terms of the averaged time to finish one epoch of training (lower is better) on four models: Fixed-LSTM, Var-LSTM, Tree-FC and Tree-LSTM. In (a)-(d) we fix the hidden size and vary the batch size , while in (e)-(h) we fix and vary .

5.1 Overall Performance

We first verify the viability of our design on the easiest-to-batch case: Fixed-LSTM language model. We compare Cavs to the following three strong baselines: (1) CuDNN chetlur2014cudnn : a CuDNN-based fixed-step sequence LSTM, which is highly optimized by NVIDIA using handcrafted kernels and stands as the best performed implementation on NVIDIA GPUs; (2) TF: the official implementation of Fixed-LSTM LM in TensorFlow repository TF-fixedLSTM based on static declaration; (3) DyNet: we implement a 64-step LSTM in DyNet based on dynamic declaration – we declare a dataflow graph per sample, and train with the autobatching neubig2017fly enabled; (4) Cavs with batching policy, and all input samples have a same input graph – a 64-node chain. We train the model to converge, and report the average time per epoch in Figure 8(a)(e), where in (a) we fix the hidden size of the LSTM unit as 512 and vary the batch size , and in (e) we fix and vary . Empirically, CuDNN performs best in all cases, but note it is highly inflexible. Cavs performs slightly better than TF in various settings, verifying that our system has little overhead dealing with fully static graphs, though it is specialized for dynamic ones. We also conclude from Figure 8 that batching is essential for GPU-based DL: is nearly one order of magnitude faster than regardless of used frameworks. For Cavs, the batching policy is 1.7x, 3.8x, 7.0x, 12x, 15x, 25x, 36x faster than the serial policy at , respectively.

Next, we experiment with Var-LSTM, the most commonly used RNN for variable-length sequences. We compare the following three implementations (CuDNN-based LSTM cannot handle variable-length inputs): (1) TF: an official TensorFlow implementation based on the dynamic unroll approach described in §2.2; (2) DyNet: an official implementation from DyNet benchmark repository based on dynamic declaration DyNet-VarLSTM ; (3) Cavs: where each input sentence is associated with a chain graph that has number of vertices equal to the number of words. We vary and , and report the results in Figure 8(b)(f), respectively. Although all three systems perform batched computation in different ways, Cavs is constantly 2-3 times faster than TF, and outperforms DyNet by a large margin. Compared to TF, Cavs saves computational resources. TF dynamically unrolls the LSTM unit according to the longest sentence in the current batch, but it cannot prevent unnecessary computation for those sentences that are shorter than the longest one.

We then turn to Tree-FC, a dynamic model for benchmarking. Since vanilla TensorFlow is unable to batch its computation, we compare Cavs to (1) DyNet and (2) Fold, a specialized library built upon TensorFlow for dynamic NNs, with a depth-based dynamic batching strategy. To enable the batching, it however needs to preprocess the input graphs, translate them into intermediate representations and pass them to lower-level TensorFlow control flow engine for execution. We report the results in Figure 8(c)(g) with varying and , respectively. For all systems, we allocate a single CPU thread for graph preprocessing or construction. Cavs shows at least an order of magnitude speedups than Fold and DyNet at (). Because the size of the synthetic trees is large, one major advantage of Cavs over them is the alleviation of substantial graph preprocessing/construction overhead. With a single CPU thread, Fold takes even more time on graph preprocessing than computation (§5.3).

Finally, we compare three frameworks on Tree-LSTM in Figure 8(d)(h): Cavs is 8-10x faster than Fold, and consistently outperforms DyNet. One difference in this experiment is that we allocate as many CPU threads as possible (32 on our machine) to accelerate graph preprocessing for Fold, otherwise it will take much longer time. Further, we note DyNet performs much better here than on Tree-FC, as the size of the input graphs in SST (maximally 52 leaves) is much smaller than the synthetic ones (256 leaves each) in Tree-FC experiments. We observe DyNet needs more time on graph construction for large input graphs, and DyNet’s dynamic batching is less effective on larger input graphs, as it has to perform frequent memory checks to support its dynamic batching, which we will discuss in §5.3.

5.2 Graph Construction and Computation

In this section, we investigate the graph construction overhead in Fold and DyNet. To batch computation of different graphs, Fold analyzes the input graphs to recognize batch-able dynamic operations, then translates them into intermediate instructions, with which, TensorFlow generates appropriate control flow graphs for evaluation – we will treat the overhead caused in both steps as Fold’s graph construction overhead. DyNet, as a typical dynamic declaration framework, has to construct as many dataflow graphs as the number of samples. Though DyNet has optimized its graph construction to make it lightweight, the overhead still grows with the training set and the size of input graphs. By contrast, Cavs takes constant time to construct a small dataflow graph encoded by , then reads input graphs through I/O. To quantify the overhead, we separate the graph construction from computation, and visualize in Figure 9(a) how it changes with the average number of leaves (graph size) of input graphs on training Tree-FC, with fixed . We compare (1) Cavs (2) Fold-1 which is Fold with one graph processing thread and (3) DyNet. We plot for one epoch, both the (averaged) absolute time for graph construction and it percentage of the overall time. Clearly we find that all three systems take increasingly more time when the size of the input graphs grows, but Cavs, which loads graphs through I/O, causes the least overhead at all settings. In terms of the relative time, Fold unfortunately wastes 50% at 32 leaves, and 80% when the tree has 1024 leaves, while DyNet and Cavs take only and , respectively.

We also wonder how the overhead is related with batch size when there is fixed computational workload. We report in Figure 9(b) the same metrics when training Tree-LSTM with varying . We add another baseline Fold-32 with 32 threads for Fold’s graph preprocessing. As Fold-1 takes much longer time than others, we report its time at here (instead of showing in Figure 9): 1.1, 7.14, 31.35, 40.1, 46.13, 48.77. Except , all three systems (except Fold-1) take almost constant time for graph construction in one epoch, regardless of , while Fold-32 and DyNet take similar time, but Cavs takes 20x less. Nevertheless, at the percentage scale, increasing makes this overhead more prominent, because larger batch size yields improved computational efficiency, therefore less time to finish one epoch. This, from one perspective, reflects that the graph construction is a main obstacle that grows with the number of training samples and prevents the efficient training of dynamic NNs in existing frameworks, while Cavs successfully overcomes this barrier through its design.

Figure 9: The averaged graph construction overhead per epoch when training (a) Tree-FC with different size of input graphs (b) Tree-LSTM with different batch size. The curves show absolute time in second (left -axis), and the bar graphs show its percentage of the overall time (right -axis).

Apart from the graph construction we report in Table 1 the computation-only time – Cavs demonstrates maximally 5.4x/9.7x and 7.2x/2.4x speedups over Fold/DyNet on Tree-FC and Tree-LSTM, respectively. Besides less system overhead, the advantages stem from two main sources: an optimized graph execution engine, and a better-suited memory management strategy, which we investigate next.

# leaves Comp. time (s) Speedup Comp. time (s) Speedup
32 0.58 / 3.1 / 4.1 5.4x / 7.1x 1 76.2 / 550 / 61.6 7.2x / 0.8x
64 1.1 / 3.9 / 8.0 3.7x / 7.5x 16 9.80 / 69 / 12 7.0x / 1.2x
128 2.0 / 6.2 / 16.0 3.0x / 7.9x 32 6.15 / 43 / 9.9 7.0x / 1.6x
256 3.9 / 10.6 / 33.7 2.7x / 8.7x 64 4.1 / 29 / 7.4 7.2x / 1.8x
512 8.0 / 18.5 / 70.6 2.3x / 8.9x 128 2.9 / 20.5 / 5.9 7.1x / 2.0x
1024 15.8 / 32.4 / 153 2.1x / 9.7x 256 2.3 / 15.8 / 5.4 7.0x / 2.4x
Table 1: The computation time in second (Cavs/Fold/DyNet) and the speedup (Cavs vs Fold/DyNet) for training one epoch on Tree-FC with varying size of the input trees (left part), and on Tree-LSTM with varying batch size (right part).

5.3 Ablation Studies

Graph Execution Engine. To reveal how much each optimization in §3.5 contributes to the final performance, we disable lazy batching, fusion and streaming in Cavs and set this configuration as a baseline (speedup = 1). We then turn on one optimization at a time and record how much speedup it brings. We train Fixed-LSTM and Tree-LSTM, and report the averaged speedups one computation-only time in one epoch over the baseline configuration in Figure 10, with but varying . Lazy batching and fusion consistently deliver nontrivial improvement – lazy batching is more beneficial with a larger while fusion is more effective at smaller , which are expected: lazy batching mainly parallelizes matrix-wise operations (e.g. matmul) commonly with our higher complexity, while fusion mostly works on elementwise operations with a linear complexity with  gustafson1988reevaluating .

Streaming, compared to the other strategies, is less effective on Tree-LSTM than on Fixed-LSTM

, as we have found the depth of the input trees in SST exhibit high variance, i.e. some trees are much deeper than others. In this case, many batching tasks only have one vertex to be evaluated. The computation is highly fragmented and the efficiency is bounded by kernel launching latency. Lazy batching and fusion still help as they both reduce kernel launches (§

3.5). Streaming, which tries to pipeline multiple kernels, can hardly yield obvious improvement.

Figure 10: Improvement of each optimization strategy on execution engine over a baseline configuration (speedup = 1).

Memory Management. Cavs’ performance advantage also credits to its better memory management that reduces memory movements while guarantees memory continuity.

Quantitatively, it is difficult to compare Cavs to Fold, as Fold relies on TensorFlow where memory management is highly coupled with other system aspects. Qualitatively, we find Cavs requires less memory movement (e.g. memcpy) during dynamic batching. Built upon the tf_while operator, whenever Fold performs depth-based batching at , it has to move all the contents of nodes in the dataflow graphs at depth

to a desired location, as the control flow does not support cross-depth memory indexing. This results in redundant memcpy, especially when the graphs are highly skewed. By contrast, Cavs only copies contents that are necessary to the batching task. DyNet has a specialized memory management strategy for dynamic NNs. Compared to Cavs, it however suffers substantial overhead caused by repeated checks of the memory continuity – whenever DyNet wants to batch operators with same signatures, it checks whether their inputs are continuous on memory 

neubig2017fly . The checking overhead increases with and is more prominent on GPUs. Thanks to the simplicity of both systems, we are able to profile the memory-related overhead during both training and inference, and separate it from computation. We compare them on TreeLSTM, and report the breakdown time per epoch in Table 2 under different . We observe the improvement is significant (2x - 3x) at larger , especially during inference where DyNet has its continuity checks concentrated.

Memory operations (s) (Cavs / DyNet) Computation (s) (Cavs / DyNet)
Train Inference Train Inference
16 1.14 / 1.33 0.6 / 1.33 9.8 / 12 2.9 / 8.53
32 0.67 / 0.87 0.35 / 0.87 6.1 / 9.8 1.9 / 5.35
64 0.39 / 0.6 0.21 / 0.6 4.0 / 7.4 1.3 / 3.48
128 0.25 / 0.44 0.13 / 0.44 2.9 / 5.9 0.97 / 2.52
256 0.17 / 0.44 0.09 / 0.44 2.3 / 5.4 0.77 / 2.58
Table 2: The breakdowns of the average time per epoch on memory-related operations and computation. We compare Cavs to DyNet on Tree-LSTM on training and inference, with varying .

Others. Despite system advantages, we also try to investigate whether Cavs, as an interface, simplifies user programs (though we do not claim as a contribution). We compare Cavs to Fold and DyNet in terms of the lines of code (LoC) needed to create a few notable dynamic NNs, including Var-LSTM, Tree-LSTM, and multi-layer sequence LSTM, with Python as the host language. If only for model declaration, Fold in general has 3.5x more LoC than Cavs, while DyNet has slightly more LoC than Cavs because of the function to repeatedly declare graphs.

6 Related Work

In addition to §2, we discuss some other related works.

Graph execution optimization. Optimizing the execution of DL dataflow graphs comes in mainly two ways: better operator implementations or optimizing the execution of (sub-)graphs. As Cavs is implemented as a plug-in to enhance existing frameworks, it benefits from any improved implementations of specific operators (e.g. cuDNN) grave2016efficient ; MKL-DNN ; chetlur2014cudnn ; eigenweb . In addition, Cavs has optimized implementations for its proposed four primitives (gather/scatter/pull/push). At the graph level, a variety of well-developed techniques from other areas, such as kernel fusion, common subexpression elimination, and constant folding, have been adapted and applied on speeding the computation of DL dataflow graphs abadi2016tensorflow ; chen2015mxnet ; caffe2 ; xla . They are usually incorporated after the graph declaration, but before the execution, so that the actual computation is conducted on an optimized graph other than the original one. However, these graph optimizations are less beneficial in dynamic declaration, in which the graph changes with the sample, and needs to be re-processed and re-optimized every iteration, and may cause substantial overhead. On the contrary, Cavs separates the static vertex function from the dynamic-varying input graph, so it benefits from most of the aforementioned optimizations, as we have shown in §5.3. We draw insights from these strategies and reflect them in Cavs’ execution engine. We further propose lazy batching and streaming to exploit more parallelism exposed by our programming model.

Vertex-centric models. The vertex-centric programming model has been extensively developed in the area of graph computing malewicz2010pregel ; gonzalez2012powergraph ; chen2015powerlyra ; sundaram2015graphmat . Cavs draws insights from the GAS model gonzalez2012powergraph , but faces totally different challenges in system and interface design, such as expressiveness, scheduling for batched execution of different graphs, guaranteeing the memory continuity, etc., as we have discussed in §3.1.

7 Conclusion

We present Cavs as a vertex-centric programming interface as well as an efficient system for dynamic deep learning. Cavs represents a dynamic NN structure as static vertex functions and dynamic input graphs. It provides four novel APIs to allow users to easily program these types of NNs. With designed scheduling policy, memory management strategy, and graph execution optimizations, Cavs avoids substantial graph construction overhead suffered by dynamic declaration, and reports new state-of-the-art system performance for various notable dynamic NN architectures.

References