Parallelizable Stack Long Short-Term Memory

by   Shuoyang Ding, et al.
Johns Hopkins University

Stack Long Short-Term Memory (StackLSTM) is useful for various applications such as parsing and string-to-tree neural machine translation, but it is also known to be notoriously difficult to parallelize for GPU training due to the fact that the computations are dependent on discrete operations. In this paper, we tackle this problem by utilizing state access patterns of StackLSTM to homogenize computations with regard to different discrete operations. Our parsing experiments show that the method scales up almost linearly with increasing batch size, and our parallelized PyTorch implementation trains significantly faster compared to the Dynet C++ implementation.



There are no comments yet.


page 1

page 2

page 3

page 4


Top-down Tree Long Short-Term Memory Networks

Long Short-Term Memory (LSTM) networks, a type of recurrent neural netwo...

Automating Network Error Detection using Long-Short Term Memory Networks

In this work, we investigate the current flaws with identifying network-...

Associative Long Short-Term Memory

We investigate a new method to augment recurrent neural networks with ex...

Compositional Distributional Semantics with Long Short Term Memory

We are proposing an extension of the recursive neural network that makes...

Damaged Fingerprint Recognition by Convolutional Long Short-Term Memory Networks for Forensic Purposes

Fingerprint recognition is often a game-changing step in establishing ev...

Deep Tree Transductions - A Short Survey

The paper surveys recent extensions of the Long-Short Term Memory networ...

Learning Bounded Context-Free-Grammar via LSTM and the Transformer:Difference and Explanations

Long Short-Term Memory (LSTM) and Transformers are two popular neural ar...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Tree-structured representation of language has been successfully applied to various applications including dependency parsing Dyer et al. (2015)

, sentiment analysis

Socher et al. (2011) and neural machine translation Eriguchi et al. (2017)

. However, most of the neural network architectures used to build tree-structured representations are not able to exploit full parallelism of GPUs by minibatched training, as the computation that happens for each instance is conditioned on the input/output structures, and hence cannot be naïvely grouped together as a batch. This lack of parallelism is one of the major hurdles that prevent these representations from wider adoption practically (e.g., neural machine translation), as many natural language processing tasks currently require the ability to scale up to very large training corpora in order to reach state-of-the-art performance.

We seek to advance the state-of-the-art of this problem by proposing a parallelization scheme for one such network architecture, the Stack Long Short-Term Memory (StackLSTM) proposed in Dyer et al. (2015). This architecture has been successfully applied to dependency parsing Dyer et al. (2015, 2016); Ballesteros et al. (2017) and syntax-aware neural machine translation Eriguchi et al. (2017) in the previous research literature, but none of these research results were produced with minibatched training. We show that our parallelization scheme is feasible in practice by showing that it scales up near-linearly with increasing batch size, while reproducing a set of results reported in Ballesteros et al. (2017).

2 StackLSTM

StackLSTM Dyer et al. (2015) is an LSTM architecture Hochreiter and Schmidhuber (1997) augmented with a stack that stores some of the hidden states built in the past. Unlike traditional LSTMs that always build state from , the states of StackLSTM are built from the head of the state stack , maintained by a stack top pointer

. At each time step, StackLSTM takes a real-valued input vector together with an additional discrete operation on the stack, which determines what computation needs to be conducted and how the stack top pointer should be updated. Throughout this section, we index the input vector (e.g. word embeddings)

using the time step it is fed into the network, and hidden states in the stack using their position in the stack , being defined as the 0-base index starting from the stack bottom.

The set of input discrete actions typically contains at least Push and Pop operations. When these operations are taken as input, the corresponding computations on the StackLSTM are listed below:111To simplify the presentation, we omitted the updates on cell states, because in practice the operations performed on cell states and hidden states are the same.

Transition Systems Transition Op Stack Op Buffer Op Composition Op
Arc-Standard Shift push pop none
Left-Arc pop, pop, push hold
Right-Arc pop hold
Arc-Eager Shift push pop none
Reduce pop hold none
Left-Arc pop hold
Right-Arc push pop
Arc-Hybrid Shift push pop none
Left-Arc pop hold
Right-Arc pop hold
Table 1: Correspondence between transition operations and stack/buffer operations for StackLSTM, where denotes the composition function as proposed by Dyer et al. (2015). S0 and B0 refers to the token-level representation corresponding to the top element of the stack and buffer, while S1 and B1 refers to those that are second to the top. We use a different notation here to avoid confusion with the states in StackLSTM, which represent non-local information beyond token-level.
  • Push: read previous hidden state , perform LSTM forward computation with and , write new hidden state to , update stack top pointer with .

  • Pop: update stack top pointer with .

Reflecting on the aforementioned discussion on parallelism, one should notice that StackLSTM falls into the category of neural network architectures that is difficult to perform minibatched training. This is caused by the fact that the computation performed by StackLSTM at each time step is dependent on the discrete input actions. The following section proposes a solution to this problem.

3 Parallelizable StackLSTM

Continuing the formulation in the previous section, we will start by discussing our proposed solution under the case where the set of discrete actions contains only Push and Pop operations; we then move on to discussion of the applicability of our proposed solution to the transition systems that are used for building representations for dependency trees.

The first modification we perform to the Push and Pop operations above is to unify the pointer update of these operations as , where is the input discrete operation that could either take the value +1 or -1 for Push and Pop operation. After this modification, we came to the following observations:

Observation 1

The computation performed for Pop operation is a subset of Push operation.

Now, what remains to homogenize Push and Pop operations is conducting the extra computations needed for Push operation when Pop is fed in as well, while guaranteeing the correctness of the resulting hidden state both in the current time step and in the future. The next observation points out a way for this guarantee:

Observation 2

In a StackLSTM, given the current stack top pointer position , any hidden state where will not be read until it is overwritten by a Push operation.

What follows from this observation is the guarantee that we can always safely overwrite hidden states that are indexed higher than the current stack top pointer, because it is known that any read operation on these states will happen after another overwrite. This allows us to do the extra computation anyway when Pop operation is fed, because the extra computation, especially updating , will not harm the validity of the hidden states at any time step.

Algorithm 1 gives the final forward computation for the Parallelizable StackLSTM. Note that this algorithm does not contain any if-statements that depends on stack operations and hence is homogeneous when grouped into batches that are consisted of multiple operations trajectories.

Input: input vector
  discrete stack operation
Output: current top hidden state
return ;
Algorithm 1 Forward Computation for Parallelizable StackLSTM

In transition systems Nivre (2008); Kuhlmann et al. (2011) used in real tasks (e.g., transition-based parsing) as shown in Table 1, it should be noted that more than push and pop operations are needed for the StackLSTM. Fortunately, for Arc-Eager and Arc-Hybrid transition systems, we can simply add a hold operation, which is denoted by value 0 for the discrete operation input. For that reason, we will focus on parallelization of these two transition systems for this paper. It should be noted that both observations discussed above are still valid after adding the hold operation.

4 Experiments

4.1 Setup

We implemented222 the architecture described above in PyTorch Paszke et al. (2017)

. We implemented the batched stack as a float tensor wrapped in a non-leaf variable, thus enabling in-place operations on that variable. At each time step, the batched stack is queried/updated with a batch of stack head positions represented by an integer vector, an operation made possible by

gather operation and advanced indexing. Due to this implementation choice, the stack size has to be determined at initialization time and cannot be dynamically grown. Nonetheless, a fixed stack size of 150 works for all the experiments we conducted.

We use the dependency parsing task to evaluate the correctness and the scalability of our method. For comparison with previous work, we follow the architecture introduced in Dyer et al. (2015); Ballesteros et al. (2017) and chose the Arc-Hybrid transition system for comparison with previous work. We follow the data setup in Chen and Manning (2014); Dyer et al. (2015); Ballesteros et al. (2017) and use Stanford Dependency Treebank de Marneffe et al. (2006) for dependency parsing, and we extract the Arc-Hybrid static oracle using the code associated with Qi and Manning (2017). The part-of-speech (POS) tags are generated with Stanford POS-tagger Toutanova et al. (2003) with a test set accuracy of 97.47%. We use exactly the same pre-trained English word embedding as Dyer et al. (2015).

We use Adam Kingma and Ba (2014) as the optimization algorithm. Following Goyal et al. (2017), we apply linear warmup to the learning rate with an initial value of

and total epoch number of 5. The target learning rate is set by

multiplied by batch size, but capped at 0.02 because we find Adam to be unstable beyond that learning rate. After warmup, we reduce the learning rate by half every time there is no improvement for loss value on the development set (ReduceLROnPlateau). We clip all the gradient norms to 5.0 and apply a L-regularization with weight .

We started with the hyper-parameter choices in Dyer et al. (2015)

but made some modifications based on the performance on the development set: we use hidden dimension 200 for all the LSTM units, 200 for the parser state representation before the final softmax layer, and embedding dimension 48 for the action embedding.

We use Tesla K80 for all the experiments, in order to compare with Neubig et al. (2017b); Dyer et al. (2015). We also use the same hyper-parameter setting as Dyer et al. (2015) for speed comparison experiments. All the speeds are measured by running through one training epoch and averaging.

4.2 Results

Figure 1: Training speed at different batch size. Note that the -axis is in log-scale in order to show all the data points properly.
dev test
1* 92.50* 89.79* 92.10* 89.61*
8 92.93 90.42 92.54 90.11
16 92.62 90.19 92.53 90.13
32 92.43 89.89 92.31 89.94
64 92.53 90.04 92.22 89.73
128 92.39 89.73 92.55 90.02
256 92.15 89.46 91.99 89.43
Table 2: Dependency parsing result with various training batch size and without composition function. The results marked with asterisks were reported in the Ballesteros et al. (2017).

Figure 1 shows the training speed at different batch sizes up to 256.333At batch size of 512, the longest sentence in the training data cannot be fit onto the GPU. The speed-up of our model is close to linear, which means there is very little overhead associated with our batching scheme. Quantitatively, according to Amdahl’s Law Amdahl (1967), the proportion of parallelized computations is 99.92% at batch size 64. We also compared our implementation with the implementation that comes with Dyer et al. (2015), which is implemented in C++ with DyNet Neubig et al. (2017a). DyNet is known to be very optimized for CPU computations and hence their implementation is reasonably fast even without batching and GPU acceleration, as shown in Figure 1.444Measured on one core of an Intel Xeon E7-4830 CPU.

But we would like to point out that we focus on the speed-up we are able to obtain rather than the absolute speed, and that our batching scheme is framework-universal and superior speed might be obtained by combining our scheme with alternative frameworks or languages (for example, the torch C++ interface).

The dependency parsing results are shown in Table 2. Our implementation is able to yield better test set performance than that reported in Ballesteros et al. (2017) for all batch size configurations except 256, where we observe a modest performance loss. Like Goyal et al. (2017); Keskar et al. (2016); Masters and Luschi (2018), we initially observed more significant test-time performance deterioration (around absolute difference) for models trained without learning rate warmup, and concurring with the findings in Goyal et al. (2017), we find warmup very helpful for stabilizing large-batch training. We did not run experiments with batch size below 8 as they are too slow due to Python’s inherent performance issue.

5 Related Work

DyNet has support for automatic minibatching Neubig et al. (2017b), which figures out what computation is able to be batched by traversing the computation graph to find homogeneous computations. While we cannot directly compare with that framework’s automatic batching solution for StackLSTM555This is due to the fact that DyNet automatic batching cannot handle graph structures that depends on runtime input values, which is the case in StackLSTM., we can draw a loose comparison to the results reported in that paper for BiLSTM transition-based parsing Kiperwasser and Goldberg (2016). Comparing batch size of 64 to batch size of 1, they obtained a 3.64x speed-up on CPU and 2.73x speed-up on Tesla K80 GPU, while our architecture-specific manual batching scheme obtained 60.8x speed-up. The main reason for this difference is that their graph-traversing automatic batching scheme carries a much larger overhead compared to our manual batching approach.

Another toolkit that supports automatic minibatching is Matchbox666, which operates by analyzing the single-instance model definition and deterministically convert the operations into their minibatched counterparts. While such mechanism eliminated the need to traverse the whole computation graph, it cannot homogenize the operations in each branch of if. Instead, it needs to perform each operation separately and apply masking on the result, while our method does not require any masking. Unfortunately we are also not able to compare with the toolkit at the time of this work as it lacks support for several operations we need.

Similar to the spirit of our work, Bowman et al. (2016) attempted to parallelize StackLSTM by using Thin-stack, a data structure that reduces the space complexity by storing all the intermediate stack top elements in a tensor and use a queue to control element access. However, thanks to PyTorch, our implementation is not directly dependent on the notion of Thin-stack. Instead, when an element is popped from the stack, we simply shift the stack top pointer and potentially re-write the corresponding sub-tensor later. In other words, there is no need for us to directly maintain all the intermediate stack top elements, because in PyTorch, when the element in the stack is re-written, its underlying sub-tensor will not be destructed as there are still nodes in the computation graph that point to it. Hence, when performing back-propagation, the gradient is still able to flow back to the elements that are previously popped from the stack and their respective precedents. Hence, we are also effectively storing all the intermediate stack top elements only once. Besides, Bowman et al. (2016) didn’t attempt to eliminate the conditional branches in the StackLSTM algorithm, which is the main algorithmic contribution of this work.

6 Conclusion

We propose a parallelizable version of StackLSTM that is able to fully exploit the GPU parallelism by performing minibatched training. Empirical results show that our parallelization scheme yields comparable performance to previous work, and our method scales up very linearly with the increasing batch size.

Because our parallelization scheme is based on the observation made in section 1, we cannot incorporate batching for neither Arc-Standard transition system nor the token-level composition function proposed in Dyer et al. (2015) efficiently yet. We leave the parallelization of these architectures to future work.

Our parallelization scheme makes it feasible to run large-data experiments for various tasks that requires large training data to perform well, such as RNNG-based syntax-aware neural machine translation Eriguchi et al. (2017).


The authors would like to thank Peng Qi for helping with data preprocessing and James Bradbury for helpful technical discussions. This material is based upon work supported in part by the DARPA LORELEI and IARPA MATERIAL programs.