An Empirical Exploration of Skip Connections for Sequential Tagging

10/11/2016 ∙ by Huijia Wu, et al. ∙ 0

In this paper, we empirically explore the effects of various kinds of skip connections in stacked bidirectional LSTMs for sequential tagging. We investigate three kinds of skip connections connecting to LSTM cells: (a) skip connections to the gates, (b) skip connections to the internal states and (c) skip connections to the cell outputs. We present comprehensive experiments showing that skip connections to cell outputs outperform the remaining two. Furthermore, we observe that using gated identity functions as skip mappings works pretty well. Based on this novel skip connections, we successfully train deep stacked bidirectional LSTM models and obtain state-of-the-art results on CCG supertagging and comparable results on POS tagging.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This work is licensed under a Creative Commons Attribution 4.0 International License. License details:

In natural language processing, sequential tagging mainly refers to the tasks of assigning discrete labels to each token in a sequence. Typical examples include part-of-speech (POS) tagging and combinatory category grammar (CCG) supertagging. A regular feature of sequential tagging is that the input tokens in a sequence cannot be assumed to be independent since the same token in different contexts can be assigned to different tags. Therefore, the classifier should have memories to remember the contexts to make a correct prediction.

Bidirectional LSTMs [Graves and Schmidhuber2005] become dominant in sequential tagging problems due to the superior performance [Wang et al.2015, Vaswani et al.2016, Lample et al.2016]. The horizontal hierarchy of LSTMs with bidirectional processing can remember the long-range dependencies without affecting the short-term storage. Although the models have a deep horizontal hierarchy (the depth is the same as the sequence length), the vertical hierarchy is often shallow, which may not be efficient at representing each token. Stacked LSTMs are deep in both directions, but become harder to train due to the feed-forward structure of stacked layers.

Skip connections (or shortcut connections) enable unimpeded information flow by adding direct connections across different layers [Raiko et al.2012, Graves2013, Hermans and Schrauwen2013]. However, there is a lack of exploration and analyzing various kinds of skip connections in stacked LSTMs. There are two issues to handle skip connections in stacked LSTMs: One is where to add the skip connections, the other is what kind of skip connections should be used to pass the information. To answer the first question, we empirically analyze three positions of LSTM blocks to receive the previous layer’s output. For the second one, we present an identity mapping to receive the previous layer’s outputs. Furthermore, following the gate design of LSTM [Hochreiter and Schmidhuber1997, Gers et al.2000] and highway networks [Srivastava et al.2015a, Srivastava et al.2015b], we observe that adding a multiplicative gate to the identity function will help to improve performance.

In this paper, we present a neural architecture for sequential tagging. The input of the network are token representations. We concatenate word embeddings to character embeddings to represent the word and morphemes. A deep stacked bidirectional LSTM with well-designed skip connections is then used to extract the features needed for classification from the inputs. The output layer uses softmax function to output the tag distribution for each token.

Our main contribution is that we empirically evaluated the effects of various kinds of skip connections within stacked LSTMs. We present comprehensive experiments on the supertagging task showing that skip connections to the cell outputs using identity function multiplied with an exclusive gate can help to improve the network performance. Our model is evaluated on two sequential tagging tasks, obtaining state-of-the-art results on CCG supertagging and comparable results on POS tagging.

2 Related Work

Skip connections have been widely used for training deep neural networks. For recurrent neural networks, Schmidhuber schmidhuber1992learning; El Hihi and Bengio el1995hierarchical introduced deep RNNs by stacking hidden layers on top of each other. Raiko et al. raiko2012deep; Graves graves2013generating; Hermans and Schrauwen hermans2013training proposed the use of skip connections in stacked RNNs. However, the researchers have paid less attention to the analyzing of various kinds of skip connections, which is our focus in this paper.

The works closely related to ours are Srivastava et al. srivastava2015highway, He et al. he2015deep, Kalchbrenner et al. kalchbrenner2015grid, Yao et al. yao2015depth, Zhang et al. zhang2016highway, and Zilly et al. zilly2016recurrent. These works are all based on the design of extra connections between different layers. Srivastava et al. srivastava2015highway and He et al. he2015deep mainly focus on feed-forward neural network, using well-designed skip connections across different layers to make the information pass more easily. The Grid LSTM proposed by Kalchbrenner et al. kalchbrenner2015grid extends the one dimensional LSTMs to many dimensional LSTMs, which provides a more general framework to construct deep LSTMs.

Yao et al. yao2015depth and Zhang et al. zhang2016highway propose highway LSTMs by introducing gated direct connections between internal states in adjacent layers and do not use skip connections, while we propose gated skip connections across cell outputs. Zilly et al. zilly2016recurrent introduce recurrent highway networks (RHN) which use a single recurrent layer to make RNN deep in a vertical direction, in contrast to our stacked models.

3 Recurrent Neural Networks for Sequential Tagging

Consider a recurrent neural network applied to sequential tagging: Given a sequence , the RNN computes the hidden state and the output by iterating the following equations:


where represents the time. represents the input at time , and are the previous and the current hidden state, respectively. and are the transition function and the output function, respectively. and are network parameters.

We use a negative log-likelihood cost to evaluate the performance, which can be written as:


where is the true target for sample , and is the -th output in the softmax layer given the inputs .

The core idea of Long Short-Term Memory networks is to replace (

1) with the following equation:


where is the internal state of the memory cell, which is designed to store the information for much longer time. Besides this, LSTM uses gates to avoid weight update conflicts.

Standard LSTMs process sequences in temporal order, which will ignore future context. Bidirectional LSTMs solve this problem by combining both the forward and the backward processing of the input sequences using two separate recurrent hidden layers:


where is the LSTM computation. and are the forward and the backward input sequence, respectively. The output of the two hidden layers and in a birectional LSTM are connected to the output layer.

Stacked RNN is one type of deep RNNs, which refers to the hidden layers are stacked on top of each other, each feeding up to the layer above:


where is the -th hidden state of the -th layer.

4 Various kinds of Skip Connections

Skip connections in simple RNNs are trivial since there is only one position to connect to the hidden units. But for stacked LSTMs, the skip connections need to be carefully treated to train the network successfully. In this section, we analyze and compare various types of skip connections. At first, we give a detailed definition of stacked LSTMs, which can help us to describe skip connections. Then we start our construction of skip connections in stacked LSTMs. At last, we formulate various kinds of skip connections.

Stacked LSTMs without skip connections can be defined as:


During forward pass, LSTM needs to calculate and , which is the cell’s internal state and the cell outputs state, respectively. To get , needs to be computed to store the current input. Then this result is multiplied by the input gate , which decides when to keep or override information in memory cell . The cell is designed to store the previous information , which can be reset by a forget gate . The new cell state is then obtained by adding the result to the current input. The cell outputs are computed by multiplying the activated cell state by the output gate , which learns when to access memory cell and when to block it. “sigm” and “tanh

” are the sigmoid and tanh activation function, respectively.

is the weight matrix needs to be learned.

The hidden units in stacked LSTMs have two forms. One is the hidden units in the same layer , which are connected through an LSTM. The other is the hidden units at the same time step , which are connected through a feed-forward network. LSTM can keep the short-term memory for a long time, thus the error signals can be easily passed through . However, when the number of stacked layers is large, the feed-forward network will suffer the gradient vanishing/exploding problems, which make the gradients hard to pass through .

The core idea of LSTM is to use an identity function to make the constant error carrosel. He et al. he2015deep also use an identity mapping to train a very deep convolution neural network with improved performance. All these inspired us to use an identity function for the skip connections. Rather, the gates of LSTM are essential parts to avoid weight update conflicts, which are also invoked by skip connections. Following highway gating, we use a gate multiplied with identity mapping to avoid the conflicts.

Skip connections are cross-layer connections, which means that the output of layer 2 is not only connected to the layer 1, but also connected to layer . For stacked LSTMs, can be connected to the gates, the internal states, and the cell outputs in layer ’s LSTM blocks. We formalize these below:

Skip connections to the gates.

We can connect to the gates through an identity mapping:


where is the identity mapping.

Skip connections to the internal states.

Another kind of skip connections is to connect to the cell’s internal state :

Skip connections to the cell outputs.

We can also connect to cell outputs:

Skip connections using gates.

Consider the case of skip connections to the cell outputs. The cell outputs grow linearly during the presentation of network depth, which makes the ’s derivative vanish and hard to convergence. Inspired by the introduction of LSTM gates, we add a gate to control the skip connections through retrieving or blocking them:


where is the gate which can be used to access the skipped output or block it. When equals 0, no skipped output can be passed through skip connections, which is equivalent to traditional stacked LSTMs. Otherwise, it behaves like a feed-forward LSTM using gated identity connections. Here we omit the case of adding gates to skip connections to the internal state, which is similar to the above case.

Skip connections in bidirectional LSTM.

Using skip connections in bidirectional LSTM is similar to the one used in LSTM, with a bidirectional processing:


5 Neural Architecture for Sequential Tagging

Sequential tagging can be formulated as , where indicates the words in a sentence, and indicates the corresponding tags. In this section we introduce an neural architecture for , which includes an input layer, a stacked hidden layers and an output layer. Since the stacked hidden layers have already been introduced, we only introduce the input and the output layer here.

5.1 Network Inputs

Network inputs are the representation of each token in a sequence. There are many kinds of token representations, such as using a single word embedding, using a local window approach, or a combination of word and character-level representation. Here our inputs contain the concatenation of word representations, character representations, and capitalization representations.

Word representations.

All words in the vocabulary share a common look-up table, which is initialized with random initializations or pre-trained embeddings. Each word in a sentence can be mapped to an embedding vector

. The whole sentence is then represented by a matrix with columns vector . We use a context window of size surrounding with a word to get its context information. Following Wu et al. wu2016dynamic, we add logistic gates to each token in the context window. The word representation is computed as , where is a logistic gate to filter the unnecessary contexts, is the word embeddings in the local window.

Character representations.

Prefix and suffix information about words are important features in sequential tagging. Inspired by Fonseca et al. fonseca2015evaluating et al, which uses a character prefix and suffix with length from 1 to 5 for part-of-speech tagging, we concatenate character embeddings in a word to get the character-level representation. Concretely, given a word consisting of a sequence of characters , where is the length of the word and is the look-up table for characters. We concatenate the leftmost most 5 character embeddings with its rightmost 5 character embeddings

. When a word is less than five characters, we pad the remaining characters with the same special symbol.

Capitalization representations.

We lowercase the words to decrease the size of word vocabulary to reduce sparsity, but we need an extra capitalization embeddings to store the capitalization features, which represent whether or not a word is capitalized.

5.2 Network Outputs

For sequential tagging, we use a softmax activation function in the output layer:



is a probability distribution over all possible tags.

is the -th dimension of , which corresponds to the -th tags in the tag set. is the hidden-to-output weight.

6 Experiments

6.1 Combinatory Category Grammar Supertagging

Combinatory Category Grammar (CCG) supertagging is a sequential tagging problem in natural language processing. The task is to assign supertags to each word in a sentence. In CCG the supertags stand for the lexical categories, which are composed of the basic categories such as , and , and complex categories, which are the combination of the basic categories based on a set of rules. Detailed explanations of CCG refers to [Steedman2000, Steedman and Baldridge2011].

The training set of this task only contains 39604 sentences, which is too small to train a deep model, and may cause over-parametrization. But we choose it since it has been already proved that a bidirectional recurrent net fits the task by many authors [Lewis et al.2016, Vaswani et al.2016].

6.1.1 Dataset and Pre-processing

Our experiments are performed on CCGBank [Hockenmaier and Steedman2007], which is a translation from Penn Treebank [Marcus et al.1993] to CCG with a coverage 99.4%. We follow the standard splits, using sections 02-21 for training, section 00 for development and section 23 for the test. We use a full category set containing 1285 tags. All digits are mapped into the same digit ‘9’, and all words are lowercased.

6.1.2 Network Configuration


There are two types of weights in our experiments: recurrent and non-recurrent weights. For non-recurrent weights, we initialize word embeddings with the pre-trained 200-dimensional GolVe vectors [Pennington et al.2014]

. Other weights are initialized with the Gaussian distribution

scaled by a factor of 0.1, where fan-in is the number of units in the input layer. For recurrent weight matrices, following Saxe et al. saxe2013exact we initialize with random orthogonal matrices through SVD to avoid unstable gradients. Orthogonal initialization for recurrent weights is important in our experiments, which takes about relative performance enhancement than other methods such as Xavier initialization [Glorot and Bengio2010].


For the word representations, we use a small window size of 3 for the convolutional layer. The dimension of the word representation after the convolutional operation is 600. The size of character embedding and capitalization embeddings are set to 5. The number of cells of the stacked bidirectional LSTM is set to 512. We also tried 400 cells or 600 cells and found this number did not impact performance so much. All stacked hidden layers have the same number of cells. The output layer has 1286 neurons, which equals to the number of tags in the training set with a

rare symbol.


We train the networks using the back-propagation algorithm, using stochastic gradient descent (SGD) algorithm with an equal learning rate 0.02 for all layers. We also tried other optimization methods, such as momentum

[Plaut and others1986], Adadelta [Zeiler2012], or Adam [Kingma and Ba2014]

, but none of them perform as well as SGD. Gradient clipping is not used. We use on-line learning in our experiments, which means the parameters will be updated on every training sequences, one at a time. We trained the 7-layer network for roughly 2 to 3 days on one NVIDIA TITAN X GPU using Theano

111 [Team et al.2016].


Dropout [Srivastava et al.2014]

is the only regularizer in our model to avoid overfitting. Other regularization methods such as weight decay and batch normalization do not work in our experiments. We add a binary dropout mask to the local context windows on the embedding layer, with a drop rate

of 0.25. We also apply dropout to the output of the first hidden layer and the last hidden layer, with a 0.5 drop rate. At test time, weights are scaled with a factor .

6.1.3 Results

Table 1 shows the comparisons with other models for supertagging. The comparisons do not include any externally labeled data and POS labels. We use stacked bidirectional LSTMs with gated skip connections for the comparisons, and report the highest 1-best supertagging accuracy on the development set for final testing. Our model presents state-of-the-art results compared to the existing systems. The character-level information (+ 3% relative accuracy) and dropout (+ 8% relative accuracy) are necessary to improve the performance.

Model Dev Test
Clark and Curran clark2007wide 91.5 92.0
Lewis et al. lewis2014improved 91.3 91.6
Lewis et al. lewis2016lstm 94.1 94.3
Xu et al. xu2015ccg 93.1 93.0
Xu et al. xu2016expected 93.49 93.52
Vaswani et al. vaswani2016supertagging 94.24 94.5
7-layers + skip output + gating 94.51 94.67
7-layers + skip output + gating (no char) 94.33 94.45
7-layers + skip output + gating (no dropout) 94.06 94.0
9-layers + skip output + gating 94.55 94.69
Table 1: 1-best supertagging accuracy on CCGbank. “skip output” refers to the skip connections to the cell output, “gating” refers to adding a gate to the identity function, “no char” refers to the models that do not use the character-level information, “no dropout” refers to models that do not use dropout.

6.1.4 Experiments on Skip Connections

We experiment with a 7-layer model on CCGbank to compare different kinds of skip connections introduced in Section 4. Our analysis mainly focuses on the identity function and the gating mechanism. The comparisons (Table 2) are summarized as follows:

No skip connections.

When the number of stacked layers is large, the performance will degrade without skip connections. The accuracy in a 7-layer stacked model without skip connections is 93.94% (Table 2), which is lower than the one using skip connections.

Various kinds of skip connections.

We experiment with the gated identity connections between internal states introduced in Zhang et al.zhang2016highway, but the network performs not good (Table 2, 93.14%). We also implement the method proposed in Zilly et al. zilly2016recurrent, which we use a single bidirectional RNH layer with a recurrent depth of 3 with a slightly modification222Our original implementation of Zilly et a. zilly2016recurrent with a recurrent depth of 3 fails to converge. The reason might be due to the explosion of under addition. To avoid this, we replace with in the last recurrent step.. Skip connections to the cell outputs with identity function and multiplicative gating achieves the highest accuracy (Table 2, 94.51%) on the development set. We also observe that skip to the internal states without gate get a slightly better performance (Table 2, 94.33%) than the one with gate (94.24%) on the development set. Here we recommend to set the forget bias to 0 to get a better development accuracy.

Identity mapping.

We use the function to the previous outputs to break the identity link, in which we replace in Eq. (15) with , where

. The result of the sigmoid function is 94.02% (Table

2), which is poor than the identity function. We can infer that the identity function is more suitable than other scaled functions such as sigmoid or tanh to transmit information.

Exclusive gating.

Following the gating mechanism adopted in highway networks, we consider adding a gate to make a flexible control to the skip connections. Our gating function is . Gated identity connections are essential to achieving state-of-the-art result on CCGbank.

Case Variant Dev Test
H-LSTM, Zhang et al.zhang2016highway - 93.14 93.52
RHN, Zilly et al. zilly2016recurrent , with output gates 94.28 94.24
no skip connections - 93.94 94.26
to the gates, Eq. (10) - 93.9 94.22
to the internals no gate, Eq. (11) 94.33 94.63
with gate 94.24 94.52
to the outputs no gate, Eq. (14) 93.89 93.98
with gate, , Eq. (15) 94.23 94.81
with gate, , Eq. (15) 94.51 94.67
mapping: 94.02 94.18
Table 2: Accuracy on CCGbank using 7-layer stacked bidirectional LSTMs, with different types of skip connections. is the bias of the forget gate. We report “fail” when the validation error is higher than 10%.

6.1.5 Experiments on Number of Layers

Table 3 compares the effect of the depth in the stacked models. We can observe that the performance is getting better with the increased number of layers. But when the number of layers exceeds 9, the performance will be hurt. In the experiments, we found that the number of stacked layers between 7 and 9 are the best choice using skip connections. Notice that we do not use layer-wise pretraining [Bengio et al.2007, Simonyan and Zisserman2014], which is an important technique in training deep networks. Further improvements might be obtained with this method to build a deeper network with improved performance.

#Layers Dev Test
3 94.21 94.35
5 94.51 94.57
7 94.51 94.67
9 94.55 94.7
11 94.43 94.65
Table 3: Accuracy on CCGbank using gated identity connections to cell outputs, with different number of stacked layers.

6.2 Part-of-speech Tagging

Part-of-speech tagging is another sequential tagging task, which is to assign POS tags to each word in a sentence. It is very similar to the supertagging task. Therefore, these two tasks can be solved in a unified architecture. For POS tagging, we use the same network configurations as supertagging, except the word vocabulary size and the tag set size. We conduct experiments on the Wall Street Journal of the Penn Treebank dataset, adopting the standard splits (sections 0-18 for the train, sections 19-21 for validation and sections 22-24 for testing).

Model Test
Søgaard sogaard2011semisupervised 97.5
Ling et al. ling2015finding 97.36
Wang et al. wang2015part 97.78
Vaswani et al. vaswani2016supertagging 97.4
7-layers + skip output + gating 97.45
9-layers + skip output + gating 97.45
Table 4: Accuracy for POS tagging on WSJ.

Although the POS tagging result presented in Table 4 is slightly below the state-of-the-art, we neither do any parameter tunings nor change the network architectures, just use the one getting the best development accuracy on the supertagging task. This proves the generalization of the model and avoids heavy work of model re-designing.

7 Conclusions

This paper investigates various kinds of skip connections in stacked bidirectional LSTM models. We present a deep stacked network (7 or 9 layers) that can be easily trained and get improved accuracy on CCG supertagging and POS tagging. Our experiments show that skip connections to the cell outputs with the gated identity function performs the best. Our explorations could easily be applied to other sequential processing problems, which can be modelled with RNN architectures.


The research work has been funded by the Natural Science Foundation of China under Grant No. 61333018. We thank the anonymous reviewers for their useful comments that greatly improved the manuscript.