DeepAI
Log In Sign Up

Multi-Zone Unit for Recurrent Neural Networks

Recurrent neural networks (RNNs) have been widely used to deal with sequence learning problems. The input-dependent transition function, which folds new observations into hidden states to sequentially construct fixed-length representations of arbitrary-length sequences, plays a critical role in RNNs. Based on single space composition, transition functions in existing RNNs often have difficulty in capturing complicated long-range dependencies. In this paper, we introduce a new Multi-zone Unit (MZU) for RNNs. The key idea is to design a transition function that is capable of modeling multiple space composition. The MZU consists of three components: zone generation, zone composition, and zone aggregation. Experimental results on multiple datasets of the character-level language modeling task and the aspect-based sentiment analysis task demonstrate the superiority of the MZU.

READ FULL TEXT VIEW PDF
07/06/2018

Sliced Recurrent Neural Networks

Recurrent neural networks have achieved great success in many NLP tasks....
06/09/2016

MuFuRU: The Multi-Function Recurrent Unit

Recurrent neural networks such as the GRU and LSTM found wide adoption i...
09/09/2011

Learning Sequence Neighbourhood Metrics

Recurrent neural networks (RNNs) in combination with a pooling operator ...
05/31/2021

Learning and Generalization in RNNs

Simple recurrent neural networks (RNNs) and their more advanced cousins ...
07/09/2018

IGLOO: Slicing the Features Space to Represent Long Sequences

We introduce a new neural network architecture, IGLOO, which aims at pro...
03/22/2021

Alleviate Exposure Bias in Sequence Prediction with Recurrent Neural Networks

A popular strategy to train recurrent neural networks (RNNs), known as “...
11/12/2015

Improving performance of recurrent neural network with relu nonlinearity

In recent years significant progress has been made in successfully train...

Introduction

Processing sequential data of variable length is a major challenge in the field of natural language processing (NLP). Recurrent Neural Networks (RNNs), Long Short-term Memories (LSTMs) 

[15]

and Gated Recurrent Units (GRUs) 

[6] in particular, have recently become one of the most popular tools to approach sequence learning tasks, such as handwriting recognition [10], sentiment classification [34, 38], sequence labeling [16, 1], language modeling [12, 47, 7, 30, 23] and machine translation [5, 27].

RNNs sequentially construct fixed-length representations of arbitrary-length sequences by folding new observations into their hidden states using an input-dependent transition operator. The hidden state is updated recursively using the previous hidden state and the current input as , where is a differentiable function with learnable parameters, such as multiplying its inputs by a matrix and squashing the result with a non-linear function for a vanilla RNN. The state transition between consecutive hidden states adds a new input to the summary of previous ones. However, this procedure of constructing a new summary from the combination of the previous one and the new input in conventional RNNs only bases on a single space composition, which often has difficulty in capturing complicated long-range dependencies, to allow the hidden state to rapidly adapt to quickly changing modes of the input while still preserving a useful summary of the past.

Recent studies in neural machine translation show that it is beneficial to linearly project one single-space representation into multi-space representations and then capture useful information from these different representation spaces, e.g., the multi-head attention 

[36, 5, 27], which also has been verified effective for capturing multiple semantic aspects from the user utterance in generative dialogue systems [35]. Although some effective RNN extensions, e.g., GRU and LSTM, have been proposed with gating units controlling the information flow. The gating units are also generated by such kind of function () based on single space composition, which is not expressive enough to capture potentially complex information for controlling the information flow.

In this paper, we thus boost the function through modeling multiple space composition, and propose a Multi-zone Unit for Recurrent Neural Networks, named MZU. The MZU consists of three components: zone generation, zone composition, and zone aggregation. At each time step, the MZU projects the previous hidden state and the current input into multiple zones (zone generation), then conducts full interactions and compositions on these zones (zone composition), and finally aggregates them to the final representation (zone aggregation). In particular, we propose three effective approaches for the zone composition to verify the MZU. The first model exploits self-attention [36] to draw global dependencies on zones. The second model utilizes graph convolutional network [18, 19] to perform neighborhood mixing on zones by leveraging the graph connectivity structure as a filter. The third model uses the dynamic routing algorithm [33] to model part-whole relationships on zones. To further enhance the MZU, we propose an effective regularization objective to promote the diversity in multiple zones. The architecture of MZU is generic as standard RNNs, thus can be extended to deep MZUs.

We evaluate the MZU on 1) the challenging character-level language modeling task with two standard datasets, namely the well known Penn Treebank [26] and the larger Wikipedia dataset (text8) [25]; and 2) the aspect-based sentiment analysis (ABSA) task with two datasets of the SemEval 2014 Task 4 [32] in different domains. Experimental results on both tasks demonstrate the superiority of the MZU to previous competitive RNNs and its generalizability across tasks.

Our main contributions are three-fold:

  • We propose a new and generic Multi-zone Unit for RNNs, along with three effective approaches for the zone composition.

  • To further enhance the MZU, we propose an effective regularization objective to promote the diversity of multiple zones.

  • We provide empirical and visualization analyses to reveal advantages of the MZU, e.g., capturing richer linguistic information and better long sequence processing.

Multi-Zone Unit for RNNs

Overview

MZU maintains a hidden state to summarize past inputs at each time step , updated as:

(1)

where is an element-wise product, is the gate, to control the flow of information in the previous hidden state and the candidate activation at the current time step, which is computed as:

(2)

where is the input embedding at time , and the gate is computed as:

(3)

where and are multi-zone transformation functions, designed for modeling multiple space composition, and their formal definitions will be described in the next section.

MZU extends the standard RNN by replacing the linear transformation (i.e., the

function) on with multi-zone transformation functions, i.e. the and the in Eq.( 2) and (3), as well as maintains a simplified gating mechanism to control the information flow. We can also design more complex gating mechanisms as GRU and LSTM to further enhance MZU.

Figure 1: Illustration for the multi-zone transformation function, which is consists of three components. (1) Zone Generation: linearly project [] into zones ; (2) Zone Composition: conduct compositions on and output zones ; (3) Zone Aggregation: conduct a deep abstraction on with a position-wise FFN and output new zones , then aggregate the representations by concatenating and finally conduct a linear transformation to obtain the final representation.

Multi-Zone Transformation Function

The multi-zone transformation function, i.e., the -function (in Eq. (2) (3)) is designed for enhancing both the transformation module (i.e., the candidate activation ) and the gating module (i.e., the ). The -function only takes the embedding and the previous hidden state as inputs, where and . Figure 1 shows the framework of the -function, which consists of three components: Zone Generation, Zone Composition, and Zone Aggregation.

Zone Generation.

As in Figure 1, we linearly project the current input and the previous hidden state into

zone vectors

parameterized with different weight matrices. Each zone is computed as:

(4)

where (we set ).

Zone Composition.

Then we conduct a multi-zone composition with fully connected interactions on , and output zones , where can be equal to or not. In this step, zones in can fully interact with each other and generate new representations through our specially designed composition models. This composition can be designed flexibly, in particular, we provide three models, i.e., a self-attention based model, a graph convolutional model and a capsule networks based model, which will be formally described later.

Zone Aggregation.

We further conduct a deep abstraction on the newly generated zones , and aggregate them to the final representation in this step. In particular, we apply a position-wise feed-forward network (FFN) [36] to each zone

to conduct a deep abstraction and feature extraction, and output to

correspondingly. This consists of two linear transformations with a ReLU activation in between:

(5)

where , and , is the filter size. The linear transformations are the same across different zones. After that, we obtain deeply transformed zones . Then, we aggregate the representations by concatenating these zones and conduct a linear transformation to obtain the final representation.

Models for Zone Composition

To be noticed that, how to conduct full interactions on the multiple zones and generate new zones , i.e., in the “Zone Composition” (Figure 1-(2)) is crucial. It can be flexibly designed, in this paper, we just propose three effective approaches:

1) We apply self-attention [36] to compute the relevance between each zone pair in , and generate new zones by a weighted sum of linear transformed , since self-attention has been proved effective to draw global dependencies of representations on different positions.

2) We exploit Graph Convolutional Network (GCN) [18, 19] to conduct interactions on zones in by taking the zones as nodes of a graph, and generate new zones by absorbing information from their neighbor nodes along weighted edges. As a special form of Laplacian Smoothing [22], GCN takes the graph connectivity structure as a filter to perform neighborhood mixing, and therefore better for zone composition.

3) We utilize the capsule networks [13, 33, 14], which have strong capabilities in representation composition and have been successfully applied to some NLP tasks [43, 37, 42]. In particular, we model part-whole relationships between the older zones and the new zones to aggregate diverse and useful information from older ones , and achieve better representation composition through iterations of dynamic routing [33].

Self-attention based MZU (Mzu)

We firstly project the zones into queries, keys and values of dimension , each with a linear transformation, parameterized with different weights. Then we compute the dot products of the query with all keys, divide each by , and apply a function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix . The keys and values are also packed together into matrices and . Then the matrix of outputs is computed as:

(6)

which is the new zones .

Graph Convolutional MZU (Mzu)

We take the zones as nodes and construct a graph , where and represent the sets of nodes and edges, with and , respectively. Given node and node , we use to compute the semantic relevance between two nodes, as the weighted edge , which is calculated by:

(7)

where stands for the L2 norm. Then we obtain the adjacent matrix111Self-connections have been added after the computation. :

(8)

The degree matrix is represented as , where . Meanwhile, we concatenate the representation of each node to a matrix . Then we obtain the new node matrix through a layer of convolution, which has incorporated the structured interactive information, as follows:

(9)

where is the weight matrix, and (we set ) is the dimension of this convolutional layer, and . It is a special form of Laplacian Smoothing [22], which computes the new representation of a node as the weighted average of itself and its neighbors’.

Capsule Networks based MZU (Mzu)

We take the zones as low-level capsules, and aim to output high-evel capsules (i.e., zones) , where (we set ). We aggregate information from and conduct representation composition to generate , by modeling part-whole relationships between capsules in and capsules in .

We firstly initialize the initial logits

with 0, which are log prior probabilities that capsule

should be coupled to capsule . And we generate the “prediction vectors” from the low-level capsules as follows:

(10)

where

is the weight tensor.

222In practice, we share the transformation matrix of each output capsule among all the input capsules, that is .

Then we aggregate information from and conduct representation composition to generate , through iterations of dynamic routing on . At each iteration, we firstly compute coupling coefficient between low-level capsule and all high-level capsules, by a “softmax” on initial logits , as follows:

(11)

After that we generate by a weighted sum over with coupling coefficient as weights, computed as:

(12)

Then we take as input and output capsule by squashing as follows:

(13)

where  [33] is a non-linear function to ensure that short vectors get shrunk to almost zero length and long vectors get shrunk to a length slightly below .

Then are iteratively refined by measuring the agreement (scalar product) between the current output and the prediction :

(14)

After iterations of dynamic routing, we obtain final capsules (zones) .

Deep Transition MZU

The MZU is generic as standard RNNs, and can be extended to any deep MZUs. Since recent studies [31, 3, 27] have demonstrated the superiority of deep transition RNNs over deep stacked RNNs, we extend the MZU to a more powerful deep transition MZU. The entire deep transition block is composed of one MZU cell followed by several transition MZU (-MZU) cells at each time step. As a special case of MZU, -MZU only has “state” as input, i.e. the embedding input of -MZU is a zero vector. In the whole recurrent procedure, for the current time step, the “state” output of one MZU/-MZU cell is used as the “state” input of the next -MZU cell. And the “state” output of the last -MZU cell for the current time step is carried over as the “state” input of the first MZU cell for the next time step.

For a -MZU cell, each hidden state at time step at transition depth is computed as follows:

(15)
(16)

where the gate is computed as:

(17)

where and are Multi-Zone Transformation functions, as described in Section Multi-Zone Transformation Function.

Regularization on Multiple Zones

We expect that the representations in multiple zones of MZU are as different as possible. Thus we propose a disagreement regularization on these zones, inspired by li2018multi li2018multi. This regularization is designed to maximize the cosine distance (i.e., negative cosine similarity) between zone pairs. Our objective is to enlarge the average cosine distance among all zone pairs, computed as:

(18)
Training Objective.

Taking into the regularization, the training objective of MZU is:

(19)

where is the number of -functions in the MZU, is the number of inputs (e.g., tokens), is a task specific training objective, and

is a hyper-parameter used to balance the preference between two terms of the loss function.

Experiments

We verify the MZU on the character-level language modeling task and the aspect based sentiment analysis task.

Character-level Language Modeling

We use two standard datasets, namely the Penn Treebank [26] and the larger Wikipedia dataset (text8) [25].

Penn Treebank. The Penn Treebank dataset is a collection of Wall Street Journal articles written in English. We follow the process procedure introduced in [29], and split the data into training, validation and test sets consisting of 5.0M, 390K and 440K characters, respectively.

The Wikipedia Corpus (text8). The text8 dataset consists of 100M characters extracted from the English Wikipedia. text8 contains only alphabets and spaces, and thus we have total 27 symbols. In order to compare with other previous works, we follow the the process procedure in [29], and split the data into training, validation and test sets consisting of 90M, 5M, and 5M characters, respectively.

Training Details.

We train the model using Adam [17]

with an initial learning rate of 0.001. Each update is done by using a mini-batch of 256 examples. We use the truncated backpropagation through time to approximate the gradients, setting the length to 150. The norm of the gradient is clipped with 5.0. We apply layer normalization to our models, and apply dropout to the candidate activation to avoid overfitting. The dropout rate for the Penn Treebank task and the text8 task are set to 0.5 and 0.3, respectively. The hidden size and filter size for Penn Treebank are set to 800 and 1000, respectively. And those for text8 are set to 1,536 and 3,072, respectively. The embedding size is 256. We share parameters of the

-function across different depths (i.e., the deep transition counterpart) for Penn Treebank. We measure the performance of checkpoints by evaluating bits per character (BPC) over the valid set, and make the test with the best one. BPC is the negative log-likelihood divided by natural logarithm of 2. The lower the better.

System Description.

We use MZU, MZU and MZU standing for the self-attention based MZU, the graph convolutional MZU and the capsule networks based MZU, respectively. “DT” stands for deep transition MZU cells, with transition depth setting to 1 by default, i.e. one MZU cell plus one -MZU cell. For all MZU models, we apply 4 zones in -functions, except when otherwise mentioned. For all MZUs, we set the number of output capsules to 2, and use 3 routing iterations, according to primary experiments. We set to 1.0 in Eq. 19 for all experiments. From primary experiments with different settings to , we find setting (promoting to different zones) consistently improve the performance (0.005 0.007 BPC) on both shallow and deep MZU models. While setting (promoting to similar zones), e.g. -0.5 or -1.0, sharply declines the performance (0.03 0.05 BPC). This indicates modeling multi-zone transformation affects the state transition, and consequently affects the performance.

Model Size BPC
LSTM [21] 1.36
Zoneout LSTM [21] 1.27
2-Layers LSTM [47] 6.6M 1.243
LN-HM-LSTM [7] 1.24
HyperLSTM [12] 14.4M 1.219
NASCell [47] 16.3M 1.214
FS-LSTM-4 [30] 7.2M 1.190
res-IndRNN [23] 1.19
MZU+DT (ours) 6.5M 1.189
MZU+DT (ours) 6.4M 1.183
MZU+DT (ours) 7.2M 1.181
Table 1: BPC on Penn Treebank test set.
Model Size BPC
Large td-LSTM [41] 1.49
MI-LSTM [39] 1.44
BN-LSTM [8] 1.36
Zoneout LSTM [21] 1.336
LN HM-LSTM [7] 35M 1.29
Large RHN [46] 45M 1.27
Large mLSTM [20] 45M 1.27
MZU+DT (ours) 32M 1.268
MZU+DT (ours) 32M 1.260
MZU+DT (ours) 41M 1.249
Transformer [2] 235M 1.13
Transformer-XL [9] 277M 1.08
Table 2: BPC on text8 test set.
Main Results.

Table 1 shows results on Penn Treebank test set, with comparing to existing competitive approaches. Our MZU models outperform previous competitive approaches, including deeper RNNs, e.g., the 4-layers FS-LSTM and the 11-layers res-IndRNN, without using more parameters. Among three MZU models, the MZU+DT and the MZU+DT achieve better performance than the MZU+DT, and the MZU+DT achieves the best BPC with 1.181. The MZU can model part-whole relationships between the old zones and the new zones to aggregate useful information, through multiple iterations of routing-by-agreement. Such kind of multi-round refinements let the new zones contain more accurate representations and therefore MZU preforms better than other MZUs.

To demonstrate that MZUs work well across datasets, we conduct experiments on a larger dataset text8. As shown in Table 2, using even less parameters, our MZU+DT, MZU+DT and MZU+DT outperform several previous deep RNNs, including the 10-depths Large RHN and the stacked Large mLSTM, except the recently proposed transformer-based models [2, 9] trained with much longer context (512), deeper layers (24-layers Transformer & 64-layers Transformer-XL) and larger amount of parameters. Explicit using longer context indeed improves the performance of language modeling. While in this work, we mainly focus on boosting the transition functions (i.e., only operate on and ) of RNNs. Our approaches can also be applied on transformer-based architectures, e.g for enhancing the multi-head attention.

#Zones MZU MZU MZU
2 1.191 1.187 1.186
4 1.189 1.183 1.181
8 1.200 1.189 1.181
16 1.208 1.185 1.186
Table 3: Effects of zone numbers on Penn Treebank test set. All the tested models are with “DT”, which omitted due to the limited table space.
Model No-DT DT
GRU 1.269 1.208
MZRNN 1.214 1.189
     Regular-Gate 1.236 1.207
     Regular-Trans. 1.247 1.208
MZRNN 1.196 1.183
     Regular-Gate 1.224 1.197
     Regular-Trans. 1.225 1.191
MZRNN 1.192 1.181
     Regular-Gate 1.219 1.190
     Regular-Trans. 1.211 1.188
Table 4: Ablation study of -functions in MZUs. “Regular” stands for replacing the -function with the regular linear transformation function on .
The Number of Zones.

We investigate the impact of zone numbers on MZUs, in particular, we conduct experiments with three deep MZU models (MZU+DT) on Penn Treebank. For each model, we respectively test 2, 4, 8, and 16 zones in -functions. As the dimension of the zone is decided by the hidden size and the zone number, i.e., = , more zones do not lead to more parameters. As shown in Table 3, using 4 zones achieves the best performance for all MZU models. Additionally, we observe that the MZU is more sensitive on the number of zones, and leveraging too many zones (e.g. 8, 16) sharply declines the performance. The MZU and the MZU are more robust on modeling multiple zones, benefiting from the powerful graph convolutional structure and dynamic routing algorithm, respectively.

Figure 2: BPC of shallow (left) and deep (right) models on sequences with different lengths. The numbers on X-axis stand for sequences longer than the corresponding length (x10), e.g., 9 for sequences with (90, 120] characters. The superiority of MZUs over GRUs is more obvious on long sequences (e.g. 90).
Ablation Study.

Since the -function works on the gating module (i.e., ) and the transformation module (i.e., ), we conduct an ablation study to investigate its effects on both modules. We show results in Table 4, where “Regular” stands for replacing the -function with the regular linear transformation on . As shown in Table 4, the -function is crucial for both the gating module (“Gate”) and the transformation module (“Trans.”), for both shallow (“No-DT”) and deep (“DT”) MZU models, since replacing any module with the Regular one sharply declines the performance. We also list results333We also apply dropout and layer normalization to GRUs, and build models with the same settings to our MZUs. of GRU and deep transition GRU [31] (i.e., “DT”) in Table 4. Even with the module degeneration, our MZUs still outperform GRUs which only based on single space composition, in shallow and deep models, correspondingly.

About Length.

We investigate effects of MZU models on sequences with different length, and compare them with GRU and deep transition GRU (GRU+DT) in Figure 2. The results of shallow and deep models are in the left and right subfigure, respectively. As shown in Figure 2, on sequences with different length, shallow and deep MZU models consistently yield better performance than the GRU and the GRU+DT, respectively. And the superiority is more obvious on long sequences (e.g., 90) for both shallow and deep MZU models, demonstrating the stronger capability for long sequence processing of the MZU.

Figure 3: Relevance results between each of the last five characters (i.e., ‘m’, ‘e’, ’n’, ‘t’, ‘s’) and all characters before it, from the GRU(a) and the MZU(b). (c) shows the zone-specific relevance result of MZU between the last character ‘s’ and all characters before it. The higher the relevance, the more white of the grayscale. Blue rectangles and green ellipses indicate high-relevance parts. ‘_’ stands for the space between two words.
Visualization Analysis.

We conduct a visualization analysis to demonstrate the superiority of the MZU (Figure 3(b)) over GRU (Figure 3(a)). In particular, we compare the MZU to GRU. We compute the relevance between each of the last five characters (i.e., ‘m’, ‘e’, ’n’, ‘t’, ‘s’) and all characters before it. For example, the relevance between ‘m’ and the first character ‘p’ is calculated by the similarity between the candidate activation (stands for the transformation at current step) of ‘m’ and the hidden state of ‘p’. The higher the relevance, the more white of the grayscale. From Figure 3(a) (GRU), we can find that 1) The gray distribution is not sharply contoured; and 2) the highest relevance positions to each computed character usually occur at the nearest position to it or at positions with the same surface character, such as the boxes marked blue. In Figure 3(b) (MZU), the gray distribution is more sharply contoured, e.g., three regions marked by green ellipses. Additionally, it captures some philological segments, such as the suffixes “ments” and “ed”, and parts of the word “inc.”, which are related to the word “agreements”. Furthermore, we investigate how multiple zones affect the performance, and show the zone-specific relevance between the last character ‘s’ and all characters before it in Figure 3(c) (MZU). We find that different zones indeed capture different philological information. For example, “zone1” and “zone3” capture some suffixes (e.g., “ment”, “ed”), “zone2” captures the same surface character ‘s’ in long distance, and “zone4” not only captures the nearest ‘t’ to ‘s’, but also captures the long distance and frequent collocation “ts”.

Sentence
The appetizers are ok, but the service is slow.
Aspect   The appetizers   service
Sentiment   Neutral   Negative
Table 5: An example that contains different sentiment polarities towards two aspects.
Models Restaurant Laptop
DS HDS DS HDS
TD-LSTM* [34] 73.441.17 56.482.46 62.230.92 46.111.89
ATAE-LSTM* [38] 73.743.01 50.982.27 64.384.52 40.391.30
GCAE* [40] 77.280.32 56.730.56 69.140.32 47.062.45
GRU (With-DT) 78.250.51 56.170.72 71.010.73 47.031.15
MZRNN (With-DT) 79.400.23 57.590.70 72.450.68 47.961.12
Table 6:

The accuracy (mean of five times training with standard deviation) on SemEval 2014 term-based datasets. Results (marked with “*”) of existing systems are cited from 

[40].

Aspect Based Sentiment Analysis

Task Description.

Since RNNs are widely used for sequence modeling and classification tasks, we choose the well-known and challenging ABSA task as a case study to demonstrate the generalizability across tasks of the MZU. We test on the aspect-term based task of the SemEval 2014 Task 4 [32] with two datasets that contain domain-specific customer reviews for restaurants and laptops, each of which contains four sentiment labels (i.e., Positive, Negative, Neutral, and Conflict). For this task, the goal is to infer sentiment polarity over the aspect (i.e., the given term), which is a subsequence of the sentence. The example in Table 5 shows the customer’s different attitudes towards two terms: “The appetizers” and “service”. For each dataset, we follow [40, 24] and construct a hard sub-dataset (named “HDS”) by extracting from the full dataset (named “DS”). In “HDS”, all sentences contain multiple aspects, each of which corresponds to a dfferent sentiment label. On this sub-dataset, we copy each sentence times, where is the number of aspects in a sentence.

Models.

We compare our MZU with the deep transition GRU and several previous competitive systems. For the MZU, we use MZRNN with deep transition. We take the ABSA as a sentence-level classification task. Therefore, for both GRU model and MZRNN model, we follow [34]

, and firstly encode the sentence as a standard Bi-RNN encoder does. Then we conduct mean pooling on the output hidden states to generate a sentence vector. Next we concatenate the sentence vector with the given aspect embedding (the mean of embeddings for multi-word aspects). Finally, we predict the sentiment polarity with the concatenated representation by a softmax layer.

Results.

As shown in Table 6, our implemented deep transition GRU outperforms most previous LSTM-based systems (i.e., TD-LSTM [34], ATAE-LSTM [38] and the gated CNN-based system GCAE [40] (comparable on “HDS” sub-datasets), demonstrating that our deep transition GRU is a strong baseline. When incorporating the MZU, our MZRNN further improve the performance on multiple datasets by significant margins. These results demonstrate the superiority and generalizability of the MZU cross tasks.

Related Work

This work is inspired by the idea of processing attentions on multiple subspaces with different heads [36]. While different from this multi-channel style operation, we conduct fully connected interactions on multiple zones to extract useful information. In particular, we propose three effective models to conduct interactions and representation composition.

Some researchers [31, 46, 11, 30] boost the transition function with deep architectures, such as adding intermediate layers to increase the transition depth. Different from these works, we boost the performance of RNNs by modeling multiple zone composition for the transition function. Additionally, our MZU is generic and can be easily extended to deep MZUs.

Some researchers devise models to capture specific features from the processed history sequence, including incorporating an attention mechanism into LSTM cells [44] or extending the GRU with a recurrent attention unit [45]. Different from these works, we focus on boosting the transition function by modeling multiple zone composition, which only operates on and , at each time step.

Some studies [28, 4] investigate regularization techniques for optimizing LSTM-based models, such as weight dropout, variational dropout, weight tying, temporal activation regularization, and past decode regularization. These techniques can be used to further optimize MZU-based models in the future.

Conclusion

We propose a generic Multi-zone Unit for RNNs along with three effective variants (i.e., the MZU, the MZU and the MZU), which boost the state transition function through modeling multiple space composition. To further enhance MZUs, we propose an effective regularization objective to promote the diversity in multiple zones. Experiments on two character-level language modeling datasets show that our MZUs can automatically capture rich linguistic information and substantially improve the performance. Experimental results on the ABSA task further demonstrate the superiority and generalizability of the MZU.

Acknowledgments

Yang Liu is supported by the National Key R&D Program of China (No. 2017YFB0202204), National Natural Science Foundation of China (No. 61761166008), Beijing Advanced Innovation Center for Language Resources (No. TYR17002).

References

  • [1] A. Akbik, D. Blythe, and R. Vollgraf (2018) Contextual string embeddings for sequence labeling. In ACL, External Links: Link Cited by: Introduction.
  • [2] R. Al-Rfou, D. Choe, N. Constant, et al. (2018) Character-level language modeling with deeper self-attention. arXiv preprint arXiv:1808.04444. Cited by: Main Results., Table 2.
  • [3] A. V. M. Barone, J. Helcl, R. Sennrich, et al. (2017) Deep architectures for neural machine translation. In WMT, Cited by: Deep Transition MZU.
  • [4] S. Brahma (2018) Improved language modeling by decoding the past. CoRR abs/1808.05908. External Links: Link, 1808.05908 Cited by: Related Work.
  • [5] M. X. Chen, O. Firat, A. Bapna, et al. (2018) The best of both worlds: combining recent advances in neural machine translation. In ACL, External Links: Link Cited by: Introduction, Introduction.
  • [6] K. Cho, B. van Merrienboer, C. Gulcehre, et al. (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, Cited by: Introduction.
  • [7] J. Chung, S. Ahn, and Y. Bengio (2017) Hierarchical multiscale recurrent neural networks. In ICLR, Cited by: Introduction, Table 1, Table 2.
  • [8] T. Cooijmans, N. Ballas, C. Laurent, et al. (2017)

    Recurrent batch normalization

    .
    In ICLR, Cited by: Table 2.
  • [9] Z. Dai, Z. Yang, Y. Yang, et al. (2019) Transformer-xl: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860. Cited by: Main Results., Table 2.
  • [10] A. Graves (2013) Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Cited by: Introduction.
  • [11] A. Graves (2016) Adaptive computation time for recurrent neural networks. CoRR abs/1603.08983. External Links: Link, 1603.08983 Cited by: Related Work.
  • [12] D. Ha, A. Dai, and Q. V. Le (2017) Hypernetworks. In ICLR, Cited by: Introduction, Table 1.
  • [13] G. E. Hinton, A. Krizhevsky, and S. D. Wang (2011) Transforming auto-encoders. In ICANN, Cited by: Models for Zone Composition.
  • [14] G. E. Hinton, S. Sabour, and N. Frosst (2018) Matrix capsules with em routing. In ICLR, Cited by: Models for Zone Composition.
  • [15] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation. Cited by: Introduction.
  • [16] Z. Huang, W. Xu, and K. Yu (2015) Bidirectional lstm-crf models for sequence tagging. arXiv. Cited by: Introduction.
  • [17] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Training Details..
  • [18] T. N. Kipf and M. Welling (2016) Variational graph auto-encoders. arXiv preprint arXiv:1611.07308. Cited by: Introduction, Models for Zone Composition.
  • [19] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: Introduction, Models for Zone Composition.
  • [20] B. Krause, L. Lu, I. Murray, et al. (2016) Multiplicative lstm for sequence modelling. In ICLR (Workshop), Cited by: Table 2.
  • [21] D. Krueger, T. Maharaj, J. Kramár, et al. (2017) Zoneout: regularizing rnns by randomly preserving hidden activations. In ICLR, Cited by: Table 1, Table 2.
  • [22] Q. Li, Z. Han, and X. Wu (2018)

    Deeper insights into graph convolutional networks for semi-supervised learning

    .
    In AAAI, Cited by: Graph Convolutional MZU (MZU), Models for Zone Composition.
  • [23] S. Li, W. Li, C. Cook, et al. (2018) Independently recurrent neural network (indrnn): building a longer and deeper rnn. In CVPR, Cited by: Introduction, Table 1.
  • [24] Y. Liang, F. Meng, J. Zhang, J. Xu, Y. Chen, and J. Zhou (2019) A novel aspect-guided deep transition model for aspect based sentiment analysis. In EMNLP, Cited by: Task Description..
  • [25] M. Mahoney (2011) Large text compression benchmark. Cited by: Introduction, Character-level Language Modeling.
  • [26] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini (1993) Building a large annotated corpus of english: the penn treebank. Computational linguistics 19 (2). Cited by: Introduction, Character-level Language Modeling.
  • [27] F. Meng and J. Zhang (2019) DTMT: A novel deep transition architecture for neural machine translation. AAAI. Cited by: Introduction, Introduction, Deep Transition MZU.
  • [28] S. Merity, N. S. Keskar, and R. Socher (2018) Regularizing and optimizing lstm language models. In ICLR, Cited by: Related Work.
  • [29] T. Mikolov, I. Sutskever, A. Deoras, et al. (2012) Subword language modeling with neural networks. Cited by: Character-level Language Modeling, Character-level Language Modeling.
  • [30] A. Mujika, F. Meier, and A. Steger (2017) Fast-slow recurrent neural networks. In NIPS, Cited by: Introduction, Table 1, Related Work.
  • [31] R. Pascanu, C. Gulcehre, K. Cho, et al. (2014) How to construct deep recurrent neural networks. In ICLR, Cited by: Deep Transition MZU, Ablation Study., Related Work.
  • [32] M. Pontiki, D. Galanis, J. Pavlopoulos, et al. (2014) SemEval-2014 task 4: aspect based sentiment analysis. In SemEval, External Links: Document, Link Cited by: Introduction, Task Description..
  • [33] S. Sabour, N. Frosst, and G. E. Hinton (2017) Dynamic routing between capsules. In NIPS, Cited by: Introduction, Capsule Networks based MZU (MZU), Models for Zone Composition.
  • [34] D. Tang, B. Qin, X. Feng, et al. (2016) Effective lstms for target-dependent sentiment classification. In COLING, External Links: Link Cited by: Introduction, Models., Results., Table 6.
  • [35] C. Tao, S. Gao, M. Shang, et al. (2018) Get the point of my utterance! learning towards effective responses with multi-head attention mechanism.. In IJCAI, Cited by: Introduction.
  • [36] A. Vaswani, N. Shazeer, N. Parmar, et al. (2017) Attention is all you need. In NIPS, Cited by: Introduction, Introduction, Zone Aggregation., Models for Zone Composition, Related Work.
  • [37] Q. Wang, J. Qiu, Y. Zhou, et al. (2018) Recurrent capsule network for relations extraction: A practical application to the severity classification of coronary artery disease. CoRR abs/1807.06718. External Links: Link, 1807.06718 Cited by: Models for Zone Composition.
  • [38] Y. Wang, M. Huang, x. zhu, et al. (2016) Attention-based lstm for aspect-level sentiment classification. In EMNLP, External Links: Document, Link Cited by: Introduction, Results., Table 6.
  • [39] Y. Wu, S. Zhang, Y. Zhang, et al. (2016) On multiplicative integration with recurrent neural networks. In NIPS, Cited by: Table 2.
  • [40] W. Xue and T. Li (2018) Aspect based sentiment analysis with gated convolutional networks. In ACL, External Links: Link Cited by: Task Description., Results., Table 6.
  • [41] S. Zhang, Y. Wu, T. Che, et al. (2016) Architectural complexity measures of recurrent neural networks. In NIPS, Cited by: Table 2.
  • [42] X. Zhang, P. Li, W. Jia, and H. Zhao (2019) Multi-labeled relation extraction with attentive capsule network. In AAAI, Cited by: Models for Zone Composition.
  • [43] W. Zhao, J. Ye, M. Yang, et al. (2018) Investigating capsule networks with dynamic routing for text classification. In EMNLP, External Links: Link Cited by: Models for Zone Composition.
  • [44] G. Zhong, X. Lin, and K. Chen (2018) Long short-term attention. CoRR abs/1810.12752. External Links: Link, 1810.12752 Cited by: Related Work.
  • [45] G. Zhong, G. Yue, and X. Ling (2018) Recurrent attention unit. CoRR abs/1810.12754. External Links: Link, 1810.12754 Cited by: Related Work.
  • [46] J. G. Zilly, R. K. Srivastava, J. Koutník, et al. (2017) Recurrent highway networks. In ICML, Cited by: Table 2, Related Work.
  • [47] B. Zoph and Q. V. Le (2017)

    Neural architecture search with reinforcement learning

    .
    In ICLR, Cited by: Introduction, Table 1.