Introduction
Processing sequential data of variable length is a major challenge in the field of natural language processing (NLP). Recurrent Neural Networks (RNNs), Long Shortterm Memories (LSTMs)
[15]and Gated Recurrent Units (GRUs)
[6] in particular, have recently become one of the most popular tools to approach sequence learning tasks, such as handwriting recognition [10], sentiment classification [34, 38], sequence labeling [16, 1], language modeling [12, 47, 7, 30, 23] and machine translation [5, 27].RNNs sequentially construct fixedlength representations of arbitrarylength sequences by folding new observations into their hidden states using an inputdependent transition operator. The hidden state is updated recursively using the previous hidden state and the current input as , where is a differentiable function with learnable parameters, such as multiplying its inputs by a matrix and squashing the result with a nonlinear function for a vanilla RNN. The state transition between consecutive hidden states adds a new input to the summary of previous ones. However, this procedure of constructing a new summary from the combination of the previous one and the new input in conventional RNNs only bases on a single space composition, which often has difficulty in capturing complicated longrange dependencies, to allow the hidden state to rapidly adapt to quickly changing modes of the input while still preserving a useful summary of the past.
Recent studies in neural machine translation show that it is beneficial to linearly project one singlespace representation into multispace representations and then capture useful information from these different representation spaces, e.g., the multihead attention
[36, 5, 27], which also has been verified effective for capturing multiple semantic aspects from the user utterance in generative dialogue systems [35]. Although some effective RNN extensions, e.g., GRU and LSTM, have been proposed with gating units controlling the information flow. The gating units are also generated by such kind of function () based on single space composition, which is not expressive enough to capture potentially complex information for controlling the information flow.In this paper, we thus boost the function through modeling multiple space composition, and propose a Multizone Unit for Recurrent Neural Networks, named MZU. The MZU consists of three components: zone generation, zone composition, and zone aggregation. At each time step, the MZU projects the previous hidden state and the current input into multiple zones (zone generation), then conducts full interactions and compositions on these zones (zone composition), and finally aggregates them to the final representation (zone aggregation). In particular, we propose three effective approaches for the zone composition to verify the MZU. The first model exploits selfattention [36] to draw global dependencies on zones. The second model utilizes graph convolutional network [18, 19] to perform neighborhood mixing on zones by leveraging the graph connectivity structure as a filter. The third model uses the dynamic routing algorithm [33] to model partwhole relationships on zones. To further enhance the MZU, we propose an effective regularization objective to promote the diversity in multiple zones. The architecture of MZU is generic as standard RNNs, thus can be extended to deep MZUs.
We evaluate the MZU on 1) the challenging characterlevel language modeling task with two standard datasets, namely the well known Penn Treebank [26] and the larger Wikipedia dataset (text8) [25]; and 2) the aspectbased sentiment analysis (ABSA) task with two datasets of the SemEval 2014 Task 4 [32] in different domains. Experimental results on both tasks demonstrate the superiority of the MZU to previous competitive RNNs and its generalizability across tasks.
Our main contributions are threefold:

We propose a new and generic Multizone Unit for RNNs, along with three effective approaches for the zone composition.

To further enhance the MZU, we propose an effective regularization objective to promote the diversity of multiple zones.

We provide empirical and visualization analyses to reveal advantages of the MZU, e.g., capturing richer linguistic information and better long sequence processing.
MultiZone Unit for RNNs
Overview
MZU maintains a hidden state to summarize past inputs at each time step , updated as:
(1) 
where is an elementwise product, is the gate, to control the flow of information in the previous hidden state and the candidate activation at the current time step, which is computed as:
(2) 
where is the input embedding at time , and the gate is computed as:
(3) 
where and are multizone transformation functions, designed for modeling multiple space composition, and their formal definitions will be described in the next section.
MZU extends the standard RNN by replacing the linear transformation (i.e., the
function) on with multizone transformation functions, i.e. the and the in Eq.( 2) and (3), as well as maintains a simplified gating mechanism to control the information flow. We can also design more complex gating mechanisms as GRU and LSTM to further enhance MZU.MultiZone Transformation Function
The multizone transformation function, i.e., the function (in Eq. (2) (3)) is designed for enhancing both the transformation module (i.e., the candidate activation ) and the gating module (i.e., the ). The function only takes the embedding and the previous hidden state as inputs, where and . Figure 1 shows the framework of the function, which consists of three components: Zone Generation, Zone Composition, and Zone Aggregation.
Zone Generation.
Zone Composition.
Then we conduct a multizone composition with fully connected interactions on , and output zones , where can be equal to or not. In this step, zones in can fully interact with each other and generate new representations through our specially designed composition models. This composition can be designed flexibly, in particular, we provide three models, i.e., a selfattention based model, a graph convolutional model and a capsule networks based model, which will be formally described later.
Zone Aggregation.
We further conduct a deep abstraction on the newly generated zones , and aggregate them to the final representation in this step. In particular, we apply a positionwise feedforward network (FFN) [36] to each zone
to conduct a deep abstraction and feature extraction, and output to
correspondingly. This consists of two linear transformations with a ReLU activation in between:
(5) 
where , and , is the filter size. The linear transformations are the same across different zones. After that, we obtain deeply transformed zones . Then, we aggregate the representations by concatenating these zones and conduct a linear transformation to obtain the final representation.
Models for Zone Composition
To be noticed that, how to conduct full interactions on the multiple zones and generate new zones , i.e., in the “Zone Composition” (Figure 1(2)) is crucial. It can be flexibly designed, in this paper, we just propose three effective approaches:
1) We apply selfattention [36] to compute the relevance between each zone pair in , and generate new zones by a weighted sum of linear transformed , since selfattention has been proved effective to draw global dependencies of representations on different positions.
2) We exploit Graph Convolutional Network (GCN) [18, 19] to conduct interactions on zones in by taking the zones as nodes of a graph, and generate new zones by absorbing information from their neighbor nodes along weighted edges. As a special form of Laplacian Smoothing [22], GCN takes the graph connectivity structure as a filter to perform neighborhood mixing, and therefore better for zone composition.
3) We utilize the capsule networks [13, 33, 14], which have strong capabilities in representation composition and have been successfully applied to some NLP tasks [43, 37, 42]. In particular, we model partwhole relationships between the older zones and the new zones to aggregate diverse and useful information from older ones , and achieve better representation composition through iterations of dynamic routing [33].
Selfattention based MZU (Mzu)
We firstly project the zones into queries, keys and values of dimension , each with a linear transformation, parameterized with different weights. Then we compute the dot products of the query with all keys, divide each by , and apply a function to obtain the weights on the values. In practice, we compute the attention function on a set of queries simultaneously, packed together into a matrix . The keys and values are also packed together into matrices and . Then the matrix of outputs is computed as:
(6) 
which is the new zones .
Graph Convolutional MZU (Mzu)
We take the zones as nodes and construct a graph , where and represent the sets of nodes and edges, with and , respectively. Given node and node , we use to compute the semantic relevance between two nodes, as the weighted edge , which is calculated by:
(7) 
where stands for the L2 norm. Then we obtain the adjacent matrix^{1}^{1}1Selfconnections have been added after the computation. :
(8) 
The degree matrix is represented as , where . Meanwhile, we concatenate the representation of each node to a matrix . Then we obtain the new node matrix through a layer of convolution, which has incorporated the structured interactive information, as follows:
(9) 
where is the weight matrix, and (we set ) is the dimension of this convolutional layer, and . It is a special form of Laplacian Smoothing [22], which computes the new representation of a node as the weighted average of itself and its neighbors’.
Capsule Networks based MZU (Mzu)
We take the zones as lowlevel capsules, and aim to output highevel capsules (i.e., zones) , where (we set ). We aggregate information from and conduct representation composition to generate , by modeling partwhole relationships between capsules in and capsules in .
We firstly initialize the initial logits
with 0, which are log prior probabilities that capsule
should be coupled to capsule . And we generate the “prediction vectors” from the lowlevel capsules as follows:(10) 
where
is the weight tensor.
^{2}^{2}2In practice, we share the transformation matrix of each output capsule among all the input capsules, that is .Then we aggregate information from and conduct representation composition to generate , through iterations of dynamic routing on . At each iteration, we firstly compute coupling coefficient between lowlevel capsule and all highlevel capsules, by a “softmax” on initial logits , as follows:
(11) 
After that we generate by a weighted sum over with coupling coefficient as weights, computed as:
(12) 
Then we take as input and output capsule by squashing as follows:
(13) 
where [33] is a nonlinear function to ensure that short vectors get shrunk to almost zero length and long vectors get shrunk to a length slightly below .
Then are iteratively refined by measuring the agreement (scalar product) between the current output and the prediction :
(14) 
After iterations of dynamic routing, we obtain final capsules (zones) .
Deep Transition MZU
The MZU is generic as standard RNNs, and can be extended to any deep MZUs. Since recent studies [31, 3, 27] have demonstrated the superiority of deep transition RNNs over deep stacked RNNs, we extend the MZU to a more powerful deep transition MZU. The entire deep transition block is composed of one MZU cell followed by several transition MZU (MZU) cells at each time step. As a special case of MZU, MZU only has “state” as input, i.e. the embedding input of MZU is a zero vector. In the whole recurrent procedure, for the current time step, the “state” output of one MZU/MZU cell is used as the “state” input of the next MZU cell. And the “state” output of the last MZU cell for the current time step is carried over as the “state” input of the first MZU cell for the next time step.
For a MZU cell, each hidden state at time step at transition depth is computed as follows:
(15)  
(16) 
where the gate is computed as:
(17) 
where and are MultiZone Transformation functions, as described in Section MultiZone Transformation Function.
Regularization on Multiple Zones
We expect that the representations in multiple zones of MZU are as different as possible. Thus we propose a disagreement regularization on these zones, inspired by li2018multi li2018multi. This regularization is designed to maximize the cosine distance (i.e., negative cosine similarity) between zone pairs. Our objective is to enlarge the average cosine distance among all zone pairs, computed as:
(18) 
Training Objective.
Taking into the regularization, the training objective of MZU is:
(19) 
where is the number of functions in the MZU, is the number of inputs (e.g., tokens), is a task specific training objective, and
is a hyperparameter used to balance the preference between two terms of the loss function.
Experiments
We verify the MZU on the characterlevel language modeling task and the aspect based sentiment analysis task.
Characterlevel Language Modeling
We use two standard datasets, namely the Penn Treebank [26] and the larger Wikipedia dataset (text8) [25].
Penn Treebank. The Penn Treebank dataset is a collection of Wall Street Journal articles written in English. We follow the process procedure introduced in [29], and split the data into training, validation and test sets consisting of 5.0M, 390K and 440K characters, respectively.
The Wikipedia Corpus (text8). The text8 dataset consists of 100M characters extracted from the English Wikipedia. text8 contains only alphabets and spaces, and thus we have total 27 symbols. In order to compare with other previous works, we follow the the process procedure in [29], and split the data into training, validation and test sets consisting of 90M, 5M, and 5M characters, respectively.
Training Details.
We train the model using Adam [17]
with an initial learning rate of 0.001. Each update is done by using a minibatch of 256 examples. We use the truncated backpropagation through time to approximate the gradients, setting the length to 150. The norm of the gradient is clipped with 5.0. We apply layer normalization to our models, and apply dropout to the candidate activation to avoid overfitting. The dropout rate for the Penn Treebank task and the text8 task are set to 0.5 and 0.3, respectively. The hidden size and filter size for Penn Treebank are set to 800 and 1000, respectively. And those for text8 are set to 1,536 and 3,072, respectively. The embedding size is 256. We share parameters of the
function across different depths (i.e., the deep transition counterpart) for Penn Treebank. We measure the performance of checkpoints by evaluating bits per character (BPC) over the valid set, and make the test with the best one. BPC is the negative loglikelihood divided by natural logarithm of 2. The lower the better.System Description.
We use MZU, MZU and MZU standing for the selfattention based MZU, the graph convolutional MZU and the capsule networks based MZU, respectively. “DT” stands for deep transition MZU cells, with transition depth setting to 1 by default, i.e. one MZU cell plus one MZU cell. For all MZU models, we apply 4 zones in functions, except when otherwise mentioned. For all MZUs, we set the number of output capsules to 2, and use 3 routing iterations, according to primary experiments. We set to 1.0 in Eq. 19 for all experiments. From primary experiments with different settings to , we find setting (promoting to different zones) consistently improve the performance (0.005 0.007 BPC) on both shallow and deep MZU models. While setting (promoting to similar zones), e.g. 0.5 or 1.0, sharply declines the performance (0.03 0.05 BPC). This indicates modeling multizone transformation affects the state transition, and consequently affects the performance.
Model  Size  BPC 
LSTM [21]  –  1.36 
Zoneout LSTM [21]  –  1.27 
2Layers LSTM [47]  6.6M  1.243 
LNHMLSTM [7]  –  1.24 
HyperLSTM [12]  14.4M  1.219 
NASCell [47]  16.3M  1.214 
FSLSTM4 [30]  7.2M  1.190 
resIndRNN [23]  –  1.19 
MZU+DT (ours)  6.5M  1.189 
MZU+DT (ours)  6.4M  1.183 
MZU+DT (ours)  7.2M  1.181 
Model  Size  BPC 
Large tdLSTM [41]  –  1.49 
MILSTM [39]  –  1.44 
BNLSTM [8]  –  1.36 
Zoneout LSTM [21]  –  1.336 
LN HMLSTM [7]  35M  1.29 
Large RHN [46]  45M  1.27 
Large mLSTM [20]  45M  1.27 
MZU+DT (ours)  32M  1.268 
MZU+DT (ours)  32M  1.260 
MZU+DT (ours)  41M  1.249 
Transformer [2]  235M  1.13 
TransformerXL [9]  277M  1.08 
Main Results.
Table 1 shows results on Penn Treebank test set, with comparing to existing competitive approaches. Our MZU models outperform previous competitive approaches, including deeper RNNs, e.g., the 4layers FSLSTM and the 11layers resIndRNN, without using more parameters. Among three MZU models, the MZU+DT and the MZU+DT achieve better performance than the MZU+DT, and the MZU+DT achieves the best BPC with 1.181. The MZU can model partwhole relationships between the old zones and the new zones to aggregate useful information, through multiple iterations of routingbyagreement. Such kind of multiround refinements let the new zones contain more accurate representations and therefore MZU preforms better than other MZUs.
To demonstrate that MZUs work well across datasets, we conduct experiments on a larger dataset text8. As shown in Table 2, using even less parameters, our MZU+DT, MZU+DT and MZU+DT outperform several previous deep RNNs, including the 10depths Large RHN and the stacked Large mLSTM, except the recently proposed transformerbased models [2, 9] trained with much longer context (512), deeper layers (24layers Transformer & 64layers TransformerXL) and larger amount of parameters. Explicit using longer context indeed improves the performance of language modeling. While in this work, we mainly focus on boosting the transition functions (i.e., only operate on and ) of RNNs. Our approaches can also be applied on transformerbased architectures, e.g for enhancing the multihead attention.
#Zones  MZU  MZU  MZU 
2  1.191  1.187  1.186 
4  1.189  1.183  1.181 
8  1.200  1.189  1.181 
16  1.208  1.185  1.186 
Model  NoDT  DT 
GRU  1.269  1.208 
MZRNN  1.214  1.189 
RegularGate  1.236  1.207 
RegularTrans.  1.247  1.208 
MZRNN  1.196  1.183 
RegularGate  1.224  1.197 
RegularTrans.  1.225  1.191 
MZRNN  1.192  1.181 
RegularGate  1.219  1.190 
RegularTrans.  1.211  1.188 
The Number of Zones.
We investigate the impact of zone numbers on MZUs, in particular, we conduct experiments with three deep MZU models (MZU+DT) on Penn Treebank. For each model, we respectively test 2, 4, 8, and 16 zones in functions. As the dimension of the zone is decided by the hidden size and the zone number, i.e., = , more zones do not lead to more parameters. As shown in Table 3, using 4 zones achieves the best performance for all MZU models. Additionally, we observe that the MZU is more sensitive on the number of zones, and leveraging too many zones (e.g. 8, 16) sharply declines the performance. The MZU and the MZU are more robust on modeling multiple zones, benefiting from the powerful graph convolutional structure and dynamic routing algorithm, respectively.
Ablation Study.
Since the function works on the gating module (i.e., ) and the transformation module (i.e., ), we conduct an ablation study to investigate its effects on both modules. We show results in Table 4, where “Regular” stands for replacing the function with the regular linear transformation on . As shown in Table 4, the function is crucial for both the gating module (“Gate”) and the transformation module (“Trans.”), for both shallow (“NoDT”) and deep (“DT”) MZU models, since replacing any module with the Regular one sharply declines the performance. We also list results^{3}^{3}3We also apply dropout and layer normalization to GRUs, and build models with the same settings to our MZUs. of GRU and deep transition GRU [31] (i.e., “DT”) in Table 4. Even with the module degeneration, our MZUs still outperform GRUs which only based on single space composition, in shallow and deep models, correspondingly.
About Length.
We investigate effects of MZU models on sequences with different length, and compare them with GRU and deep transition GRU (GRU+DT) in Figure 2. The results of shallow and deep models are in the left and right subfigure, respectively. As shown in Figure 2, on sequences with different length, shallow and deep MZU models consistently yield better performance than the GRU and the GRU+DT, respectively. And the superiority is more obvious on long sequences (e.g., 90) for both shallow and deep MZU models, demonstrating the stronger capability for long sequence processing of the MZU.
Visualization Analysis.
We conduct a visualization analysis to demonstrate the superiority of the MZU (Figure 3(b)) over GRU (Figure 3(a)). In particular, we compare the MZU to GRU. We compute the relevance between each of the last five characters (i.e., ‘m’, ‘e’, ’n’, ‘t’, ‘s’) and all characters before it. For example, the relevance between ‘m’ and the first character ‘p’ is calculated by the similarity between the candidate activation (stands for the transformation at current step) of ‘m’ and the hidden state of ‘p’. The higher the relevance, the more white of the grayscale. From Figure 3(a) (GRU), we can find that 1) The gray distribution is not sharply contoured; and 2) the highest relevance positions to each computed character usually occur at the nearest position to it or at positions with the same surface character, such as the boxes marked blue. In Figure 3(b) (MZU), the gray distribution is more sharply contoured, e.g., three regions marked by green ellipses. Additionally, it captures some philological segments, such as the suffixes “ments” and “ed”, and parts of the word “inc.”, which are related to the word “agreements”. Furthermore, we investigate how multiple zones affect the performance, and show the zonespecific relevance between the last character ‘s’ and all characters before it in Figure 3(c) (MZU). We find that different zones indeed capture different philological information. For example, “zone1” and “zone3” capture some suffixes (e.g., “ment”, “ed”), “zone2” captures the same surface character ‘s’ in long distance, and “zone4” not only captures the nearest ‘t’ to ‘s’, but also captures the long distance and frequent collocation “ts”.
Sentence 


Aspect  The appetizers  service  
Sentiment  Neutral  Negative 
Models  Restaurant  Laptop  
DS  HDS  DS  HDS  
TDLSTM* [34]  73.441.17  56.482.46  62.230.92  46.111.89 
ATAELSTM* [38]  73.743.01  50.982.27  64.384.52  40.391.30 
GCAE* [40]  77.280.32  56.730.56  69.140.32  47.062.45 
GRU (WithDT)  78.250.51  56.170.72  71.010.73  47.031.15 
MZRNN (WithDT)  79.400.23  57.590.70  72.450.68  47.961.12 
The accuracy (mean of five times training with standard deviation) on SemEval 2014 termbased datasets. Results (marked with “*”) of existing systems are cited from
[40].Aspect Based Sentiment Analysis
Task Description.
Since RNNs are widely used for sequence modeling and classification tasks, we choose the wellknown and challenging ABSA task as a case study to demonstrate the generalizability across tasks of the MZU. We test on the aspectterm based task of the SemEval 2014 Task 4 [32] with two datasets that contain domainspecific customer reviews for restaurants and laptops, each of which contains four sentiment labels (i.e., Positive, Negative, Neutral, and Conflict). For this task, the goal is to infer sentiment polarity over the aspect (i.e., the given term), which is a subsequence of the sentence. The example in Table 5 shows the customer’s different attitudes towards two terms: “The appetizers” and “service”. For each dataset, we follow [40, 24] and construct a hard subdataset (named “HDS”) by extracting from the full dataset (named “DS”). In “HDS”, all sentences contain multiple aspects, each of which corresponds to a dfferent sentiment label. On this subdataset, we copy each sentence times, where is the number of aspects in a sentence.
Models.
We compare our MZU with the deep transition GRU and several previous competitive systems. For the MZU, we use MZRNN with deep transition. We take the ABSA as a sentencelevel classification task. Therefore, for both GRU model and MZRNN model, we follow [34]
, and firstly encode the sentence as a standard BiRNN encoder does. Then we conduct mean pooling on the output hidden states to generate a sentence vector. Next we concatenate the sentence vector with the given aspect embedding (the mean of embeddings for multiword aspects). Finally, we predict the sentiment polarity with the concatenated representation by a softmax layer.
Results.
As shown in Table 6, our implemented deep transition GRU outperforms most previous LSTMbased systems (i.e., TDLSTM [34], ATAELSTM [38] and the gated CNNbased system GCAE [40] (comparable on “HDS” subdatasets), demonstrating that our deep transition GRU is a strong baseline. When incorporating the MZU, our MZRNN further improve the performance on multiple datasets by significant margins. These results demonstrate the superiority and generalizability of the MZU cross tasks.
Related Work
This work is inspired by the idea of processing attentions on multiple subspaces with different heads [36]. While different from this multichannel style operation, we conduct fully connected interactions on multiple zones to extract useful information. In particular, we propose three effective models to conduct interactions and representation composition.
Some researchers [31, 46, 11, 30] boost the transition function with deep architectures, such as adding intermediate layers to increase the transition depth. Different from these works, we boost the performance of RNNs by modeling multiple zone composition for the transition function. Additionally, our MZU is generic and can be easily extended to deep MZUs.
Some researchers devise models to capture specific features from the processed history sequence, including incorporating an attention mechanism into LSTM cells [44] or extending the GRU with a recurrent attention unit [45]. Different from these works, we focus on boosting the transition function by modeling multiple zone composition, which only operates on and , at each time step.
Some studies [28, 4] investigate regularization techniques for optimizing LSTMbased models, such as weight dropout, variational dropout, weight tying, temporal activation regularization, and past decode regularization. These techniques can be used to further optimize MZUbased models in the future.
Conclusion
We propose a generic Multizone Unit for RNNs along with three effective variants (i.e., the MZU, the MZU and the MZU), which boost the state transition function through modeling multiple space composition. To further enhance MZUs, we propose an effective regularization objective to promote the diversity in multiple zones. Experiments on two characterlevel language modeling datasets show that our MZUs can automatically capture rich linguistic information and substantially improve the performance. Experimental results on the ABSA task further demonstrate the superiority and generalizability of the MZU.
Acknowledgments
Yang Liu is supported by the National Key R&D Program of China (No. 2017YFB0202204), National Natural Science Foundation of China (No. 61761166008), Beijing Advanced Innovation Center for Language Resources (No. TYR17002).
References
 [1] (2018) Contextual string embeddings for sequence labeling. In ACL, External Links: Link Cited by: Introduction.
 [2] (2018) Characterlevel language modeling with deeper selfattention. arXiv preprint arXiv:1808.04444. Cited by: Main Results., Table 2.
 [3] (2017) Deep architectures for neural machine translation. In WMT, Cited by: Deep Transition MZU.
 [4] (2018) Improved language modeling by decoding the past. CoRR abs/1808.05908. External Links: Link, 1808.05908 Cited by: Related Work.
 [5] (2018) The best of both worlds: combining recent advances in neural machine translation. In ACL, External Links: Link Cited by: Introduction, Introduction.
 [6] (2014) Learning phrase representations using rnn encoderdecoder for statistical machine translation. In EMNLP, Cited by: Introduction.
 [7] (2017) Hierarchical multiscale recurrent neural networks. In ICLR, Cited by: Introduction, Table 1, Table 2.

[8]
(2017)
Recurrent batch normalization
. In ICLR, Cited by: Table 2.  [9] (2019) Transformerxl: attentive language models beyond a fixedlength context. arXiv preprint arXiv:1901.02860. Cited by: Main Results., Table 2.
 [10] (2013) Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Cited by: Introduction.
 [11] (2016) Adaptive computation time for recurrent neural networks. CoRR abs/1603.08983. External Links: Link, 1603.08983 Cited by: Related Work.
 [12] (2017) Hypernetworks. In ICLR, Cited by: Introduction, Table 1.
 [13] (2011) Transforming autoencoders. In ICANN, Cited by: Models for Zone Composition.
 [14] (2018) Matrix capsules with em routing. In ICLR, Cited by: Models for Zone Composition.
 [15] (1997) Long shortterm memory. Neural computation. Cited by: Introduction.
 [16] (2015) Bidirectional lstmcrf models for sequence tagging. arXiv. Cited by: Introduction.
 [17] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Training Details..
 [18] (2016) Variational graph autoencoders. arXiv preprint arXiv:1611.07308. Cited by: Introduction, Models for Zone Composition.
 [19] (2017) Semisupervised classification with graph convolutional networks. In ICLR, Cited by: Introduction, Models for Zone Composition.
 [20] (2016) Multiplicative lstm for sequence modelling. In ICLR (Workshop), Cited by: Table 2.
 [21] (2017) Zoneout: regularizing rnns by randomly preserving hidden activations. In ICLR, Cited by: Table 1, Table 2.

[22]
(2018)
Deeper insights into graph convolutional networks for semisupervised learning
. In AAAI, Cited by: Graph Convolutional MZU (MZU), Models for Zone Composition.  [23] (2018) Independently recurrent neural network (indrnn): building a longer and deeper rnn. In CVPR, Cited by: Introduction, Table 1.
 [24] (2019) A novel aspectguided deep transition model for aspect based sentiment analysis. In EMNLP, Cited by: Task Description..
 [25] (2011) Large text compression benchmark. Cited by: Introduction, Characterlevel Language Modeling.
 [26] (1993) Building a large annotated corpus of english: the penn treebank. Computational linguistics 19 (2). Cited by: Introduction, Characterlevel Language Modeling.
 [27] (2019) DTMT: A novel deep transition architecture for neural machine translation. AAAI. Cited by: Introduction, Introduction, Deep Transition MZU.
 [28] (2018) Regularizing and optimizing lstm language models. In ICLR, Cited by: Related Work.
 [29] (2012) Subword language modeling with neural networks. Cited by: Characterlevel Language Modeling, Characterlevel Language Modeling.
 [30] (2017) Fastslow recurrent neural networks. In NIPS, Cited by: Introduction, Table 1, Related Work.
 [31] (2014) How to construct deep recurrent neural networks. In ICLR, Cited by: Deep Transition MZU, Ablation Study., Related Work.
 [32] (2014) SemEval2014 task 4: aspect based sentiment analysis. In SemEval, External Links: Document, Link Cited by: Introduction, Task Description..
 [33] (2017) Dynamic routing between capsules. In NIPS, Cited by: Introduction, Capsule Networks based MZU (MZU), Models for Zone Composition.
 [34] (2016) Effective lstms for targetdependent sentiment classification. In COLING, External Links: Link Cited by: Introduction, Models., Results., Table 6.
 [35] (2018) Get the point of my utterance! learning towards effective responses with multihead attention mechanism.. In IJCAI, Cited by: Introduction.
 [36] (2017) Attention is all you need. In NIPS, Cited by: Introduction, Introduction, Zone Aggregation., Models for Zone Composition, Related Work.
 [37] (2018) Recurrent capsule network for relations extraction: A practical application to the severity classification of coronary artery disease. CoRR abs/1807.06718. External Links: Link, 1807.06718 Cited by: Models for Zone Composition.
 [38] (2016) Attentionbased lstm for aspectlevel sentiment classification. In EMNLP, External Links: Document, Link Cited by: Introduction, Results., Table 6.
 [39] (2016) On multiplicative integration with recurrent neural networks. In NIPS, Cited by: Table 2.
 [40] (2018) Aspect based sentiment analysis with gated convolutional networks. In ACL, External Links: Link Cited by: Task Description., Results., Table 6.
 [41] (2016) Architectural complexity measures of recurrent neural networks. In NIPS, Cited by: Table 2.
 [42] (2019) Multilabeled relation extraction with attentive capsule network. In AAAI, Cited by: Models for Zone Composition.
 [43] (2018) Investigating capsule networks with dynamic routing for text classification. In EMNLP, External Links: Link Cited by: Models for Zone Composition.
 [44] (2018) Long shortterm attention. CoRR abs/1810.12752. External Links: Link, 1810.12752 Cited by: Related Work.
 [45] (2018) Recurrent attention unit. CoRR abs/1810.12754. External Links: Link, 1810.12754 Cited by: Related Work.
 [46] (2017) Recurrent highway networks. In ICML, Cited by: Table 2, Related Work.

[47]
(2017)
Neural architecture search with reinforcement learning
. In ICLR, Cited by: Introduction, Table 1.