Towards Better Modeling Hierarchical Structure for Self-Attention with Ordered Neurons

09/04/2019 ∙ by Jie Hao, et al. ∙ Florida State University Tencent 0

Recent studies have shown that a hybrid of self-attention networks (SANs) and recurrent neural networks (RNNs) outperforms both individual architectures, while not much is known about why the hybrid models work. With the belief that modeling hierarchical structure is an essential complementary between SANs and RNNs, we propose to further enhance the strength of hybrid models with an advanced variant of RNNs - Ordered Neurons LSTM (ON-LSTM), which introduces a syntax-oriented inductive bias to perform tree-like composition. Experimental results on the benchmark machine translation task show that the proposed approach outperforms both individual architectures and a standard hybrid model. Further analyses on targeted linguistic evaluation and logical inference tasks demonstrate that the proposed approach indeed benefits from a better modeling of hierarchical structure.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Self-attention networks (Sans, Lin:2017:ICLR)

have advanced the state of the art on a variety of natural language processing (NLP) tasks, such as machine translation 

Vaswani:2017:NIPS, semantic role labelling Tan:2018:AAAI, and language representations Devlin:2018:arxiv. However, a previous study empirically reveals that the hierarchical structure of the input sentence, which is essential for language understanding, is not well modeled by SanTran:2018:EMNLP. Recently, hybrid models which combine the strengths of Sans and recurrent neural networks (Rnns) have outperformed both individual architectures on a machine translation task Chen:2018:ACL. We attribute the improvement to that Rnns complement Sans on the representation limitation of hierarchical structure, which is exactly the strength of RnnTran:2018:EMNLP.

Starting with this intuition, we propose to further enhance the representational power of hybrid models with an advanced Rnns variant – Ordered Neurons Lstm (On-Lstm, Shen:2019:ICLR). On-Lstm is better at modeling hierarchical structure by introducing a syntax-oriented inductive bias, which enables Rnns to perform tree-like composition by controlling the update frequency of neurons. Specifically, we stack Sans encoder on top of On-Lstm encoder (cascaded encoder). Sans encoder is able to extract richer representations from the input augmented with structure context. To reinforce the strength of modeling hierarchical structure, we propose to simultaneously expose both types of signals by explicitly combining outputs of the Sans and On-Lstm encoders.

We validate our hypothesis across a range of tasks, including machine translation, targeted linguistic evaluation, and logical inference. While machine translation is a benchmark task for deep learning models, the last two tasks focus on evaluating how much structure information is encoded in the learned representations. Experimental results show that the proposed approach consistently improves performances in all tasks, and modeling hierarchical structure is indeed an essential complementary between

Sans and Rnns.

The contributions of this paper are:

  • We empirically demonstrate that a better modeling of hierarchical structure is an essential strength of hybrid models over the vanilla Sans.

  • Our study proves that the idea of augmenting Rnns with ordered neurons Shen:2019:ICLR produces promising improvement on machine translation, which is one potential criticism of On-Lstm.

2 Approach

Partially motivated by wang:2016:COLING and Chen:2018:ACL, we stack a Sans encoder on top of a Rnns encoder to form a cascaded encoder. In the cascaded encoder, hierarchical structure modeling is enhanced in the bottom Rnns encoder, based on which Sans encoder is able to extract representations with richer hierarchical information. Let be the input sequence, the representation of the cascaded encoder is calculated by


where is a -layer Rnns encoder that reads the input sequence, and is a -layer Sans encoder that takes the output of Rnns encoder as input.

In this work, we replace the standard Rnns with recently proposed On-Lstm for better modeling of hierarchical structure, and directly combine the two encoder outputs to build even richer representations, as described below.

Modeling Hierarchical Structure with Ordered Neurons

On-Lstm introduces a new syntax-oriented inductive bias – Ordered Neurons, which enables Lstm models to perform tree-like composition without breaking its sequential form Shen:2019:ICLR. Ordered neurons enables dynamic allocation of neurons to represent different time-scale dependencies by controlling the update frequency of neurons. The assumption behind ordered neurons is that some neurons always update more (or less) frequently than the others, and that order is pre-determined as part of the model architecture. Formally, On-Lstm introduces novel ordered neuron rules to update cell state:


where forget gate , input gate and state are same as that in the standard Lstm hochreiter1997long. The master forget gate and the master input gate are newly introduced to control the erasing and the writing behaviors respectively. indicates the overlap, and when the overlap exists (), the corresponding neurons are further controlled by the standard gates and .

An ideal master gate is in binary format such as , which splits the cell state into two continuous parts: 0-part and 1-part. The neurons corresponding to 0-part and 1-part are updated with more and less frequencies separately, so that the information in 0-part neurons will only keep a few time steps, while the information in 1-part neurons will last for more time steps. Since such binary gates are not differentiable, the goal turns to find the splitting point

(the index of the first 1 in the ideal master gate). To this end, Shen:2019:ICLR introduced a new activation function:



produces a probability distribution (e.g.

) to indicate the probability of each position being the splitting point . cumsum is the cumulative probability distribution, in which the -th probability refers to the probability that falls within the first positions. The output for the above example is , in which different values denotes different update frequencies. It also equals to the probability of each position’s value being 1 in the ideal master gate. Since this ideal master gate is binary, is the expectation of the ideal master gate.

Based on this activation function, the master gates are defined as


where is the current input and is the hidden state of previous step. and are two individual activation functions with their own trainable parameters.

Short-Cut Connection

Inspired by previous work on exploiting deep representations Peters:2018:NAACL; Dou:2018:EMNLP, we propose to simultaneously expose both types of signals by explicitly combining them with a simple short-cut connection He:2016:CVPR.

Similar to positional encoding injection in Transformer Vaswani:2017:NIPS, we add the output of the On-Lstm encoder to the output of Sans encoder:


where is the output of On-Lstm encoder, and is output of Sans encoder.

3 Experiments

We chose machine translation, targeted linguistic evaluation and logical inference tasks to conduct experiments in this work. The first and the second tasks evaluate and analyze models as the hierarchical structure is an inherent attribute for natural language. The third task aims to directly evaluate the effects of hierarchical structure modeling on artificial language.

3.1 Machine Translation

For machine translation, we used the benchmark WMT14 EnglishGerman dataset. Sentences were encoded using byte-pair encoding (BPE) with 32K word-piece vocabulary sennrich2016neural. We implemented the proposed approaches on top of Transformer Vaswani:2017:NIPS – a state-of-the-art Sans-based model on machine translation, and followed the setting in previous work Vaswani:2017:NIPS to train the models, and reproduced their reported results. We tested on both the Base and Big models which differ at hidden size (512 vs. 1024), filter size (2048 vs. 4096) and number of attention heads (8 vs. 16). All the model variants were implemented on the encoder. The implementation details are introduced in Appendix A.1. Table 1 lists the results.

# Encoder Architecture Para. BLEU
Base Model
1 6L Sans 88M 27.31
2 6L Lstm 97M 27.23
3 6L On-Lstm 110M 27.44
4 6L Lstm + 4L Sans 104M 27.78
5 6L On-Lstm + 4L Sans 123M 28.27
6 3L On-Lstm + 3L Sans 99M 28.21
7     + Short-Cut 99M 28.37
Big Model
8 6L Sans 264M 28.58
9 Hybrid Model + Short-Cut 308M 29.30
Table 1: Case-sensitive BLEU scores on the WMT14 EnglishGerman translation task. “”: significant over the conventional self-attention counterpart (), tested by bootstrap resampling. “6L Sans” is the state-of-the-art Transformer model. “nL Lstm + mL Sans” denotes stacking n Lstm layers and m Sans layers subsequently. “Hybrid Model” denotes “3L On-Lstm + 3L Sans”.


(Rows 1-3) Following Chen:2018:ACL, the three baselines are implemented with the same framework and optimization techniques as used in Vaswani:2017:NIPS. The difference between them is that they adopt Sans, Lstm and On-Lstm as basic building blocks respectively. As seen, the three architectures achieve similar performances for their unique representational powers.

Hybrid Models

(Rows 4-7) We first followed Chen:2018:ACL to stack 6 Rnns layers and 4 Sans layers subsequently (Row 4), which consistently outperforms the individual models. This is consistent with results reported by Chen:2018:ACL. In this setting, the On-Lstm model significantly outperforms its Lstm counterpart (Row 5), and reducing the encoder depth can still maintain the performance (Row 6). We attribute these to the strength of On-Lstm on modeling hierarchical structure, which we believe is an essential complementarity between Sans and Rnns. In addition, the Short-Cut connection combination strategy improves translation performances by providing richer representations (Row 7).

Stronger Baseline

(Rows 8-9) We finally conducted experiments on a stronger baseline – the Transformer-Big model (Row 8), which outperforms its Transformer-Base counterpart (Row 1) by 1.27 BLEU points. As seen, our model consistently improves performance over the stronger baseline by 0.72 BLEU points, demonstrating the effectiveness and universality of the proposed approach.

Assessing Encoder Strategies

# Encoder Architecture Para. BLEU
1 3L On-Lstm 3L Sans 99M 28.21
2 3L Sans 3L On-Lstm 99M 27.39
3 8L Lstm 102.2M 27.25
4 10L Sans 100.6M 27.76
Table 2: Results for encoder strategies. Case-sensitive BLEU scores on the WMT14 EnglishGerman translation task. “A B” denotes stacking B on the top of A. The model in Row 1 is the hybrid model in Table 1.

We first investigate the encoder stack strategies on different stack orders. From Table 2, to compare with the proposed hybrid model, we stack 3-layers On-Lstm on the top of 3-layers Sans (Row 2). It performs worse than the strategy in the proposed hybrid model. The result support the viewpoint that the Sans encoder is able to extract richer representations if the input is augmented with sequential context Chen:2018:ACL.

Moreover, to dispel the doubt that whether the improvement of hybrid model comes from the increasement of parameters. We investigate the 8-layers Lstm and 10-layers Sans encoders (Rows 3-4) which have more parameters compared with the proposed hybrid model. The results show that the hybrid model consistently outperforms these model variants with less parameters and the improvement should not be due to more parameters.

3.2 Targeted Linguistic Evaluation

Task S O Hybrid + Short-Cut
Final Final Final
Surface Tasks
SeLen 92.71 90.70 91.94 89.50 89.86
WC 81.79 76.42 90.38 79.10 80.37
Avg 87.25 83.56 91.16 84.30 85.12
Syntactic Tasks
TrDep 44.78 52.58 51.19 52.55 53.28
ToCo 84.53 86.32 86.29 87.92 87.89
BShif 52.66 82.68 81.79 82.05 81.90
Avg 60.66 73.86 73.09 74.17 74.36
Semantic Tasks
Tense 84.76 86.00 83.88 86.05 85.91
SubN 85.18 85.44 85.56 84.59 85.81
ObjN 81.45 86.78 85.72 85.80 85.38
SoMo 49.87 49.54 49.23 49.12 49.92
CoIn 68.97 72.03 72.06 72.05 72.23
Avg 74.05 75.96 75.29 75.52 75.85
Table 3: Performance on the linguistic probing tasks of evaluating linguistics embedded in the learned representations. “S” and “O” denote the San and On-Lstm baseline models. “” and “” are respectively the outputs of the On-Lstm encoder and the San encoder in the hybrid model, and “Final” denotes the final output exposed to decoder.

To gain linguistic insights into the learned representations, we conducted probing tasks conneau:2018:acl to evaluate linguistics knowledge embedded in the final encoding representation learned by model, as shown in Table 3. We evaluated Sans and proposed hybrid model with Short-Cut connection on these 10 targeted linguistic evaluation tasks. The tasks and model details are described in Appendix A.2.

Experimental results are presented in Table 3. Several observations can be made here. The proposed hybrid model with short-cut produces more informative representation in most tasks (“Final” in “S” vs. in “Hybrid+Short-Cut”), indicating that the effectiveness of the model. The only exception are surface tasks, which is consistent with the conclusion in conneau:2018:acl: as a model captures deeper linguistic properties, it will tend to forget about these superficial features. Short-cut further improves the performance by providing richer representations (“” vs. “Final” in “Hybrid+Short-Cut”). Especially on syntactic tasks, our proposed model surpasses the baseline more than 13 points (74.36 vs. 60.66) on average, which again verifies that On-Lstm enhance the strength of modeling hierarchical structure for self-attention.

3.3 Logical Inference

Figure 1: Accuracy of logical inference when training on logic data with at most 6 logical operators in the sequence.

We also verified the model’s performance in the logical inference task proposed by  Bowman:2015:arXiv. This task is well suited to evaluate the ability of modeling hierarchical structure. Models need to learn the hierarchical and nested structures of language in order to predict accurate logical relations between sentences Bowman:2015:arXiv; Tran:2018:EMNLP; Shen:2019:ICLR. The artificial language of the task has six types of words {a, b, c, d, e, f} in the vocabulary and three logical operators {or, and, not}. The goal of the task is to predict one of seven logical relations between two given sentences. These seven relations are: two entailment types , equivalence , exhaustive and non-exhaustive contradiction , and semantic independence .

We evaluated the Sans, Lstm, On-Lstm

and proposed model. We followed Tran:2018:EMNLP to use two hidden layers with Short-Cut connection in all models. The model details and hyperparameters are described in Appendix A.3.

Figure 1 shows the results. The proposed hybrid model outperforms both the Lstm-based and the Sans-based baselines on all cases. Consistent with Shen:2019:ICLR, on the longer sequences () that were not included during training, the proposed model also obtains the best performance and has a larger gap compared with other models than on the shorter sequences (), which verifies the proposed model is better at modeling more complex hierarchical structure in sequence. It also indicates that the hybrid model has a stronger generalization ability.

4 Related Work

Improved Self-Attention Networks

Recently, there is a large body of work on improving Sans in various NLP tasks Yang:2018:EMNLP; Wu:2018:ACL; Yang:2019:AAAI; Yang:2019:NAACL; Guo:2019:AAAI; Wang:2019:ACL; sukhbaatar:2019:ACL, as well as image classification bello:2019:attention and automatic speech recognition mohamed:2019:transformers tasks. In these works, several strategies are proposed to improve the utilize Sans with the enhancement of local and global information. In this work, we enhance the Sans with the On-Lstm to form a hybrid model Chen:2018:ACL, and thoroughly evaluate the performance on machine translation, targeted linguistic evaluation, and logical inference tasks.

Structure Modeling for Neural Networks in NLP

Structure modeling in NLP has been studied for a long time as the natural language sentences inherently have hierarchical structures chomsky2014aspects; bever1970cognitive. With the emergence of deep learning, tree-based models have been proposed to integrate syntactic tree structure into Recursive Neural Networks Socher:2013:EMNLP, LstmTai:2015:ACL, CnnMou:2016:AAAI. As for Sans, Hao:2019:EMNLPa, Ma:2019:NAACL and Wang:2019:EMNLP enhance the Sans with neural syntactic distance, multi-granularity attention scope and structural position representations, which are generated from the syntactic tree structures.

Closely related to our work, Hao:2019:NAACL find that the integration of the recurrence in Sans encoder can provide more syntactic structure features to the encoder representations. Our work follows this direction and empirically evaluates the structure modelling on the related tasks.

5 Conclusion

In this paper, we adopt the On-Lstm, which models tree structure with a novel activation function and structured gating mechanism, as the Rnns counterpart to boost the hybrid model. We also propose a modification of the cascaded encoder by explicitly combining the outputs of individual components, to enhance the ability of hierarchical structure modeling in a hybrid model. Experimental results on machine translation, targeted linguistic evaluation and logical inference tasks show that the proposed models achieve better performances by modeling hierarchical structure of sequence.


J.Z. was supported by the National Institute of General Medical Sciences of the National Institute of Health under award number R01GM126558. We thank the anonymous reviewers for their insightful comments.


Appendix A Supplemental Material

a.1 Machine Translation

We conducted experiments on the widely-used WMT14 EnglishGerman dataset111 consisting of about 4.56M sentence pairs. We used newstest2013 and newstest2014 as development set and test set respectively. We applied byte pair encoding (BPE) toolkit222 with 32K merge operations. The case-sensitive NIST BLEU score papineni2002bleu

is used as the evaluation metric. All models were trained on eight NVIDIA Tesla P40 GPUs where each was allocated with a batch size of 4096 tokens.

For Base model, it has embedding size and hidden size of 512, filter size of 2048 and attention heads of 8. Compared with Base model, Big model has embedding size and hidden size of 1024, filter size of 4096 and attention heads of 16. For both Base and Big models, the number of encoder and decoder layer is 6, all types of dropout rate is 0.1. Adam kingma2015adam is used with , and . The learning rate is 1.0 and linearly warms up over the first 4,000 steps, then decreases proportionally to the inverse square root of the step number. Label smoothing is 0.1 during training DBLP:conf/cvpr/SzegedyVISW16. All the results we reported are based on the individual models without using the averaging model or ensemble.

a.2 Targeted Linguistic Evaluation

We conducted 10 probing tasks333
to study what linguistic properties are captured by the encoders conneau:2018:acl

. A probing task is a classification problem that focuses on simple linguistic properties of sentences. ‘SeLen’ predicts the length of sentences in terms of number of words. ‘WC’ tests whether it is possible to recover information about the original words given its sentence embedding. ‘TrDep’ checks whether an encoder infers the hierarchical structure of sentences. In ‘ToCo’ task, sentences should be classified in terms of the sequence of top constituents immediately below the sentence node. ‘BShif’ tests whether two consecutive tokens within the sentence have been inverted. ‘Tense’ asks for the tense of the main-clause verb. ‘SubN’ focuses on the number of the main clause’s subject. ‘ObjN’ tests for the number of the direct object of the main clause. In ‘SoMo’, some sentences are modified by replacing a random noun or verb with another one and the classifier should tell whether a sentence has been modified. ‘CoIn’ contains sentences made of two coordinate clauses. Half of sentences are inverted the order of the clauses and the task is to tell whether a sentence is intact or modified.

Each of our probing model consists a pre-trained encoder of model variations from machine translation followed by a MLP classifier conneau:2018:acl

. The mean of the encoding layer is served as the sentence representation passed to the classifier. The MLP classifier has a dropout rate of 0.3, a learning rate of 0.0005 with Adam optimizer and were trained for 250 epochs.

a.3 Logical Inference

We used the artificial data444

described in  Bowman:2015:arXiv. The train/dev/test dataset ratios are set to 0.8/0.1/0.1 with the number of logical operations range from 1 to 12. We followed  Tran:2018:EMNLP to implement the architectures: premise and hypothesis sentences are encoded in fixed-size vectors, which are concatenated and fed to a three layer feed-forward network for classification of the logical relation. For

Lstm based models, we took the last hidden state of the top layer as a fixed-size vector representation of the sentence. For the hybrid and Sans models, we used two trainable queries to obtain the fixed-size representation.

In our experiments, both word embedding size and hidden size are set to 256. All models have two layers, a dropout rate of 0.2, a learning rate of 0.0001 with Adam optimizer, and were trained for 100 epochs. Especially, for hybrid model, we stacked one On-Lstm layer and one Sans layer subsequently. Short-Cut connection between layers is added into all models for fair comparison.