Two-Headed Monster And Crossed Co-Attention Networks

This paper presents some preliminary investigations of a new co-attention mechanism in neural transduction models. We propose a paradigm, termed Two-Headed Monster (THM), which consists of two symmetric encoder modules and one decoder module connected with co-attention. As a specific and concrete implementation of THM, Crossed Co-Attention Networks (CCNs) are designed based on the Transformer model. We demonstrate CCNs on WMT 2014 EN-DE and WMT 2016 EN-FI translation tasks and our model outperforms the strong Transformer baseline by 0.51 (big) and 0.74 (base) BLEU points on EN-DE and by 0.17 (big) and 0.47 (base) BLEU points on EN-FI.


page 1

page 2

page 3

page 4


Learning Deep Transformer Models for Machine Translation

Transformer is the state-of-the-art model in recent machine translation ...

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent...

Recurrent multiple shared layers in Depth for Neural Machine Translation

Learning deeper models is usually a simple and effective approach to imp...

Transformer Based Bengali Chatbot Using General Knowledge Dataset

An AI chatbot provides an impressive response after learning from the tr...

Hard-Coded Gaussian Attention for Neural Machine Translation

Recent work has questioned the importance of the Transformer's multi-hea...

DCANet: Learning Connected Attentions for Convolutional Neural Networks

While self-attention mechanism has shown promising results for many visi...

Sharing Attention Weights for Fast Transformer

Recently, the Transformer machine translation system has shown strong re...

1 Introduction

Attention has emerged as a prominent neural module extensively adopted in a wide range of deep learning research problems  

Das et al. (2017); Hermann et al. (2015); Rocktäschel et al. (2015); Santos et al. (2016); Xu and Saenko (2016); Yang et al. (2016); Yin et al. (2016); Zhu et al. (2016); Xu et al. (2015); Chorowski et al. (2015)

such as VQA, reading comprehension, textual entailment, image captioning, speech recognition and so forth. It’s remarkable success is also embodied in machine translation tasks  

Bahdanau et al. (2014); Vaswani et al. (2017).

This work proposes an end-to-end co-attentional neural structure, named Crossed Co-Attention Networks (CCNs) to address machine translation, a typical sequence-to-sequence NLP task. We customize the transformer  Vaswani et al. (2017) featured by non-local operations  Wang et al. (2018) with two input branches and tailor the transformer’s multi-head attention mechanism to the needs of information exchange between these two parallel branches. A higher-level and more abstract paradigm generalized from CCNs is denoted as ”Two-Headed Monster” (THM), representing a broader class of neural structure benefiting from two parallel neural channels that would be intertwined with each other through, for example, co-attention mechanism as illustrated in Fig. 1.

Figure 1: Two-Headed Monster.

Needless to say, co-attention is widely adopted in multi-modal scenarios  Lu et al. (2016a); Yu et al. (2017); Tay et al. (2018); Xiong et al. (2016); Lu et al. (2016b), the basic idea of which is to make two feature maps from different domains to attend to each other symmetrically and thus output summarized representations for each domain. In this work, we emphasize a parallel and symmetric manifold operating on two input channels and possessing two output channels but do not assume that the two channels of input must be disparate. Our co-attention mechanism is designed in a ”Transformer” style, and to the best of our knowledge, our proposed Crossed Co-Attention Network is one of the first (if not the only) implementations of co-attention on transformer model. As a preliminary investigation, we apply our model on the popular machine translation task where two input channels are in one same domain. Our code also leverages half-precision floating point format (FP16) training and synchronous distributed training for inter-GPU communication (we do not discard gradients calculated by ”stragglers”) which dramatically accelerate our training procedure  Ott et al. (2018); Micikevicius et al. (2018). We will release our code after the paper is de-anonymized.

2 Model Architecture

We propose an end-to-end neural architecture, based on the transformer, to address a class of sequence to sequence tasks where the model takes input from two channels. We design a Crossed Co-Attention Mechanism to make our model capable of attending to two information flows simultaneously in both the encoding and the decoding stages. Our co-attention mechanism is naively realized by a crossed connection of Value, key and Query gates of a regular multi-head attention module, so we term our model Crossed Co-Attention Networks.

2.1 Generic Co-Attention

In this section, we first review non-local operations and bridge them to the dot-product attention that is widely used in self-attention modules and then formulate the co-attention mechanism in a generic way. A non-local operation is defined as a building block in deep neural networks which captures long-range dependencies where every response is computed as a linear combination of all features in the input feature map  

Wang et al. (2018). Suppose the input feature maps are , and and the output feature map is of the same size as the input. Then a generic non-local operation is formulated as follows:


We basically follow the definition of no-local operation in  Wang et al. (2018) where is a pairwise function (”” is Cartesian product), is a unary function and calculates a normalizer, but dispense with the assumption that . However, if we assume , , the normalizer and , then the non-local operation degrades to the multi-head self-attention as is described in  Vaswani et al. (2017) (formula 2 describes only one attention head):


Considering two input channels, denoted as ’left’ and ’right’, we present the following non-local operation as a definition of co-attention where . Note that when the co-attention degrades to two self-attention modules.


2.2 Crossed Co-Attention Networks

Based on the transformer model  Vaswani et al. (2017), we design a novel co-attention mechanism. Our proposed mechanism consists of two symmetrical branches that work in parallel to assimilate information from two input channels respectively. Different from previously known co-attention mechanisms such as  Xiong et al. (2017); Lu et al. (2016a), our co-attention is built through connecting two multiplicative attention modules  Vaswani et al. (2017) each containing three gates, i.e., Value, Key and Query. The information flows from two input channels then interact with and benefit from each other via crossed connections. Suppose the input fed into the left branch is , and the right branch . In our encoder, the left branch takes input from as Value (V) and Key (K) and takes the input as Query (Q). The right branch, however, takes the input as Query (Q) and as Value (V) and Key (K). This design is, in a sense, meant for the two branches to relatively keep the information in their own domains. A special case is, if , then the response will be in the row space of . Because when an attention takes input from its own branch, the output responses will by and large carry the information of the branch. For machine translation, the two encoder branches take in one same input sequence, but in order to reduce the redundancy of two parallel branches, we apply dropout and input corruption on input embeddings for two branches respectively. While our model shares BPE embeddings  Sennrich et al. (2015)

globally, for input matrices encoder branches, we randomly select and swap two sub-word tokens at a probability of


Figure 2: Crossed Co-Attention Networks.

In the encoder-decoder attention layers, the multi-head attention on two decoder branches uses the output from two encoder branches as Value and Key alternatively while absorbing the self-attended output embedding from below as Query. The output of the two branches in decoder is processed through concatenation, linear transformation and then fed into a feed-forward network. In addition to our co-attention mechanism, we keeps one self-attention layer in the decoder for reading in shifted output embedding. We adopt the same input masking and sinusoidal position encoding as the Transformer which will not be expanded here.

Model Dataset Epoch Time (s) BLEU Number of Parameters Batch Size
Transformer-Base WMT2014 EN-DE 684.52 27.21 61,364,224 6,528
THM / CCN-Base WMT2014 EN-DE 1090.65 27.95 114,928,640 6,528
Transformer-Base WMT2016 EN-FI 232.97 16.12 55,883,776 6,528
THM / CCN-Base WMT2016 EN-FI 410.79 16.59 109,448,192 6,528
Transformer-Big WMT2014 EN-DE 1982.63 28.13 210,808,832 2,176
THM / CCN-Big WMT2014 EN-DE 3611.53 28.64 424,892,416 2,176
Transformer-Big WMT2016 EN-FI 726.51 16.21 199,847,936 2,176
THM / CCN-Big WMT2016 EN-FI 1387.22 16.38 413,931,520 2,176
Table 1: Comparisons Between Our Proposed Method and Transformer Baseline on WMT 2014 EN-DE and WMT 2016 EN-FI

3 Experiments

3.1 Setup

We demonstrate our model on WMT 2014 EN-DE and WMT 2016 EN-FI machine translation tasks. For convenience, in this section, we do not differentiate between the notion of THM and CCN which is an implementation of THM. The raw input data is pre-processed with length filtering as previous work  Ott et al. (2018). Our final dataset consists of training examples, valid examples and test examples for EN-DE, and training examples, valid examples and test examples for EN-FI. Considering the scale of the training sets, we adopt shared BPE dictionaries of size for EN-DE and for EN-FI. Our CCNs are established with encoder and decoder blocks and a hidden state of size for base models and with also such blocks but a hidden state of neurons for big models. That exactly corresponds to the settings of Transformer paper. We train our models on a NVIDIA DGX-1 GPU server with TESLA V100-16GB GPUs. In order to make full use of the computational resources, FP16 computation is adopted and we use a batch size of tokens/GPU for base models and for big models (both Transformer and THM). We adopt the Sequence-to-Sequence Toolkit FairSeq  Ott et al. (2019) released by the Facebook AI Research for our Transformer baseline 111, upon which our THM code is built as well. We train all base models for around one day and big models for around two days. For model selection, we strictly choose the model that achieves the highest BLEU on Dev set.

3.2 Experimental Results

Main Results:

Our experiments demonstrate the efficiency of our proposed crossed co-attention mechanism which significantly improves the BLEU scores of machine translation as illustrated in Table  1. Besides, the co-attention mechanism has, by and large, reduced training, valid and test loss from the first training epoch compared with the transformer baselines as shown in Appendices  A.1. However, since the number of parameters doubles, the epoch time also increases by roughly .

Capability of Model Selection:

In addition to the BLEU, loss and time efficiency, we also find that the THM/CCN models demonstrate better capability of selecting good models with Dev set from all models derived in all training epochs. As is shown is Table  2, for THM/CCN, the models that achieved hightest BLEU on Dev set are also high-ranking on the Test set. In cases, THM will select TOP models and in all cases, it will select TOP models whereas Transformer can only select TOP models in cases.

Performance across Languages:

We test our proposed method on two language pairs, EN-DE and EN-FI and the improved BLEU scores and the capability of model selection on both base and big models demonstrate the universality of our proposed method.

THM / CCN Transformer
TOP 1 25% 0
TOP 3 75% 0
TOP 5 100% 0
TOP 10 100% 50%
Table 2: This Table Evaluates If The Models Selected by The Dev Set Are Also Better Than Others on Test Set. Here We Provide The Percentage of Selected Models That Rank TOP 1, TOP 3, TOP 5 or TOP 10 Among All Models Derived from All Training Epochs.

4 Related Work


Multi-head self-attention has demonstrated its capacity in neural transduction models  Vaswani et al. (2017), language model pre-training  Devlin et al. (2018); Radford et al. (2018) and speech synthesis  Yang et al. (2019c). While the novel attention mechanism, eschewing recurrence, is famous for modeling global dependencies and considered faster than recurrent layers  Vaswani et al. (2017), recent work points out that it may tend to overlook neighboring information  Yang et al. (2019a); Xu et al. (2019). It is found that applying an adaptive attention span could be conducive to character level language modeling tasks  Sukhbaatar et al. (2019). Yang et al. propose to model localness for self-attention which would be conducive to capturing local information by learning a Gaussian bias predicting the region of local attention  Yang et al. (2018a). Other work indicates that adding convolution layers would ameliorate the aforementioned issue  Yang et al. (2018b, 2019b). Multi-head attention can also be used in multi-modal scenarios when V, K and Q gates take in data from different domains.  Helcl et al. (2018) adds an attention layer on top of the encoder-decoder layer with K and V being CNN-extracted image features.

Machine Translation:

Some recent advances in machine translation aim to find more efficient models based on the Transformer: Hao et al. add an additional recurrence encoder to model recurrence for Transformer Hao et al. (2019); So et al. demonstrate the power of neural architecture search and find that the found evolved transformer architecture outperforms human-designed ones  So et al. (2019); Wu et al. propose dynamic convolutions that would be more efficient and simpler compared with self-attention  Wu et al. (2019). Other work shows that training on GPUs can significantly boost the experimental results and shorten the training time  Ott et al. (2018). A novel research direction is semi- or un-supervised machine translation aimed at addressing low-resource languages where parallel data is usually unavailable  Cheng (2019); Artetxe et al. (2017); Lample et al. (2017).

5 Conclusion

We propose a novel co-attention mechanism consisting of two parallel attention modules connected with each other in a crossed manner. First we formulate the co-attention in a general sense as a non-local operation and then show a specific type of co-attention, known as crossed co-attention can improve the machine translation tasks by BLEU points and enhance the capability of model selection. However, the time efficiency is reduced since the number of parameters increases.


Appendix A Appendices

a.1 Comparisons of Loss between CCN models and Transformer baselines.

Figure 3: Loss vs Epoch for THM-base and Transformer-base on EN-DE
Figure 4: Loss vs Epoch for THM-big and Transformer-big on EN-DE
Figure 5: Loss vs Epoch for THM-base and Transformer-base on EN-FI
Figure 6: Loss vs Epoch for THM-big and Transformer-big on EN-FI