Area Attention

10/23/2018 ∙ by Yang Li, et al. ∙ 4

Existing attention mechanisms, are mostly item-based in that a model is designed to attend to a single item in a collection of items (the memory). Intuitively, an area in the memory that may contain multiple items can be worth attending to as a whole. We propose area attention: a way to attend to an area of the memory, where each area contains a group of items that are either spatially adjacent when the memory has a 2-dimensional structure, such as images, or temporally adjacent for 1-dimensional memory, such as natural language sentences. Importantly, the size of an area, i.e., the number of items in an area, can vary depending on the learned coherence of the adjacent items. By giving the model the option to attend to an area of items, instead of only a single item, we hope attention mechanisms can better capture the nature of the task. Area attention can work along multi-head attention for attending to multiple areas in the memory. We evaluate area attention on two tasks: neural machine translation and image captioning, and improve upon strong (state-of-the-art) baselines in both cases. These improvements are obtainable with a basic form of area attention that is parameter free. In addition to proposing the novel concept of area attention, we contribute an efficient way for computing it by leveraging the technique of summed area tables.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Attentional mechanisms have significantly boosted the accuracy on a variety of deep learning tasks

(Bahdanau et al., 2014; Luong et al., 2015; Xu et al., 2015). They allow the model to selectively focus on specific pieces of information, which can be a word in a sentence for neural machine translation (Bahdanau et al., 2014; Luong et al., 2015) or a region of pixels in image captioning (Xu et al., 2015; Sharma et al., 2018).

An attentional mechanism typically follows a memory-query paradigm, where the memory contains a collection of items of information from a source modality such as the embeddings of an image or the hidden states of encoding an input sentence, and the query comes from a target modality such as the hidden state of a decoder model. In recent architectures such as Transformer (Vaswani et al., 2017), self-attention involves queries and memory from the same modality for either encoder or decoder. Each item in the memory has a key and value

, where the key is used to compute the probability

regarding how well the query matches the item (see Equation 1).

(1)

The typical choices for include dot products (Luong et al., 2015)

and a multilayer perceptron

(Bahdanau et al., 2014). The output from querying the memory with is then calculated as the sum of all the values in the memory weighted by their probabilities (see Equation 2), which can be fed to other parts of the model for further calculation. During training, the model learns to attend to specific piece of information, e.g., the correspondance between a word in the target sentence and a word in the source sentence for translation tasks.

(2)

Attention mechanisms are typically designed to focus on individual items in the entire memory, where each item defines the granularity of what the model can attend to. For example, it can be a character for a character-level translation model, a word for a word-level model or a grid cell for an image-based model. Such a construction of attention granularity is predetermined rather than learned. While this kind of item-based attention has been helpful for many tasks, it can be fundamentally limited for modeling complex attention distribution that might be involved in a task.

In this paper, we propose area attention, as a general mechanism for the model to attend to a group of items in the memory that are structurally adjacent. In area attention, each unit for attention calculation is an area that can contain one or more than one item. Each of these areas can aggregate a varying number of items and the granularity of attention is thus learned from the data rather than predetermined. Note that area attention subsumes item-based attention because when an area contains a single item, it is equivalent to regular attention mechanisms. Area attention can be used along multi-head attention (Vaswani et al., 2017). With each head using area attention, multi-head area attention allows the model to attend to multiple areas in the memory. As we show in the experiments, the combination of both achieved the best results.

Extensive experiments with area attention indicate that area attention outperforms regular attention on a number of recent models for two popular tasks: machine translation (both token and character-level translation on WMT’14 EN-DE and EN-FR), and image captioning (trained on COCO and tested for both in-domain with COCO40 and out-of-domain captioning with Flickr 1K). These models involve several distinct architectures, such as the canonical LSTM seq2seq with attention (Luong et al., 2015) and the encoder-decoder Transformer (Vaswani et al., 2017; Sharma et al., 2018).

2 Related Work

Item-grouping has been brought up in a number of language-specific tasks. Ranges or segments of a sentence, beyond individual tokens, have been often considered for problems such as dependency parsing or constituency parsing in natural language processing. Recent works

(Wang & Chang, 2016; Stern et al., 2017; Kitaev & Klein, 2018) represent a sentence segment by subtracting the encoding of the first token from that of the last token in the segment, assuming the encoder captures contextual dependency of tokens. The popular choices of the encoder are LSTM (Wang & Chang, 2016; Stern et al., 2017) or Transformer (Kitaev & Klein, 2018)

. In contrast, the representation of an area (or a segment) in area attention, for its basic form, is defined as the mean of all the vectors in the segment where each vector does not need to carry contextual dependency. We calculate the mean of each area of vectors using subtraction operation over a summed area table

(Viola & Jones, 2001) that is fundamentally different from the subtraction applied in these previous works.

Lee et al. proposed a rich representation for a segment in coreference resolution tasks (Lee et al., 2017), where each span (segment) in a document is represented as a concatenation of the encodings of the first and last words in the span, the size of the span and an attention-weighted sum of the word embeddings within the span. Again, this approach operates on encodings that have already captured contextual dependency between tokens, while area attention we propose does not require each item to carry contextual or dependency information. In addition, the concept of range, segment or span that is proposed in all the above works addresses their specific task, rather than aiming for improving general attentional mechanisms.

Previous work have proposed several methods for capturing structures in attention calculation. For example, Kim et al. used a conditional random field to directly model the dependency between items, which allows multiple "cliques" of items to be attended to at the same time (Kim et al., 2017). Niculae and Blondel approached the problem, from a different angle, by using regularizers to encourage attention to be placed onto contiguous segments (Niculae & Blondel, 2017). In image captioning tasks, Pedersoli et al. enabled a model to attend to object proposals on an image (Pedersoli et al., 2016) while You et al. applied attention to semantic concepts and visual attributes that are extracted from an image (You et al., 2016).

Compared to these previous works, area attention we propose here does not require to train a special network or sub-network, or use an additional loss (regularizer) to capture structures. It allows a model to attend to information at a varying granularity, which can be at the input layer where each item might lack contextual information, or in the latent space. It is easy to apply area attention to existing single or multi-head attention mechanisms. By enhancing Transformer, an attention-based architecture, (Vaswani et al., 2017) with area attention, we achieved state-of-art results on a number of tasks.

3 Area-Based Attention Mechanisms

An area is a group of structurally adjacent items in the memory. When the memory consists of a sequence of items, a 1-dimensional structure, an area is a range of items that are sequentially (or temporally) adjacent and the number of items in the area can be one or multiple. Many language-related tasks are categorized in the 1-dimensional case, e.g., machine translation or sequence prediction tasks. In Figure 1, the original memory is a 4-item sequence. By combining the adjacent items in the sequence, we form area memory where each item is a combination of multiple adjacent items in the original memory. We can limit the maximum area size to consider for a task. In Figure 1, the maximum area size is 3.

Figure 1: An illustration of area attention for the 1-dimensional case. In this example, the memory is a 4-item sequence and the maximum size of an area allowed is 3.

When the memory contains a grid of items, a 2-dimensional structure, an area can be any rectangular region in the grid (see Figure 2). This resembles many image-related tasks, e.g., image captioning. Again, we can limit the maximum size allowed for an area. For a 2-dimensional area, we can set the maximum height and width for each area. In this example, the original memory is a 3x3 grid of items and the maximum height and width allowed for each area is 2.

Figure 2: An illustration of area attention for the 2-dimensional case. In this example, the memory is a 3x3 grid and the dimension allowed for an area is 2x2.

As we can see, many areas can be generated by combining adjacent items. For the 1-dimensional case, the number of areas that can be generated is where is the maximum size of an area and is the length of the sequence. For the 2-dimensional case, there are an quadratic number of areas can be generated from the original memory: . and where and are the height and width of the memory grid and and are the maximum height and width allowed for a rectangular area.

To be able to attend to each area, we need to define the key and value for each area that contains one or multiple items in the original memory. As the first step to explore area attention, we define the key of an area, , simply as the mean vector of the key of each item in the area.

(3)

where is the size of the area . For the value of an area, we simply define it as the the sum of all the value vectors in the area.

(4)

With the keys and values defined, we can use the standard way for calculating attention as discussed in Equation 1 and Equation 2. Note that this basic form of area attention (Eq.3 and Eq.4) is parameter-free—it does not introduce any parameters to be learned.

3.1 Combining Area Features

Alternatively, we can derive a richer representation of each area by using features other than the mean of the key vectors of the area. For example, we can consider the standard deviation of the key vectors within each area.

(5)

We can also consider the height and width of each area, , and ,, as the features of the area. To combine these features, we use a multi-layer perceptron. To do so, we treat and as discrete values and project them onto a vector space using embedding (see Equation 6 and 7).

(6)
(7)

where and

are the one-hot encoding of

and , and and are the embedding matrices. is the depth of the embedding. We concatenate them to form the representation of the shape of an area.

(8)

We then combine them using a single-layer perceptron followed by a linear transformation (see Equation

9).

(9)

where

is a nonlinear transformation such as ReLU, and

, , and . , , and are trainable parameters.

3.2 Fast Computation Using Summed Area Table

If we naively compute , and , the time complexity for computing attention will be where is the size of the memory that is for a 1-dimensional sequence or for a 2-dimensional memory. is the maximum size of an area, which is in the one dimensional case and in the 2-dimensional case. This is computationally expensive in comparison to the attention computed on the original memory, which is

. To address the issue, we use summed area table, an optimization technique that has been used in computer vision for computing features on image areas

(Viola & Jones, 2001). It allows constant time to calculate a summation-based feature in each rectangular area, which allows us to bring down the time complexity to —We will report on the actual time cost in our experimental section.

Summed area table is based on a pre-computed integral image, , which can be computed in a single pass of the memory (see Equation 10). Here let us focus on the area value calculation for a 2-dimensional memory because a 1-dimensional memory is just a special case with the height of the memory grid as 1.

(10)

where and are the coordinates of the item in the memory. With the integral image, we can calculate the key and value of each area in constant time. The sum of all the vectors in a rectangular area can be easily computed as the following (Equation 11).

(11)

where is the value for the area located with the top-left corner at and the bottom-right corner at . By dividing with the size of the area, we can easily compute . Based on the summed area table, (thus ) can also be computed at constant time for each area (see Equation 12), where , which is the integral image of the element-wise squared memory.

(12)

The core component for computing these quantities is to be able to quickly compute the sum of vectors in each area after we obtain the integral image table for each coordinate , as shown in Equation 10 and 11. We present the Pseudo code for performing Equation 10 and Equation 11 as well as the shape size of each area in Algorithm (1) and the code for computing the mean, sum and standard deviation (Equation 12) in Algorithm (2

). These Pseudo code are designed based on Tensor operations, which can be implemented efficiently using libraries such as TensorFlow 

111https://github.com/tensorflow/tensorflow

and PyTorch 

222https://github.com/pytorch/pytorch.

Input: A tensor in shape of that represents a grid with height and width where each item is a vector of depth .
Output: Sum of vectors of each area, , and height and width of each area, and .
1 Hyperparameter: maximum area width and height allowed. Compute horizontal integral image by cumulative sum along horizontal direction over ;
2 Compute integral image by cumulative sum along vertical directions over ;
3 Acquire

by padding all-zero vectors to the left and top of

;
4 for  do
5       for  do
6             ;
7             ;
8             ;
9             ;
10             ;
11             ; Fill tensor with value for the height of each area;
12             ; Fill tensor with value for the width of each area;
13             , reshape to and concatenate on the first dimension;
14             , reshape to and concatenate on the first dimension;
15             , reshape to and concatenate on the first dimension;
16            
17      
return , and .
Algorithm 1 Compute the vector sum and the size of each area, for all the qualified rectangular areas on a given grid.
Input: A tensor in shape of that represents a grid with height and width where each item is a vector of depth .
Output: Vector mean , standard deviation and sum as well as height and width of each area.
1 Acquire , and using Algorithm 1 with input ;
2 Acquire using Algorithm 1 with input where is for element-wise multiplication;
3 where is for element-wise division;
4 ;
5 ;
return , , , as well as and .
Algorithm 2 Compute the vector mean, standard deviation, and sum as well as the size of each area, for all the qualified rectangular areas on a grid.

4 Experiments

We experimented with area attention on two important tasks: neural machine translation (including both token and character-level translation) and image captioning, where attention mechanisms have been a common component in model architectures for these tasks. The architectures we investigate involves several popular encoder and decoder choices, such as LSTM (Hochreiter & Schmidhuber, 1997) and Transformer (Vaswani et al., 2017). The attention mechansims in these tasks include both self attention and encoder-decoder attention.

4.1 Token-Level Neural Machine Translation

Transformer has recently (Vaswani et al., 2017) established the state of art performance on WMT 2014 English-to-German and English-to-French tasks, while LSTM with encoder-decoder attention has been a popular choice for neural machine translation tasks. We use the same dataset as the one used in (Vaswani et al., 2017) in which the WMT 2014 English-German dataset contains about 4.5 million English-German sentence pairs, and the English-French dataset has about 36 million English-French sentence pairs (Wu et al., 2016). A token is either a byte pair (Britz et al., 2017) or a word piece (Wu et al., 2016) as in the original Transformer experiments.

4.1.1 Transformer Token-Level MT Experiments

Transformer heavily uses attentional mechanisms, including both self-attention in the encoder and the decoder, and attention from the decoder to the encoder. We vary the configuration of Transformer to investigate how area attention impacts the model. In particular, we investigated the following variations of Transformer: Tiny (#hidden layers=2, hidden size=128, filter size=512, #attention heads=4), Small (#hidden layers=2, hidden size=256, filter size=1024, #attention heads=4), Base (#hidden layers=6, hidden size=512, filter size=2048, #attention heads=8) and Big (#hidden layers=6, hidden size=1024, filter size=4096, #attention heads=16).

During training, sentence pairs were batched together based on their approximate sequence lengths. All the model variations except Big uses a training batch contained a set of sentence pairs that amount to approximately 32,000 source and target tokens and were trained on one machine with 8 NVIDIA P100 GPUs for a total of 250,000 steps. Given the batch size, each training step for the Transformer Base model, on 8 NVIDIA P100 GPUs, took 0.4 seconds for Regular Attention, 0.5 seconds for the basic form of Area Attention (Eq.3 and Eq.4), 0.8 seconds for Area Attention using multiple features (Eq.9 and Eq.4).

For Big, due to the memory constraint, we had to use a smaller batch size that amounts to roughly 16,000 source and target tokens and trained the model for 600,000 steps. Each training step took 0.5 seconds for Regular Attention, 0.6 seconds for the basic form of Area Attention (Eq.3 and 4), 1.0 seconds for Area Attention using multiple features (Eq.9 and 4). Similar to previous work, we used the Adam optimizer with a varying learning rate over the course of training—see (Vaswani et al., 2017) for details.

Model Regular Attention Area Attention (Eq.3 and 4) Area Attention (Eq.9 and 4)
EN-DE EN-FR EN-DE EN-FR EN-DE EN-FR
Tiny 18.60 27.07 19.30 27.39 19.45 27.79
Small 22.80 31.91 22.86 32.47 23.19 32.97
Base 27.96 39.10 28.17 39.22 28.52 39.28
Big 29.43 40.88 29.48 41.06 29.68 41.03
Table 1: The BLEU scores on token-level translation tasks for the variations of the Transformer-based architecture.

We applied area attention to each of the Transformer variation, with the maximum area size of 5 to both encoder and decoder self-attention, and the encoder-decoder attention in the first two layers. We found area attention consistently improved Transformer on all the model variations (see Table 1), even with the basic form of area attention where no additional parameters are used (Eq.3 and Eq.4). For Transformer Base, area attention achieved BLEU scores (EN-DE: 28.52 and EN-FR: 39.28) that surpassed the previous results for both EN-DE and EN-FR. In particular, Transformer Base with area attention even outperformed the result of Transformer Big with regular attention reported previously (Vaswani et al., 2017) on EN-DE.

For EN-FR, the performance of Transformer Big with regular attention—a baseline—does not match what was reported in the Transformer paper (Vaswani et al., 2017), largely due to a different batch size and the different number of training steps used, although area attention still outperformed the baseline consistently. On the other hand, area attention with Transformer Big achieved BLEU 29.68 on EN-DE that improved upon the state-of-art result of 28.4 reported in (Vaswani et al., 2017) with a significant margin.

4.1.2 LSTM Token-Level MT Experiments

We used a 2-layer LSTM for both encoder and decoder. The encoder-decoder attention is based on multiplicative attention where the alignment of a query and a memory key is computed as their dot product (Luong et al., 2015). We vary the size of LSTM and the number of attention heads to investigate how area attention can improve LSTM with varying capacity on translation tasks. The purpose is to observe the impact of area attention on each LSTM configuration, rather than for a comparison with Transformer.

Because LSTM requires sequential computation along a sequence, it trains rather slow compared to Transformer. To improve GPU utilization we increased data parallelism by using a much larger batch size than training Transformer. We trained each LSTM model on one machine with 8 NVIDIA P100. For a model has 256 or 512 LSTM cells, we trained it for 50,000 steps using a batch size that amounts to approximately 160,000 source and target tokens. When the number of cells is 1024, we had to use a smaller batch size with roughly 128,000 tokens, due to the memory constraint, and trained the model for 625,000 steps.

In these experiments, we used the maximum area size of 2 and the attention is computed from the output of the decoder’s top layer to that of the encoder. Similar to what we observed with Transformer, area attention consistently improves LSTM architectures in all the conditions (see Table 2).

#Cells #Heads Regular Attention Area Attention (Eq.3,4) Area Attention (Eq.9,4)
EN-DE EN-FR EN-DE EN-FR EN-DE EN-FR
256 1 19.05 28.51 19.08 28.39 19.60 28.61
256 4 19.92 29 20.21 29.45 20.64 30.39
512 1 22.13 31.95 22.14 32.08 22.02 31.73
512 4 22.78 33.2 22.73 33.05 23.18 33.44
1024 1 23.8 31.66 24 34.57 23.39 34.70
1024 4 20.06 32.82 24.48 35.54 24.95 36.02
Table 2: The BLEU scores on token-level translation tasks for the LSTM-based architecture with varying model capacities.

4.2 Character-Level Neural Machine Translation

Compared to token-level translation, character-level translation requires the model to address significantly longer sequences, which are a more difficult task and often less studied. We speculate that the ability to combine adjacent characters, as enabled by area attention, is likely useful to improve a regular attentional mechanisms. Likewise, we experimented with both Transformer and LSTM-based architectures for this task. We here used the same dataset, and the batching and training strategies as the ones used in the token-level translation experiments.

Transformer has not been used for character-level translation tasks. We found area attention consistently improved Transformer across all the model configurations. The best result we found in the literature is reported by (Wu et al., 2016). We achieved for the English-to-German character-level translation task and on the English-to-French character-level translation task. Note that these accuracy gains are based on the basic form of area attention (see Eq.3 and Eq.4), which does not add any additional trainable parameters to the model.

Similarly, we tested LSTM architectures on the character-level translation tasks. We found area attention outperformed the baselines in most conditions (see Table 4). The improvement seems more substantial when a model is relatively small.

Model Regular Attention Area Attention (Eq.3 and 4)
EN-DE EN-FR EN-DE EN-FR
Tiny 6.97 9.47 7.39 11.79
Small 12.18 18.75 13.44 21.24
Base 24.65 32.80 25.03 33.69
Big 25.24 33.82 26.65 34.81
Table 3: The BLEU scores on character-level translation tasks for the Transformer-based architecture with varying model capacities.
#Cells #Heads Regular Attention Area Attention (Eq.3 and 4)
EN-DE EN-FR EN-DE EN-FR
256 1 9.96 16.5 11.17 17.14
256 4 11.43 17.86 12.48 19.06
512 1 16.31 23.51 16.55 24.05
512 4 17.5 25.01 17.28 24.97
1024 1 20.45 28.7 21.17 28.99
1024 4 21.28 30.11 21.53 30.28
Table 4: The BLEU scores on character-level translation tasks for the LSTM-based architecture with varying model capacities.

4.3 Image Captioning

Image captioning is the task to generate natural language description of an image that reflects the visual content of an image. This task has been addressed previously using a deep architecture that features an image encoder and a language decoder (Xu et al., 2015; Sharma et al., 2018). The image encoder typically employs a convolutional net such as ResNet (He et al., 2015) to embed the images and then uses a recurrent net such as LSTM or Transformer (Sharma et al., 2018) to encode the image based on these embeddings. For the decoder, either LSTM (Xu et al., 2015) or Transformer (Sharma et al., 2018) has been used for generating natural language descriptions. In many of these designs, attention mechanisms have been an important component that allows the decoder to selectively focus on a specific part of the image at each step of decoding, which often leads to better captioning quality.

In this experiment, we follow a champion condition in the experimental setup of (Sharma et al., 2018) that achieved state-of-the-art results. It uses a pre-trained Inception-ResNet to generate image embeddings, a 6-layer Transformer for image encoding and a 6-layer Transformer for decoding. The dimension of Transformer is 512 and the number of heads is 8. We intend to investigate how area attention improves the captioning accuracy, particularly regarding self-attention and encoder-decoder attention computed off the image, which resembles a 2-dimensional case for using area attention. We also vary the maximum area size allowed to examine the impact.

Similar to (Sharma et al., 2018), we trained each model based on the training & development sets provided by the COCO dataset (Lin et al., 2014), which as 82K images for training and 40K for validation. Each of these images have at least 5 groudtruth captions. The training was conducted on a distributed learning infrastructure (Dean et al., 2012) with 10 GPU cores where updates are applied asynchronously across multiple replicas. We then tested each model on the COCO40 (Lin et al., 2014) and the Flickr 1K (Young et al., 2014) test sets. Flickr 1K is out-of-domain for the trained model. For each experiment, we report CIDEr (Vedantam et al., 2014) and ROUGE-L (Lin & Och, 2004) metrics. For both metrics, higher number means better captioning accuracy—the closer distances between the predicted and the groundtruth captions. Similar to the previous work (Sharma et al., 2018), we report the numerical values returned by the COCO online evaluation server333http://mscoco.org/dataset/#captions-eval for the COCO C40 test set (see Table 5),

In the benchmark model, a regular multi-head attention is used. We then experimented with several variations by adding area attention with different maximum area sizes to the first 2 layers of the image encoder self-attention and encoder-decoder (caption-to-image) attention, which both are a 2-dimensional area attention case. stands for the maximum area size 2 by 2 and for 3 by 3. For the case, an area can be 1 by 1, 2 by 1, 1 by 2, and 2 by 2 as illustrated in Figure 2. allows more area shapes.

Model COCO40 Flickr 1K
CIDEr ROUGE-L CIDEr ROUGE-L
Benchmark (Sharma et al., 2018) 1.032 0.700 0.359 0.416
Benchmark Replicate 1.034 0.701 0.355 0.409
Eq.3 & 4 1.060 0.704 0.364 0.420
Eq.3 & 4 1.060 0.706 0.377 0.419
Eq.9 & 4 1.045 0.707 0.372 0.420
Table 5: Test accuracy of image captioning models on COCO40 (in-domain) and Flickr 1K (out-of-domain) tasks.

We found models with area attention outperformed the benchmark on both CIDEr and ROUGE-L metrics with a large margin. The models with Eq.3 and Eq.3 are parameter free—they do not use any additional parameters beyond the benchmark model. achieved the best results overall. Eq. 9 adds a small fraction of the number of parameters to the benchmark model and did not seem to improve on the parameter-free version of area attention, although it still outperformed the benchmark.

5 Conclusions

In this paper, we present a novel attentional mechanism by allowing the model to attend to areas as a whole. An area contains one or a group of items in the memory to be attended. The items in the area are either spatially adjacent when the memory has 2-dimensional structure, such as images, or temporally adjacent for 1-dimensional memory, such as natural language sentences. Importantly, the size of an area, i.e., the number of items in an area or the level of aggregation, can vary depending on the learned coherence of the adjacent items, which gives the model the ability to attend to information at varying granularity. Area attention contrasts with the existing attentional mechanisms that are item-based. We evaluated area attention on two tasks: neural machine translation and image captioning, based on model architectures such as Transformer and LSTM. On both tasks, we obtained new state-of-the-art results using area attention.

References