FAN: Focused Attention Networks

05/27/2019 ∙ by Chu Wang, et al. ∙ adobe McGill University 0

Attention networks show promise for both vision and language tasks, by emphasizing relationships between constituent elements through appropriate weighting functions. Such elements could be regions in an image output by a region proposal network, or words in a sentence, represented by word embedding. Thus far, however, the learning of attention weights has been driven solely by the minimization of task specific loss functions. We here introduce a method of learning attention weights to better emphasize informative pair-wise relations between entities. The key idea is to use a novel center-mass cross entropy loss, which can be applied in conjunction with the task specific ones. We then introduce a focused attention backbone to learn these attention weights for general tasks. We demonstrate that the focused attention module leads to a new state-of-the-art for the recovery of relations in a relationship proposal task. Our experiments show that it also boosts performance for diverse vision and language tasks, including object detection, scene categorization and document classification.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Complex tasks involving visual perception or language interpretation are inherently contextual. In an image of an office scene, for example, a computer mouse may not easily be recognized due to its small size, but the detection of a computer keyboard will hint at its presence and constrain its possible locations. The study of objects in their context is a cornerstone of much past computer vision work

belongie2007 . Scene categories are themselves often determined by the relationships between objects or environments commonly found in them zhou2017places

. In natural language processing as well, words must be interpreted in their context, that is, their relation to other words or phrases in sentences. Machine learning algorithms that learn object to object or word to word relationships have thus been sought. Among them, attention networks have shown great promise for the task of learning relationship attention weights between entities

velivckovic2017graph ; vaswani2017attention . As a recent example, the scaled dot product attention module from vaswani2017attention achieves state of the art performance in language translation tasks.

In the present article we propose to explicitly supervise the learning of attention weights between constituent elements of a data source using a novel center-mass cross entropy loss. The minimization of this loss increases relation weights between entity pairs which are more commonly observed in the data, but without the need for handcrafted frequency measurements. We then design a focused attention network that is end-to-end trainable and which explicitly learns pairwise element affinities without the need for relationship annotations in the data. Our experiments demonstrate that the focused attention module improves upon the baseline and also the case of attention without focus, for both computer vision and natural language processing tasks. This backbone shows promise for learning informative relationships, for example, in a relationship proposal task it matches the present state-of-the-art relproposal , even without the use of ground truth relationship labels. When the ground truth labels are used for focused attention learning, it leads to a further 25% relative improvement, as measured by a relationship recall metric.

2 Motivation

Entities Attended to Quality of Attention Weights
Rel. Networks hu2017relation Focused Attn. Networks Rel. Networks hu2017relation Focused Attn. Networks
Figure 1: Relationships predicted on the MIT67 dataset. Left: Relation Networks hu2017relation tend to learn weights between a reference object (the blue box) and its surrounding context, while Focused Attention Networks better emphasize relationships between distinct objects. Right: Relation Networks can suffer from a poor selection of regions to pair, or low between object relationship weights in comparison to Focused Attention Networks. The relation weights can be viewed by zooming in. The networks are pre-trained on the minicoco dataset. More examples are in the supplementary material.

Attention Networks – The Present State

The modeling of relations between objects as well as objects in their common contexts has a rich history in computer vision belongie2007 ; torralba2003contextual ; galleguillos2010context

. Deep learning based object detection systems leverage attention models to this end, to achieve impressive performance in recognition tasks. The scaled dot product attention module of

vaswani2017attention , for example, uses learned pairwise attention weights between region proposal network (RPN) generated bounding boxes in images of natural scenes hu2017relation to boost object detection. Pixel level attention models have also been explored to aid semantic semantic segmentation zhao2018psanet and video classification wang2018nonlocal .

Current approaches to learn the attention weights do not adequately reflect relations between entities in practice, as may occur in a typical visual scene. In fact, for a given reference object (region), relation networks hu2017relation tend to predict high attention weights with scaled or shifted bounding boxes surrounding the same object instance. This is likely because including surrounding context, or simply restoring missing parts of the reference object, boosts object detection. The learned relationship weights between distinct objects (regions) are often small in magnitude. Typical qualitative examples comparing Relation Networks with our Focused Attention Network are shown in Figure 2, with a quantitative comparison reported in Section 5. An analogous situation can be shown to occur empirically in applications of attention networks to natural language processing. For a document classification task, attention weights learned using Hierarchical Attention Networks hatt tend to concentrate on a few words in a sentence, as illustrated by the examples in Figure 2.

Attention Networks – Limitations

A present limitation of attention networks in various applications is the use of only task specific losses as their training objectives. There is little work thus far on explicitly supervising the learning of weights, so as to be more distributed across meaningful entities. For example, Relation Networks hu2017relation and those applied to segmentation problems, such as PSANet zhao2018psanet , learn attention weights solely by minimizing categorical cross entropy for classification, L1 loss for bounding box localization or pixel-wise cross entropy loss for semantic segmentation zhao2018psanet . In language tasks including machine translation vaswani2017attention and document classification hatt , the attention weights are also solely learned by minimizing the categorical cross entropy loss. In what follows we shall refer to such attention networks as unsupervised.

Figure 2: Left: Visualization of the word importance factor in a sentence from the 20 Newsgroup dataset (see Section 4.5). "ait": weights learned using Hierarchical Attention Networks hatt , "unsup": weights from the unsupervised case of our Focused Attention Network module, and "sup": weights from the supervised case. Right: semantic word-to-word relationship labels used in the supervision of our network (see Section 3.2.) More examples are in the supplementary material.

Whereas attention aggregation with learned weights boosts performance for these specific tasks, our earlier examples provide evidence that relationships between distinct entities may not be adequately captured. In the present article we address this limitation by focusing the attention weight learning by applying a novel center-mass cross entropy loss on the attention matrix learned by the network. We discuss our explicit supervision of the attention weights in the following section.

3 Focusing the Attention

Given our goal of better reflecting learned attention weights between distinct entities, we propose to explicitly supervise the learning of attention relationship weights. We accomplish this by introducing a novel center-mass cross entropy loss.

3.1 Problem Statement

Given a set of entities that are generated by a feature embedding framework, which can be a region proposal network (RPN) fasterRCNN or a word embedding layer with a bidirectional LSTM hatt , for the -th entity we define as the embedding feature. To compute the relatedness or affinity between entity and entity , we define an attention function which computes the pairwise attention weight as

(1)

A specific form of this attention function applied in this paper is reviewed in Section 4.1, and it originates from the scaled dot product attention module of vaswani2017attention .

We can now build an attention graph whose vertices represent entities in a data source with features and whose edge weights represent pairwise affinities between the vertices. We define the graph adjacency matrix for this attention graph as . We propose to supervise the learning of so that the matrix entries corresponding to entity pairs with high co-occurrence in the training data gain higher attention weights.

3.2 Supervision Target

We now discuss how to construct ground truth supervision labels in matrix form to supervise the learning of the entries of . For visual recognition tasks we want our attention weights to focus on relationships between objects from different categories, so for each entry of the ground truth relationship label matrix , we assign only when: 1) entities and overlap with two different ground truth objects’ bounding boxes with and 2) their category labels and are different.111IOU refers to intersection over union. For language tasks we want the attention weights to reveal meaningful word pairs according to the semantics of the language. For example, relationships between nouns and nouns, verbs and nouns, nouns and adjectives, and adverbs and verbs should be encouraged. To this end, we build a minimalistic word category pair dictionary and assign label when word category pair and are found in the semantic pair dictionary. The semantic pair dictionary is shown on the right side of Figure 2, highlighting word category pairs that are considered ground truth.

Center-Mass

Intuitively, we would like to have high affinity weights at those entries where , and low affinity weights elsewhere. In other words, we want the attention weights to concentrate on the 1’s in the ground truth relationship label matrix . We capture this via a notion of center-mass of ground truth relation weights, which we define as

(2)

where is a matrix-wise softmax operation.

3.3 Center-mass Cross Entropy Loss

Key to our approach is the introduction of a center-mass cross entropy loss, which aims to focus attention weight learning so that is high for pairs of commonly occurring distinct entities. The loss is computed as

(3)

When minimizing this loss over the center-mass, its gradient will elevate those matrix entries in , corresponding to a ground truth relation label of in the matrix. More frequently occurring 1-labeled pairs in the matrix will cumulatively receive stronger emphasis, for example, human-horse pairs versus horse-chair pairs in natural images. Furthermore, when supervising the attention learning in conjunction with another task specific loss, the matrix entries that reduce the task loss will also be optimized. The resultant dominant entries will not reflect entity pairs with high co-occurrence, but will also help improve the main objective. The focal term focalloss helps shrink the gap between well converged center-masses and those that are far from convergence. As an example, with a higher center-mass value the gradient scale on the log loss will be scaled down, whereas for a lower center-mass the gradient will be scaled up, which is the motivation for using focal loss focalloss . The focal term prevents committing solely to the most dominant entries, and thus promotes diversity. We choose in our experiments.

4 Network Architecture

Our focused attention module originates from the scaled dot product attention module in vaswani2017attention . We now discuss our network structures, for both the focused attention weight learning backbone and various specific tasks, as shown in Figure 3.

Figure 3: Top: The Focused Attention Network backbone. Bottom left: we add a detection branch to the backbone, similar to hu2017relation

. Bottom middle: we add a scene recognition branch to the backbone. Bottom right: we insert the Focused Attention Module into a Hierarchical Attention Network

hatt .

4.1 Scaled Dot Product Attention Network

We briefly review the computation of attention weights in the dot product attention module vaswani2017attention , given a pair of nodes from the attention graph defined in Section 3.1. Let an entity node consist of its feature embedding, defined as . Given a reference entity node , such as one of the blue boxes in Figure 2, the attention weight indicates its affinity to a surrounding entity node . It is computed using a softmax activation over the scaled dot products from vaswani2017attention :

(4)

Both and

are matrices and so this is essentially a linear transformation to project the embedding features

and into metric spaces to measure how well they match. The feature dimension after projection is

. From the above formulations, the attention graph affinity matrix is defined as

.

4.2 Focused Attention Network (FAN) Backbone

In Figure 3 top, we illustrate the base Focused Attention Network architecture. More specifically, the dot product attention weights go through a matrix-wise softmax operation to generate the attention matrix output , that is used for the focused supervision with the center-mass cross entropy loss defined in Section 3.3. We shall refer to this loss term as relation loss. In parallel, a row-wise softmax is applied to to output the coefficients , which are then used for attention weighted aggregation . The aggregation output from the FAN module is sent to a task specific loss function. The entire module is end-to-end trainable, with both the task loss and the relation loss. We now generalize the FAN module to different network architectures for application to various machine learning tasks.

4.3 Object Detection and Relationship Proposals

In Figure 3 bottom left, we demonstrate how to generalize the FAN module for application to object detection and relationship proposal generation. The network is end-to-end trainable with detection loss, RPN loss and our relation loss. On top of the ROI pooling features from the Faster R-CNN backbone fasterRCNN , contextual features from attention aggregation are applied to further boost detection performance:

(5)

The final feature descriptor for the detection head is , following hu2017relation . In parallel, the attention matrix output can be used to generate relationship proposals by finding the top K weighted pairs in the matrix.

4.4 Scene Categorization Task

In Figure 3 bottom middle, we demonstrate how to apply the Focused Attention Network module to scene categorization. Since there are no bounding box annotations in most scene recognition datasets, we adopt a pre-trained Focused Attention Network detection module described in Section 4.3, in conjunction with a newly added convolution branch, to perform scene recognition. In order to maintain the learned relationship weights from the pre-trained module, which helps encode object co-occurrence context in the aggregation result, we fix the parameters in the convolution backbone, RPN layer and Focused Attention Network module, but make all other layers trainable. Fixed layers are shaded in grey in Figure 3 (bottom middle).

The scene categorization network works as follows. From the convolution backbone, we apply an additional convolution layer followed by a global average pooling to acquire the scene level feature descriptor . The Focused Attention Network module takes as input the object proposals’ visual features , and outputs the aggregation result as the scene contextual feature . The input to the scene classification head thus becomes , and the class scores are output.

4.5 Document Categorization Task

In Figure 3 bottom right, we demonstrate how to apply the Focused Attention Network module to a document classification task, using hierarchical attention networks hatt . We insert the module into the word level attention layer, but making it parallel to the original word-to-sentence attention module. The module learns word-to-word attention from a semantic supervision label that is discussed in Section 3.2. Then, through attention aggregation, a sentence level descriptor is achieved. This descriptor is concatenated with the output from the original word-to-sentence attention to result in a more comprehensive sentence level embedding. The sentence level features for the entire document are then sent to the sentence-to-document attention layer and a final descriptor for the document is sent to the final classification layer.

Word Importance Factor

The word-to-sentence attention module directly models the importance of a single word given a learned sentence representation. The Focused Attention Network module first learns meaningful word-to-word attention by focused supervision and attention aggregation to compute a sentence level descriptor. Whereas it is different, the resultant output is relatively comparable at a sentence level. More specifically, the word importance factor defined in equation 6 of hatt is comparable to , which is the relative aggregation strength of word in our Focused Attention Network. Thus we define the word importance factor as , because it represents the contribution of the -th word in the final aggregation result .

5 Experiments

We evaluate our Focused Attention Networks on a variety of tasks using the following datasets:
VOC07: which is part of the PASCAL VOC detection dataset voc . It consists of 5k images for training and 5k for testing.
MSCOCO: which consists of 80 object categories mscoco . Within the 35k validation images of the COCO2014 detection benchmark, a selected 5k subset named “minival” is commonly used when reporting test time performance, for ablation studies hu2017relation . We used the 30k validation images for training and the 5k “minival” images for testing. We define this split as “minicoco”.
Visual Genome: which is a large scale relationship understanding benchmark visualgenome , consisting of 150 object categories and human annotated relationship labels between objects. We used 70k images for training and 30K for testing, as in the scene graph literature neuralmotifs ; xu2017scenegraph .
MIT67: which is a scene categorization benchmark which consists of 67 scene categories, with each category having 80 training images and 20 test images mit67 .
20 Newsgroups: which is a document classification dataset consisting of text documents from 20 categories 20news , with 11314 training ones and 7532 test ones.

5.1 Network Training

In our experiments we use the ResNet101 architecture as the CNN backbone he2016deep . Further details on the hyper-parameters used in the training and input/output dimensions can be found in the supplementary material. Following Section 4.3, we first train the detection-and-relation joint framework end-to-end with a detection task loss and a relation loss on the minicoco dataset. We define this network as “FAN-minicoco”. We report detection results as well as relation learning quality. We then proceed to fine tune the scene task structure on the MIT67 dataset, using the pre-trained FAN-minicoco network (see Section 4.4), and report scene categorization performance.

5.2 Relationship Proposal Task

Relationship Recall Metric

We evaluate the learned relationships using a recall metric defined as . Here stands for the number of unique ground truth relations in a given image and stands for the number of unique matched ground truth relations in the top-K ranked relation weight list. In the calculation of , we only consider a match when both bounding boxes in a given relationship pair have overlaps of more than , with the corresponding ground truth boxes in a ground truth relationship pair. Therefore, measures how well the top-K ranked relation weights capture the ground truth labeled relationships.

Ablation Study on Focused Supervision We first provide a model ablation study, examining different strategies for supervising the focused attention. For each case we train the detection-and-relation joint framework from Section 4.3 on the VOC07 dataset for ablation purposes. First, we apply a row-wise softmax over the pre-activation matrix and calculate the center-mass in a row-wise manner and apply the center-mass cross entropy loss accordingly. We refer to this as “row”. Second, we apply the supervision explained in Section 3.3 but without the use of the focal term, and refer to this as “mat”. Finally, we add the focal term to the matrix supervision, referring to this as “mat-focal”. The results are summarized in Figure 4 (left), and they indicate that the focused attention weights, when supervised using the center-mass cross entropy loss with a focal term (Section 3.3), are indeed better concentrated on inter-object relationships, as reflected by the recall metric, when compared with the unsupervised case. In addition, “mat-focal” is superior to other the explored strategies of supervision. Thus, in all our experiments, unless stated otherwise, we apply the matrix supervision with the focal term.

Figure 4: Recall metric results with varying top K. Left: Model ablation study on VOC07 testset, with ground truth relation labels constructed following 3.2. Right: Recall comparison for the Visual Genome dataset, where the ground truth relation labels are human annotated. See text for a discussion.

Relationship Proposal Recall Evaluation

We now evaluate the relationships learned by the unsupervised Focused Attention Network model (similar to hu2017relation ), the Focused Attention Network supervised with weak relation labels described in Section 3.2, as well as the case of supervision with human annotated ground truth relation labels. We refer to the three models as “unsup”, “sup-cate”, and “sup-gt”. We also include the reported recall metric from Relationship Proposal Networks relproposal , which is a state-of-the-art level relationship learning network with strong supervision, using ground truth relationships. The evaluation of recall on the Visual Genome dataset is summarized in Figure 4 (right). Our center-mass cross entropy loss does not require potentially costly human annotated relationship labels for learning, yet it achieves the same level of performance as the present state-of-the-art relproposal (the green curve in Figure 4 right). When supervised with the ground truth relation labels instead of the weak labels (Section 3.2), we significantly outperform relation proposal networks (by about 25% in relative terms for all K thresholds) with this recall metric (the red curve in Figure 4 right).

VOC07 base F-RCNN FAN + hu2017relation FAN + +
avg mAP (%) 47.0 47.6 48.0
mAP@0.5 (%) 78.2 79.4 80.0
mini COCO base F-RCNN FAN + hu2017relation FAN + +
avg mAP (%) 26.8 27.5 27.9
mAP@0.5 (%) 46.6 47.4 47.8
Table 1: Object Detection Results. mAP@0.5: average precision over a bounding box overlap threshold as . avg mAP: averaged mAP over multiple bounding box overlap thresholds.

5.3 Object Detection Task

In Table 1 we provide results on object detection using the PASCAL VOC voc and MSCOCO mscoco benchmarks. In both cases we improve upon the baseline and slightly outperform the unsupervised case (similar to Relation Networks hu2017relation ). This suggests that relation weights learned using our focused attention network are at least as good as those from hu2017relation , in terms of object detection performance.

MIT67 CNN CNN CNN + ROIs CNN + FAN-unsup CNN + FAN-minicoco
Pretraining Imgnet Imgnet+COCO Imgnet+COCO Imgnet+COCO Imgnet+COCO
Features
Accuracy (%) 75.1 76.8 77.9 76.9 80.1
Table 2: MIT67 Scene Categorization Results. See the text in Section 5.4 for a discussion. For details regarding , and , see Section 4.4.

5.4 Scene Categorization Task

We adopt the FAN-minicoco network (Section 5.1), and add an additional scene task branch to fine tune it on MIT67, as discussed in Section 4.4. We then apply this model to the MIT67 dataset, with the results shown in Table 2. We refer to the backbone as “CNN” (first column). In the second column we apply FAN-minicoco, but without its detection branch. In the third column we apply FAN-minicoco, with the detection branch but without the FAN module. In fourth column we apply FAN-minicoco trained without relation loss. Finally, in the fifth column we apply our full Focused Attention Network, as explained in Section 4.4. It is evident that the supervised case (fifth column) demonstrates a non-trivial improvement over the baseline (third column) and also significantly outperforms the unsupervised case (fourth column). This suggests that the relation weights learned solely by minimizing detection loss do not generalize well to a scene task, whereas learned by our Focused Attention Network supervised by weak relations labels can. We hypothesize that recovering informative relations between distinct objects, which is what our Focused Attention Network is designed to do, is particularly beneficial for scene categorization.

20news Hatt hatt FAN-hatt FAN-hatt-cate FAN-hatt-semantic
Accuracy (%) 64.0 64.5 65.0 65.3
Table 3: Document categorization results for the 20 Newsgroups dataset. See text for a discussion.

5.5 Document Categorization Task

We present document classification results on the 20 Newsgroups dataset in Table 3. We provide a comparison between the base Hierarchical Attention Networks (Hatt), and FAN-hatt, explained in Section 4.5, with and without relation loss supervision. More specifically, FAN-hatt-semantic uses the supervision label explained in Section 3.2 and FAN-hatt-cate simply considers all different category word pairs to be ground truth relations. The semantic focused attention supervision results in an improvement over both the unsupervised case and the baseline. This suggests that our focused attention networks provided a more comprehensive sentence level embedding. In addition, the qualitative distributions in Figure 2 suggest that focused semantic attention encourages more diversity in assigning word importance. As a result, more relevant words are incorporated in the sentence level embedding, which in turn leads to better document classification performance.

6 Conclusion

Our Focused Attention Network is versatile, and allows the user to direct the learning of attention weights in the manner they choose. In proof of concept experiments we have demonstrated the benefit of learning relations between distinct objects for computer vision tasks, and between lexical categories (words) for a natural language processing task. It not only boosts performance in object detection, scene categorization and document classification, but also leads to state-of-the-art performance in a relationship proposal task. In the future we envision its use as a component for deep learning architectures where supervised control of relationship weights is desired, since it is adaptable, modular, and end-to-end trainable in conjunction with a task specific loss.

References

  • [1] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. IJCV, 2010.
  • [2] Carolina Galleguillos and Serge Belongie. Context based object categorization: A critical survey. Computer vision and image understanding, 2010.
  • [3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016.
  • [4] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. CVPR, 2018.
  • [5] Thorsten Joachims. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. Carnegie-Mellon University Technical Report, 1996.
  • [6] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
  • [7] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
  • [8] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. CVPR, 2017.
  • [9] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. ECCV, 2014.
  • [10] Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. CVPR, 2009.
  • [11] Andrew Rabinovich, Andrea Vedaldi, Carolina Galleguillos, Eric Wiewiora, and Serge J. Belongie. Objects in context. In IEEE 11th International Conference on Computer Vision, ICCV 2007, Rio de Janeiro, Brazil, October 14-20, 2007, pages 1–8. IEEE Computer Society, 2007.
  • [12] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. NIPS, 2015.
  • [13] Antonio Torralba. Contextual priming for object detection. IJCV, 2003.
  • [14] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NIPS, 2017.
  • [15] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
  • [16] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He.

    Non-local neural networks.

    CVPR, 2018.
  • [17] Danfei Xu, Yuke Zhu, Christopher Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. CVPR, 2017.
  • [18] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networks for document classification. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016.
  • [19] Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global context. CVPR, 2018.
  • [20] Ji Zhang, Mohamed Elhoseiny, Scott Cohen, Walter Chang, and Ahmed Elgammal. Relationship proposal networks. CVPR, 2017.
  • [21] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. Psanet: Point-wise spatial attention network for scene parsing. ECCV, 2018.
  • [22] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.

7 Network Training Details

Figure 5: The input/output dimension details pertaining to Figure 3 of our main article. The dimensions shown are for the case of a batch size of 1. Left: We add a detection branch to the backbone. Middle: We add a scene recognition branch to the backbone. Right: We inserted the Focused Attention Module into a Hierarchical Attention Network.

Vision Tasks

Unless stated otherwise, all the vision task networks are based on a ResNet101 [3] structure trained with a batch size of 2 (images), using a learning rate of which is decreased to

after 5 epochs, with 8 epochs in total for the each training session. SGD with a momentum optimizer is applied with the momentum set as

. The number of RPN proposals is fixed at . Thus the attention weight matrix has a dimension of for a single image. Further details regarding input/output dimensions of the intermediate layers can be found in Figure 5 (left and middle).

Language Task

For the document classification task, the network structure is based on a Hierarchical Attention Network [18]. For all experiments, the batch size is set to be (documents), and the word embedding dimension is set to . The maximum number of words in a sentence is set to be , and the maximum number of sentences in a document is set to be . Therefore, the word level Focused Attention Network’s attention weight matrix has a dimension of , for a single sentence. The output dimension for Bi-LSTMs is set to be 100, and the attention dimension in attention models is also set to be 100. The Adam optimizer [6] is applied with an initial learning rate of . The network is trained end-to-end with categorization loss and relation loss for 15 epochs. Further details regarding input/output dimensions of intermediate layers can be found in Figure 5 (right).

8 Implementation Details

We ran multiple trials of our experiments and observed that the results are relatively stable and are reproducible. Furthermore, we plan to release our code upon acceptance of this article. Given that all our datasets are publicly available this will allow other researchers to both reproduce our experiments and use our Focused Attention Network module for their own research.

Vision Tasks

We implemented the center-mass cross entropy loss as well as the Focused Attention Module using MxNet. For the Faster R-CNN backbone, we adopted the source code from Relation Networks [4].

Language Tasks

We implemented the Hierarchical Attention Networks according to [18]

in Keras with a TensorFlow backend. The word-to-word Focused Attention Network module as well as the center-mass cross entropy loss, are also implemented in the same Keras based framework.

Runtime and machine configuration

All our experiments are carried out on a linux machine with 2 Titan XP GPUs, an Intel Core i9 CPU and 64GBs of RAM. The Figures and Tables referred to in the following text are those in the main article.

  • Figure 4, Relationship Proposal. For a typical run of Visual Genome Focused Attention Network training, it takes 55 hours for 8 epochs using the above machine configuration.

  • Table 1 Object, Detection. For a typical run of VOC07 Focused Attention Network training, it takes 4 hours when training for 8 epochs. For a typical run on minicoco , it takes 26 hours using the same setup.

  • Table 2, Scene Categorization. For a typical run of the MIT67 dataset, it takes 2 hours when training for 8 epochs.

  • Table 3,Document Classification. For a typical run of the 20 Newsgroup dataset, it takes 30 minutes for 15 epochs.

We also determined that when compared with unsupervised cases of the above experiments, the use of the Focused Attention Network module does not add any noticeable run time overhead.

9 Additional Results

9.1 Convergence of Center-Mass

COCO Training Testing
un-sup [4] 0.020 0.013
sup-obj 0.747 0.459
Table 4: We compare center-mass values for the FAN-minicoco network between training and testing. The values reported are evaluated on the minicoco train/test set.

We provide additional results illustrating the convergence of center-mass training in Table 4. The center mass is a elemental wise multiplication between the post softmax attention weight matrix and the ground truth label matrix , defined as . For more details, please refer to Section 3.2 in the main article. Applying the FAN-minicoco network described in Section 5 of the main article, “sup-obj” stands for focusing the attention using the ground truth label constructed following Section 3.2 in the main article, and “un-sup” stands for the unfocused case of removing the relation loss during training, which is similar to [4]. The converged center-mass value for the supervised case is much higher than that for the unsupervised case. Empirically these results suggest that our relation loss, whose design goal is to increase the center-mass during learning, is effective. Furthermore, the gap between the training center-mass and the testing one is reasonable for the supervised case, i.e., we do not appear to be suffering from over-fitting. We have observed the same general trends for the other tasks as well.

9.2 Relation Visualizations

Visual Relationships

We now provide additional qualitative visualizations showing typical relationship weights learned by our method. In Figure 6, we visualize the predicted relationship on images from the MIT67 dataset, using a pre-trained Focused Attention Network on the minicoco dataset, referred to as FAN-minicoco, as discussed in Section 5.1 of the main article. We compare this with the corresponding unsupervised case, which is similar to Relation Networks [4].

Word Importance in a Sentence

In Figures 7, 8 and 9 we provide additional visualizations of the word importance factor in a sentence (defined in Section 4.5 of the main article), using the same format as that used in Figure 2 in the main article.

Relation Networks [4] Focused Attention Networks
Figure 6: The visualization of relationships recovered on additional images of the MIT67 dataset. See the caption of Figure 1 of the main article for an explanation.
Figure 7: The visualization of the word importance factor in a sentence. See Section 3.2 and the caption of Figure 2 of the main article for an explanation.
Figure 8: Additional visualization of the word importance factor in a sentence. See Section 3.2 and the caption of Figure 2 of the main article for an explanation.
Figure 9: Additional visualization of the word importance factor in a sentence. See Section 3.2 and the caption of Figure 2 of the main article for an explanation.