Multi-Label Image Recognition with Multi-Class Attentional Regions

07/03/2020 ∙ by Bin-Bin Gao, et al. ∙ 0

Multi-label image recognition is a practical and challenging task compared to single-label image classification. However, previous works may be suboptimal because of a great number of object proposals or complex attentional region generation modules. In this paper, we propose a simple but efficient two-stream framework to recognize multi-category objects from global image to local regions, similar to how human beings perceive objects. To bridge the gap between global and local streams, we propose a multi-class attentional region module which aims to make the number of attentional regions as small as possible and keep the diversity of these regions as high as possible. Our method can efficiently and effectively recognize multi-class objects with an affordable computation cost and a parameter-free region localization module. Over three benchmarks on multi-label image classification, we create new state-of-the-art results with a single model only using image semantics without label dependency. In addition, the effectiveness of the proposed method is extensively demonstrated under different factors such as global pooling strategy, input size and network architecture.



There are no comments yet.


page 3

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional Neural Networks (CNNs) have made revolutionary breakthroughs on various computer vision tasks. For example, single-label image recognition (SLR), as a fundamental vision task, has surpassed human-level performance [he2015delving]

on large-scale ImageNet. Unlike SLR, multi-label image recognition (MLR) needs to predict a set of objects or attributes of interest present in a given image. Meanwhile, these objects or attributes usually have complex variations like spatial location, object scale and occlusion 


. Nonetheless, MLR still has wide applications such as scene understanding 

[shao2015deeply], face or human attribute recognition [liu2015deep, li2016human] and multi-object perception [wei2015hcp] etc. These make MLR become a practical and challenging task.

MLR can be simply addressed by using SLR framework to predict whether each category object presents or not. Recently, there are many works using deep CNNs to improve the performance of MLR. These works can be roughly divided into three types: spatial information [wei2015hcp, yang2016exploit], visual attention [chen2018recurrent, wang2017multi, zhu2017learning, guo2019visual] and label dependency [wang2016cnn, chen2018order, chen2019multi, chenlearning].

Since the goal of MLR is to predict a set of object categories instead of producing accurate spatial locations of all possible objects, we argue that it is not necessary to waste computation resource for hundreds of object proposals in HCP [wei2015hcp] or consume labor cost for the bounding box annotation of objects in Fev+Lv [yang2016exploit]. RARL [chen2018recurrent] and RDAL [wang2017multi]

introduce a reinforcement learning module and a spatial transformer layer to localize attentional regions, respectively, and sequentially predict label distribution based on generated regions. The main problem of these two methods is that the generated attentional regions are always category-agnostic and it is also difficult to guarantee the diversity of these local regions. In fact, we should ask the number of attentional regions to be as small as possible while maintaining the high diversity. Recently, MLGCN 

[chen2019multi] and SSGRL [chenlearning] try to model the label dependency with graph CNN to boost the performance of MLR. However, in this paper, we aim to improve the performance of MLR with only image semantics.

In order to exploit the semantic information of image, let us recall how we humans recognize multiple objects appeared in an image. Firstly, people may have a glimpse of a given image to discover some possible object regions from a global view. Then, these possible object regions guide the eye movements and help to make decisions on specific object categories following a region-by-region manner. In other words, most of time we humans do not recognize multi-objects using a single glance but at least two steps from a global view to local regions. In this paper, we wonder if machines can acquire the learning ability to recognize multi-objects like humans.

Inspired by this observation, we propose a novel multi-label image recognition framework with Multi-Class Attentional Regions (MCAR) as illustrated in Fig. 1. This framework contains a global image stream, a local region stream, and a multi-class attentional region module. Firstly, the global image stream takes an image as the input for a deep CNN and learns global representations supervised by the corresponding labels. Then, the multi-class attentional region module is used to discover possible object regions with the information from the global stream, which is similar to the way how we recognize multiple objects. Finally, these localized regions are fed to the shared CNN to obtain their predicted class distributions using the local region stream. The local region stream can recognize objects better since it flexibly focuses on details of each object which helps to alleviate the difficulty of recognition for these objects at different spatial locations and object scales.

The contributions of this paper can be summarized into three aspects.

  • Firstly, we present a multi-label image recognition framework that can efficiently and effectively recognize multi-objects following a global to local manner. To the best of our knowledge, the learning mechanism of global to local in a unified framework is the first time being proposed to find possible regions for multi-label images.

  • Secondly, we propose a simple but effective multi-class attentional region module which includes three steps: generation, selection, and localization. In practice, it can dynamically generate a small number of attentional regions while keeping their diversity as high as possible.

  • Thirdly, we create new state-of-the-art results on three widely used benchmarks with only a single model. Our method provides an affordable computation cost and needs no extra parameters. In addition, we also extensively demonstrate the effectiveness of the proposed method under different conditions like global pooling strategy, input sizes and network architectures.

2 Related Works

Spatial Information. How to utilize the spatial information of image is very crucial for almost all visual recognition tasks such as image recognition [lazebnik2006beyond, he2015spatial], object detection [girshick2014rich] and semantic segmentation [zhao2017pyramid, chen2017rethinking]. It is closely related to how to design (or learn) effective features. The reason is that objects usually present with different scale at different spatial locations. HCP [wei2015hcp] uses BING or EdgeBoxes to generate hundreds of object proposals for each image using a like RCNN [girshick2014rich] method, and aggregates prediction scores of these proposals to obtain the final prediction. However, a large number of proposals usually bring a huge computation cost. Fev+Lv [yang2016exploit] generates proposals using bounding box annotations. Their approach combined the local proposal features and global CNN features to produce the final feature representations. It reduces the number of proposals but introduces the labor cost of annotation.

Visual Attention. Attention mechanism has been widely used in many vision tasks, such as visual tracking [bazzani2011learning], fine-grained image recognition [fu2017look]

, image captioning 

[xu2015show], image question answering [anderson2018bottom], and semantic segmentation [hong2016learning]. RARL [chen2018recurrent] uses a recurrent attention reinforcement learning module [mnih2014recurrent] to localize a sequence of attention regions and further predict label scores conditioned on these regions. Instead of reinforcement learning in RARL, RDAL [wang2017multi] introduces a spatial transformer layer [jaderberg2015spatial, yu2019delta] for localizing attentional regions from an image and an LSTM unit to sequentially predict the category distribution based on features of these localized regions. Unlike RARL and RDAL, SRN [zhu2017learning] and ACfs [guo2019visual] combine attention regularization loss and multi-label loss to improve performance. Specifically, SRN [zhu2017learning] captures both spatial semantic and label correlations based on the weighted attention map, while ACfs [guo2019visual] enforces the network to learn attention consistency that the classification attention map should follow the same transformation when input image is spatially transformed.

Label Dependency. In order to exploit label dependency, CNN-RNN [wang2016cnn] jointly learns image feature and label correlation in a unified framework composed of a CNN module and an LSTM layer. The limitation is that it requires a pre-defined label order for model training. Order-Free RNN [chen2018order]

relaxes the label order constraint via learning visual attention model and a confidence-ranked LSTM. Recently, SSGRL 

[chenlearning] directly uses a graph convolutional network to model the label dependency among all labels. While in this paper, we deliberately avoid using any information from label dependency and aim to improve the performance of multi-label recognition with only image semantic information. We leave it as a future work to further boost recognition performance by integrating the label correlation to our framework.

3 MCAR Framework

In this section, we firstly present a two-stream framework which contains a global image stream and a local region stream. Then, we elaborate multi-class attentional region module, which tries to bridge the gap between global and local views. Finally, we present the optimization details of our framework.

Figure 1:

The pipeline of our MCAR framework for multi-label image recognition. MCAR firstly feeds an input image into a deep CNN model to extract its global feature representation through the global image stream. Then, the multi-class attentional region module roughly localizes possible object regions by integrating that information from the global stream. Finally, these localized regions are fed to the shared CNN to obtain their predicted class distributions through local regions stream. At the inference stage, MCAR aggregates predictions from global and local streams with category-wised max-pooling and produces the final prediction.

3.1 Two-Stream Framework

Global Image Stream. Given an input image , where , are image’s height and width. Let’s denote its corresponding label as , where is a binary indicator. if image is tagged with label , otherwise . is the number of all possible categories in this dataset.

We assume that is the activation map of the last convolutional layer of a CNN, where denotes the parameters of the CNN and . Then, a global pooling function encodes the activation map

to a single vector

, i.e. . Here can be considered as a global feature representation of the image . In order to get it’s prediction score, a 11 fully convolutional layer transfers to by


We then use a sigmoid function

to turn into a range , that is


where stands for the global prediction distribution.

Local Regions Stream. We assume that is a set of local regions cropped from input image . These local regions are firstly resized to the input size by bilinear upsampling. Then, they are fed to the shared CNN (with the global stream) to get prediction distributions with Eq. 1 and 2. Finally, these local region distributions are aggregated by a category-wised max-pooling operation:


where is the -th category score of the local prediction . The subscript means the distribution is from local regions.

Note that the global and local streams share the same network without importing additional parameters. It is obviously different from the classical two-stream architecture which usually contains two parallel subnetworks. The inputs to our two-stream are the whole image and local regions from it, respectively. These local regions are dynamically generated by using the information of global stream. Therefore, it is also different from the existing methods whose inputs are always two parallel views like video frame and optical flow in video classification [Simonyan14]. During the training stage, we jointly train these two streams. At the inference stage, we fuse the predictions from global stream () and local stream () with a category-wised max-pooling operation to generate the final predicted distribution of image .

3.2 From Global to Local

Potential object regions are not available in image-level labels, which must be generated in an efficient way. The desirable generation module and candidate regions should satisfy some basic principles. First, the diversity of candidate regions should be as high as possible such that they can cover all possible objects of a given multi-label image. Second, the number of these candidate regions should be as small as possible in order to ensure the efficiency. In contrast, more candidate regions require more computation resources since these regions need to be fed to the shared CNN simultaneously. Last but not least, the candidate regions generation module should have a simple network architecture and few parameters to alleviate the computation cost and storage overhead.

Attentional Maps Generation. The class activation mapping method [zhou2016learning] intuitively shows the discriminative image regions and helps us understand how to identify a particular category with a CNN. To obtain class activation maps, we directly apply the 11 convolutional layer to the class-agnostic activation maps from the global stream, that is


where . The class activation map of the -th category is denoted as and it directly indicates the importance of the activation map at spatial leading to the classification of an image to class .

The discriminative class regions of a specific are significantly different among all possible class maps . If we employ class maps to localize the potential object regions then it is easy to satisfy the first principle: to increase the diversity of different proposals.

Attentional Maps Selection. The number of class activation maps is equal to that of all categories associated with a dataset. For example, there are 20 and 80 categories on PASCAL VOC and MS-COCO datasets, respectively. If we use all class maps, it leads to two problems. First, the generated regions are too many to ensure efficiency. Second, a majority of regions will be redundant or meaningless because an image usually consists of a few instances.

A fact is the predicted distribution will be close to the ground-truth distribution with the learning of the network which is supervised by ground-truth labels. It is a reasonable assumption that the high category confidence means that the corresponding object presents on image with a high probability. Therefore, we sort the predicted scores

(whose dimension is equal to the number of classes) following a descending order and select the class attentional maps. In experiments, we can see that a satisfied performance can be achieved when the is a small number (such as 2 or 4) which is far less than the number of all categories. Another benefit is that the proposed method may enforce network to implicitly learn a label correlation if selective attentional maps don’t fully cover all object categories. This is because the local stream is also supervised by the ground-truth label distribution.

Local Regions Localization. We still denote class attentional maps as for notation simplification. Each is normalized to the range by sigmoid function Eq. 2. Further more, we simply upsample to the input size to align the spatial semantics between and the input image .

Figure 2: The visualization of local region localization with class attentional map. We firstly decompose the class attentional map into two marginal distributions along row and column. Then, the class attentional region is localized by these two marginal distributions.

The value of represents a probability that it belongs to the -th category at spatial location . In order to efficiently localize regions of interest, we decompose each selective attentional map

into a row and a column marginal distribution, which represents a probability distribution of object present at the corresponding location (as shown in Fig. 

2). We compute the marginal distribution based on the class attentional map over and axis, respectively, which is


Then, and are normalized by min-max normalization such that the distribution is scaled to the range in , that is


In oder to localize one discriminative region, we need to solve the following integer inequalities:


where is a constant threshold. The solution of Eq. 7 may be a single interval or a union of multiple ones, and each interval corresponds to the spatial location of a specific object region. The fact is that or may have one peak when input image only contains an object in Fig. (a)a and also may have multiple peaks when input image consists of multiple objects of the same category at different spatial locations in Fig. (b)b and (c)c. However, our objective is to recognize multiple objects in a given image, and only one discriminative region needs to be selected for each category. Therefore, some constraints have to be added such that a unique interval among multiple feasible intervals can be chosen. To achieve this goal, we pick the interval contained in the global maximum peak for the case of multiple local maximum peaks as shown in Fig. (b)b and choose the widest interval for multiple global maximum peaks as shown in Fig. (c)c. For all selected class attentional maps, discriminative regions would be generated by solving the Eq. 7 conditioned on the above constraints.

(a) single peak
(b) multiple peaks
(c) multiple peaks
Figure 3: Some examples of margin distribution. Black curves represent the margin distribution, and blue dash is the threshold , and the best interval between two red dashes is the desirable localization.

3.3 Two-Stream Learning

Given a training dataset , which is the -th image and represents the corresponding labels. The learning goal of our framework is to find , and

via jointly learning global and local streams in an end-to-end manner. Thus, our overall loss function is formulated as the weighted sum of two streams,


where and represent the global and the local loss, respectively. Specifically, we adopt the binary cross entropy loss for global and local stream,


where and are the prediction scores of the -th category of the -th image from global and local stream, respectively. Optimization is performed using SGD and standard back propagation.

4 Experiments

In this section, we firstly report extensive experimental results and comparisons that demonstrate the effectiveness of the proposed method. Next, we present an ablation study to carefully evaluate and discuss the contribution of the crucial components. Finally, we visualize the produced local regions to further help us understand how the network recognizes a multi-label image.

4.1 Experiment Setting

Implementation Details. We perform experiments to validate the effectiveness of the proposed MCAR on three benchmarks in multi-label classification: MS-COCO [lin2014microsoft], PASCAL VOC 2007 and 2012 [everingham2010pascal]

, using the open-source framework PyTorch.

Following recent MLR works, we compare the proposed method with state-of-the-arts using the powerful ResNet-101 [he2016deep] model. Some popular and lightweight models such as MobileNet-v2 [sandlermobilenetv2] and ResNet-50 [he2016deep] also used to further evaluate our method. In general, for each of these networks we remove the fully-connected layers before the final output and replace them with global pooling followed by a 1

1 convolutional layer and a sigmoid layer. These models are all pre-trained on ImageNet and we train them with stochastic gradient descent (SGD) with momentum using image-level labels only.

During training, all input images are resized into a fixed size (i.e., 256256 or 448

448) with random horizontal flips and color jittering for data augmentation. In order to speed up the convergence of network, we don’t use the random crop although it can bring performance improvement but need more training time. We train all networks with 60 epochs in total. Unless otherwise stated, we set

as 4 and as 0.5 in our experiments. The effects of hyper-parameter ( and ) is discussed in ablation study.

Evaluation Metrics. The performance of MLR mainly employ two metrics which are the average precision (AP) for each category and the mean average precision (mAP) overall categories. We first employ AP and mAP to evaluate all the methods. Following conventional setting [wei2015hcp, chen2019multi, chenlearning], we also compute the precision, recall and F1-measure for comparison performance on MS-COCO dataset. For each image, we assign a positive label if its prediction probability is greater than a threshold and compare them with the ground-truth labels. The overall precision (OP), recall (OR), F1-measure (OF1) and per-category precision (CP), recall (CR), F1-measure (CF1) are computed as follows:


where is the number of images correctly predicted for the -th category, is the number of predicted images for the -th category, is the number of ground truth images for the -th category. We also compute these above metrics via another way that each image is assigned labels with top3 highest score. It is worthy to notice that these metrics may be affected by the threshold. Among these metrics, OF1 and CF1 are more stable than OP, CP, OR and CR, AP and mAP are the most important metrics which can provide a more comprehensive comparison.

4.2 Comparsions with State-of-the-Arts

To verify the effectiveness of our method, we compare the proposed method with state-of-the-arts on MS-COCO [lin2014microsoft] and PASCAL VOC 2007 & 2012 [everingham2010pascal].

MS-COCO. MS-COCO [lin2014microsoft] is a widely used dataset to evaluate multiple tasks such as object detection, semantic segmentation and image caption, and it has been adopted to evaluate multi-label image recognition recently. It contains 82,081 images as the training set and 40,137 images as validation set and covers 80 object categories. Compared to VOC 2007 & 2012 [everingham2010pascal], both the size of training set and the number of object category are increased. Meanwhile, the number of labels of different images, the scale of different objects and the number of images in each category vary considerably, which makes it more challenging.

Methods Input Size Backbone mAP All Top3
CNN-RNN [wang2016cnn] - VGG16 61.2 66.0 55.6 60.4 69.2 66.4 67.8
RDAL [wang2017multi] - VGG16 79.1 58.7 67.4 84.0 63.0 72.0
Order-Free RNN [chen2018order] - ResNet-152 71.6 54.8 62.1 74.2 62.2 67.7
ML-ZSL [lee2018multi] - ResNet-152 74.1 64.5 69.0
SRN [zhu2017learning] 224224 ResNet-101 77.1 81.6 65.4 71.2 82.7 69.9 75.8 85.2 58.8 67.4 87.4 62.5 72.9
ACfs [guo2019visual] 288288 ResNet-101 77.5 77.4 68.3 72.2 79.8 73.1 76.3 85.2 59.4 68.0 86.6 63.3 73.1
ResNet-101 [ge2018multi] 448448 ResNet-101 73.8 72.9 72.8 77.5 75.1 76.3 78.3 63.7 69.5 83.8 64.9 73.1
Multi-Evidence [ge2018multi] 448448 ResNet-101 80.4 70.2 74.9 85.2 72.5 78.4 84.5 62.2 70.6 89.1 64.3 74.7
SSGRL* [chenlearning] 448448 ResNet-101 81.9 84.2 70.3 76.6 85.8 72.4 78.6 88.0 63.1 73.5 90.2 64.5 75.2
Baseline 448448 ResNet-101 77.1 72.7 72.3 72.5 77.4 75.5 76.5 77.8 63.5 69.9 84.0 65.5 73.6
MCAR 448448 ResNet-101 83.8 85.0 72.1 78.0 88.0 73.9 80.3 88.1 65.5 75.1 91.0 66.3 76.7
SSGRL [chenlearning] 576576 ResNet-101 83.8 89.9 68.5 76.8 91.3 70.8 79.7 91.9 62.5 72.7 93.8 64.1 76.2
MCAR 576576 ResNet-101 84.5 84.3 73.9 78.7 86.9 76.1 81.1 87.8 65.9 75.3 90.4 67.1 77.0
Table 1: Comparisons of mAP, CP, CR, CF1 and OP, OR, OF1 in of our model and state-of-the-art methods on the MS-COCO dataset. * indicates that the results are reproduced by using the open-source code [chenlearning], and - denotes the corresponding result is not provided.

Results on MS-COCO. The results on MS-COCO are reported in Table 1. When the input size is 448448 (the most common setting in MLR), our method is already comparable to the state-of-the-art SSGRL [chenlearning] which uses additional label dependency and larger input to boost performance. Moreover, if we simply resize the input image to 576576 during the testing stage while still using the model weights trained with 448448 inputs, our method achieves 84.5% mAP which outperforms the SSGRL by 0.7%. In order to fairly compare with the SSGRL, we re-implement the experiment with 448448 input following the same setting as described in the SSGRL. In Table 1, we can see that our method significantly beats the SSGRL and improves it by 1.9 points (83.8% vs. 81.9%).

The performance of our method is also significantly better than that of Multi-Evidence [ge2018multi], and it improves CF1 by 3.1%, OF1 by 1.9%, CF1-top3 by 4.5%, OF1-top3 by 2.0%. Note that our baseline ResNet-101 model achieves 77.1% mAP, which should be close to that of the baseline of Multi-Evidence [ge2018multi] because of nearly same F1-measures. In comparison to the baseline, our method is 6.7% higher in mAP (83.8% vs. 77.1%).

Meanwhile, we show the AP performance of each class for further comparison with baseline model in Fig 4. It is obvious that our method has significant improvements on almost all categories, especially for some difficult categories such as “toaster” and “hair drier”. In short, MCAR outperforms all state-of-the-art methods and significantly surpasses the baseline by a large margin even though it does not need a large number of proposals or label dependency information. This further demonstrates the effectiveness of the proposed method for large-scale multi-label image recognition.

PASCAL VOC 2007 and 2012. PASCAL VOC 2007 and 2012 [everingham2010pascal] are the most widely used datasets for MLR. There are 9,963 and 22,531 images in VOC 2007 and 2012, respectively. Each image contains one or several labels, corresponding to 20 object categories. These images are divided into three parts including train, val and test sets. In order to fairly compare with other competitors, we follow the common setting to train our model on the  train-val sets, and then evaluate produced models on the test set. VOC 2007 contains a train-val set of 5,011 images and a test set of 4,952 images. VOC 2012 consists of 11,540 images as train-val set and 10,991 images as the test set.

Figure 4: AP (in ) of each category of our proposed framework and the ResNet-101 baseline on COCO dataset.
 Methods Backbone aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
 CNN-RNN [wang2016cnn] VGG16 96.7 83.1 94.2 92.8 61.2 82.1 89.1 94.2 64.2 83.6 70.0 92.4 91.7 84.2 93.7 59.8 93.2 75.3 99.7 78.6 84.0
 VGG+SVM [Simonyan15]  VGG16&19 98.9 95.0 96.8 95.4 69.7 90.4 93.5 96.0 74.2 86.6 87.8 96.0 96.3 93.1 97.2 70.0 92.1 80.3 98.1 87.0 89.7
 Fev+Lv [yang2016exploit] VGG16 97.9 97.0 96.6 94.6 73.6 93.9 96.5 95.5 73.7 90.3 82.8 95.4 97.7 95.9 98.6 77.6 88.7 78.0 98.3 89.0 90.6
 HCP [wei2015hcp] VGG16 98.6 97.1 98.0 95.6 75.3 94.7 95.8 97.3 73.1 90.2 80.0 97.3 96.1 94.9 96.3 78.3 94.7 76.2 97.9 91.5 90.9
 RDAL [wang2017multi] VGG16 98.6 97.4 96.3 96.2 75.2 92.4 96.5 97.1 76.5 92.0 87.7 96.8 97.5 93.8 98.5 81.6 93.7 82.8 98.6 89.3 91.9
 RARL [chen2018recurrent] VGG16 98.6 97.1 97.1 95.5 75.6 92.8 96.8 97.3 78.3 92.2 87.6 96.9 96.5 93.6 98.5 81.6 93.1 83.2 98.5 89.3 92.0
 SSGRL* [chenlearning] ResNet-101 99.5 97.1 97.6 97.8 82.6 94.8 96.7 98.1 78.0 97.0 85.6 97.8 98.3 96.4 98.1 84.9 96.5 79.8 98.4 92.8 93.4
 Baseline ResNet-101 99.0 97.9 97.2 97.6 80.2 93.6 96.0 98.0 81.8 92.0 84.6 97.5 97.2 95.3 97.9 81.8 94.6 84.1 98.2 93.6 92.9
 MCAR ResNet-101 99.7 99.0 98.5 98.2 85.4 96.9 97.4 98.9 83.7 95.5 88.8 99.1 98.2 95.1 99.1 84.8 97.1 87.8 98.3 94.8 94.8
Table 2: Comparisons of AP and mAP in of our model and state-of-the-art methods on the PASCAL VOC 2007. indicates methods using larger input size.
 Methods Backbone aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
 VGG+SVM [Simonyan15] VGG16&19 99.0 89.1 96.0 94.1 74.1 92.2 85.3 97.9 79.9 92.0 83.7 97.5 96.5 94.7 97.1 63.7 93.6 75.2 97.4 87.8 89.3
 Fev+Lv [yang2016exploit] VGG16 98.4 92.8 93.4 90.7 74.9 93.2 90.2 96.1 78.2 89.8 80.6 95.7 96.1 95.3 97.5 73.1 91.2 75.4 97.0 88.2 89.4
 HCP [wei2015hcp] VGG16 99.1 92.8 97.4 94.4 79.9 93.6 89.8 98.2 78.2 94.9 79.8 97.8 97.0 93.8 96.4 74.3 94.7 71.9 96.7 88.6 90.5
 SSGRL* [chenlearning] ResNet-101 99.5 95.1 97.4 96.4 85.8 94.5 93.7 98.9 86.7 96.3 84.6 98.9 98.6 96.2 98.7 82.2 98.2 84.2 98.1 93.5 93.9
 MCAR MobileNet-v2 98.6 92.3 95.4 93.3 77.7 93.8 92.6 97.6 80.8 90.9 82.3 96.5 96.6 95.5 98.3 78.4 92.6 78.7 96.8 90.9 91.0
 MCAR ResNet-50 99.6 95.6 97.5 95.2 85.1 95.5 94.3 98.6 85.2 95.8 83.9 98.4 98.0 97.2 98.8 81.6 95.5 81.8 98.3 93.6 93.5
 MCAR ResNet-101 99.6 97.1 98.3 96.6 87.0 95.5 94.4 98.8 87.0 96.9 85.0 98.7 98.3 97.3 99.0 83.8 96.8 83.7 98.3 93.5 94.3
Table 3: Comparisons of AP and mAP in of our model and state-of-the-art methods on the PASCAL VOC 2012. indicates methods using larger input size.

Results on VOC 2007. We firstly report the AP for each category and the mAP for all categories on VOC 2007 test set in Table 2. The current state-of-the-art is SSGRL [chenlearning] which uses GCN to model label dependency to boost the performance. We can see that our method achieves the best mAP performance among all methods. It largely outperforms the SSGRL [chen2019multi] by 1.4 points (94.8% vs. 93.4%) when SSGRL uses a larger input size 576576. Moreover, the proposed method improves the baseline ResNet-101 model by 1.9% under the same setting such as data augmentation and hyper-parameters of optimization. Last but not least, our framework shows good performance for some difficult categories such as “bottle”, “table” and “sofa”. This shows that exploiting global and local vision information is very crucial for multi-label recognition.

Results on VOC 2012. We report the results on VOC 2012 test set with PASCAL VOC evaluation server in Table 3. We compare state-of-the-arts with our method on several backbone networks. First, we still win the best mAP performance with a smaller input size compared to SSGRL [chenlearning] when ResNet-101 is considered as a backbone. Second, our method achieves better performances using lightweight networks, i.e. MobileNet-v2 and ResNet-50, than that of VGG. This implies that it is easy to extend our method to resource-restricted devices such as mobile phones.

Line No. Methods Global Local VOC 2007 MS-COCO
   0 Baseline 92.9 77.1
   1 MCAR 93.4 0.5 81.9 4.8
   2 94.2 1.3 82.9 5.8
   3 94.8 1.9 83.8 6.7
Table 4: Ablative study of two streams in MCAR.

4.3 Ablation Study

In this section, we firstly discuss the contribution of each component in our two-stream architecture and demonstrate its effectiveness. The training details are exactly the same as those described in Section 4.2. We also present the effects of MCAR in different hyper-parameters ( and ) appeared in the local region localization module. The experiment is conducted on VOC 2007 using lightweight backbone networks, e.g. MobileNet-v2 and ResNet-50, and we set the input size to 256256. Finally, we extensively analyze the effects of our method under different conditions such as different global pooling strategies, various input sizes, and different network architectures.

Contributions of proposed two-stream framework. To explore the effectiveness of two streams, we jointly train the global and local streams in MCAR, and during the inference stage, we report the influence of using each stream in Table 4. Firstly, thanks to the joint training strategy, our MCAR significantly outperforms the baseline method even when the same global image is taken as input (line 1 vs. line 0). Such improvement is very intuitive because MCAR is more robust and generalized by learning on not only global image but also various scales of local regions. Secondly, we can see that using local stream alone performs better than only using global stream (line 2 vs. line 1), which is because the local stream is able to flexibly focus on the details of each object. Nonetheless, we want to emphasize that the global stream plays an important role in guiding the learning of local stream. Last but not least, it is obvious that employing both global and local streams achieves the best results (line 3). This is similar to humans perception because we usually make a final decision after our brain gathers information from different spatial locations and object scales.

(a) MobileNet-v2
(b) ResNet-50
(c) MobileNet-v2
(d) ResNet-50
Figure 5: mAP comparisons of our MCAR with different values of and .

Number of local regions. We fix to 0.5 and choose the value from a given set . Note that, implies we train model using global stream only, which is equal to our baseline. In Fig. (a)a and (b)b, we show the mAP performance curves when is set to different numbers. First, the mAP performance shows an upward trend with the number of gradually being increased. This means that it is useful to improve the multi-label classification performance using more local regions. Second, the performance tends to be stable when is set to 6 or 8, which implies that the improvements will be not significant when applying a large . Third, the performance of a small , (e.g. 1, 2, or 4) is significantly better than that of a pure global stream (i.e., =0). This further verifies the effectiveness of proposed selection strategy of generated high-confidence local regions. Another benefit of the region selection strategy is to help reduce the cost of computation resources.

Threshold of localization. To explore the sensitivity of the in Eq. 7, we fix to 4 and test different values from . The whole image will be considered as a local region when equals to 0, and it is also equivalent to the baseline method. We show the mAP performances as the function of in Fig. (c)c and (d)d. First, we observe that the performance is better when is greater than 0. Second, the performance drops when is either too small or too large. We argue that if is too small, local regions may contain more context information and lack discriminative feature because all local regions are close to the original input image. When is too large, it makes local regions only contain the most discriminative parts of an object and easily leads to over-fitting. It is a good choice when the value is in the interval between 0.5 and 0.7.

Global pooling strategy. Encoding spatial feature descriptors to a single vector is a necessary step in state-of-the-art CNNs. The early works e.g. AlexNet and VGGNet, use a fully connected layer, and the recent ResNet usually employs global average pooling (GAP) which outputs the spatial average of each feature map.

GMP (Global Maximum Pooling) easily falls into over-fitting because it enforces the network to learn the most discriminative feature. Generally, GAP usually has a better generalization ability than GMP. However, GAP may lead to under-fitting and slow convergence because it equally gives the same importance for all spatial feature descriptors. Our local region localization needs to discover the discriminative region which seems to be opposite to the objective of GAP. In order to alleviate this conflict, we propose a simple solution termed as Global Weighted Pooling (GWP) which is an average of GAP and GMP.

Methods MobileNet-v2 ResNet-50 ResNet-101
Input Size 256 448 256 448 256 448
Baseline 61.5 67.8 70.1 75.4 71.2 77.1
MCAR (GAP) 66.6 5.1 74.3 6.5 75.9 5.8 78.0 2.6 77.4 6.2 80.5 3.4
MCAR (GWP) 69.8 8.3 75.0 7.2 78.0 7.9 82.1 6.7 79.4 8.2 83.8 6.7
Table 5: Comparisons of mAP in of our methods and baseline on the COCO dataset. Compared to baseline method, the improvements of our method are highlighted in red.

In Table 5, we can see that MCAR with GWP further boosts performance on MS-COCO dataset. It improves the mAP by 4.1 points and 3.3 points compared to the traditional GAP on ResNet-50 and ResNet-101, respectively. On VOC 2007, the overall performances of MCAR with GWP is slightly worse than that of GAP. A possible reason is that deep networks are trained using small-scale training images which make it easily over-fits on VOC 2007.

Network architectures. The recent state-of-the-art methods usually take Res-Net-101 as a backbone to report their performance. However, in real applications, lightweight networks have been widely adopted. To meet such requirements, we extensively evaluate the proposed method on MobileNet-v2 and ResNet-50 besides ResNet-101 and report their results in Table 5. The deeper network tends to obtain better performance. This is not surprised because the big network has more parameters and a deeper structure to ensure strong capacity and transferability. Note that our method still has good performance using the lightweight MobileNet-v2. In addition, the proposed method has significant improvements for all backbones. In Table 5, our MCAR with GWP improves the baseline by about 7% using the input size of 448448.

Figure 6: Selected examples of region localization and classification results on PASCAL VOC 2012 testing images. Each region box is associated with a category name (), a global stream score () and a two-stream score (, }), organized as “:/, }”. These region boxes are displayed with conditions on , , and . Note that different colors denote different rankings based on their global scores. These pictures should be best viewed in color.

Input size. The performance of multi-label recognition is sensitive to the choice of input size. Generally, the larger size tends to get the better performance as reported in Table 5. However, it is more practicable to employ small-sized input on resource-restrict devices. Somewhat surprisingly, MCAR performs better using small inputs. In Table 5, we can see that our methods always tends to produce more improvement when a smaller input size is employed. This advantage comes from the two-stream architecture which can look image in a comprehensive manner (global to local). This indicates that our method is more friendly for low-resolution inputs.

4.4 Visualization

To analyze where our model focuses on an image, we show the class attentional regions generated by multi-class attentional region module in Fig. 6. It can be seen that these attentional regions cover almost all possible objects in each image which is consistent with our initial intention. In addition, we can find that global prediction scores of some small-scale objects are low, e.g. the dog in , the car in , the bottle in , the person in , where is the image at -th row and -th column in Fig 6. This indicates that it is suboptimal to use global image stream solely, especially for small-scale and partly occluded objects. This drawback would be improved by our two-stream network because it recognizes this type of object from a closer view (high score of two-stream).

5 Conclusion

We observe that humans recognize multiple objects following two steps. In practice, they usually obey a rule of global to local. Through looking the whole image at first, people can discover places that need to be focused with more attentions. These attentional regions are then checked closer for a better decision. Inspired by this observation, we develop a two-stream framework to recognize multi-label images from global to local as human’s perception system works. In order to localize object regions, we propose an efficient multi-class attentional region module which significantly reduces the number of regions and keeps their diversity. Our method can efficiently and effectively recognize multi-class objects with an affordable computation cost and a parameter-free region localization module. On three prevalent multi-label benchmarks, the proposed method achieves state-of-the-art results. In the future, we will try to integrate the label dependency into our method to further boost the performance. It is also an interest direction to explore how to extend the proposed method for weakly supervised image detection and semantic segmentation.