Semantic-Aware Scene Recognition

Scene recognition is currently one of the top-challenging research fields in computer vision. This may be due to the ambiguity between classes: images of several scene classes may share similar objects, which causes confusion among them. The problem is aggravated when images of a particular scene class are notably different. Convolutional Neural Networks (CNNs) have significantly boosted performance in scene recognition, albeit it is still far below from other recognition tasks (e.g., object or image recognition). In this paper, we describe a novel approach for scene recognition based on an end-to-end multi-modal CNN that combines image and context information by means of an attention module. Context information, in the shape of semantic segmentation, is used to gate features extracted from the RGB image by leveraging on information encoded in the semantic representation: the set of scene objects and stuff, and their relative locations. This gating process reinforces the learning of indicative scene content and enhances scene disambiguation by refocusing the receptive fields of the CNN towards them. Experimental results on four publicly available datasets show that the proposed approach outperforms every other state-of-the-art method while significantly reducing the number of network parameters. All the code and data used along this paper is available at



There are no comments yet.


page 3

page 12

page 13

page 22

page 23


FOSNet: An End-to-End Trainable Deep Neural Network for Scene Recognition

Scene recognition is an image recognition problem aimed at predicting th...

Visual Question Generation for Class Acquisition of Unknown Objects

Traditional image recognition methods only consider objects belonging to...

A Robust Indoor Scene Recognition Method based on Sparse Representation

In this paper, we present a robust method for scene recognition, which l...

Knowledge Guided Disambiguation for Large-Scale Scene Classification with Multi-Resolution CNNs

Convolutional Neural Networks (CNNs) have made remarkable progress on sc...

Design of an Intelligent Vision Algorithm for Recognition and Classification of Apples in an Orchard Scene

Apple is one of the remarkable fresh fruit that contains a high degree o...

ContourletNet: A Generalized Rain Removal Architecture Using Multi-Direction Hierarchical Representation

Images acquired from rainy scenes usually suffer from bad visibility whi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Motivation and Problem Statement

Scene recognition is a hot research topic whose complexity is, according to reported performances Zhou et al. (2018), on top of image understanding challenges. The paradigm shift forced by the advent of Deep Learning methods, and specifically, of Convolutional Neural Networks (CNNs), has significantly enhanced results, albeit they are still far below those achieved in tasks as image classification, object detection and semantic segmentation Russakovsky et al. (2015); Zhou et al. (2017); Cordts et al. (2016).

Figure 1: Scene recognition performance of CNN-based solutions on the Places-365 Standard Dataset Zhou et al. (2018). Each network is represented by its Top@1 accuracy (vertical axis), its number of layers (horizontal axis) and its number of units (radius of each circle). Circle colors are only for better identification.

The complexity of the scene recognition task lies partially on the ambiguity between different scene categories showing similar appearances and objects’ distributions: inter-scene boundaries can be blurry, as the sets of objects that define a scene might be highly similar to another’s. Therefore, scene recognition should cope with the classical tug-off-war between repeatable (handle intra-class variation) and distinctive (discriminate among different categories) characterizations. Whereas CNNs have been proved to automatically yield trade-off solutions with success, the complexity of the problem increases with the number of categories, and specially when training datasets are unbalanced.

Recent studies on the semantic interpretability of CNNs, suggest that the learning of scenes is inherent to the learning of the objects they include Zhou et al. (2015); Bau et al. (2017). Object detectors somehow arise as latent-variables in hidden-units within networks trained to recognize scenes. These detectors operate without constraining the networks to decompose the scene recognition problem Zhou et al. (2015). The number of detectors is, to some extent, larger for deeper and wider (with a larger number of units) networks. During the last years, the main trend to enhance scene recognition performance has focused on increasing the number of CNN units. However, performance does not increase linearly with the increase in the number of network parameters Zhou et al. (2018). For instance, DenseNet-161 Huang et al. (2017) is twenty-times deeper than AlexNet Krizhevsky et al. (2012), and VGG-16 Simonyan and Zisserman (2014b) has twenty-times more units than GoogleNet, but performances of DenseNet-161 and VGG-16 for scene recognition are only a and a better than those of AlexNet and GoogleNet, respectively (see Figure 1).

This paper presents a novel strategy to improve scene recognition: the use object-level information to guide scene learning during the training process. It is well-known that—probably due to the ImageNet fine-tuning paradigm

He et al. (2018)—widely used models are biased towards the spatial centre of the visual representation (see Figure 2) Das et al. (2017). The proposed approach relies on semantic-driven attention mechanisms to prioritize the learning of common scene objects. This strategy achieves the best reported results on the validation set of the Places365 Standard dataset Zhou et al. (2018) (see Figure 1) without increasing the network’s width or deepness, i.e. using less units and less layers than VGG-16 and DenseNet-161 respectively.

Figure 2: Scene recognition results depending on the input data. (a) RGB image corresponding to the ”Kitchen” scene class. (b) Semantic segmentation from (a) obtained by a state-of-the-art CNN-based algorithm. (c) Class Activation Map (CAM) Zhou et al. (2016) just using the (a) RGB image. (d) CAM just using the (b) semantic segmentation. (e) CAM for the proposed approach, using both (a) and (b). Top@3 predicted classes are included at the top-left corner of each image. Better viewed in color.

Similar strategies have been recently proposed to constrain scene recognition through histograms of patch-wise object detections Wang et al. (2017b); Cheng et al. (2018). Compared to these strategies, the proposed method naturally exploits the spatial relationships between objects and emphasizes interpretability by relying on semantic segmentation. Semantic segmentation can be understood as a dense object detection task. In fact, scene recognition, object detection and semantic segmentation are interrelated tasks that share a common Branch in the recently proposed taskonomies or task-similarity branches Zamir et al. (2018). However, the performance achieved by semantic segmentation methods is generally lower than that of object detection ones mainly due to the difference in size and variety of their respective datasets. Howbeit, the proposed method surmounts this gap and yields a higher scene recognition performance than object-constrained methods Wang et al. (2017b); Cheng et al. (2018), using a substantially smaller number of units (see Section 4).

In essence, the proposed work aims to enhance scene recognition without increasing network’s capacity or depth leading to lower training and inference times, as well as to smaller data requirements and resources for training. The specific contributions of these paper are threefold:

  1. We propose an end-to-end multi-modal deep learning architecture which gathers both image and context information using a two-branched CNN architecture.

  2. We propose to use semantic segmentation as an additional information source to automatically create, through a convolutional neural network, an attention model to reinforce the learning of relevant contextual information.

  3. We validate the effectiveness of the proposed method through experimental results on public scene recognition datasets such as ADE20K Zhou et al. (2017), MIT Indoor 67 Quattoni and Torralba (2009), SUN 397 Xiao et al. (2010) and Places365 Zhou et al. (2018) obtaining state-of-the-art results.

2 State of the Art

2.1 Scene Recognition

A variety of works for scene recognition have been proposed during the last years: it is a hot topic. Existing methods can be broadly organized into two different categories: those based on hand-crafted feature representations and those based on CNN architectures.

Among the first group, earliest works proposed to design holistic feature representations. Generalized Search Trees (GIST) Oliva and Torralba (2001); Oliva (2005)

were used to generate an holistic low dimensional representation for each image. However, precisely due to the holistic approach, GIST lacks of scene’s local structure information. To overcome this problem, local feature representations were used to exploit the pattern of local patches and combine their representations in a concatenated feature vector. Census Transform Histogram (CENTRIST)

Wu and Rehg (2011) encodes the local structural properties within an image and suppresses detailed textural information to boost scene recognition performance. To further increase generalization, Oriented Texture Curves (OTC) Margolin et al. (2014) captures patch textures along multiple orientations to be robust to illumination changes, geometric distortions and local contrast differences. Overall, despite reporting noteworthy results, these hand-crafted features, either holistic or local, exploit low-level features which may not be discriminative enough for ambiguous or highly related scenes Zhou et al. (2018). Besides, the hand-craft nature of features may hinder their scalability as ad hoc designs may be required for new domains.

Solutions based on CNNs generally result in higher performances. Recent network architectures, together with multi-million datasets such as Places-365, crush the results obtained by hand-crafted features Zhou et al. (2018). CNNs exploit multi-scale feature representations using convolutional layers and do not require the design of hand-crafted features as they are implicitly learned through the training process. Besides, CNNs combine low-level latent information such as color, texture and material with high-level information, e.g., parts and objects to obtain better scene representations and boost scene recognition performance Bau et al. (2017). Well-consolidated network architectures, such as AlexNet Krizhevsky et al. (2012), GoogLeNet Szegedy et al. (2015), VGG-16 Simonyan and Zisserman (2014b), ResNet-152 He et al. (2016) and DenseNet-161 Huang et al. (2017) have reported accuracies of , , , and

respectively when classifying images from the challenging Places-365 Standard dataset

Zhou et al. (2018).

Features extracted from these networks have been also combined with different scene classification methods. For instance, Xie et al. Xie et al. (2015) proposed to use features from AlexNet and VGG with Mid-level Local Representation (MLR) and Convolutional Fisher Vector representation (CFV) dictionaries to incorporate local and structural information. Cimpoi et al. Cimpoi et al. (2015) used material and texture information extracted from a CNN-filter bank and from a bag-of-visual-words to increase generalization for enhanced scene recognition. Yoo et al. Yoo et al. (2014) benefited from multi-scale CNN-based activations aggregated by a Fisher kernel framework to perform significantly better on the MIT Indoor Dataset Quattoni and Torralba (2009).

While accuracies from this networks and methods are much higher than those reported by methods based on hand-crafted features, they are far from those achieved by CNNs in other image classification problems (e.g., ImageNet challenge best reported accuracy is Mahajan et al. (2018) while best reported accuracy on Person Re-Identification task is Quan et al. (2019)).

As described in Section 1, the increment in capacity (number of units) of CNNs does not, for scene recognition, lead to a linear rise in performance. This might be explained by the inability of the networks to handle inter-class similarities Cheng et al. (2018): unrelated scene classes may share objects that are prone to produce alike scene representations weakening the network’s generalization power. To cope with this problem, some recent methods incorporate discriminative object information to constrain scene recognition. For instance, Wang et al. Wang et al. (2017b) designed an architecture where patch-based features are extracted from customized object-wise and scene-wise CNNs to construct semantic representations of the scene. A scene recognition method built on these representations (Vectors of Semantically Aggregated Descriptors (VSAD)) yields excellent performance on standard scene recognition benchmarks. VSAD’s performance has been recently enhanced by measuring correlations between objects among different scene classes Cheng et al. (2018). These correlations are then used to reduce the effect of common objects in scene miss-classification and enhance the effect of discriminative objects through a Semantic Descriptor with Objectness (SDO).

Even though these methods constitute the state-of-the-art in scene recognition, their obtention of object correlations by using patch-based object classification techniques requires severe and reactive parametrization (scale, patch-size, stride, overlapping…). Moreover, their descriptors are object-centered and lack of information on the spatial-interrelations between object instances. We instead propose to use an end-to-end CNN that exploits spatial relationships by using semantic segmentation instead of object information to guide network’s attention. This proves to provide equal or better performance while significantly reducing the number of network units and hyper-parameters.

2.2 Semantic Segmentation

Semantic segmentation is the task of assigning a unique object (or stuff) label to every pixel of an image. As reported in several benchmarks (MIT Scene Parsing Benchmark Zhou et al. (2017), CityScapes Dataset Cordts et al. (2016), Mapillary Challenge Neuhold et al. (2017) and COCO Challenge Lin et al. (2014)), top-performing strategies, which are completely based on end-to-end deep learning architectures, are year by year getting closer to human accuracy.

Among the top performing strategies, Wang et al. proposed to use a dense up-sampling CNN to generate pixel-level predictions within an hybrid dilated convolution framework Wang et al. (2018). In this vein, the Unified Perceptual Parsing Xiao et al. (2018) benefits from a multi-task framework to recognize several visual concepts given a single image (semantic segmentation, materials, texture). By regularization, when multiple tasks are trained simultaneously, their individual results are boosted. Zhao et al. modelled contextual information between different semantic labels proposing the Pyramid Pooling Module Zhao et al. (2017)—e.g., an airplane is likely to be on a runway or flying across the sky but rarely over the water—. These spatial relationships allow reducing the complexity associated with large sets of object labels, generally improving performance. Guo et al. Guo et al. (2018) proposed Inter-Class Shared Boundary (ISB) Encoder which aims to encode the level of spatial adjacency between object-class pairs into the segmentation network to obtain higher accuracy for classes representing small objects.

Motivated by the success of modelling contextual information and spatial relationships, we decided to explore the use of semantic segmentation as an additional information modality to confront scene recognition.

2.3 Multi-Modal Deep Learning Architectures

Integrating complementary image modalities (e.g., depth, optical flow or heat maps) as additional sources of information to multi-modal classification models has been recently used to boost different computer vision tasks: pedestrian detection Park et al. (2013); Yang et al. (2015); Daniel Costea and Nedevschi (2016); Mao et al. (2017), action recognition Simonyan and Zisserman (2014a); Park et al. (2016) and hand-gesture recognition Abavisani et al. (2019).

In terms of pedestrian detection, numerous works have been proposed to combine different sources of information to improve mono-modal results Park et al. (2013); Yang et al. (2015); Daniel Costea and Nedevschi (2016); Mao et al. (2017). Park et al. proposed the incorporation of optical-flow-based features to model motion through a decision-forest-based detection scheme Park et al. (2013). Convolutional Channel Features (CCF) Yang et al. (2015) uses extracted low-level features from VGG-16 network as an additional input channel to enhance pedestrian detection. Moreover, Daniel et al. Daniel Costea and Nedevschi (2016) used short range and long range multiresolution channel features obtained by eight different individual classifiers to propose a detector which obtains state-of-the-art results at a higher speed. An extensive study on what features may help pedestrian detection has been carried out by Mao et al. Mao et al. (2017). Results suggest that mono-modal pedestrian detection results are overcome when depth, heat-maps, optical flow, or segmentation representations are integrated as additional information modalities into a CNN-based pedestrian detector.

The effectiveness of modelling temporal features using multi-modal networks has been also studied Simonyan and Zisserman (2014a); Park et al. (2016). Simonyan et al. Simonyan and Zisserman (2014a) and Park et al. Park et al. (2016) proposed to incorporate optical flow as an additional input for action recognition. These methods lead to state-of-the-art performance when classifying actions in videos. Similarly, hand-gesture recognition has also benefited from the use of optical-flow Abavisani et al. (2019).

In this paper, we propose to combine RGB images and their corresponding semantic segmentation. Additionally, instead of using traditional combinations—linear combination, feature concatenation or averaging Simonyan and Zisserman (2014a); Park et al. (2016); Mao et al. (2017)—we propose to use a CNN architecture based on a novel attention mechanism.

2.4 Attention Networks

CNNs with architectures based on attention modules have been recently shown to improve performances on a wide and varied set of tasks Mnih et al. (2014); Ba et al. (2014); Xu et al. (2015); Gregor et al. (2015). These attention modules are inspired by saliency theories on the human visual system, and aim to enhance CNNs by refocusing them onto task-relevant image content, emphasizing the learning of this content Woo et al. (2018).

In image classification, the Residual Attention Network Wang et al. (2017a), based on encoder-decoder stacked attention modules, increases classification performance by generating attention-aware features. The Convolutional Block Attention Module (CBAM) Woo et al. (2018) infers both spatial and channel attention maps to refine features at all levels while obtaining top performance on ImageNet classification tasks and on MS COCO and VOC2007 object detection datasets. An evaluation benchmark proposed by Kiela et al. Kiela et al. (2018)

, suggested that the incorporation of different multi-modal attention methods, including additive, max-pooling, and bilinear models is highly beneficial for image classification algorithms.

In terms of action recognition and video analysis, attention is used to focus on relevant content over time. Yang et al.

proposed to use soft attention models by combining a convolutional Long Short-Term Memory Network (LSTM) with a hierarchical architecture to improve action recognition metrics

Yan et al. (2017).

Guided by the broad range and impact of attention mechanisms, in this paper we propose to train an attention mechanism based on semantic information. For the semantic segmentation of a given image, this mechanism returns a map based on prior learning. This map promises to reinforce the learning of features derived from common scene objects and to hinder the effect of features produced by non-specific objects in the scene.

3 Proposed Architecture

This paper proposes a multi-modal deep-learning scene recognition system composed of a two-branched CNN and an Attention Module (see Figure 3).

Figure 3: Multi-modal deep-learning scene recognition architecture. The architecture is composed by a Semantic Branch which aims to extract meaningful features from a semantic segmentation score map. The only possible features that can be extracted from this image domain; hence, this Branch conveys an attention map only based on meaningful and representative scene objects and their relationships. The RGB Branch on the contrary, extracts features from the color image. These features are then gated in the Attention Module using the semantic-based attention map. Through this process, the network is refocused towards the meaningful objects Semantic Branch has learned to be relevant for recognition. Better viewed in color.

3.1 Design’s motivation

The basic idea behind the design is that features extracted from RGB and semantic segmentation domains complement each other for scene recognition. If we train a CNN to recognize scenes just based on the semantic segmentation of an input RGB image (see top Branch in Figure 3), this network tends to focus on representative and discriminative scene objects, yielding genuine-semantic representations based on objects and stuff classes. We propose to use these representations to train an attention map. This map is then used to gate RGB-extracted representations (see bottom Branch in Figure 3), hence refocusing them on the learned scene content.

The effect of this attention mechanism is that the receptive field of the most probable neuron at the final classification layer—known as class-activation map

Zhou et al. (2016)—is enlarged and displaced towards scene discriminative elements that have been learned by the Semantic Branch. An example of this behaviour is depicted in Figure 2: the RGB Branch is focused on the center of the image; the Semantic Branch mainly focuses on the oven for predicting the scene class; the combined network includes the oven and other ”Kitchen” discriminative objects as the stove and the cabinet for prediction.

3.2 Preliminaries

The proposed network architecture is presented in Figure 3. Let be a normalized RGB image and let

be a score tensor representing

I’s semantic segmentation, where

represents the probability distribution for the

-th pixel on the -set of learned semantic labels.

The network is composed of a Semantic Branch (Figure 3 top) and a RGB Branch (Figure 3 bottom). Their output is a pair of feature tensors— and respectively—which are fed to the Attention Module to obtain the final feature tensor . This is fed to a linear classifier to obtain the final prediction for a -class scene recognition problem.

3.3 RGB Branch

It consists of a complete ResNet-18 architecture He et al. (2016). This Branch is fed with I and returns a set of RGB-based features where , and are the width, the height and the number of channels of the output feature map. This Branch includes the original ResNet-18’s Basic Blocks with three of them (Basic Blocks

2, 3 and 4) implementing residual connections.

, which is obtained from Basic Block 4, is then forwarded into the Attention Module.

3.4 Semantic Branch

Given I, a semantic segmentation network is used to infer a semantic segmentation score tensor M which is then fed to the Semantic Branch. M encodes information of the set of meaningful and representative objects, their spatial relations and their location in the scene. This Branch returns a set of semantic-based features that model the set of semantic labels and their semantic inter-dependencies in spatial and channel dimensions.

The architecture of the Semantic Branch is a shallow network whose output size exactly matches that of a ResNet-18 He et al. (2016) feature map (i.e, before the linear classifier). The use of a shallow network is here preferred, as its input M lacks of texture information. Nonetheless, we compare the effect of using deeper networks for this Branch in section 4.4.

Figure 4: Channel Attention Module (ChAM) Woo et al. (2018) architecture. ChAM is fed with an arbitrary size feature map and infers a 1D channel attention map . Better viewed in color.

To reinforce the semantic classes that are relevant for a given image, three Channel Attention Modules (ChAM) Woo et al. (2018) are interspected between convolutional blocks in the Semantic Branch. The ChAM module (see Figure 4) exploits the inter-channel relationships between features and results on a per-channel attention map. In this map, some feature channels are reinforced and some inhibited after the sigmoid layer. As each channel represents the probability of a semantic class, ChAM forces the network to focus on certain classes. Specifically, ChAM is fed with an arbitrary size feature map and infers a 1D channel attention map . F is squeezed along the spatial dimension by using both average-pooling and max-pooling operations, obtaining feature vectors and

. These vectors are combined via two shared fully connected networks separated by a ReLu activation layer. Resulting vectors

and are then added and regularized via a sigmoid layer to yield the 1D channel attention map :


where and

are Sigmoid and ReLU activation functions respectively,

and are the weights of the fully connected layers, and is a reduction ratio parameter.

This channel-attention map is used to weight F by:


where is Hadamard product.

3.5 Attention Module

The Attention Module is used to obtain a set of semantic-weighted features

which are forwarded to a linear classifier to obtain the final scene posterior probabilities. The architecture and design parameters of the proposed Semantic Attention Module are depicted in Figure

5 and detailed in Table 1. Alternative attention mechanisms have been evaluated (see Section 4.4). Hereinafter, we describe the one that yields the highest performance.

Name Output Size Blocks att_conv_I 512 5 5 3 3, 512, stride 1 att_conv_M 512 5 5 3 3, 512, stride 1 att_conv2_I 1024 3 3 3 3, 1024, stride 1 att_conv2_M 1024 3 3 3 3, 1024, stride 1 attention 1024 3 3 Hadamard product avg_pool 1024 1 1 Average Pooling classifier K 1 Dropout, K-dimensional FC, Softmax

Table 1: Attention Module designed architecture.

Semantic gating representations are obtained from the output of the Semantic Branch by a double convolutional block followed by a sigmoid module:


where and are again Sigmoid and ReLU activation functions respectively, and are the weights and biases of the two convolutional layers.

Similarly RGB features to be gated are obtained from the output of the RGB Branch by:


where and are the weights of the two convolutional layers of this Branch.

Gating is performed by simply multiplying these two representations:

Figure 5: Attention Module architecture. It aims to obtain a set of semantic-weighed features , based on RGB and semantic representations and . Better viewed in color.

is then forwarded to a classifier composed of an average pooling, a dropout, and a fully connected layer, generating a feature vector . The inference of scene posterior probabilities from f is achieved by using a logarithmic normalized exponential (logarithmic softmax) function :


where is the posterior probability for class given the feature vector f.

3.6 Training Procedure and Loss

A two-stage learning procedure has been followed paying particular attention to prevent one of the branches dominating the training process.

Observe that some training examples might be better classified using either RGB or semantic features. In our experiments, when both branches were jointly trained, and one of the branches was more discriminative than the other, the overall loss was small; hence, hindering the optimization of the less discriminative Branch. To avoid this situation, during the first stage, both RGB and Semantic Branches are separately trained for a given scenario. At the second training step, pre-trained RGB and semantic branches are frozen while the Attention Module and the linear classifier are fully trained from scratch.

Both training stages are optimized minimizing the following logistic function:


where denotes the trainable parameters, is the Negative Log-Likelihood (NLL) loss and is the number of training samples for a given batch.

4 Experiments and Results

In this section, we evaluate the proposed scene recognition network on four well-known and publicly available datasets: ADE20K Zhou et al. (2017), MIT Indoor 67 Quattoni and Torralba (2009), SUN 397 Xiao et al. (2010) and Places365 Zhou et al. (2018)

. To ease the reproducibility of the method, the implementation design of the proposed method is first explained and common evaluation metrics for all the experiments are presented. Then, to assess the effect of each design decision on the method, we perform a varied set of ablation studies. The section closes with a comparison against state-of-the-art approaches on three publicly available datasets.

Figure 6: Image examples extracted from ADE20K Zhou et al. (2017) (top row) and MIT Indoor 67 Quattoni and Torralba (2009) (bottom row) datasets. Observe how blurred the boundaries between different kinds of scenes are in both indoor and outdoor scenarios. Better viewed in color.

4.1 Implementation Design

Data Dimensions: Each input RGB image I is spatially adapted to the network input by re-sizing the smaller edge to and then cropping to a square shape of dimension. In the training stage, this cropping is randomized as part of the data augmentation protocol, whereas in the validation stage, we follow a ten-crop evaluation protocol as described in Zhou et al. (2018). The Semantic Branch inherits this adaptation. RGB and Semantic Branches produce features and respectively, both with dimensions . After the Attention Module, a attention representation is obtained. This representation is fed to the linear classifier, yielding a probability vector with dimensions , being the number of scene classes of a given dataset.

Semantic Segmentation: It is performed by an UPerNet-50 network Xiao et al. (2018) trained on the ADE20K dataset with objects and stuff classes. Its output (a vector per pixel, indicating its probability distribution) is the score tensor M. To ease storage and convergence, this tensor is sparsified, i.e. all the -th probability distributions which are not among the three top for pixel are set to zero.

Data Augmentation: These techniques are used in order to reduce model overfitting and to increase network generalization. Specifically, for the RGB image, we use regular random crop, horizontal flipping, Gaussian blur, contrast normalization, Gaussian noise and bright changes (all but the first two as in the Imgaug Library Jung (2019)

). Due to their nature, for the semantic segmentation tensors, we only apply regular random crop and horizontal flipping. The training dataset is shuffled at each epoch.

Figure 7: Image examples extracted from SUN 397 Xiao et al. (2010) (top row) and Places 365 Zhou et al. (2018) (bottom row) datasets. Observe how blurred the boundaries between different kinds of scenes are in both indoor and outdoor scenarios. Better viewed in color.


To minimize the loss function in equation

7 and optimize the network’s trainable parameters , the Deep Frank-Wolfe (DFW) Berrada et al. (2019) algorithm is used. DFW is a first-order optimization algorithm for deep neural learning which, contrary to the Stochastic Gradient Descend (SGD), only requires the initial learning rate hyper-parameter and not a hand-designed decay schedule. In all our experiments the initial learning rate was set to —the use of alternative values did not improve learning performance. The use of DFW resulted in similar performance to that of SGD with a learning-rate decay schedule suited for each dataset in around the same number of epochs. However, by suppressing the decay policy hyper-parameter, the training process was lightened, as the number of parameters to sweep was simplified. Momentum was set to and weight decay was set to for both stages in the training procedure (Section 3.6).

Hardware and Software:

The model design, training and evaluation has been implemented using PyTorch 1.1.0 Deep Learning framework

Paszke et al. (2017) running on a PC using a 12 Cores CPU and a NVIDIA TITAN Xp 12GB Graphics Processing Unit.

4.2 Datasets

ADE20K Scene Parsing Dataset: The scene parsing dataset ADE20K Zhou et al. (2017) contains 22210 scene-centric images exhaustively annotated with objects. The dataset is divided into 20210 images for training and 2.000 images for validation. ADE20K is a semantic segmentation benchmark with ground-truth for semantic categories provided for each image, hence enabling its use for training semantic segmentation algorithms (Section 4.1-Semantic Segmentation). This dataset is used in our ablation studies (Section 4.4) to reduce the influence of semantic segmentation quality in the evaluation results. Additionally, the dataset includes ground truth for the classification of scene classes. Therefore, it can also be used as a scene recognition benchmark, albeit data composition of scene classes is notably unbalanced. The top row of Figure 6 depicts example images and scene classes of this dataset.

MIT Indoor 67 Dataset: MIT Indoor 67 dataset Quattoni and Torralba (2009) contains 15620 RGB images of indoor places arranged onto scene classes. In the reported experiments, we followed the division between train and validation sets proposed by the authors Quattoni and Torralba (2009): 80 training images and 20 images for validation per scene class. The bottom row of Figure 6 represents example images and scene classes of the dataset.

SUN 397 Dataset: SUN 397 Dataset Xiao et al. (2010) is a large-scale scene recognition dataset composed by 39700 RGB images divided into scene classes, covering a large variety of environmental scenes both indoor and outdoor. An evaluation protocol to divide the full dataset into both train an validation sets is provided with the dataset Xiao et al. (2010). Following such protocol, the dataset is divided into 50 training and validation images per scene class. The top row of Figure 7 reproduces examples images and scene classes of this dataset.

Places 365 Database: Places365 dataset Zhou et al. (2018) is explicitly designed for scene recognition. It is composed of 10 millions images comprising 434 scene classes. There are two versions of the dataset: Places365-Standard with 1.8 million train and 36000 validation images from scene classes, and Places365-Challenge-2016, in which the size of the training set is increased up to 6.2 million extra images, including 69 new scene classes (leading to a total of 8 million train images from 434 scene classes). In this paper, experiments are carried out using the Places365-Standard dataset. The bottom row of Figure 7, presents example images and scene classes of this dataset.

4.3 Evaluation Metrics

Scene recognition benchmarks are generally evaluated via the Top@ accuracy metric with . In this paper, Top@ accuracy metrics have been chosen. The Top@ accuracy measures the percentage of validation/testing images whose top-scored class coincides with the ground-truth label. Top@ and Top@ accuracies, are the percentage of validation/testing images whose ground-truth label corresponds to any of the 2 and 5 top-scored classes respectively.

The Top@ accuracy metrics are biased to classes over-represented in the validation set; or, in other words, under-represented classes barely affect these metrics. In order to cope with unbalanced validation sets (e.g., ADE20K for scene recognition), we propose to use an additional performance metric, the Mean Class Accuracy (MCA):


where Top@ is the Top@ metric for scene class . Note that MCA equals Top@ for perfectly balanced datasets.

Backbone Channel Attention Module Woo et al. (2018) Pretraining Number of Parameters Top@1 Top@2 Top@5 MCA ResNet-18 Scratch 12 M 49.58 60.21 71.87 11.55 ResNet-18 Scratch 12 M 50.60 60.45 72.10 12.17 ResNet-18 ImageNet 12 M 52.17 61.86 71.75 15.54 3 Convolutional Layers Scratch 6.4 M 49.80 60.55 72.53 11.37 3 Convolutional Layers Scratch 6.5 M 50.00 60.95 73.53 11.67 4 Convolutional Layers Scratch 2.5 M 49.90 60.50 71.70 12.94 4 Convolutional Layers Scratch 2.6 M 50.35 60.90 72.55 13.53

Table 2: Ablation results for different architectures for the Semantic Branch

4.4 Ablation Studies

The aim of this section is to gauge the influence of different design elements that constitute significant aspects of the proposed method. First, the effect of different Semantic Branch architectures is evaluated. Second, the effect of variations in both the attention mechanism and the attention module architecture is assessed and analyzed. Third, the influence of the multi-modal architecture and the attention module with respect to the RGB Branch is quantified. As stated in Section 4.2, all the ablation studies are carried out using the ADE20K dataset.

Influence of the Semantic Branch Architecture

Table 2 presents results when semantic representations () are solely used for scene-recognition. Seven configurations for the Semantic Branch that result from the combination of three design parameters (variation of the network architecture, employment of ChAM modules Woo et al. (2018), and use of pre-trained models) have been explored.

Results from Table 2 suggest that the inference capabilities of deeper and wider CNNs are not fully exploited, as similar results can be achieved with shallower networks. For instance, when trained from scratch, there is not a significant difference in terms of Top@k and MCA metrics between using ResNet-18 or a shallower network; e.g., a relative improvement of with respect to the 4-convolutional layers configuration. Using a pre-trained network moderately improves performance: ResNet-18 pre-trained on ImageNet performs a (Top@) and a (MCA) better than ResNet-18 from scratch, but not drastically, due to the non-RGB nature of the semantic domain. The use of ChAM modules (as in Figure 3 for the shallow networks and as in Woo et al. (2018) for ResNet-18) slightly improves between a and in terms of MCA, without implying a significant increase in complexity (0.1 Million additional units on average).

In the light of these experiments, we opted for using the 4 convolutional layer architecture for the Semantic Branch as a trade-off solution between performance and complexity. This architecture yields a comparable performance to ResNet-18 ( and for Top@ and MCA respectively), given that substantially less units are used ().

Figure 8: Different attention mechanisms architectures used for the attention module ablation study. Better viewed in color.

Influence of the Attention Module

We have performed two ablation studies on the influence of the attention module. First, we evaluated the effect of using different attention mechanisms all of them sharing a similar network architecture: two convolutional layers for each Branch using and kernels. Then, once selected an attention mechanism, we further explore the use of alternative network architectures to drive the attention.

Attention Mechanisms

Table 3 presents an extensive study on attention mechanisms to validate their effectiveness and performance, which is compared with the RGB Baseline. To ease the analysis, Figure 8 graphically depicts each of the evaluated attention mechanisms.

Results from Table 3 suggest that attention mechanisms generally outperform the RGB baseline in terms of MCA. Additionally, those based on Hadamard combination, either gated or not, perform better than additive and concatenation mechanisms. Additive combination results in a and a decrease in terms of Top@1 and MCA with respect to the Gated RGB Hadamard combination (G-RGB-H). Similarly, attention based on feature concatenation performs a (Top@) and a (MCA) worse than G-RGB-H. Besides, concatenation increases the complexity of the linear classifier by generating a larger input to this module. In our opinion, the decreases in performance may be indicative of the inability of additive and concatenation mechanisms to favourably scale scene-relevant information.

Focusing on Hadamard combination, the use of a gated RGB combination (G-RGB-H) slightly improves the non-gated one (a for Top@1 and a for MCA) thanks to the normalized semantic attention obtained after the sigmoid activation layer. The effect of this normalized map is to gate RGB information, hence maintaining its numerical range instead of scaling it as in the non-gated Hadamard combination. Moreover, gating may also serve to nullify non-significant RGB information, easing the learning process. Additionally, we tested gating attention contrari-wise, i.e. using RGB features to gate semantic features (see Figure 8 right and Table 3 bottom). In this case, feature representations were obtained from the semantic domain—which lacks texture and color information—and the attention map was inferred from RGB domain—which lacks objects’ labels and clear boundaries. These representation problems deteriorate performance a and a for Top@1 and MCA respectively. However, the Gated Sem combination improves the performance of the Semantic Branch itself (compare Table 2 third row and Table 3 last row). In conclusion, we opted for the top-performing G-RGB-H attention mechanism.

Attention Mechanism Top@1 Top@2 Top@5 MCA
RGB Baseline 56.90 67.25 78.00 20.80
Additive Combination 59.60 70.60 80.60 25.36
Concatenation 61.85 72.25 81.60 25.38
Hadamard Combination 62.25 71.95 81.30 26.38
Gated RGB Hadamard Combination (G-RGB-H) 62.55 73.25 82.75 27.00
Gated Sem Hadamard Combination 56.55 66.30 75.45 20.11
Table 3: Ablation results for different attention mechanisms

Architecture of the G-RGB-H module

Table 4 presents comparative results for several variations of this module’s architecture: straightly gating without adaptation (No Conv layers), using 2 or 3 convolutional layers, and using channel-increasing layers.

Results indicate that the inclusion of convolutional layers to adapt and before the attention mechanism improves performance. The lack of convolutional layers compared with a 2-layer configuration leads to a decrease in performance of a for Top@1 and a for MCA. In our opinion, this owes to the two-stage learning procedure that we follow (see Section 3.6): if no convolutional layers are used, features from the semantic and the RGB Branches are not adapted, hindering the learning process. On the other hand, the use of a deeper attention module (using 3 instead of 2 layers) produces a decrease in performance of and in terms of Top@1 and MCA respectively, suggesting training over-fitting.

Regarding kernel nature, the use of channel-increasing layers improves performance with respect to using no layers at all, but its performance is still behind to that obtained by considering also the width and height dimensions ( spatial kernels).

In conclusion, we opted for a G-RGB-H attention strategy based on two convolutional layers composed of and kernels at each Branch.

Influence of Semantic Segmentation in Scene Recognition

Table 5 gauges the effectiveness of the proposed architecture when compared to the solely use of either the RGB Branch or the Semantic Branch. Comparative performances show that the RGB and semantic branches highly complement each other when used for scene recognition. The proposed architecture (Table 5 last row) outperforms the RGB baseline (Table 5 first row): a in terms of Top@1 performance and a in terms of MCA. Note that, due to the unbalanced dataset, the MCA metric is more adequate for comparison.

Conv Layers Kernel Size Top@1 Top@2 Top@5 MCA
No Conv Layers 60.84 70.55 80.22 24.20
2 1 1 512 61.00 72.30 81.95 24.92
1 1 1024
2 3 3 512 62.55 73.25 82.75 27.00
3 3 1024
3 3 3 512 61.20 71.150 81.750 24.90
3 3 1024
3 3 1024
Table 4: Ablation results for different G-RGB-H architectures
RGB Semantic Top@1 Top@2 Top@5 MCA
56.90 67.25 78.00 20.80
50.60 60.45 72.10 12.17
62.55 73.25 82.75 27.00
Table 5: Scene recognition results on ADE20K

4.5 State-of-the-art Comparison

Along this section the proposed method is compared with 19 state-of-the-art approaches, ranging from common CNN architectures trained for scene recognition to methods using objects to drive scene recognition. Comparison is performed on three datasets: MIT Indoor 67 Quattoni and Torralba (2009), SUN 397 Xiao et al. (2010) and Places365 Zhou et al. (2018). Unless explicitly mentioned, results of all the approaches are extracted from the respective referenced papers.

Method Backbone Number of Parameters Top@1
PlaceNet Zhou et al. (2014) Places-CNN 62 M 68.24
MOP-CNN Gong et al. (2014) CaffeNet 62 M 68.90
CNNaug-SVM Sharif Razavian et al. (2014) OverFeat 145 M 69.00
HybridNet Zhou et al. (2014) Places-CNN 62 M 70.80
URDL + CNNaug Liu et al. (2014) AlexNet 62 M 71.90
MPP-FCR2 (7 scales) Yoo et al. (2014) AlexNet 62 M 75.67
DSFL + CNN Zuo et al. (2014) AlexNet 62 M 76.23
MPP + DSFL Yoo et al. (2014) AlexNet 62 M 80.78
CFV Cimpoi et al. (2015) VGG-19 143 M 81.00
CS Xie et al. (2015) VGG-19 143 M 82.24
SDO (1 scale) Cheng et al. (2018) 2VGG-19 276 M 83.98
VSAD Wang et al. (2017b) 2VGG-19 276 M 86.20
SDO (9 scales) Cheng et al. (2018) 2VGG-19 276 M 86.76
RGB Branch ResNet-18 12 M 82.68
RGB Branch* ResNet-50 25 M 84.40
Semantic Branch 4 Conv 2.6 M 73.43
Ours RGB Branch + Sem Branch + G-RGB-H 47 M 85.58
Ours* RGB Branch* + Sem Branch + G-RGB-H 85 M 87.10
Table 6: State-of-the-art results on MIT Indoor 67 dataset. Methods using objects to drive scene recognition include: Cheng et al. (2018); Wang et al. (2017b), Semantic Branch, Ours and Ours*.
Method Backbone Number of Parameters Top@1
Decaf Donahue et al. (2014) AlexNet 62 M 40.94
MOP-CNN Gong et al. (2014) CaffeNet 62 M 51.98
HybridNet Zhou et al. (2014) Places-CNN 62 M 53.86
Places-CNN Zhou et al. (2014) Places-CNN 62 M 54.23
Places-CNN ft Zhou et al. (2014) Places-CNN 62 M 56.20
CS Xie et al. (2015) VGG-19 143 M 64.53
SDO (1 scale) Cheng et al. (2018) 2VGG-19 276 M 66.98
VSAD Wang et al. (2017b) 2VGG-19 276 M 73.00
SDO (9 scales) Cheng et al. (2018) 2VGG-19 276 M 73.41
RGB Branch ResNet-18 12 M 67.65
RGB Branch* ResNet-50 25 M 70.87
Semantic Branch 4 Conv 2.6 M 51.32
Ours RGB Branch + Sem Branch + G-RGB-H 47 M 71.25
Ours* RGB Branch* + Sem Branch + G-RGB-H 85 M 74.04
Table 7: State-of-the-art results on SUN 397 dataset. Methods using objects to drive scene recognition include: Cheng et al. (2018); Wang et al. (2017b), Semantic Branch, Ours and Ours*.

Evaluation on MIT Indoor 67 and SUN 397

Tables 6 and 7 agglutinate performances of scene recognition methods for the MIT Indoor 67 and the SUN 397 datasets respectively. As these datasets are balanced, we only include the Top@1 metric. All the compared algorithms are based on CNNs (see the Backbone column for details). For this evaluation, and just to offer a fair comparison in terms of network complexity, we also present results for a version of the proposed method (Ours) using a ResNet-50 network, instead of a ResNet-18 one, for the RGB Branch (Ours*).

Results on MIT Indoor 67 dataset (see Table 6) are in line with those of Table 5, suggesting a high complementarity between the RGB and the semantic representations. The proposed method increases Top@1 performance of the RGB Branch by a and a for ResNet-18 and ResNet-50 backbones respectively. These relative increments are smaller than those reported in Table 5 as the semantic segmentation model, trained over the ADE20K dataset, is less tailored to the scenario. The proposed method (Ours) performs better than most of the state-of-the-art algorithms while using a substantially smaller number of parameters. e.g., a improvement over the single scale SDO Cheng et al. (2018) (a method similar in spirit) while reducing a the number of parameters. The ResNet-50 backbone version of our method (Ours*) outperforms every other state-of-the-art method, providing relative performance increments of and with respect to VSAD and multi-scale SDO, still with a significant reduction in the model’s complexity— less parameters. Additionally, whereas multi-scale patch-based algorithms require severe parametrization of the size, stride and scale of each of the explored patches, the proposed algorithm is only parametrized in terms of training hyper-parameters, highly simplifying its learning process.

A similar discussion applies to the SUN 397 Dataset (see Table 7). The proposed method (Ours) outperforms most of the state-of-the-art approaches, maintaining lower complexity; and if we go for a more complex network for the RGB Branch (Ours*), our results improve all reported methods still using less than 1/3 of parameters.

Network Number of Parameters Top@1 Top@2 Top@5 MCA
AlexNet 62 M 47.45 62.33 78.39 49.15
AlexNet @Zhou et al. (2018) 62 M 53.17 - 82.89 -
GoogLeNet @Zhou et al. (2018) 7 M 53.63 - 83.88 -
ResNet-18 12 M 53.05 68.87 83.86 54.40
ResNet-50 25 M 55.47 70.40 85.36 55.47
ResNet-50 @Zhou et al. (2018) 25 M 54.74 - 85.08 -
VGG-19 @Zhou et al. (2018) 143 M 55.24 - 84.91 -
DenseNet-161 29 M 56.12 71.48 86.12 56.12
Semantic Branch 2.6 M 36.20 50.11 68.48 36.20
Ours 47 M 56.51 71.57 86.00 56.51
Table 8: State-of-the-art results on Places-365 Dataset (%). (@Zhou et al. (2018) stands for performance metrics reported in Zhou et al. (2018).)
Figure 9: Qualitative results. First and second column represent the RGB and semantic segmentation images for some examples from the ADE20K, the SUN 397 and the Places 365 validation sets. The third, fourth and fifth columns depict the Class Activation Map (CAM) Zhou et al. (2016) obtained by using features extracted from: the RGB Branch used baseline (ResNet-18), the Semantic Branch and the proposed method (Ours). The CAM represents the image areas that produce a greater activation of the network. CAM images also indicate the ground-truth label and the Top 3 predictions. Better viewed in color.

Evaluation on Places 365

Table 8 presents results for the challenging Places365 Dataset. This Table includes both results from the original Places365 paper (@Zhou et al. (2018) in Table 8) and those we obtained after the downloading and evaluation of the publicly available network models.

The proposed method (Ours) obtains the best results while maintaining relatively low complexity. Its performance improves those of the deepest network, DenseNet-161, by a in terms of Top@1 accuracy and it surpasses the most complex network, VGG-19, by a reducing the number of parameters a .

4.6 Qualitative interpretation of the attention module

To qualitatively analyze the benefits of the attention module, Figure 9 depicts intermediate results of our method, including, from left to right: the original color image, its semantic segmentation and Class Activation Maps (CAMs) Zhou et al. (2016) at the output of the RGB Branch—through the features—, the Semantic Branch—via the features—and after the attention module—using the features. CAMs are here represented normalized in color with dark-red and dark-blue representing maximum (1) and minimum (0) activation respectively. Top 3 predicted classes for each Branch are included at the top-left corner of each image.

The automatical refocusing capability of the attention module can be observed at first glance. Whereas CAMs of the RGB Branch are clearly biased towards the image center, after the attention module, the attention is focused on human-accountable concepts that can be indicative of the scene class, e.g., the microwave for the kitchen, the animals for the chicken farm or the mirror for the bathroom. This refocusing is specially useful for the disambiguation between similar classes: e.g., the runway to correct the airport prediction at the second row, and to drive attention towards discriminative objects in conflicting scenes, e.g., the mirror can be used to recognize the bathroom at the last row.

Owing to the trained convolutional layers in the attention module (see Section 3.5) the final CAM is not derived from the straight multiplication of the RGB and the Semantic CAMs. Instead, the effect of the Semantic CAM—together with refocusing—is generally an enlargement of the focus area (compare third and fifth columns in the Figure 9); hence, increasing the amount of information that may be used to discriminate between scene classes. This may be a consequence of the larger CAMs yielded by the Semantic Branch: as the semantic segmentation contains less information than the color image, the Semantic Branch tends to focus on either small discriminative objects or on large areas containing objects’ transitions.

Figure 10: Examples of limitations of the proposed method. In some cases, when semantic segmentation is insufficient (top row) or incorrect (bottom row) network predictions are erroneous. In the top row, the presence of a cabinet and the absence of the computer in the semantic segmentation leads to a kitchen prediction. In the bottom row, the erroneous segmentation of the floor as water leads to water-related scene predictions.

4.7 Limitations of the proposed method

According to the reported results, semantic segmentation is useful to gate the process of scene recognition on RGB images. However, if semantic segmentation is flawed or imprecise, the proposed method may not be able to surpass the erroneous information.

Figure 10 depicts two qualitative examples of these problems. In the top row, the semantic image lacks information on discriminative objects e.g., the computer. In the absence of these objects, the cabinet object, which is correctly segmented, dominates the gating process and drives an erroneous recognition: a ”Kitchen” instead of an ”Office”. The bottom row contains another problematic situation. The semantic segmentator miss-classifies the court as water (probably due to its color and texture). The proposed network, guided by the primary presence of water, infers that the 3 most probable scenes classes are those in which water is preeminent, hence failing.

5 Conclusions

This paper describes a a novel approach to scene recognition based on an end-to-end multi-modal convolutional neural network. The proposed method gathers both image and context information using a two-branched CNN (the traditional RGB branch and a complementary semantic information branch) whose features are combined via an attention module. Several attention strategies have been explored including classical additive and concatenating strategies as well as novel strategies based on fully trained convolutional layers followed by a Hadamard product. Among these last, the top performing strategy relies on a softmax transformation of the convolved Semantic Branch features. These transformed features are used to gate convolved features of the RGB Branch, which results in the reinforcement of the learning of relevant context information by changing the focus of attention towards human-accountable concepts indicative of scene classes.

Results on publicly available datasets (ADE20K, MIT Indoor 67, SUN 397 and Places 365) confirm that the proposed method outperforms every reported state-of-the-art method while significantly reducing the number of network parameters. This model simplification decreases training and inference times and the amount of necessary data for the learning stage. Overall, results confirm that the combination of both RGB and semantic segmentation modalities generally benefits scene recognition.


This study has been partially supported by the Spanish Government through its TEC2017-88169-R MobiNetVideo project.



  • M. Abavisani, H. R. V. Joze, and V. M. Patel (2019) Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1165–1174. Cited by: §2.3, §2.3.
  • J. Ba, V. Mnih, and K. Kavukcuoglu (2014) Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755, 2014. Cited by: §2.4.
  • D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba (2017) Network dissection: quantifying interpretability of deep visual representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6541–6549. Cited by: §1, §2.1.
  • L. Berrada, A. Zisserman, and M. P. Kumar (2019) Deep frank-wolfe for neural network optimization. In Proceedigns of the International Conference on Learning Representations (ICLR), External Links: Link Cited by: §4.1.
  • X. Cheng, J. Lu, J. Feng, B. Yuan, and J. Zhou (2018) Scene recognition with objectness. Pattern Recognition 74, pp. 474–487. Cited by: §1, §2.1, §4, Table 6, Table 7.
  • M. Cimpoi, S. Maji, and A. Vedaldi (2015) Deep filter banks for texture recognition and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3828–3836. Cited by: §2.1, Table 6.
  • M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.2.
  • A. Daniel Costea and S. Nedevschi (2016) Semantic channels for fast pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2360–2368. Cited by: §2.3, §2.3.
  • A. Das, H. Agrawal, L. Zitnick, D. Parikh, and D. Batra (2017) Human attention in visual question answering: do humans and deep networks look at the same regions?. Computer Vision and Image Understanding 163, pp. 90–100. Cited by: §1.
  • J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell (2014) Decaf: a deep convolutional activation feature for generic visual recognition. In

    Proceedings of the International Conference on Machine Learning (ICML)

    pp. 647–655. Cited by: Table 7.
  • Y. Gong, L. Wang, R. Guo, and S. Lazebnik (2014) Multi-scale orderless pooling of deep convolutional activation features. In Proceedigns of the European Conference on Computer Vision, pp. 392–407. Cited by: Table 6, Table 7.
  • K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra (2015)

    Draw: a recurrent neural network for image generation

    In Proceedings of the International Conference on Machine Learning (ICML), Cited by: §2.4.
  • D. Guo, L. Zhu, Y. Lu, H. Yu, and S. Wang (2018) Small object sensitive segmentation of urban street scene with spatial adjacency between object classes. IEEE Transactions on Image Processing 28 (6), pp. 2643–2653. Cited by: §2.2.
  • K. He, R. Girshick, and P. Dollár (2018) Rethinking imagenet pre-training. arXiv preprint arXiv:1811.08883, 2018. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §2.1, §3.3, §3.4.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks.. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 3. Cited by: §1, §2.1.
  • A. Jung (2019) Imaug Library. Image augmentation for machine learning experiments. Note:[Online; accessed 21-August-2019] Cited by: §4.1.
  • D. Kiela, E. Grave, A. Joulin, and T. Mikolov (2018) Efficient large-scale multi-modal classification. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Cited by: §2.4.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097–1105. Cited by: §1, §2.1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §2.2.
  • B. Liu, J. Liu, J. Wang, and H. Lu (2014) Learning a representative and discriminative part model with deep convolutional features for scene recognition. In Proceedings of the Asian Conference on Computer Vision (ACCV), pp. 643–658. Cited by: Table 6.
  • D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten (2018) Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196. Cited by: §2.1.
  • J. Mao, T. Xiao, Y. Jiang, and Z. Cao (2017) What can help pedestrian detection?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6034–6043. Cited by: §2.3, §2.3, §2.3.
  • R. Margolin, L. Zelnik-Manor, and A. Tal (2014) Otc: a novel local descriptor for scene classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 377–391. Cited by: §2.1.
  • V. Mnih, N. Heess, A. Graves, et al. (2014) Recurrent models of visual attention.” advances in neural information processing systems. In Proceedings of the Conference on Neural Information Processing Systems (NIPS), Vol. 2. Cited by: §2.4.
  • G. Neuhold, T. Ollmann, S. R. Bulò, and P. Kontschieder (2017) The mapillary vistas dataset for semantic understanding of street scenes.. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 5000–5009. Cited by: §2.2.
  • A. Oliva and A. Torralba (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. International Journal of Computer Vision 42 (3), pp. 145–175. Cited by: §2.1.
  • A. Oliva (2005) Gist of the scene. In Neurobiology of Attention, pp. 251–256. Cited by: §2.1.
  • D. Park, C. L. Zitnick, D. Ramanan, and P. Dollár (2013) Exploring weak stabilization for motion feature extraction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2882–2889. Cited by: §2.3, §2.3.
  • E. Park, X. Han, T. L. Berg, and A. C. Berg (2016) Combining multiple sources of knowledge in deep cnns for action recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1–8. Cited by: §2.3, §2.3, §2.3.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In Proceedings of the Conference on Neural Information Processing Systems (NIPS), Cited by: §4.1.
  • R. Quan, X. Dong, Y. Wu, L. Zhu, and Y. Yang (2019) Auto-reid: searching for a part-aware convnet for person re-identification. arXiv preprint arXiv:1903.09776, 2019. Cited by: §2.1.
  • A. Quattoni and A. Torralba (2009) Recognizing indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 413–420. Cited by: item 3, §2.1, Figure 6, §4.2, §4.5, §4.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §1.
  • A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson (2014) CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) workshops, pp. 806–813. Cited by: Table 6.
  • K. Simonyan and A. Zisserman (2014a) Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, pp. 568–576. Cited by: §2.3, §2.3, §2.3.
  • K. Simonyan and A. Zisserman (2014b) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014. Cited by: §1, §2.1.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9. Cited by: §2.1.
  • F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang (2017a) Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156–3164. Cited by: §2.4.
  • P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell (2018) Understanding convolution for semantic segmentation. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1451–1460. Cited by: §2.2.
  • Z. Wang, L. Wang, Y. Wang, B. Zhang, and Y. Qiao (2017b) Weakly supervised patchnets: describing and aggregating local patches for scene recognition. IEEE Transactions on Image Processing 26 (4), pp. 2028–2041. Cited by: §1, §2.1, Table 6, Table 7.
  • S. Woo, J. Park, J. Lee, and I. S. Kweon (2018) Cbam: convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.4, §2.4, Figure 4, §3.4, §4, §4, Table 2.
  • J. Wu and J. M. Rehg (2011) CENTRIST: a visual descriptor for scene categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (8), pp. 1489–1501. Cited by: §2.1.
  • J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba (2010) Sun database: large-scale scene recognition from abbey to zoo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3485–3492. Cited by: item 3, Figure 7, §4.2, §4.5, §4.
  • T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun (2018) Unified perceptual parsing for scene understanding. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 418–434. Cited by: §2.2, §4.1.
  • G. Xie, X. Zhang, S. Yan, and C. Liu (2015) Hybrid cnn and dictionary-based models for scene recognition and domain adaptation. IEEE Transactions on Circuits and Systems for Video Technology 27 (6), pp. 1263–1274. Cited by: §2.1, Table 6, Table 7.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning (ICML), pp. 2048–2057. Cited by: §2.4.
  • S. Yan, J. S. Smith, W. Lu, and B. Zhang (2017) Cham: action recognition using convolutional hierarchical attention model. In Proceedings of the IEEE International Conference on Image Processing (ICIP), pp. 3958–3962. Cited by: §2.4.
  • B. Yang, J. Yan, Z. Lei, and S. Z. Li (2015) Convolutional channel features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 82–90. Cited by: §2.3, §2.3.
  • D. Yoo, S. Park, J. Lee, and I. S. Kweon (2014) Fisher kernel for deep neural activations. arXiv preprint arXiv:1412.1628, 2014. Cited by: §2.1, Table 6.
  • A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and S. Savarese (2018)

    Taskonomy: disentangling task transfer learning

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3712–3722. Cited by: §1.
  • H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2881–2890. Cited by: §2.2.
  • B. Zhou, A. Khosla, Lapedriza. A., A. Oliva, and A. Torralba (2016)

    Learning Deep Features for Discriminative Localization.

    Cited by: Figure 2, §3.1, Figure 9, §4.6.
  • B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2015) Object detectors emerge in deep scene cnns. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §1.
  • B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2018) Places: a 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (6), pp. 1452–1464. Cited by: Figure 1, item 3, §1, §1, §1, §2.1, §2.1, Figure 7, §4.1, §4.2, §4.5, §4, Table 8, §4.
  • B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva (2014) Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems, pp. 487–495. Cited by: Table 6, Table 7.
  • B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba (2017) Scene parsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: item 3, §1, §2.2, Figure 6, §4.2, §4.
  • Z. Zuo, G. Wang, B. Shuai, L. Zhao, Q. Yang, and X. Jiang (2014) Learning discriminative and shareable features for scene classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 552–568. Cited by: Table 6.