Salient Object Detection via High-to-Low Hierarchical Context Aggregation
Recent progress on salient object detection mainly aims at exploiting how to effectively integrate convolutional side-output features in convolutional neural networks (CNN). Based on this, most of the existing state-of-the-art saliency detectors design complex network structures to fuse the side-output features of the backbone feature extraction networks. However, should the fusion strategies be more and more complex for accurate salient object detection? In this paper, we observe that the contexts of a natural image can be well expressed by a high-to-low self-learning of side-output convolutional features. As we know, the contexts of an image usually refer to the global structures, and the top layers of CNN usually learn to convey global information. On the other hand, it is difficult for the intermediate side-output features to express contextual information. Here, we design an hourglass network with intermediate supervision to learn contextual features in a high-to-low manner. The learned hierarchical contexts are aggregated to generate the hybrid contextual expression for an input image. At last, the hybrid contextual features can be used for accurate saliency estimation. We extensively evaluate our method on six challenging saliency datasets, and our simple method achieves state-of-the-art performance under various evaluation metrics. Code will be released upon paper acceptance.READ FULL TEXT VIEW PDF
Recent progress on salient object detection mainly aims at exploiting ho...
This paper proposes a novel saliency detection method by developing a
Beneficial from Fully Convolutional Neural Networks (FCNs), saliency
Existing state-of-the-art salient object detection networks rely on
There has been profound progress in visual saliency thanks to the deep
Feature pyramid network (FPN) based models, which fuse the semantics and...
Recent saliency models extensively explore to incorporate multi-scale
Salient Object Detection via High-to-Low Hierarchical Context Aggregation
. The progress in saliency detection has been beneficial to a wide range of vision applications, including image retrieval, visual tracking 36], content-ware video compression 
, and weakly supervised learning[46, 47]. Although numerous valuable models have been presented [25, 4, 57, 29, 17, 53, 15] and significant progress has been made, it remains as an open problem to accurately detect salient objects in static images, especially in some complicated scenarios.
usually design hand-crafted low-level features and heuristic priors, which are difficult to describe semantic objects and scenes. Recent progress on saliency detection is mainly beneficial from convolutional neural networks (CNN)[32, 26, 57, 45, 54, 21, 22]. The backbone of CNN usually consists of several blocks of stacked convolutional and pooling layers, in which the blocks near to network inputs are called bottom sides and otherwise top sides. It is well accepted that the top sides of CNN contain semantic meaningful information while the bottom sides contain complementary spatial details [48, 30, 16]. Therefore, current state-of-the-art saliency detectors [4, 51, 45, 54, 29, 44, 55, 43, 16] mainly aim at designing complex network structures to fuse the features or results from various side-outputs. For example, Hou et al.  carefully selected several combination sets of various side-output results and fused the combination results for accurate saliency segmentation. Wang et al.  proposed a recurrent module to filter out noisy information for side-output features. Although significant progress has been made in this direction [16, 55, 44], the side-output fusion strategies have become more and more complex. Do we have to continue this direction for the further improvement of saliency detection?
To answer this question, we notice that some recent studies [58, 52] find CNN can learn global contextual information for input images at top convolution layers by enlarging receptive fields. This is not directly applicable to saliency detection, because saliency detection requires not only global contextual information but also local spatial details. Instead of fusing side-output features complicatedly as in [4, 57, 51], we consider constructing hierarchical contextual features. Specifically, we flow global contextual information obtained at top sides into bottom sides. The top contextual information will learn to guide the bottom sides to construct the contextual features at fine spatial scales only emphasizing salient objects. Hence the obtained contexts are different from side-output features or some combinations of them which only contain or at least emphasize local representations for an image. A visualization of contexts learned by our model can be found in Figure 1.
Intuitively, the hierarchical contexts should be learned in a high-to-low manner, which means the top sides should learn contexts first and then bottom sides can learn contexts at large spatial resolutions using the information flowing from the top sides. Hence we build an hourglass network and add intermediate supervision after the context module at each side. In the training process, we find the top sides can be automatically optimized first, which is consistent with our hypothesis. This will be demonstrated in Section 4. At last, we simply aggregate hierarchical contexts for accurate salient object detection. The experimental results demonstrate our simple idea can favorably outperform recent state-of-the-art methods that use heavily engineered networks. Our contributions can be summarized as three folds:
We build an hourglass network with intermediate supervision to learn hierarchical contexts, which are generated with the guidance of global contextual information and thus only emphasize salient objects at different scales.
We propose a hierarchical context aggregation module to ensure the network is optimized from the top sides to bottom sides. We aggregate the learned hierarchical contexts at different scales to perform accurate salient object detection unlike previous studies [16, 55, 43] that fuse side-output features or some complex combinations of side-outputs.
We extensively compare our method with recent state-of-the-art methods on six popular datasets. Our simple method favorably outperforms these competitors under various metrics.
Salient object detection is a very active research field due to its wide applications and challenging scenarios. Here, we briefly divide the related work into four parts to review the development of saliency detection and context learning.
Heuristic saliency detection
methods usually extract hand-crafted low-level features and apply machine learning models to classify these features. Some heuristic saliency priors are utilized to ensure the accuracy, such as color contrast[1, 7], center prior [20, 19] and background prior [50, 60]. DRFI 
is a comprehensive representative of this kind of methods by integrating various features and priors. However, it is difficult for the low-level features to describe semantic information, and the saliency priors are not robust enough for complicated scenarios. Hence deep learning based methods have dominated this fields due to their powerful representation capability.
Region-based saliency detection appears in the early era of deep learning based saliency. These approaches view each image patch as a basic processing unit to perform saliency detection. Lee et al. 
utilized both low-level hand-crafted features and high-level deep features to classify candidate regions as salient or not. The low-level features are compared with other parts of an image to form a distance map that is then encoded by the CNN. Wanget al.  presented a two-stage training strategy to sort the segmented object proposals in which the first stage extracts features and the second stage predicts the saliency score for each region. Li et al.  extracted multi-scale deep features which are used to infer the saliency scores for image segments.
CNN-based image-to-image saliency detection models [4, 57, 51, 45, 54, 29, 44, 17, 55, 5, 43, 27, 28, 16, 24, 32, 26] take saliency detection as a pixel-wise binary classification task and perform image-to-image predictions. For example, Chen et al.  proposed a two-stream network which consists of a fixation stream and a semantic stream. Zhang et al.  introduced an attention guided network that progressively integrates multiple layer-wise attention for saliency detection. Islam et al.  introduced a new deep learning solution with a hierarchical representation of relative saliency and stage-wise refinement.
How to effectively fuse multi-level CNN features is the main research direction for CNN-based saliency detection methods [4, 51, 45, 54, 29, 44, 55, 43, 16, 24, 32, 26]. There are too many studies to list here, but the general trend of recent designs is becoming more and more complicated. We will provide detailed discussion about these methods in Section 4. Compared with them, we focus on a simple yet effective design in this paper.
Context learning is recently discovered in semantic segmentation [58, 52]. Zhao et al.  added a pyramid pooling module for global context construction upon the final layer of the deep network, by which they significantly improved the performance of semantic segmentation. Zhang et al.  built context encoding module using the encoding layer  on the top of neural network to conduct accurate semantic segmentation. In saliency detection, Wang et al.  followed  to use the pyramid pooling module to extract contextual information. Zhao et al.  proposed a global context module and a local context module to extract the global and local contexts. The global context module is fed with a superpixel-centered large window including the full image, while the local context module takes a superpixel-centered small window with a small image patch. Hence the the goal to extract multi-contexts in  is achieved by multi-scale inputs.
The full literature review of salient object detection is out the scope of this paper. Please refer to [2, 8, 12] for a more comprehensive survey. In this paper, we focus on the context learning rather than previous multi-level feature fusion for the improvement of saliency detection. Different from  that uses multiple networks, each of which has a pyramid pooling module  at the top, we propose an elegant single network. Different from  that uses multi-scale inputs, we use single-scale inputs to extract multi-level contexts. The resulting model is simple yet effective.
In this section, we will elaborate our proposed framework for salient object detection. We first introduce our base network in Section 3.1. Then, we present a Mirror-linked Hourglass Network (MLHN) in Section 3.2. A detailed description of the Hierarchical Context Aggregation (HCA) module is finally provided in Section 3.3. We show an overall network architecture in Figure 2.
as our backbone net, whose final fully connected layers are removed to serve for image-to-image translation. Salient object detection usually requires global information to judge which objects are salient, so enlarging the receptive field of the network would be helpful. To this end, we remain the final pooling layer as in  and follow  to transform the last two fully connected layers to convolution layers, one of which has the kernel size of with 1024 channels and another of which has the kernel size of with 1024 channels as well. Therefore, there are five pooling layers in the backbone net. They divide the convolution layers into six convolution blocks, which are denoted as from bottom to top, respectively. We consider as the top valve that controls the overall contextual information flow in the network. The resolution of feature maps in each convolution block is the half of the preceding one. Following [16, 48], the side-output of each convolution block means the connection from the last layer of this block.
Based on the backbone net, we build a Mirror-linked Hourglass Network (MLHN). An overview of MLHN is displayed in Figure 2. More concretely, we upsample the convolution block by two times and connect a convolution layer (w/o non-linearization) after . The resulting two feature maps are fused using an element-wise summation operation. For the upsampling, the side-output of is first connected to a
convolution layer (w/o non-linearization) which follows by a deconvolution layer. This deconvolution upsamples a features map by 2 times using bilinear interpolation. A crop operation is performed to ensure the upsampled feature map ofhas equal size to the feature map of . To convert the fused feature map into contextual information, two sequential convolution layers are then connected to obtain contextual features . These two convolution layers play a role of transform function, which uses the contextual information of to guide the features of to generate contexts . The contextual features can be obtained in the similar way. For a clear presentation, this can be formulated as
A standard encoder-decoder network can be formulated as
In this way, the proposed MLHN gradually flow top contextual information into lower sides, so the lower sides are expected to only emphasize the details of salient regions in an image.
The two sequential convolution layers (orange box in Figure 2) are with kernel size for and kernel size for . The numbers of output channels are 512, 256, 256, 128 and 128 from to , respectively. On one hand, the encoded features in the base network are connected to the decoder part in a Mirror-linked way. On the other hand, the proposed network is symmetric with as its center, just like an hourglass. Hence we call our network Mirror-linked Hourglass Network (MLHN).
Intuitively, the proposed MLHN should be optimized from the top sides to bottom sides, because the global contextual information is contained in the top sides and will be flowed to bottom sides gradually. Therefore, unlike previous encoder-decoder networks [37, 31] that impose supervision at the final layer of decoder, we adopt supervision at all context learning stages, i.e. , through a Hierarchical Context Aggregation (HCA) module. The HCA module is shown in Figure 3.
The side-output of each decoder side is first connected with two convolution layers, which are with kernel size of for , for and for . The numbers of channels for them are 512, 512, 256, 256, 128 and 128, respectively. Then, we add a convolution layer without non-linearization to decrease the number of channels to 25 for all sides. The 25-channel map is the context map at each side. A deconvolution layer with fixed bilinear kernel is employed to upsample the context map into the size of original image. In order to better understand this process, we formulate it as
is a linear transformation for channel reorganization andis to transform the fused features at each stage into contexts at various scales.
The saliency prediction map can be obtained by simply adding a (w/o non-linearization) convolution. We put the intermediate supervision here for each side to help the top sides to be optimized first. The upsampled context maps () for all sides are aggregated using a standard concatenation. A convolution and a convolution are followed to further fuse the hierarchical contexts for the final high-quality prediction of saliency maps. We empirically find large kernel sizes are a bit helpful here, but large kernel sizes will also lead to slow speed because the aggregated context map is in the size of original image. Therefore, we do not use two or larger kernel sizes.
The essential function of HCA lies in three aspects. Firstly, the intermediate supervision of HCA can help MLHN be optimized from top to bottom, so that the global contextual information at top sides will flow to bottom sides gradually. Secondly, the added convolution layers can encourage each side to generate contexts at the corresponding scale. Thirdly, the hierarchical contexts at all sides are aggregated for final saliency map prediction, unlike previous methods [16, 48, 30] that compute final results by fusing results of various side-outputs.
Due to the nature of the multi-scale and multi-level learning in deep neural networks, there have emerged a large number of architectures that are designed to utilize the hierarchical deep features. For example, multi-scale learning can use skip-layer connections [13, 31] which is widely accepted owning their strong capabilities to fuse hierarchical deep features inside the networks. On the other hand, multi-scale learning can use encoder-decoder networks that progressively decode the hierarchical deep representation learned in the encoder backbone net. We have seen these two structures applied in various vision tasks.
We continue our discussion by briefly categorizing inside multi-scale deep learning into five classes: hyper feature learning, FCN style, HED style, DSS style and encoder-decoder networks. An overall illustration of them is summarized in Figure 4. Our following discussion of them will clearly show the differences between our proposed HCA network and previous efforts on multi-scale learning.
Hyper feature learning: Hyper feature learning  is the most intuitive way to purse multi-scale information, as illustrated in Figure 4(a). Examples of this structure for saliency include [24, 51, 5, 43, 27]. These models concatenate/sum multi-scale deep features from multiple levels of backbone nets [24, 51] or branches of the multi-stream nets [5, 43, 27]. The fused hyper features are then used for final predictions.
FCN style: Since the top sides of neural networks usually contain more reliable semantic information, a reasonable revision of hyper feature learning is to progressively fuse deep features from upper layers to lower layers [31, 37], as shown in Figure 4(b). The top semantic features will combine with bottom low-level features to capture fine-grained details. The feature fusion can be a simple element-wise summation , a simple feature map concatenation (U-Net) , or more complex designs based on them.
Most of recent saliency models fall into this category [57, 45, 54, 29, 44, 17, 55]. They differ from each other by applying different fusion strategies. One notable similarity of these models is that the final prediction is produced using the fused feature maps at the largest scale. Hence the final fused features are expected to learn both global semantic information and local low-level details. To better achieve this goal, recent state-of-the-art models have designed very complex fusion strategies [29, 44, 4].
|VGG16  backbone|
|ResNet  backbone|
HED style: HED-like networks [48, 30] add deep supervision at the intermediate sides to perform predictions, and the final result is a combination of predictions at all sides (shown in Figure 4(c)). Unlike multi-scale feature fusion, HED performs multi-scale prediction fusion. Chen et al.  followed this style to perform saliency detection.
DSS style: DSS network  is an extension of HED architecture. The side-output of each network side is fused with side-outputs from some of the upper sides. For each side, which upper sides to choose for fusion is carefully selected by experiments. The difference between HED and DSS can be clearly seen in Figure 4(d).
Encoder-decoder networks: To benefit from the powerful representation capability of deep networks, one can also decode the high-level representation at the top layers , as displayed in Figure 4(e). The decoder gradually enlarges its resolution to decode local information from upper layers.
HCA network: We show a streamlined diagram of our proposed HCA network in Figure 4(f). Its left part looks a bit like an FCN (Figure 4(b)) or an encoder-decoder network (Figure 4(e)) with parallel connections. Unlike the FCN and encoder-decoder nets that perform predictions using the final fused hybrid features, our HCA network aggregates hierarchical contexts to perform predictions. The contexts are learned in a high-to-low manner through the proposed HCA module, so that the firstly optimized top sides can generate global contextual information to guide lower layers to produce scale-specific contexts. We show a demonstration of this high-to-low optimization in Figure 5, which includes the loss curves of all sides during training. We can clearly see that is optimized first, then , , , and follow sequentially. Without carefully designed feature fusion strategies [29, 55, 44, 4], the simple HCA can learn high-quality contexts for accurate salient object detection.
We implement the proposed network using the well-known Caffe framework. The convolution layers contained in original VGG16 
are initialized using the publicly available pretrained ImageNet model
. The weights of other layers are initialized from the zero-mean Gaussian distribution with standard deviation 0.01. The upsampling operations are implemented by deconvolution layers with bilinear interpolation kernels which will be frozen in the training process. The network is optimized using SGD with learning rate policy ofpoly, in which the current learning rate equals the base one multiplying . The hyper parameters and are set to 0.9 and 20000, respectively, so that the training takes 20000 iterations in total. The initial learning rate is set to 1e-7. The momentum and weight decay are set to 0.9 and 0.0005, respectively. All the experiments in this paper are performed on a TITAN Xp GPU.
Datasets. We extensively evaluate our method on six popular datasets, including DUTS , ECSSD , SOD , HKU-IS , THUR15K  and DUT-OMRON . These six datasets consist of 15572, 1000, 300, 4447, 6232 and 5168 natural complex images with corresponding pixel-wise ground truth labeling. Among them, DUTS dataset  is a latest released challenging dataset consisting of 10553 training images and 5019 test images in very complex scenarios. For fair comparison, we follow recent studies [44, 29, 43, 51] to use DUTS training set for training and test on the DUTS test set and other five datasets.
We utilize two evaluation metrics to evaluate our method as well as other state-of-the-art salient object detectors, including max F-measure score and mean absolute error (MAE). Given a predicted saliency map with continuous probability values, we can convert it into binary maps with arbitrary thresholds and computing corresponding precision/recall values. Taking the average of precision/recall values over all images in a dataset, we can get many mean precision/recall pairs. Moreover, F-measure score is an overall performance indicator:
in which is usually set to 0.3 to emphasize more on precision. We follow recent studies [32, 16, 55, 56, 29, 25, 4] to report max across different thresholds. Given a saliency map and the corresponding ground truth that are normalized to [0, 1], MAE can be calculated as
where and represent the height and width, respectively. denotes the saliency score at location , similar to .
|No.||Module||Side 1||Side 2||Side 3||Side 4||Side 5||Side 6|
We compare our proposed salient object detector with 16 recent state-of-the-art saliency models, including DRFI , MDF , LEGS , DCL , DHS , ELD , RFCN , NLDF , DSS , SRM , Amulet , UCF , BRN , PiCA , C2S  and RAS . Among them, DRFI  is the state-of-the-art non-deep-learning based method, and the other 15 models are all based on deep learning. We do not report MDF  results on the HKU-IS  dataset because MDF uses a part of HKU-IS for training. Due to the same reason, we do not report DHS  results on the DUT-OMRON . For fair comparison, all these models are tested using their publicly available code and pretrained models released by the authors with default settings. We also report the results of the ResNet-101  version of our proposed HCA. Since ResNet is deep enough to capture global contexts, we exclude the sixth side () in HCA.
Table 1 summarizes the numeric comparison in terms of and MAE on six datasets. HCA can significantly outperform other competitors in most cases, which demonstrates its effectiveness. With the VGG16  backbone, the values of HCA are 2.1%, 1.0%, 0.9%, 1.1%, 0.6% and 0.5% higher than the second best method on the DUTS, ECSSD, SOD, HKU-IS, DUT-OMRON and THUR15K datasets, respectively. On the SOD dataset in terms of MAE metric, HCA performs slightly worse than the best result. PiCA  seems to achieves the second place. With the ResNet backbone, the performance gap between the proposed HCA and other ResNet based competitors is much larger than with VGG16 backbone net. Specifically, the values of HCA are 2.2%, 1.3%, 1.3%, 1.7%, 3.0% and 0.8% higher than the second best method on six datasets, respectively.
We also provide a qualitative comparison in Figure 6. For objects with various shapes and scales, HCA can well segment the entire objects with fine details (1-2 rows). HCA is also robust with complicated background (3-5 rows), multiple objects (6-7 rows) and confusing stuff (8 row).
To evaluate the influences of various design choices of MLHN and HCA (the 2Conv blocks in Figure 2 and Figure 3), we extensively perform seven ablation studies with VGG16 backbone. The detailed experimental settings and corresponding evaluation results are shown in Table 2 and Table 3, respectively. We can observe that our proposed method is not sensitive to different parameter settings, and the default design achieves slightly better results. These ablation studies can also reflect some interesting phenomena. For example, the experiment #5 suggests larger convolution kernel at sixth side is helpful to obtain accurate global contexts. The experiments #6 and #7 demonstrate introducing more convolution channels is useless to the performance. Interestingly, we observe that the default convolution parameter settings are similar to DSS  although we have different network architecture (see Section 4). Perhaps it is due to the intrinsic properties of backbone nets.
Salient object detection is highly related to the global contextual information which can be used to judge which parts of an image are salient. Motivated by this, we propose a simple yet effective method in this paper. Our method starts from the top sides of neural networks and gradually flows the top global contexts into lower sides to obtain hierarchical contexts. These hierarchical contexts are aggregated for the final salient object detection. Our method reaches the new state-of-the-art on six datasets when compared with 16 recent saliency models. In the future, we plan to apply the proposed network architecture into other vision tasks that need global information.