Computer vision have achieved great success based on deep learning techniques which rely on large scale training data in recent years. Some tasks focus on dealing with single modal data ( e.g. RGB, depth or thermal image/videos), such as visual tracking, salient object detection, which are easily influenced by illumination, clutter background, etc. Recent works validated the effectiveness of incorporating various modal information no matter these data are heterogeneous or homogenous (named multi-modal). For example, RGB-Depth or RGB-Thermal data is introduced into the multi-modal moving object detection Li2017Weighted , visual tracking Li2016Learning or salient object detection Li2017A
; and RGB-Text-Speech is introduced into sentiment analysisporia2016fusing . Although a good deep neural network already can be obtained by these data in corresponding task, however, the model may still works not well when challenging factors occurred.
According to our observations, the data collected from different domains can be complementary to each other. However, in some cases, there are only limited modals can provide useful information for the training of deep neural networks. If these data are treated equally, the noisy modal will mislead the final representation. To make deep neural network robust to modality with poor quality as mentioned above and simultaneously use the rich information from the rest domains, a quality measure mechanism is required in the design of network architecture.
The target of this paper is to optimize hyperparameters dynamically according to different quality of different domains. One intuitive idea is to employ a CNN to predict the quality-aware hyperparameters for each sequence. However, we do not have the ground truth values of the hyperparameters, thus we can not provide target object values for the network to regress them in a supervised manner. Inspired by recent progress in deep reinforcement learning, an agent is trained to learn a policy to take a better action by giving a reward for its action according to the current state. The learning goal is to maximize the expected returns in a time sequence, where the return at each time step is defined as the summed rewards from this time step to the end of sequence. For our quality-aware multi-modal task, we utilize a neural network to represent the agent and allow it choose the quality weights for each image by regarding the choice as its action. By defining the reward as saliency detection accuracy, the goal of reinforcement learning becomes maximizing the expected cumulative salient object detection accuracies, which is consistent with the saliency evaluation. Similar views can also be found inDong_2018_CVPR .
In this paper, we propose a general quality estimation network which could perceive the quality of input data from different sensors, the whole pipeline can be found in Figure1. Specifically, we take the quality estimation of each domain as a decision-making problem, and train an agent to interact with the environment to explore and learn to weight each domain. The state is the input data from different domains, the actions are increase, decrease or terminate tune the weight of each modality, and the reward is calculated according to the loss between estimated results and ground truth. The training of the introduced quality estimation network can be optimized by deep reinforcement learning algorithms, we adopt deep Q-network Mnih2015Human in this paper due to it is simple and efficient to implement. More advanced reinforcement learning techniques, such as dueling network architectures for deep reinforcement learning wang2016dueling or actor-critic algorithm mnih2016asynchronous can also be applied in our settings. The network can automatically assign low quality scores to modality with poor quality in order to make the final results more accurate. We show the applications of the proposed quality estimation network on multi-modal salient object detection in this paper.
The main contributions of this paper can be summarized as follows:
We introduce a novel and general quality estimation network using deep reinforcement learning which do not require any explicit annotations of the quality.
We apply the introduced quality estimation network on the multi-modal saliency detection task successfully, and further propose a coarse-to-fine salient object detection framework based on generative adversarial network.
Extensive experiments on two public multimodal saliency detection dataset validated the effectiveness of the introduced algorithm.
2 Related Works
In this section, we give a brief review of multi-modal saliency detection methods, deep reinforcement learning and generative adversarial network, respectively. The comprehensive literature reviews on these saliency detection methods can be found in Peng2014RGBD Borji2014Salient .
Deep Reinforcement Learning.
Deep learning and reinforcement learning are treated as the most important way to general artifical intelligence. Different from supervised learning and unsupervised learning, reinforcement learning target at learning to execute the "right" action in a given environment (state) and obtain the maximum rewards. The whole learning process of the agent is guided by the reward given by the environment. Deep reinforcement learning (DRL) was first proposed by Mnihet al. Mnih2015Human in 2013 which utilize deep neural networks, i.e. Deep Q-learning Networks (DQN) to parametrize an action-value function to play Atari games, reaching human-level performance. The most relevant and successful application of reinforcement learning maybe the game of Go which combined policy network and value network and beat many world-class professional player Silver2016Mastering . Asynchronous deep reinforcement learning was also introduced in Babaeizadeh2016GA3C to tackle the training efficiency issue by Mnih et al. On the aspect of computer vision applications, DQN also applied to many domains, such as object detection caicedo2015active ; Kong2017Collaborative ; Jie2017Tree , visual tracking yun2017adnet ; Wang_2018_CVPR , Face Hallucination Cao2017Attention . Caicedo et al. introduce the DRL into the community of object detection in caicedo2015active , and this is also the first attempt to treat the object detection task as decision-making problem. Some other DRL based object detectors futher improve the baseline algorithm by introduce tree-structured search process Jie2017Tree or multi-agent DRL algorithm et al. Sangdoo et al. yun2017adnet propose the action-decision network to treat the visual tracking task as desion-making problem and teach the agent to learn to move the bounding box along with target object. However, there are still no prior works focus on handling the saliency detection problem with deep reinforcement learning technique. Our work is the first to introduce the DRL into the multi-modal saliency detection community to automatically learn to weight different data to better fuse the multi-modal information.
Generative Adversarial Network. More and more researchers focus their attention on generative adversarial networks (GANs), which is first proposed by Goodfellow in Goodfellow2014Generative . Recently, massive works attempt to generate more realistic images Arjovsky2017Wasserstein Gulrajani2017Improved and also some interesting image transformation based works Isola2016Image Dong2017Unsupervised
. Image-conditioned GAN for super-resolution which is proposed by Lediget al. achieved amazing performance Ledig2016Photo . Pan et al. first proposed to generate saliency results of given images based on GAN in Pan2017SalGAN . GANs also achieved great success on text based image generation, such as Dash2017TAC . Li et al. propose to use perceptual GAN to handle the issue of small object detection in Li2017Perceptual . Besides, the studies about theoretical model of GAN are also one of the most hottest topic in recent years Arjovsky2017Wasserstein Saatchi2017Bayesian Wang2017IRGAN Yu2016SeqGAN . To the best of our knowledge, this work makes the first attempt to introduce GANs on the multi-modal saliency detection task.
Multi-Modal Saliency Detection. Multi-modal saliency detection discussed in this paper mainly focus on RGBD and RGBT. Different from RGB saliency detection, multi-modal salient object detection receives less research attention Maki1996A , Lang2012Depth , Desingh2013Depth , Zhang2010Stereoscopic , Shen2012A . An early computational model on depth-based attention by measuring disparity, flow and motion is proposed by Maki et al. Maki1996A . Similarly, Zhang et al. propose a stereocopic saliency detection algorithm on the basis of depth and motion contrast for 3D videos in Zhang2010Stereoscopic . Desingh et al. Desingh2013Depth estimate saliency regions by fusing the saliency maps produced by appearance and depth cues independently. However, these methods either treat the depth map as an indicator to weight the RGB saliency map Maki1996A , Zhang2010Stereoscopic or consider depth map as an independent image channel for saliency detection Desingh2013Depth , Lang2012Depth . On the other hand, Peng et al. Peng2014RGBD propose a multi-stage RGBD model to combine both depth and appearance cues to detect saliency. Ren et al. Ren2015Exploiting integrate the normalized depth prior and the surface orientation prior with RGB saliency cues directly for the RGBD saliency detection. These methods combine the depth-induced saliency map with RGB saliency map either directly Ju2014Depth , Ren2015Exploiting or in a hierarchy way to calculate the final RGBD saliency map Peng2014RGBD . However, these saliency map level integration is not optimal as it is restricted by the determined saliency values.
3 Our Method
In this section, we will first give an overview of the designed quality-aware multi-modal saliency detection networks and the whole pipeline can be found in Figure 1. Then, we will introduce the coarse single modal saliency estimation network. After that, we will give a detailed explanation about why and how to adaptively weighting the multimodal data via deep reinforcement learning. Finally, we will talk about how to train and test the adaptive weighting module.
To validate the effectiveness of our proposed general quality estimation network, we implement our experiments based on the multi-modal saliency detection. This task target at handling the problem of finding the salient regions from multi-modal data. And the key of this task lies on how to adaptively fuse the multi-modal data to predict the final saliency results. The proposed multi-modal saliency detector dynamically pursues the target by adaptively weight each saliency results using deep reinforcement learning as shown in Figure 1.
For the coarse single modal saliency estimation network, we introduce the conditional generative adversarial network (CGAN) to predict coarse saliency maps. The CGAN consists of two sub-networks, i.e.
the generator G and discriminator D. The generator follows the encoder-decoder framework, specifically, the encoder is a truncated VGG network ( with the fully connected layers removed ) which is used to extract the feature of input images; the decoder is a reversed truncated VGG network which is utilized to upsample the encoded information and output its saliency detection results. The discriminator is a standard convolutional neural network (CNN), which is introduced to detect whether the given image is real ( from ground truth saliency maps ) or fake ( from generated saliency results). With the competition between these two models, both of them can alternatively and iteratively boost their performance. Moreover, we also adopt the content loss to stable the training of GAN and speed up the training process asbousmalis2017unsupervised Pan2017SalGAN does. Hence, for each modal, we have one coarse saliency results produced by corresponding saliency generation network.
We deal with the adaptive fusion mechanism using deep reinforcement learning which can fuse the multi-modal data through the interaction between the agent and environment. We denote the output of GANs as state, the increase, decrease or terminate the tuning of weight values are actions, and we give the agent a positive/negative reward according to the loss between predicted saliency maps and the ground truth. In the testing phase, the deep Q-network can be directly used to predict the weight of each modal until the trigger action selected or other conditions are met. This is the first time to take the quality-aware multi-modal adaptive fusion as decision making problem and the proposed weighting mechanism can also be applied in other quality-aware tasks.
3.2 Review: Generative Adversarial Network
GANs attempt to learn a mapping from random noise vectorz to generated image y: in an unsupervised way Goodfellow2014Generative . They utilize a discriminative network D to judge one sample comes from the dataset or produced by a generative model G. These two networks i.e.
G and D are simultaneously trained so that G learns to generate images that are hard to classify by D, while D attempt to discriminate the images generated by G. Finally, it is not easy for D to detect when G is well trained.
The whole training procedure of GANs can be regarded as a min-max process:
Conditional GANs generate images y based on random noise vector z and observed image x: . The whole training procedure of CGANs can be formulated as:
D. Pathak et al. found that the combination of CGANs and traditional loss such as loss will generate more realistic images in Pathak2016Context . The job of discriminator keep unchanged, however, the generator not only try to fool the discriminator, but also need to fit the given ground truth in an sense:
3.3 Network Architecture
As shown in Figure 1, our multi-modal saliency detection can be divided into two main stages. In the stage-I, we take the multi-modal data as our input and directly output corresponding coarse saliency maps. To achieve this target, we introduce the encoder-decoder architecture which contain two truncated VGG networks. This encoder-decoder architecture has been widely used in many tasks, especially in semantic segmentation badrinarayanan2017segnet , saliency detection Zhao2015Saliency , etc
. Specifically, we remove the fully connected layers from standard VGG network as encoder and reverse the network as the decoder network. Hence, we can obtain coarse saliency maps from these sub-network. The weight parameter of encoder is initialized with weights of the VGG-16 model which is first pre-trained on the ImageNet dataset for general object classificationDeng2009ImageNet . The weights for the decoder are randomly initialized. In the training phase, we fix the parameter of earlier layers and only fine-tuning the last two groups of convolutional layers in VGG-16 for saving computational resources. We set the discriminator as the same with Pan2017SalGAN , which composed of six 33 kernel convolutions interspersed with three pooling layers, and followed by three fully connected layers.
How to adaptively fuse these coarse results is another key problem in multi-modal tasks. The target of this paper is attempt to optimize hyperparameters dynamically according to different quality of different domains (in this paper, i.e. the RGB and thermal images, RGB and depth images). One intuitive idea is to employ a CNN to estimate the quality-aware hyperparameters for each sequence. However, we do not have the ground truth values of these parameters, therefore, we can not provide target object values for the network to train in the popular supervised way. Motivated by recent development in deep reinforcement learning, we treat these results as state and train an agent to interact with the environment to capture the quality of input data for better information fusion. This will work due to the observation that the learning target is to maximize the expected returns in a time sequence, where the return at time step t is defined as the accumulation of rewards from t to the end of the sequence. For our quality-aware multi-modal task, we utilize a neural network to represent the agent and allow it choose the quality weights for each image by regarding the choice as its action. By defining the reward as saliency detection accuracy, the goal of reinforcement learning becomes maximizing the expected cumulative salient object detection accuracies, which is consistent with the saliency evaluation. Similar views can also be found from Dong_2018_CVPR .
The goal of agent is to give a suitable weight variable for each modal data that can be learned from the environment. During the training phase, the agent receives positive and negative rewards for each decision made when interacting with the environment. When testing, the agent does not receive any rewards and does not update the model either, it just follows the learned policy. Formally, the Markov Decision Process (MDP) has a set of actions A, a set of states S, and a reward function R. And we define these basic elements as follows:
State. The state of our agent is actually a tuple which contain three main components, i.e. the coarse saliency results from each subnetwork, the fused results
in previous steps. We resize and concatenate these three results into a tensor whose dimension isas our state and input to subsequent two fc layers to output the actions.
Action. We design three actions to adjust the weights which can be divided into two streams, i.e. adjust the weights and terminate the adjust. The agent can select a series of actions (i.e. increase or decrease) to tune the weight and finally select the terminate action to achieve the goal of automatic weighting on the basis of the input state. The initial weight value for each modal is .
Reward. The target of agent is to obtain the maximum rewards, thus, the design of reward will be key to the success of learned policy. And it can be estimated during the training phase only because it requires ground truth saliency maps to be calculated. In this paper, we utilize the fused final saliency results as the criterion of rewards. We assume the mean squared loss between the predicted salient object and ground truth saliency maps is . The reward for the increase/decrease actions can be setted as:
where and are current and next state, respectively. This equation means that we will give a positive reward if the loss decreased after a series of weighting tuning. Otherwise, we will punish the agent by giving a negative reward.
When to stop this adjust process is another key point to the success of adaptive weighting mechanism. Because maybe we can not obtain the optimal weights, if the adjustment stopped too early. On the other hand, the time consuming will be large, if we can not timely stop the operations. Hence, we designed another specified reward function for the terminate action:
where is a pre-defined threshold parameter (we set as 0.04 in our experiments). This function denotes that if the agent choose the terminate action, we will compute the final weighted saliency results and compare it with ground truth saliency maps to obtain the MSE value of current state. If the value of MSE is less than the given threshold , we think it’s time to stop the weight adjustment and give a positive reward to the agent, otherwise, we give a negative to punish the agent.
3.4 The Training
We train the quality-aware multi-modal saliency detection network into two stages. We first train the coarse saliency estimation network with mean squared loss and adversarial loss in stage-I. Then, we train the adaptive fusion module (i.e. the deep reinforcement learning) in the stage-II. The loss funcations used in these two stages are introduced as follows respectively.
Loss Function in Stage-I. To achieve better saliency estimation, the proposed encoder-decoder architecture is trained by combining a content loss and adversarial loss which has been widely used in many prior works Ledig2016Photo Pan2017SalGAN . Content loss is computed in a per-pixel basis, where each value of the predicted saliency map is compared with its corresponding peer from the ground truth map. Assume we have an image and its resolution is , and the ground truth saliency maps can be denoted as , the predicted saliency maps is . The content loss which measures the mean squared error (MSE) or Euclidean loss between the predicted and ground truth saliency maps can be defined as:
The adversarial loss function is adopted from conditional generative adversarial networks (CGANs). This network consists of one generator and one discriminator, and these two models play a game-theoretical min-max game. Specifically, the generative model tries to fit saliency distribution provided by reference images and produces “fake” samples to fool the discriminative model, while the discriminative model tries to recognize whether the sampled image is from ground truth or estimated by the generative model. With the competition between these two models, both of them can alternately and iteratively boost their performance. The mathematical function can be formulated as:
is the probability of fooling the discriminator, so that the loss associated to the generator will grow more when chances of fooling the discriminator are lower.
As illustrated in above sections, we combine the MSE loss with adversarial loss to obtain more stable and fast convergence generator. The final loss function for the generator during adversarial training can be formulated as:
where is the tradeoff parameters to balance these two loss functions. We experimentally set this parameter as 0.33 to obtain better saliency detection results.
During the adversarial training, we alternate the training of the generator and discriminator after each iteration (batch). weight regularization (i.e. weight decay) when training both the generator and discriminator (). AdaGrad was utilized for model optimization, with an initial learning rate of .
Loss Function in Stage-II. The parameters of the Q-network are initialized randomly. The agent is setted to interact with the environment in multiple episodes, each representing a different training image. We also take a -greedy bellemare2013arcade to train the Q-network, which gradually shifts from exploration to exploitation according to the value of . When exploration, the agent selects actions randomly to observe different transitions and collects a varied set of experience. During exploitation, the agent will choose actions according to the learned policy and learns from its own successes and mistakes.
The utilization of target network and experience replay lin1993reinforcement in DQN algorithm is the key ingredient of their success. The target network with parameters is copied every steps from online network and kept fixed on all other steps, thus, we could have . The target in DQN can be described as the following formulation:
A replay memory is used to store the experiences of past episodes, which allows one transition to be used in multiple model updates and breaks the short-time strong correlations between training samples. Each time Q-learning update is applied, a mini batch randomly sampled from the replay memory is used as the training samples. The update for the network weights at the iteration given transition samples () is as follows:
where represents the actions that can be taken at state , is the learning rate and is the discount factor.
The pseudo-code for training the quality estimation network can be found in Algorithm 1.
In this section, we validate the proposed approach on two public multi-modal saliency detection benchmarks, including RGB-Depth (RGBD) and RGB-Thermal (RGBT) salient object detection benchmarks. We will first give an introduction about evaluation criterion and dataset description, then we will analyse the experimental results on RGBD and RGBT datasets. We also give an ablation study on the components and efficiency analysis.
|MST (CVPR2016)||0.6856||0.5980||0.6312||0.5601||0.5178||0.5242||0.6415||0.6276||0.6103||Wrapping code|
|HSaliency (CVPR2013)||0.7048||0.4820||0.5891||0.3900||0.4547||0.3755||0.6479||0.4991||0.5487||Wrapping code|
|Ours (Equal Weights)||0.8407||0.8575||0.8339||0.8312||0.8521||0.8252||0.8480||0.8625||0.8362||Theano+Lasagne|
|Ours (Adaptive Weights)||0.8407||0.8575||0.8339||0.8312||0.8521||0.8252||0.8541||0.8596||0.8440||Theano+Lasagne|
|BR (ECCV2010)||0.724||0.260||0.411||0.648||0.413||0.488||0.804||0.366||0.520||Matlab & C++||8.23|
|Ours (Equal Weights)||0.8474||0.8453||0.8351||0.8321||0.8501||0.8251||0.8497||0.8595||0.8386||Python||-|
|Ours (Adaptive Weights)||0.8474||0.8453||0.8351||0.8321||0.8501||0.8251||0.8520||0.8591||0.8413||Python||5.88|
4.1 Evaluation Criteria and Dataset Description
For fair comparisons, we fix all parameters and other settings of our approach in the experiments, and use the default parameters released in their public codes for other baseline methods. In our experiments, we set equal to 2 in our reward function; , and is setting as 0.0001, 0.9, 1.0, respectively.
For quantitative evaluation, we regard it as a classification problem and evaluate the results using two groups of evaluation criterion, i.e. Precision, Recall, F-measure (P, R, F for short) and MSE. The mathematical formulations of P, R, F can be described as follows:
where TP, FP, TN and FN mean the numbers of true positives, false positives, true negatives and false negatives, respectively. We set the super-parameter as 0.3 in all our experiments.
We denote the ground truth saliency map as and the predicted results as . And the mean squared error (MSE) can be written as:
We evaluate salient object detectors on two public saliency detection benchmarks including RGBD (named NJU2000 dataset) Ju2014Depth and RGBT benchmarks Li2017A . The RGBD dataset consists 2,000 stereo images, as well as corresponding depth maps and manually labeled groundtruth. These images are collected from Internet, 3D movies and photographs by a Fuji W3 stereo camera. They perfrom mask labeling in a 3D display environment by using Nvidia 3D Vision due to the labeling results on 2D images maybe a little different from that in real 3D environments. The project page of this benchmark can be found from this website 111http://mcg.nju.edu.cn/publication/2014/icip14-jur/index.html.
To evaluate the generalization of our proposed quality-aware multi-modal deep saliency detection network, we also report the saliency detection performance on RGBT benchmark. The newest RGBT benchmark proposed by Li et al. includes 821 aligned RGB-T images with the annotated ground truths, and it also present the fine-grained annotations with 11 challenges to allow researchers to analyse the challenge-sensitive performance of different algorithms. Moreover, they implement 3 kinds of baseline methods with different inputs (RGB, thermal and RGB-T) for evaluations. The detailed configuration of this benchmark can be found from 222http://chenglongli.cn/people/lcl/journals.html..
4.2 Compare with State-of-the-art Methods
We compare our proposed quality-aware multi-modal salient object detection network with 11 state-of-the-art saliency detectors on the RGBD saliency detection benchmark including 6 traditional methods and 5 deep learning based approaches, including: RR Li2015Robust , MST Tu2016Real , BSCA Qin2015Saliency , DeepSaliency Li2015DeepSaliency , DRFT Wang2013Salient , DSS hou2016deeplycvpr , HSaliency Yan2013Hierarchical , MDF Li2015Visual , RBD Zhu2014Saliency , MCDL Zhao2015Saliency , MLNet mlnet2016 .
The baseline methods we compared on RGBT saliency detection dataset are directly adopted from this benchmark. The saliency detection performance of our proposed method and other start-of-the-art detectors on the two benchmarks will be discussed in later subsections, respectively.
4.2.1 Results on RGB-Depth Dataset
We first report the Precision, Recall and F-measure of each method on the entire dataset as shown in Table 2. From the evaluation results, we can find that the proposed method substantially outperforms all baseline approaches. This comparison clearly demonstrates the effectiveness of our approach for adaptively fuse color and depth images. Besides, we can also discover that the proposed quality-aware adaptive weighted RGB-D saliency results are significantly better than single modal results. This fully demonstrate the depth images are effective to boost image saliency detection and complementary to RGB data.
To give a more intuitive understanding of all the saliency detection results, we give a PR-curve as shown in Figure 2. It is easy to find that the proposed method can achieve better salient object detection results compared with other state-of-the-art approaches. The saliency detection results can be found in Figure 3.
4.2.2 Results on RGB-Thermal Dataset
To further validate the generic and the effectiveness of our quality-aware deep multi-modal saliency detection network, we also implement the experiments on another multi-modal dataset, i.e. RGB-Thermal dataset. We also report the detection results on Precision, Recall, F-measure values on this dataset. The specific saliency detection results of our and other state-of-the-art algorithms can be found in Table 3. Similar conclusions can also be drawn from this dataset, and we do not reiterate them here.
4.3 Ablation Study
We discuss the details of our approach by analysing the main components and efficiency in this section.
Components Analysis. To justify the significance of the main components of the proposed approach, we implement two special versions for comparative analysis, including: 1) Ours-I, that removes the adversarial loss in the proposed network architecture, i.e. only the MSE loss used to train the network; 2) Ours-II, removes the modal weights and naively fuse the multi-modal data with equal contributions. Intuitively, Ours-I is designed to validate the effectiveness of adversarial training, and Our-II is implemented to check the validity of quality-aware deep Q-network which used to adaptively measure the quality of multi-modal data.
As the MSE results presented in Table 4, and we can summarize the following conclusions. 1) The complete algorithm achieves superior performance than Ours-I, validating the effectiveness of adversarial loss. 2) Our method substantially outperform Ours-II. This demonstrate the significance of the introduced quality-aware deep Q-network to achieve adaptive fusion of different source data. It is also worthy to note that the proposed quality-aware weighting mechanism is a general adaptive weighting framework and it can also be applied in many other related tasks, such as multi-modal visual tracking, multi-modal moving object detection or quality-aware procedure. We leave this for our future works.
Efficiency Analysis. Runtime of our approach against other methods are all presented in Table 3 (in the column FPS). The experiments are carried out on a desktop with an Intel I7 3.4GHz CPU, GTX1080 and 32 GB RAM, and our code is implemented based on the deep learning framework Theano 333http://deeplearning.net/software/theano/ and Lasagne 444http://lasagne.readthedocs.io/en/latest/. It is obviously that our method achieved better trade-off between detection accuracy and efficiency.
In this paper, we validated the effectiveness of our algorithm on the task of multi-modal saliency detection. Specifically speaking, only two modality are contained in our case, i.e. RGB-Thermal or RGB-Depth images. How to deal with more modalities with our method is also worthy to consider, for example, RGB-Thermal-Depth image pairs. As shown in Figure 4, we can adaptive weighting these modalities in a sequential manner. Another possible solution is that, we take these modalities as the input state, and output corresponding weights directly. We leave these ideas as our future works.
In this paper, we propose a novel quality-aware multi-modal saliency detection neural network using deep reinforcement learning. To the best of our knowledge, this is the first attempt to introduce the deep reinforcement learning into the multi-modal saliency detection problem to handle the adaptive weighting of different modal data. Our network architecture follow the coarse-to-fine framework, that is to say, our pipeline consist of two sub-networks, i.e. coarse single modal saliency estimation network and adaptive fusion Q-network. For each modal, we detect salient objects using encoder-decoder network and train the network with content loss and adversarial loss. We take the adaptive weighting of different data in multi-modal case as decision making problem and teach the agent to learn a weighting policy through the interaction between the agent and environment. It is also worthy to note that our adaptive weighting mechanism is a general weighting method and it can also be applied in other related tasks. Extensive experiments on RBGD and RGBT benchmarks validated the effectiveness of our proposed quality-aware deep multi-modal salient object detection network.
- (1) C. Li, X. Wang, L. Zhang, J. Tang, H. Wu, L. Lin, Weighted low-rank decomposition for robust grayscale-thermal foreground detection, IEEE Transactions on Circuits Systems for Video Technology PP (99) (2017) 1–1.
- (2) C. Li, H. Cheng, S. Hu, X. Liu, J. Tang, L. Lin, Learning collaborative sparse representation for grayscale-thermal tracking, IEEE Transactions on Image Processing 25 (12) (2016) 5743–5756.
- (3) C. Li, G. Wang, Y. Ma, A. Zheng, B. Luo, J. Tang, A unified rgb-t saliency detection benchmark: Dataset, baselines, analysis and a novel approach, arXiv preprint arXiv:1701.02829.
- (4) S. Poria, E. Cambria, N. Howard, G.-B. Huang, A. Hussain, Fusing audio, visual and textual clues for sentiment analysis from multimodal content, Neurocomputing 174 (2016) 50–59.
X. Dong, J. Shen, W. Wang, Y. Liu, L. Shao, F. Porikli, Hyperparameter optimization for tracking with continuous deep q-learning, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- (6) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, Human-level control through deep reinforcement learning., Nature 518 (7540) (2015) 529–33.
- (7) Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, N. Freitas, Dueling network architectures for deep reinforcement learning, in: International Conference on Machine Learning, 2016, pp. 1995–2003.
- (8) V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in: International Conference on Machine Learning, 2016, pp. 1928–1937.
- (9) H. Peng, B. Li, W. Xiong, W. Hu, R. Ji, Rgbd salient object detection: A benchmark and algorithms 8691 (2014) 92–109.
- (10) A. Borji, M. M. Cheng, H. Jiang, J. Li, Salient object detection: A survey, Eprint Arxiv 16 (7) (2014) 3118.
- (11) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, d. D. G. Van, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, Mastering the game of go with deep neural networks and tree search, Nature 529 (7587) (2016) 484–489.
- (12) M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, J. Kautz, Ga3c: Gpu-based a3c for deep reinforcement learning, CoRR abs/1611.06256.
- (13) J. C. Caicedo, S. Lazebnik, Active object localization with deep reinforcement learning, in: IEEE International Conference on Computer Vision, 2015, pp. 2488–2496.
- (14) X. Kong, B. Xin, Y. Wang, G. Hua, Collaborative deep reinforcement learning for joint object search, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- (15) Z. Jie, X. Liang, J. Feng, X. Jin, W. Lu, S. Yan, Tree-structured reinforcement learning for sequential object localization, in: Advances in Neural Information Processing Systems, 2016, pp. 127–135.
- (16) S. Yun, J. Choi, Y. Yoo, K. Yun, J. Young Choi, Action-decision networks for visual tracking with deep reinforcement learning, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- (17) X. Wang, C. Li, B. Luo, J. Tang, Sint++: Robust visual tracking via adversarial positive instance generation, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- (18) Q. Cao, L. Lin, Y. Shi, X. Liang, G. Li, Attention-aware face hallucination via deep reinforcement learning, in: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE, 2017, pp. 1656–1664.
- (19) I. J. Goodfellow, J. Pougetabadie, M. Mirza, B. Xu, D. Wardefarley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial networks, Advances in Neural Information Processing Systems 3 (2014) 2672–2680.
- (20) M. Arjovsky, S. Chintala, L. Bottou, Wasserstein gan, stat 1050 (2017) 9.
- (21) I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. C. Courville, Improved training of wasserstein gans, in: Advances in Neural Information Processing Systems, 2017, pp. 5767–5777.
P. Isola, J.-Y. Zhu, T. Zhou, A. A. Efros, Image-to-image translation with conditional adversarial networks, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 5967–5976.
- (23) H. Dong, P. Neekhara, C. Wu, Y. Guo, Unsupervised image-to-image translation with generative adversarial networks, arXiv preprint arXiv:1701.02676.
- (24) C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al., Photo-realistic single image super-resolution using a generative adversarial network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4681–4690.
- (25) J. Pan, C. Canton Ferrer, K. McGuinness, N. E. O’Connor, J. Torres, E. Sayrol, X. Giro-i Nieto, Salgan: Visual saliency prediction with generative adversarial networks, arXiv preprint arXiv:1701.01081.
- (26) A. Dash, J. Cristian Borges Gamboa, S. Ahmed, M. Liwicki, M. Zeshan Afzal, Tac-gan-text conditioned auxiliary classifier generative adversarial network, arXiv preprint arXiv:1703.06412.
- (27) J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, S. Yan, Perceptual generative adversarial networks for small object detection, in: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, IEEE, 2017, pp. 1951–1959.
- (28) Y. Saatci, A. G. Wilson, Bayesian gan, in: Advances in neural information processing systems, 2017, pp. 3622–3631.
- (29) J. Wang, L. Yu, W. Zhang, Y. Gong, Y. Xu, B. Wang, P. Zhang, D. Zhang, Irgan: A minimax game for unifying generative and discriminative information retrieval models, in: Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, ACM, 2017, pp. 515–524.
- (30) L. Yu, W. Zhang, J. Wang, Y. Yu, Seqgan: Sequence generative adversarial nets with policy gradient., in: AAAI, 2017, pp. 2852–2858.
- (31) A. Maki, J. O. Eklundh, P. Nordlund, A computational model of depth-based attention, in: International Conference on Pattern Recognition, 1996, pp. 734–739 vol.4.
- (32) C. Lang, T. V. Nguyen, H. Katti, K. Yadati, M. Kankanhalli, S. Yan, Depth matters: influence of depth cues on visual saliency, in: European Conference on Computer Vision, 2012, pp. 101–115.
- (33) K. Desingh, K. K. Madhava, D. Rajan, C. V. Jawahar, Depth really matters: Improving visual salient region detection with depth, in: British Machine Vision Conference, 2013, pp. 98.1–98.11.
Y. Zhang, G. Jiang, M. Yu, K. Chen, Stereoscopic visual attention model for 3d video, in: International Conference on Advances in Multimedia Modeling, 2010, pp. 314–324.
- (35) X. Shen, Y. Wu, A unified approach to salient object detection via low rank matrix recovery, in: Computer Vision and Pattern Recognition, 2012, pp. 853–860.
- (36) J. Ren, X. Gong, L. Yu, W. Zhou, M. Y. Yang, Exploiting global priors for rgb-d saliency detection, in: Computer Vision and Pattern Recognition Workshops, 2015, pp. 25–32.
- (37) R. Ju, L. Ge, W. Geng, T. Ren, G. Wu, Depth saliency based on anisotropic center-surround difference, in: IEEE International Conference on Image Processing, 2014, pp. 1115–1119.
- (38) K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, D. Krishnan, Unsupervised pixel-level domain adaptation with generative adversarial networks, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 95–104.
- (39) D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A. A. Efros, Context encoders: Feature learning by inpainting, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2536–2544.
- (40) V. Badrinarayanan, A. Kendall, R. Cipolla, Segnet: A deep convolutional encoder-decoder architecture for image segmentation, IEEE Transactions on Pattern Analysis & Machine Intelligence (12) (2017) 2481–2495.
- (41) R. Zhao, W. Ouyang, H. Li, X. Wang, Saliency detection by multi-context deep learning, in: Computer Vision and Pattern Recognition, 2015, pp. 1265–1274.
- (42) J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, F. F. Li, Imagenet: A large-scale hierarchical image database, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 2009, pp. 248–255.
M. G. Bellemare, Y. Naddaf, J. Veness, M. Bowling, The arcade learning environment: An evaluation platform for general agents, Journal of Artificial Intelligence Research 47 (2013) 253–279.
- (44) L.-J. Lin, Reinforcement learning for robots using neural networks, Tech. rep., Carnegie-Mellon Univ Pittsburgh PA School of Computer Science (1993).
- (45) C. Li, Y. Yuan, W. Cai, Y. Xia, Robust saliency detection via regularized random walks ranking (2015) 2710–2717.
- (46) W. C. Tu, S. He, Q. Yang, S. Y. Chien, Real-time salient object detection with a minimum spanning tree, in: Computer Vision and Pattern Recognition, 2016, pp. 2334–2342.
- (47) Y. Qin, H. Lu, Y. Xu, H. Wang, Saliency detection via cellular automata, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 110–119.
- (48) X. Li, L. Zhao, L. Wei, M. H. Yang, F. Wu, Y. Zhuang, H. Ling, J. Wang, Deepsaliency: Multi-task deep neural network model for salient object detection, IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society 25 (8) (2015) 3919.
- (49) J. Wang, H. Jiang, Z. Yuan, M. M. Cheng, X. Hu, N. Zheng, Salient object detection: A discriminative regional feature integration approach, in: Computer Vision and Pattern Recognition, 2013, pp. 2083–2090.
- (50) Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, P. Torr, Deeply supervised salient object detection with short connections, in: Computer Vision and Pattern Recognition, 2017.
- (51) Q. Yan, L. Xu, J. Shi, J. Jia, Hierarchical saliency detection, in: Computer Vision and Pattern Recognition, 2013, pp. 1155–1162.
G. Li, Y. Yu, Visual saliency based on multiscale deep features (2015) 5455–5463.
- (53) W. Zhu, S. Liang, Y. Wei, J. Sun, Saliency optimization from robust background detection, in: Computer Vision and Pattern Recognition, 2014, pp. 2814–2821.
- (54) M. Cornia, L. Baraldi, G. Serra, R. Cucchiara, A Deep Multi-Level Network for Saliency Prediction, in: International Conference on Pattern Recognition (ICPR), 2016.