Visual saliency prediction is one of the fundamental vision problems that has been extensively studied for several decades [1, 2, 3]. With the proposal of comprehensive rules [4, 5], large training datasets [6, 7, 8]
and deep learning algorithms[9, 10], the performances of saliency models have been improving steadily. Meanwhile, many saliency-based attentive systems have achieved impressive performance in image recognition , video compression , content-based adverting , robot interaction  and navigation .
In addition to the success of saliency models, an important concern gradually emerges in the literature: to which extent can the state-of-the-art saliency models capture the inherent characteristics of human attention mechanism? To address this concern, this paper proposes to utilize the aerial videos captured by drones. Different from existing benchmark datasets that are usually full of ground-level images [7, 16] and videos [17, 18]
, a drone can observe the visual world from many different viewpoints, providing us an opportunity to revisit the problem of visual saliency prediction from an aerial perspective. In particular, from these aerial videos, we wish to explore reliable answers to two questions in visual saliency estimation:
1. Whether previous ground-level saliency models still work well in processing aerial videos?
2. How to transfer the ground-level knowledge to aerial platforms to develop an aerial saliency model?
To answer these two questions, we first collect a large amount of aerial videos captured by drones and manually divide them into shots. For these shots, we conduct massive eye-tracking experiments to collect and fuse the fixation data of 24 subjects in free-viewing conditions. By fusing these fixations, the ground-truth salient regions in aerial videos can be annotated without much ambiguity, and the corresponding ground-truth saliency maps can be used for model training and benchmarking. By testing the performance of ten ground-level saliency models, we find these models still work impressively in capturing salient regions in aerial videos. However, there still exist large gaps between their predictions and the ground-truth maps. Therefore, it is necessary to explore the way that drones look and develop a saliency model suitable for the saliency prediction task on aerial videos.
Inspired by the crowdsourced annotation process of multiple subjects in eye-tracking experiments (see Fig. 1), this paper proposes a Crowdsourced Multi-path Network (CMNet) that provides a way to transfer knowledge from classic models to deep networks for visual saliency prediction in aerial videos. The system framework of our approach is shown in Fig. 2. In this framework, common low-level features are first extracted in a low-level module, which are then fed into a multi-path module. Considering that ground-level models still have impressive performance on aerial videos, we initialize each path under the supervision of a classic ground-level saliency model. After that, redundant paths are identified and removed via a path selection algorithm that jointly considers path diversity, representativeness and the overall complexity of the multi-path module. The selected paths then enter the fusion module, whose predictions are fused and fine-tuned on aerial videos. Finally, spatial saliency maps can be efficiently predicted, which are then adaptively fused with the temporal saliency predictions to obtain clean and accurate saliency maps for various types of aerial videos. Experimental results show that the proposed approach outperforms ten state-of-the-art saliency models on aerial videos.
The main contributions of this paper include: 1) We propose a large-scale video dataset for aerial saliency prediction. To the best of our knowledge, it is the first large-scale video dataset that can be used to study visual saliency/attention on drones; 2) We propose a Crowdsourced Multi-path Network (CMNet) that provides a way to transfer the knowledge from multiple classic models into a single deep model; 3) We propose an effective path selection algorithm, which can be used to balance the complexity and effectiveness of the multi-path network.
2 Related Works
In this section, we present a brief review of computational saliency models from three aspects: heuristic models, non-deep learning models and deep learning models.
2.1 Heuristic Saliency Models
A number of heuristic saliency models have been proposed in the past decades [1, 19, 20, 21, 22]. These models can be roughly categorized into bottom-up approaches and top-down approaches. The bottom-up approaches [23, 24] are stimulus-driven and infer saliency from visual stimuli themselves with hand-crafted features (e.g., direction, color and intensity) and/or limited human knowledge (e.g., center-bias). For example, Fang et al.  proposed a video saliency model that detected spatial and temporal saliency first and then fused them according to the spatial compactness and temporal motion contrast. Later, Fang et al.  proposed a saliency model for stereoscopic video that fused the spatial and temporal saliency maps through uncertainty weighting. Moreover, some researchers  argued that the whole objects can be perceived before individual features were analyzed. Based on this assumption, a proto-object based computational model was proposed to utilize border ownership and group cells that work together to the foreground targets (i.e., the figures). Similarly, Rueopas et al.  proposed a bionic model that utilized mid-level features such as corners, line intersections and line endings to extract possible locations of figures in natural images.
In many scenarios, these bottom-up approaches achieve impressive performance for inherently capturing certain attention mechanisms of the human-being. However, these models may have some difficulties in suppressing background distractors due to imperfect hand-crafted features or heuristic fusion strategies. To address this problem, some top-down approaches heuristically incorporate high-level factors into the saliency computation framework. For example, Borji et al.  proposed an unified Bayesian approach to integrate global context of a scene, previous attended locations and previous motor actions over time to predict the next attending locations. Chen et al.  proposed a video saliency model that predicted video saliency by combining the top-down saliency maps with the bottom-up ones through point-wise multiplication. Actually, some studies have proved that the top-down cues can boost the performance of bottom-up saliency models [31, 32].
With the assistance of high-level cues, these top-down approaches can achieve impressive performance in scenarios with high-level saliency cues (e.g., faces and vehicles). However, they may have difficulties to process images with rich contents (e.g., a scene with many cars and pedestrians). By inspecting the representative failures of many bottom-up and top-down heuristic models, researchers find that most of them come from the imperfect hand-crafted features and heuristic fusion strategies. Therefore, the non-deep learning saliency models are proposed to find out optimal fusion strategies, while deep learning models are then proposed to learn the optimal saliency features.
2.2 Non-Deep Learning Saliency Models
To learn the optimal fusion strategy, a plenty of learning-based saliency models have been proposed [33, 34, 35, 36, 37]. For example, Vig et al.  proposed a simple bottom-up saliency model for dynamic scenarios with the aim of keeping the number of salient regions to a minimum. Recently, Fang et al.  proposed an image saliency model by learning a set of discriminative subspaces that perform the best in popping out targets and suppressing distractors. Xu et al. 
developed a data-driven method for video saliency detection, in which a support vector machine was learnt to integrate High Efficiency Video Coding (HEVC) features together for video saliency detection. Liet al. 
proposed a saliency model that measured the joint visual surprise from intrinsic and extrinsic contexts. Thousands of Gaussian mixture models were learnt first, which were then used as extrinsic context to measure the patch surprise.
Beyond the low-level features, some researchers proposed to combine multi-level features in saliency model via machine learning[34, 40]. In , the low-level, high-level and center bias priors were fused to estimate the image saliency, and the fusion weights were learnt from an eye tracking dataset. Vig et al. 
proposed an automatic hyper-parameter optimization to efficiently guide large-scale search for optimal features. Such multi-layer features were then combined with a simple linear classifier to get the final saliency. A saliency model for remote sensing images was proposed in, which can detect multi-class geo-spatial targets by the integration of visual saliency prediction modeling and the discriminative learning of sparse coding dictionary. Zhang et al.  proposed a co-saliency detection model that incorporated multi-instance learning and self-spaced learning into a unified framework. This model formulated the co-saliency detection problem as a multi-instance learning paradigm, while the self-spaced learning paradigm embedded in this model further alleviated the data ambiguity.
Generally speaking, these non-deep learning models can achieve promising results since they can transfer the human knowledge from training data to new cases in a supervised manner. However, these hand-crafted features used in these models may be not perfect and thus inherently set an upper bound for the performance of these models.
2.3 Deep Learning Saliency Models
Different from heuristic and non-deep learning models, the greatest advantage of deep learning saliency models is their capability of learning feature representations [43, 44, 45]. For example, Kümmerer et al. 
presented a Convolutional Neural Network (CNN) that reused AlexNet to generate high-dimensional features. These features were then used in fixation prediction. Later, Kümmerer et al. 
proposed a saliency model built on the transfer learning technique. This model used the features from the VGG-19 network for saliency prediction with no additional fine-tuning. Pan et al.  proposed two designs, a shallow network and a deeper network, that can be trained end-to-end for fixation prediction. Lahiri et al.  proposed a saliency model which used a two step learning strategy. The first step weakly pre-trained a saliency model on large-scale datasets (e.g.
, ImageNet) without recorded fixations, while the second step refined the pre-trained model on limited images with ground-truth saliency maps. By pre-training the backbone network on massive images from ImageNet, the knowledge for the high-level object recognition task can be reused in the low-level saliency prediction task, leading to the impressive performance of deep saliency models.
Beyond these models, some deep models focus on designing specific architecture or loss function for the saliency computation task. For example, Imamogluet al.  proposed CNN-based saliency prediction model that utilized the objectiveness scores predicted by selected features from CNNs to detect attentive regions. Wang and Shen  proposed a skip-layer network structure to capture hierarchical saliency information from deep layers with global saliency information to shallow layers with local saliency response. Liu et al.  proposed a computational framework to simultaneously learn both bottom-up and top-down features from raw image data using a multi-resolution CNN. Bak et al.  proposed a two-stream CNN to mimic the pathways in brain and combine networks trained on spatial and temporal information to predict video saliency. Jetley et al.  introduced a saliency model that adopted a set of loss functions to train a fully-convolutional architecture.
These deep learning models, benefit from the powerful capabilities of CNNs in extracting hierarchical feature representations, usually have high computational efficiency and impressive performance in ground-level images and/or videos. However, it is still unclear whether these saliency models trained on ground-level images and/or videos can be reused in aerial platforms, since the visual patterns of many objects (e.g., pedestrian and vehicle) can remarkably change from the ground-level and aerial viewpoints. Therefore, it is necessary to construct a large-scale aerial saliency dataset to benchmark ground-level saliency models.
3 The Aerial Video Saliency Dataset
In this section, we present a large-scale dataset for aerial video saliency. We also benchmark classic ground-level models to show the difference and correlation between aerial and ground-level saliency prediction.
3.1 Dataset Construction
To construct the dataset, we download hundreds of long aerial videos from Internet that are captured by drones. We manually divide these long videos into shots and randomly sample 1,000 shots with a total length of 1.6 hours (i.e., 177,664 frames at 30 FPS). By inspecting the selected shots, we find that the dataset mainly covers videos from four genres: building, human, vehicle, and others (e.g., animal, boat and aircraft). As a result, the aerial video saliency dataset, denoted as AVS1K, contains four subsets that are denoted as AVS1K-B, AVS1K-H, AVS1K-V and AVS1K-O, respectively. Some dataset statistics can be found in Tab. I.
|Dataset||Video||Max Res||Frames||Avg (s)|
To annotate the ground-truth salient regions, we conduct massive eye-tracking experiments on 24 subjects (20 males and 4 females, aged between 20 and 28). Note that each video is free-viewed by 17-20 randomly selected subjects. All subjects have normal or corrected to normal vision, and they have never seen these videos before. In experiments, the videos are displayed on a 22-inch color monitor with the resolution of . A chin set is adopted to eliminate the error caused by the head wobble and fix the monitor viewing distance to 75cm. Other experimental conditions such as illumination and noise are set to constant for all subjects.
In the eye-tracking experiments, we divide all the 1,000 videos into subgroups, each of which contains 30 videos (about 4 minutes per subgroup). In this way, each subject will have a long rest after free-viewing videos in each subgroup, and the eye-tracking data we collect can become more reliable. An eye-tracking apparatus (SMI RED 500) is used to record various types of eye movements (e.g., fixation, saccade and blink) at a sample rate of 500HZ. To annotate the salient regions, we only keep the fixations and ignore other types. Such fixations on a video are denoted as , where is a fixation with the coordinate and and the start time stamp .
Given the fixation data, we can compute a fixation density map for each frame to annotate the ground-truth salient regions, i.e., the salient regions that a drone should look at from the perspective of human-being. Let be a frame presented at time , we measure the fixation density map of , denoted as as in . The value of at pixel can be computed as
where and measure the spatial and temporal influences of the fixation to the pixel , respectively. By using an indicator function that equals 1 if and 0 otherwise, we only consider the influence of fixations in a short period after . Let be the coordinate of , the values of and can be computed as
where and are two constants to control the spatial and temporal influences of fixations, which are empirically set to 3% of video width (or video height if it is larger) and 0.1s, respectively. Some representative frames from AVS1K, the recorded fixations and the ground-truth saliency maps generated by Eq. (2) can be found in Fig. 3.
3.2 Comparison of Aerial/Ground-level Datasets
Given the dataset, we can thus measure the correlation and difference between the problems of aerial and ground-level saliency prediction. Toward this end, we first show the representative frames from AVS1K and the latest large-scale ground-level video saliency dataset DHF1K  in Fig. 4. From these frames, we find that aerial videos often have higher viewpoints, wider fields of vision and smaller targets. In other words, the visual patterns in aerial videos may be remarkably different from those on the ground. Thus it is worth further exploring an answer to the question: whether previous ground-level saliency models still work well in processing aerial videos?
To address this concern, we test ten classic ground-level saliency models on both AVS1K and DHF1K. These models include AIM , AWS , BMS , GB , HFT , ICL , IT , QDCT , SP  and SUN . Note that these models are not learning-based and thus become less sensitive to dataset bias.
, including the traditional Area Under the ROC Curve (AUC), the shuffled AUC (sAUC) and the Normalized Scanpath Saliency (NSS). AUC is computed by enumerating all probable thresholds to generate a ROC curve of true positive rate versus false positive rate, while sAUC takes the fixations shuffled from other frames as negatives in generating the curve. NSS measures the average response at ground-truth salient regions when the estimated saliency maps are normalized to zero mean and unit standard deviation. Here we adopt the implementation of that efficiently computes NSS via element-wise multiplication of the estimated and ground-truth saliency maps. Typically, AUC may assign high scores to a fuzzy saliency map if it correctly predicts the orders of salient and less-salient locations, while sAUC and NSS prefer clean saliency maps that only pop-out the most salient locations and suppress all the other regions.
Based on the three metrics, Tab. II shows the performance of ten classic models on AVS1K and DHF1K. From Tab. II, we find that the challenges of aerial videos are different from ground-level videos. The AUC scores of all models on AVS1K are lower than those on DHF1K, indicating that it is more difficult to correctly predict the saliency order of various locations. This may be caused by the fact that aerial videos often have wider fields of vision and thus contain richer content than ground-level videos. Surprisingly, sAUC and NSS, which focus on the saliency amplitude, achieve even higher scores on AVS1K than on DHF1K. This implies that the salient targets in aerial videos, which are usually very small, demonstrates impressive capability to pop-out from its surroundings from the higher viewpoints. To sum up, aerial and ground-level videos face different challenges in visual saliency estimation, which deserve a further investigation to depict the inherent attentive mechanisms a drone should have so as to behave like the human-being.
One more thing we can learn from Tab. II is that the ground-level models, with heuristically designed features and frameworks, still work impressively in detecting the salient regions of aerial videos. This shows the possibility to reuse the knowledge encoded in these ground-level models in aerial saliency prediction. Moreover, even for the best models such as HFT and SP, the performance scores on AVS1K are still far from perfect. The AUC scores of HFT and SP only reach 0.789 and 0.781 on AVS1K, respectively. Therefore, it is necessary to further study how to transfer the knowledge in these classic ground-level models to aerial platforms to develop an aerial saliency model.
4 The Crowdsourced Multi-path Network
In this section, we present CMNet, a network that absorbs the ground-level knowledge in classic saliency models and evolves to handle aerial video saliency like the human being does. Toward this end, CMNet contains multiple paths, each of which is initialized under the supervision of a classic ground-level model. The representative paths are then selected according to their representative and diversity so as to reduce the complexity of CMNet. After that, paths are fused and simultaneously fine-tuned on aerial videos to transfer ground-level knowledge to aerial saliency prediction. Finally, a spatiotemporal optimization algorithm is adopted to incorporate the influence of temporal saliency.
4.1 Path Initialization
As shown in Fig. 2
, the proposed CMNet starts with a low-level module that consists of two convolution layers and one max pooling layer. By down-sampling the input frames to the same resolution, the low-level module outputs 64 feature maps with the resolution . These feature maps obtained after only two convolution layers contain many low-level preattentive features such as edges, corners, curves and colors, and such local features can be reused and shared in many higher level neural processes. In this study, we initialize the parameters of the low-level module with the first two convolution layers of VGG16 .
Given these low-level features, there exist many ways in ground-level models to extract and fuse saliency cues from them. To make use of the knowledge in these models, we select the ten classic models we have tested in Section 3, each of which is used to supervise the initialization process of a network path in Fig. 2. In the initialization, we first obtain the saliency maps of a ground-level model on the training set (500 videos) and validation set (250 videos) of AVS1K. These model-estimated saliency maps are then used as ground-truth to fine-tune the convolution layers in each CMNet path. In this process, the learning rate of the low-level module is set to zero so that the parameters of each network path are independently updated. By minimizing the cross entropy loss between path outputs and ground-level model predictions, each path is forced to behave like a classic ground-level model so as to distillate its knowledge of saliency prediction. In practice, each path consists of five convolution layers, one pooling layer, two Squeeze-and-Excitation (SE) blocks  and one inception block with the capability of multi-scale perception.
After initializing the ten network paths under the supervision of ten classic ground-level models with heuristically designed saliency features and rules, CMNet inherently learns how to extract and fuse saliency-related features from different perspectives. The problem is, the knowledge encoded in the multiple paths of CMNet is highly redundant, and how to remove such redundancy to reduce model complexity is the next issue to be addressed.
4.2 Path Selection
To remove the path redundancy while maintaining the advantage of the crowdsourcing mechanism, the most straightforward way is to select a set of representative paths that depict visual saliency from several complementary perspectives. Toward this end, we propose a path selection algorithm that jointly considers path diversity, representativeness and the overall complexity of the multi-path module. These selection criteria are defined as follows:
Representativeness. The selected paths should represent all the unselected paths (i.e., high similarity).
Diversity. The selected paths should have high diversity.
Complexity. The number of selected paths should small to reduce the network complexity.
To select a subset of paths from candidate paths ( in this study), we adopt a column vector with binary components. The th component of , denoted as , equals 1 if the th path is selected and 0 otherwise. By assuming that there are frames from the training and validation videos of AVS1K, we denote the saliency map predicted by the th path on the th frame as . As a result, the path selection process can be solved by optimizing
where denotes the number of non-zero components in and thus reflects the complexity of the multi-path module. The terms and denote the representativeness and diversity to be maximized, respectively. The is a weight parameter to balance the representativeness and diversity, which is empirically set to 0.2 (its influence on final results will be discussed in experiments).
The term is defined according to path similarities. That is, the unselected paths should be highly similar to selected ones that are considered to be representative. This term can be defined as
where is a small value to avoid dividing by zero. The term measures the similarity between the th and th paths that can be measured in a data-driven manner:
where and are the width and height of the input images, respectively. By resizing and
to the input image resolution and normalizing them into probability distributions,measures the average histogram interactions between the saliency maps estimated by two paths. As shown in Fig. 5, many pairs of paths have high similarities. By maximizing , the similarity between selected and unselected paths can become very large, leading to a less-redundant multi-path module.
The representativeness term is defined between selected and unselected paths, while the diversity term is defined only on the selected ones that aims to maximize their difference
We can see that this term will penalize to co-selection of two highly similar paths.
By incorporating (4) and (6) into the optimization objective (3), we can obtain a binary optimization problem with quadratic terms. Ideally, we can enumerate all the possible values of the binary vector when is small. When is large, we propose a greedy algorithm to solve (3). The algorithm first randomly select half of the paths and then iteratively remove one selected path and add one unselected path until the objective function stop increasing. Note that the greedy algorithm can be run several times and the selected pathes with the maximum objective function value will be used. In this way, the complexity of the path selection algorithm can be greatly reduced. On synthetic similarity matrix, we find the local optimum reached by the greedy algorithm can well approximate the global optimal.
4.3 Spatial and Spatiotemporal Fusion
By solving (3) defined over the similarity matrix of ten paths, we obtain three representative paths (i.e., paths pre-trained by IT, QDCT, and SUN in the given parameters). After that, these selected paths are fused with one concation layer, four convolution layers, one SE block and a deconvolution layer to output the spatial saliency maps. The overall structure of CMNet can be found in Fig. 6. In training CMNet, the parameters of the fusion module are randomly initialized and then optimized with a learning rate of , and a batch size of 4.
Beyond the spatial fusion of paths, another necessary fusion is the spatial and temporal saliency maps. Let be the spatial saliency map given by CMNet and be the temporal saliency map given by an existing temporal saliency model (e.g., ). Inspired by , we propose a spatiotemporal fusion framework that adaptively fuse and :
where is the refined saliency map, is the collaborative interaction of and , and is the selected spatial or temporal saliency map according to a heuristic rule. is a variant to balance and . We first compute the spatial-to-temporal consistency and temporal-to-spatial consistency scores :
where is the entropy function and indicates the per-pixel multiplication. We can see that the spatial-to-temporal consistency will be higher than the temporal-to-spatial consistency if the temporal saliency map is cleaner, and vice versa. As a result, the collaboration interaction map can be computed by emphasizing the cleaner map:
The selection map is served as a compensation of . Let and be the average weighted distances of salient pixels to their gravity centers in and , respectively. The is defined as the map with more compact salient regions:
Intuitively, we can trust if the spatial and temporal saliency maps are highly consistent. If not, we can select the most compact map as the final prediction. Let be the average weighted distances of salient pixels to their gravity center in , the parameter can be computed as
where is a predefined weight that is empirically set to 2.1 (its influence will be discussed in experiments). By incorporating (9), (10) and (11) into (7), we can adaptively fuse the spatial and temporal saliency maps.
In this section, experiments are conducted to prove the challenge of AVS1K and the effectiveness of CMNet. We will first introduce the experimental setting and then benchmark the state-of-the-art models, followed by detailed discission.
5.1 Experimental Setting
In experiments, we test our approach on our aerial video saliency dataset AVS1K as well as the ground-level video saliency dataset DHF1K . Similar to AVS1K, DHF1K is currently the largest ground-level video saliency dataset that spans a large range of scenarios, viewpoints, motions, objects types and background complexity. It is divided into training set, validation set and test set, which contain 600, 100 and 300 videos, respectively.
On these two datasets, we compare CMNet and its variants CMNet+ (with spatiotemporal refinement) and CMNet- (without spatiotemporal refinement and the supervision of classic ground-level models) with ten state-of-the-art models. These models can be roughly categorized into three groups, including:
3) The Deep Learning Group (DL Group) contains five deep learning models, including eDN , iSEEL , SalNet , DVA  and STS . Among these models, iSEEL and eDN are built on pre-extracted features and thus cannot be re-trained. For the other three models, we fine-tune them on the two datasets and use a mark to indicate the retrained model.
In the comparisons, we adopt five evaluation metrics, including the aforementioned AUC, sAUC and NSS as well as the Similarity Metric (SIM) and Correlation Coefficient (CC)  in addition. SIM is computed to measure the similarity of two saliency maps as probability distributions, while CC is computed as the linear correlation between the estimated and ground-truth saliency maps. Note that the values of all the five metrics are positively correlated with the performance. In the comparisons, we resize all saliency maps to the original resolution of the video.
5.2 Comparison with the State-of-the-art Models
The performances of 13 state-of-the-art models on the AVS1K dataset, including three heuristic models, two non-deep learning models and eight deep learning models, are shown in Tab. III
. Moreover, the Receiver Operating characteristics Curves (ROC)[77, 78] are given in Fig. 7. Some representative results of these models are shown in Fig. 8.
From Tab. III, we find that our fundamental multi-path network CMNet-, in which the spatiotemporal refinement and the supervision of classic ground-level models are not used, still outperforms the other ten state-of-the-art models in terms of NSS and CC and ranks the second place in terms of AUC (worse than DVA), sAUC (worse than SalNet) and SIM (worse than DVA). Note that NSS is the primary metric recommended by many surveys on saliency evaluation metrics [52, 79]. The impressive performance of CMNet- can be explained by its multi-path structure. The low-level module of CMNet- can extract many low-level preattentive features, based on which the multi-path module can further extract saliency cues at higher levels from different perspectives. These high-level saliency cues are then fused to obtain the final saliency map. In this way, CMNet- has better representation capability when compared with traditional single path network (such as SalNet) and classic two-stream network for video (such as STS). Although DVA also adopts a multi-stream structure that directly fed supervisions into multi-layers, the CMNet- still performs better than DVA in terms of sAUC, NSS and CC.
In Tab. III, we also find that CMNet outperforms CMNet- in terms of all metrics except AUC. This can be explained by the crowdsourced strategy adopted in training multiple paths. After initializing different network paths under the supervision of selected ground-level models with heuristically designed saliency features and rules, CMNet inherently learns how to extract and fuse saliency-related features from both the aerial and ground-level perspectives. In addition, the path selection algorithm ensures the adopted ground-level “perspectives” are representative. In this manner, the biases of classic ground-level models can be further investigated and utilized to improve the saliency prediction accuracy. As a result, the effectiveness of the crowdsourced strategy can be well justified.
From Tab. III, we also observe that CMNet+ outperforms all the other models in terms of all metrics except SIM. This may be caused by the proposed spatiotemporal optimization algorithm. Based on the mutual consistency and weighted spatiotemporal saliency, the optimization algorithm measures the confidence of both spatial and temporal saliency maps to incorporate the influence of motion information into the final saliency. In this manner, the optimization algorithm tends to generate cleaner saliency maps with more compact salient regions (see Fig. 8). As a result, the most salient locations can pop-out in the saliency maps predicted by CMNet+, leading to high NSS and sAUC scores.
Beyond the performance of CMNet and its variants, we also find that the heuristic models in the H Group perform worse than the models in the NL Group and the DL Group. Actually, the models in H Group usually rely on low-level hand-crafted features and predefined rules for feature fusion, making it a great challenge for these models to handle unknown aerial scenarios. By replacing the predefined feature fusion rules with the learned fusion strategies, models in the NL Group become slightly better but still far from satisfactory. The key issue here is that these hand-crafted features designed for ground-level scenarios may be no longer suitable for the aerial scenarios. In other words, there may exist many irregular saliency visual patterns in aerial videos, which should be learned from data. This also explains the impressive performances of models in the DL Group since they can benefit from the powerful capabilities of CNNs in extracting hierarchical feature representations.
From these results, we can answer the question that how drones look. When the application scenarios transfer from ground-level to aerial, the salient visual patterns, as well as the feature fusion strategies, may become remarkably different. As a result, it is necessary to learn the saliency cues and their fusion strategies that best characterize the salient visual patterns from the aerial perspective. In addition, there exist some inherent correlations between ground-level and aerial scenarios, implying that a drone can also benefit from the ground-level knowledge in learning how to look. Actually, there exist much more annotated data in the ground-level platforms than the aerial ones. By transferring the ground-level knowledge into the aerial platforms, a drone can gain a better capability of handling various visual patterns, leading to better performance.
5.3 Performance Analysis
In this section, we conduct several experiments to analyze the performance of CMNet+ (and CMNet) from multiple perspectives, including parameter influences, generalization ability and its performance on the four subsets of AVS1K.
In the first experiment, we analyze the parameter in (3) that is used to balance the representativeness and diversity in CMNet+. The NSS curve of CMNet+ on AVS1K with different is shown in Fig. 9. From Fig. 9, we find that when falls in [0, 0.24], CMNet+ achieves the best performance (NSS=2.133) with three representative paths (IT, QDCT and SUN). When grows, the number of selected path decreases. When falls in [0.26, 0.56], CMNet+ has lower complexity (only two selected paths, BMS and IT) as well as the lowest performance (NSS=2.035). When grows larger, only two paths keep on being selected but they may be supervised by two different models. For example, when falls in [0.58, 0.68], CMNet+ select AIM and IT with NSS=2.093. When falls in [0.70, 1.00], CMNet+ select IT and SUN with NSS=2.093. To sum up, in a wide range of , the path selection algorithm tends to select two or three paths to reduce the model complexity. Therefore, we select in all experiments for pursuing better performance at an acceptable performance.
In the second experiment, we analyze the parameter in (11) that is served as a threshold parameter in computing and further balance the fusion of spatial and temporal saliency maps in CMNet+. The curves of AUC, sAUC, NSS, SIM and CC scores on AVS1K with different are shown in Fig. 10. We find that the AUC, NSS and CC curves are convex, the sAUC curve is monotonically increasing, and the SIM curve is monotonically decreasing. The overall performance is generally stable when falls between [1.8, 2.3]. With a small , many saliency maps are refined with (see (11)), implying that the spatial and temporal saliency maps cannot be adaptively fused in (7). On the contrary, a large may generate non-zero in most cases and thus lead to noisy saliency maps due to the additive fusion strategy of (7). Therefore, we select in all experiments.
In the third experiment, we compare CMNet with DVA on ground-level video saliency dataset DHF1K. The main objective of this experiment is to verify the generalization ability of CMNet on ground-level scenarios. Note that the DVA model, after being fine-tuned, has the best performance among the ten models from the three groups on AVS1K (see Tab. III), and we omit the spatiotemporal refinement step of our approach for fair comparisons. Quantitative results of these two models, after being fine-tuned on DHF1K, are shown in Tab. IV. We can observe the proposed CMNet outperform the DVA on DHF1K. This proves the generalization ability of CMNet, implying that the multi-path network architecture can be used for the saliency prediction task in both aerial and ground-level scenarios.
Furthermore, to verify the performance of CMNet+ on different scenarios of aerial videos, we show its performance on the four subsets of AVS1K in Tab. V. From this table, we can observe that the CMNet+ have relatively better performance on AVS1K-H and AVS1K-V than on AVS1K-B and AVS1K-O. In most aerial videos, the humans and vehicles usually have relatively appropriate sizes and significant motions, which make them become easier to pop-out from the local context. On the contrary, buildings are static and usually have big sizes, making both the spatial and the temporal saliency prediction very challenging. Similarly, the subset AVS1K-O contains many diversified scenarios about planes, boats and animals. In these scenarios, the appearances and motion patterns of salient targets may change remarkably, making it difficult to separate them from the distractors.
In this work, we build a large-scale video dataset for aerial saliency prediction. It contains 1,000 videos, and can be categorized into four subsets: building, human, vehicle and others. Based on this dataset, we propose a crowdsourced multi-path network (CMNet) for aerial saliency prediction, which simulates the process of eye-tracking experiments using a multi-path network structure and a crowdsourced training strategy to transfer human knowledge from ground-level models into the network paths. A spatiotemporal optimization algorithm is also proposed to fuse the spatial and temporal saliency maps. Experimental results demonstrate the superior performance of our proposed models with respect to other state-of-the-art saliency models.
In the future work, we will explore the feasibility of simultaneously learning from both ground-level and aerial data. Ground-level object detectors will also be explored to facilitate the rough localization of large and static buildings as well as animals and boats with high appearance/motion variance.
This work was partially supported by grants from National Natural Science Foundation of China (61672072, U1611461), the Beijing Nova Program (Z181100006218063), and Fundamental Research Funds for the Central Universities.
-  L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene analysis,” IEEE Transactions on pattern analysis and machine intelligence, vol. 20, no. 11, pp. 1254–1259, 1998.
A. Borji and L. Itti, “State-of-the-art in visual attention modeling,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 185–207, 2012.
T. V. Nguyen, Q. Zhao, and S. Yan, “Attentive systems: A survey,”
International Journal of Computer Vision, no. 11, pp. 1–25, 2017.
-  S. Goferman, L. Zelnikmanor, and A. Tal, “Context-aware saliency detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 10, p. 1915, 2012.
-  J. Li, L. Y. Duan, X. Chen, T. Huang, and Y. Tian, “Finding the secret of image saliency in the frequency domain,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 12, p. 2428, 2015.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A
large-scale hierarchical image database,” in
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255.
-  A. Borji and L. Itti, “Cat2000: A large scale fixation dataset for boosting saliency research,” CVPR 2015 workshop on ”Future of Datasets”, 2015, arXiv preprint arXiv:1505.03581.
-  S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan, “Youtube-8m: A large-scale video classification benchmark,” arXiv preprint arXiv:1609.08675, 2016.
-  J. Pan, E. Sayrol, X. Giroinieto, K. Mcguinness, and N. E. Oconnor, “Shallow and deep convolutional networks for saliency prediction,” pp. 598–606, 2016.
-  S. Fang, J. Li, Y. Tian, T. Huang, and X. Chen, “Learning discriminative subspaces on random contrasts for image saliency analysis,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 5, pp. 1095–1108, 2017.
-  R. Gupta, M. T. Khanna, and S. Chaudhury, “Visual saliency guided video compression algorithm,” Signal Processing Image Communication, vol. 28, no. 9, pp. 1006–1022, 2013.
-  C. Shen and Q. Zhao, Webpage Saliency. Springer International Publishing, 2014.
C. Muhl, Y. Nagai, and G. Sagerer, “On constructing a communicative space in
German Conference on Advances in Artificial Intelligence, 2007, pp. 264–278.
-  D. W. Gage, “Saliency detection and model-based tracking: a two part vision system for small robot navigation in forested environment,” Proc Spie, vol. 8387, no. 15, p. 27, 2012.
-  S. Symons and K. Nieselt, “Mgv: a generic graph viewer for comparative omics data,” Bioinformatics, vol. 27, no. 16, pp. 2248–55, 2011.
-  Y. Fang, J. Wang, J. Li, R. P pion, and P. L. Callet, “An eye tracking database for stereoscopic video,” in International Workshop on Quality of Multimedia Experience, 2014, pp. 51–52.
-  W. Wang, J. Shen, and L. Shao, “Video salient object detection via fully convolutional networks,” IEEE Transactions on Image Processing, vol. 27, no. 1, pp. 38–49, 2018.
-  L. Itti, “Automatic foveation for video compression using a neurobiological model of visual attention,” IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society, vol. 13, no. 10, p. 1304, 2004.
-  H. J. Seo and P. Milanfar, “Static and space-time visual saliency detection by self-resemblance,” Journal of vision, vol. 9, no. 12, pp. 15–15, 2009.
-  P. Jiang, H. Ling, J. Yu, and J. Peng, “Salient region detection by ufo: Uniqueness, focusness and objectness,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 1976–1983.
-  N. Li, J. Ye, Y. Ji, H. Ling, and J. Yu, “Saliency detection on light field,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2806–2813.
-  Y. Zhang, Z. Mao, J. Li, and Q. Tian, “Salient region detection for complex background images using integrated features,” Information Sciences, vol. 281, pp. 586–600, 2014.
-  L. Zhang, Y. Xia, R. Ji, and X. Li, “Spatial-aware object-level saliency prediction by learning graphlet hierarchies,” IEEE Transactions on Industrial Electronics, vol. 62, no. 2, pp. 1301–1308, 2015.
-  Y. Fang, W. Lin, Z. Chen, C.-M. Tsai, and C.-W. Lin, “A video saliency detection model in compressed domain,” IEEE transactions on circuits and systems for video technology, vol. 24, no. 1, pp. 27–38, 2014.
-  Y. Fang, C. Zhang, J. Li, J. Lei, M. P. Da Silva, and P. Le Callet, “Visual attention modeling for stereoscopic video: a benchmark and computational model,” IEEE Transactions on Image Processing, vol. 26, no. 10, pp. 4684–4696, 2017.
-  A. F. Russell, S. Mihalasß, R. von der Heydt, E. Niebur, and R. Etienne-Cummings, “A model of proto-object based saliency,” Vision Research, vol. 94, pp. 1–15, 2014.
-  W. Rueopas, S. Leelhapantu, and T. H. Chalidabhongse, “A corner-based saliency model,” in Computer Science and Software Engineering (JCSSE), 2016 13th International Joint Conference on. IEEE, 2016, pp. 1–6.
-  A. Borji, D. N. Sihite, and L. Itti, “Probabilistic learning of task-specific visual attention,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 470–477.
-  Y. Chen, G. Tao, Q. Xie, and M. Song, “Video attention prediction using gaze saliency,” Multimedia Tools and Applications, pp. 1–18, 2016.
-  L. Zhang, R. Hong, Y. Gao, R. Ji, Q. Dai, and X. Li, “Image categorization by learning a propagated graphlet path,” IEEE transactions on neural networks and learning systems, vol. 27, no. 3, pp. 674–685, 2016.
-  J. Sun, P. Wang, Y.-K. Luo, G.-M. Hao, and H. Qiao, “Precision work-piece detection and measurement combining top-down and bottom-up saliency,” International Journal of Automation and Computing, May 2018. [Online]. Available: https://doi.org/10.1007/s11633-018-1123-1
-  J. Li, Y. Tian, T. Huang, and W. Gao, “Probabilistic multi-task learning for visual saliency estimation in video,” International journal of computer vision, vol. 90, no. 2, pp. 150–165, 2010.
-  W.-F. Lee, T.-H. Huang, S.-L. Yeh, and H. H. Chen, “Learning-based prediction of visual attention for video signals,” IEEE Transactions on Image Processing, vol. 20, no. 11, pp. 3028–3038, 2011.
-  E. Vig, M. Dorr, and E. Barth, “Efficient visual coding and the predictability of eye movements on natural movies,” Spatial Vision, vol. 22, no. 5, pp. 397–408, 2009.
-  E. Vig, M. Dorr, T. Martinetz, and E. Barth, “Intrinsic dimensionality predicts the saliency of natural dynamic scenes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 6, pp. 1080–1091, 2012.
-  E. Vig, M. Dorr, and D. Cox, “Large-scale optimization of hierarchical features for saliency prediction in natural images,” in Computer Vision and Pattern Recognition, 2014, pp. 2798–2805.
-  M. Xu, L. Jiang, X. Sun, Z. Ye, and Z. Wang, “Learning to detect video saliency with hevc features,” IEEE Transactions on Image Processing, vol. 26, no. 1, pp. 369–385, 2017.
-  J. Li, Y. Tian, X. Chen, and T. Huang, “Measuring visual surprise jointly from intrinsic and extrinsic contexts for image saliency estimation,” International Journal of Computer Vision, vol. 120, no. 1, pp. 44–60, 2016.
-  M. Song, C. Chen, S. Wang, and Y. Yang, “Low-level and high-level prior learning for visual saliency estimation,” Information Sciences, vol. 281, pp. 573–585, 2014.
-  J. Han, P. Zhou, D. Zhang, G. Cheng, L. Guo, Z. Liu, S. Bu, and J. Wu, “Efficient, simultaneous detection of multi-class geospatial targets based on visual saliency modeling and discriminative learning of sparse coding,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 89, pp. 37–48, 2014.
-  D. Zhang, D. Meng, and J. Han, “Co-saliency detection via a self-paced multiple-instance learning framework,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 5, pp. 865–878, 2017.
-  P. Hu, B. Shuai, J. Liu, and G. Wang, “Deep level sets for salient object detection,” in CVPR, vol. 1, 2017, p. 2.
-  J. Kuen, Z. Wang, and G. Wang, “Recurrent attentional networks for saliency detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3668–3677.
-  X. Li, L. Zhao, L. Wei, M.-H. Yang, F. Wu, Y. Zhuang, H. Ling, and J. Wang, “Deepsaliency: Multi-task deep neural network model for salient object detection,” IEEE Transactions on Image Processing, vol. 25, no. 8, pp. 3919–3930, 2016.
-  M. Kümmerer, L. Theis, and M. Bethge, “Deep gaze i: Boosting saliency prediction with feature maps trained on imagenet,” arXiv preprint arXiv:1411.1045, 2014.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  M. Kümmerer, T. S. Wallis, and M. Bethge, “Deepgaze ii: Reading fixations from deep features trained on object recognition,” arXiv preprint arXiv:1610.01563, 2016.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  A. Lahiri, S. Roy, A. Santara, P. Mitra, and P. K. Biswas, “Wepsam: Weakly pre-learnt saliency model,” arXiv preprint arXiv:1605.01101, 2016.
-  N. Imamoglu, C. Zhang, W. Shmoda, Y. Fang, and B. Shi, “Saliency detection by forward and backward cues in deep-cnn,” in Image Processing (ICIP), 2017 IEEE International Conference on. IEEE, 2017, pp. 430–434.
-  W. Wang and J. Shen, “Deep visual attention prediction,” IEEE Transactions on Image Processing, vol. 27, no. 5, pp. 2368–2378, 2018.
-  N. Liu, J. Han, D. Zhang, S. Wen, and T. Liu, “Predicting eye fixations using convolutional neural networks,” in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. IEEE, 2015, pp. 362–370.
-  Ç. Bak, A. Erdem, and E. Erdem, “Two-stream convolutional networks for dynamic saliency prediction,” arXiv preprint arXiv:1607.04730, 2016.
-  S. Jetley, N. Murray, and E. Vig, “End-to-end saliency mapping via probability distribution prediction,” Proceedings of Computer Vision and Pattern Recognition 2016, pp. 5753–5761, 2016.
J. Li, C. Xia, and X. Chen, “A benchmark dataset and saliency-guided stacked autoencoders for video-based salient object detection,”IEEE Transactions on Image Processing, vol. 27, no. 1, pp. 349–364, 2018.
-  W. Wang, J. Shen, F. Guo, M.-M. Cheng, and A. Borji, “Revisiting video saliency: A large-scale benchmark and a new model,” arXiv preprint arXiv:1801.07424, 2018.
-  N. Bruce and J. Tsotsos, “Attention based on information maximization,” Journal of Vision, vol. 7, no. 9, pp. 950–950, 2007.
-  A. Garcia-Diaz, V. Leboran, X. R. Fdez-Vidal, and X. M. Pardo, “On the relationship between optical variability, visual saliency, and eye fixations: A computational approach,” Journal of vision, vol. 12, no. 6, pp. 17–17, 2012.
-  J. Zhang and S. Sclaroff, “Saliency detection: A boolean map approach,” in Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013, pp. 153–160.
-  J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in Advances in neural information processing systems, 2007, pp. 545–552.
-  J. Li, M. D. Levine, X. An, X. Xu, and H. He, “Visual saliency based on scale-space analysis in the frequency domain,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 4, pp. 996–1010, 2013.
-  X. Hou and L. Zhang, “Dynamic visual attention: Searching for coding length increments,” in Advances in neural information processing systems, 2009, pp. 681–688.
-  B. Schauerte and R. Stiefelhagen, “Quaternion-based spectral saliency detection for eye fixation prediction,” in Computer Vision–ECCV 2012. Springer, 2012, pp. 116–129.
-  J. Li, Y. Tian, and T. Huang, “Visual saliency with statistical priors,” International journal of computer vision, vol. 107, no. 3, pp. 239–253, 2014.
-  L. Zhang, M. H. Tong, T. K. Marks, H. Shan, and G. W. Cottrell, “Sun: A bayesian framework for saliency using natural statistics,” Journal of vision, vol. 8, no. 7, pp. 32–32, 2008.
-  N. Riche, M. Duvinage, M. Mancas, B. Gosselin, and T. Dutoit, “Saliency and human fixations: state-of-the-art and study of comparison metrics,” in Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013, pp. 1153–1160.
-  J. Li, C. Xia, Y. Song, S. Fang, and X. Chen, “A data-driven metric for comprehensive evaluation of saliency models,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 190–198.
-  Z. Bylinskii, T. Judd, A. Oliva, A. Torralba, and F. Durand, “What do different evaluation metrics tell us about saliency models?” arXiv preprint arXiv:1604.03605, 2016.
-  S. Marat, T. H. Phuoc, L. Granjon, N. Guyader, D. Pellerin, and A. Guérin-Dugué, “Modelling spatio-temporal saliency to predict gaze direction for short videos,” International journal of computer vision, vol. 82, no. 3, p. 231, 2009.
-  J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” arXiv preprint arXiv:1709.01507, 2017.
-  Z. Liu, X. Zhang, S. Luo, and O. Le Meur, “Superpixel-based spatiotemporal saliency detection,” IEEE transactions on circuits and systems for video technology, vol. 24, no. 9, pp. 1522–1540, 2014.
H. R. Tavakoli, A. Borji, J. Laaksonen, and E. Rahtu, “Exploiting inter-image similarity and ensemble of extreme learners for fixation prediction using deep features,”Neurocomputing, vol. 244, pp. 10–18, 2017.
-  C. Bak, A. Kocak, E. Erdem, and A. Erdem, “Spatio-temporal saliency networks for dynamic saliency prediction,” IEEE Transactions on Multimedia, 2017.
-  W. Hou, X. Gao, D. Tao, and X. Li, “Visual saliency detection using information divergence,” Pattern Recognition, vol. 46, no. 10, pp. 2658–2669, 2013.
-  A. Borji, “Boosting bottom-up and top-down visual features for saliency estimation,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012, pp. 438–445.
-  K. Fukunaga, Introduction to statistical pattern recognition. Academic press, 2013.
-  C. W. Therrien and C. W. Therrien, Decision, estimation, and classification: an introduction to pattern recognition and related topics. Wiley New York, 1989.
-  N. Liu and J. Han, “A deep spatial contextual long-term recurrent convolutional network for saliency detection,” arXiv preprint arXiv:1610.01708, 2016.