Aerial scene recognition is a fundamental task in remote sensing and has recently received increased interest. While the visual information from overhead images with powerful models and efficient algorithms yields good performance on scene recognition, additional information is always a bonus, for example, the corresponding audio information. In this paper, for improving the performance on the aerial scene recognition, we explore a novel audiovisual aerial scene recognition task using both images and sounds as input. Based on an observation that some specific sound events are more likely to be heard at a given geographic location, we propose to exploit the knowledge from the sound events to improve the performance on the aerial scene recognition. For this purpose, we have constructed a new dataset named AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE). With the help of this dataset, we evaluate three proposed approaches for transferring the sound event knowledge to the aerial scene recognition task in a multimodal learning framework, and show the benefit of exploiting the audio information for the aerial scene recognition. The source code is publicly available for reproducibility purposes.https://github.com/DTaoo/Multimodal-Aerial-Scene-Recognition.READ FULL TEXT VIEW PDF
Aerial scene recognition is a fundamental task in remote sensing and has...
Aerial scene classification, which aims to automatically label an aerial...
Aerial scene recognition is a fundamental visual task and has attracted ...
Along with the increasing use of unmanned aerial vehicles (UAVs), large
Aerial image categorization plays an indispensable role in remote sensin...
Aerial scene recognition is a fundamental research problem in interpreti...
In this paper, we propose a novel method to precisely match two aerial i...
Scene recognition is a longstanding, hallmark problem in the field of computer vision, and it refers to assigning a scene-level label to an image based on its overall contents. Most scene recognition approaches in the community make use of ground images and have achieved remarkable performance. By contrast, overhead images usually cover larger geographical areas and are capable of offering more comprehensive information from a bird’s eye view than ground images. Hence aerial scene recognition has received increased interest. The success of current state-of-the-art aerial scene understanding models can be attributed to the development of novel convolutional neural networks (CNNs) that aim at learning good visual representations from images.
Albeit successful, these models may not work well in some cases, particularly when they are directly used in worldwide applications, suffering the pervasive factors, such as different remote imaging sensors, lighting conditions, orientations, and seasonal variations. A study in neurobiology reveals that human perception usually benefits from the integration of both visual and auditory knowledge. Inspired by this investigation, we argue that aerial scenes’ soundscapes are partially free of the aforementioned factors and can be a helpful cue for identifying scene categories (Fig. 1). This is based on an observation that the visual appearance of an aerial scene and its soundscape are closely connected. For instance, sound events like broadcasting, people talking, and perhaps whistling are likely to be heard in all train stations in the world, and cheering and shouting are expected to hear in most sports lands. However, incorporating the sound knowledge into a visual aerial scene recognition model and assessing its contributions to this task still remain underexplored. In addition, it is worth mentioning that with the now widespread availability of smartphones, wearable devices, and audio sharing platforms, geotagged audio data have been easily accessible, which enables us to explore the topic in this paper.
In this work, we are interested in the audiovisual aerial scene recognition task that simultaneously uses both visual and audio messages to identify the scene of a geographical region. To this end, we construct a new dataset, named AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE), providing 5075 paired images and sound clips categorized to 13 scenes, which will be introduced in Section 3, for exploring the aerial scene recognition task. According to our preliminary experiments, simply concatenating representations from the two modalities is not helpful, slightly degrading the recognition performance compared to using a vision-based model. Knowing that sound events are related to scenes, this preliminary result indicates that the model cannot directly learn the underlying relation between the sound events and the scenes. So directly transferring the sound event knowledge to scene recognition may be the key to making progress. Following this direction, with the multimodal representations, we propose three approaches that can effectively exploit the audio knowledge to solve the aerial scene recognition task, which will be detailed in Section 4. We compare our proposed approaches with baselines in Section 5, showing the benefit of exploiting the sound event knowledge for the aerial scene recognition task.
Thereby, this work’s contributions are threefold.
The audiovisual perception of human beings gives us an incentive to investigate a novel audiovisual aerial scene recognition task. We are not aware of any previous work exploring this topic.
We create an annotated dataset consisting of 5075 geotagged aerial image-sound pairs involving 13 scene classes. This dataset covers a large variety of scenes from across the world.
We propose three approaches to exploit the audio knowledge, i.e.
, preserving the capacity of recognizing sound events, constructing a mutual representation in order to learn the underlying relation between sound events and scenes, and directly learning this relation through the posterior probabilities of sound events given a scene. In addition, we validate the effectiveness of these approaches through extensive ablation studies and experiments.
In this section, we briefly review some related works in aerial scene recognition, multimodal learning, and cross-task transfer.
Earlier studies on aerial scene recognition [33, 23, 24] mainly focused on extracting low-level visual attributes and/or modeling mid-level spatial features [15, 17, 28]. Recently, deep networks, especially CNNs, have achieved a large development in aerial scene recognition [20, 4, 6], e.g., Nogueria et al.  analysed three possible strategies for fully exploiting the power of the existing CNNs. Meanwhile, there are some other works [16, 32] aiming at designing effective network architectures for aerial scene recognition in different cases, e.g., Kyrkou et al.  designed a lightweight yet efficient CNN architecture for embedded systems. Moreover, some methods were proposed to solve the problem of the limited collection of aerial images by employing more efficient networks [34, 37]. Although these methods have achieved great empirical success, they usually learn scene knowledge from the same modality, i.e., image. Different from previous works, this paper mainly focuses on exploiting multiple modalities (i.e. image and sound) to achieve robust aerial scene recognition performance.
Information in the real world usually comes as different modalities, with each modality being characterized by very distinct statistical properties, e.g., sound and image . An expected way to improve relevant task performance is by integrating the information from different modalities. In past decades, amounts of works have developed promising methods on the related topics, such as reducing the audio noise by introducing visual lip information for speech recognition [11, 1], improving the performance of facial sentiment recognition by resorting to the voice signal [35, 35]. Recently, more attention is paid to the task of learning to analyze real-world multimodal scenarios [21, 12, 13] and events [26, 31]. These works have confirmed the advantages of multimodal learning. In this paper, we proposed to recognize the aerial scene by leveraging the bridge between scene and sound to help better understand aerial scenes.
Transferring the learned knowledge from one task to another related task has been approved as an effective way for better data modeling and messages correlating [7, 2, 14]. Aytar et al.  proposed a teacher-student framework that transfers the discriminative knowledge of visual recognition to the representation learning task of sound modality via minimizing the differences in the distribution of categories. Ehrlich et al.  learned a joint feature representation with the interaction between different tasks (such as PCA and facial landmark detection) to facilitate facial attributes recognition. Chaplot et al.  utilized a dual-attention unit to align textual and visual representations with the transferred knowledge of words and objects. Due to the partial correlation of scenes and sound events, Imoto et al.  proposed a method for sound event detection by transferring the knowledge of scenes with soft labels. Salem et al.  proposed to transfer the sound clustering knowledge to the image recognition task by predicting the distribution of sound clusters from an overhead image, similarly work can be found in . By contrast, this paper strives to exploit effective sound event knowledge to facilitate the aerial scene understanding task.
To our knowledge, the audiovisual aerial scene recognition task has not been explored before. Although Salem et al.  have ever established a dataset to explore the correlation between geotagged audio and overhead image, the low-quality images and absent scene labels make it difficult to facilitate the research in this field. To this end, we construct a new dataset, named as ADVANCE222The dataset is now publicly available: https://zenodo.org/record/3828124
, which in summary contains 5075 pairs of aerial images and sounds, classified into 13 classes.
The audio data are collected from Freesound333https://freesound.org/browse/geotags/, where we remove the audio recordings that are shorter than 2 seconds, and extend those that are between 2 and 10 seconds to longer than 10 seconds by replicating the audio content. Each audio recording is attached to the geographic coordinates of the sound being recorded. From the location information, we can download the updated aerial images from Google Earth444https://earthengine.google.com/. Then we pair the downloaded aerial image with a randomly extracted 10-second sound clip from the entire audio recording content. Finally, the paired data are labeled according to the annotations from OpenStreetMap555https://www.openstreetmap.org/, also using the attached geographic coordinates from the audio recording. Those annotations have manually been corrected and verified by participants in case that some of them are not up to date. The overview of the establishment is shown in Fig. 2.
Due to the inherent uneven distribution of scene classes, the collected data are strongly unbalanced, which makes difficult the training process. So, two extra steps are designed to alleviate the unbalanced-distribution problem. Firstly we filter out the scenes whose numbers of paired samples are less than 10, such as desert and the site of wind turbines. Then for scenes that have less than 100 samples, we apply a small offset to the original geographic coordinates in four directions. So, correspondingly, four new aerial images are generated from Google Earth and paired with the same audio recording, while for each image, a new 10-second sound clip is randomly extracted from the recording. Fig. 3 reveals the final number of paired samples per class. Moreover, as shown in Fig. 4, the samples are distributed over the whole world, increasing the diversity of the aerial images and sounds.
In this paper, we focus on the audiovisual aerial scene recognition task, based on two modalities, i.e., image and audio. We propose to exploit the audio knowledge to better solve the aerial scene recognition task. In this section, we detail our proposed approaches for creating the bridge of knowledge transfer from sound event knowledge to the scene recognition in a multi-modality framework.
Let a paired data sample drawn from the image-audio data distribution , where contains the image and the paired audio , and is the scene label. Our built dataset ADVANCE is thus an empirical distribution of the real data distribution . We note and as two convolutional networks for extracting representations from images and sound clips respectively, and for the network that concatenates the representations from and , as shown in Fig. 5.
Then with the extracted representation, a fully-connected layer, whose size depends on the choice of the network and the number of categories, transforms the representation
to the label space, and then predicts the probability of each category with an appropriate activation function. We write the output of the entire network before final activation as. We use the notation for sound event recognition, and
is the one-hot vector where all elements are null except the-th, indicating that the sample belongs to the scene , is the softmax function and
is the Kullback-Leibler divergence between the one-hot ground truthand the predicted probability.
In the rest of this section, we formulate our proposed model architecture for addressing the multimodal scene recognition task, and present our idea of exploiting the audio knowledge following three directions: (1) avoid forgetting the audio knowledge during training by preserving the capacity of recognizing sound events; (2) construct a mutual representation that solves the main task and the sound event recognition task simultaneously, allowing the model to learn the underlying relation between sound events and scenes; (3) directly learn the relation between sound events and scenes. Our total objective function is
For the multimodal learning task with deep networks, we adopt the model architecture that concatenates representations from two deep convolutional networks on images and sound clips. That is, we mainly use for in Equation (1), and we will focus on this setting to evaluate our proposed approaches of exploiting audio knowledge.
Furthermore, pre-training on related datasets helps accelerate the training process and improving the performance on the new dataset, especially on a relatively small dataset. For our task, the paired data samples are limited, and our preliminary experiments show that the two networks and benefit a lot from pre-training on the AID dataset  for classifying scenes from aerial images, and AudioSet  for recognizing 527 audio events from sound clips . and denote the pre-trained networks. We maintain the notations , and for the networks that are initialized from the pre-trained weights, for the reason that we will mainly base on the networks that are initialized from the pre-trained weights in the following. In the cases where networks are initialized with random weights, we add a dagger notation , like , which differs from only in the initialization of networks. The performances of randomly initialized networks will also be reported in the experimental section, presenting the contribution from each modality and from the pre-trained model.
For our task of aerial scene recognition, audio knowledge is expected to be helpful since the scene information is related to sound events. While initializing the network by the pre-trained weights is an implicit way of transferring the knowledge to the main task, the audio source knowledge may easily be forgotten during fine-tuning. Without audio source knowledge, the model can hardly recognize the sound events, leading to a random confusion between sound events and scenes.
For preserving the knowledge, we propose to record the soft responses of target samples from the pre-trained model and retain them during fine-tuning. This simple but efficient approach  is named as knowledge distillation, for distilling the knowledge from an ensemble of models to a single model, and has also been used in domain adaptation  and lifelong learning . All of them encourage to preserve the source knowledge by minimizing the KL divergence between the responses from the pre-trained model and the training model. For avoiding the saturated regions of the softmax, the pre-activations are divided by a large scalar, called temperature , to provide smooth responses, with which the knowledge can be easily transferred.
However, for the reason that the audio event task is a multi-label recognition,
is activated by the sigmoid function, instead of softmax, so the knowledge distillation technique is implemented by a sum of KL divergences, instead of a single one:
where is a KL divergence on binary recognition, is the sigmoid function, , is the -th element of , indicating the probability of -th sound event happening in sound clip , predicted by the pre-trained model , while is predicted by the training model .
From another viewpoint, knowledge distillation with high temperature is equivalent to minimizing the squared Euclidean distance (SQ) between the pre-activated outputs . Thereby, apart from Equation (3), we also propose to preserve the source knowledge by directly comparing the pre-activated outputs:
Different from the idea of preserving the knowledge within the audio modality, we encourage our model, along with the visual modality, to learn a mutual representation that recognizes the scenes and the sound events simultaneously. Specifically, we optimize to solve the sound event recognition task using the concatenated representation, with knowledge distillation technique:
Similar to (4), we also propose to minimize the squared Euclidean distance between the pre-activated outputs:
This simultaneous multi-task technique is very common within one single modality, such as solving depth estimation, surface normal estimation and semantic segmentation from one single image, or recognizing acoustic scenes and sound events from audio , hoping that the model can learn the underlying relationships among the tasks. We implement this idea by either Equation (5) or (6), to encourage the model to solve two tasks simultaneously and find the underlying relation between sound events and scenes.
The two previously proposed approaches are based on the multi-task learning framework, solving two tasks in parallel, either using different or the same representations, in order to preserve the audio source knowledge or learn an underlying relation between aerial scenes and sound events. Here, we propose an explicit way for directly modeling the relation between scenes and sound events, and creating the bridge of transferring the knowledge between two modalities.
We employ the paired image-audio data samples from our built dataset as introduced in Section 3
, analyze the happening sound events in each scene, and obtain the posteriors given one scene. Then instead of predicting the probability of sound events by the network, we estimate this probability distributionwith the help of posteriors and the predicted probability of scenes :
where , recalling that is the softmax and is the probability of the -th scene, and the posteriors is obtained by averaging over all samples that belong to the scene . This estimation is in fact the compound distribution that marginalizes out the probability of scenes, while we search for the optimal scene probability distribution (ideally one-hot) through aligning with soft responses:
Besides estimating the probability of each sound event happening in a specific scene, we also investigate possible concomitant sound events in this scene. Some sound events may largely overlap under a given scene, and this coincidence can be used as a characteristic for recognizing scenes. We propose to extract this characteristic from of all samples that belong to this specific scene.
We note as the sound event probabilities of samples in the scene , where each row of is each sample’s probabilities of sound events in the scene
. Then we extract the eigenvector
that corresponds to the largest eigenvalue of the Gram matrix. This eigenvector indicates the correlated sound events and quantifies their relevance in the scene by the direction of this vector. We thus propose to align the direction of , the event relevance of the ground truth scene , with the estimated :
In this section, we conduct comparison experiments, as well as some ablation studies, to validate the proposed approaches on the ADVANCE dataset.
Our built ADVANCE dataset is employed for evaluation, where 70% image-sound pairs are for training, 10% for validation, and 20% for testing. Note that, these three sub-sets do not share audiovisual pairs that are collected from the same coordinate. Before feeding the recognition model, we sub-sample the sound clips at 16 kHz. Then, following 
, the short-term Fourier transform is computed using a window size of 1024 and a hop length of 400. The generated spectrogram is then projected into the log-mel scale to obtain an audio matrix in, where the time and the frequency
. Finally, we normalize each feature dimension to have zero mean and unit variance. The image data are all resized into 256256, and horizontal flipping, color, and brightness jittering are used as data augmentation means.
In the network setting part, the visual pathway employs the AID pre-trained ResNet-101 for modeling the scene content  and the audio pathway adopts the AudioSet pre-trained ResNet-50 for modeling the sound content . The whole network is optimized via an Adam optimizer with a weight decay rate 1e-4 and a relatively small learning rate 1e-5, as both backbones have been pre-trained from external knowledge. By using grid search strategy, the hyper-parameters of and
are set as 0.1 and 0.001, respectively. We adopt the weighted-averaging precision, recall and F-score metrics for evaluation, which are more convincing when faced with uneven distribution of scene classes.
Table 1 shows the recognition results of different learning approaches under the unimodal and multimodal scenario, from which we have four points should pay attention to. Firstly, according to the unimodal results, the sound data can provide a certain reference for different scene categories, although it is significantly worse than image-based results. Such phenomenon reminds us that we can take advantage of the audio information to improve recognition results further. Secondly, we recognize that simply using the information from both modalities does not bring benefits but slightly lowers the results (72.85 vs. 72.71 in F-score). This could be because the pre-trained knowledge for audio modality may be forgotten or the audio messages are not fully exploited just with the rough scene labels. Thirdly, when the sound event knowledge is transferred for the scene modeling, we have considerable improvements for all of the proposed approaches. The results of and show that preserving audio event knowledge is an effective means for better exploiting audio messages for scene recognition, and the performance of and demonstrates that transferring the unimodal knowledge of sound events to the multimodal network can help to learn better mutual representation of scene content across modalities. Fourthly, among all the compared approaches, our proposed approach shows the best results, as it better imposes the sound event knowledge by imposing the underlying relation between scenes and sound events.
means standard deviation.
We use the CAM technique  to highlight the parts of the input image that make significant contributions to identifying the specific scene category. Fig. 6 shows the comparison of the visualization results and the predicted probabilities of the ground-truth label among different approaches. By resorting to the sound event knowledge, as well as its association with scene information, our proposed model can better localize the salient area of the correct aerial scene and provide a higher predicted probability for the ground-truth category, e.g, the harbour and bridge class.
Apart from the multimodal setting, we have also conducted more experiments under the unimodal settings, shown in Table 2, for presenting the contributions from pre-trained models, and verifying the benefits from the sound event knowledge on the aerial scene recognition. For these unimodal experiments, we keep one modal data input and set the other to zeros. When only the audio data are considered, the sound event knowledge is transferred within the audio modality and thus is equivalent to , similarly for the visual modality case. Comparing the results of randomly initializing the weights i.e. and initializing with the pre-trained weights i.e. , we find that initializing the network from the pre-trained model can significantly prompt the performance, which confirms that pre-training from a large-scale dataset benefits the learning task on the small datasets. Another remark from this table is that the results of the three proposed approaches show that both unimodal networks can take advantage of the sound event knowledge to achieve better scene recognition performance. It further validates the generalization of the proposed approaches, either in the multimodal or the unimodal input case. Compared with the multi-task framework of and , the approach can better utilize the correlation between sound event and scene category via the statistical posteriors.
In this subsection, we directly validate the effectiveness of the scene-to-event transfer term and the event relevance term , without the supervision from the scene recognition objective of . Table 3 shows the comparison results. By resorting to the scene-to-event transfer term, performing sound event recognition can reward the model the ability to distinguish different scenes. When further equipped with the event relevance of the scenes, the model can have higher performance. This demonstrates that cross-task transfer can indeed provide reasonable knowledge if the inherent correlation between these tasks are well exploited and utilized. By contrast, as the multi-task learning approaches do not well take advantage of this knowledge, the scene recognition performance remains at the chance level.
To better illustrate the correlation between aerial scenes and sound events, we further visualize the embedding results. Specifically, we use the well-trained cross-task transfer model to predict the sound event distribution on the testing set. Ideally, the sound event distribution can separate the scenes from each other, since each scene takes a different sound event distribution. Hence, we use t-SNE  to visualize the high-dimensional sound event distributions of different scenes. Fig. 7 shows the visualization results, where the points in different color mean different scene categories. As is performed within the audio modality, the sound event knowledge cannot well transfer to the entire model, leading to the mixed scene distribution. By contrast, as transfers the sound event knowledge into the multimodal network, the predicted sound event distribution can separate different scenes to some extent. By introducing the correlation between scenes and events, i.e., , different scenes can be further disentangled, which confirms the feasibility and merits of cross task transfer.
In this paper, we explore a novel multimodal aerial scene recognition task that considers both visual and audio modality data. We have constructed a dataset, named ADVANCE, with labeled paired audiovisual worldwide samples, for facilitating the research on this topic. Observing that simply combining both modalities cannot easily promote the scene recognition performance, we propose to directly transfer the sound event knowledge to the scene recognition task for the reasons that the sound events are related to the scenes and that this underlying relation is not well exploited. Three kinds of effective cross-task transfer approaches are thus developed to improve the performance on the scene recognition, and amounts of experimental results show their effectiveness, confirming the benefit of exploiting the audio knowledge for the aerial scene recognition.
Multimodal machine learning: a survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 41 (2), pp. 423–443. Cited by: §2.
When deep learning meets metric learning: remote sensing image scene classification via learning discriminative cnns. IEEE transactions on geoscience and remote sensing 56 (5), pp. 2811–2821. Cited by: §2.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 47–55. Cited by: §2.
Scene classification via a gradient boosting random convolutional network framework. IEEE Transactions on Geoscience and Remote Sensing 54 (3), pp. 1793–1802. Cited by: §2.
Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §5.2.
Deep learning based feature selection for remote sensing scene classification. IEEE Geoscience and Remote Sensing Letters 12 (11), pp. 2321–2325. Cited by: §2.