Cross-domain Deep Feature Combination for Bird Species Classification with Audio-visual Data

by   Bold Naranchimeg, et al.
University of Fukui

In recent decade, many state-of-the-art algorithms on image classification as well as audio classification have achieved noticeable successes with the development of deep convolutional neural network (CNN). However, most of the works only exploit single type of training data. In this paper, we present a study on classifying bird species by exploiting the combination of both visual (images) and audio (sounds) data using CNN, which has been sparsely treated so far. Specifically, we propose CNN-based multimodal learning models in three types of fusion strategies (early, middle, late) to settle the issues of combining training data cross domains. The advantage of our proposed method lies on the fact that We can utilize CNN not only to extract features from image and audio data (spectrogram) but also to combine the features across modalities. In the experiment, we train and evaluate the network structure on a comprehensive CUB-200-2011 standard data set combing our originally collected audio data set with respect to the data species. We observe that a model which utilizes the combination of both data outperforms models trained with only an either type of data. We also show that transfer learning can significantly increase the classification performance.



There are no comments yet.


page 4

page 5

page 6

page 7

page 8


Deep CNNs for large scale species classification

Large Scale image classification is a challenging problem within the fie...

Biometric Fish Classification of Temperate Species Using Convolutional Neural Network with Squeeze-and-Excitation

Our understanding and ability to effectively monitor and manage coastal ...

Deep CNN Framework for Audio Event Recognition using Weakly Labeled Web Data

The development of audio event recognition models requires labeled train...

An empirical investigation into audio pipeline approaches for classifying bird species

This paper is an investigation into aspects of an audio classification p...

Distributed Averaging CNN-ELM for Big Data

Increasing the scalability of machine learning to handle big volume of d...

Extracting Electron Scattering Cross Sections from Swarm Data using Deep Neural Networks

Electron-neutral scattering cross sections are fundamental quantities in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Identifying the species of a bird is a widely-studied problem to ornithologists, and an important task in ecosystem monitoring and biodiversity preservation. Recognition of bird species in images is a challenging task due to various appearances, backgrounds, and environmental changes. Despite of this, this fine-grained recognition task has received a significant amount of attention in the computer vision community

[1, 2, 3, 4] because of its potential widespread applications. Compared to generic object recognition, fine-grained recognition benefits more from learning critical parts of the object that can help align objects of same class and discriminate between neighboring classes. Current state-of-the-art methods (e.g., [1, 2, 5, 6, 7]) adopt CNN-based architectures that learn representations directly from the raw data and can be used to extract set of discriminative features.

On the other hand, sound also provide us important information about the world around us. Many animals make sounds either for communication or their living activities such as moving, flying, mating etc. Although sound is in some case complementary to visual information, such as when we listen to something out of view, vision and hearing are often informative about the same structures in the world [8]. As a consequence, numerous efforts have been devoted to recognize bird species based on auditory data [9, 10]

in recent years. Adapting CNN architectures for the purpose of audio event detection has become a common practice and generating deep features based on visual representations of audio recordings has proven to be very effective

[11] such as in bird sounds [12, 10]. To improve the affinity between images and sounds, we utilize the CNN to extract features from spectrogram of audio recordings.

In the real world, human are able to handle information consist of different information from multiple plural modalities. In the filed of machine learning, multimodal recognition can improve performance compared with unimodal recognition by utilizing complementary sources of information

[13, 14]. Multimodal learning have been used for tasks such as image and sentence matching [15], RGB-D object recognition [16], action detection [17] and specially speech recognition [15, 18, 19, 20], fusing different modalities. The results have shown that one modality can enhance the performance of the other by providing relevant information. Furthermore, authors of [14, 21] proposed fusion schemes for multimodal learning with considering the architectures of neural networks.
Our main contributions are summarized as follows:

  • We propose that the combination of image and sound provide richer training signal for bird species classification under CNN framework, which is the first attempt to the best of our knowledge.

  • Three strategies are investigated for fusing audio and image modalities using CNN.

  • We collect at least 10 audio recordings for each bird over 178 species, corresponding to the image dataset CUB-200-2011 [22].

Specifically, we adopt CNN to process jointly the two modalities for bird species classification. Three strategies are investigated for fusing audio and image modalities using CNN: (1) an early fusion strategy in which the feature vectors related to each modality are concatenated together and input to the CNN. (2) A middle fusion strategy. Features learned by each single modality are combined at the mid-level of the CNN. (3) A late fusion strategy. Outputs of single modality are fused to determine a final classification. Our experimental results show that the architecture with late fusion strategy outperforms among the proposed architectures, which indicates that combining decisions of the classifiers from two modalities is superior. In addition, we apply a two-stage training procedure, which improves the classification accuracy.

The rest of the paper organized as follows. In section 2, we review related works on bird species identification with CNN and multimodal learning algorithms. Section 3 describes several architectures among which a multimodal CNN processing jointly the two modalities. In section 4, we describe the integrated dataset utilized for the evaluation experiments of our architectures. Section 5 describes the experimental setup and results. In section 6, we conclude this paper.

2 Related work

Our work is related to the problem of recognition from multimodal data as well as convolutional neural networks for image classification. We will briefly highlight connections and differences between our work and existing works.

Deep neural networks (DNN) have successfully applied for single modality such us text [23, 24, 25], images [26, 27, 28] and audio [29, 30] showing their ability to learn representations directly from raw data and can be used to extract a set of discriminative features. CNN is one powerful deep architecture of DNN commonly utilized for image classification [31, 32, 33]. The use of CNN for distinguishing between fine grained categorization such as bird species categorization has been proposed in many studies [1, 2, 5, 6, 7], which employ part/pose based approaches to achieve good performance. Besides, CNNs have also been applied for speech processing [34, 35]. In [11, 36], authors utilized CNNs to extract features from spectral representations of audio recordings. It has proved to be efficient when extracting features based on spectrograms of audio recordings such as bird sound [9, 12], thus we used spectrogram representations for audio data. In this paper, instead of employing part/pose based approaches we will focus on integrating raw image an audio using conventional CNN, in order to know which fusion strategy performs best.

Multimodal learning algorithms have been used for tasks such as image sentence matching [37], action recognition [17], RGB-D object recognition [16], and speech recognition (audio-visual speech recognition [18] and visual-only speech recognition [15]). Among the many approaches for multimodal learning, multimodal integration is commonly realized by three different categories of approaches. First, in early fusion approach, feature vectors from multiple modalities are concatenated and transformed to acquire a multimodal feature vector. For example, Ngiam et al. [14] utilized DNN to extract fused representations directly from multimodal signal inputs. Likewise, in middle fusion approach, Huang et al. [21]

employed deep belief network (DBN) to combine mid-level features learned by single modality. Lastly, in late fusion approach, outputs of unimodal classifiers are merged to determine a final classification. For example, in RGB-D object recognition, Eitel et al.

[16] proposed two separate CNN streams processing RGB and depth data independently are combined with late fusion approach. Therefore, Simonyan et al. [17] proposed two-stream (one stream processing spatial features from RGB image inputs, while the other stream processing temporal features from optical flow inputs) network architecture designed to recognize action for videos. They combined two streams by concatenating features and by averaging prediction scores from two CNNs, respectively. In contrast to these works, we propose simple yet effective concatenation, summation and multiplication based fusion methods with respect to three strategies.

3 Methodology

Our three multimodal architectures extend conventional CNN for large-scale image classification [31]. Our implementation is based on CaffeNet , and can be treated as [38] a variation of the structure proposed by Krizhevsky et al. [31].

3.1 Feature extraction

CNN uses hierarchical features in its processing pipeline. The features from initial layers are primitive while late layers are high-level abstract features made from combinations of lower-level features. The CaffeNet consists of five convolutional layers (with max pooling layers following the first, second and fifth convolution layer) followed by three fully connected (FC) layers and a softmax classifier. Rectified linear unit is applied to every convolutional layer and fully connected layer and local normalization is applied in first and second convolutional layer. The process through this 8-layer CNN network can be treated as a process from low to mid to high-level features. We hypothesize, that combining the features of different layers in this pipeline can lead to achieve better performance.

3.2 Feature combination

We propose our method in three strategies to fuse features: early fusion, middle fusion and late fusion. Early fusion, also known as feature level fusion, is a feature combination scheme that features from multiple modalities concatenated to form a merged feature vector. Middle fusion, also called mid-level combination, combines the high-level features learned by single network. Late fusion, also called decision-level fusion, combines the decisions of the unimodal classifier and determine the final classification. In this paper, we use concatenation to combine low and high-level features, and summation or multiplication to combine decisions of the classifiers. The feature combination layers can be trained with standard back-propagation and stochastic gradient descent.

Figure 1: The architecture of early fusion model (Net1). pixel RGB images of two modalities are concatenated at merging layer, which produces output volume, and the convolution layers will extract joint features from this merged volume. We use HDF5 format to manage datasets of two modalities, because of it’s flexible data storage and unlimited data types.

3.3 Architectures of proposed models

The proposed multimodal learning models which combines audio and image using CNN by different fusion approaches are described in this section. We exploit the same architecture for both audio and image modalities to focus on evaluating the effectiveness of the feature combination approaches.

3.3.1 Early fusion model

One direct approach for combining audio and image is to train a CNN over the concatenated audio and image data as shown in Fig. 1. In this strategy, the input vectors related to each modality are concatenated together and then processed together throughout the rest of the CNN pipeline. This model is most computational efficient comparing to the middle and late fusion models, because the number of learnable weight parameters is almost half times less than late fusion model.

3.3.2 Middle fusion model

In the middle fusion strategy, unimodal features is extracted independently from audio and images, then combined into a multimodal representation by concatenating the activations of the last pooling layers of two modalities. The multimodal representation then learned in the following fully connected layers. The middle fusion model is shown in Fig.2.

Figure 2: The architecture of middle fusion model (Net2). The activations of the pool5 layers of the two modalities are concatenated at merging layer, and feeding it into the three fully connected layers with softmax at the end.

3.3.3 Decision or late fusion model

In contrast to the middle fusion model, extracted unimodal features are separately learned to compute unimodal scores, then these scores are integrated to determine a final scores. The late fusion model consists of two-streams processing audio and image data (green and blue) independently, which are fused after last fully connected layers as shown in Fig.3. Among the various ways of combining CNNs with different modalities, one straightforward way is to add additional fully connected layers to combine the outputs of the each streams as presented in [16]. Instead of using FC layer to combine the two streams, we applied element-wise summation and element-wise multiplication to fuse the outputs of the each streams.

Figure 3: The architecture of late fusion model (Net3). The last fully connected layers of each model holds the unimodal scores for each class, which fused at fusing layer by summing and multiplying the unimodal scores.

4 Database

We evaluate our models on the popular fine-grained CUB-200-2011 [22] bird dataset and our originally collected sound dataset from sharing bird sound database Xeno-Canto111

Figure 4: (a) Radio (top) and spectrogram (bottom) of the black-footed albatross. (b) The spectrogram of 10 seconds duration, which will be fed into the bird classification model.

CUB-200-2011 [22] dataset contains 11788 images of 200 species of birds, with each image downscaled to 227 227 pixel. Spectro-temporal features (spectrogram) are extracted from audio recordings that we collected from the Xena-Canto to be used as the audio representation. Based on the 200 species of the CUB-200-2011, we try to harvest at least 10 different audio recordings for each species. As a result, audio recordings over 178 species were collected completely (), audio recordings from 19 species were collected deficiently (

), and 3 species could not be collected. The spectrograms of the audio are obtained by using short-time fourier transform (STFT) over 10 seconds audio frames, windowed with Hanning window (size 512, 50% overlap). The reason is, the sounds of birds are usually contained in a small portion of the frequency range (mostly around 2-8 kHz) as stated in

[10], so we only extract features from the range of (0, 10) kHz. In order to focus only sounds produced in the vocal organ of birds (i.e. calls and songs), first we obtained the maximum amplitude of the audio and removed a frame which contains only amplitude less than of the maximum amplitude. Finally, the spectrograms are saved as 227227 pixel color images, and the dataset contains 4807 images of 194 species of birds. Several examples including both images and spectrograms over different bird species are shown in Fig.5. We follow the standard training/test split of CUB-200-2011 dataset suggested in [22]. The sound dataset is split into two halves for training and test set respectively.

The multimodal CNNs are trained in a supervised manner, thus we create integrated dataset by matching two data sets and corresponding labels using HDF5 222 file format. Because the HDF5 file can contain any collection of data entities (images, videos, audio recording, text, etc.,) in a single file, and used to organize heterogeneous collections consisting of very large and complex datasets.

Figure 5: An example of CUB-200-2011 and audio dataset: (a) the yellow bellied flycatcher, (b) the seaside sparrow (c), the western grebe, and (d) the pacific loon.

5 Experiments and Results

The experiments in this section are conducted to evaluate the effectiveness of our proposed architectures (Net1, Net2, Net3). In order to evaluate an advantage of our late fusion approach, we conduct a comparative experiment between Net3 model and two different existing fusion approaches [16, 17]. Besides, we have fine-tuned a pretrained model for all the models (including comparative models) in order to improve the performance. To improve the repeatability and only focus on the evaluation of fusion step, we use the well-known CaffeNet model [38] to extract features, train, and fine-tune the CNN with default structure and parameter setting fixed except the base learning rate and batch size. We set the learning rate to 0.001 for single modality and 0.0001 for multimodal learning. The batch size is set to 32 for single modality and 1 for multimodal learning due to limited resources.

5.1 Quantitative result: Single-modality v.s. Multi-modality

The core idea of this paper is to address the integration of the image and audio. Therefore, it is necessary to compare the performance between single-modality (image or sound) and multi-modality (both image and sound: Net1, Net2, Net3). To focus on the evaluation of fusion models, in this experiment, we do not introduce any transfer learning techniques (e.g., fine-tune the pretrained model to help differentiate between gains from the proposed architectures). Table 1

summarizes the results of single modality and proposed models. We can observe that combining two modalities using CNN improves the performance of those only using one modality, extracting features separately from image and audio and fusing them at the late stage performs better with significant gain. Interestingly, the performance of low-level and mid-level fusion models slightly better than the performance of single-modality. One possible reason is because CNN learns features for the predominant modality. In contrast, learning features separately for different modalities results in more independent features, which leads to achieve better performance. Let us mention that the result obtained by multi-modality is different from simply combining the results of two CNNs trained separately. Indeed, the two modalities’ parameters are jointly estimated and thus can be mutually influenced. We provide a detailed discussion in following section.


Accuracy (%)

Single modality Image 16.2
Audio 46.4
Multi modality Net1 50.0
Net2 49.9
Net3 (summation) 53.8
Table 1: Comparative results between individual modality and multimodal CNNs.

Figure 6

plots learning process of single modality (image only and audio only) and Net3 model over learning epochs. It can be observed that Net3 significantly improves the results.

Figure 6: Test accuracy vs. Epoch.

5.2 Qualitative result: Single-modality v.s. Multi-modality

We perform a qualitative study to analyze effects of multimodal learning models by comparing single-modality and multi-modality networks. First, we select some classes where the multi-modality models provides the correct answer while the single-modality model produces the wrong classification. Then, we study why the multi-modality models provides right answers while the single-modality model failed to produce the correct classification. Figure 7

shows some examples of single-modality vs. multi-modality classification. In the first column, the single-modality models predict the input image and spectrogram as the ’barn shallow’ and the ’ring-billed gull’ respectively rather than the ’red-bellied woodpecker’. However, the multi-modality models are able to predict the right answer, because those models provide joint features of different modalities. We observe that when single-modality model provide right answer for spectrogram, the probability of providing the right answer of multimodality models is higher than when single-modality classificaion is correct for image. In the second column of Fig.

7, single-modality model is misclassified the ’clark nutcracker’ image as the ’great grey shrike’, where other models provide correct answers. Lastly, the Net3 model is able to provide right answers while the other models provide misclassification on the ’belted kingfisher’ (last column).

To better understand the difference between the models, we analyze the feature learned from the each networks by visualizing the filters of the first convolutional layers shown in Fig. 8. We see that each network’s filters of different models have similar pattern. Secondly, we see that the single-modality’s filters have more meaningful patterns than the multi-modality filters. As we mentioned before its seems that the learning features separately for different modalities results more independent features. Finally, it can be seen that the filters of early fusion model has combined patterns from both networks, but most filters are similar to the spectrogram network. It reveals that the multimodal model (early fusion and mid-level fusion) learns features for the predominant modality.


Figure 7: Effects of combining image and spectrogram. Top two rows show sample image and spectrogram of different bird species where are fed into single-modality models and multi-modality models. The bottom rows show the resulting classification, where multimodal networks provide a correct classification while the single-modality classification are incorrect.

5.3 Quantitative result: Net3 v.s. existing late fusion approaches

To evaluate the effectiveness of our late fusing approach, we conduct comparative experiments on Net3 with existing late fusing methods presented in [17, 16]. The differences between the late fusion approaches is shown in Fig. 9.





Figure 8: Visualization of 96 filters of the first convolutional layer. Left side shows the filters related to the image network, while right side shows the filters related to the spectrogram network. It can be seen that the filters (left or right side) of different models have similar pattern. However, the filters of Net1 seems to have mixed filters of both networks.
  • FC7 concat[16]

    : FC7 layers (green and blue) of each networks are concatenated and merge into the fusion layer, which performs tensor multiplication of two vectors. The resulting fusion vector is then passed through one additional fully-connected layer for classification. This means this fusion methods is a linear combination of pair-wise interactions between two features. However, this method is not suitable when the features are in different sizes.

  • Averaging scores[17]: Each networks focus on learning features from images and spectrograms, respectively, and the final classification is computed as an average of the softmax scores of the two networks. In this fusion method, they do not consider pair-wise interactions between the features. However, this method is suitable when the model consists of different structured network streams.

In terms of pair-wise interactions, our method is similar to FC7 concat method. However, we fused final output of each networks. Figure 6 plots learning curve of Net3 model with different late fusing methods for each epoch, indicating that averaging the Softmax scores gives lowest performance and the our fusing approach performs best.

Figure 9: Differences between the late fusion approaches.

5.4 Fine-tuning the pretrained model

Combining multi-modalities at the late stage of CNN has proven to be more effective, thus we show additional results with fine-tuning CaffeNet pretrained CNN under Net3 model in this section. One natural idea for fine-tuning is to train the model by initializing both image and audio CNNs with the weights and biases of the first seven layers derived from CaffeNet pretrained network, discarding the last fully connected layer. Instead of last fully connected layer of the pretrained model, we randomly place the initialized new fully connected layer for 200-class bird classification (in our experiment, 194 classes due to the lack of audio dataset).





(a) Pooling layer
(b) Fully connected layer
Figure 10: Feature visualization of network layer. These are examples of features at different levels in Net3, where (a) shows the features of different pooling layer of image (left) and audio (right) network, and (b) top to bottom shows the features of last fully connected layer of image only, spectrogram only, Fused layer using FC7 concat, and fused layer using summation. Here, the red rectangle shows incorrect answer, the green rectangle shows correct answer of the classification.

Another method of fine-tuning the model is to train Net3 in two stages. First, training the two stream individually followed by a joint fine-tuning. We train the image and audio CNNs separately, adapting the weights of pretrained model and learn the weights of the new 194-class output layer. After this training, the networks can be used to perform separate classification with respect to each modality. After then, we train an entire model by setting their learning rate to zero and only training the fusion part of the network to freeze the individual stream networks. As shown in Table 2, two stage training is resulted best performance, and it proves that additional modality can improve the performance. We found that the fine-tuning pretrained model to two modalities and training them simultaneously, is significant worse than two stage training. We think the problem relates to the difference between the size of two datasets and batch size of single-modality and multi-modalities. In this experiment, we used net surgery333 to fine-tune two different CNN using one pretrained model and to fine-tune two pretrained model into our model.

To confirm that the features extracted from sample image and spectrogram during the fine-tuning were meaningful, we visualized the activations of different layers, especially the fused layers in Net

3. The results shown in Fig.10, that allowed us to confirm the learned features were meaningful and qualitatively resembled the sample image and spectrogram. Moreover, the activations of our fused layer takes advantage by incorporating the features from each streams, when each network failed to produce write answer.


Accuracy (%)

Fine-tuning weights of pretrained model (summation) 65.0
Two stage fine-tuning Summation (ours, Net3) 78.9
Multiplication (ours, Net3) 75.0
Simonyan et al. [17] 70.0
Eitel et al. [16] 72.5
Table 2: Classification performance of fine-tuned Net3 model with different fusion and fine-tuning method.

6 Conclusion

In this paper, we proposed three multimodal CNN architectures in different fusion strategies, which can process jointly the image and audio data for bird classification. Experimental results verified that the two-stream multimodal CNN in late fusion strategy outperforms the others. In addition, we proposed summed fusion method to combine multiple CNNs, which shows better performance comparing against several existing fusion methods. Moreover, with the help of two-stage fine-tuning, our method can be more effective. However, there still exist several drawbacks of our method: (1) Choosing the suitable duration based on the vocal features of birds to be recognized, is essential ingredients of improvement, is missed in our current work. (2) Our method is based on the raw image data, thus part detection and extracting features from pose-normalized regions may improve the classification performance.

As the future work, we aim to apply multiple kernel learning to combine multiple modalities, which is able to learn optimal composite kernel through combining basis kernels constructed from different features of modalities. We are also interested in designing new CNN architectures by increasing the number of modalities.


  • [1] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnns for fine-grained category detection,” European conference on computer vision, pp.834–849, Springer, 2014.
  • [2] S. Branson, G. Van Horn, S. Belongie, and P. Perona, “Bird species categorization using pose normalized deep convolutional nets,” arXiv preprint arXiv:1406.2952, 2014.
  • [3]

    Y. Cui, F. Zhou, Y. Lin, and S. Belongie, “Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1153–1162, 2016.

  • [4] E. Gavves, B. Fernando, C.G. Snoek, A.W. Smeulders, and T. Tuytelaars, “Local alignments for fine-grained categorization,” International Journal of Computer Vision, vol.111, no.2, pp.191–212, 2015.
  • [5] P. Guo and R. Farrell, “Fine-grained visual categorization using pairs: Pose and appearance integration for recognizing subcategories,” arXiv preprint arXiv:1801.09057, 2018.
  • [6] V. Lebedev, A. Babenko, and V. Lempitsky, “Impostor networks for fast fine-grained recognition,” arXiv preprint arXiv:1806.05217, 2018.
  • [7] X. He and Y. Peng, “Fine-grained image classification via combining vision and language,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.5994–6002, 2017.
  • [8] A. Owens, J. Wu, J.H. McDermott, W.T. Freeman, and A. Torralba, “Ambient sound provides supervision for visual learning,” European Conference on Computer Vision, pp.801–816, Springer, 2016.
  • [9] S. Kahl, T. Wilhelm-Stein, H. Hussein, H. Klinck, D. Kowerko, M. Ritter, and M. Eibl, “Large-scale bird sound classification using convolutional neural networks,” Working notes of CLEF, 2017.
  • [10]

    E. Cakir, S. Adavanne, G. Parascandolo, K. Drossos, and T. Virtanen, “Convolutional recurrent neural networks for bird audio detection,” Signal Processing Conference (EUSIPCO), 2017 25th European, pp.1744–1748, IEEE, 2017.

  • [11] N. Takahashi, M. Gygli, and L. Van Gool, “Aenet: Learning deep audio features for video analysis,” IEEE Transactions on Multimedia, vol.20, no.3, pp.513–524, 2018.
  • [12] K.J. Piczak, “Recognizing bird species in audio recordings using deep convolutional neural networks.,” CLEF (Working Notes), pp.534–543, 2016.
  • [13]

    N. Srivastava and R.R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” Advances in neural information processing systems, pp.2222–2230, 2012.

  • [14]

    J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A.Y. Ng, “Multimodal deep learning,” Proceedings of the 28th international conference on machine learning (ICML-11), pp.689–696, 2011.

  • [15] E. Tatulli and T. Hueber, “Feature extraction using multimodal convolutional neural networks for visual speech recognition,” Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp.2971–2975, IEEE, 2017.
  • [16] A. Eitel, J.T. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard, “Multimodal deep learning for robust rgb-d object recognition,” Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pp.681–687, IEEE, 2015.
  • [17] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Advances in neural information processing systems, pp.568–576, 2014.
  • [18] K. Noda, Y. Yamaguchi, K. Nakadai, H.G. Okuno, and T. Ogata, “Audio-visual speech recognition using deep learning,” Applied Intelligence, vol.42, no.4, pp.722–737, 2015.
  • [19] H. Meutzner, N. Ma, R. Nickel, C. Schymura, and D. Kolossa, “Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates,” Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp.5320–5324, IEEE, 2017.
  • [20] A. Torfi, S.M. Iranmanesh, N.M. Nasrabadi, and J. Dawson, “Coupled 3d convolutional neural networks for audio-visual recognition,” arXiv preprint, 2017.
  • [21] J. Huang and B. Kingsbury, “Audio-visual deep learning for noise robust speech recognition,” Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp.7596–7599, IEEE, 2013.
  • [22] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011 dataset,” Tech. Rep. CNS-TR-2011-001, California Institute of Technology, 2011.
  • [23]

    R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,” Journal of Machine Learning Research, vol.12, no.Aug, pp.2493–2537, 2011.

  • [24] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” Advances in neural information processing systems, pp.649–657, 2015.
  • [25] Y. Kim, “Convolutional neural networks for sentence classification,” arXiv preprint arXiv:1408.5882, 2014.
  • [26]

    Q.V. Le, “Building high-level features using large scale unsupervised learning,” Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp.8595–8598, IEEE, 2013.

  • [27] Y. LeCun, B.E. Boser, J.S. Denker, D. Henderson, R.E. Howard, W.E. Hubbard, and L.D. Jackel, “Handwritten digit recognition with a back-propagation network,” Advances in neural information processing systems, pp.396–404, 1990.
  • [28]

    Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L.D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, vol.1, no.4, pp.541–551, 1989.

  • [29] G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, vol.29, no.6, pp.82–97, 2012.
  • [30]

    Y. Petetin, C. Laroche, and A. Mayoue, “Deep neural networks for audio scene recognition.,” EUSIPCO, pp.125–129, 2015.

  • [31]

    A. Krizhevsky, I. Sutskever, and G.E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, pp.1097–1105, 2012.

  • [32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp.1–9, 2015.
  • [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp.770–778, 2016.
  • [34] T.N. Sainath, A.r. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep convolutional neural networks for lvcsr,” Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on, pp.8614–8618, IEEE, 2013.
  • [35] O. Abdel-Hamid, A.r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Transactions on audio, speech, and language processing, vol.22, no.10, pp.1533–1545, 2014.
  • [36] M. Espi, M. Fujimoto, K. Kinoshita, and T. Nakatani, “Exploiting spectro-temporal locality in deep learning based acoustic event detection,” EURASIP Journal on Audio, Speech, and Music Processing, vol.2015, no.1, p.26, 2015.
  • [37] L. Ma, Z. Lu, L. Shang, and H. Li, “Multimodal convolutional neural networks for matching image and sentence,” Proceedings of the IEEE international conference on computer vision, pp.2623–2631, 2015.
  • [38]

    Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” Proceedings of the 22nd ACM international conference on Multimedia, pp.675–678, ACM, 2014.