Identifying the species of a bird is a widely-studied problem to ornithologists, and an important task in ecosystem monitoring and biodiversity preservation. Recognition of bird species in images is a challenging task due to various appearances, backgrounds, and environmental changes. Despite of this, this fine-grained recognition task has received a significant amount of attention in the computer vision community[1, 2, 3, 4] because of its potential widespread applications. Compared to generic object recognition, fine-grained recognition benefits more from learning critical parts of the object that can help align objects of same class and discriminate between neighboring classes. Current state-of-the-art methods (e.g., [1, 2, 5, 6, 7]) adopt CNN-based architectures that learn representations directly from the raw data and can be used to extract set of discriminative features.
On the other hand, sound also provide us important information about the world around us. Many animals make sounds either for communication or their living activities such as moving, flying, mating etc. Although sound is in some case complementary to visual information, such as when we listen to something out of view, vision and hearing are often informative about the same structures in the world . As a consequence, numerous efforts have been devoted to recognize bird species based on auditory data [9, 10]
in recent years. Adapting CNN architectures for the purpose of audio event detection has become a common practice and generating deep features based on visual representations of audio recordings has proven to be very effective such as in bird sounds [12, 10]. To improve the affinity between images and sounds, we utilize the CNN to extract features from spectrogram of audio recordings.
In the real world, human are able to handle information consist of different information from multiple plural modalities. In the filed of machine learning, multimodal recognition can improve performance compared with unimodal recognition by utilizing complementary sources of information[13, 14]. Multimodal learning have been used for tasks such as image and sentence matching , RGB-D object recognition , action detection  and specially speech recognition [15, 18, 19, 20], fusing different modalities. The results have shown that one modality can enhance the performance of the other by providing relevant information. Furthermore, authors of [14, 21] proposed fusion schemes for multimodal learning with considering the architectures of neural networks.
Our main contributions are summarized as follows:
We propose that the combination of image and sound provide richer training signal for bird species classification under CNN framework, which is the first attempt to the best of our knowledge.
Three strategies are investigated for fusing audio and image modalities using CNN.
We collect at least 10 audio recordings for each bird over 178 species, corresponding to the image dataset CUB-200-2011 .
Specifically, we adopt CNN to process jointly the two modalities for bird species classification. Three strategies are investigated for fusing audio and image modalities using CNN: (1) an early fusion strategy in which the feature vectors related to each modality are concatenated together and input to the CNN. (2) A middle fusion strategy. Features learned by each single modality are combined at the mid-level of the CNN. (3) A late fusion strategy. Outputs of single modality are fused to determine a final classification. Our experimental results show that the architecture with late fusion strategy outperforms among the proposed architectures, which indicates that combining decisions of the classifiers from two modalities is superior. In addition, we apply a two-stage training procedure, which improves the classification accuracy.
The rest of the paper organized as follows. In section 2, we review related works on bird species identification with CNN and multimodal learning algorithms. Section 3 describes several architectures among which a multimodal CNN processing jointly the two modalities. In section 4, we describe the integrated dataset utilized for the evaluation experiments of our architectures. Section 5 describes the experimental setup and results. In section 6, we conclude this paper.
2 Related work
Our work is related to the problem of recognition from multimodal data as well as convolutional neural networks for image classification. We will briefly highlight connections and differences between our work and existing works.
Deep neural networks (DNN) have successfully applied for single modality such us text [23, 24, 25], images [26, 27, 28] and audio [29, 30] showing their ability to learn representations directly from raw data and can be used to extract a set of discriminative features. CNN is one powerful deep architecture of DNN commonly utilized for image classification [31, 32, 33]. The use of CNN for distinguishing between fine grained categorization such as bird species categorization has been proposed in many studies [1, 2, 5, 6, 7], which employ part/pose based approaches to achieve good performance. Besides, CNNs have also been applied for speech processing [34, 35]. In [11, 36], authors utilized CNNs to extract features from spectral representations of audio recordings. It has proved to be efficient when extracting features based on spectrograms of audio recordings such as bird sound [9, 12], thus we used spectrogram representations for audio data. In this paper, instead of employing part/pose based approaches we will focus on integrating raw image an audio using conventional CNN, in order to know which fusion strategy performs best.
Multimodal learning algorithms have been used for tasks such as image sentence matching , action recognition , RGB-D object recognition , and speech recognition (audio-visual speech recognition  and visual-only speech recognition ). Among the many approaches for multimodal learning, multimodal integration is commonly realized by three different categories of approaches. First, in early fusion approach, feature vectors from multiple modalities are concatenated and transformed to acquire a multimodal feature vector. For example, Ngiam et al.  utilized DNN to extract fused representations directly from multimodal signal inputs. Likewise, in middle fusion approach, Huang et al. 
employed deep belief network (DBN) to combine mid-level features learned by single modality. Lastly, in late fusion approach, outputs of unimodal classifiers are merged to determine a final classification. For example, in RGB-D object recognition, Eitel et al. proposed two separate CNN streams processing RGB and depth data independently are combined with late fusion approach. Therefore, Simonyan et al.  proposed two-stream (one stream processing spatial features from RGB image inputs, while the other stream processing temporal features from optical flow inputs) network architecture designed to recognize action for videos. They combined two streams by concatenating features and by averaging prediction scores from two CNNs, respectively. In contrast to these works, we propose simple yet effective concatenation, summation and multiplication based fusion methods with respect to three strategies.
Our three multimodal architectures extend conventional CNN for large-scale image classification . Our implementation is based on CaffeNet , and can be treated as  a variation of the structure proposed by Krizhevsky et al. .
3.1 Feature extraction
CNN uses hierarchical features in its processing pipeline. The features from initial layers are primitive while late layers are high-level abstract features made from combinations of lower-level features. The CaffeNet consists of five convolutional layers (with max pooling layers following the first, second and fifth convolution layer) followed by three fully connected (FC) layers and a softmax classifier. Rectified linear unit is applied to every convolutional layer and fully connected layer and local normalization is applied in first and second convolutional layer. The process through this 8-layer CNN network can be treated as a process from low to mid to high-level features. We hypothesize, that combining the features of different layers in this pipeline can lead to achieve better performance.
3.2 Feature combination
We propose our method in three strategies to fuse features: early fusion, middle fusion and late fusion. Early fusion, also known as feature level fusion, is a feature combination scheme that features from multiple modalities concatenated to form a merged feature vector. Middle fusion, also called mid-level combination, combines the high-level features learned by single network. Late fusion, also called decision-level fusion, combines the decisions of the unimodal classifier and determine the final classification. In this paper, we use concatenation to combine low and high-level features, and summation or multiplication to combine decisions of the classifiers. The feature combination layers can be trained with standard back-propagation and stochastic gradient descent.
3.3 Architectures of proposed models
The proposed multimodal learning models which combines audio and image using CNN by different fusion approaches are described in this section. We exploit the same architecture for both audio and image modalities to focus on evaluating the effectiveness of the feature combination approaches.
3.3.1 Early fusion model
One direct approach for combining audio and image is to train a CNN over the concatenated audio and image data as shown in Fig. 1. In this strategy, the input vectors related to each modality are concatenated together and then processed together throughout the rest of the CNN pipeline. This model is most computational efficient comparing to the middle and late fusion models, because the number of learnable weight parameters is almost half times less than late fusion model.
3.3.2 Middle fusion model
In the middle fusion strategy, unimodal features is extracted independently from audio and images, then combined into a multimodal representation by concatenating the activations of the last pooling layers of two modalities. The multimodal representation then learned in the following fully connected layers. The middle fusion model is shown in Fig.2.
3.3.3 Decision or late fusion model
In contrast to the middle fusion model, extracted unimodal features are separately learned to compute unimodal scores, then these scores are integrated to determine a final scores. The late fusion model consists of two-streams processing audio and image data (green and blue) independently, which are fused after last fully connected layers as shown in Fig.3. Among the various ways of combining CNNs with different modalities, one straightforward way is to add additional fully connected layers to combine the outputs of the each streams as presented in . Instead of using FC layer to combine the two streams, we applied element-wise summation and element-wise multiplication to fuse the outputs of the each streams.
CUB-200-2011  dataset contains 11788 images of 200 species of birds, with each image downscaled to 227 227 pixel. Spectro-temporal features (spectrogram) are extracted from audio recordings that we collected from the Xena-Canto to be used as the audio representation. Based on the 200 species of the CUB-200-2011, we try to harvest at least 10 different audio recordings for each species. As a result, audio recordings over 178 species were collected completely (), audio recordings from 19 species were collected deficiently (
), and 3 species could not be collected. The spectrograms of the audio are obtained by using short-time fourier transform (STFT) over 10 seconds audio frames, windowed with Hanning window (size 512, 50% overlap). The reason is, the sounds of birds are usually contained in a small portion of the frequency range (mostly around 2-8 kHz) as stated in, so we only extract features from the range of (0, 10) kHz. In order to focus only sounds produced in the vocal organ of birds (i.e. calls and songs), first we obtained the maximum amplitude of the audio and removed a frame which contains only amplitude less than of the maximum amplitude. Finally, the spectrograms are saved as 227227 pixel color images, and the dataset contains 4807 images of 194 species of birds. Several examples including both images and spectrograms over different bird species are shown in Fig.5. We follow the standard training/test split of CUB-200-2011 dataset suggested in . The sound dataset is split into two halves for training and test set respectively.
The multimodal CNNs are trained in a supervised manner, thus we create integrated dataset by matching two data sets and corresponding labels using HDF5 222https://www.hdfgroup.org/HDF5/ file format. Because the HDF5 file can contain any collection of data entities (images, videos, audio recording, text, etc.,) in a single file, and used to organize heterogeneous collections consisting of very large and complex datasets.
5 Experiments and Results
The experiments in this section are conducted to evaluate the effectiveness of our proposed architectures (Net1, Net2, Net3). In order to evaluate an advantage of our late fusion approach, we conduct a comparative experiment between Net3 model and two different existing fusion approaches [16, 17]. Besides, we have fine-tuned a pretrained model for all the models (including comparative models) in order to improve the performance. To improve the repeatability and only focus on the evaluation of fusion step, we use the well-known CaffeNet model  to extract features, train, and fine-tune the CNN with default structure and parameter setting fixed except the base learning rate and batch size. We set the learning rate to 0.001 for single modality and 0.0001 for multimodal learning. The batch size is set to 32 for single modality and 1 for multimodal learning due to limited resources.
5.1 Quantitative result: Single-modality v.s. Multi-modality
The core idea of this paper is to address the integration of the image and audio. Therefore, it is necessary to compare the performance between single-modality (image or sound) and multi-modality (both image and sound: Net1, Net2, Net3). To focus on the evaluation of fusion models, in this experiment, we do not introduce any transfer learning techniques (e.g., fine-tune the pretrained model to help differentiate between gains from the proposed architectures). Table 1
summarizes the results of single modality and proposed models. We can observe that combining two modalities using CNN improves the performance of those only using one modality, extracting features separately from image and audio and fusing them at the late stage performs better with significant gain. Interestingly, the performance of low-level and mid-level fusion models slightly better than the performance of single-modality. One possible reason is because CNN learns features for the predominant modality. In contrast, learning features separately for different modalities results in more independent features, which leads to achieve better performance. Let us mention that the result obtained by multi-modality is different from simply combining the results of two CNNs trained separately. Indeed, the two modalities’ parameters are jointly estimated and thus can be mutually influenced. We provide a detailed discussion in following section.
5.2 Qualitative result: Single-modality v.s. Multi-modality
We perform a qualitative study to analyze effects of multimodal learning models by comparing single-modality and multi-modality networks. First, we select some classes where the multi-modality models provides the correct answer while the single-modality model produces the wrong classification. Then, we study why the multi-modality models provides right answers while the single-modality model failed to produce the correct classification. Figure 7
shows some examples of single-modality vs. multi-modality classification. In the first column, the single-modality models predict the input image and spectrogram as the ’barn shallow’ and the ’ring-billed gull’ respectively rather than the ’red-bellied woodpecker’. However, the multi-modality models are able to predict the right answer, because those models provide joint features of different modalities. We observe that when single-modality model provide right answer for spectrogram, the probability of providing the right answer of multimodality models is higher than when single-modality classificaion is correct for image. In the second column of Fig.7, single-modality model is misclassified the ’clark nutcracker’ image as the ’great grey shrike’, where other models provide correct answers. Lastly, the Net3 model is able to provide right answers while the other models provide misclassification on the ’belted kingfisher’ (last column).
To better understand the difference between the models, we analyze the feature learned from the each networks by visualizing the filters of the first convolutional layers shown in Fig. 8. We see that each network’s filters of different models have similar pattern. Secondly, we see that the single-modality’s filters have more meaningful patterns than the multi-modality filters. As we mentioned before its seems that the learning features separately for different modalities results more independent features. Finally, it can be seen that the filters of early fusion model has combined patterns from both networks, but most filters are similar to the spectrogram network. It reveals that the multimodal model (early fusion and mid-level fusion) learns features for the predominant modality.
5.3 Quantitative result: Net3 v.s. existing late fusion approaches
To evaluate the effectiveness of our late fusing approach, we conduct comparative experiments on Net3 with existing late fusing methods presented in [17, 16]. The differences between the late fusion approaches is shown in Fig. 9.
: FC7 layers (green and blue) of each networks are concatenated and merge into the fusion layer, which performs tensor multiplication of two vectors. The resulting fusion vector is then passed through one additional fully-connected layer for classification. This means this fusion methods is a linear combination of pair-wise interactions between two features. However, this method is not suitable when the features are in different sizes.
Averaging scores: Each networks focus on learning features from images and spectrograms, respectively, and the final classification is computed as an average of the softmax scores of the two networks. In this fusion method, they do not consider pair-wise interactions between the features. However, this method is suitable when the model consists of different structured network streams.
In terms of pair-wise interactions, our method is similar to FC7 concat method. However, we fused final output of each networks. Figure 6 plots learning curve of Net3 model with different late fusing methods for each epoch, indicating that averaging the Softmax scores gives lowest performance and the our fusing approach performs best.
5.4 Fine-tuning the pretrained model
Combining multi-modalities at the late stage of CNN has proven to be more effective, thus we show additional results with fine-tuning CaffeNet pretrained CNN under Net3 model in this section. One natural idea for fine-tuning is to train the model by initializing both image and audio CNNs with the weights and biases of the first seven layers derived from CaffeNet pretrained network, discarding the last fully connected layer. Instead of last fully connected layer of the pretrained model, we randomly place the initialized new fully connected layer for 200-class bird classification (in our experiment, 194 classes due to the lack of audio dataset).
|(a) Pooling layer|
|(b) Fully connected layer|
Another method of fine-tuning the model is to train Net3 in two stages. First, training the two stream individually followed by a joint fine-tuning. We train the image and audio CNNs separately, adapting the weights of pretrained model and learn the weights of the new 194-class output layer. After this training, the networks can be used to perform separate classification with respect to each modality. After then, we train an entire model by setting their learning rate to zero and only training the fusion part of the network to freeze the individual stream networks. As shown in Table 2, two stage training is resulted best performance, and it proves that additional modality can improve the performance. We found that the fine-tuning pretrained model to two modalities and training them simultaneously, is significant worse than two stage training. We think the problem relates to the difference between the size of two datasets and batch size of single-modality and multi-modalities. In this experiment, we used net surgery333https://github.com/BVLC/caffe/blob/master/examples/net_surgery.ipynb to fine-tune two different CNN using one pretrained model and to fine-tune two pretrained model into our model.
To confirm that the features extracted from sample image and spectrogram during the fine-tuning were meaningful, we visualized the activations of different layers, especially the fused layers in Net3. The results shown in Fig.10, that allowed us to confirm the learned features were meaningful and qualitatively resembled the sample image and spectrogram. Moreover, the activations of our fused layer takes advantage by incorporating the features from each streams, when each network failed to produce write answer.
|Fine-tuning weights of pretrained model (summation)||65.0|
|Two stage fine-tuning||Summation (ours, Net3)||78.9|
|Multiplication (ours, Net3)||75.0|
|Simonyan et al. ||70.0|
|Eitel et al. ||72.5|
In this paper, we proposed three multimodal CNN architectures in different fusion strategies, which can process jointly the image and audio data for bird classification. Experimental results verified that the two-stream multimodal CNN in late fusion strategy outperforms the others. In addition, we proposed summed fusion method to combine multiple CNNs, which shows better performance comparing against several existing fusion methods. Moreover, with the help of two-stage fine-tuning, our method can be more effective. However, there still exist several drawbacks of our method: (1) Choosing the suitable duration based on the vocal features of birds to be recognized, is essential ingredients of improvement, is missed in our current work. (2) Our method is based on the raw image data, thus part detection and extracting features from pose-normalized regions may improve the classification performance.
As the future work, we aim to apply multiple kernel learning to combine multiple modalities, which is able to learn optimal composite kernel through combining basis kernels constructed from different features of modalities. We are also interested in designing new CNN architectures by increasing the number of modalities.
-  N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part-based r-cnns for fine-grained category detection,” European conference on computer vision, pp.834–849, Springer, 2014.
-  S. Branson, G. Van Horn, S. Belongie, and P. Perona, “Bird species categorization using pose normalized deep convolutional nets,” arXiv preprint arXiv:1406.2952, 2014.
Y. Cui, F. Zhou, Y. Lin, and S. Belongie, “Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.1153–1162, 2016.
-  E. Gavves, B. Fernando, C.G. Snoek, A.W. Smeulders, and T. Tuytelaars, “Local alignments for fine-grained categorization,” International Journal of Computer Vision, vol.111, no.2, pp.191–212, 2015.
-  P. Guo and R. Farrell, “Fine-grained visual categorization using pairs: Pose and appearance integration for recognizing subcategories,” arXiv preprint arXiv:1801.09057, 2018.
-  V. Lebedev, A. Babenko, and V. Lempitsky, “Impostor networks for fast fine-grained recognition,” arXiv preprint arXiv:1806.05217, 2018.
-  X. He and Y. Peng, “Fine-grained image classification via combining vision and language,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.5994–6002, 2017.
-  A. Owens, J. Wu, J.H. McDermott, W.T. Freeman, and A. Torralba, “Ambient sound provides supervision for visual learning,” European Conference on Computer Vision, pp.801–816, Springer, 2016.
-  S. Kahl, T. Wilhelm-Stein, H. Hussein, H. Klinck, D. Kowerko, M. Ritter, and M. Eibl, “Large-scale bird sound classification using convolutional neural networks,” Working notes of CLEF, 2017.
E. Cakir, S. Adavanne, G. Parascandolo, K. Drossos, and T. Virtanen, “Convolutional recurrent neural networks for bird audio detection,” Signal Processing Conference (EUSIPCO), 2017 25th European, pp.1744–1748, IEEE, 2017.
-  N. Takahashi, M. Gygli, and L. Van Gool, “Aenet: Learning deep audio features for video analysis,” IEEE Transactions on Multimedia, vol.20, no.3, pp.513–524, 2018.
-  K.J. Piczak, “Recognizing bird species in audio recordings using deep convolutional neural networks.,” CLEF (Working Notes), pp.534–543, 2016.
N. Srivastava and R.R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” Advances in neural information processing systems, pp.2222–2230, 2012.
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A.Y. Ng, “Multimodal deep learning,” Proceedings of the 28th international conference on machine learning (ICML-11), pp.689–696, 2011.
-  E. Tatulli and T. Hueber, “Feature extraction using multimodal convolutional neural networks for visual speech recognition,” Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp.2971–2975, IEEE, 2017.
-  A. Eitel, J.T. Springenberg, L. Spinello, M. Riedmiller, and W. Burgard, “Multimodal deep learning for robust rgb-d object recognition,” Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pp.681–687, IEEE, 2015.
-  K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” Advances in neural information processing systems, pp.568–576, 2014.
-  K. Noda, Y. Yamaguchi, K. Nakadai, H.G. Okuno, and T. Ogata, “Audio-visual speech recognition using deep learning,” Applied Intelligence, vol.42, no.4, pp.722–737, 2015.
-  H. Meutzner, N. Ma, R. Nickel, C. Schymura, and D. Kolossa, “Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates,” Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp.5320–5324, IEEE, 2017.
-  A. Torfi, S.M. Iranmanesh, N.M. Nasrabadi, and J. Dawson, “Coupled 3d convolutional neural networks for audio-visual recognition,” arXiv preprint, 2017.
-  J. Huang and B. Kingsbury, “Audio-visual deep learning for noise robust speech recognition,” Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp.7596–7599, IEEE, 2013.
-  C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011 dataset,” Tech. Rep. CNS-TR-2011-001, California Institute of Technology, 2011.
R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,” Journal of Machine Learning Research, vol.12, no.Aug, pp.2493–2537, 2011.
-  X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” Advances in neural information processing systems, pp.649–657, 2015.
-  Y. Kim, “Convolutional neural networks for sentence classification,” arXiv preprint arXiv:1408.5882, 2014.
Q.V. Le, “Building high-level features using large scale unsupervised learning,” Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pp.8595–8598, IEEE, 2013.
-  Y. LeCun, B.E. Boser, J.S. Denker, D. Henderson, R.E. Howard, W.E. Hubbard, and L.D. Jackel, “Handwritten digit recognition with a back-propagation network,” Advances in neural information processing systems, pp.396–404, 1990.
Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L.D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural computation, vol.1, no.4, pp.541–551, 1989.
-  G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, vol.29, no.6, pp.82–97, 2012.
Y. Petetin, C. Laroche, and A. Mayoue, “Deep neural networks for audio scene recognition.,” EUSIPCO, pp.125–129, 2015.
A. Krizhevsky, I. Sutskever, and G.E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, pp.1097–1105, 2012.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp.1–9, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp.770–778, 2016.
-  T.N. Sainath, A.r. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep convolutional neural networks for lvcsr,” Acoustics, speech and signal processing (ICASSP), 2013 IEEE international conference on, pp.8614–8618, IEEE, 2013.
-  O. Abdel-Hamid, A.r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Transactions on audio, speech, and language processing, vol.22, no.10, pp.1533–1545, 2014.
-  M. Espi, M. Fujimoto, K. Kinoshita, and T. Nakatani, “Exploiting spectro-temporal locality in deep learning based acoustic event detection,” EURASIP Journal on Audio, Speech, and Music Processing, vol.2015, no.1, p.26, 2015.
-  L. Ma, Z. Lu, L. Shang, and H. Li, “Multimodal convolutional neural networks for matching image and sentence,” Proceedings of the IEEE international conference on computer vision, pp.2623–2631, 2015.
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” Proceedings of the 22nd ACM international conference on Multimedia, pp.675–678, ACM, 2014.