1 Introduction
The brain combines multisensory information to understand the surrounding situation. Through various sensory experiences, humans learn the relationships between multisensory data and understand the experienced situation. This mechanism of learning the relationship between different stimuli is called associative learning [4, 5, 18, 19, 26, 33, 34]. Because of the associative learning mechanism, humans can robustly understand and perceive their surrounding situations even when only some of the modalities are available.
In the field of machine learning, utilizing multimodality is also important issues because of its usefulness in a wide range of applications
[2, 3]. As a representative example, object recognition and scene understanding methods based on multimodal data outperform the methods using only singlemodal data
[11, 17]. Moreover, one can generate the synthesized data for a missing or desired modality [6, 14, 21, 22, 31, 37]. The crossmodal data association is one of the fundamental steps to understand the relationships among multimodal data.For this reason, a lot of studies have attempted to solve this crossmodal data association problem with a deep learning algorithm
[2]. Audio data and visual data has been associated in [11, 21, 23, 17]. The works in [22, 27] deal with visual data and hand pose data. The association between heterogeneous visual data has been tried in [6, 31], where the heterogeneous visual data includes RGB image, depth image, and segmentation map. Recent studies have adopted an approach that encodes crossmodal data into a shared latent space to memorize common features among multiple modalities [6, 11, 17, 22, 35].However, as pointed out in [7], most existing studies did not consider the case that the characteristic of each modality is very different from others. In this case, it is hard to encode meaningful common feature representing all characteristics of the heterogeneous modalities or the encoding could be biased to a dominant modality. Furthermore, the existing approach encounters a scalability problem since the capacity of the shared latent space will decrease as the number of modalities increases.
To mitigate the limitation of shared latent space approach, we propose an approach that adopts a distributed latent space concept. In our approach, as shown in Figure 1, each modality is encoded by the usual variational autoencoder (VAE) and the distributed latent space encoded from each modality is associated with the other modality via crossmodal associator between them. This approach is inspired by the research on associative memory cells in the brain [29, 28], where the intramodal association is done in each sensory cortex and the crossmodal association is done among sensory cortices.
The intramodal association is realized by VAE and the crossmodal association is realized by the proposed associator
. In the proposed structure, the information on each modality is memorized in each VAE and the crossmodal association can be easily performed through associator between them. The loss function to train the model is derived by the variational inference framework. The advantages of the proposed approach are discussed in view of generalization ability for semisupervised learning, scalability, and flexibility of encoding dimension. In experiments, the effectiveness, performance, and advantages of the proposed approach are evaluated through comparison with the existing methods and empirical selfanalysis using various datasets including voice and visual data.
2 Related Works
2.1 Multimodality in Machine Learning
One of the major issues in machine learning is exploiting multimodal data for various applications, such as data generation [12, 9, 22, 37], retrieval [30] and recognition [11, 17]. There are a lot of studies that extract modality independent features by finding the shared representation of multimodal data [2]. The shared representation is utilized in diverse applications such as handling a missing modality [6, 22] or accomplishing better performance than models trained on singlemodal data [11, 17]. The research related to multimodality can be categorized into two groups [2].
One is a method mapping data from diverse modalities to the shared latent space. [35] proposes the extended version of a Variational autoencoder [12] which combines distribution parameters from encoders and calculate integrated distribution parameters. [22] also a variant of Variational autoencoder [12]
for hand pose estimation with multimodal data. The model proposed by
[22] chooses the input modality and the output modality pair and train the corresponding encoder and decoder pair at every iteration. [6] trains an autoencoder that takes RGB images, depth images and semantic images as its network input, then the trained model can a generate complete depth image and semantic image from an RGB image and partial depth and semantic image. [17]builds a deepbelief network structure that maps audio data and lip images into the common hidden node for audiovisual speech recognition.
[11] extends the RBM structure to reflect the sequential characteristic of a speech dataset.The other group comprises methods that encode the corresponding data to the latent space of each modality but enforce similarity constraints to corresponded latent vectors.
[31] trains domain specific encoders and decoders, allowing encoders and decoders from different modality to be combined, then, the model is able to generate an unseen data pair by combining the encoders and decoders. [7] extracts lowlevel representation from original data first. Then [7] trains autoencoders for each modality and enforces similarity constraints to embedding spaces of each autoencoders for correlated data pair. In [8], a model is trained to maximize the similarity of an image feature and a vectorized label to infer a proper label for a given image.2.2 Associative Learning inspired by the Brain
The artificial neural network is an engineering model inspired by the biological mechanism of the brain. Parameters of those networks are usually updated by Hebbian learning rule where weight connections between firing nodes for input data are strengthened [10]
. The Hopfield network and Boltzmann machine are representative examples
[1]. The Hopfield network models associative memory of human, thus network is trained to memorize specific patterns. Even if the input is incomplete, The Hopfield network can restore incomplete data through recurrent iteration. The Boltzmann machine is a stochastic version of the Hopfield network, which can learn a latent representation for input data through its hidden nodes.There have been many studies that investigate associative learning from the perspective of neuroscience [4, 5, 26, 34]. In the recent study which tried to analyze associative learning at the cellular substrate level [29, 28, 33]
, they introduce the associative memory cells to describe brain neurons which are mainly involved in integration and storage of associated signals. A brain learns associated information by enhancing the strength of the synapses between coactivated associative memory cells activated by associated signals. In this paper, we realize the crossmodal association mechanism recently proposed in
[29], which assumes the comprehensive diagram based on the associative memory cells.3 Method
3.1 Problem Statements
According to recent studies [28, 29], associative learning process in the brain includes intramodal and crossmodal association processes. The intramodal association process is to make humans familiar with singlemodal sensory information. On the other hand, the crossmodal association process is accomplished to enhance the strength of the synapses connecting multimodal information to be associated. The goal of this paper is to establish the Bayesian formulation of these two association processes and to realize them in a variational autoencoder framework.
3.2 Graphical Model of CrossModal Association
3.2.1 IntraModal Association
Intramodal association is the process of memorizing singledomain information. To efficiently memorize a vast amount of information, the model needs to extract the expressive features of the data. One way to make the encoding model remember the features of the data in an unsupervised manner, is to formulate a mathematical model reconstructing the original sensory data from the encoded information. Figure 2.(a) shows Bayesian graphical model to formulate the intramodal association to memorize a distribution of the latent variable associated with the input variable for an observation in modality . In the Bayesian framework, the objective is to infer model parameter of posterior distribution .
One of the most popular approaches to approximate an intractable posterior is the variational inference method. In this method, the variational distribution approximates the true posterior
by minimizing the KullbackLeibler divergence,
. According to [12], the minimization of can be replaced with the maximization of the evidence lower bound, given by(1)  
where indicates expectation over distribution .
3.2.2 CrossModal Association
In this section, we design a graphical model to represent the crossmodal association mechanism as in Figure 2.(b) Without loss of generality, we consider a path from modality to . From observations of an associated variable pair , the distribution parameter is inferred to model the association between and .
For a given observation pair , the crossposterior distribution is defined by marginalization for as
(2) 
To establish the crossmodal association model, we define a variational distribution for crossposterior distribution . Then, to infer the distribution parameters , we minimize KullbackLeibler divergence between and . To avoid clutter, subscripts for the distribution parameters are omitted in the remainders of this section. KullbackLeibler divergence between and is given by
(3) 
where
(4) 
Since logevidence is independent to the model parameter, the target problem is identical to maximizing the evidence lower bound . With probabilistic tricks, can be decomposed as following.
(5)  
The detail derivation is given in Appendix A of supplementary document. In Eq. (5), the first term is a negative KL divergence term that leads given by to have similar distribution with a prior distribution of target modality. The expectation term in Eq. (5) minimizes the reconstruction error of decoded output from fired from , which also promotes the inference for . By the similar steps, we can easily derive the opposite association from modality to modality .
3.3 Realization: CrossModal Association Network
We accomplish a realization of the aforementioned intramodal and crossmodal association models by extending the Variational AutoEncoder framework (VAE) [12]. Figure 3 illustrates the proposed crossmodal association network for modality and . Although only two modalities are considered in this paper, the proposed model can be applied to the association among three or more modalities also. In the proposed structure, the encoder produces the parameter of , and the decoder produces the parameter of . The encoder and decoder are realized by deep neural networks. Likewise, the latent space associating models and are also realized by deep neural networks, which are called by associator. Thus, the intramodal association network contains several autoencoders, each of which considers one of the multiple modalities only. The latent spaces of the autoencoders are connected by associators in a pairwise manner, which configure the crossmodal association network.
The proposed network is trained in the two phases: intramodal training phase and crossmodal training phase. In the intramodal training phase, the autoencoder in each modality is trained separately by minimizing the approximated version of the negative evidence lower bound in Eq. (1) [12]. As derived in [12]
, variational distributions are assumed by the centered isotropic multivariate Gaussian distribution. For a given observation sample
, the encoder produces the meanand the variance
for a Gaussian distribution of . Then, the latent vector is sampled as and . Similarly, the decoder also produces the mean and the variance for a Gaussian distribution of . Then, the reconstruction vector is sampled as and .Using the samples, the empirical loss for autoencoder can be derived as
(6)  
where is a userdefined parameter and is the dimension of the latent variable . and denote the th element of and . The detail derivation presents in Appendix B of supplementary document.
After the convergence of the intramodal training phase, the following crossmodal training phase proceeds to train the associators while freezing the weights of the autoencoders. In the same way as in the intramodal training phase, for a given observation pair and , the encoders and produce the latent vectors and , respectively. In addition, associators and produce the latent vectors and using inputs and , respectively. Thereafter, the decoders and produce the reconstruction vectors and from and , respectively.
Using the samples, the empirical loss for is designed according to Eq. (5) as follows:
(7)  
where is a userdefined parameter and is the dimension of the latent variable . (, ) are parameters for Gaussian distribution produced by . The detail derivation presents in Appendix B of supplementary document.
The loss for is given in the same form except the index. Note that all ’s and ’s in Eq. (6) and Eq. (7) are the functions of weights () in encoders, decoders, or associators. Hence, the weights of the proposed network are trained by the negative direction of the gradient of the losses with respect to the weights ().
3.4 Advantages of Proposed Method
Owing to the newly introduced associator, the proposed model can associate heterogeneous modalities effectively. Reckless coalescence of heterogeneous data may have a fatal impact on associative learning such as the problem that shared latent vectors can be biased to the dominant modality. However, in our model, the associator acts as a translator between heterogeneous modalities and thus the characteristics of each latent space are preserved. Furthermore, in contrast to the existing models which adopt a shared latent space for the different modalities [35, 22, 6, 17], our structure can provide a flexible dimensional encoding in each latent space depending on the complexity of each modality. This provides better crossmodal data association results.
The proposed model easily incorporates additional modalities while maintaining the existing modalities. That is, a new modality can be added via training of only a new associator between an existing autoencoder and a new autoencoder. Though the associator only associates the new modality with one of the existing modalities, the model can associate the new modality with the rest of the modality by passing through multiple associators.
Finally, in contrast to existing models which always require paired data for crossmodal association, our structure can train the associator with the only small amount of paired data in a semisupervised manner after learning each autoencoder using unpaired data independently. Since obtaining paired data for crossmodal association is more expensive than obtaining unpaired data, our model is costeffective. Furthermore, our model is plausible in that, when a person learns a crossmodal association, the paired examples are rarely given by a teacher after the person has become familiar with each modality via selfexperience without a teacher.
In the following experiment section, we validate the aforementioned advantages of the proposed structure.
4 Experiment
This section presents experimental results validating the effectiveness of the proposed method. Through our experiments, we used images and voice recordings sharing common semantic meanings to evaluate our model on the visual and the auditory modalities. The implementation details for network architectures are provided in Appendix C of the supplementary document.
4.1 Datasets
Google Speech Commands (GSC) [32]: As the data for the auditory modality, we used the GSC dataset, which consists of 105,829 audio samples containing utterances of 35 short words. Each audio sample is onesecondlong and encoded with a sampling rate of 16KHz. Among 35 words, we chose 14 words, including words for each digit (’ZERO’ to ’NINE’) and four traffic commands (’GO,’ ’STOP,’ ’LEFT,’ ’RIGHT’). The chosen set has 54,239 samples. We extracted the MelFrequency Cepstral Coefficient (MFCC) from each audio clip to generate an audio feature. MFCC has been widely used in the processing of voice data because it reflects the human auditory perception mechanism well [16, 15, 20]. The resulting features are matrices. We randomly divided the original dataset into training, validation and test sets at the ratio of 8:1:1.
MNIST [13]: We used the MNIST dataset as the corresponding visual data to the GSC for each digit. The MNIST consists of centeraligned grayscale images for handwritten digits from 0 to 9. The dataset contains 60k and 10k samples for the training set and testing set, respectively.
FashionMNIST (FMNIST) [36]: We used the FMNIST dataset as another visual modality which has a similar specification to MNIST dataset. The FMNIST consists of centeraligned grayscale images assigned with a label from 10 kinds of clothing such as Tshirt, Trouser and Sneaker. The dataset contains 60k and 10k samples for the training set and testing set, respectively.
German Traffic Sign Recognition Benchmark (GTSRB) [24]: For the visual data that correspond to the traffic commands in GSC, we used the GTSRB dataset, which consists of 51,839 RGB color images illustrating 42 kinds of traffic signals. In particular, to evaluate the performance on pairs of traffic sign images and voice commands in GSC, we chose four pair sets, where each pair set has similar semantic meaning, i.e., (’Ahead only,’ ’GO’), (’No entry for vehicle,’ ’STOP’), (’Turn left and ahead,’ ’LEFT’), and (’Turn right and ahead,’ ’RIGHT’). Then, to prevent the four signs from occupying the entire latent space, we chose additional sign images such as ’No overtaking,’ ’Entry to 30kph zone,’ ’Prohibit overweighted vehicle,’ ’Nowaiting zone,’ and ’Roundabout’. The chosen set includes 10,709 samples. All of the chosen signs have a circular backboard. The size of each image varies from to pixels for each RGB channel in the original dataset. In our experiments, we resized all images into .
Recognition Accuracy (%)  
FMNIST  MNIST  MNIST  GSC  FMNIST  GSC  GTSRB  GSC  
Model  MNIST  FMNIST  GSC  MNIST  GSC  FMNIST  GSC  GTSRB 
VAE [12]  47.36  41.86  10.34  28.61  12.83  22.18  35.93  19.44 
VAECG [12]  82.41  83.84  32.46  66.62  29.87  63.79  28.43  55.00 
JMVAE [25]  83.49  88.14  28.15  62.31  47.58  51.23  41.02  65.18 
CVA [22]  76.51  85.88  24.61  65.04  18.70  59.73  31.02  77.78 
MVAE [35]  62.65  77.62  23.04  46.70  13.52  33.24  28.06  69.17 
ours  82.02  94.26  59.47  88.66  43.95  77.84  58.89  77.87 
oursflex              61.39  80.56 
Dataset  Rec (%)  VAE (%)  dim() 

MNIST  97.97  96.12  64 
FMNIST  89.22  80.54  64 
GSC  88.65  81.93  64 
GTSRB  98.53  95.70  64 
GTSRB    95.50  256 
4.2 Evaluation Metric
Since datasets in Section.4.1 have no direct matching relationships, we cannot measure crosslikelihood for paired sample used in [35, 25]
. In our work, we used the reconstruction accuracy as the evaluation metric of the association models. The quality of an image reconstructed by an association model can be a valid measure to evaluate the association model since the quality of the reconstructed image is acceptable to both the human and recognition model. The reconstruction accuracy was measured by our own recognition networks trained with the original data used in our experiments. Table
2 shows the performance of the recognition networks trained with the original data, which shows sufficient performance for evaluating the reconstructed output of the compared encoders by using the recognition models. For the GSC dataset, we get performance comparable to the 88.2% in [32].4.3 IntraModal Association
As mentioned in section 3.1, it is essential for the crossmodal association to learn the intramodal association that encodes single modal input data into the latent space. For a fair comparison to existing works, we trained encoders and decoders for each dataset with the fixed dimension of latent space (dim). In addition, to show the advantages of the proposed model where the dimension of the latent space can be flexibly designed according to the complexity of target modality, we trained additional autoencoder whose latent space dimension is for GTSRB dataset.
Table 2 shows the reconstruction performance of the intramodal association network implemented by VAE. As shown in the table, the voice data in the GSC dataset shows much degraded accuracy, which means that the voice data are hard to be reconstructed than other modalities. Since FMNIST has confused classes such as pullover, coat and shirt, performance on FMNIST dataset is also degraded.
4.4 CrossModal Association
We evaluated the proposed model on four scenarios: (1) Association between FMNIST and MNIST, (2) MNIST and GSC, (3) FMNIST and GSC, (4) GTSRB and GSC. Scenario (1) is for association between datasets which have similar characteristics to each other. Scenario (2) and (3) are for association between heterogeneous datasets, i.e. voice and image datasets. Scenario (4) is a more practical case than the others. In order to train crossmodal association, we used randomly paired training samples from each dataset belonging to the correlated class. For example, we paired a randomly chosen sample in ’0’ class of MNIST dataset with a randomly chosen sample in ’ZERO’ class of GSC dataset.
To evaluate the proposed associator, the following methods were compared: VAE and VAECG are variants of the standard VAE [12]. The direct concatenation of paired data is given to VAE as an input. To allow VAE to acquire crossgeneration capability, VAECG is trained to generate the target modality sample via associator for given input sample from other modality. This model has to be trained only by supervised data with input and output pairs. Joint Multimodal Variational AutoEncoder (JMVAE) [25] has two kinds of latent spaces: one is for each modality and the other is for jointly encoding of two modalities. The joint latent space is shared for association between two modalities. The training for encoding in the joint latent space is done to minimize KullbackLeibler divergence between the latent vector of each encoder and the joint latent vector of the joint encoder. In comparison, the hyperparameter was set to 0.01 for whole scenarios. Crossmodal Variational AutoEncoder (CVA) [22] is an extension of VAE for crossmodal data. In CVA, the latent space is shared between two modalities. In the training process, the selected sample pair are trained alternately throughout iteration. Multimodal Variational AutoEncoder (MVAE) [35] is also a variant of VAE for crossmodal data. MVAE uses the standard VAE for each modality, but each latent space is associated via a shared latent space expressing the unified distribution of the association modalities. We trained MVAE by using the subsampled training paradigm presented in their paper.
To evaluate the flexibility of encoding dimension in our model, we have conducted an experiment where each modality is encoded in a different dimensional space from the other. oursflex has large dimension of latent space for GTSRB dataset (). Except for oursflex, all compared models use the same VAE of which the latent space dimension is 64.
Table 1 shows the evaluation result of the proposed model and the compared models for the crossmodal association. The proposed model accomplishes significant enhancement from the compared algorithms for most of the scenarios. Interestingly, in the challenging scenarios such as the association between heterogeneous modalities, for instance, between audio (GSC) and visual data (MNIST, GTSRB), the proposed model achieves a remarkable improvement compared to the existing models.
Figure 4 shows the qualitative results of our model for images generated from GSC dataset. Figure 4 (a) and (b) show 3 generated images for each ‘number’ command of GSC. Figure 4 (c) shows 5 generated images for each ‘traffic command’ of GSC. The proposed model successfully generates images with correct semantics though images converge to similar shape due to the random pairing between training samples.
4.5 Application: Hand pose estimation
We have conducted additional experiments for hand pose estimation on Rendered Hand pose Dataset (RHD) [38]. RHD dataset provides RGB image, depth map, segmentation map and 21 keypoints for each hand. We consider the case of generating 3D keypoints from the RGB image. Evaluation metric is the average EndPointError (EPE), which measures euclidean distance between ground truth keypoints and estimated keypoints, as CVA [22] did. We used the same encoders and decoders structure to CVA and added the associator. Table 3 and Figure 5 show that our model outperforms previous works [22, 38].
4.6 Semisupervised Learning
We conducted an additional experiment to verify the effectiveness of the proposed associator in semisupervised learning. Figure 6 illustrates a trend of performance variation of the proposed associator depending on the proportion of paired data, from 100% to 1% in the three scenarios including FMNIST MNIST, MNIST GSC and FMNIST GSC case. The result shows that the proposed associator can achieve good performance with only a small proportion of paired data (5%) in a semisupervised manner.
4.7 Scalability
The proposed structure can easily expand a new modality while maintaining the existing modalities. That is, a new modality can be added via training of only a new associator between an existing autoencoder and the new autoencoder. Since the associator connect only two latent spaces, if the existing network associates modality, associators need to be trained. In our model, this inefficiency can be mitigated by cascading association through multiple associators. For example, suppose that MNIST and FMNIST datasets are connected by an associator, and GSC and FMNIST are also connected by an associator. Then even if there is no direct associator between GSC and MNIST, the association between them can be made by the cascading of the two existing associators. Table 4 compares the results of cascading association and direct association. Although the cascading association has some performance degradation, it still has good performance compared to other algorithms presented in Table 1.
GSC MNIST  GSC FMNIST MNIST 

88.66  76.99 
5 Conclusion
We proposed a novel multimodal association network structure that consists of multiple modalspecific autoencoders and associators for crossmodal association. By adopting the associators, the proposed multimodal network can incorporate new modalities easily and efficiently while preserving the encoded information in the latent space of each modality. In addition, the proposed network can effectively associate even heterogeneous modalities by designing each latent space independently and can be trained by a small amount of paired data in a semisupervised manner. Based on the validation of our structure in experiments, future work can attempt to implement a largescale multimodal association network for practical use.
References
 [1] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.
 [2] T. Baltrušaitis, C. Ahuja, and L.P. Morency. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
 [3] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
 [4] T. V. Bliss and G. L. Collingridge. A synaptic model of memory: longterm potentiation in the hippocampus. Nature, 361(6407):31, 1993.
 [5] D. V. Buonomano and M. M. Merzenich. Cortical plasticity: from synapses to maps. Annual review of neuroscience, 21(1):149–186, 1998.
 [6] C. Cadena, A. R. Dick, and I. D. Reid. Multimodal autoencoders as joint estimators for robotics scene understanding. In Robotics: Science and Systems, 2016.
 [7] S. Chaudhury, S. Dasgupta, A. Munawar, M. A. S. Khan, and R. Tachibana. Conditional generation of multimodal data using constrained embedding space mapping. arXiv preprint arXiv:1707.00860, 2017.
 [8] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visualsemantic embedding model. In Advances in neural information processing systems, pages 2121–2129, 2013.
 [9] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [10] D. O. Hebb. The organization of behavior. na, 1961.

[11]
D. Hu, X. Li, et al.
Temporal multimodal learning in audiovisual speech recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 3574–3582, 2016.  [12] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [13] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[14]
J. Lim, Y. Yoo, B. Heo, and J. Y. Choi.
Pose transforming network: Learning to disentangle human posture in variational autoencoded latent space.
Pattern Recognition Letters, 2018.  [15] B. Logan et al. Mel frequency cepstral coefficients for music modeling. In ISMIR, volume 270, pages 1–11, 2000.
 [16] L. Muda, M. Begam, and I. Elamvazuthi. Voice recognition algorithms using mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques. arXiv preprint arXiv:1003.4083, 2010.
 [17] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML11), pages 689–696, 2011.
 [18] I. P. Pavlov. Conditional reflexes: an investigation of the physiological activity of the cerebral cortex. 1927.
 [19] I. P. Pavlov and W. Gantt. Lectures on conditioned reflexes: Twentyfive years of objective study of the higher nervous activity (behaviour) of animals. 1928.
 [20] J. Rubin, R. Abreu, A. Ganguli, S. Nelaturi, I. Matei, and K. Sricharan. Classifying heart sound recordings using deep convolutional neural networks and melfrequency cepstral coefficients. In Computing in Cardiology Conference (CinC), 2016, pages 813–816. IEEE, 2016.
 [21] A. Senocak, T.H. Oh, J. Kim, M.H. Yang, and I. S. Kweon. Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4358–4366, 2018.
 [22] A. Spurr, J. Song, S. Park, and O. Hilliges. Crossmodal deep variational hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 89–98, 2018.
 [23] N. Srivastava and R. R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems, pages 2222–2230, 2012.
 [24] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The German Traffic Sign Recognition Benchmark: A multiclass classification competition. In IEEE International Joint Conference on Neural Networks, pages 1453–1460, 2011.
 [25] M. Suzuki, K. Nakayama, and Y. Matsuo. Joint multimodal learning with deep generative models. arXiv preprint arXiv:1611.01891, 2016.
 [26] H. Van Praag, B. R. Christie, T. J. Sejnowski, and F. H. Gage. Running enhances neurogenesis, learning, and longterm potentiation in mice. Proceedings of the National Academy of Sciences, 96(23):13427–13431, 1999.
 [27] C. Wan, T. Probst, L. Van Gool, and A. Yao. Crossing nets: Combining gans and vaes with a shared latent space for hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 680–689, 2017.
 [28] J.H. Wang and S. Cui. Associative memory cells: formation, function and perspective. F1000Research, 6, 2017.
 [29] J.H. Wang and S. Cui. Associative memory cells and their working principle in the brain. F1000Research, 7, 2018.
 [30] K. Wang, Q. Yin, W. Wang, S. Wu, and L. Wang. A comprehensive survey on crossmodal retrieval. arXiv preprint arXiv:1607.06215, 2016.
 [31] Y. Wang, J. van de Weijer, and L. Herranz. Mix and match networks: encoderdecoder alignment for zeropair image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5467–5476, 2018.
 [32] P. Warden. Speech commands: A dataset for limitedvocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018.
 [33] E. A. Wasserman and R. R. Miller. What’s elementary about associative learning? Annual review of psychology, 48(1):573–607, 1997.
 [34] N. M. Weinberger. Specific longterm memory traces in primary auditory cortex. Nature Reviews Neuroscience, 5(4):279, 2004.
 [35] M. Wu and N. Goodman. Multimodal generative models for scalable weaklysupervised learning. In Advances in Neural Information Processing Systems, pages 5580–5590, 2018.
 [36] H. Xiao, K. Rasul, and R. Vollgraf. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms, 2017.

[37]
Y. Yoo, S. Yun, H. J. Chang, Y. Demiris, and J. Y. Choi.
Variational autoencoded regression: high dimensional regression of visual data on complex manifold.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2017.  [38] C. Zimmermann and T. Brox. Learning to estimate 3d hand pose from single rgb images. Technical report, arXiv:1705.01389, 2017. https://arxiv.org/abs/1705.01389.
Comments
There are no comments yet.