-
Audio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA
Deep learning has successfully shown excellent performance in learning j...
read it
-
Multimodal Metric Learning for Tag-based Music Retrieval
Tag-based music retrieval is crucial to browse large-scale music librari...
read it
-
Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking
A major challenge in matching images and text is that they have intrinsi...
read it
-
deepsing: Generating Sentiment-aware Visual Stories using Cross-modal Music Translation
In this paper we propose a deep learning method for performing attribute...
read it
-
CNN based music emotion classification
Music emotion recognition (MER) is usually regarded as a multi-label tag...
read it
-
Target-Oriented Deformation of Visual-Semantic Embedding Space
Multimodal embedding is a crucial research topic for cross-modal underst...
read it
-
Learning Affective Correspondence between Music and Image
We introduce the problem of learning affective correspondence between au...
read it
Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space
Both images and music can convey rich semantics and are widely used to induce specific emotions. Matching images and music with similar emotions might help to make emotion perceptions more vivid and stronger. Existing emotion-based image and music matching methods either employ limited categorical emotion states which cannot well reflect the complexity and subtlety of emotions, or train the matching model using an impractical multi-stage pipeline. In this paper, we study end-to-end matching between image and music based on emotions in the continuous valence-arousal (VA) space. First, we construct a large-scale dataset, termed Image-Music-Emotion-Matching-Net (IMEMNet), with over 140K image-music pairs. Second, we propose cross-modal deep continuous metric learning (CDCML) to learn a shared latent embedding space which preserves the cross-modal similarity relationship in the continuous matching space. Finally, we refine the embedding space by further preserving the single-modal emotion relationship in the VA spaces of both images and music. The metric learning in the embedding space and task regression in the label space are jointly optimized for both cross-modal matching and single-modal VA prediction. The extensive experiments conducted on IMEMNet demonstrate the superiority of CDCML for emotion-based image and music matching as compared to the state-of-the-art approaches.
READ FULL TEXT
Comments
There are no comments yet.