Deep Learning Techniques for Future Intelligent Cross-Media Retrieval

With the advancement in technology and the expansion of broadcasting, cross-media retrieval has gained much attention. It plays a significant role in big data applications and consists in searching and finding data from different types of media. In this paper, we provide a novel taxonomy according to the challenges faced by multi-modal deep learning approaches in solving cross-media retrieval, namely: representation, alignment, and translation. These challenges are evaluated on deep learning (DL) based methods, which are categorized into four main groups: 1) unsupervised methods, 2) supervised methods, 3) pairwise based methods, and 4) rank based methods. Then, we present some well-known cross-media datasets used for retrieval, considering the importance of these datasets in the context in of deep learning based cross-media retrieval approaches. Moreover, we also present an extensive review of the state-of-the-art problems and its corresponding solutions for encouraging deep learning in cross-media retrieval. The fundamental objective of this work is to exploit Deep Neural Networks (DNNs) for bridging the "media gap", and provide researchers and developers with a better understanding of the underlying problems and the potential solutions of deep learning assisted cross-media retrieval. To the best of our knowledge, this is the first comprehensive survey to address cross-media retrieval under deep learning methods.



There are no comments yet.



Cross-Media Scientific Research Achievements Retrieval Based on Deep Language Model

Science and technology big data contain a lot of cross-media information...

Deep Learning and Synthetic Media

Deep learning algorithms are rapidly changing the way in which audiovisu...

An Overview of Cross-media Retrieval: Concepts, Methodologies, Benchmarks and Challenges

Multimedia retrieval plays an indispensable role in big data utilization...

Research on Cross-media Science and Technology Information Data Retrieval

Since the era of big data, the Internet has been flooded with all kinds ...

A multimodal deep learning framework for scalable content based visual media retrieval

We propose a novel, efficient, modular and scalable framework for conten...

Cross-media Multi-level Alignment with Relation Attention Network

With the rapid growth of multimedia data, such as image and text, it is ...

Deep Neural Networks and Tabular Data: A Survey

Heterogeneous tabular data are the most commonly used form of data and a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Social media websites (e.g., Facebook, Youtube, Instagram, Flickr, and Twitter) have tremendously increased the volume of multimedia data over the Internet. Consequently, considering this large volume of data and the heterogeneity of the data sources, data retrieval becomes more and more challenging. Generally, multimodal data (i.e., data from sources, e.g., video, audio, text, images) are used to describe the same events or occasions. For instance, a web page describes similar contents of an event in different modalities (image, audio, video, and text). Therefore, with a large amount of multimodal data, the accurate result of a search concerning the information of interest decreases. The evolution of different search algorithms for indexing and searching multimodal data contributed positively to searching for information of interest efficiently. Nevertheless, they only work in a single-modality-based search, comprising two main classes: content-based retrieval and keyword-based retrieval [gasser2019towards].

In the last few years, many cross-media retrieval methods have been proposed [rehman2018benchmark, peng2018overview, dong2018semi, xia2018cross, liu2018multi, shu2018crossfire, xu2018deep]. However, Canonical Correlation Analysis (CCA) [hardoon2004canonical] and Partial Least Square (PLS) [rosipal2005overview, sharma2011bypassing] are usually adopted to explicitly project different modality data to a common space for similarity measurement. In the Bilinear Model (BLM) [tenenbaum2000separating], different modality (e.g., text and image) data are projected to the same coordinates as it learns a common subspace. Generalized Multiview Analysis (GMA) [sharma2012generalized] can be used to combine CCA, BLM, and PLS for solving cross-media retrieval task. Gong et. al. [gong2014multi] proposed a variant CCA model by incorporating the high-level semantic information as a third view. Ranjan et al. [ranjan2015multi] also introduced a variant of CCA called multilabel Canonical Correlation Analysis (ml-CCA) for learning the weights of shared subspaces using high-level semantics called multi label annotations. Rasiwasia et al. [rasiwasia2014cluster] proposed a cluster CCA method to learn discriminant isomorphic representations that maximize the correlation between two modalities while distinguishing the different categories. Sharma et. al. [sharma2012generalized] proposed a variant of Marginal Fisher Analysis (MFA) called Generalized Multiview Marginal Fisher Analysis (GMMFA).

Ref. Year Topic Deep Learning Cross-media Retrieval

Supervised Unsupervised Pairwise Rank Representation Alignment Transalation
[schmidhuber2015deep] 2015 Deep learning in neural networks: An overview
[lecun2015deep] 2015 Deep Learning
[liu2017survey] 2017 A survey of deep neural network architectures and their applications.
[ahmad2019deep] 2019 Deep learning: methods and applications
[deng2014tutorial] 2014 A tutorial survey of architectures, algorithms, and applications for deep learning
[pouyanfar2018survey] 2018 A survey on deep learning: Algorithms, techniques, and applications
[arulkumaran2017deep] 2017

Deep reinforcement learning: A brief survey

[hussein2017imitation] 2017 Imitation learning: A survey of learning methods
[chen2014big] 2014 Big data deep learning: challenges and perspectives
[najafabadi2015deep] 2015 Deep learning applications and challenges in big data analytics
[hordri2017systematic] 2017 A systematic literature review on features of deep learning in big data analytics
[peng2017overview] 2017 An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges
[wang2016comprehensive] 2016 A comprehensive survey on cross-modal retrieval
[liu2010cross] 2010 Cross-media retrieval: state-of-the-art and open issues
Our work 2020 Deep Learning Techniques: Evolving Machine Intelligence for Future Intelligent Cross-media Retrieval
Table I: Comparison of existing survey articles on deep learning and cross-media retrieval. ✔ represents that the topic is covered, ✘ represents the topic is not covered, and ❊ represents the topic is partially covered.

Even though every aforementioned contribution provide vital contribution in cross-media retrieval society, still these methods lack satisfactory performance. The key reason is that conventional feature learning techniques hardly tackle the problem of image understanding, but visual features representation between images and text is highly dependent on cross-media retrieval. Recently, deep learning models have made significant development in fields such as computer vision

[tu2017csfl, benjdira2019unsupervised], engineering [rehman2016face], health [yang2018clinical] and hydrology [Rehman2019water]. Donahue et. al. [donahue2014decaf]

proposed a deep eight-layer neural network called DeCAF, which confirmed that Convolution Neural Network (CNN) features are helpful for various feature extraction tasks.

In this paper, we investigate different deep learning approaches applied in the domain of cross-media search, which are indispensable for the adoption and implementation of cross-media retrieval. DNN is designed to simulate the neuronal structure of the human brain, and represents a powerful approach to naturally deal with the correlations of multi media. For this purpose, several researchers have explored DNNs for using it in the search and retrieval of data from heterogenous sources. Although, the latest research in the field of DNN-based methods for cross-media retrieval has achieved better performance

[peng2016cross], however, there are still significant improvements needed in this area.

We explore the following three main challenges for using deep learning techniques in cross-media retrieval.

Figure 1: Taxonomy of the proposed work.
  1. Representation. It aims to learn the representation of cross-media data in an optimal way to mitigate its redundancy. This is a challenging task in cross-media retrieval since data is heterogeneous. For instance, the text is normally symbolic while audio and video modalities are represented as signals. Therefore, learning the representation of individual modality in a common semantic space is a challenging task.

  2. Alignment. In this procedure, the key objective is to find the correlation between elements from cross modalities to mitigate the modality-to-modality mismatch issue. For instance, we want to align each human action image into a video showing a series of different human actions. To achieve this, we need to measure the similarity distance between different modalities and deal with other correlation uncertainties.

  3. Translation. It shows the correlation mapping of data across different modalities, since data is heterogeneous and the relationship between cross modalities is hard to identify. For instance, an image can be described in various different ways, and a single perfect translation may not exist. Therefore, it is hard to choose an appropriate translation for a particular task, where multiple parameters are crucial. Particularly, there is no appropriate correct answer to a query in translation. As there is no common concept of translation to chose which answer is right and which is wrong.

For each of the aforementioned problems in cross-media retrieval, we provide a taxonomy of classes and sub-classes. A detailed taxonomy is provided in Fig. 1. We found out that some key issues of deep learning in cross-media retrieval on concepts, methodologies and benchmarks are still not clear in the literature. To tackle the aforementioned challenges, we investigate the DNN-based methods assisted cross-media retrieval.

I-a Comparison with Related Surveys Article

Our current survey article is unique in a sense that it comprehensively covers the area of DNNs-based cross-media retrieval. There is no prior detailed survey article that jointly considers DNNs and cross-media retrieval, to the best of our knowledge. Though there is an extensive literature on survey articles on DNNs or cross-media retrieval, but these survey articles either focus on DNNs or cross-media retrieval, individually.

General surveys regarding deep learning are discussed in [lecun2015deep, liu2017survey, ahmad2019deep, pouyanfar2018survey]. Surveys dealing with only cross-media retrieval domain are presented in [liu2010cross]. Our work is closely related to [wang2016comprehensive, peng2017overview]; however, they cover the broader picture of cross-media retrieval domain whereas, our work is more focus on DL-based cross-media retrieval. Furthermore, we provide a novel taxonomy according to the challenges faced by multi-modal deep learning approaches in solving cross-media retrieval, namely: representation, alignment, and translation. To the best of our knowledge, this is the only work till date, which provide a detail survey of DL-based methods in solving cross-media retrieval challenges (representation, alignment and translation). A summarized comparison of survey articles on DL and cross-media retrieval are provided in Table I.

I-B Our Contributions

To summarize, our main objectives in this paper are as follows.

Figure 2: A generalized framework of cross-media retrieval system.
  • Provide an up to date survey on the current advancement in cross-modal retrieval. This provides an added value as compared to previous surveys, which represents substantial benefits for understanding the trends in cross-media retrieval rapidly.

  • Provide a useful categorization of cross-media retrieval under DNN approaches. The contrasts between various types of techniques are expounded, which are helpful for readers to better understand various deep learning techniques used in cross-media retrieval.

  • A detailed explanation of almost every cross-media dataset is provided. Furthermore, its advantages and disadvantages are also discussed to facilitate the developers and researchers choosing a better dataset for their learning algorithms.

  • Present the key challenges and opportunities in the area of cross-media retrieval and discuss open future research challenges.

Ii An Overview of Cross-media Retrieval and Deep Learning

Before probing in to the depth of this paper, we want to initiate with the fundamental concepts of cross-media retrieval and deep learning techniques. We divide this section into diverse subsection such as, cross-media retrieval is discussed in subsection A. Moreover, the deep learning techniques in subsection B to discuss different algorithms for representations. Finally, the subsection C explain why DL is important for cross-media retrieval?

Ii-a Cross-media Retrieval

Cross-media retrieval represents the search for different modalities (e.g., images, texts, videos) by giving any individual modality as an input. The generic framework of cross-media retrieval is shown in Fig. 2

, in which data is represented in different modalities such as text, image, and video. Different algorithms (e.g., CNN, SIFT, LDA, TF-IDF, etc.) are applied to learn the feature vectors of individual modality. Furthermore, in the case of joint semantic space for multimodal data, cross-media correlation learning is performed for feature extraction. Finally, the semantic representations allow the cross-media retrieval to perform search results ranking and summarization.

It is important to note that cross-media retrieval is different from other correlation matching approaches between various media types (image and text). For example, correlation matching approaches [yang2017image, vinyals2015show] are used to generate the text descriptions of image/video only, whereas the cross-media retrieval approach endeavor to retrieve text from different modalities data image/video and vice versa. Methods of image annotation [ballan2014cross] are used to assign most relevant tags to images for descriptions, whereas in cross-media retrieval, the text also represents sentences and paragraph descriptions instead of only tags.

Cross-media retrieval is an open research issue in real-world applications. With the popularity of social media platforms (i.e., Facebook, Twitter, Youtube, Flickr and Instagram) different types of media (images, videos, texts) are flooding over the Internet. To tackle this issue, different cross-media retrieval approaches have been proposed [8695043, 8643797, 8673892, 8691806, 8716706]. However, in this paper we only consider DNNs-based cross-media retrieval approaches for information utilization to learn the common representations. As, DNNs-based approaches leverage the performance of different learning algorithms in cross-media retrieval domain. Moreover, to our knowledge this is the only survey mutually consider DNNs and cross-media retrieval. We categorize the DNN-based methods for the individual challenge of cross-media retrieval into four classes: (1) unsupervised methods, (2) supervised methods, (3) pairwise based methods, and (4) rank based methods.

  1. Unsupervised methods. Unsupervised methods leverage co-occurrence information instead of label information to learn common representations across data with different modalities. Specifically, these methods treated different modalities of data existing in a common multi-modal document as the same semantic. For instance, a website page contains both text and pictures for the outline of same theme. Specifically, users get information from both images or texts to get idea of a particular event or topic in a webpage.

  2. Supervised methods. In supervised methods, label information is used to learn common representations. These methods increase the correlation among intra-class samples and decrease the correlation among inter-class samples to obtain good discriminating representations. However, getting annotated data is costly and laborious because of manual labelling.

  3. Pairwise based methods. These methods are used to learn common representations through similar/dissimilar pairs, in which, a semantic metric distance is learned between data of various modalities.

  4. Rank based methods. These methods are used to learn common representations for cross-media retrieval through learning to rank.

Ii-B Deep Learning Techniques

Deep Learning (DL) is a sub-class of Machine Learning (ML). DL networks are a kind of neural network that discovers important object features. These algorithms attempt to learn (multiple levels of) representation by using a hierarchy of multiple layers. If the system is provided with a large amount of information, it begins to understand it through feature extraction and respond in useful ways. Most of the deep learning algorithms are built on neural network architectures, due to this reason they are often called as Deep Neural Networks (DNN).

Different DL architectures (Deep Neural Network, Convolution Neural Network, Deep Belief Networks, Recurrent Neural Network) are successful in solving many computer vision problems efficiently, where the solutions are difficult to obtain analytically. These problems include handwritten digit recognition, optical character recognition, object classification, face detection, Image captioning and facial expression analysis

[bengio2013representation, schmidhuber2015deep, lecun2015deep].

Currently, DL algorithms are also tested in interdisciplinary research domains, such as bio-informatics, drug design, medical image analysis, material inspection, agriculture and hydrology [Rehman2019water, bovsnavcki2019deep, stephenson2019survey, bastian2019visual, alhnaity2019using]. The processing and evolution of these fields are dependent on deep learning, which is still evolving and in need of creative ideas [cirecsan2012multi, krizhevsky2012imagenet, marblestone2016toward].

Ii-B1 Evolution and Classification of Deep Learning Techniques

Figure 3: An overview of the evolution of deep learning from conventional Machine Intelligence and Machine Learning paradigms.

Since the early excitement stirred by ML in the 1950s, smaller subsets of machine intelligence have been impacting a myriad of applications over the last three decades as shown in Fig. 3. Initially, the term “deep learning” was presented to the community of machine learning by Rina Dechter in 1986 [AAA12, lecun2015deep], and Igor Aizenberg and his colleagues to artificial neural networks in 2000, in boolean threshold neurons domain [AAA14, AAA15]

. In 1965, Alexey Ivakhnenko and Lapa published the primary general learning algorithm for feed-forward, supervised, multi-layer perceptrons

[A91]. Moving forward in 1980, Kunihiko Fukushima introduced Neocognitron in computer vision domain [A92]

. Furthermore, Yann LeCun applied standard backpropagation algorithm to deep neural network for handwritten recognition in 1989

[A93, A94, A95, A96].

Although, deep learning has existed for more than three decades however, they have recently gain interest in the machine learning community. Before 2006, the deep learning method was a complete failure in training large deep architectures. In 2006, the revolution to successful training schemes for deep architectures originated with the algorithms for training Deep Belief Networks (DBNs) by Hinton et al. [A97]

and autoencoders by Ranzato et al.

[A98] and Bengio et al. [A99] based on unsupervised pre-training followed by supervised fine-tuning. Following the same path, different approaches were proposed to deal with the aforementioned issues under different circumstances.

Before 2011, CNNs did not succeed in efficiently solving computer vision problems. However, in 2011, CNNs achieved superhuman performance in a visual pattern recognition contest. In 2012, the success of deep learning algorithms in image and object recognition were started. However, backpropagation algorithm had been used for decades to train CNNs, and Graphical Processing Unit (GPU) implementations of Neural Networks (NNs) for years, comprising CNNs

[A100, A101].Moreover, in the same year CNNs also won ICDAR Chinese handwriting contest. In May 2012, CNNs won ISBI image segmentation contest [A102], which significantly attracted researcher’s attention. Ciresan et al.

showed how max-pooling CNNs on GPU can affectedly enhance several computer vision benchmark records at CVPR 2012

[A103]. Following the same path, in October 2012, Krizhevsky et al. [krizhevsky2012imagenet]

showed the dominancy of DNNs over shallow machine learning methods by winning the large-scale ImageNet competition over a large margin.

Researchers believe that the victory of ImageNet in Large Scale Visual Recognition Challenge (ILSVRC) 2012 anchored the begin of “deep learning revolution”

that has revolutionize the Artificial Intelligence (AI) industry


Ii-C Why DL for Cross-media Retrieval?

Before going in detail, it is useful to understand the reason of applying DNNs to cross-media retrieval. There are several DNNs attractive characteristics that make it unique such as (1) end-to-end learning model, (2) efficiency boost up using back-propagation training, and (3) the performance of DNNs increase as the size of data increase [moreira2019contextual, zhang2019deeprec, you2019hierarchical]. Furthermore, the architecture of DNNs are hierarchal and trained end-to-end. The main advantage using such architecture is when dealing multimedia data. For example, a webpage contains textual data (reviews [zheng2017joint], tweets [gong2016hashtag]), visual data (posts, scenery images), audio data and video data. Here modality-specific features extraction will be complex and time consuming. Suppose, if we have to process textual data, initially we need to perform expensive and time consuming pre-processing (e.g., keywords extraction, main topic selection). However, DNNs have the ability to process all the textual information in a sequential end-to-end manner [zheng2017joint]. Therefore, these advances in the architecture of DNNs make it very suitable for multi-modal tasks [zhang2017bjoint] and we urge for indispensable neural end-to-end learning models.

As for as the interaction-only settings (i.e. matrix completion) are concerned, DNNs are necessary in dealing huge number of training data and gigantic complexity. He et al. [he2017neural] overcome the performance gain of conventional Matrix Factorization (MF) method by using Multi Layer Perceptron (MLP) to approximate the interaction function. Moreover, typical ML models (i.e., BPR and MF) also achieve best performance on interaction-only data when trained with momentum-based gradient descent [tay2018latent]

. Nevertheless, these models also take the benefit of current DNNs based improvements such as Batch normalization, Adam, and optimize weight initialization

[he2017neural, zhang2018neurec]. It is fact that most of the Cross-media retrieval algorithms have adopted DNNs-based structure to improve its performance such as Deep Canonical Correlation Analysis (DCCA) [liu2019new], Deep Canonically Correlated Auto-Encoder (DCCAE) [wang2016deep], and Discriminative Deep Canonical Correlation Analysis (DisDCCA) [elmadany2016multiview]. Therefore, DL is significantly useful tool for today’s research and industrial environment for the advancement of cross-media retrieval methods.

We summarize some of the useful strengths of DNNs based cross-media retrieval models, which are as follows:

Ii-C1 Flexibility

The DNNs based approaches are also known as global learning due to its vast application domain. Currently, the flexibility of DL methods further boost up with the invent of well-known DL frameworks i.e., Caffe, Tensorflow, Pytorch, Keras, Theano, and MXnet. Each of the aforementioned framework has active community and support. This make development and engineering efficient and easier. For instance, concatenation of different neural models become easier, and produce more powerful hybrid structures. Hence, the implementation of hybrid cross-media retrieval models become easier to capture better features and perform well.

Ii-C2 Generalization

This property of DL methods make it very demanding and unique. It can be used in many different applications and with different data types. For example, in the case of transfer learning the DL-based method have the ability to share knowledge across different tasks. As, DL algorithms capture both low and high level features, they may be beneficial to perform other tasks

[bengio2013representation]. Andreas et al. [andreas2015deep] and Perera et al. [perera2019deep] showed the successful performance of DNNs-based methods in transfer learning.

Ii-C3 Nonlinear Transformation

DNNs based models have the ability to process the non-linearity in data using non-linear activation functions i.e.,

sigmoid, relu

and tanh. This helps the models to capture complex patterns within the dataset. Traditional cross-media retrieval methods such as CCA, BLM and Linear Discriminant Analysis (LDA) are linear models, which need DNNs-based methods to retrieve nonlinear features. For example, in DCCA, initially DNNs are used to extracts nonlinear features and then uses linear CCA to calculate the canonical matrices. It is well-know that neural networks have the ability to approximate any continuous function by fluctuating the activation functions [abc].

Ii-C4 Robust

DL based methods do not need manually feature extraction algorithms rather feature are learned in an end-to-end manner. Hence, the system achieve better performance despite the variations of the input data. The authors of [Gaurav] and [Wicker2019RobustnessO3] showed the robustness of DL against adversarial attacks in visual recognition application.

Iii Cross-media Datasets

Dataset plays a critical role in the evaluation of learning algorithm. Its selection is very important for feature extraction and training of different DL algorithms. We summarized some of the well-known cross-media datasets below, and Table II depicts a comparison evaluation among them.

  1. Wikipedia: this dataset is largely used in cross-media domain to evaluate the performance of different learning algorithms. The dataset consists of 2866 image-text pairs of 10 distinct classes accumulated from Wikipedia’s articles.

  2. NUS WIDE: A popular dataset in cross-media community after Wikipedia dataset. This dataset contains 269,648 labeled images of 81 different concepts from Flickr. Every image in the dataset is aligned with associated user tags called image-text pair. Overall, the dataset contains 425,059 unique tags that are associated with these images. Nevertheless, to enhance the quality of tags, those tags were pruned that appear less than 100 times and do not exist in WordNet [miller1995wordnet]. Hence, 5,018 unique tags are included in this dataset.

  3. Pascal VOC: the dataset consists of 20 distinct classes of image-tag pairs having 5011 training pairs and 4952 testing pairs. Although, some images are labeled more than twice. However, in the literature some studies have selected uni-labelled images, which results in 2808 and 2841 training and testing pairs, respectively [sharma2012generalized]. The image feature chosen were GIST and color [hwang2012reading], and histogram whereas; text features were 399-dimensional tag occurrence.

  4. FB5K: The dataset contains 5,130 image-tag pairs with associated users’ feelings, which is accumulated from Facebook [ur2018facebook5k]. Furthermore, this dataset is categorized into 80% and 20% for training and testing image-text pairs.

  5. Twitter100K: This dataset is made up of 100,000 image-text pairs collected from Twitter. It exploited 50,000 and 40,000 image-text pairs for training and testing respectively. Moreover, about 1/4 of the images in this dataset contain text which are highly correlated to the paired tweets.

  6. XMedia: This is the only dataset in the cross-media domain with five different modalities, such as video, audio, image, text, and 3-Dimensional (3D) model. It consists of 20 distinct classes, such as elephant, explosion, bird, dog, etc. Each class contains an overall of 600 media instances: 250 texts, 250 images, 25 videos, 50 audio clips, and 25 3D models. In the dataset’s overall collection, different popular websites were used to collect data, i.e., Flickr, YouTube, Wikipedia, 3D Warehouse, and Princeton 3D model search engine.

  7. Flickr30K: the dataset is the extended version of Flickr8k datset [hodosh2013framing]. It consists of 31783 images collected from Flickr. Individual image in this dataset is linked with associated five native English speakers’ descriptive sentences.

  8. INRIA-Websearch: this dataset contains 353 image search queries, along with their meta-data and ground-truth annotations. In total, this dataset consists of 71478 images.

  9. IAPR TC-12: the dataset consists of 20,000 images (plus 20,000 corresponding thumbnails) taken from locations around the world and comprising a varying cross-section of still natural images.The time span used for the collection of images falls within 2001-2005. Moreover, this collection is spatially diverse as the images were collected from more than 30 countries.

  10. ALIPR: the dataset contains annotation results for more than 54,700 images created by users of are viewable at the Website:

  11. LabelMe: the dataset contains 30,000 images with associated 183 number of labels. The main source of dataset collection was crowd-sourcing through MIT CSAIL Database of objects and scenes111

  12. Corel5K: the dataset was collected from 50 Corel Stock Photo cds. It consists a total of 5,000 images, with 100 images on the same topic. Individual image is linked with an associated 1-5 keywords with a total of 371 keywords. Before modelling, all the images in the dataset are pre-segmented using normalized cuts [shi2000normalized]. It consists a total of 36 features: 18 color features, 12 texture features and 6 shape features.

  13. Corel30K: the dataset is the extended version of previously published dataset called Corel5K. It contains 31,695 images and 5,587 associated words. It exploited 90% (28,525) and 10% (3,170) images for training and testing respectively. This dataset is much improved from Corel5K in terms of examples per label and database size, and hence play a significant role in evaluating learning systems.

  14. AnnoSearch: the dataset contains 2.4 million photos collected from popular websites, such as and the University of Washington (UW)333 The images are of high quality and consists rich associated descriptions, such as title, category and comments from the photographers. Although these descriptions cover to a certain degree the concepts of the associated images.

  15. Clickture: this data set was obtained from the hard work of one-year click log of a commercial image search website. There are 212.3 million triads in this dataset. The triad is mathematically define as:


    A triad is defined as as image “” was clicked “” times in the search space of query “” in one year by means of different users at different times. The Clickture full dataset consists of 40 million unique image and 73.6 million unique text queries. Moreover, this dataset also contains a lite version titled as “Clickture-Lite”, which consists of 1 million images and 11.7 million text queries.

  16. ESP: the dataset contains more than 10 million images. The key source of dataset collection was crowd-sourcing. The main objective of this cross-media dataset is to label the most of images over the internet. We envisioned that if our game get a proper gaming site platform, such as Yahoo! Games and allows people to play with interest like other games, it can solve the labeling of most of the images in a time span of weeks. Furthermore, It is predicted that if 5,000 people regularly play this game for 31 days they could assign labels to all Google images.

Ref Dataset Year Data size URL Image Text Tags Video Audio 3D Model
[rasiwasia2010new] Wikipedia 2010 2,866 - - - -
[chua2009nus] Nus Wide 2009 269,648 - - -
[hwang2012reading] Pascal VOC 2015 9,963 - - -
[young2014image] Flickr30K 2014 31,783 - - - -
[krapac2010improving] INRIA-Websearch 2010 71,478 - - - -
[ur2018benchmark] FB5K 2018 5140 - - ✔ - - -
[hu2018twitter100k] Twitter100K 2018 100,000 - - - -
[peng2018overview] Xmedia 2018 12,000 -
[grubinger2006iapr] IAPR TC-12 2006 20,000 - - - -
[li2011real] ALPR 2011 - - - -
[carneiro2007supervised] SML 2007 - - - - - - - -
 [lavrenko2004model] Corel5K 2007 5000 - - -
[von2004labeling] ESP 2004 - - - - - -
[russell2008labelme] LabelMe 2008 - LabelMe/intro.html - - - -
[wang2006annosearch] AnnoSearch 2006 - - - - -
[hua2013clickage] Clickture 2013 - - - - -

Table II: A summary of datasets in cross-media retrieval. For each dataset we identify the modality used to tackle the problem of cross-media retrieval.

Iv Challenges in Cross-media Retrieval and Proposed DL based Methods

In this survey paper, we provide a novel taxonomy according to the challenges faced by multi-modal deep learning approaches in solving cross-media retrieval. In subsection A, we explain the data representation in cross-modal retrieval because it always difficult task in deep learning. Subsection B describe the alignment of multimodal. Multimodal alignment is also a challenging task in cross-media retrieval to find the relationship and correlations between different instances in cross modalities. Finally, we also consider the translation in subsection C that refers to map the data from one modality to another. To tackle the aforementioned problems, we present an extensive review of the state-of-the-art problems and their corresponding solutions to leverage the use of deep learning in cross-media retrieval applications. This new taxonomy will enable researchers to better understand the state-of-the-art problems and solutions, and identify future research directions.

Iv-a Representations

Figure 4: An illustration of multimedia for learning shared space representations utilizing deep learning model.

Data representations in cross-modal retrieval has always been a difficult task in deep learning. Multi-modal representations deal with the representation of data from multiple domains. These representations from different modalities faces several challenges to learn a common semantic space, such as, data concatenation from heterogeneous sources (image, text, video), noise, and missing data handling from various modalities. Semantic data representation tries to learn the correlation across different modalities. Initially, to represent multimodal data in a common semantic space, cross-media correlation learning is performed for feature extraction. Finally, the semantic representations allow the cross-media retrieval to perform search results ranking and summarization. Semantic data representation is mandatory to multi-modal issues, and leverages the performance of any cross-media retrieval model.

Semantic representations are non-uniform in a low-level feature space. For example, modeling a broad theme, such as “Asia”, is more challenging than modeling a specific theme, such as “sky”, due to the absence of a significant, unique visual feature that can characterize the concept of “Asia”. Therefore, neglecting such semantic representation would be inappropriate. Hence, good representation is indispensable for deep learning models. Bengio et al. [bengio2013representation] proposed several ways for good representations - sparsity, smoothness, spatial and temporal coherence etc. It is important to represent data in a meaningful way to enhance the performance of DNN based cross-media retrieval models.

In a few years, many conventional methods shifted to advanced DNN based methods. For instance, the bag of visual words (BoVW) and scale invariant feature transform (SIFT) were used to represent an image. However, presently CNN [krizhevsky2012imagenet] is used to represent the description of the images. Similarly, Mel-frequency cepstral coefficients (MFCC) have been overcome by deep neural networks in the audio domain for speech recognition [hinton2012deep]. An overview of such approaches can be visualized in Fig. 4, with representative work summarized in Table III.

Iv-A1 Unsupervised DNNs based Methods

The major advantage of neural network based joint representations come from their ability to pre-train from unlabeled data when labeled data is not enough for supervised learning. It is also common to fine-tune the resulting representation on a particular task at hand as the representation constructed with unsupervised data is generic and not necessarily optimal for a specific task

[wang2015deepp, rehman2018quantum]. Unsupervised methods used co-occurrence information instead of label information to learn common representations across different modality data. Srivastava et al. [srivastava2012learning] learned the representations of multimodal data using Deep Belief Network (DBN). They first model individual media type using a separate DBN model. Then concatenated both networks by learning a mutual RBM at the top.

Chen et al. [chen2017deep]

proposed conditional generative adversarial networks (CGAN) to achieve cross-modal retrieval of audio-visual generation (e.g, sound and image). Unlike traditional Generative Adversarial Networks (GANs), they make their system to handle cross-modality generation, such as sound to image (S2I) and image to sound (I2S). Furthermore, a fully connected layer and several deconvolution layers of deep convolutional neural networks are used as the image encoder and decoder respectively. Similarly is the case with sound generation. Following the same path, Zhang

et al. [zhang2017hashgan] proposed a novel adversarial model, called HashGAN. It consists of three main modules: (1) feature learning module for multi-modal data, which uses CNN to extract high level semantic information, (2) generative attention module, which is used to extract foreground and background feature representations, and (3) discriminative hash coding module, which uphold the similarity between cross modalities.

Multi-modal Stacked Auto-Encoders (MSAE) model [wang2014effective]

is used to project features from cross-modality into a common latent space for efficient cross-modal retrieval. This model shows significant advantages over current state-of-the-art approaches. First, the non-linear mapping method used in this model is more expressive. Second, since it is an unsupervised learning method, data dependency is minimal. Third, the memory usage is optimized and independent of the training dataset size. Unlike the authors of

[fan2017unsupervised], they proposed an unsupervised deep learning approach in text subspace for cross-media retrieval. They claimed that the proposed text subspace is more efficient and useful as compared to conventional latent subspace.

Iv-A2 Supervised DNNs based Methods

Ngiam et al. [ngiam2011multimodal] were the first to address a multimodal deep learning approach in audio and video retrieval. They trained deep networks for a series of multimodal learning tasks to learn a shared representation between cross modalities and tested it on a single modality, for example, the system was trained with video data but tested with audio data and vice versa.

Deep Cross-modal Hashing (DCMH) [jiang2016deep] efficiently reveals the correlations among cross modalities. It is an end-to-end learning paradigm, which integrates two parts: (1) feature learning part, and (2) the hash-code learning part. Cao et al. [cao2016deep]

proposed Deep Visual-Semantic Hashing (DVSH) model, which utilized two different DNN models such as CNN and Long Short Term Memory (LSTM) to learn similar representation for visual data and natural language.

Wang et al. [wang2015deep] proposed a regularized deep neural network (RE-DNN), which utilized deep CNN features and topic features as visual and textual semantic representation across modalities. This model is able to capture both intra-modal and inter-modal relationships for cross-media retrieval. They further improve their work in [wang2016joint, wang2013learning]

by concatenating common subspace learning and coupling feature selection into a joint feature learning framework. Unlike previous models, this approach considers both the correlation and feature selection problems at the same time. They learn the projection matrices through linear regression to map cross-modality data into a common subspace, and

norm to select similar/dissimilar features from various feature spaces. Furthermore, the inter-modality and intra-modality similarities are preserved through a multimodal graph regularization.

Iv-A3 Pairwise-based DNNs Methods

These methods are used to learn a semantic metric distance between cross modalities data for utilizing similar/dissimilar pairs, which is termed as heterogeneous metric learning.

Social media networks, e.g., Flickr, Facebook, Youtube, Wechat, Twitter, have produced immense data on the web due to which it became the source of high attention. Thus, it plays a significant role in multimedia related applications, including cross-media retrieval. Social media networks are completely different from traditional media network and exhibit unique challenges to data analysis. 1) The data present on social media websites are various and noisy. 2) The data are heterogeneous and present in different modalities, e.g., image, text, video, audio, on the same platform. To predict the link between various instances of social media Yuan et al. [yuan2013latent] proposed a brave novel idea on the latent feature learning. To achieve this, they designed a Relational Generative Deep Belief Nets (RGDBN). In this model, they learn the latent feature for social media, which utilized the relationships between social media instances in the network. By integrating the proposed model called the Indian buffet process into the improved DBN, they learn the optimal latent features that best embed both the media content and its relationships. The proposed RGBDBN is able to analyze the correlation between homogeneous and heterogeneous data for cross-media retrieval.

Following the same path, Wang et al. [wang2015image] proposed Modality-Specific Deep Structure (MSDS) based on modality-specific feature learning. The MSDS model used two different types of CNN to represent raw data in the latent space. The semantic information among the images and texts in the latent space used one-vs more learning scheme. Deep Cross-Modal Hashing (DCMH) [jiang2017deep] extends traditional deep models for cross-modal retrieval, but it can only capture intra-modal information and ignores inter-modal correlations, which makes the retrieved results suboptimal. To overcome the aforementioned limitations, a Pairwise Relationship guided Deep Hashing (PRDH) [yang2017pairwise] adopted deep CNN models to learn feature representations and hash codes for individual cross-modality using the end-to-end architecture. Moreover, in this model, the decorrelation constraints are integrated into a single deep architecture to enhance the classification performance of the individual hash bit.

Iv-A4 Rank-based DNNsMethods

These methods utilize rank lists to learn semantic representations, in which an individual candidate is ranked based on the similarity distance between the query and candidate. In this regard, Hu et al. [hu2018twitter100k] achieved the highest efficiency for cross-media retrieval using Dual-CNN’s architecture. They used dual CNN to model image and text independently, which is further used to rank the similarity distance between query and candidate. Frome et al. [frome2013devise] introduced a novel deep visual-semantic embedding (DeViSE) approach to leverage useful information learned in the text domain, and transfer it to a system trained for visual recognition. Similarly, Weston et al. [weston2010large] employed online learning to rank approach, called WSABIE, to train a joint embedding model of labels and images. The authors of [srivastava2012multimodal]

developed a Deep Boltzmann Machines (DBM) to represent joint cross-modal probability distribution over sentences and images. Different from RNN-based approaches, Socher et al.

[socher2014grounded] introduced a novel Dependency Tree Recursive Neural Networks (DT-RNNs) model which embed one modality (e.g., sentences) into a vector space using dependency trees in order to retrieve cross-modality (e.g., images). However, these methods reason about the image only on the global level using a single, fixed-sized representation from the top layer of a CNN as a description for the entire image. In contrast, the model presented in [karpathy2014deep] clearly elaborated the challenge faced in a complex scene. They formulated a max-margin objective for DNN that learn to embed both image and text into a joint semantic space. The ranking function for joint image-text representations is:



is a hyperparameter that we cross-validate. The objective stipulates that the score for true image-sentence pairs

should be higher than or for any by at least a margin of .

Reference Modalities Representation
3cm[chen2017deep], [srivastava2012learning], [zhang2017hashgan], [wang2014effective], [fan2017unsupervised] 5cmAudio and Images
Text and Images Unsupervised

5cm[ngiam2011multimodal], [jiang2016deep], [wang2015deep],
[wang2016joint, wang2013learning], [cao2016deep] 3cmAudio and Video
Text and Images
Images and Audio Supervised
5cm [yang2017pairwise, yuan2013latent, wang2015image] 3cmAudio and Images
Text and Images Pairwise
4.5cm[hu2018twitter100k], [frome2013devise], [karpathy2014deep] [weston2010large],
[srivastava2012multimodal], [socher2014grounded] 5cmText and Images
Label and Images
Sentences and Images Rank-based

Table III: Summary of DNN based methods for the cross-media representations task.

Iv-B Alignment

Figure 5: An example of cross-media multi-level alignment for correlation learning, which not only explores global alignment between original instances and local alignment between fine-grained patches, but also captures relation alignment lying in the context.

Multimodal alignment is a challenging task in cross-media retrieval. It basically consists in finding the relationships and correlations between different instances in cross modalities. For example, aligning text and image for a particular website. As the reader get good understanding from both modalities present in a particular webpage rather than just one. Multimodal alignment is significant for cross-media retrieval, as it allows us to retrieve the contents of different modality based on input query (e.g., image retrieval in case of the text as a query, and vice versa) as shown in Fig.

5. Furthermore, we summarized different DNN based methods for the cross media alignment task in Table IV.

Reference Modalities Alignment
[rehman2018benchmark, kruskal1983overview, dong2018semi, yan2018joint, chung2018unsupervised] 5cm Image and Text
Speech and Text Unsupervised
[qi2018cross] [jourabloo2016large] 5cmImage and Text
Image and gesture Supervised
[feng2014cross, yuan2018recursive, peng2018ccl, zheng2017dual] 5cmImage and Text Pairwise

Table IV: Summary of DNN based methods for the cross-media alignment task.

Iv-B1 Unsupervised DNNs based Methods

Unsupervised methods operate without label information between instances from cross modalities. These methods enforce some constraints on alignment, such as the temporal ordering of sequences and similarity metric existence between the modalities.

To align multi-view time series Kruskal et al. [kruskal1983overview] proposed the Dynamic Time Warping (DTW) approach, which is used to measure the similarity between two instances and find an optimized match between them using time warping (frames insertion). DTW can be used directly for multimodal alignment by hand-crafting similarity metrics between modalities; for example Rehman et al. [rehman2018benchmark] introduced a novel similarity measurement between texts, images and users’ feelings to align images and texts.

The canonical correlation analysis (CCA) extended the original DTW formulation as it requires a pre-define correlation metric between different modalities [dong2018semi, yan2018joint]. George et al. [trigeorgis2018deep] proposed a novel Deep Canonical Time Warping (DCTW) approach to automatically learn composite non-linear representations of multiple time series which are highly correlated and temporally in alignment. Yan et al. [yan2015deep] proposed a novel end-to-end approach based on the deep CCA. They formulated the objective function as:



and the objective function can be rewritten as follwing.


Furthermore, Yan et al. [yan2015deep] also optimize the memory consumption and speed complexity in the DCCA framework using GPU implementation with CULA libraries, which significantly increase the efficiency as compared to the CPU implementation.

Chung et al. [chung2018unsupervised] proposed an unsupervised cross-modal alignment method to learn the embedding spaces of speech and text. Particularly, the proposed approach used the Speech2Vec [chung2018speech2vec] and Word2Vec [mikolov2013distributed] to learned the respective speech and text embedding spaces. Furthermore, it also attempted to align the two spaces through adversarial training, followed by a refinement method.

Iv-B2 Supervised DNNs based methods

Normally, researchers not only focus on the visual regions and keywords, when aligning an image with text, but also between the rely on the correlation between them. Correlation is very important for cross-media learning; however, it is ignored in most of the previous works. For this purpose, Qi et. al. [qi2018cross] proposed Cross-media Relation Attention Network (CRAN) with multi-level alignment. The proposed model was used to efficiently handle the relation between different multimodal domains using multi-level alignment. In another article, Amin et al. [jourabloo2016large]

proposed a concatenated model of CNN regressor method and a 3-dimensional deep Markov Model (3DMM) to align faces with pose appearance. Dai

et al. [dai2018cross] proposed a unified framework for cross-media alignment task. They proposed a fused objective function, which contains both CCA-like correlation capability and LDA-like distinguishing capabilities. Further, Jia et al. [jia2018deep] proposed an efficient CNN model, which includes three main parts: the visual part is responsible for visual features extraction, the tex part is responsible for text features extraction, and finally the fusion part is responsible to fuse the image and sentences to generate decisive alignment score of the tweet (image and sentence pair).

Iv-B3 Pairwise-based DNNs Mehtods

With the recent advances of deep learning in multimedia applications, such as image classification [krizhevsky2012imagenet] and object detection [ren2015faster]

, researchers adopt the deep neural network to learn common space for cross-media retrieval, which aims to fully utilize its considerable ability of modeling a highly nonlinear correlation. Most of the deep learning based methods construct a multi-pathway network, where each pathway is for the data of one media type. Multiple pathways are linked at the joint layer to model cross-media correlation. Ngiam et al. propose bimodal autoencoders (Bimodal AE) to extend the restricted Boltzmann machine (RBM)

[ngiam2011multimodal]. They model the correlation by mutual reconstruction between different media types. Multimodal deep belief network [srivastava2012learning] adopts two kinds of DBNs to model the distribution over data of different media types, and it constructs a joint RBM to learn cross-media correlation. Liu et al. propose deep canonical correlation analysis (DCCA) to combine traditional CCA with deep network [liu2019new], which maximizes correlation on the top of two subnetworks. Feng et al. jointly model cross-media correlation and reconstruction information to perform correspondence autoencoder (Corr-AE) [feng2014cross]. Furthermore, Yuan et al. propose a recursive pyramid network with joint attention (RPJA) [yuan2018recursive]. They construct a hierarchical network structure with stacked learning strategy, which aims to fully exploit both inter-media and intra-media correlation. Cross-modal correlation learning (CCL) [peng2018ccl] utilizes fine-grained information, and adopts multi-task learning strategy for better performance. Zheng et al. propose a dual-path convolutional network to learn image-text embedding [zheng2017dual]. They conduct efficient and effective end-to-end learning to directly learn from the data with the utilization of supervisions. Besides, Plummer et al. provide the first large-scale dataset of region-to-phrase correspondences for image description based on Flickr-30K dataset [plummer2015flickr30k], where image regions depict the corresponding entities for richer image-to-sentence modelling.

However, the above methods mainly focus on pairwise correlation, which exists in global alignment between original instances of different media types. Although some of they attempt to explore local alignment between fine-grained patches, they all ignore important relation information lying in the context of these fine-grained patches, which can provide rich complementary hints for cross-media correlation learning. Thus, we propose to fully exploit multi-level cross-media alignment, which can learn the more precise correlation between different media types.

Iv-C Translation

Figure 6: A generalize description of example-based multimodal translation. It shows that the system retrieves efficient translation as soon as it get a query.

Translation refers to a mapping of data from one modality to another. For example, given a query of one modality, the task is to retrieve different modality of similar information. This task is a critical problem in cross-media retrieval [pereira2014role], computer vision and multimedia [bernardi2016automatic]. An overview of multi-modal translation can be visualized in Fig. 6 and the representative work is summarized in Table V.

In recent years, many deep learning based methods have been proposed to elucidate multimodal translation challenges. It is important because the retrieval task from different modalities has to fully understand the visual scene and produce grammatically correct and brief text depicting it. The multimodal translation is a very challenging issue in a deep learning community for several reasons. Foremost, as most of the time, it is hard to choose an appropriate translation for a particular task, where multiple parameters are crucial. Particularly, there is no appropriate correct answer to a query in translation. As there is no common concept of translation to chose which answer is right and which is wrong.

Another important reason is the variety of media, linguistic, area and culture differences, which further need expertise in the individual domain of translation with image, text and audio channels. We categorize multimodal translation based deep learning methods into two types - supervised and unsupervised.

Iv-C1 Unsupervised DNNs based Methods

These approaches normally rely on finding the nearest sample in the dictionary through consensus caption selection and used that as the translated output. Devlin et al. [devlin2015language] proposed a k-nearest neighbor retrieval approach to achieve translation results.

In [socher2010connecting] the authors projected words and image regions into a common space. Moreover, they used unsupervised large text corpora to learn semantic word representations for cross-media retrieval. Following the same path, Socher et al. [socher2013zero] proposed two different deep neural network models for translation. First, they trained a DNN model on many images in order to obtain rich features [coates2011importance]; at the same time, a neural language model [bengio2003neural] was trained to extract embedding representation of text. They further trained a linear mapping between the image features and the text embeddings to decrease the semantic space and link the two modalities. Lample et al. [lample2018word] proposed an unsupervised bilingual translation method that can model bilingual dictionary between two different languages. The key benefit of the proposed method is that it does not use any cross-lingual annotated data instead it only uses two monolingual corpora as the source and target language.

Iv-C2 Supervised DNNs based Methods

These approaches rely on label information to retrieve cross-modality instances. Yagcioglu et al. [yagcioglu2015distributed] used a CNN-based image representation to translate the given visual query into a distributional semantics based form. Furthermore, selecting intermediate semantic space for correlation measurement during retrieval is also an alternative way to tackle the problem of translation. Socher et al. [socher2014grounded] used intermediate semantic space to translate common representation from text to image and vice versa. Similarly, Xu et al. [xu2015jointly] proposed an integrated paradigm that models video and text data simultaneously. Their proposed model contains three fundamental parts: a semantic language model, a video model, and a joint embedding model. The language model was used to embed sentences into a continuous vector space. Whereas in the visual model, DNN was used to capture semantic correlation from videos. Finally, in the fused embedding model, the distance of outputs of the deep video model and language model was minimized in the common space to leverage the semantic correlation between different modality. Cao et al. [cao2016deep] proposed a novel Deep Visual-Semantic Hashing (DVSH) model for cross-media retrieval. They generated compact hash codes of visual and text data in a supervised manner, which was able to learn the semantic correlation between image and text data. The proposed architecture fuse joint multimodal embedding and cross-media hashing based on CNN for images, RNN for text and max-margin objective that incorporate both images and text to enable similarity preservation and standard hash codes. Lebret et al. [lebret2015phrase] used CNN to generate image representation, which allow the system to infer phrases that describe it. Moreover to predict a set of top-ranked phrases, a trigram constrained language model is proposed to generate syntactically correct sentences from different subsets phrases. Wei et al. [wei2017cross] tackled the cross-media retrieval problem through a novel approach called deep semantic matching (deep-SM). Particularly, images and text are mapped into a joint semantic space using two autonomous DNN models.

The popular benchmark multimodal techniques commonly learns a semantic space for image and text features to find a semantic correlation between them. However, using the same projection into the semantic space for two different tasks such as image-to-text and text-to-image may lead to performance degradation. Therefore, Wei et al [wei2016modality] proposed a novel method called Modality-Dependent Cross-media Retrieval (MDCR) to tackle the projection problem into the semantic space efficiently. In their proposed method, they learned two couples of projections for cross-media retrieval despite one couple projection into the semantic space.

Reference Modalities Translation
[socher2010connecting], [socher2013zero] Image and Text Unsupervised
[yagcioglu2015distributed, socher2014grounded, lebret2015phrase, wei2017cross] [xu2015jointly] [cao2016deep] [wei2016modality] 5cmImage and Text
Video and Text
Image and Audio
Image and Text Supervised

Table V: Summary of DNNs based methods for the cross-media translation task.

V Discussion

In this section, we provide a summarized overview of each technical challenge, namely: representation, alignment, and translation, with a discussion of future directions and research problems faced by multi-modal deep learning approaches with application to cross-media retrieval as shown in Fig. 7. We also highlight the lessons and “best practices” obtained from our review of the existing work.

V-a Lessons Learned and Best Practices

Based on the reviewed papers, we derive a set of lessons learned and “best practices” to be considered in implementing and deploying deep learning based cross-media retrieval for solving different challenges, such as representation, alignment, and translation. The key criteria used for solving each challenge is described as follows.

V-A1 Representation

This section describes four major types of deep learning approaches to solve multimodal representation — unsupervised deep learning, supervised deep learning, pairwise deep learning, and rank based deep learning methods. Unsupervised methods used co-occurrence information instead of label information to learn common representations across different modality data. These methods are commonly used for AVSR, affect, and multimodal gesture recognition. The remaining three representations, project individual modality into a separate space, which often used in applications where single modality is required for retrieval, such as zero-shot learning. Moreover, for the representation task, networks are mostly static. However, in the future, it may be dynamically switching between the modalities [baltruvsaitis2019multimodal, pahuja2019structure].

V-A2 Translation

Cross-media translation methods are extremely challenging to evaluate. As such, tasks for instance speech recognition have a unique suitable translation, whereas, tasks for instance speech synthesis and image description do not. Most of the time it is hard to choose an appropriate translation for a particular task, where multiple answers are acceptable. However, we can add a number of probabilistic metrics that help in model evaluation.

Normally, we use the help of human judgment in order to evaluate the aforementioned task. A group of experts has been assigned the task of evaluating individual translation manually through some scale parameter: opinion mining [van2016wavenet, zen2012statistical], realistic visual speech evaluation [taylor2012dynamic, anderson2013expressive], media description [chen2015microsoft, kulkarni2013babytalk, mitchell2012midge, venugopalan2014translating] and correlation and grammatical correctness. On the other hand, preference studies is also an alternate option where various translations are brought forward to the applicant for comparison [sarfjoo2017using, taylor2017deep]. Though, human judgment is a slow and expensive process. Moreover, they also affected with a different culture, age and gender preferences. It is hoped that by handling the evaluation challenge will be helpful to leverage multimodal translation methods.

V-A3 Alignment

Cross-media alignment has several challenges, which are summarized as follows:

  1. The number of datasets with clearly annotated alignment are scarce.

  2. The development of common similarity metrics between different modalities is hard.

  3. The alignment of different elements in one modality may not have a correspondence in other modality.

Literature showed that most of the alignment in cross-media focused on the alignment of sequences in an unsupervised manner using graphical models and dynamic programming methods [rehman2019quantum, ur2019learning, rehman2018design]. Most of these methods used hand-crafted similarity measures between different modalities or relied on unsupervised algorithms. However, supervised learning techniques become popular in the current era due to the availability of labeled training data.

Figure 7: Open problems and challenges for future direction

V-B Challenges and Open Problems

V-B1 Dataset Construction

The current state-of-the-art cross-media datasets have significant gaps to fulfil. First, datasets such as Wikipedia dataset444 [rasiwasia2010new], consists of only two different media types i.e., images and texts. In addition to this, Pascal VOC 2012 dataset555 [hwang2012reading] have only 20 different classes. Although, cross-media concatenate different domains such as images, texts, audio, video and 3D models. Therefore, handling the queries from unknown domain is challenging for the system trained on small dataset [ur2019unsupervised]. Second, some of the current cross-media datasets are deficient in context information, which results in the decline of cross-media retrieval efficiency. Third, the major limitation in the benchmark cross-media retrieval dataset is the size of the dataset, for instance Xmedia [peng2018overview], IAPR TC-12 [grubinger2006iapr], and Wikipedia. This makes the decision challenging for the learning systems due to scarcity of data. Finally, some dataset lacks the proper image labelling aligned with the training set such as, ALIPR [li2011real], and SML [carneiro2007supervised]. Furthermore, datasets such as ESP [von2004labeling], LabelMe [russell2008labelme], and AnnoSearch [wang2006annosearch] withdraw restrictions on the annotation vocabulary, which results in the weak linkage among different modalities semantic gaps. The aforementioned discussion concludes that cross-media retrieval method performance is directly proportional to the nature of the dataset used for evaluation [ur2020optimization]. Therefore, we propose some significant characteristics for a good cross-media retrieval dataset, which are as follow:

  1. Social media platform is the best source for dataset collection as it contains varied domains and informal text language.

  2. There must be no constraint in the modality categorization.

  3. Excluding images and texts the dataset also contain other modalities such as video, audio and 3-dimensional (3D) models, which is acceptable in real time scenario.

  4. To avoid the overfitting problem during the training of the network. The size of the dataset must be kept significantly large. Also, a large dataset helps the learning algorithm understand the underlying patterns in the data and produce efficient results.

  5. The dataset aid in reducing the semantic gap for efficient retrieval by providing coherent visual content descriptors. Also, the datasets with structured alignment between distinct modalities help the learning algorithm to be more robust.

V-B2 Scalability on large-scale data

With the advancement of technology and the expansion of social media websites around the globe, a large number of multimedia data are produced over the internet. Luckily, deep learning models have exhibit very promising and efficient performance in handling a huge amount of data [najafabadi2015deep] with the help of the Graphical Processing Unit (GPU). Therefore, the need for a scalable and robust model for distributed platforms is significant. Furthermore, it is also noteworthy to investigate further research on effectively organizing individual related modality of data into a common semantic space. We believe compression procedures [serra2017getting] as one of the promising future directions for cross-media retrieval. High-Dimensional input data can be compressed to compact embedding to reduce the space and computation time during model learning.

V-B3 Deep Neural Network

The work of deep learning on multimodal research is very scarce. Different multimodal hashing techniques are introduced for cross-media retrieval [bronstein2010data, kumar2011learning, zhen2012co, zhen2012probabilistic, song2013inter, wang2014effective, yu2014discriminative, liu2014collaborative, zhang2014large, wu2015quantized, lin2015semantics, long2016composite]. However, these methods are based on shallow architecture, which cannot learn semantic information efficiently between different modalities. Recently, different deep learning models [frome2013devise, kiros2014multimodal, long2015learning, karpathy2015deep, donahue2015long, gao2015you, andreas2015deep, xia2014supervised, lai2015simultaneous, zhu2016deep, cao2016deep] showed that these models were able to extract semantic information between different modalities more efficiently compare to shallow methods. However, they were restricted only to single modality retrieval. One of the promising solutions for the aforementioned problem is transfer learning. It significantly improves the learning task in a specific domain by using knowledge transferred from a different domain. DNN based models are well-matched to transfer learning as it learns both low and high-level features that separate the difference of various cross-media domains.

V-B4 Informal annotations

Social networks websites such as YouTube, Facebook, Instagram, Twitter, and Flickr have produced a large amount of multimodal data over the internet. Generally, this data is poorly organized and has scarce and noisy annotations. However, these annotations provide a correlation between different multimodal data. The key question is how to use the restricted and noisy annotations for a large amount of multimodal data to learn semantic information among the cross-media?

V-B5 Practical Cross-media Retrieval Applications

As a hot topic these days, practical applications of cross-media retrieval will soon become conceivable due to continuous enhancement in the performance of multimodal efficiency. This will provide easy and flexible retrieval from one modality to another modality. Furthermore, cross-media retrieval is also important in many firms, such as press companies, Television, the entertainment industry, and many others. Currently, people not looking to search for text only but they want to completely visualize things. For example, If you are looking for the installation of a window (operating system) on your machine, it’s hard to complete read an article rather than just follow few steps by watching a video. Moreover, the video explains and visualize things better than text and is easily understandable. It is the need for a smart city where people not only search in the same domain but cross-modal searching is also at the fingertips.

V-B6 Evaluation Criteria

In the cross-media community we have seen that each time a model is proposed, it is expected that the model show efficiency against numerous baselines. However, most of the authors did not take it seriously and avail free options for choosing baselines and datasets. This makes several issues in evaluating cross-media models. First, it makes the output prediction score inconsistent. Since individual author reports their own assessed results. By doing this, sometimes, we also encounter conflicts of results. For instance, the original score of the NCF model predicted in its pioneer research work [he2017neural] is ranked very low compared to its variant/modified version [zheng2018mars]

. This makes state-of-the-art neural models very difficult. The main question arises here is, how would we solve this issue? Considering other domains such as Natural Language Processing (NLP) or Image Processing they have baseline datasets, such as ImageNet and MNIST for the evaluation of models. Therefore, we strongly believe such a standardized system for the cross-media domain. Second, there must be proper designing of dataset split, particularly, test sets. Without this, in fact, it is challenging to measure the performance of model evaluation. Finally, by using deep learning models it is important to estimate the dataset. As deep learning models performance varies with the amount of data fluctuates.

V-B7 Requirement Gap and Conflict

Through our review, we found some blind-spots in DNN-based approaches, such as pairwise based DL methods and rank based DL methods, for solving alignment and translation in cross-media retrieval. The purpose of pairwise based DL methods to learn common representations through similar/dissimilar pairs, in which, a semantic metric distance is learned between data of various modalities, whereas, rank based DL methods are used to learn common representations for cross-media retrieval through learning to rank. These approaches are necessary to solve the aforementioned challenges in cross-media retrieval. However, these approaches received little attention in cross-media retrieval and only a few articles have been published in shallow domain [bai2018learning, grangier2018discriminative, mcfee2018metric].

Moreover, the deep learning model used by most of the researchers is an individual model for a separate modality. It is strongly recommended that researchers should unfold the recent mathematical theory of deep learning models to investigate the reason why a single model did not achieve benchmark results in cross-media retrieval. It is also encouraged to find out a common semantic space for the features extracted from different modality data using DL models, simultaneously. Furthermore, the confliction between service quality and retrieval is also noteworthy. For example, DL methods fulfill multiple requirements of feature extraction and distance detection but can be too heavyweight to achieve the real-time constraints of cross-media retrieval. How to strike a balance among contradicting requirements deserves future studies. The key is to balance feature extraction, similarity measurements, and service quality.

Vi Conclusion

Multimedia information retrieval is a rapidly growing research field that aims to build models that can validate the information from different modalities. This paper reviewed cross-media retrieval in terms of DNN-based algorithms and presented them in a common classification built upon three technical challenges faced by multimodal researchers: alignment, translation, and representation. For individual challenge, we introduced different sub-classes of DNN-based methods to bridge the media gap, and provide researchers and developers with a better understanding of the underlying problems and the potential solutions of the current deep learning assisted cross-media retrieval research.


This work was partially supported by The China’s National Key R&D Program (No. 2018YFB0803600), National Natural Science Foundation of China (No.61801008), Beijing Natural Science Foundation National (No. L172049), Scientific Research Common Program of Beijing Municipal Commission of Education (No. KM201910005025).