Heterogeneous Representation Learning: A Review

04/28/2020 ∙ by Joey Tianyi Zhou, et al. ∙ Nanyang Technological University Agency for Science, Technology and Research 0

The real-world data usually exhibits heterogeneous properties such as modalities, views, or resources, which brings some unique challenges wherein the key is Heterogeneous Representation Learning (HRL) termed in this paper. This brief survey covers the topic of HRL, centered around several major learning settings and real-world applications. First of all, from the mathematical perspective, we present a unified learning framework which is able to model most existing learning settings with the heterogeneous inputs. After that, we conduct a comprehensive discussion on the HRL framework by reviewing some selected learning problems along with the mathematics perspectives, including multi-view learning, heterogeneous transfer learning, Learning using privileged information and heterogeneous multi-task learning. For each learning task, we also discuss some applications under these learning problems and instantiates the terms in the mathematical framework. Finally, we highlight the challenges that are less-touched in HRL and present future research directions. To the best of our knowledge, there is no such framework to unify these heterogeneous problems, and this survey would benefit the community.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In a lot of real-world problems, such as social computing, video surveillance, and healthcare, data are collected from diverse domains or obtained from various feature extractors/sensors, thus leading to the heterogeneous properties in terms of modalities (e.g., texts, images and videos) or views (e.g., multi-lingual and cross-sensors). For example, the posts on social media like Facebook usually simultaneously employ texts, images and voice to describe different profiles of the same topics/events. In other words, the cross-modal or multi-view data does not exist in an isolation fashion. Instead, it is usually the combination of various data types in a many-to-one correspondence fashion because the complete information of a given object/event distributes in multiple views/modalities.

Figure 1: Relationship between HRL and learning tasks/applications. “MVL”: Multi-View Learning, “HTL”: Heterogeneous Transfer Learning, “HMTL”: Heterogeneous Multi-task Learning, “LUPI”: Learning Using Privileged Information; “MVSC”: Multi-view Subspace clustering, “MVC”: Multi-view Classification, “CLTC”: Cross-lingual Text Classification, “LRSL”: Low-Resource Sequence Labeling, “MORPI”: Multi-object Recognition with PI, “HPI“: Hashing with PI.

Besides the many-to-one correspondence, another heterogeneity lies on the one-to-one correspondence. One typical example is cross-lingual machine translation wherein all data are in text but in different languages and all bi-lingual data pairs contain the “same and complete” information of the object/event of interest. Moreover, there exist a lot of tasks of such a paradigm, e.g., person re-identification, tracking, 3D data point matching.

In the aforementioned problems, it is far away from extracting desirable features with the same or homogeneous descriptors due to the heterogeneity of data. In addition to the heterogeneity of data, the heterogeneity of task is another challenge, which refers to that different tasks may have different inputs, outputs, and the associated downstream tasks. It is still unclear how to discover the hidden useful information contained in multiple related heterogeneous tasks to improve the generalization performance of all the tasks.

To address these challenges, there are a variety of techniques have been proposed including but not limited to multi-view/-modal learning, heterogeneous transfer learning, heterogeneous multitask learning, etc. Unfortunately, these techniques are mostly isolated studied in each domain. To the best of our knowledge, there is no study to unify these topics and give comprehensive review from the perspective of Heterogeneous Representation Learning (HRL). We believe that HRL could be an effective unified solution to trigger recent explosion of interest in all these vibrant multi-disciplinary fields.

In this review, we first give a formal definition and narrow down our research focus to differentiate it from other related works but in different lines.

Definition 1.1

Given a given data set drawn from domains and the cross-domain correspondences (e.g., one-to-one or many-to-one), the original feature spaces for each domain are heterogeneous (e.g., ), Heterogeneous Representation Learning aims at learning feature/task mappings which connect different domains to facilitate the downstream tasks .

To be noted that, the aforementioned correspondences are shared across domains, which is necessary to HRL learning in different manners. For example, many-to-one correspondences could be the labels accompanying the data. If the data from two domains are annotated with the same label, they would exhibit some similar properties in the feature space. In addition, to learn a common representation shared by different views/modalities, one-to-one correspondence is always explicitly taken in multi-view analysis wherein the data are assigned in sample-wise. It is clear that how to utilize these correspondences becomes a crucial research problem in HRL.

Unlike existing surveys which are either from the perspective of a general learning setting like transfer learning [9], multi-view learning [13], multi-task learning [15], or from the perspective of downstream applications [5, 12], we provide a revisit of our recent efforts in several selected learning tasks and their downstream applications from the view of HRL (see Fig. 1). To the best of our knowledge, this could be also the first study on discussing the diverse learning settings and applications in a unified perspective of HRL based on mathematical formulation. Beyond their similarity, we further discuss the differences as well as the research goals along HRL.

2 Mathematical View

In this section, we review the study on the problems by providing a mathematical view of HRL. Specifically, all the tasks related to HRL could be described in the following framework that includes three major terms in the objective function , i.e., 1) a within-domain loss, 2) an inter-domain loss, and 3) a task-related regularization.

s.t. (1)

where denotes the input data. , , and denote the learnable parameters of the within-domain loss , inter-domain loss , and task-related regularization , respectively. For different tasks, their learnable parameter sets may be further constrained by , , and . Noticed, the “domain” has different definitions in different tasks and applications. For example, it refers to “view” in multi-view analysis.

The within-domain loss is a domain-related function used to describe the tasks/property within each domain. The inter-domain loss aims to model the relationship across domains under the help of the cross-domain correspondences which could be constructed or given in different scenarios. For example, in multi-view learning, the one-to-one correspondence based on views are usually given in data collection or preparation. In transfer learning, the multi-class label could be deemed as many-to-one correspondences. In multi-label learning, one-to-many correspondences could be established between the data point and multiple labels. The third term in Eq.(2) is the regularization term to avoid over-fitting or achieve data prior, which is specifically defined in different tasks and architectures, for example, sparsity and low rankness.

In the following sections, we are going to discuss our recent efforts on multi-view learning, heterogeneous transfer learning, learning using privileged information, and heterogeneous multi-task learning in the proposed framework. 111To be noted that we are not giving a review of all the related learning tasks and applications, which is beyond the scope of this brief review. Instead, we summarize and introduce some of our works in the aforementioned unified framework to verify the homogeneity of these tasks/applications.

3 Multi-view Learning

As one of most important topics in machine learning, multi-view learning (MVL) aims at fusing the knowledge from multiple views to facilitate the down-streaming learning tasks, e.g., clustering, classification, and retrieval. The key challenge of MVL is exploring the data correspondence across multiple views. The mappings among different views are able to couple the view-specific knowledge, while the additional task-related regularization is incorporated based on different priors of data latent structure.

3.1 Multi-view Subspace clustering

Multi-view subspace clustering (MVSC) targets at grouping similar objects into the same subspace and dissimilar objects into different subspaces by exploring the available multi-view information. The core commonality of existing MVSC methods is encapsulating the complementary information of different views to learn a shared/common representation followed by a single-view clustering approach. More specifically, MVSC first learns a latent representation to bridge the gap of multiple view and then applies the traditional clustering methods like spectral clustering algorithms or k-means on the representation to obtain the final clustering partitions.

Formally, let denote the data from -th view and each view consists of observation, one aims to infer a shared latent representation for each data point across views. To obtain the latent representation, the objective of most existing methods could be generally decomposed into two major kinds of loss, i.e., view-specific loss , and inter-domain consistency loss , namely,

Different methods define different loss functions. For example,

ijcai2019-356 propose a multiview spectral clustering network with the following formulations, , where denotes a precomputed similarity graph for the -th view. , where

denotes the output of a parametric model w.r.t. the input

. More specifically, , where denotes the -th learnable subnetwork in [4] or the linear mapping used in [7]. kangNN propose a novel objective which integrates the objectives of graph construction and spectral clustering. Specifically, , where is the partition result, is the consensus cluster indicator matrix, and denotes the graph Laplacian matrix derived from . In addition, and denote feature matrix and subspace representation of the -th view, and is used to avoid overfitting.

More recently, peng2019comic propose to cluster multi-view data through automatic parameters learning. Specifically, the method defines the view-specific loss to be a combination of view-specific reconstruction loss and geometric consistency, i.e., , where denotes the learned representation of -th data point from the -th view and is connection graph of view-. where defines cross-view cluster assignment consistency.

Besides, some subspace clustering methods share the following formulation,

For example, Li_2019_ICCV define a view-specific loss , where the first part applies the reconstruction loss to alleviate the noise effect. The second part learns a common latent representation by using it to reconstruct all views-specific representations, where is the weighting parameter for network . , where the common subspace representation reveals the subspace structure of latent representation . , where the nuclear norm is used to guarantee the high within-class homogeneity.

3.2 Multi-view Classification

The multi-view classification (MVC) aims to build a classification model with the given multiple views of training data. In the test stage, we also consider all the available views of each testing instance to make the final prediction. Different from multi-view subspace clustering, multi-view classification usually only involves one step rather than two-step optimization by utilizing the supervised information to guide representation learning:

To regularize the representation learning, most methods usually employ a single quotient to incorporate the supervised information across views. As a result, their main difference lies on the choice on the cross-view loss which could be generalized as where denotes the within-class compactness and the between-class scatter that are computed across all views. For example, kim2006learning extend canonical correlation analysis (CCA)[1] to Discriminative CCAs wherein the within-class correlation is maximized and the between-class correlation is simultaneously minimized in the learned common space. By adopting the fisher discriminant analysis, diethe2008multiview extend this idea by explicitly using the label information.

In addition, another group of methods incorporate the supervised information into the classification rather than representation learning loss 222As suggested in [10], classification could also be treated as a special case of representation learning.. For example, NIPS2019_8346 propose the reconstruction loss which encodes all partial views into the shared representation , where indicates the view availability for -th data point. is a clustering-like loss which learns the structured representation in a common space. Xiao2018ActionRF use the cross-entropy loss for multi-view action recognition, , where is predicted labels from the network.

In positive-unlabeled learning task, pmlr-v25-zhou12 proposes a co-regularization framework to construct the cross-view loss so that the density ratio functions and make agreement on the same instance. The classification loss is defined to be the unconstrained least-square importance fitting, i.e., , where is density ratio between the positive samples and all the samples and denotes indicator of the positive annotation.

4 Heterogeneous Transfer Learning

Heterogeneous Transfer Learning (HTL) targets at improving the performance on the target domain by exploiting auxiliary labeled source domain data (), when shares the different feature space as the target domain . To bridge two the source and target domain, most existing HTL methods either require cross-domain correspondences or shared labels to learn mappings and most of them could be summarized into the following formulation,

4.1 Cross-lingual Text Classification

Cross-lingual text classification (CLTC) refers to the task of classifying documents in different languages into the same taxonomy of categories. Text classification largely relies on manually annotated training data, which however is expensive to create training data. Therefore, it is promising to explore how to use the training data given in only one source language (i.e, source domain) to classify text written in a different target language (i.e, target domain) with no or a small number of labeled data. Most existing studies minimize such feature differences between data in the two languages and conduct translation through learning the transformations based on the either labels or one-to-one correspondences.

Most existing CLTC methods could be roughly categorized into two groups. The first group usually assumes there are a number of cross-lingual correspondences such as cross-lingual translation pairs, . With the correspondences, some works borrow the auto-encoder loss as the domain discrepancy loss via

where and

are the hidden representations of the source and target domain. More specifically,

DBLP:conf/aaai/ZhouPTY14,ijcai2017-454 formulate the cross-domain mapping using a linear mapping, i.e., , where the representation is learned from linear reconstruction objective [11]

or marginalized stack denoising autoencoder

[18]. DBLP:conf/aaai/ZhouPTY14 define the loss on the source domain by the stack denoising autoencoder loss, namely, , where denotes the union source domain data matrix, and denotes the -th corrupted version. DBLP:journals/ai/ZhouPT19 further extend the linear cross-domain mapping into a nonlinear mappings using the multilayer perception. Formally, .

Another group of methods projects the heterogeneous features space into a higher-dimension space [17] so that the features from two domains are homogeneous via the techniques like Maximum Mean Discrepancy (MMD), i.e., .

The other branch of methods just assume that only a few annotated target data points are given and they construct the domain classification loss by directly exploiting the label information of two domains. Therefore, they deploy the loss functions on two domains as below,

where could be realized with the parameters of source and target domain, denoted by . For example, xiao2014feature deploy the squared loss and , where is a prediction function represented by kernel matrix and denotes the kernel matching matrix. DBLP:conf/icml/DuanXT12 propose the heterogeneous feature augmentation method which augments the homogeneous common features by using a SVM-style approach with the help of the labels. The domain loss is defined to be the hinge loss , .

4.2 Low-Resource Sequence Labeling

Sequence labeling tasks such as named entity recognition (NER) and part-of-speech (POS) tagging are fundamental problems in natural language processing (NLP). In recent, a variety of deep models have been proposed for sequence labeling, which generalize well on the new entities by automatically learning features from the data. However, when the annotated corpora is small, especially in the low resource scenario, the performance of these methods degrades significantly because the hidden representations cannot be learned adequately. To enable knowledge transfer for low-resource sequence labeling (LRSL). The recent methods like

[3] are proposed based on cross-resource word embedding, which bridge the low- and high-resources and enable knowledge transfer.

To implement the domain loss

, most existing works of sequence labeling usually use a linear chain model based on the first-order Markov chain structure, termed the chain conditional random field (CRF). For example,

zhou2019roseq employ a CRF as the label decoder to induce a probability distribution over the label sequences that is conditioned on the word-level latent features

. In the decoder, there are two kinds of cliques: local cliques and transition cliques. Specifically, local cliques correspond to the individual elements in the sequence, and transition cliques reflect the evolution of states between two nearby elements at time and . The transition distribution is defined as , and thus a linear-chain CRF can be formally written as where is a normalization term and is the sequence of predicted labels.

The model is optimized through maximizing this conditional log likelihood which acts as the following domain discriminative loss,

where are the word-level representations for the -th input word of a sequence from source domain and the target domain, respectively.

To model the domain discrepancy loss , adversarial discriminator is introduced with a great success in low-resource sequence labeling tasks. For example, 8778733 propose the generalized resource-adversarial discriminator (GRAD) as a domain discrepancy loss to enable adaptive weights for each sample. As a result, the imbalance issue in training size between two domains could be largely alleviated. Most existing adversarial discriminators could be understood in GRAD with the following formulation,

where are the identity functions which denote whether a sentence is from high resource (source) and low resource (target). is a weighting factor to balance the loss contribution from the high and low resource. The parameter (or ) controls the loss contribution from individual samples by measuring the discrepancy between prediction and true label (easy samples have smaller contribution).

5 Learning Using Privileged Information

Different from the heterogeneous transfer learning where either the unlabeled cross-domain correspondences or labels for two domains are given, Learning Using Privileged Information (LUPI) is a novel setting proposed by vapnik2009new. In brief, LUPI assumes that the triplets () are available in training phrase but only is available in the testing phrase, i.e., the testing phrase is without access to privileged information (PI) . Such a setting has been extended into the application scenarios such as image classification [6], information retrieval [19], and so on. In mathematics, the objective functions of these studies share the following formulation,

where is the task loss. The PI regularization loss is defined on and the privileged information . Different from the heterogeneous transfer learning or multi-view learning which directly uses the cross-domain correspondences as input to minimize the cross-domain loss , LUPI aims at leveraging the privileged information to model the task loss to achieve better generalization.

5.1 Multi-Object Recognition with PI

Multi-object recognition is usually recast into a Multi-instance multi-label (MIML) learning problem. To be specific, MIML assumes there are bags in the training data, denoted by , where each bag has

instances with the instance-level label vectors

and contains the labels associated with . In multi-object recognition, additional information such as bounding-boxes, image captions and descriptions is often available during training phrase, which is referred as privileged information (PI). Based on the above observation, yang2017miml propose using the bounding box and image caption as PI to improve multi-object recognition (MORPI) performance. Formally, for each training bag, there exists a privileged bag that contains instances . The task loss refers to the MIML-FCN loss which is defined as follows,

where the relation between instance-level labels and bag-level labels is expressed as . The bag-level label prediction is given by . is generated from a fully convolutional network, and can be the square loss or ranking loss for multi-object recognition.

To utilize the privileged information, the PI regularization loss is modeled as follows,

where denotes the output of network for an input privileged bag .

5.2 Hashing with PI

Learning to hash aims to learn a binary code consisting of a sequence of bits from a specific dataset so that the nearest neighbor search result in the hash coding space is as close as possible to the search result in the original space Most existing learning to hash methods assume that there are sufficient data, either labeled or unlabeled, on the domain of interest (i.e., the target domain) for training. However, this assumption cannot be satisfied in some applications. To solve this problem, zhou2016transfer,DBLP:journals/tnn/ZhouZPFQG18 apply LUPI into learning to hash, termed HPI. In these works, the task loss takes the quantization error with the following definition,

where is a orthogonal projection matrix and is the binary code matrix for privileged data matrix .

where is defined as a linear projection matrix in [19]

and a nonlinear neural network in


There are additional regularizers to further constrain hash coding. For example, zhou2016transfer incorporate the graph structure into the formulation, namely , where is the pre-computed graph Laplacian matrix on the target domain.

6 Heterogeneous Multi-task Learning

Multi-Task Learning (MTL) is a learning paradigm in machine learning, which aims to leverage the information contained in multiple related tasks so that the generalization performance of all the tasks is improved. Different from the traditional MTL with tasks of only one type, the heterogeneous multi-task learning (HMTL) consists of different types of tasks including supervised learning, unsupervised learning, reinforcement learning, and so on 

[15]. Formally, HMTL usually could be formulated as a minimization problem as below:

where denotes the dataset for the -th task and is the weight parameter set for the corresponding learner with . could be explicitly formulated to further capture the relationship across tasks. Let denote the weight parameters, the constraint enforces different tasks to share the same weight parameters, and thus transferring the knowledge across tasks. Either through sharing the parameters constraints space or explicit loss could facilitate the knowledge sharing among tasks. To avoid the over-fitting of the learners, one usually introduces a regularization term in the objective function.

To improve the sequence labeling performance, zhou2019learning consider fully annotated data, incomplete annotated data and unsupervised data in a unified framework. Specifically, the task loss of the work is written as follows,

where denote the fully/semi-supervised CRF loss on the fully annotated data and incomplete annotated data, respectively. is realized by an autoencoder-like structure and the parameter is defined by a softmax function. In [16], task relationship is implicitly learned through the shared parameter , where is the same space for all the tasks. is adopted to avoid overfitting.

To model the relationship between different tasks, different methods give different loss. For example, in our recent work [14], the task relationship is regularized through minimizing the negative correlation as below where is the learned neural network for the -th task and . DBLP:conf/aistats/ZhouT14,JMLR:v20:13-580 propose to learn a sparse transformation matrix between two heterogeneous domains by exploiting the commonality of multiple binary classification tasks the the formulation of where the linear mapping is to minimize the difference between the target binary classifier weight and the transformed classifier weight over tasks. , is to enforce the sparsity on each row of .

7 Discussion and Future Direction

In this paper, we provide a unified HRL framework to understand over a dozen of learning tasks/problems across multiple areas in mathematically. We deeply analyze the shared and distinct loss terms of over 10 learning tasks which are popular in machine learning, multi-media analysis, computer vision, data mining, and natural language processing. We believe that such a unified view would benefit the AI community in both industry and academia from literature review to future directions. In addition to those aforementioned applications, there are also other interesting applications, such as person-reid, translation, visual question answering, video caption, text2image, points matching, etc, could also be revisit in the our HRL framework. Despite the recent advances, in future research, we think there are several very important and fundamental challenges of HRL deserves more attention.

  • Theoretical analysis on the heterogeneous tasks/domains. A lot of experimental studies show that incorporating heterogeneous tasks/domains could boost up performances. However, there are few theoretical analysis on the relationship between the improvement of performance and the number of training data/tasks. A quantitatively metric is also expected to measure the contribution degree from different domains on final performance .

  • Universal pre-train model for HRL. pre-train for single domain/modality is well studied in last decades, such as ResNets, word2vec, etc. Recently, more and more attention has been paid on pre-trainable generic representation such as ViLBERT [8] for visual-linguistic tasks and M-BERT[2] for multilingual tasks. However, research on the pre-train model for other applications/tasks like audio-visual tasks, video-linguistic tasks are still on the early stage.

  • Heterogeneous feature generation. This paper discuss those learning tasks from the perspective of feature representation or domain matching. However, some related tasks like image/video caption, image/text style transfer requires the model to generate heterogeneous features for out-of-domain data.


  • [1] K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan (2009) Multi-view clustering via canonical correlation analysis. In ICML, pp. 129–136. Cited by: §3.2.
  • [2] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 4171–4186. Cited by: 2nd item.
  • [3] M. Fang and T. Cohn (2017) Model transfer for tagging low-resource languages using a bilingual dictionary. In ACL, pp. 587–593. Cited by: §4.2.
  • [4] Z. Huang, J. T. Zhou, X. Peng, C. Zhang, H. Zhu, and J. Lv (2019-07) Multi-view spectral clustering network. In IJCAI, pp. 2563–2569. Cited by: §3.1.
  • [5] K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown (2019) Text classification algorithms: a survey. Information 10 (4), pp. 150. Cited by: §1.
  • [6] J. Lambert, O. Sener, and S. Savarese (2018) Deep learning under privileged information using heteroscedastic dropout. In CVPR, pp. 8886–8895. Cited by: §5.
  • [7] Y. Li, F. Nie, H. Huang, and J. Huang (2015) Large-scale multi-view spectral clustering via bipartite graph. In AAAI, Cited by: §3.1.
  • [8] J. Lu, D. Batra, D. Parikh, and S. Lee (2019) ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, pp. 13–23. Cited by: 2nd item.
  • [9] S. J. Pan and Q. Yang (2009) A survey on transfer learning. IEEE TKDE 22 (10), pp. 1345–1359. Cited by: §1.
  • [10] X. Peng, H. Zhu, J. Feng, C. Shen, H. Zhang, and J. T. Zhou (2019) Deep clustering with sample-assignment invariance prior. IEEE TNNLS (), pp. 1–12. External Links: Document, ISSN 2162-2388 Cited by: footnote 2.
  • [11] X. Shi, Q. Liu, W. Fan, P. S. Yu, and R. Zhu (2010) Transfer learning on heterogenous feature spaces via spectral transformation. In ICDM, pp. 1049–1054. Cited by: §4.1.
  • [12] J. Wang, T. Zhang, N. Sebe, H. T. Shen, et al. (2017) A survey on learning to hash. IEEE TPAMI 40 (4), pp. 769–790. Cited by: §1.
  • [13] C. Xu, D. Tao, and C. Xu (2013) A survey on multi-view learning. arXiv preprint arXiv:1304.5634. Cited by: §1.
  • [14] L. Zhang, Z. Shi, M. Cheng, Y. Liu, J. Bian, J. T. Zhou, G. Zheng, and Z. Zeng (2019) Nonlinear regression via deep negative correlation learning. IEEE TPAMI. Cited by: §6.
  • [15] Y. Zhang and Q. Yang (2017) A survey on multi-task learning. arXiv preprint arXiv:1707.08114. Cited by: §1, §6.
  • [16] J. T. Zhou, M. Fang, H. Zhang, C. Gong, X. Peng, Z. Cao, and R. S. M. Goh (2019) Learning with annotation of various degrees. IEEE TNNLS 30 (9), pp. 2794–2804. Cited by: §6.
  • [17] J. T. Zhou, S. J. Pan, I. W. Tsang, and S. Ho (2016) Transfer learning for cross-language text categorization through active correspondences construction. In AAAI, pp. 2400–2406. Cited by: §4.1.
  • [18] J. T. Zhou, S. J. Pan, I. W. Tsang, and Y. Yan (2014) Hybrid heterogeneous transfer learning through deep learning. In AAAI, pp. 2213–2220. Cited by: §4.1.
  • [19] J. T. Zhou, X. Xu, S. J. Pan, I. W. Tsang, Z. Qin, and R. S. M. Goh (2016) Transfer hashing with privileged information. In IJCAI, pp. 2414–2420. Cited by: §5.2, §5.
  • [20] J. T. Zhou, H. Zhao, X. Peng, M. Fang, Z. Qin, and R. S. M. Goh (2018) Transfer hashing: from shallow to deep. IEEE TNNLS 29 (12), pp. 6191–6201. Cited by: §5.2.