1 Introduction
In a lot of realworld problems, such as social computing, video surveillance, and healthcare, data are collected from diverse domains or obtained from various feature extractors/sensors, thus leading to the heterogeneous properties in terms of modalities (e.g., texts, images and videos) or views (e.g., multilingual and crosssensors). For example, the posts on social media like Facebook usually simultaneously employ texts, images and voice to describe different profiles of the same topics/events. In other words, the crossmodal or multiview data does not exist in an isolation fashion. Instead, it is usually the combination of various data types in a manytoone correspondence fashion because the complete information of a given object/event distributes in multiple views/modalities.
Besides the manytoone correspondence, another heterogeneity lies on the onetoone correspondence. One typical example is crosslingual machine translation wherein all data are in text but in different languages and all bilingual data pairs contain the “same and complete” information of the object/event of interest. Moreover, there exist a lot of tasks of such a paradigm, e.g., person reidentification, tracking, 3D data point matching.
In the aforementioned problems, it is far away from extracting desirable features with the same or homogeneous descriptors due to the heterogeneity of data. In addition to the heterogeneity of data, the heterogeneity of task is another challenge, which refers to that different tasks may have different inputs, outputs, and the associated downstream tasks. It is still unclear how to discover the hidden useful information contained in multiple related heterogeneous tasks to improve the generalization performance of all the tasks.
To address these challenges, there are a variety of techniques have been proposed including but not limited to multiview/modal learning, heterogeneous transfer learning, heterogeneous multitask learning, etc. Unfortunately, these techniques are mostly isolated studied in each domain. To the best of our knowledge, there is no study to unify these topics and give comprehensive review from the perspective of Heterogeneous Representation Learning (HRL). We believe that HRL could be an effective unified solution to trigger recent explosion of interest in all these vibrant multidisciplinary fields.
In this review, we first give a formal definition and narrow down our research focus to differentiate it from other related works but in different lines.
Definition 1.1
Given a given data set drawn from domains and the crossdomain correspondences (e.g., onetoone or manytoone), the original feature spaces for each domain are heterogeneous (e.g., ), Heterogeneous Representation Learning aims at learning feature/task mappings which connect different domains to facilitate the downstream tasks .
To be noted that, the aforementioned correspondences are shared across domains, which is necessary to HRL learning in different manners. For example, manytoone correspondences could be the labels accompanying the data. If the data from two domains are annotated with the same label, they would exhibit some similar properties in the feature space. In addition, to learn a common representation shared by different views/modalities, onetoone correspondence is always explicitly taken in multiview analysis wherein the data are assigned in samplewise. It is clear that how to utilize these correspondences becomes a crucial research problem in HRL.
Unlike existing surveys which are either from the perspective of a general learning setting like transfer learning [9], multiview learning [13], multitask learning [15], or from the perspective of downstream applications [5, 12], we provide a revisit of our recent efforts in several selected learning tasks and their downstream applications from the view of HRL (see Fig. 1). To the best of our knowledge, this could be also the first study on discussing the diverse learning settings and applications in a unified perspective of HRL based on mathematical formulation. Beyond their similarity, we further discuss the differences as well as the research goals along HRL.
2 Mathematical View
In this section, we review the study on the problems by providing a mathematical view of HRL. Specifically, all the tasks related to HRL could be described in the following framework that includes three major terms in the objective function , i.e., 1) a withindomain loss, 2) an interdomain loss, and 3) a taskrelated regularization.
s.t.  (1) 
where denotes the input data. , , and denote the learnable parameters of the withindomain loss , interdomain loss , and taskrelated regularization , respectively. For different tasks, their learnable parameter sets may be further constrained by , , and . Noticed, the “domain” has different definitions in different tasks and applications. For example, it refers to “view” in multiview analysis.
The withindomain loss is a domainrelated function used to describe the tasks/property within each domain. The interdomain loss aims to model the relationship across domains under the help of the crossdomain correspondences which could be constructed or given in different scenarios. For example, in multiview learning, the onetoone correspondence based on views are usually given in data collection or preparation. In transfer learning, the multiclass label could be deemed as manytoone correspondences. In multilabel learning, onetomany correspondences could be established between the data point and multiple labels. The third term in Eq.(2) is the regularization term to avoid overfitting or achieve data prior, which is specifically defined in different tasks and architectures, for example, sparsity and low rankness.
In the following sections, we are going to discuss our recent efforts on multiview learning, heterogeneous transfer learning, learning using privileged information, and heterogeneous multitask learning in the proposed framework. ^{1}^{1}1To be noted that we are not giving a review of all the related learning tasks and applications, which is beyond the scope of this brief review. Instead, we summarize and introduce some of our works in the aforementioned unified framework to verify the homogeneity of these tasks/applications.
3 Multiview Learning
As one of most important topics in machine learning, multiview learning (MVL) aims at fusing the knowledge from multiple views to facilitate the downstreaming learning tasks, e.g., clustering, classification, and retrieval. The key challenge of MVL is exploring the data correspondence across multiple views. The mappings among different views are able to couple the viewspecific knowledge, while the additional taskrelated regularization is incorporated based on different priors of data latent structure.
3.1 Multiview Subspace clustering
Multiview subspace clustering (MVSC) targets at grouping similar objects into the same subspace and dissimilar objects into different subspaces by exploring the available multiview information. The core commonality of existing MVSC methods is encapsulating the complementary information of different views to learn a shared/common representation followed by a singleview clustering approach. More specifically, MVSC first learns a latent representation to bridge the gap of multiple view and then applies the traditional clustering methods like spectral clustering algorithms or kmeans on the representation to obtain the final clustering partitions.
Formally, let denote the data from th view and each view consists of observation, one aims to infer a shared latent representation for each data point across views. To obtain the latent representation, the objective of most existing methods could be generally decomposed into two major kinds of loss, i.e., viewspecific loss , and interdomain consistency loss , namely,
Different methods define different loss functions. For example,
ijcai2019356 propose a multiview spectral clustering network with the following formulations, , where denotes a precomputed similarity graph for the th view. , wheredenotes the output of a parametric model w.r.t. the input
. More specifically, , where denotes the th learnable subnetwork in [4] or the linear mapping used in [7]. kangNN propose a novel objective which integrates the objectives of graph construction and spectral clustering. Specifically, , where is the partition result, is the consensus cluster indicator matrix, and denotes the graph Laplacian matrix derived from . In addition, and denote feature matrix and subspace representation of the th view, and is used to avoid overfitting.More recently, peng2019comic propose to cluster multiview data through automatic parameters learning. Specifically, the method defines the viewspecific loss to be a combination of viewspecific reconstruction loss and geometric consistency, i.e., , where denotes the learned representation of th data point from the th view and is connection graph of view. where defines crossview cluster assignment consistency.
Besides, some subspace clustering methods share the following formulation,
For example, Li_2019_ICCV define a viewspecific loss , where the first part applies the reconstruction loss to alleviate the noise effect. The second part learns a common latent representation by using it to reconstruct all viewsspecific representations, where is the weighting parameter for network . , where the common subspace representation reveals the subspace structure of latent representation . , where the nuclear norm is used to guarantee the high withinclass homogeneity.
3.2 Multiview Classification
The multiview classification (MVC) aims to build a classification model with the given multiple views of training data. In the test stage, we also consider all the available views of each testing instance to make the final prediction. Different from multiview subspace clustering, multiview classification usually only involves one step rather than twostep optimization by utilizing the supervised information to guide representation learning:
To regularize the representation learning, most methods usually employ a single quotient to incorporate the supervised information across views. As a result, their main difference lies on the choice on the crossview loss which could be generalized as where denotes the withinclass compactness and the betweenclass scatter that are computed across all views. For example, kim2006learning extend canonical correlation analysis (CCA)[1] to Discriminative CCAs wherein the withinclass correlation is maximized and the betweenclass correlation is simultaneously minimized in the learned common space. By adopting the fisher discriminant analysis, diethe2008multiview extend this idea by explicitly using the label information.
In addition, another group of methods incorporate the supervised information into the classification rather than representation learning loss ^{2}^{2}2As suggested in [10], classification could also be treated as a special case of representation learning.. For example, NIPS2019_8346 propose the reconstruction loss which encodes all partial views into the shared representation , where indicates the view availability for th data point. is a clusteringlike loss which learns the structured representation in a common space. Xiao2018ActionRF use the crossentropy loss for multiview action recognition, , where is predicted labels from the network.
In positiveunlabeled learning task, pmlrv25zhou12 proposes a coregularization framework to construct the crossview loss so that the density ratio functions and make agreement on the same instance. The classification loss is defined to be the unconstrained leastsquare importance fitting, i.e., , where is density ratio between the positive samples and all the samples and denotes indicator of the positive annotation.
4 Heterogeneous Transfer Learning
Heterogeneous Transfer Learning (HTL) targets at improving the performance on the target domain by exploiting auxiliary labeled source domain data (), when shares the different feature space as the target domain . To bridge two the source and target domain, most existing HTL methods either require crossdomain correspondences or shared labels to learn mappings and most of them could be summarized into the following formulation,
4.1 Crosslingual Text Classification
Crosslingual text classification (CLTC) refers to the task of classifying documents in different languages into the same taxonomy of categories. Text classification largely relies on manually annotated training data, which however is expensive to create training data. Therefore, it is promising to explore how to use the training data given in only one source language (i.e, source domain) to classify text written in a different target language (i.e, target domain) with no or a small number of labeled data. Most existing studies minimize such feature differences between data in the two languages and conduct translation through learning the transformations based on the either labels or onetoone correspondences.
Most existing CLTC methods could be roughly categorized into two groups. The first group usually assumes there are a number of crosslingual correspondences such as crosslingual translation pairs, . With the correspondences, some works borrow the autoencoder loss as the domain discrepancy loss via
where and
are the hidden representations of the source and target domain. More specifically,
DBLP:conf/aaai/ZhouPTY14,ijcai2017454 formulate the crossdomain mapping using a linear mapping, i.e., , where the representation is learned from linear reconstruction objective [11]or marginalized stack denoising autoencoder
[18]. DBLP:conf/aaai/ZhouPTY14 define the loss on the source domain by the stack denoising autoencoder loss, namely, , where denotes the union source domain data matrix, and denotes the th corrupted version. DBLP:journals/ai/ZhouPT19 further extend the linear crossdomain mapping into a nonlinear mappings using the multilayer perception. Formally, .Another group of methods projects the heterogeneous features space into a higherdimension space [17] so that the features from two domains are homogeneous via the techniques like Maximum Mean Discrepancy (MMD), i.e., .
The other branch of methods just assume that only a few annotated target data points are given and they construct the domain classification loss by directly exploiting the label information of two domains. Therefore, they deploy the loss functions on two domains as below,
where could be realized with the parameters of source and target domain, denoted by . For example, xiao2014feature deploy the squared loss and , where is a prediction function represented by kernel matrix and denotes the kernel matching matrix. DBLP:conf/icml/DuanXT12 propose the heterogeneous feature augmentation method which augments the homogeneous common features by using a SVMstyle approach with the help of the labels. The domain loss is defined to be the hinge loss , .
4.2 LowResource Sequence Labeling
Sequence labeling tasks such as named entity recognition (NER) and partofspeech (POS) tagging are fundamental problems in natural language processing (NLP). In recent, a variety of deep models have been proposed for sequence labeling, which generalize well on the new entities by automatically learning features from the data. However, when the annotated corpora is small, especially in the low resource scenario, the performance of these methods degrades significantly because the hidden representations cannot be learned adequately. To enable knowledge transfer for lowresource sequence labeling (LRSL). The recent methods like
[3] are proposed based on crossresource word embedding, which bridge the low and highresources and enable knowledge transfer.To implement the domain loss
, most existing works of sequence labeling usually use a linear chain model based on the firstorder Markov chain structure, termed the chain conditional random field (CRF). For example,
zhou2019roseq employ a CRF as the label decoder to induce a probability distribution over the label sequences that is conditioned on the wordlevel latent features
. In the decoder, there are two kinds of cliques: local cliques and transition cliques. Specifically, local cliques correspond to the individual elements in the sequence, and transition cliques reflect the evolution of states between two nearby elements at time and . The transition distribution is defined as , and thus a linearchain CRF can be formally written as where is a normalization term and is the sequence of predicted labels.The model is optimized through maximizing this conditional log likelihood which acts as the following domain discriminative loss,
where are the wordlevel representations for the th input word of a sequence from source domain and the target domain, respectively.
To model the domain discrepancy loss , adversarial discriminator is introduced with a great success in lowresource sequence labeling tasks. For example, 8778733 propose the generalized resourceadversarial discriminator (GRAD) as a domain discrepancy loss to enable adaptive weights for each sample. As a result, the imbalance issue in training size between two domains could be largely alleviated. Most existing adversarial discriminators could be understood in GRAD with the following formulation,
where are the identity functions which denote whether a sentence is from high resource (source) and low resource (target). is a weighting factor to balance the loss contribution from the high and low resource. The parameter (or ) controls the loss contribution from individual samples by measuring the discrepancy between prediction and true label (easy samples have smaller contribution).
5 Learning Using Privileged Information
Different from the heterogeneous transfer learning where either the unlabeled crossdomain correspondences or labels for two domains are given, Learning Using Privileged Information (LUPI) is a novel setting proposed by vapnik2009new. In brief, LUPI assumes that the triplets () are available in training phrase but only is available in the testing phrase, i.e., the testing phrase is without access to privileged information (PI) . Such a setting has been extended into the application scenarios such as image classification [6], information retrieval [19], and so on. In mathematics, the objective functions of these studies share the following formulation,
where is the task loss. The PI regularization loss is defined on and the privileged information . Different from the heterogeneous transfer learning or multiview learning which directly uses the crossdomain correspondences as input to minimize the crossdomain loss , LUPI aims at leveraging the privileged information to model the task loss to achieve better generalization.
5.1 MultiObject Recognition with PI
Multiobject recognition is usually recast into a Multiinstance multilabel (MIML) learning problem. To be specific, MIML assumes there are bags in the training data, denoted by , where each bag has
instances with the instancelevel label vectors
and contains the labels associated with . In multiobject recognition, additional information such as boundingboxes, image captions and descriptions is often available during training phrase, which is referred as privileged information (PI). Based on the above observation, yang2017miml propose using the bounding box and image caption as PI to improve multiobject recognition (MORPI) performance. Formally, for each training bag, there exists a privileged bag that contains instances . The task loss refers to the MIMLFCN loss which is defined as follows,where the relation between instancelevel labels and baglevel labels is expressed as . The baglevel label prediction is given by . is generated from a fully convolutional network, and can be the square loss or ranking loss for multiobject recognition.
To utilize the privileged information, the PI regularization loss is modeled as follows,
where denotes the output of network for an input privileged bag .
5.2 Hashing with PI
Learning to hash aims to learn a binary code consisting of a sequence of bits from a specific dataset so that the nearest neighbor search result in the hash coding space is as close as possible to the search result in the original space Most existing learning to hash methods assume that there are sufficient data, either labeled or unlabeled, on the domain of interest (i.e., the target domain) for training. However, this assumption cannot be satisfied in some applications. To solve this problem, zhou2016transfer,DBLP:journals/tnn/ZhouZPFQG18 apply LUPI into learning to hash, termed HPI. In these works, the task loss takes the quantization error with the following definition,
where is a orthogonal projection matrix and is the binary code matrix for privileged data matrix .
where is defined as a linear projection matrix in [19]
and a nonlinear neural network in
[20].There are additional regularizers to further constrain hash coding. For example, zhou2016transfer incorporate the graph structure into the formulation, namely , where is the precomputed graph Laplacian matrix on the target domain.
6 Heterogeneous Multitask Learning
MultiTask Learning (MTL) is a learning paradigm in machine learning, which aims to leverage the information contained in multiple related tasks so that the generalization performance of all the tasks is improved. Different from the traditional MTL with tasks of only one type, the heterogeneous multitask learning (HMTL) consists of different types of tasks including supervised learning, unsupervised learning, reinforcement learning, and so on
[15]. Formally, HMTL usually could be formulated as a minimization problem as below:where denotes the dataset for the th task and is the weight parameter set for the corresponding learner with . could be explicitly formulated to further capture the relationship across tasks. Let denote the weight parameters, the constraint enforces different tasks to share the same weight parameters, and thus transferring the knowledge across tasks. Either through sharing the parameters constraints space or explicit loss could facilitate the knowledge sharing among tasks. To avoid the overfitting of the learners, one usually introduces a regularization term in the objective function.
To improve the sequence labeling performance, zhou2019learning consider fully annotated data, incomplete annotated data and unsupervised data in a unified framework. Specifically, the task loss of the work is written as follows,
where denote the fully/semisupervised CRF loss on the fully annotated data and incomplete annotated data, respectively. is realized by an autoencoderlike structure and the parameter is defined by a softmax function. In [16], task relationship is implicitly learned through the shared parameter , where is the same space for all the tasks. is adopted to avoid overfitting.
To model the relationship between different tasks, different methods give different loss. For example, in our recent work [14], the task relationship is regularized through minimizing the negative correlation as below where is the learned neural network for the th task and . DBLP:conf/aistats/ZhouT14,JMLR:v20:13580 propose to learn a sparse transformation matrix between two heterogeneous domains by exploiting the commonality of multiple binary classification tasks the the formulation of where the linear mapping is to minimize the difference between the target binary classifier weight and the transformed classifier weight over tasks. , is to enforce the sparsity on each row of .
7 Discussion and Future Direction
In this paper, we provide a unified HRL framework to understand over a dozen of learning tasks/problems across multiple areas in mathematically. We deeply analyze the shared and distinct loss terms of over 10 learning tasks which are popular in machine learning, multimedia analysis, computer vision, data mining, and natural language processing. We believe that such a unified view would benefit the AI community in both industry and academia from literature review to future directions. In addition to those aforementioned applications, there are also other interesting applications, such as personreid, translation, visual question answering, video caption, text2image, points matching, etc, could also be revisit in the our HRL framework. Despite the recent advances, in future research, we think there are several very important and fundamental challenges of HRL deserves more attention.

Theoretical analysis on the heterogeneous tasks/domains. A lot of experimental studies show that incorporating heterogeneous tasks/domains could boost up performances. However, there are few theoretical analysis on the relationship between the improvement of performance and the number of training data/tasks. A quantitatively metric is also expected to measure the contribution degree from different domains on final performance .

Universal pretrain model for HRL. pretrain for single domain/modality is well studied in last decades, such as ResNets, word2vec, etc. Recently, more and more attention has been paid on pretrainable generic representation such as ViLBERT [8] for visuallinguistic tasks and MBERT[2] for multilingual tasks. However, research on the pretrain model for other applications/tasks like audiovisual tasks, videolinguistic tasks are still on the early stage.

Heterogeneous feature generation. This paper discuss those learning tasks from the perspective of feature representation or domain matching. However, some related tasks like image/video caption, image/text style transfer requires the model to generate heterogeneous features for outofdomain data.
References
 [1] (2009) Multiview clustering via canonical correlation analysis. In ICML, pp. 129–136. Cited by: §3.2.
 [2] (2019) BERT: pretraining of deep bidirectional transformers for language understanding. In NAACLHLT, J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 4171–4186. Cited by: 2nd item.
 [3] (2017) Model transfer for tagging lowresource languages using a bilingual dictionary. In ACL, pp. 587–593. Cited by: §4.2.
 [4] (201907) Multiview spectral clustering network. In IJCAI, pp. 2563–2569. Cited by: §3.1.
 [5] (2019) Text classification algorithms: a survey. Information 10 (4), pp. 150. Cited by: §1.
 [6] (2018) Deep learning under privileged information using heteroscedastic dropout. In CVPR, pp. 8886–8895. Cited by: §5.
 [7] (2015) Largescale multiview spectral clustering via bipartite graph. In AAAI, Cited by: §3.1.
 [8] (2019) ViLBERT: pretraining taskagnostic visiolinguistic representations for visionandlanguage tasks. In NeurIPS, pp. 13–23. Cited by: 2nd item.
 [9] (2009) A survey on transfer learning. IEEE TKDE 22 (10), pp. 1345–1359. Cited by: §1.
 [10] (2019) Deep clustering with sampleassignment invariance prior. IEEE TNNLS (), pp. 1–12. External Links: Document, ISSN 21622388 Cited by: footnote 2.
 [11] (2010) Transfer learning on heterogenous feature spaces via spectral transformation. In ICDM, pp. 1049–1054. Cited by: §4.1.
 [12] (2017) A survey on learning to hash. IEEE TPAMI 40 (4), pp. 769–790. Cited by: §1.
 [13] (2013) A survey on multiview learning. arXiv preprint arXiv:1304.5634. Cited by: §1.
 [14] (2019) Nonlinear regression via deep negative correlation learning. IEEE TPAMI. Cited by: §6.
 [15] (2017) A survey on multitask learning. arXiv preprint arXiv:1707.08114. Cited by: §1, §6.
 [16] (2019) Learning with annotation of various degrees. IEEE TNNLS 30 (9), pp. 2794–2804. Cited by: §6.
 [17] (2016) Transfer learning for crosslanguage text categorization through active correspondences construction. In AAAI, pp. 2400–2406. Cited by: §4.1.
 [18] (2014) Hybrid heterogeneous transfer learning through deep learning. In AAAI, pp. 2213–2220. Cited by: §4.1.
 [19] (2016) Transfer hashing with privileged information. In IJCAI, pp. 2414–2420. Cited by: §5.2, §5.
 [20] (2018) Transfer hashing: from shallow to deep. IEEE TNNLS 29 (12), pp. 6191–6201. Cited by: §5.2.
Comments
There are no comments yet.