Learning Multi-Modal Nonlinear Embeddings: Performance Bounds and an Algorithm

While many approaches exist in the literature to learn representations for data collections in multiple modalities, the generalizability of the learnt representations to previously unseen data is a largely overlooked subject. In this work, we first present a theoretical analysis of learning multi-modal nonlinear embeddings in a supervised setting. Our performance bounds indicate that for successful generalization in multi-modal classification and retrieval problems, the regularity of the interpolation functions extending the embedding to the whole data space is as important as the between-class separation and cross-modal alignment criteria. We then propose a multi-modal nonlinear representation learning algorithm that is motivated by these theoretical findings, where the embeddings of the training samples are optimized jointly with the Lipschitz regularity of the interpolators. Experimental comparison to recent multi-modal and single-modal learning algorithms suggests that the proposed method yields promising performance in multi-modal image classification and cross-modal image-text retrieval applications.

Authors

• 1 publication
• 12 publications
07/01/2021

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross...
03/07/2020

Cross-modal Learning for Multi-modal Video Categorization

Multi-modal machine learning (ML) models can process data in multiple mo...
04/26/2021

Joint Representation Learning and Novel Category Discovery on Single- and Multi-modal Data

This paper studies the problem of novel category discovery on single- an...
10/19/2017

Nonlinear Supervised Dimensionality Reduction via Smooth Regular Embeddings

The recovery of the intrinsic geometric structures of data collections i...
07/23/2020

METEOR: Learning Memory and Time Efficient Representations from Multi-modal Data Streams

Many learning tasks involve multi-modal data streams, where continuous d...
07/30/2021

Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining

Nowadays, customer's demands for E-commerce are more diversified, which ...
11/16/2017

Deep Matching Autoencoders

Increasingly many real world tasks involve data in multiple modalities o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the increasing accessibility of data acquisition and storing technologies, the need for successfully analyzing and interpreting multi-modal data collections has become more important. Many applications involve the acquirement or analysis of data collections available in multiple modalities. In some problems, the purpose is to fuse the information in different modalities to attain higher detection or classification accuracy than in a single-modality. For instance, the integration of multi-modal medical data such as patient information, MRI scans and ECG recordings may lead to more accurate clinical decisions. Similarly, an image sample and a text sample extracted from the same web page can be seen as the observations of the same data in two different modalities, which can be used together for the categorization of the web page. Meanwhile, some other applications require the retrieval of data samples in a certain modality while relevant query samples are provided in another modality. For instance, in an image-text cross-modal retrieval problem, a text sample may be provided as query and one might be interested in retrieving image samples belonging to the same category as the query text sample, as in web image search applications. In this paper, we study the problem of learning supervised nonlinear representations for multi-modal classification and cross-modal retrieval applications.

Multi-modal learning algorithms often rely on computing joint representations for multi-modal data in a common domain, where the main challenge is to efficiently align data samples from different modalities without damaging the inherent geometry of the individual modalities. The main approaches in the literature are as follows. Subspace learning methods such as CCA [surveyOnMl] align different modalities via linear projections or transformations. Supervised variants of such linear embedding methods aim to enhance the separation between different data classes [gma], [jfssl] in addition to the alignment of different modalities. However, such linear embedding methods have limitations in challenging data sets where different modalities are weakly linked. In particular, when the data from different modalities have significantly dissimilar geometric structures, linear methods may fall short of learning effective joint representations since they are mostly restricted by the original geometry of the individual modalities.

Kernel extensions of linear methods provide nonlinear representations that may improve some of these shortcomings [surveyOnMl]; however, the resulting algorithms might still lack in flexibility in certain problems. In particular, the suitability of the selected kernel type might vary largely depending on the structure of the data set and there is often no guarantee that the learnt embedding will perform well on the test data at hand.

In the recent years, impressive performance has been attained in retrieval and classification problems with deep learning algorithms based on cross-modal CNNs and autoencoders

[multimodalDeepLearning], [WeiZLWLZY17], [FengWL14]. While such methods can compute effective and powerful nonlinear representations, they typically require much larger data sets compared to subspace methods or their nonlinear kernel extensions, and their training complexity is significantly higher.

While the aforementioned multi-modal learning approaches might be preferable to each other depending on the setting, their capacity to generalize to novel test samples is a questionable issue in general. A multi-modal learning method may yield promising performance figures on training data, e.g., it may perfectly align different modalities and separate training samples from different classes, while its performance may be much lower on previously unseen test data. In particular, due to the limited capacity of the model they learn, linear subspace methods may be expected to have relatively close accuracy on training and test data. On the other hand, more sophisticated methods such as deep learning algorithms computing complex and rich models may suffer from overfitting if the amount of training data is insufficient. Although the learnt model fits to the characteristics of the training data very well, it might fail to generalize to previously unseen test data. In fact, the theoretical characterization of the generalization capability of multi-modal learning algorithms is a largely overlooked problem in the literature, despite its importance. The previous study [supervisedManifold] proposes generalization bounds on the performance of supervised nonlinear embedding algorithms; however, it treats the problem in a single modality. To the best of our knowledge, a mathematically rigorous study of the performance of supervised multi-modal embedding algorithms has not been proposed so far.

In this paper, we consider the problem of learning supervised nonlinear embeddings for multi-modal classification and cross-modal retrieval applications that can generalize well to new test data. Our main purpose in preferring a nonlinear embedding approach as opposed to linear subspace methods is to achieve a relatively high model capacity that can adapt to challenging data geometries. On the other hand, we adhere to a shallow data representation model consisting of a single-stage embedding as opposed to deep methods, in order to achieve applicability to settings with restricted availability of training data or limited computation budget. Our effort hence seeks a balance between the ease of training (which linear subspace methods have) and the richness and flexibility of nonlinear representation models (which deep learning methods have), while ensuring good generalizability to new test data.

Our study has two main contributions. We first propose a theoretical analysis of the problem of learning supervised multi-modal embeddings. We provide an extension of the results in [supervisedManifold] to the multi-modal setting. We consider a nonlinear embedding model where the training samples from different modalities are jointly mapped to a common lower-dimensional domain. The extension of the embedding to new test samples is achieved via Lipschitz-continuous interpolation functions that generalize the pointwise embeddings to the data space of each modality. Our theoretical bounds suggest that in order to attain good generalization performance in classification and retrieval applications, the multi-modal embedding of training samples should satisfy three conditions: (1) Different modalities should be aligned sufficiently well; (2) Different classes should be sufficiently well-separated from each other; (3) The geometric structure of each modality (captured through nearest neighborhoods) should be preserved. Then, under these conditions, our theoretical analysis shows that the embedding generalizes well to test data, provided that the Lipschitz constants of the interpolation functions are sufficiently low. This points to an important trade-off in learning nonlinear embeddings: Multi-modal methods may fail to generalize to test data if the nonlinear interpolation functions are too irregular, even if the embeddings of training samples exhibit good cross-modal alignment and between-class separation properties.

Our next contribution is to propose a new supervised nonlinear multi-modal learning algorithm. Motivated by the above theoretical findings, we formulate an optimization problem where a cross-modal alignment term and a between-class separation term for the embeddings of the training samples are jointly optimized with the Lipschitz constants of the interpolation functions that generalize the embeddings to the ambient space of each modality. The resulting objective function is minimized with an iterative optimization procedure, where the nonlinear embedding coordinates are learnt jointly with the Lipschitz-continuous interpolator parameters. Compared to existing approaches, our method has the advantage of providing more flexible representations than subspace methods thanks to the employed nonlinear models, while it entails a much more lightweight training phase compared to more elaborate approaches such as deep learning methods. We test the proposed algorithm in multi-view image classification and image-text cross-modal retrieval applications. Experimental results show that the proposed method yields quite satisfactory performance in comparison with recent multi-modal learning approaches.

The rest of the paper is organized as follows. In Section II, we overview the related literature. In Section III, we present a theoretical analysis of the multi-modal representation learning problem. In Section IV, we propose a supervised nonlinear multi-modal representation learning algorithm that is motivated by the theoretical findings of Section III. In Section V, we experimentally evaluate the performance of the proposed method, and in Section VI, we conclude.

Ii Related Work

The multi-modal (multi-view) learning approaches in the literature can be mainly grouped as co-training methods, multiple kernel learning algorithms, subspace learning-based approaches and deep learning methods. Co-training methods learn separate models in different modalities by encouraging their predictions to be similar [BlumM98]. The study in [coTrainingEM]

improves the co-training algorithm using Expectation Maximization (EM) to assign probabilistic labels to unlabeled samples, where each modality classifier iteratively uses the probabilities of class labels. A probabilistic model for Support Vector Machine (SVM) is constructed in

[coEM] based on the Co-EM approach. There also exist co-regression algorithms employing the co-training idea. A regression algorithm that uses two k-NN regressors is presented to learn appropriate labels for unlabeled samples in [coRegularization]. The co-training technique is also used in graph-based methods such as [bayesianCoTraining], where a Gaussian process model is used on an undirected Bayesian graph representation for all modalities. Co-training algorithms have been applied to the analysis of multi-modal data sets in various applications [fusionCotraining], [multimodalVideoCotraining], [DuanTTLH17].

Subspace learning methods are based on computing linear projections or transformations that suitably align samples from different modalities. The well-known unsupervised subspace learning algorithm CCA (Canonical Correlation Analysis) maximizes the correlation between different modalities [surveyOnMl]. Alternative versions of CCA such as cluster CCA [clusterCCA], multi-label CCA [multilabelCca] and three-view CCA [threeViewCca] have been proposed to improve the performance of CCA in various supervised tasks, all of which employ linear projections. In the recent years, many supervised subspace methods have been proposed, which aim to enhance the between-class separation and cross-modal alignment when learning linear projections of data. The GMLDA (Generalized Multiview Analysis) method proposes a multi-modal extension of the LDA algorithm within this framework [gma]

. The JFSSL (Joint Feature Selection and Subspace Learning) method additionally uses a joint graphical model for calculating projections with relevant and irrelevant features

[jfssl]. Some other subspace learning methods propose solutions based on the metric learning [metricLearning] and matrix factorization [matrixFactorization] ideas.

Subspace learning methods have the advantage of involving relatively simple models; however, their performance may be poor in difficult data geometries where the distributions of the modalities are significantly different. Nonlinear methods offer more flexible representations in such cases. Nonlinear representations may follow from the kernel extensions of linear subspace methods. For instance, a nonlinear kernel extension of CCA (called Kernel CCA) can be found in [surveyOnMl] and [KussG03]. Some other methods are based on combining kernels in different modalities. Convex combinations of multiple Laplacian kernels are learnt in [ArgyriouHP05]

, while the power mean of the kernels of multilayer graphs is used in semi-supervised learning in

[MercadoTH19]. The problem of learning multiple kernels has also been explored in [markKernels], [sdpKernel], [smoKernel], [largeScaleKernel], [groupLassoKernel]. Unlike kernel methods, some multi-view algorithms compute nonlinear embedding coordinates in a non-parametric manner. The multi-view learning algorithm [spectralEmbeddingMv] finds a nonlinear low-dimensional representation for multi-modal data based on spectral embeddings. The graph-based multi-modal clustering approach in [multimodalJointClustering]

computes a nonlinear embedding while jointly estimating a clustering of the multi-view data.

With the evolution of computation techniques in the recent years, many deep learning methods have been proposed for processing large multi-modal data sets. Deep multi-view autoencoders have been proposed in [multimodalDeepLearning], [FengWL14], [mmRetrievalDeepLearning] for learning a shared representation across different data modalities. The method in [deepMultimodalCrossWeights]

learns cross-weights between the different modality layers of a stacked denoising autoencoder structure. Convolutional neural network structures are also widely used for alignment in multi-modal applications, where the visual modality features obtained with CNNs are combined with features of other modalities

[WeiZLWLZY17], [CastrejonAVPT16], [multimodalDeepImageAnnotation]. The work in [multiViewImageClassification] proposes to train a separate classifier for each image modality using features generated by deep CNNs and then combine the classifier outputs of different views. GAN-type architectures are also used for adversarially training feature generators and domain discriminators across different modalities or domains [deepMultimodalRepresentation]. The method in [ZhaoDF17] proposes to learn a common latent representation for different modalities via a deep matrix factorization scheme.

Finally, some other previous works related to our study are the following. The theoretical analysis in [supervisedManifold] provides performance bounds for supervised nonlinear embeddings in a single modality. The idea in [supervisedManifold] is developed in this paper to perform a theoretical analysis for multi-modal embeddings. The previous work [nsse] prposes a supervised nonlinear dimensionality reduction algorithm via smooth representations like in our work; however, it treats the embedding problem in a single modality. Lastly, a preliminary version of our work was presented in [KayaV19]. The current paper builds on [KayaV19] by including a theoretical analysis of the multi-modal learning problem and significantly extending the experimental results.

Iii Performance Bounds for Multi-Modal Learning with Supervised Embeddings

In this section, we first describe the multi-modal representation learning setting considered in this study and then present a theoretical analysis of multi-modal classification and retrieval with supervised embeddings.

Iii-a Notation and Setting

We consider a setting with data classes and modalities (also called views) such that a data sample has an observation in each modality (or view) . Let the data samples from each class in each modality be drawn from a probability measure on a Hilbert space . We assume that the probability measure has a bounded support for each , and that the probability measures in different modalities are independent for each class .

Let be a set of training samples such that each -th training sample belongs to one of the classes . In each modality , the observations of the training samples from each class are independent and identically distributed, drawn from the probability measure . In this paper, we study a setting where the training samples from all modalities are embedded as into a common Euclidean domain , such that each training sample from modality is mapped to a vector . Although we do not impose any conditions on the dimension of the embedding, is typically small in many methods.

Focusing mainly on a scenario where the embedding is nonlinear in this work, we assume that the embedding of the training samples is extended to the whole data space through interpolation functions , for , such that each training sample in a modality is mapped to its embedding as . We characterize the regularity of the interpolation functions with their Lipschitz continuity, which is defined as follows.

Definition 1.

A function defined on a Hilbert space is Lipschitz continuous with constant if for any , the function satisfies .

The notation will denote the usual norm in the space of interest (e.g. -norm, or -norm), unless stated otherwise. Now, for each modality , let be an open ball of radius around the point

 Bδ(x(v))={z(v)∈H(v):∥x(v)−z(v)∥<δ}.

Then, for each class , we define a parameter , which is a lower bound on the measure of the open ball around any point from class in any modality

 ηm,δ:=minv=1,…,Vinfx(v)∈M(v)mν(v)m(Bδ(x(v))).

In the following, denotes the class label of a sample, refers to the cardinality of a set, the notation means that the sample is drawn from the distribution , denotes the probability of an event, and denotes the Frobenius norm. The notation stands for the trace of a matrix, and indicates the entry of a matrix in the -th row and the -th column.

Iii-B Theoretical Analysis of Classification and Retrieval Performance

We now present performance bounds for the multi-modal classification problem and the cross-modal retrieval problem.

Iii-B1 Multi-Modal Classification Performance

Let be a test sample with an observation available in a specific modality . Denoting the true class of by , we assume that the observation of the test sample is drawn from the probability measure independently of the training samples.

We consider a classification setting where the class label of is estimated by first embedding into as through the interpolator learnt using the training samples. Then the estimate of the class label of is found via nearest-neighbor classification in over the embeddings of the training samples from all modalities . Hence, the class label of the test sample is estimated as , where 111We adopt the notation instead of for class labels as the observation of a sample in any modality has the same class label.

 i∗=argminiminu=1,…,V∥y(u)i−f(v)(x(v))∥. (1)

Before stating our main result, we first present the following lemma.

Lemma 1.

Let the training sample set contain at least training samples from class , whose observations with are available in all modalities . Assume that the interpolation function in each modality is Lipschitz continuous with constant .

Let be a test sample from class with an observation given in modality , drawn with respect to the probability measure independently of the training samples. Let be the observation of the same sample in an arbitrary modality , which need not be available to the learning algorithm. For an arbitrary modality , define as the set of the training samples from class within a -neighborhood of in

 A(u)={x(u)i:xi∈X, C(xi)=m, x(u)i∈Bδ(x(u))}.

Assume that for some and , the number of training samples from class satisfies

 Nm>Qηm,δ.

Then for any , with probability at least

 1−exp(−2(Nmηm,δ−Q)2Nm)−2dexp(−Qϵ22L2δ2)−(1−ηm,δ)Q,

the set contains at least samples, the distance between and the sample mean of the embeddings of its neighboring training samples is bounded as

 ∥∥ ∥ ∥∥f(u)(x(u))−1|A(u)|∑x(u)i∈A(u)f(u)(x(u)i)∥∥ ∥ ∥∥≤Lδ+√dϵ, (2)

and also there is at least one such that its observation in modality satisfies .

Lemma 1 is proved in the Appendix. The purpose of Lemma 1 is to see how much the embedding of a test sample through a Lipschitz-continuous interpolator is expected to deviate from the average embedding of the training samples surrounding it. Lemma 1 provides a probabilistic upper bound on this deviation, which is used in Theorems 1 and 2 for bounding the classification and retrieval errors. Note that the classification algorithm knows the observation of the test sample only in modality , and classifies it through its embedding with respect to the rule in (1). The entity in the lemma denotes a hypothetical observation of in an arbitrary modality . Although we conceptually refer to in the derivations, it is not known to the classification algorithm in practice (unless ).

In the following theorem, we present our main result for multi-modal classification with supervised embeddings.

Theorem 1.

Let the training sample set contain at least training samples from class , whose observations with are available in all modalities . Let be an embedding of in with the following properties

 (P1)∥y(v)i−y(u)i∥≤η for all training samples xi and for all v,u∈{1,…,V}(P2)∥y(u)i−y(u)j∥≤Rδ  for all u∈{1,…,V}, if ∥x(u)i−x(u)j∥≤2δ % and C(xi)=C(xj)(P3)∥y(v)i−y(u)j∥>γ % for all v,u,∈{1,…,V} if C(xi)≠C(xj)

where and are some constants and is a -dependent constant. Assume that the interpolation function in each modality is a Lipschitz continuous function with constant such that for some parameters and , the following inequality is satisfied

 6Lδ+2√dϵ+2Rδ+2η≤γ. (3)

Then for some , if the number of training samples is such that

 Nm>Qηm,δ, (4)

the probability of correctly classifying a test sample from class observed as in modality via the nearest neighbor classification rule in (1) is lower bounded as

 P(^C(x)=m)≥1−[exp(−2(Nmηm,δ−Q)2Nm)+2dexp(−Qϵ22L2δ2)+(1−ηm,δ)Q]V. (5)

The proof of Theorem 1 is given in the Appendix. The theorem intuitively states the following: First, (P1), (P2), and (P3) define the properties that the embedding should have, which are illustrated in Figure 1. (P1) requires the observations , of the same training sample in two different modalities to be mapped to nearby points in the common domain of embedding, so that the distance between their embeddings does not exceed some threshold . This property imposes that different modalities be well aligned through the learnt embedding. The property (P2) indicates that two nearby samples from the same modality and the same class should be mapped to nearby points, so that a distance of in the original domain is mapped to a distance of at most in the domain of embedding, where is a constant depending on . This can be seen as a condition for the preservation of the local geometry of each modality within the same class. Lastly, the property (P3) imposes samples from different classes to be separated by a distance of at least in the domain of embedding, regardless of their modality. Here, the parameter can be seen as a separation margin between different classes in the learnt embedding.

If the embedding of the training samples has these properties, supposing that the condition in (3) is satisfied, Theorem 1 guarantees that the probability of correctly classifying a test sample from some class approaches at an exponential rate as the number of training samples from that class increases. This can be verified by observing that should be chosen proportionally to the parameter as seen in (4), in which case the correct classification probability in (5) improves at rate . Here, an important observation is that as the number of modalities increases, the correct classification probability improves at an exponential rate. This confirms that the multi-modal learning algorithm can successfully fuse the information obtained from different modalities for improving the classification performance.

Finally, a crucial implication of Theorem 1 is that the condition in (3) must be satisfied in order to achieve high classification accuracy. The condition (3) is quite central to our study and it will be of importance when proposing an algorithm in Section IV. It states that a certain compromise must be sought between the Lipschitz regularity of the interpolator and the separation between different classes: When learning nonlinear embeddings, the separation between training samples from different classes should be adjusted in a way to allow the existence of a sufficiently regular interpolator, so that remains sufficiently small. While an embedding with a too small value would fail to satisfy the condition (3), increasing too much would result in a highly irregular warping of the training samples, which typically leads to an increase in the magnitude of the interpolator parameters. This results in an interpolator with poor Lipschitz regularity with a large value where the condition (3) would fail again. Hence, the condition (3) points to how the separation margin and the interpolator regularity should be jointly taken into account when learning an embedding with good generalization properties.

Iii-B2 Cross-Modal Retrieval Performance

Next, we analyze the performance of cross-modal retrieval via supervised embeddings. Given the multi-modal data set , where each data sample belongs to one of the classes , we formally define the retrieval problem as follows. Let be a query test sample observed in modality . We study a cross-modal retrieval setting where the purpose is to retrieve samples from a certain modality that are “relevant” to the query sample from modality . We consider two samples to be relevant if they belong to the same class.

Denoting the modality of the query sample by and the modality of the retrieved samples by , we consider a retrieval strategy that returns the most relevant samples to the query sample, based on the distance of the samples in the domain of embedding. Hence, given the query sample , it is first embedded into as via the interpolator ; and then the training samples from modality whose embeddings have the smallest distance to are retrieved as the most relevant samples, thus returning the set , where

 i1=argmini∥f(u)(x(u)i)−f(v)(x(v))∥ik=argmini∉{i1,…ik−1}∥f(u)(x(u)i)−f(v)(x(v))∥, for k=2,…,K. (6)

The precision rate and the recall rate of the retrieval algorithm are then given by

 P=TPTP+FP,R=TPTP+FN (7)

where , , and respectively denote the number of true positive, false positive, and false negative samples depending on whether the retrieved and unretrieved samples are relevant or not.

We present the following main result regarding the performance of cross-modal retrieval with supervised embeddings.

Theorem 2.

Let the training sample set contain training samples from class , with observations and available in the modalities and . Let be an embedding of in with the following properties:

 (P1)∥y(v)i−y(u)i∥≤η for all training samples xi(P2)For two samples xi and xj with C(xi)=C(xj)∥y(v)i−y(v)j∥≤Rδ  if  ∥x(v)i−x(v)j∥≤2δ;∥y(u)i−y(u)j∥≤Rδ  if  ∥x(u)i−x(u)j∥≤2δ(P3)∥y(v)i−y(u)j∥>γ if C(xi)≠C(xj),

where and are some constants and is a -dependent constant. Assume that the interpolation functions and in modalities and are Lipschitz continuous with constant such that for some parameters and , the following inequality holds

 6Lδ+2√dϵ+2Rδ+2η≤γ. (8)

For some , let the number of training samples from class be such that

 Nm>Qηm,δ.

Let be a query sample from class observed in modality , the relevant samples to which are sought in modality . Then, with probability at least

 1−exp(−2(Nmηm,δ−Q)2Nm)−2dexp(−Qϵ22L2δ2)−(1−ηm,δ)Q

the precision rate of the retrieval algorithm in (6) satisfies

 P=1, if K≤QP≥QK, if K>Q (9)

and the recall rate of the retrieval algorithm satisfies

 R=KNm. if K≤QR≥QNm, if K>Q. (10)

The proof of Theorem 2 is given in the Appendix. Theorem 2 can be interpreted similarly to Theorem 1. The properties (P1), (P2) and (P3) ensure that the learnt embedding aligns modalities and sufficiently well, while mapping nearby samples from the same classes to nearby points, and increasing the distance between samples from different classes. Assuming that the condition (8

) is satisfied, the precision and recall rates given in (

9) and (10) are attained with probability approaching at an exponential rate as the number of training samples increases. In the proof of the theorem, the precision and recall rates in (9) and (10) are obtained by identifying the conditions under which at least samples out of the samples returned by the retrieval algorithm are relevant to the query sample.

The condition (8) required for successful cross-modal retrieval is the same as the condition (3) for accurate multi-modal classification. Hence, similarly to the findings of our multi-modal classification analysis, the results of our retrieval analysis also suggest that it is necessary to find a good compromise between the Lipschitz continuity of the interpolators and the separation between different classes when learning nonlinear embeddings for cross-modal retrieval applications.

Iv Proposed Multi-Modal Supervised Embedding Method

In this section, we propose a multi-modal nonlinear dimensionality reduction algorithm that relies on the theoretical findings of Section III. We formulate the nonlinear embedding problem in Section IV-A and then discuss its solution in Section IV-B.

Iv-a Problem Formulation

Let denote the training data matrix of modality , each row of which is the observation of some training sample in the -th modality. Here is the total number of observations222Although the observations of all training samples were assumed to be available in all modalities for the simplicity of the theoretical analysis in Section III, here we remove this assumption and allow some observations to be missing in some modalities. Hence may be different for different . from all classes in modality , and is the dimension of the Hilbert space of modality , assumed to be finite in a practical setting. Given the training samples from modalities , we would like to compute embeddings of the training samples into the common domain , such that each is mapped to a vector . The embedding is extended to the whole data space through interpolation functions such that each training sample is mapped to its embedding as .

Our main purpose is to find an embedding that can be successfully generalized to initially unavailable test samples. We recall from our theoretical analysis that for successful generalization in multi-modal classification and retrieval, the embedding must have the properties (P1), (P2) and (P3) given in Theorems 1 and 2, while the Lipschitz constant of the interpolators must be kept sufficiently small as imposed by the conditions (3) and (8). We now formulate our multi-modal learning problem in the light of these results.

Lipschitz regularity of the interpolators. For the extension of the embedding, we choose to use RBF interpolation functions, which are analytical functions with well-studied properties. Hence, the interpolator of each modality has the form , where

 f(v)k(x(v))=N(v)∑i=1C(v)ikϕ(v)(∥x(v)−x(v)i∥) (11)

is the -th component of . Here

 ϕ(v)(r)=e−r2/(σ(v))2

is a Gaussian RBF kernel with scale parameter and are the interpolator coefficients.

The Lipschitz continuity of Gaussian RBF interpolators has been studied in [nsse], from which it follows that is Lipschitz-continuous with constant

 L(v)=√2e−12√N(v)(σ(v))−1∥∥C(v)∥∥F. (12)

Here is the coefficient matrix with entries . The interpolator coefficients can be easily obtained as

 C(v)=(Ψ(v))−1Y(v)

by fitting the embedding coordinates to the training data , where is the RBF kernel matrix with entries .

The conditions (3) and (8) suggest that the Lipschitz constants of the interpolators should be sufficiently small for successful generalization of the embedding to test data. In view of these results, when learning a nonlinear embedding, we propose to minimize the kernel scale of each modality through the term

 V∑v=1(σ(v))−2

as well as the interpolator coefficients of all modalities through

 V∑v=1∥∥C(v)∥∥2F=V∑v=1∥(Ψ(v))−1Y(v)∥2F=tr(~YT~Ψ−2~Y)

so that the Lipschitz constant in (12) is minimized for each modality . Here

 ~Y=[(Y(1))T (Y(2))T … (Y(V))T]T∈RN×d

denotes the matrix containing the embeddings from all modalities (with ) and is a block-diagonal matrix containing the kernel matrix in its -th block.

Within-class compactness. Theorems 1 and 2 suggest that the constant in (P2) should be kept small, so that the conditions (3) and (8) are more likely to be met. Although it is not easy to analytically formulate the minimization of , in practice if nearby samples from the same modality and same class are embedded into nearby points, will be small. This problem is well-studied in the manifold learning literature. The total weighted distance between the embeddings of same-class samples can be formulated as

 V∑v=1N(v)∑i,j=1(W(v)w)ij∥y(v)i−y(v)j∥2=tr(~YT~Lw~Y). (13)

Here is chosen as a weight matrix whose entries represent the affinity between the data samples when and are from the same class (for a scale parameter ), and otherwise. In the equality, the block-diagonal matrix contains the within-class Laplacian in its -th block, where is the diagonal degree matrix with -th diagonal entry given by . The term in (13) hence imposes nearby samples , from the same class and the same modality to be mapped to nearby coordinates.

Between-class separation. In Theorems 1 and 2, the between-class margin in (P3) must be sufficiently large for conditions (3) and (8) to be satisfied. Since it is difficult to formulate the maximization of the exact value of , we relax this problem to the maximization of

 V∑v=1N(v)∑i,j=1(W(v)b)ij∥y(v)i−y(v)j∥2=tr(~YT~Lb~Y)

which aims to increase the separation between the samples from different classes within each modality . Here the matrix has entries when and are from different classes; and , otherwise. The block-diagonal matrix contains the between-class Laplacian in its -th block, where is the diagonal between-class degree matrix with -th diagonal entry given by .

Cross-modal alignment. Finally, the constant in property (P1) in Theorems 1 and 2 should be sufficiently small for conditions (3) and (8) to be met. The parameter represents the distance between the embeddings of the observations of the same sample in different modalities. We relax the minimization of to the minimization of the following term, which aims to embed samples of high affinity from different modalities , into nearby points

 V∑v=1∑u≠vN(v)∑i=1N(u)∑j=1 (W(vu)w)ij ∥∥y(v)i−y(u)j∥∥2=tr(~YT~Lcw~Y).

Here, the matrix encodes the affinities between sample pairs from different modalities. is nonzero only if and are from the same class, in which case it is computed with the Gaussian kernel based on the distance between and when transferred to a common modality (i.e., using or , otherwise in some other modality if the former ones are not possible). Denoting by the cross-modal within-class weight matrix containing in its -th block, the corresponding Laplacian matrix is computed as , where is the diagonal degree matrix with -th diagonal entry given by .

Meanwhile, the property (P3) in Theorems 1 and 2 suggests that two samples from modalities , should be separated if they are from different classes. We thus propose to maximize

 V∑v=1∑u≠vN(v)∑i=1N(u)∑j=1 (W(vu)b)ij ∥∥y(v)i−y(u)j∥∥2=tr(~YT~Lcb~Y)

where the matrix is formed by setting if and are from different classes, and 0 otherwise. The cross-modal between-class weight matrix contains the matrix in its -th block, while is the corresponding Laplacian matrix given by , with denoting the diagonal degree matrix with -th diagonal entry given by .

Overall problem. We now combine all these objectives in the following overall optimization problem

 minimize ~Y, {σ(v)}tr% (~YT~Lw~Y)−μ1tr(~YT~Lb~Y)+μ2tr(~YT~Ψ−2~Y)+μ3V∑v=1(σ(v))−2+μ4tr(~YT~Lcw~Y)−μ5tr(~YT~Lcb~Y) (14)

subject to , where are positive weight parameters,

is the identity matrix, and the optimization constraint

is for the normalization of the learnt coordinates.

Iv-B Solution of the Optimization Problem

Defining

 A=~Lw−μ1~Lb+μ2~Ψ−2+μ4~Lcw−μ5~Lcb (15)

the problem in (14) can be rewritten as

 minimize ~Y, {σ(v)}tr(~YTA~Y)+μ3V∑v=1(σ(v))−2, subject to ~YT~Y=I. (16)

The above problem is not jointly convex in and , hence it is not easy to find its global optimum. We minimize the objective function with an iterative alternating optimization scheme, where we first optimize by fixing , and then optimize by fixing in each iteration as follows.

Optimization of : When are fixed, the optimization problem in (16) becomes

 minimize ~Ytr(~YTA~Y)% subject to ~YT~Y=I. (17)

The solution to this problem is given by the eigenvectors of the matrix corresponding to its smallest eigenvalues.

Optimization of : Fixing , the problem (16) becomes

 minimize {σ(v)} μ2tr(~YT~Ψ−2~Y)+μ3V∑v=1(σ(v))−2. (18)

Note that the first term in the objective depends on the kernel scale parameters through the entries of the kernel matrix . Due to the block diagonal structure of and the separability of the second term, the objective (18) can be decomposed into individual objectives, each one of which is a function of only one scale parameter . We minimize these objective functions one by one, by optimizing one scale parameter at a time through exhaustive search.

If and are sufficiently small, the matrix becomes positive semi-definite. In this case, the objective function is guaranteed to converge since it is nonnegative, and both updates on and reduce it. We continue the iterations until the convergence of the objective. We call the proposed algorithm Multi-modal Nonlinear Supervised Embedding (MNSE), which is summarized in Algorithm 1.

Iv-C Complexity Analysis

The complexity of the proposed MNSE method is mainly determined by those of the problems (17) and (18) repeated in the main loop of the algorithm. When computing the matrix in (15), the matrices , , , , and can be constructed with complexity not exceeding , where is the total number of observations from all modalities. The eigenvalue decomposition step in (17) is of complexity . In the optimization problem (18), the evaluation of the objective for each value requires operations in modality ; hence, the total complexity of finding all is smaller than . Therefore, the overall complexity of the algorithm is determined as .

V Experimental Results

We first study the stabilization of the proposed MNSE algorithm and its sensitivity to the algorithm parameters in Section V-A. Then, we evaluate its performance with comparative experiments in multi-view image classification and image-text retrieval applications in Section V-B.

V-a Stabilization and Sensitivity Analysis of MNSE

We analyze the performance of MNSE in an image classification setting. The experiments are done on the MIT-CBCL multi-view face data set [MITCBCL], which contains face images of 10 participants captured under 36 illumination conditions and 9 different pose angles. Images with frontal and profile poses are used in the experiments, which are considered to represent two different modalities. Some sample images of two participants in both modalities are shown in Figure 2. The images in each modality are randomly divided into 100 training and 260 test images in each experiment. The embedding parameters are computed using the training images, which are then applied to the test images to estimate their class labels via NN classification. The misclassification errors obtained with the representations of the images in Modalities 1 and 2 are reported individually. The reported results are the average of 10 random trials.

We first study in Figure 3 the evolution of the objective function of (14) and the misclassification error (in percentage) of the test images throughout the optimization iterations. Figure 3(a) indicates that the overall objective function steadily decreases throughout the iterations as expected, confirming the efficacy of the proposed optimization procedure. The updates on both the embeddings and the kernel scale parameters ensure that the overall objective decreases or remains constant. The misclassification errors in Figure 3(b) are seen to decrease rather regularly during the iterations, in line with the decrease in the objective. This suggests that the proposed objective function is indeed well-representative of the classification error of the algorithm.

Next, the effect of the weight parameters on the algorithm performance is studied in Figure 4. Figure 4(a) shows the variation of the misclassification error with and , by fixing , , and . Similarly, 4(b) shows the variation of the error with and , by fixing and . The parameters and are set to be equal, motivated by the similarity in the construction of the between-class separation matrices associated with these parameters. Figure 4(a) indicates that the weight of the squared norms of the interpolator coefficient matrices should be relatively low (), while the weight for the kernel scale parameter terms should be higher (). This can be explained in the way that an appropriate assignment of and should balance the orders of the magnitudes of their corresponding terms in the objective, which are significantly different. Figure 4(b) suggests that the weight parameter of the cross-modal within-class similarity term and the weight parameters for the between-class discrimination terms can be chosen in a rather large region (