1 Introduction
The Laplacian operator and related constructions play a pivotal role in a wide range of applications in machine learning, pattern recognition, and computer vision community. It has been shown that many problems in these fields boil down to finding some eigenvectors and eigenvalues of a Laplacian constructed on some highdimensional data. Important examples include
spectral clustering (Ng et al. (2001)) where clusters are determined by the first eigenvectors of the Laplacian; eigenmaps (Belkin & Niyogi (2002)) and more generally diffusion maps (Coifman & Lafon (2006)), where one tries to find a lowdimensional manifold structure using the first smallest eigenvectors of the Laplacian; and diffusion metrics (Coifman et al. (2005)) measuring the “connectivity” of points on a manifold and expressed through the eigenvalues and eigenvectors of the Laplacian. Other applications heavily relying on the properties of the Laplacian include spectral graph partitioning (Ding et al. (2001)), spectral hashing (Weiss et al. (2008)), spectral correspondence, image segmentation (Shi & Malik (1997)), and shape analysis (Levy (2006)). Because of the intimate relation between the Laplacian operator, Riemannian geometry, and diffusion processes, it is common to encounter the umbrella term spectral or diffusion geometry in relation to the above problems.These applications have been considered mostly in the context of unimodal data, i.e., a single data space. However, many applications involve observations and measurements of data done using different modalities, such as multimedia documents (Weston et al. (2010); Rasiwasia et al. (2010); McFee & Lanckriet (2011)), audio and video (Kidron et al. (2005); AlamedaPineda et al. (2011)), or medical imaging modalities like PET and CT (Bronstein et al. (2010)). Such problems of multimodal (or multiview) data analysis have gained increasing interest in the computer vision and pattern recognition communities, however, there have been only few attempts extending the powerful spectral methods to such settings.
In this paper, we propose a general framework allowing to extend different diffusion and spectral methods to the multimodal setting by finding a common eigenbasis of multiple Laplacians. Numerically, this problem is posed as approximate joint diagonalization of several matrices. Such methods have received limited attention in the numerical mathematics community (BunseGerstner et al. (1993)) and have been employed for joint diagonalization of covariance matrices in blind source separation applications by Cardoso & Souloumiac (1993, 1996); Yeredor (2002); Ziehe (2005). To the best of our knowledge, this is the first time they are applied to spectral embeddings. Besides providing a principled approach to data fusion, our framework gives a theoretical explanation to existing methods for multimodal data analysis. In particular, we show that many recent works on multiview clustering by de Sa (2005); Ma & Lee (2008); Tang et al. (2009); Cai et al. (2011); Kumar et al. (2011) can be considered a particular instance of our framework.
2 Background
Let us be given some data represented as a dimensional manifold , embedded into a dimensional Euclidean space. In many applications is very large while the intrinsic dimension of the data is small, and one tries to study the structure of the manifold rather than its dimensional embedding. Such a structure can be characterized by the means of the LaplaceBeltrami operator. In the discrete setting, the manifold is often represented by a weighted graph with vertices and edge weights representing local connectivity using e.g. Gaussian kernel (see von Luxburg (2007)). The LaplaceBeltrami operator can be discretized^{1}^{1}1There exist many different constructions of the discrete Laplacian. For the sake of simplicity, we adopt the symmetric Laplacian. Our framework is applicable to other discretization as well. as , where and . Such a discretization is often referred to as symmetric normalized Laplacian and admits a unitary diagonalization , with the eigenvalues . Geometric constructions associated with eigenvectors and eigenvalues of the Laplacian play an important role in machine learning, since several archetypical problems can be formulated in these terms:
Eigenmaps. Nonlinear dimensionality reduction methods try to capture the intrinsic lowdimensional structure of the manifold . Belkin and Niyogi (2002) showed that finding a neighborhoodpreserving dimensional embedding of can be posed as the minimum eigenvalue problem,
(1) 
This problem is minimized by setting to be the matrix containing the first eigenvectors of
, thus effectively embedding the data by means of the eigenfunctions of the LaplaceBeltrami operator (the null eigenvector is usually discarded). Such an embedding is referred to as
Laplacian eigenmap. More generally, a diffusion map is given as a mapping of the form , where is some transfer function acting as a “lowpass filter” on eigenvalues (Coifman et al. (2005); Coifman & Lafon (2006)).Diffusion distances. Coifman et al. (2005; 2006) related the eigenmaps to heat diffusion and random processes on manifolds and defined a family of diffusion metrics that in the most general setting can be written as
(2) 
Particular choice of gives the heat diffusion distance, related to the connectivity of points on the manifold by means of diffusion process of length . Such distances are intrinsic and thus invariant to manifold embedding and are robust to topological noise.
Spectral clustering. Ng et al. (2001) showed a very efficient and robust clustering approach based on the observation that the multiplicity of the null eigenvalue of is equal to the number of connected components of
. The corresponding eigenvectors act as indicator functions of these components. Embedding the data using these eigenvectors and then applying some standard clustering algorithm such as Kmeans was shown to produce significantly better results than clustering the highdimensional data directly.
3 Multimodal diffusion geometry
Recently, we witness increasing popularity of attempts to analyze different “views” or modalities of data. Such data can be modeled as different manifolds , which can have embeddings of different dimensionality () and sometimes different structure. We are interested in analyzing these manifolds simultaneously in order to extract their joint intrinsic structure. We assume that we are given corresponding samples on the manifolds and can construct the Laplacian matrices as described in the previous section.
Trying to use the eigenvectors of the Laplacian matrices is problematic: for a set of eigenvectors corresponding to an eigenvalue with multiplicity greater than one, we can talk only of eigen subspace, and any basis spanning it is a valid set of eigenvectors. As a result, the eigenvectors of the Laplacians in different modalities can be substantially different (Figure 1, top).
Joint diagonalization. A solution is to try to find the eigenbasis of the Laplacians simultaneously. This problem is known as joint diagonalization and consists of finding a set of joint orthogonal eigenvectors such that are diagonal matrices of the eigenvalues of . Such a common eigenbasis solves the inherent ambiguity in the definition of the eigenvectors and “couples” different modalities (Figure 1, bottom). However, due to differences between the modalities and the presence of noise, the Laplacian matrices rarely have a joint eigenbasis (iff they commute). It is still possible to find an approximate joint diagonalization by solving
(3) 
where is some offdiagonality criterion, e.g. the sum of squared offdiagonal elements, . In this case, are only approximately diagonal; we refer to the average of the diagonal elements as the joint approximate eigenvalues of . This definition allows us to naturally extend the diffusion geometric methods discussed in the previous section (eigenmaps, diffusion distances, spectral clustering, etc.) to the multimodal setting by simply replacing the eigenvalues and eigenvectors of a single Laplacian by the joint eigenvectors and eigenvalues of multiple Laplacians .
Numerical computation. A numerical method for joint diagonalization based on a modified Jacobi iteration traces back to BunseGerstner et al. (1993), and it has been used at about the same time by Cardoso and Souloumiac (1993; 1996) for joint diagonalization of covariance matrices in the context of blind source separation. The idea of the standard Jacobi method for eigenvalue calculation is to apply a sequence of plane rotations in order to sequentially minimize the offdiagonal elements of the given matrix. The rotation is applied “inplace” and does not require matrix multiplication. In the modified Jacobi method (referred to as JADE), the rotations are applied to reduce the offdiagonality criterion (3) in each step. Let
the (complex) rotation matrix the entries of which are equal to those of the identity matrix except for the elements
(4) 
where . Cardoso & Souloumiac (1996) show that the problem
(5) 
has a simple explicit solution based on a eigenvalue problem. JADE is one of the most common algorithms in the field of joint diagonalization and has complexity comparable to that of the standard Jacobi method. There are other algorithms, like the ACDC method of Yeredor (2002), as well as different versions of the idea of minimizing a suitable cost function on the Stiefel manifold (Rahbar & Reilly (2000)).
Analytic computation. In the spectral clustering problem, we are looking for the null eigenvectors of the Laplacian. Assuming that the first eigenvalues of the Laplacians are zero, we want to find such that for all and by reformulating (3) as
(6) 
Since , the problem can be equivalently recast as singlemodality clustering with the “average” Laplacian matrix . We can also consider other averaging operators, e.g. weighted arithmetic mean
. We discuss these methods in the next section.For zero eigenvalues, (6) is akin to (3), which justifies the successful use of such “averaging” methods in problems of multimodal spectral clustering (Ma & Lee (2008); Cai et al. (2011)). However, iterative methods such as JADE explicitly minimizing the offdiagonality criterion (3) are more generic and applicable to settings where one has to find all or many joint eigenvectors, e.g., for computing eigenmaps or diffusion distances.
4 Relation to previous works
There have been numerous recent works on multimodal spectraltype clustering proposing different ways of fusing multiple modalities based on different principles. Considering these methods through the prism of joint diagonalization, we show many commonalities and equivalences between algorithms stemming from different motivations and coming from various communities. Ma & Lee (2008) considered detection of shots in video sequences using fusion of video and audio information, employing for this purpose spectral clustering of a Laplacian created as a weighted arithmetic mean of each modality Laplacian. Tang et al. (2009) used lowrank factorization of the weight matrix, trying to find a common factor such that by solving
(7) 
using the quasiNewton method. Besides the fact that the factorization is applied to the weight matrix (it can be equivalently applied to the Laplacian), we see here a (nonorthogonal) joint diagonalization problem with an offdiagonality criterion considered by Yeredor (2002).
Cai et al. (2011) proposed a method for multiview spectral clustering (MVSC) by solving^{2}^{2}2Cai et al. (2011) also impose a nonnegativity constraint on the matrix in order to obtain cluster indicators directly and bypass the Kmeans clustering stage. We ignore this additional constraint for the simplicity of discussion; such a constraint can be added to all the problems discussed in this paper.
(8) 
The authors show that this problem can be equivalently posed as
(9) 
and then employ an iterative algorithm to find the solution . First, we observe that problem (8) consists of minimumeigenvalue problems w.r.t. bases , with the addition of a coupling term, encouraging as close as possible to some common basis (note that the authors do not impose orthogonality constraints , but for , the proximity to orthogonal makes approximately orthogonal). Thus, it is possible to interpret (8) as a kind of joint diagonalization criterion. Second, problem (9) can be rewritten as a minimum eigenvalue problem
(10) 
whose solution is given by the matrix composed of the first eigenvectors of the matrix . For this a regularized version of the harmonic mean of the Laplacian matrices. We can thus regard the method of Cai et al. (2011) as a particular instance of our joint diagonalization approach discussed in the previous section.
Kumar et al. (2011) proposed the centroid coregularization approach for multimodal clustering based on the minimization of
(11) 
This function is alternatingly minimized, first with respect to the , then with respect to . Problems (11) and (8) are similar in their spirit (the first one uses dissimilarity as coupling term, while the second one the similarity )), and fall under our joint diagonalization framework.
We must stress that these methods were developed for clustering problems where one has to find the null eigenvectors, and do not adapt easily to other applications of diffusion geometry where one has to find many or all joint eigenvectors of the Laplacians (e.g., computation of diffusion distances). In particular, iterative solvers used in Tang et al. (2009); Kumar et al. (2011); Cai et al. (2011) do not scale up to such cases. On the other hand, algorithms such as modified Jacobi iteration (JADE) are made for finding a full set of joint eigenvectors and have the complexity akin to standard Jacobi iteration. Further speedup might be achieved by making explicit use of the sparse structure of the Laplacian matrices, which is not taken advantage of in JADE.
5 Results
We tested the proposed approach on three applications: dimensionality reduction, diffusion distance, and spectral clustering. All the datasets and code generating the results in this section are available from anonymous.com. Additional results are shown in the supplementary material.
Swiss rolls. In the first experiment, we used two Swiss roll surfaces with slightly different embedding as two different data modalities. The rolls were constructed in such a way that in each modality there is topological noise (connectivity “across” the roll loops) at different points. Laplacians were constructed as in Belkin & Niyogi (2002) using neighbor connectivity and Gaussian weights with scale parameter . Figure 1 shows the first few eigenvectors computed using each Laplacian individually and jointly. Figure 2 shows twodimensional embeddings of the same surfaces using the first nontrivial eigenvectors. When using joint eigenvectors, we are able to correctly capture the intrinsic structure of the data. Figure 3 shows the diffusion distance on the Swiss roll surfaces, computed using the first 100 eigenvectors and heat diffusion kernel . Topological noise is clearly visible especially in the first modality, resulting in the distance between two loops to be small. This phenomenon does not occur when using joint eigenvectors.
Synthetic data clustering. In the second experiment, we performed clustering on several synthetic multimodal datasets. Laplacians were constructed using 15 nearest neighbors (10 for the circles), and Gaussian weight selected using the selftuning approach of Perona & ZelnikManor (2004). We compare spectral clustering based on single modalities (SC1 and SC2) and joint diagonalization obtained using the JADE method of Cardoso & Souloumiac (1996); harmonic mean (JDHM) of Laplacians (Cai et al. (2011)); and a nonspectral Comraf clustering algorithm (Bekkerman & Jeon (2007)). Quality was measured using the clustering accuracy criterion as defined in Bekkerman & Jeon (2007). For Blobs, accuracy is averaged over 100 experiments ran on randomly generated datasets.
The results are summarized in Figure 4 and Table 1. Surprisingly, the simpleminded averaging approach performs extremely well; this is consistent with the previously reported results and the success of the methods of Cai et al. (2011) (essentially harmonic mean) and Ma & Lee (2008) (arithmetic mean).
Clus.  SC1  SC2  JADE  JDHM  Comraf  

Blobs  6  91.07.2%  90.87.2%  97.34.2%  98.33.0%  86.98.6% 
Circles  4  65.9%  63.4%  100.0%  99.8%  31.4% 
NIPS  4  63.3%  75.1%  99.9%  99.9%  51.8% 
NUS  7  83.5%  71.0%  92.4%  80.7%  82.1% 
Caltech  7  73.3%  76.2%  86.7%  84.8%  – 
20  66.3%  70.7%  73.3%  76.0%  – 
NUS dataset. In the third experiment, we used a subset of the NUSWIDE dataset Chua et al. (2009) containing annotated images. The images were selected on purpose to have ambiguous content and annotations (e.g., swimming tigers are also tagged as “water” making them confuse e.g. with whales). As two different modalities, we used the 64dimensional color histograms and 1000dimensional bags of words. Laplacians were constructed using 10 nearest neighbors and Gaussian weight was selected using selftuning. Table 1 shows the performance of different clustering methods, and Figure 5 exemplifies the clustered images.
Using JADE joint diagonalization, we produced all the joint eigenvectors of the two modalities Laplacians. Figure 8 (top) shows the distance matrices between the objects in the NUS dataset obtained using uni and multimodal diffusion distances (computed with the first eigenvectors according to (2) using heat diffusion kernel ). Ideally, the distance matrix should contain zero blocks on the diagonal (objects of the same class) and nonzero elsewhere (objects from different classes). Thresholding these distances at a set of levels and measuring the false positives/true positive rates (FPR/TPR), we produce the ROC curves that clearly indicate the advantage of using multiple modalities (Figure 8).
In Figure 7 (top), we used the diffusion distance to progressively sample the NUS dataset using the farthest point sampling strategy: starting with some point, pick up the second one as most distant from the first; then the third as the most distant from the first and second, and so on. Such sampling is almostoptimal (Hochbaum & Shmoys (1985)) and is known to produce a progressively refined covering of the set. In fact, the first samples produced in this way cover all the classes present in the dataset, which is an indication of the meaningfulness of such a sampling.
Caltech dataset. In the fourth experiment, we repeated the third experiment on a subset of the Caltech101 dataset with 7 and 20 image classes as in Cai et al. (2011). For each image, kernels arising from different visual descriptors were given. For the 7clusters experiment, we used the bioinspired features and 4x4 pyramid histogram of visual words (PHOW); for the 20clusters experiment, we used geometric blur and 4x4 PHOW descriptors as different modalities, respectively. Laplacians were constructed from these kernels using Gaussian weight selected with selftuning. Diffusion distances were computed with the first eigenvectors using the kernel . The results are shown in Figures 6–8.
6 Discussion and Conclusions
We presented a framework for multimodal data analysis using approximate joint diagonalization of Laplacian matrices, naturally extending the classical construction of diffusion geometry to the multimodal setting. This construction allowed an almost straightforward extension of various diffusiongeometric data analysis tools such as spectral clustering and manifold learning based on diffusion maps. In followup studies, we intend to show multimodal extensions of other related techniques such as spectral hashing.
We also showed that many previously proposed approaches to multimodal spectral clustering are nearly equivalent and try to solve some version of the joint approximate diagonalization problem. From the numerical perspective, existing methods were tailored for computing the null joint eigenvectors that are sought for in clustering problems. The underlying optimization problems are poorly suited for broader applications of diffusion geometry such as nonlinear dimensionality reduction and manifold learning, where many or all eigenvectors of the Laplacians are of interest. While approximate joint diagonalization methods developed in the signal processing community for source separation problems can address the latter case, they were initially developed for full matrices and do not take advantage of the sparse structure of Laplacians.
To the best of our knowledge, there currently exists no efficient tool to compute the joint eigenvectors of very large sparse matrices, akin Matlab’s eigs. We believe that the presented construction makes the need of such a tool central enough to deserve the interest of the entire machine learning community. In future work, we will consider extending standard methods for eigendecomposition of large sparse matrices to the joint diagonalization case.
References
 AlamedaPineda et al. (2011) AlamedaPineda, X., Khalidov, V., Horaud, R., and Forbes, F. Finding audiovisual events in informal social gatherings. In Proc. ICMI, 2011.
 Bekkerman & Jeon (2007) Bekkerman, R. and Jeon, J. Multimodal clustering for multimedia collections. In Proc. CVPR, 2007.
 Belkin & Niyogi (2002) Belkin, M. and Niyogi, P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15:1373–1396, 2002.
 Bronstein et al. (2010) Bronstein, M. M., Bronstein, A. M., Michel, F., and Paragios, N. Data fusion through crossmodality metric learning using similaritysensitive hashing. In Proc. CVPR, pp. 3594–3601, 2010.
 BunseGerstner et al. (1993) BunseGerstner, A., Byers, R., and Mehrmann, V. Numerical methods for simultaneous diagonalization. SIAM J. Matrix Anal. Appl., 14(4):927–949, 1993.
 Cai et al. (2011) Cai, X., Nie, F., Huang, H., and Kamangar, F. Heterogeneous image feature integration via multimodal spectral clustering. In Proc. CVPR, 2011.
 Cardoso & Souloumiac (1993) Cardoso, J.F. and Souloumiac, A. Blind beamforming for nongaussian signals. Radar and Signal Processing, 140(6):362–370, 1993.
 Cardoso & Souloumiac (1996) Cardoso, J.F. and Souloumiac, A. Jacobi angles for simultaneous diagonalization. SIAM J. Matrix Anal. Appl, 17:161–164, 1996.
 Chua et al. (2009) Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., and Zheng, Y.T. Nuswide: A realworld web image database from national university of singapore. In Proc. CIVR, 2009.
 Coifman et al. (2005) Coifman, R. R., Lafon, S., Lee, A. B., Maggioni, M., Warner, F., and Zucker, S. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. In PNAS, pp. 7426–7431, 2005.
 Coifman & Lafon (2006) Coifman, R.R. and Lafon, S. Diffusion maps. Applied and Computational Harmonic Analysis, 21:5–30, 2006.
 de Sa (2005) de Sa, V.R. Spectral clustering with two views. In Proc. ICML Workshop on learning with multiple views, 2005.
 Ding et al. (2001) Ding, C.H.Q., He, Xiaofeng, Zha, Hongyuan, Gu, Ming, and Simon, H.D. A minmax cut algorithm for graph partitioning and data clustering. In Proc. Conf. Data Mining, 2001.

Hochbaum & Shmoys (1985)
Hochbaum, D. S. and Shmoys, D. B.
A best possible heuristic for the kcenter problem.
Mathematics of operations research, pp. 180–184, 1985.  Kidron et al. (2005) Kidron, E., Schechner, Y. Y., and Elad, M. Pixels that sound. In Proc. CVPR, 2005.
 Kumar et al. (2011) Kumar, A., Rai, P., and Daumé III, H. Coregularized multiview spectral clustering. In Proc. NIPS, 2011.
 Levy (2006) Levy, B. LaplaceBeltrami eigenfunctions towards an algorithm that “understands” geometry. In Proc. SMI, 2006.
 Ma & Lee (2008) Ma, C. and Lee, C.H. Unsupervised anchor shot detection using multimodal spectral clustering. In Proc. ICASSP, 2008.
 McFee & Lanckriet (2011) McFee, B. and Lanckriet, G. R. G. Learning multimodal similarity. JMLR, 12:491–523, 2011.

Ng et al. (2001)
Ng, A. Y., Jordan, M. I., and Weiss, Y.
On spectral clustering: Analysis and an algorithm.
In Proc. NIPS, 2001.  Perona & ZelnikManor (2004) Perona, P. and ZelnikManor, L. Selftuning spectral clustering. In Proc. NIPS, 2004.
 Rahbar & Reilly (2000) Rahbar, K. and Reilly, J. P. Geometric optimization methods for blind source separation of signals. In Proc. ICA, pp. 375–380, 2000.
 Rasiwasia et al. (2010) Rasiwasia, N., Costa Pereira, J., Coviello, E., Doyle, G., Lanckriet, G.R.G., Levy, R., and Vasconcelos, N. A new approach to crossmodal multimedia retrieval. In Proc. ICM, pp. 251–260, 2010.
 Shi & Malik (1997) Shi, J. and Malik, J. Normalized cuts and image segmentation. Trans. PAMI, 22:888–905, 1997.
 Tang et al. (2009) Tang, W., Lu, Z., and Dhillon, I.S. Clustering with multiple graphs. In Proc. Data Mining, 2009.
 von Luxburg (2007) von Luxburg, U. A tutorial on spectral clustering. 2007.
 Weiss et al. (2008) Weiss, Y., Torralba, A., and Fergus, R. Spectral hashing. In Proc. NIPS, 2008.
 Weston et al. (2010) Weston, J., Bengio, S., and Usunier, N. Large scale image annotation: learning to rank with joint wordimage embeddings. Machine learning, 81(1):21–35, 2010.
 Yeredor (2002) Yeredor, A. Nonorthogonal joint diagonalization in the leastsquares sense with application in blind source separation. Trans. Signal Proc., 50(7):1545 –1553, 2002.
 Ziehe (2005) Ziehe, A. Blind Source Separation based on Joint Diagonalization of Matrices with Applications in Biomedical Signal Processing. Dissertation, University of Potsdam, 2005.
Comments
There are no comments yet.