Similarity Search Over Graphs Using Localized Spectral Analysis

07/11/2017 ∙ by Yariv Aizenbud, et al. ∙ 0

This paper provides a new similarity detection algorithm. Given an input set of multi-dimensional data points, where each data point is assumed to be multi-dimensional, and an additional reference data point for similarity finding, the algorithm uses kernel method that embeds the data points into a low dimensional manifold. Unlike other kernel methods, which consider the entire data for the embedding, our method selects a specific set of kernel eigenvectors. The eigenvectors are chosen to separate between the data points and the reference data point so that similar data points can be easily identified as being distinct from most of the members in the dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In recent years, there is an on-going interest in finding efficient solutions to discover similarity between data points. Measuring similarity plays a central role in computer vision 

[1], speech recognition, text analysis [2]

and anomaly detection 

[3], to name some. The problem can be defined as follow: Given a reference data point from a collection of data points in an -dimensional metric space, we want to find the most similar data points with respect to the reference data point. While the interest in solving the similarity search problem accurately has always been an important goal, the rapid growth in the amount of collected data raises a need for solving this problem efficiently as well. In this work, we propose a robust method for similarity detection using the notion of localized spectral methods over graphs [4]. The main advantage of our method lies in the fact that while many algorithms, such as nearest-neighbors and its variants, search for similarity in the feature space (whether in the original ambient space or in the low-dimensional embedded space), our method searches for similarity in the intrinsic characteristics. This is done by looking at the eigenvectors that enable us to separate between the relevant data points and the rest of the data. This methodology consists of two steps: 1. Decomposition of a graph (given as a kernel) that represents the imposed similarity metric between data points. 2. A search for resemblance in the relevant new space where separation exists following 1. We apply this method to synthetic and real datasets and compare the obtained results with other known methods.

Ii Preliminaries

Ii-a Related Work

Similarity search has a key role in many applications involving high-dimensional data. Extensive research is done to achieve both efficient and accurate similarity search results. Nearest-neighbors search and its approximated variants such as Hashing methods  

[5], are a popular solutions and have been widely used to achieve fast approximate similarity search. Others like [6, 7] implement more robust and efficient methodologies. The similarity search task in [8]

is done by redefining the feature space via local intensity histograms. This can be further used as attributes for image matching. Another way is to construct the eigenvectors for each pixel by geometrical moments based on local histogram to automatically detect the corresponding landmark in CT brain images 

[9]. Most methods address the similarity search problem in feature space by using either the original feature space or by an alternative representation including hashing and dimensionality reduction techniques. In this work, we address this problem by using the intrinsic characteristics of the data and not the feature space as commonly used.

Ii-B Geometry Preservation by Kernels

Data analysis often involves non-linear relations between data points that are harder to extract via conventional linear methods. PCA and SVM, for example, are two well known methods that lack the ability to handle such relations between data points due to their linear nature. As a direct result, one would like to choose a method that allows the data points to be mapped into a higher dimensional space while exploiting the non-linear properties and relations. Kernel methods enable us to operate and analyze data in a high-dimensional environment while extracting the non-linear properties and relations in different scenarios [10]. When data is analyzed to find similarities between data points, exploiting non-linearity is important and therefore kernel methods can be useful. Two important examples in the area of kernel methods are Diffusion Maps [11] and Laplacian Eigenmaps [12]. Diffusion Maps show that the eigenvectors of Markovian matrix can be considered as a set of coordinates of the dataset, which can be represented as a set of data points in a Euclidean space. This procedure captures most of the original geometry of the data. Laplacian Eigenmaps show that a neighborhood-information-based graph can be considered as a discrete approximation of the low-dimensional manifold in the high-dimensional space. The usefulness of kernel methods and its relation to dimensionality reduction, classification and anomaly detection are described in  [13, 14, 15].

Iii Main Approach

Iii-a Similarity Search Assumptions

Our method relies on two main assumptions. 1. There is a low-dimensional space, which separates between data points, specifically it separates between our reference data point and the rest of the data points. 2. Each data point, which belongs to a high-dimensional space, can be characterized in a lower dimensional space than the original (ambient) space by choosing an appropriate kernel. Our first assumption, which maps the data into a low-dimensional space, is common since in most cases there is a strong dependency (linear or non-linear) between different coordinates. This results in a lower dimensional space than the ambient space. If the data points are inseparable then the data is assumed to be homogeneous. Hence, the notion of similarity is meaningless. Choice of an appropriate kernel in our second assumption can uncover hidden relations between data points. Our approach does not rely on prior knowledge or assumptions regarding the data/paramters distribution.

Iii-B Similarity Search Description

Data points (both via linear or non-linear methods) are mapped into their low-dimensional embedding space by utilizing the largest eigenvalues and their corresponding eigenvectors. This process captures the geometry of the data. Using successfully a small number of eigenvalues is demonstrated in

[11]. Classical spectral methods suggest to use the largest eigenvalues and their corresponding eigenvectors. In contrast to commonly use of eigenvalues, according to the above methods, we will not characterize accurately the reference point and its similar data points.

We suggest a method that is classified as a localized spectral methods over graphs. The method proposes that the intrinsic characteristics of a reference data point can be measured mostly by its top eigenvectors values. We define the top eigenvectors to have the largest absolute value in the coordinate of the reference data point. Moreover, data points which have similar top eigenvectors have shared characteristics. Similarity is defined by the norm of the localized spectral reconstruction error. We test this method on both synthetic and real datasets and provide comparable results.

Iii-C Localized Spectral Similarity Analysis Algorithm

Let be a set of data points in and let be a reference data point. We are looking to identify data points from that are similar to . We build a graph where . The normalized kernel is where a diagonal matrix, with . Each row of is summed to 1. It consists of only real entries, therefore the matrix can be viewed as a Markov transition matrix. We define A to be the matrix , which is a symmetric matrix that has positive eigenvalues that can be viewed as a graph Laplacian matrix. The eigenvalue decomposition (EVD) of is donated by . is the th row of the matrix , which is the coordinates of on the embedded axes.

is the absolute values of the vector

sorted in descending order. Denote . The matrix contains the eigenvectors () sorted by descending significance to the reference data point and let be the truncated matrix that consists of the first columns of , .

Next, we enhance data points similar to the reference data point by correlating the embedded data points with a unit-vector in the direction of , i.e.

(1)

Once is computed, each data point gets a score , where similar data points has a higher score .

Input: Data matrix with measurements and features, - number of vectors to use, - point index to search for
Output: Similarity score for each data point
1:  Build kernel matrix
2:  Construct Diffusion map , .
3:  Compute the EVD of ,
4:  Sort by descending order , and store the indexes (indicated by ) that correspond to the largest values of .
5:  Form a new data matrix on the new vectors: ,
6:  Compute the score vector,
7:  return  the absolute value of the elements in
 
Algorithm 1 Find Kernel Similarities

In practice, since is large, it is usually impractical to compute the full EVD of the kernel . Therefore, one can compute only the first largest eigenvectors for . This can be done, for example, by using power iterations [16] or by randomized SVD algorithms [17, 18, 19].

Iv Experimental Results

Algorithm 1 was tested on two datasets: 1) A synthetically generated 3D surface and 2) An image of size .
The 3D surface in the first experiment was injected by two abnormal data points that are hovered above the surface. One point was selected to be the reference data point while the other as test data point. The selection of the two data points was random. In the second experiment - the Mona Lisa’s painting, the image was rearranged into a series of columns from sliding blocks of size and a selected reference patch of size from the Mona Lisa’s skin was randomly chosen. The aim of both experiments is to locate the resembling data points, meaning for the 3D figure we would like to find second abnormal data points and for the Mona Lisa’s photo we would like to recognize other patches of skin.

(a) 3D surface generated data with two abnormal data points marked in green and red as the reference data point and the desired matching point accordingly. The proposed method find the relevant data point quite easily while alternative methods suggest incorrect data points as similar ones.
(b)

Reconstruction magnitude for each of the data points by using the top three singular vectors of the reference data point with the new corresponding singular values. It is easily seen that the two last data points in the dataset located at rows 2501 and 2502, known as the reference data point and the matching point, both have distinctively high Reconstruction compared to the rest of the data points.

Fig. 1:

Iv-a Generated 3D Surface

Data points which form a 3D surface were generated from 2500 data points and two new observations outside the terrain were injected to the data apart from each. Both of the data points were located at the end of the dataset, in row numbers 2501 and 2502. The goal was to match the most similar observation to the reference data point. For comparison purposes Nearest Neighbor (NN) and Kernel Nearest Neighbor (Kernel-NN) algorithms were chosen [20]. A Gaussian kernel was selected for this experiment both for the suggested algorithm and for Kernel-NN. Figure 0(a) shows the terrain that was generated. The original data points which belong to the surface are colored in blue while the two abnormal data points which were artificially injected into the data are marked in green and red. One of the data points (green) was chosen to be the reference data point while the other (red) was chosen to be the matching data point that the algorithms try to locate. The reference data point was chosen randomly and we repeated the experiment with different data point locations over the surface. Figure 0(b) shows the reconstruction magnitude for each of the data points by using the top three singular vectors of the reference data point with the new corresponding singular values. Note that when taking all singular vectors instead of choosing the top k ones (top three in the current experiment) the algorithm will not perform well. choosing all values will construct the space with respect to the full data and not to the reference point and other similar data points which will result in poor performance for similarity detection. Our method outperforms Both NN and Kernel-NN. The alternative methods had difficulty to find the matching data point in a consistent matter compare to our method for different locations of reference and matching data along the terrain. For the presented experiment (Figure 1

) Kerenel-KNN ranks the matching data point as 409 most similar while regular KNN only as 2459 out of 2502 data points. Our method ranks it correctly as the most similar data point.

(a) Original Mona Lisa
(b) First top eigenvalue of the kernel
Fig. 2: Dark regions correspond to patches that are similar to the reference patch (indicated by a green ) of skin and neck.

Iv-B Image Analysis

A () pixels Mona-Lisa gray level image was divided into sliding blocks (overlapping elements) of size and later transformed into a data matrix of size , where each row corresponds to an image patch. Next, an arbitrary patch from the Mona Lisa’s skin patches was chosen (image coordinate indicated by a green in Figure 1(a)). Algorithm 1 was applied to the data matrix, building a Gaussian kernel of size computing the first eigenvectors. Figure 1(b) shows the magnitude of the first top eigenvector. It can be seen from the figure, that the dark regions correspond to patches that are similar to the reference patch of skin and neck.

V Conclusion

In this paper, we presented a new algorithm for detecting similarities within a given dataset. The algorithm is based on localized spectral analysis. The method characterizes a reference data point by looking at the significant eigenvectors of the embedding kernel. The significant eigenvectors form a basis that enables us to differentiate between similar data points from the rest of the data. Numerical results of the algorithms were presented. They exhibit the potential of the new method.

References

  • [1] J. Wang, H. T. Shen, J. Song, and J. Ji, “Hashing for similarity search: A survey,” arXiv preprint arXiv:1408.2927, 2014.
  • [2] A. Z. Broder, “On the resemblance and containment of documents,” in Compression and Complexity of Sequences 1997. Proceedings.   IEEE, 1997, pp. 21–29.
  • [3] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM computing surveys (CSUR), vol. 41, no. 3, p. 15, 2009.
  • [4] X. Cheng, M. Rachh, and S. Steinerberger, “On the diffusion geometry of graph laplacians and applications,” arXiv preprint arXiv:1611.03033, 2016.
  • [5] A. Gionis, P. Indyk, R. Motwani et al., “Similarity search in high dimensions via hashing,” in VLDB, vol. 99, no. 6, 1999, pp. 518–529.
  • [6] J. Song, Y. Yang, X. Li, Z. Huang, and Y. Yang, “Robust hashing with local models for approximate similarity search,” IEEE transactions on cybernetics, vol. 44, no. 7, pp. 1225–1236, 2014.
  • [7] N. Kraus, D. Carmel, I. Keidar, and M. Orenbach, “Nearbucket-lsh: Efficient similarity search in P2P networks,” in International Conference on Similarity Search and Applications.   Springer, 2016, pp. 236–249.
  • [8] D. Shen, “Image registration by local histogram matching,” Pattern Recognition, vol. 40, no. 4, pp. 1161–1172, 2007.
  • [9] T. Sun, C. Li, and H. Feng, “An eigenvector-based corresponding points auto-detection algorithm for non-rigid registration of ct brain images,” in Bioinformatics and Biomedical Engineering, 2009. ICBBE 2009. 3rd International Conference on.   IEEE, 2009, pp. 1–4.
  • [10] C. M. Bishop, “Pattern recognition,” Machine Learning, vol. 128, pp. 1–58, 2006.
  • [11] R. R. Coifman and S. Lafon, “Diffusion maps,” Applied and Computational Harmonic Analysis, vol. 21, no. 1, pp. 5–30, 2006.
  • [12] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural computation, vol. 15, no. 6, pp. 1373–1396, 2003.
  • [13] G. Mishne and I. Cohen, “Multiscale anomaly detection using diffusion maps,” IEEE Journal of selected topics in signal processing, vol. 7, no. 1, pp. 111–123, 2013.
  • [14] F. Chernogorov, J. Turkka, T. Ristaniemi, and A. Averbuch, “Detection of sleeping cells in LTE networks using diffusion maps,” in Vehicular Technology Conference (VTC Spring), 2011 IEEE 73rd.   IEEE, 2011, pp. 1–5.
  • [15] B. Du, L. Zhang, L. Zhang, T. Chen, and K. Wu, “A discriminative manifold learning based dimension reduction method for hyperspectral classification,” International Journal of Fuzzy Systems, vol. 14, no. 2, pp. 272–277, 2012.
  • [16] G. Golub and C. Van Loan, Matrix computations.   Johns Hopkins Univ Pr, 1996, vol. 3.
  • [17] Y. Aizenbud and A. Averbuch, “Matrix decompositions using sub-gaussian random matrices,” arXiv preprint arXiv:1602.03360, 2016.
  • [18] N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,” SIAM review, vol. 53, no. 2, pp. 217–288, 2011.
  • [19] F. Woolfe, E. Liberty, V. Rokhlin, and M. Tygert, “A fast randomized algorithm for the approximation of matrices,” Applied and Computational Harmonic Analysis, vol. 25, no. 3, pp. 335–366, 2008.
  • [20] K. Yu, L. Ji, and X. Zhang, “Kernel nearest-neighbor algorithm,” Neural Processing Letters, vol. 15, no. 2, pp. 147–156, 2002.