Subspace Approximation for Approximate Nearest Neighbor Search in NLP

08/25/2017 ∙ by Jing Wang, et al. ∙ Rutgers University 0

Most natural language processing tasks can be formulated as the approximated nearest neighbor search problem, such as word analogy, document similarity, machine translation. Take the question-answering task as an example, given a question as the query, the goal is to search its nearest neighbor in the training dataset as the answer. However, existing methods for approximate nearest neighbor search problem may not perform well owing to the following practical challenges: 1) there are noise in the data; 2) the large scale dataset yields a huge retrieval space and high search time complexity. In order to solve these problems, we propose a novel approximate nearest neighbor search framework which i) projects the data to a subspace based spectral analysis which eliminates the influence of noise; ii) partitions the training dataset to different groups in order to reduce the search space. Specifically, the retrieval space is reduced from O(n) to O( n) (where n is the number of data points in the training dataset). We prove that the retrieved nearest neighbor in the projected subspace is the same as the one in the original feature space. We demonstrate the outstanding performance of our framework on real-world natural language processing tasks.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Artificial intelligence (AI) is a thriving field with active research topics and practical products, such as Amazon’s Echo, Goolge’s home smart speakers and Apple’s Siri. The world becomes enthusiastic to communicate with these intelligent products. Take Amazon Echo for example, you can ask Echo the calories of every food in your plate if you are on diet. Whenever you need to check your calendar, just ask “Alexa, what’s on my calendar today?” It boosts the development of natural language processing which refers to the AI technology that makes the communication between AI products and humans with human language possible. It is shown that the communication between human and AI products are mainly in the format of question answering (QA). QA is a complex and general natural language task. Most of natural language processing tasks can be treated as question answering problem, such as word analogy task [Mikolov et al., 2013], machine translation [Wu et al., 2016]

, named entity recognition (NER)

[Liu et al., 2011b, Passos et al., 2014], part-of-speech tagging (POS) [Kumar et al., 2016]

, sentiment analysis

[Socher et al., 2013].

There are many works designed for the question answering task, such as deep learning models

[Kumar et al., 2016], information extraction systems [Yates et al., 2007]. In this work, we propose to solve the question answering task by the approximate nearest neighbor search method. Formally, given the question as a query , the training data set , the nearest neighbor search aims to retrieve the nearest neighbor of the query , denoted as from as the answer. We assume that is within distance 1 from the query , and all other points are at distance at least 1+ () from the query . The nearest neighbor is called a -approximate nearest neighbor to which can be expressed as:


However, in real-world natural language processing applications, there are usually noise in the data, such as spelling errors, non-standard words in newsgroups, pause filling words in speech. Hence, we assume the data set with arbitrary noise to create , where . The query is perturbed similarly to get . We assume that the noise is bounded, that is and .

There are many approaches proposed to solve the approximate nearest neighbor search problem. Existing methods can be classified as two groups: the data-independent methods and the data-dependent methods. The data-independent approaches are mainly based on random projection to get data partitions, such as Local Sensitive Search, Minwise Hashing. Recently, the data-dependent methods received interest for its outstanding performance. They utilize spectral decomposition to map the data to different subspace, such as Spectral hashing. However, theoretical guarantee about the performance is not provided. Existing methods can not handle natural language processing problems well for the following reasons. First the data set in natural language processing is usually in large scale which yields a huge search space. Moreover, the singular value decomposition which is widely used to obtain the low-rank subspace is too expensive here. Second, the data is with noise. Existing data-aware projection is not robust to noisy data which cannot lead to correct partitions.

To solve the above mentioned problem, we propose a novel iterated spectral based approximate nearest neighbor search framework for general question answering tasks (Random Subspace based Spectral Hashing (RSSH)). Our framework consists of the following major steps:

  • As the data is with noise, we first project the data to the clean low-rank subspace. We obtain a low-rank approximation within (1+) of optimal for spectral norm error by the randomized block Krylov methods which enjoys the time complexity [Musco and Musco, 2015].

  • To eliminate the search space, we partition data to different clusters. With the low-rank subspace approximation, data points with are clustered corresponding to their distance to the subspace.

  • Given the query, we first locate its nearest subspace and then search the nearest neighbor in the data partition set corresponding to the nearest subspace.

With our framework, we provide theoretical guarantees in the following ways:

  • With the low-rank approximation, we prove that the noise in the projected subspace is small.

  • With the data partition strategy, all data will fall to certain partition within iterations.

  • We prove that our method can return the nearest neighbor of the query in low-rank subspace which is the nearest neighbor in the clean space.

To the best of our knowledge, it is the first attempt of spectral nearest neighbor search for question answering problem with theory justification. Generally, our framework can solve word similarity task, text classification problems (sentiment analysis), word analogy task and named entity recognition problem.

The theoretical analysis in this work is mainly inspired by the work in [Abdullah et al., 2014]. The difference is that the subspace of data sets is computed directly in [Abdullah et al., 2014], in our work, we approximate the subspace by a randomized variant of the Block Lanczos method [Musco and Musco, 2015]. In this way, our method enjoys higher time efficiency and returns a (1+)-approximate nearest neighbor.

2 Notation

In this work, we let denote the -th largest singular value of a real matrix . is used to denote the Frobenius norm of

. All vector norms, i.e.

for , refer to the -norm.

The spectral norm of a matrix is defined as


where all vector norms refer throughout to the -norm. It is clear that equals the spectral norm of . The Frobenius norm of is defined as , and let denote the transpose of . A singular vector of is a unit vector associated with a singular value and a unit vector such that and . (We may also refer to and as a pair of right-singular and left-singular vectors associated with .)

Let denote the projection of a point onto . Then the distance between a point and a set (possibly a subspace) is defined as .

3 Problem

Given an -point dataset and a query point , both lying in a -dimensional space , we aim to find its nearest neighbor which satisfying that:


Assume that the data points are corrupted by arbitrary small noise which is bounded for all (). The observed set consists of points for all and the noisy query points with .

4 Algorithm

4.1 Subspace Approximation

We utilize a randomized variant of the Block Lanczos method proposed in [Musco and Musco, 2015] to approximate the low-rank subspace of the data set.

2:, ,
4:Orthonormalize ’s columns to obtain
5:Compute the truncated SVD
Algorithm 1 Block Lanczos method [Musco and Musco, 2015]
Theorem 1.

[Musco and Musco, 2015] For any , the -dimensional low-rank subspace obtained by singular value decomposition is denoted as . Algorithm 1 returns which forms the low-rank approximation

, then the following bounds hold with probability at least



where is the -th singular value of . The runtime of the algorithm is .

Algorithm 1 returns the matrix which is the approximation to the left singular vectors of data matrix . We use to approximate the right singular vectors of data matrix .

4.2 Data Partition

Lemma 2.

[Abdullah et al., 2014] The nearest neighbor of in is .

1:Input: , rank , error , threshold
3:while  do
4:     Compute the -dimensional subspace approximation and low-rank projection matrix of by Algorithm 1
5:     Compute the distance between data points and subspace
6:     Partition data points
7:     Update dataset
8:     Update iteration
9:end while
10:Output: , , .
Algorithm 2 Spectral Data Partition by Low-rank Subspace Approximation
Lemma 3.

Algorithm 2 terminates within iterations.


Let be the -dimensional subspace of with projection matrix , let be the -dimensional subspace of with projection matrix , let be the low-rank approximation returned by Algorithm 1 with projection matrix . The distance between data points and subspace is computed as:

According to Theorem 1, we can have:

, We can get


Since minimizes the sum of squared distances from all to ,

Then, we can get:


Hence, there are at most half of the points in with distance to greater than . The set captures at least a half fraction of points. The algorithm then proceeds on the remaining set. After iterations all points of must be captured. ∎

Lemma 4.

The approximated subspace that captures returns this as the -approximate nearest neighbor of (in ).

Proof of Lemma 4.

Fix that is captured by the same , and use the triangle inequality to write


Similarly for , , and by our assumption . By the triangle inequality, we get


Similarly, we bound

By using Pythagoras’ Theorem (recall both ),

Hence, is reported by the -dimensional subspace it is assigned to. ∎

5 Experiment

In this experiment, we compare our algorithm with existing hashing algorithms.

5.1 Baseline algorithms

Our comparative algorithms include state-of-the-art learning to hashing algorithm such as

We refer our algorithm as Random Subspace based Spectral Hashing (RSSH).

5.2 Datasets.

Dataset #training #query #feature # class
MNIST 69,000 1,000 784 10
CIFAR-10 59,000 1,000 512 10
COIL-20 20,019 2,000 1,024 20
VOC2007 5,011 4,096 3,720 20
Table 1: Summary of Datasets

Our experiment datasets include MNIST 111, CIFAR-10222 kriz/cifar.html/, COIL-20 333 and the 2007 PASCAL VOC challenge dataset.

MNIST. It is a well-known handwritten digits dataset from “0” to “9”. The dataset consists of 70,000 samples in feature space of dimension 784. We split the samples to a training and a query set which containing 69,000 and 1,000 samples respectively.

CIFAR-10. There are 60,000 image in 10 classes, such as “horse” and “truck”. We use the default 59,000 training set and 1,000 testing as query set. The image is with 512 GIST feature.

COIL-20. It is from the Columbia University Image Library which contains 20 objects. Each image is represented by a feature space 1024 dimension. For each object, we choose 60% images for training and the others are querying.

Figure 1: The average precision on VOC2007.

VOC2007. The VOC2007 dataset consists of three subsets as training, validation and testing. We use the first two subsets as training containing 5,011 samples and the other as query containing 4,096 samples. We set each image to the size of [80, 100] and extract the HOG feature with cell size 10 as their feature space 444 All the images in VOC2007 are defined into 20 subjects, such as “aeroplane” and “dining tale”. For the classification task on each subject, there are 200 to 500 positive samples and the following 4,000 are negative. Thus, the label distribution of the query set is unbalanced. A brief description of the datasets are presented in Table 1.

5.3 Evaluation Metrics

All the experiment datasets are fully annotated. We report the classification result based on the groundtruth label information. That is, the label of the query is assigned by its nearest neighbor. For the first three datasets in Table 1, we report the classification accuracy. For the VOC2007 dataset, we report the precision as the label distribution is highly unbalanced. The criteria are defined in terms of true positive (TP), true negative (TN), false positive (FP) and false negative (FN) as,


For the retrieval task, we report the recall with top [1, 10, 25] retrieved samples. The true neighbors are defined by the Euclidean distance.

We report the aforementioned evaluation criteria with varying hash bits () in the range of [2, 128].

(a) Recall of Top 1 retrieval
(b) Recall of Top 10 retrieval
(c) Recall of Top 25 retrieval
Figure 2: The average recall in terms of the number of hash bits with different retrievals on the VOC2007.

5.4 Classification Results

The classification accuracy on CIFAR-10, MNIST and COIL-20 are reported in Figure 3a, 3b and 3c. We can see that our algorithm achieves the best accuracy in terms of the number of hash bits in the three datasets. For example, on CIFAR-10 with , the accuracy of our algorithm RSSN reaches 53.20% while the comparative algorithms are all less than 40.00%. The increase of hash bit promotes the accuracy of all algorithms, our algorithm remains the leading place. For example, on MNIST with , ITH and SH reach the accuracy of 93.00%, but our algorithm still enjoys 4.00% advantage with 97.30%. Moreover, our algorithm obtains significant good performance even with limited information, that is, the is small. For instance, in terms of , RSSN reaches the accuracy of 87.70%, much better than the comparative algorithms.

For the classification results on VOC2007, we report the accuracy on 12 of 20 classes as representation in Figure 4. We can see that it is a tough task for all the methods, but our algorithm still obtains satisfying performance. For example, on the classification task of “horse”, our algorithm obtains around 10% advantage over the all the comparative algorithms. The average precision on all the 20 classes are presented on Figure 1. We can see that our algorithm obtains the overall best result.

5.5 Retrieval Results

The retrieval results on MNIST, CIFAR-10 and COIL-20 are presented in Figure 3d to Figure 3l. Our algorithm obtains the best recall with varying number of retrieved samples. For example, on CIFAR-10 with Top 10 retrieval and , our algorithm reaches the recall over 70%, while the others are less than 20%. On MNIST with Top 25 retrieved samples and , the recall of RSSN reaches 90%, while the comparatives algorithms are around 40%.

(a) Classification accuracy on CIFAR-10
(b) Classification accuracy on MNIST
(c) Classification accuracy on COIL-20
(d) Recall of Top 1 retrieval
(e) Recall of Top 10 retrieval
(f) Recall of Top 25 retrieval
(g) Recall of Top 1 retrieval
(h) Recall of Top 10 retrieval
(i) Recall of Top 25 retrieval
(j) Recall of Top 1 retrieval
(k) Recall of Top 10 retrieval
(l) Recall of Top 25 retrieval
Figure 3: Classification accuracy and recall in terms of the number of hash bits on three datasets.
(a) Precision on the class: Aeroplane
(b) Precision on the class: Bicycle
(c) Precision on the class: Bird
(d) Precision on the class: Bus
(e) Precision on the class:
(f) Precision on the class: Cow
(g) Precision on the class: Dining table
(h) Precision on the class: Horse
(i) Precision on the class: Motorbike
(j) Precision on the class: Potted plant
(k) Precision on the class: Train
(l) Precision on the class: Tv monitor
Figure 4: Precision and recall in terms of the number of hash bits on various classes of VOC2007.


  • [Abdullah et al., 2014] Abdullah, A., Andoni, A., Kannan, R., and Krauthgamer, R. (2014). Spectral approaches to nearest neighbor search. In IEEE 55th Annual Symposium on Foundations of Computer Science, pages 581–590. IEEE.
  • [Andreas et al., 2016] Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. (2016). Learning to compose neural networks for question answering. arXiv preprint arXiv:1601.01705.
  • [Gong and Lazebnik, 2011] Gong, Y. and Lazebnik, S. (2011). Iterative quantization: A procrustean approach to learning binary codes. In

    Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition

    , pages 817–824.
  • [Kumar et al., 2016] Kumar, A., Irsoy, O., Ondruska, P., Iyyer, M., Bradbury, J., Gulrajani, I., Zhong, V., Paulus, R., and Socher, R. (2016). Ask me anything: Dynamic memory networks for natural language processing. In

    International Conference on Machine Learning

    , pages 1378–1387.
  • [Liu et al., 2011a] Liu, W., Wang, J., Kumar, S., and Chang, S.-F. (2011a). Hashing with graphs. In Proceedings of the 28th International Conference on Machine Learning, pages 1–8.
  • [Liu et al., 2011b] Liu, X., Zhang, S., Wei, F., and Zhou, M. (2011b). Recognizing named entities in tweets. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, pages 359–367. Association for Computational Linguistics.
  • [Mikolov et al., 2013] Mikolov, T., Yih, W.-t., and Zweig, G. (2013). Linguistic regularities in continuous space word representations. In hlt-Naacl, volume 13, pages 746–751.
  • [Musco and Musco, 2015] Musco, C. and Musco, C. (2015). Randomized block krylov methods for stronger and faster approximate singular value decomposition. In Advances in Neural Information Processing Systems, pages 1396–1404.
  • [Passos et al., 2014] Passos, A., Kumar, V., and McCallum, A. (2014). Lexicon infused phrase embeddings for named entity resolution. In Proceedings of the Eighteenth Conference on Computational Language Learning, pages 78–86.
  • [Ramanathan et al., 2014] Ramanathan, V., Joulin, A., Liang, P., and Fei-Fei, L. (2014). Linking people in videos with “their” names using coreference resolution. In European Conference on Computer Vision, pages 95–110.
  • [Socher et al., 2013] Socher, R., Perelygin, A., Wu, J. Y., Chuang, J., Manning, C. D., Ng, A. Y., Potts, C., et al. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, volume 1631, page 1642.
  • [Weiss et al., 2012] Weiss, Y., Fergus, R., and Torralba, A. (2012). Multidimensional spectral hashing. In Proceedings of the 12th European Conference on Computer Vision, pages 340–353.
  • [Weiss et al., 2009] Weiss, Y., Torralba, A., and Fergus, R. (2009). Spectral hashing. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, pages 1753–1760.
  • [Wu et al., 2016] Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  • [Xia et al., 2015] Xia, Y., He, K., Kohli, P., and Sun, J. (2015). Sparse projections for high-dimensional binary codes. In Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition, pages 3332–3339.
  • [Yates et al., 2007] Yates, A., Cafarella, M., Banko, M., Etzioni, O., Broadhead, M., and Soderland, S. (2007). Textrunner: open information extraction on the web. In Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pages 25–26.
  • [Yu et al., 2014] Yu, F. X., Kumar, S., Gong, Y., and Chang, S. (2014). Circulant binary embedding. In Proceedings of the 31th International Conference on Machine Learning, pages 946–954.