High-Dimensional Similarity Search with Quantum-Assisted Variational Autoencoder

06/13/2020 ∙ by Nicholas Gao, et al. ∙ Technische Universität München University of Bristol NASA 0

Recent progress in quantum algorithms and hardware indicates the potential importance of quantum computing in the near future. However, finding suitable application areas remains an active area of research. Quantum machine learning is touted as a potential approach to demonstrate quantum advantage within both the gate-model and the adiabatic schemes. For instance, the Quantum-assisted Variational Autoencoder has been proposed as a quantum enhancement to the discrete VAE. We extend on previous work and study the real-world applicability of a QVAE by presenting a proof-of-concept for similarity search in large-scale high-dimensional datasets. While exact and fast similarity search algorithms are available for low dimensional datasets, scaling to high-dimensional data is non-trivial. We show how to construct a space-efficient search index based on the latent space representation of a QVAE. Our experiments show a correlation between the Hamming distance in the embedded space and the Euclidean distance in the original space on the Moderate Resolution Imaging Spectroradiometer (MODIS) dataset. Further, we find real-world speedups compared to linear search and demonstrate memory-efficient scaling to half a billion data points.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

(a) Candidates with Hamming distance
(b) NDVI comparison
Figure 1. Similarity search on the modis dataset. Figure 0(a) shows candidates and Figure 0(b) shows a comparison between the reference point and a random candidate.

vae

Advances in hardware and algorithms are encouraging signs that quantum computers will have useful applications (preskill2018quantum; arute2019quantum), though hard challenges remain. One of these challenges is finding early target applications. qml is being explored as a tool to demonstrate quantum advantage (biamonte2017quantum), a situation where a quantum computer is preferred over a classical one. For example, the qvae (khoshaman2018quantum; vinci2019path) integrates samples from a quantum annealer into the prior of a discrete vae (rolfe_discrete_2016; kingma_auto-encoding_2013). Quantum annealers can be thought of as analog quantum devices that can physically simulate a qbm (amin2018quantum). Quantum and classical bm are able to model powerful and flexible distributions. Training such model, however, is known to be difficult and slow classically (long2010restricted). The use of quantum annealers as samplers could result in more robust and scalable sampling, and allow to train qbm to be used in models of practical impact. In this work, we investigate the real-world applicability of a qvae to the problem of similarity search in large-scale high-dimensional datasets. Though, it should be clear that we explore the feasibility and are neither achieving nor trying to achieve state-of-the-art performance.

Similarity search is an important problem for many applications such as computer vision and recommender systems and has been extensively studied in the literature. However, few have focused on scaling to billions of data points with thousands of dimensions

(li_approximate_2016).The methodology towards scaling to billions of data points and to very high-dimensional space differs. While research on scaling to billion-scale datasets focused on efficiently utilizing current hardware and accelerators (camerra2014beyond; johnson_billion-scale_2017; yagoubi_radiussketch:_2017)

, recent work on very high-dimensional data suggests the use of deep neural networks as dimensionality reduction tools

(do_learning_2016; srivastava_unsupervised_2015; carreira-perpinan_hashing_2015; cao_correlation_2016).

With increasing size and dimensionality of datasets, this issue is gaining importance. Massive datasets push classical algorithms to their limits. An area especially affected by this is the Earth science community where datasets grow exponentially with increasing spatial and temporal resolution. NASA, for instance, operates hundreds of satellites to collect continuous data of Earth’s processes to study land-cover change, atmospheric systems, and many others (board2007earth). But, the ever-growing size of these datasets poses challenges to scientists studying the Earth system.

We present an anns approach which uses a qvae to embed data into an efficiently searchable binary Hamming space. The binary nature of the qvae latent space removes the necessity for any additional discretization and enables end-to-end training. Further, we evaluate the approximation quality of the Hamming distance in the compressed space, the effect of the quantum prior, the resulting speedup and the memory footprint using the modis dataset. An example of our anns is shown in Figure 1 where a reference point with all candidates with Hamming distance are shown in Figure 0(a) and a comparison with a random candidate is shown in Figure 0(b).

To the best of our knowledge, this is the first work that explores the applicability of qml to similarity search in large-scale high-dimensional datasets. Further, we demonstrate the usefulness of a quantum hardware specific feature, the transverse field, in shaping the latent distribution to improve real-world performance and present a framework to index hundreds of millions of data points with minimal space requirements.

The remainder of this paper is structured as follows. In Section 2 we provide the necessary background and position our work in the context of recent research. We present our method in Section 3 and evaluate it in Section 4. The result are discussed in Section 5 and we conclude in Section 6.

2. Related Work

In the following, we present related work and background. We start with the motivation for high-dimensional similarity search in es in Section 2.1. Next, we examine existing anns methods and distinguish our approach from existing work in Section 2.2. Section 2.3 provides background on vae and the transition to a qvae, i.e., a vae with a qbm prior, which in turn is explained in Section 2.4 in addition to bm in general.

2.1. Earth Science

Satellite observations, physical model outputs, and ground-based systems collect spatio-temporal data that are used to understand changes in the Earth system. Analysis of these datasets have allowed scientists to uncover changes in interannual variability, anomalies (wright2002automated), location similarity, and seasonal phenomenology (zhang2003monitoring). While many currently available geospatial tools provide easy access to spatial processing, tools for large-scale similarity of high-resolution datasets are not as well developed. As data in the Earth sciences continues to grow and available applications broaden, computational efficiency will gain importance.

Steady improvements to sensing hardware and high-performance computing enable higher spatial, temporal, and spectral resolution remote sensing data and physical model outputs. However, improved resolution causes an exponential increase in data storage requirements leading to large-scale analysis challenges. For example, as we will use in this study, the modis dataset (remer2005modis) from NASA’s Terra and Aqua satellites provides 17 years of daily observations of the entire Earth’s surface at a spatial resolution of 250 meters over 36 spectral bands at the petabyte scale. After data collection, higher-level products are then generated from raw sensor output using physics-based pre-processing for specific applications such as land-surface temperature, snow-cover, and fire. For instance, the modis Vegetation dataset is widely used to monitor seasonal variations in phenology at a global scale (zhang2003monitoring). As satellite and climate datasets continue to improve, high-dimensional data analysis will continue to be a burden to researchers.

In recent years data mining tools have been commonly used to study large-scale Earth science datasets for land-cover change detection (boriah2008land), climate networks (steinhaeuser2012multivariate), and many others (kamath2001mining). Climate networks, for example, are used to locate climatic teleconnections by quantifying dependencies between all locations with the dataset (zhou2015teleconnection). For high-resolution data, exhaustive approaches to generate these complex networks become infeasible without an efficient similarity search. Here, we explore the applicability of qvae to index multi-variate time-series of land cover change using the modis dataset.

2.2. Approximate Nearest Neighbor Search

Finding nearest neighbors is of importance in a variety of areas, e.g., databases, computer vision or recommender systems. However, performing an exact nearest neighbor search is computationally expensive for large high-dimensional datasets with , where is the size of the dataset and

its dimensionality, due to the curse of dimensionality  

(indyk_approximate_1998).

In many applications it is sufficient to find near objects instead of nearest, motivating approximate methods. Generally, anns algorithms rely on pre-computed indices whose initial build may take significant time but reduce the query time by orders of magnitude. A variety of approaches have been proposed but most can be categorized into hashing or graph-based methods (li_approximate_2016). Since our method falls into the first category we will elaborate more on hashing-based approaches here. The most prominent example is lsh (gionis_similarity_1999). It uses hash tables, dictionaries that map from a hash to all items with this hash, as index. As hash functions lsh relies on ensembles of similarity preserving hash functions. At query time, items that agree in at least one hash are returned as candidates and later on refined by linear search. Although lsh performs well on low-dimensional datasets, high-dimensional datasets usually require prior dimensionality reduction.

Research into using deep neural networks, especially autoencoders, extends hashing based approaches by learning the hash function. While this approach showed promising results for high-dimensional data it comes with some disadvantages: The discrete space of a hash function yields a non-smooth optimization problem which has often been solved by using complex alternating optimization algorithms  (do_learning_2016; carreira-perpinan_hashing_2015) or by relaxing to a continuous space  (srivastava_unsupervised_2015)

. Further, labeled data is rarely available for similarity search, thus most methods must rely on unsupervised learning

(do_learning_2016; srivastava_unsupervised_2015; carreira-perpinan_hashing_2015). But supervised  (cao_correlation_2016) and semi-supervised methods exist  (do_learning_2016).

In this work, we use a qvae to learn hash functions in a Hamming space. This setup yields several advantages: It relies on the vae framework to perform fully scalable, unsupervised analysis of high-dimensional data. It exploits a generalization to discrete latent-variable vae (rolfe_discrete_2016) to enable end-to-end training with gradient descent while maintaining a binary latent space at inference. Furthermore, vae are probabilistic autoencoders, so our method ensures similarity between neighboring hashes enforcing the relevance of the Hamming distance. Finally, as we discuss in further detail in Section 2.4, the integration of a qbm prior allows continuous control over the distribution of the latent variables by adjusting the tunneling term (transverse field) in the qbm.

2.3. Quantum Variational Autoencoder

A vae is a generative model with latent variables which consist of two neural networks, an encoder and a decoder parametrized by and , respectively. It avoids the problem of computing the true posterior by using a variational approximation (hoffman_stochastic_2013), i.e., it is trained by maximizing a tractable elbo of the log likelihood :

(1)
(2)

The first term in Equation (1) is known as the reconstruction loss and is the kl divergence between the distributions and .

The reparametrization trick (kingma_auto-encoding_2013)

stabilizes the gradients by introducing an auxiliary random variable

and the reparametrization function . The approximate posterior is then rewritten as with

being ‘logits’ outputted by the encoder. This allows us to reformulate the kl divergence:

(3)
(4)

where we used and in place of and to avoid clutter. Computing depends on the prior distribution while depends on the reparametrization. Since our method requires a discrete latent space, we will focus on bm as priors and binary latent spaces from here. The generalization of the reparametrization trick to discrete variables is performed following Reference (khoshaman2018gumbolt), which introduces a smoothed continuous approximation of the latent discrete variables as

(5)

and regulating the ‘strength’ of the discretization. To get a deterministic hash function, we decided to map to the state with the highest likelihood instead of sampling and discretizing during inference. Using this Bernoulli reparametrization, the closed form solution for Equation (4) is the element-wise cross entropy (khoshaman2018quantum):

(6)

If is distributed by an bm, the gradients of can be computed the same way as training the bm with as target distribution, as described in Section 2.4, with the gradients propagating through the samples of the positive phase to the encoder  (rolfe_discrete_2016).

Replacing the bm prior with a qbm yields the qvae. The qbm offers an additional parameter, the transverse field which enables continuous control over the prior distribution, concretely, by increasing the transverse field, the prior loses expressive power. Previous work (hoffman_elbo_2016; so_nderby_ladder_2016; casale_gaussian_2018) has shown that the expressive power of the latent prior is linked to the number of decoupled bits in the latent distribution, i.e., decreasing the expressiveness of the prior narrows the latent distribution. In the discrete case, this causes the encoder to map to fewer distinct bit strings. Further discussion on the transverse field can be found in Section 2.4.

2.4. Quantum Boltzmann Machines

(a) Logical graph
(b) Hardware embedding
Figure 2. Representations of a rbm. The red squares represent the left nodes and the blue spheres represent the right nodes. Figure 1(a) shows the logical graph and Figure 1(b)

a hardware embedding where chains of qubits form single logical nodes.

A bm is an energy-based graphical model composed of stochastic nodes, with weighted connections between and biases applied to the nodes, i.e., a weighted graph with nodes , node weight function and edge weight function describes a bm over the states with parameters where and . We refer to this graph as the logical graph. The energy for a state is given by

(7)

and the pmf of the state distribution is

(8)

being an intractable normalization constant and a parameter recognized by physicists as the inverse temperature.

A bm is usually trained via cd (hinton_training_2002; tieleman2009using). Given a target distribution , and a bm parametrized by , the gradients can be computed as

(9)

The positive phase are the gradients of the energy for samples from the target distribution while the negative phase is computed with samples from the pmf in Equation (8

). Since it is hard to sample and compute probabilities on fully-connected bm 

(koller2007graphical), rbm are commonly used as these can be more efficiently sampled using block-Gibbs sampling. An rbm is a symmetric complete bipartite bm as shown in Figure 1(a).

The next paragraph describes the sampling from a quantum annealer and related background. However, the quantum annealer may be treated as a block-box sampler, returning samples given parameters and a transverse field, and the paragraph be skipped.

qa, an optimization algorithm exploiting quantum phenomena, has been proposed for sampling from complex Boltzmann-like distributions. qa has been demonstrated to solve a range of instances of optimization problems (rieffel2019ans), though it remains unclear what speedup qa can provide; even defining and detecting speedup, especially in small and noisy hardware implementations is challenging (ronnow2014defining; katzgraber2014glassy). To sample a qbm using qa, the framework outlined in Equation (7) has to be mapped to an Ising model, a quantum system represented by the Hamiltonian

(10)

where the nodes have been replaced by the quantum operators

, which return eigenvalues,

, in the set when applied to the state of variable , corresponding to 0 and 1, respectively. The possible states in the ‘classical’ and Ising representations are related by . Parameters and are replaced with the Ising model parameters and which are conceptually equivalent. In the hardware, these parameters are referred to as the flux bias and the coupling strength, respectively. The full model describing the dynamics of the quantum annealer, equivalent to the time-dependent transverse field Ising model, is (amin2018quantum)

(11)

where the transverse field term is

(12)

with being quantum operators on the complex-valued Hilbert space , and being monotonic functions and being the total annealing time (biswas2017nasa). Generally, at the start of an anneal, and . decreases and increases monotonically with until, at the end of the anneal, and . When , Equation (11) contains terms that describe quantum tunneling effects between various units (the transverse field ) that are not present in the classical model.

At a high level, increasing the transverse fields increases the amount of quantum fluctuations present in the model. These quantum fluctuations move the classical Boltzmann distribution towards a binomial distribution with

. This reduction impedes the ability to capture correlations between bits and causes the distribution to lose expressive power.

Lastly, the connectivity of the bm may not be reflected in the connectivity between qubits in the annealer. If that is the case, one needs to embed the logical graph into a hardware-specific one. This can be done by a 1-to-many mapping, i.e., one variable is represented by multiple qubits. These qubits are arranged in a ‘chain’ by setting the coupling strength between these qubits to a large value to encourage them to take the same value. Full details of the embedding process used in this model can be found in Reference (adachi2015application). An example of an embedded graph with maximum degree 6 can be found in Figure 1(b) where chains of red squares and chains of blue circles represent the rbm nodes.

3. Method

Figure 3. qvae pipeline: The sequence from left to right represents the forward pass of our model where is the data distribution, the distribution of ‘logits’,

is a uniformly distributed random variable,

the reparametrization function, the discrete latent distribution and distribution of the reconstruction.

This section is dedicated to describing our approach on qvae-based anns. Section 3.1 covers the implementation of our qvae and Section 3.2 focuses on how to perform anns.

3.1. Model

The model architecture is depicted in Figure 3. The network represents a vae as described in Section 2.3. As encoder it uses two fully-connected layers followed by a hierarchical posterior from Reference (rolfe_discrete_2016)

. The hierarchical posterior allows the modeling of more complex posterior distributions and thus optimize a tighter elbo. To avoid overfitting, i.e., encoding the data distribution in the decoder’s parameters, the decoder network is simpler by only featuring two full-connected layers. All layers use the relu activation and we use the reparametrization of discrete latent variables from Equation (

5) with (khoshaman2018gumbolt; vinci2019path).

We scale our data to be zero-centered with unit variance and choose a multivariate normal distribution with unit variance as reconstruction target. Therefore, our reconstruction loss, Equation (

1), is the mean squared error defined by

(13)

where is the forward pass of our model, i.e., mapping to the latent space and back to the original space.

We compare an rbm, a classically simulated qbm and a qbm trained with samples drawn from a quantum annealer. Parameter gradients are computed by cd, Equation (9), for all priors.

The quantum annealer which we used in the experiments, a D-Wave 2000Q, has 2048 qubits, each of degree 6. Since any rbm with more than 6 nodes per side is not native to the hardware, we must embed the logical graph by the methods outlined in Section 2.4 with a chain coupling strength of -1, the maximum allowed by the D-Wave 2000Q. At the end of an anneal, the value of a logical variable is determined by a majority vote.

The quantum annealer samples at some unknown effective temperature , analogous to the ‘temperature’ of the classical rbm, Equation (8). In the case of a standalone rbm, changing is equivalent to scaling the learning rate during training, Equation (9). But, this does not apply for an rbm placed as prior of a vae, due to the gradients of the encoder also propagating through the positive phase (vinci2019path)

. We estimate this parameter throughout the training with an auxiliary rbm. We define the auxiliary rbm with the same parameters as the qbm and add the effective temperature

as the only free variable, i.e., the pmf of the auxiliary rbm is given by . We estimate the effective temperature by optimizing w.r.t. to using cd as in Equation (9). The negative samples are drawn via Gibbs sampling while the positive samples are from the quantum annealer. When sampling the auxiliary rbm we use the effective parameters and . While this technique is not scalable to large bm, it is likely that the effective temperature at which quantum annealers sample can be fixed and stabilized with improved control of the annealing schedule. For the classical and simulated model, we fix .

Finally, parameters on the D-Wave are restricted to . To maximize similarity between the auxiliary rbm and the quantum annealer and to limit the magnitude of these parameters, we apply a L2 regularization to the parameters of the qbm and project them back on to the unit L1 box after every update step. All updates for the model are computed using Adam (kingma_adam_2017).

3.2. Search Algorithm

Here we describe the search algorithm used and provide complexity bounds on all operations. Throughout this section, denotes the size of the dataset and the number of unique bit strings. For simplicity, we assume the dimensionality of the original space and of the latent space as fixed since . It should be noted that while we use the Euclidean distance here, our approach is also applicable for any other distance metric given Equation (13) is adjusted accordingly.

After training, we use the encoder of our qvae to index the data. As discussed in Section 2.3, we use during encoding instead of the reparametrization, Equation (5). Due to the finite number of states in our latent space, the encoding of similar objects will map to the same bit string. We then construct an inverted index, a hash map that maps from a bit string to all data points which have been mapped to that bit string. Since every item is only processed once and independently this operation is in and the original data can be removed from main memory after processing reducing memory usage.

Given a query item and a number of requested neighbors , we first encode the query item with the encoder, which is a constant time operation, and sort all occupied bit strings by their Hamming distance to the query embedding. The computation of the Hamming distance and search are both in . While naively sorting requires operations, we can reduce this to by taking advantage of the discrete values of the Hamming distance and RadixSort. We also explored the use of iterative Branch-and-Bound algorithms but found it be slower than RadixSort when thousands of bit strings have to be iterated in a sparse index with . Next, the items are iterated in order of their Hamming distance and compared to the query item in terms of Euclidean distance in the original space. We stop the iteration after a fixed number of comparisons with to keep a fixed search time independent of the query item but other stopping criteria are also possible, e.g. by Hamming distance or by the number of visited hashes. After stopping, the best results are returned which can be done in by partitioning. The complete runtime of a single search pass is therefore in .

The memory usage scales linear with the number of objects and is constant with respect to the original dimensionality of the data. In fact, we only need to store every bit string once and a single 8 byte integer per object, so the total space requirement is , as we can simplify this to for a fixed .

Although smaller decrease the search time and memory footprint, increasing the number of distinct bit strings yields a higher resolution space and thus better search results. To control the number of occupied bit strings, one could adjust the latent dimension but since the maximum number of distinct bit strings scales exponentially with this only allows for coarse control. We propose to use the tunable quantum effects of a qbm, the transverse field defined in Equation (12), to gain continuous control over the latent distribution. As discussed in Section 2.4, the transverse field affects the expressiveness of our prior and allows us to narrow the latent distribution while keeping the network’s ability to learn how many states should be occupied.

4. Experiments

We conducted a series of experiments on the modis dataset to evaluate our approach. The exact dataset definitions can be found in Section 4.1. As first experiment, in Section 4.2, we trained a model with an rbm prior and performed anns to verify that the Hamming distance in the compressed space functions as a proxy to the Euclidean distance in the original space. We then classically simulated sampling from the quantum distribution to evaluate the effect of the transverse field parameter on the latent space and search speed in Section 4.3. Next, we compared models trained via quantum simulation, classical sampling and hardware sampling in terms of speedup by recall in Section 4.4. For this experiment, we did not include comparisons to other methods as the goal of this work and the associated implementation has not been to push the limits in terms of speed or accuracy. We discuss this further in Section 5. Lastly, we investigated the memory consumption and show how our method scales to half a billion data points in Section 4.5.

We compare all of our results to a baseline linear search algorithm and report time improvements in terms of speedup where and are the averaged elapsed wall times a query takes to complete for our QVAE approach and linear search, respectively. The quality of the search is assessed by the recall , where and are the first results from our QVAE search and linear search.

We executed all experiments on a computer with 36 CPU cores and 384GB of RAM. When reporting timings or speedup we loaded the complete dataset into the main memory to remove hard drive access from wall time measurements. We used a D-Wave 2000Q quantum annealer and simulated the sampling with Quantum Monte Carlo (gull2011continuous). The dimensions of the two hidden layers found in the encoder and decoder have been fixed to 128 and 64. We fixed the latent size to 64 by picking the latent size out of 32, 64 and 128 with highest speedup on the rbm model at a recall of 0.8.

4.1. Dataset

The proposed approach has been evaluated for the problem of detecting similarities of land-cover vegetation over multiple years using the modis dataset on the Terra satellite  (justice1998moderate). Terra is a polar orbiting satellite that covers the entire globe once per day at approximately 10:30am capturing 36 spectral bands at a 250m-2km spatial resolution. In this work, we used the modis Terra 16-day 500 meter vegetation indices (MOD13A1.006) from January 2016 to December 2018 (solano2010modis), 69 time-steps. Each modis tile covers a region of 1200 by 1200 with 2400 by 2400 pixels. Six variables were selected: Normalized Vegetation Index (NDVI), Enhanced Vegetation Index (EVI), red reflectance (0.645), near infra-red reflectance (0.858), blue reflectance (0.469), and middle infra-red reflectance (2.13). We ignored all pixels with missing data. If not otherwise stated, we used the tiles from the rectangular region covering North America, h08v04 to h12v06. After filtering, 45,386,870 data points remained with 414 dimensions.

4.2. Embedded Proximity

Figure 4. Recall value by the % of the dataset has been plotted for different

. Large recall values are preferred. The box extends to the lower and upper quartile values of the data, the black line within represents the median, the whiskers show the range of the data and outliers are marked as circles.

The quality of approximation of the Hamming distance in the compressed space to the Euclidean distance in the original space is crucial to the functionality of our search. To validate this assumption, we trained a model with an rbm prior and performed -anns, as described in Section 3.2, with different percentages of the dataset as stopping criterion. The search results were then compared to the best results retrieved by linear search. We chose different for comparison and took 200 samples and report statistics on recall. This experiment should show that for a given fraction of the dataset retrieving items in order of their Hamming distance results in significantly larger recall values than randomly selecting items which has an expected recall of .

Figure 4 presents the results with in logarithmic steps on the x-axis and recall on the y-axis. It can be seen that by iterating in order of the Hamming distance we only need to compare against a small fraction of the dataset to retrieve most of the nearest neighbors. While there are error bars ranging from recall 0 to 1 for small percentages of the dataset increasing the search radius increases the recall to 1 and reduces errors significantly. We also found the search parameter having little impact on the recall or search time, so was set to for the remaining experiments.

4.3. Transverse Field

Figure 5. Impact of the transverse field on the latent space and search speed. The red line shows the number of unique bit strings used to encode the dataset while the blue one shows the speedup at a fixed recall of for

. The error bars indicate the standard deviation on the speedup.

As discussed in Section 2.3, increasing the transverse field increases quantum fluctuations and narrows the learned distribution. In the context of anns this yields a hash function that maps to fewer distinct bit strings . In Section 3.2 we highlighted the trade-off between the query’s speed and quality w.r.t. . In this experiment, we measured the effect of the transverse field in two ways 1) the number of bit strings and 2) the speedup at a fixed recall of 0.8. We decided on using the simulated qbm for this experiment for better reproducibility as there are many sources of noise in near-term quantum devices which we discuss further in section 5. Additionally, at the time of writing this work, the performance of simulated sampling still outperformed the performance of an actual quantum sampler for small systems.

Both results can be found in Figure  5 where the red line corresponds to the number of occupied bit strings and the blue line to the resulting speedup when performing anns. It can be seen that increasing the transverse field indeed narrows the learned distribution while the speedup shows a peak in the range of 0.4 to 0.8 supporting our assumption about the importance of the prior. Instead of fixing a single value between these two, we opted for presenting results for both cases (0.4 and 0.8) in the next section to illustrate the behavior of the transverse field.

4.4. Speedup

Figure 6. Speedup of models trained with a rbm, simulated qbm and annealed qbm by recall for . Closer to the top right corner is preferred.

This experiment highlights the trade-off between query speed and quality and compares the different priors. We choose 300 random data points from the dataset and computed the exact 100 nearest neighbors using linear search and compared them to the results from our anns while measuring the wall time for both methods.

For the anns we took fixed percentages of the dataset as stopping criterion, namely . We did not include any larger nor smaller values as smaller values did not provide sufficient quality and higher values experience diminishing returns. While we originally planned to compare all models with a 64bit latent space, we found differences between the simulated and hardware qbm. The learned distribution widened when using the hardware model instead of narrowing as with the simulation. Thus, we chose a model with a 32bit hardware qbm such that the number of bit strings is approximately equal to the number of bit strings of the qbm model with a transverse field of 0.4. In addition to the hardware model, we present 64bit results of a rbm as well as qbm with the transverse fields from Section 4.3, 0.4 and 0.8.

Figure 6 shows the trade-off between recall and speedup. For a fixed recall of 0.9, the annealed model performed worst with an average speedup of 263, while the classical provided a 326 times speedup and the simulated quantum models returned 373 and 283 times faster than linear search for transverse field 0.4 and 0.8, accordingly. Relaxing the result quality to an average recall of 0.5 yields significant speedups for all models, for the quantum model we found an average speedup of 1485, the rbm model returned 1587 times faster and the simulated qbm models 1820 and 2714 times faster. While the simulated quantum models yield the highest speedups for any given recall, the quantum annealed model performed strictly worse than both simulated models but resulted in higher speedups than the classical model for recall between 0.55 to 0.8. We discuss more on the disparity between simulation and hardware in Section  5.

4.5. Memory Consumption

Number of data points qvae hnsw@ hnsw@ lsh@ lsh@
4.465.537 (1 tile) 0.06 GB 7.5 GB (0.6 GB) 8.9 GB (2 GB) 0.34 GB 3.4 GB
45.386.870 (USA) 0.59 GB 73 GB (3 GB) 88 GB (18 GB) 3.4 GB 34 GB
113.789.007 (upper left quarter) 1.4 GB 183 GB (8 GB) 221 GB (46 GB) 8.5 GB 85 GB
180.491.212 (left half) 2.6 GB 291 GB (12 GB) 350 GB (72 GB) 14 GB 135 GB
534.748.234 (total) 6.5 GB *860 GB (35 GB) *1037 GB (212 GB) 40 GB *398 GB
Table 1. Memory usage of the search indices constructed by our method, hnsw and lsh for differently sized datasets. Numbers annotated by * are theoretical numbers where we were not able to actually construct the index due to memory limitations. Since hnsw requires the complete dataset to be in memory, we calculated the space taken up by the actual index in brackets.

In this section, we experimentally investigate the memory scaling discussed in Section 3.2. We compared our approach to the graph-based state-of-the-art hnsw(malkov_efficient_2018)111We used the author’s implementation on GitHub. https://github.com/nmslib/hnswlib and lsh. hnsw’s memory usage is in , lsh’s memory usage is in and our approach is in where is the number of elements, is the maximum node degree in hnsw and is the number of hash tables in lsh. For a fixed and fixed all methods are in , though, our method has a significantly smaller pre-factor. As discussed in Section 3.2, our method requires only 8 bytes per object, equivalent to lsh with a single hash table, while hnsw requires to bytes per object (malkov_efficient_2018) and lsh requires bytes per object. Further, hnsw needs to hold the complete dataset in memory to construct the index while our method and lsh can process every data point independently to reduce memory usage during construction. Here we show our simulated qbm model with a transverse field of 0.4, hnsw with the minimum and maximal recommended values for , 6 and 48 respectively, and lsh with commonly used lower and upper bounds, and . Though higher resolution hashes in lsh require more space, we present optimistic estimates by choosing a low number of coarse hash functions per hash table, 10. We evaluated different subsets of the modis dataset to explore the limits of each method.

Table 1 shows the index sizes for each method for different dataset sizes. Due to the requirement of hnsw to access the complete dataset while constructing the index, we were not able to build the index for the complete modis dataset as it exceeded the available memory. Similarly, we were not able to build lsh with for the complete modis dataset as the index itself exceeded our available memory. We found our method to scale favorably compared to hnsw and lsh as it enables the storage of the search index for the complete modis dataset in memory and does not require the dataset to be loaded at build time.

5. Discussion

Although the results from Section 4.2 and Section 4.3 back up our assumptions about the behavior of the latent space and the transverse field, Section 4.4 shows different behavior when translating these insights to quantum hardware. We found the hardware model to widen the learned distribution instead of narrowing it. There are multiple factors that may cause this behavior. Sources of noise include decoherence of the qubits as a result of coupling with the environment and thermal fluctuations. Moreover, there are errors from the mapping between logical variables and hardware qubits, in fact, an efficient mapping between embedded and logical space may not exist (marshall_perils_2019). The latter error could be minimized by using bm which are native to the hardware, e.g., graphs with a maximum degree of 6. Nonetheless, constantly improving hardware and algorithms will reduce noise and embedding errors over time.

Despite these shortcomings, the hardware and especially the simulated results show comparable performance to the rbm, especially at lower recall values. Though this is a simple task, the data shown here is an encouraging sign that quantum hardware can be integrated into classical deep learning methods.

As mentioned in Section 4, we excluded comparisons to other methods in Section 4.4 as this work focuses on exploring similarity search as an early application for quantum computing by constructing a proof of concept. There are a variety of unexplored options to improve upon our results, e.g., using more expressive encoders, including domain-specific features, adding similarity preserving losses (do_learning_2016), integrating faster Hamming searching and switching from Python to C++ like most anns methods (li_approximate_2016). However, here we provide some broader context to the field via a comparison to hnsw11footnotemark: 1. At an average recall of 0.95, our best model resulted in an average speedup of 233 while hnsw returned results around 200,603 times faster than linear search. We used the best recommended parameter for , 48, and determined

by a small hyperparameter search, as recommended

(malkov_efficient_2018), and fixed it to 400 for construction and 100 during search.

The last point we want to address here is the memory issue of hnsw and lsh from Section 4.5. It should be mentioned that workarounds to the memory issues for hnsw and lsh exists, e.g., offloading the index to non-volatile memory but this would significantly slow down both methods. In contrast to that, our approach is capable of keeping an index for half a billion data points in main memory.

6. Conclusion

To summarize, we explored a qvae-based approach to similarity search in large-scale high-dimensional datasets as an early application target for quantum computing. We discussed several potential advantages of our approach: The data is inherently binarized, the memory scales favorably, quantum fluctuations in form of the transverse field can be used to gain continuous control over the shape of the latent space and the training of the qvae can be assisted using quantum annealers. We experimentally demonstrated the approximation quality of the Hamming distance in the latent space to the Euclidean distance in the original space and verified real-world speedups on the modis dataset with low space requirements.

These insights may gain importance with increasing complexity of quantum devices, as they become intractable to simulate (arute2019quantum). Whether this means that quantum devices will exceed classical in application is an open question, one we began addressing here.

Acknowledgments

We would like to thank Scott D. Pakin from Los Alamos National Laboratory (LANL) for his help with running the experiments and are grateful for support from NASA’s Advanced Information Systems Technology Program (grant #AIST16-0137) and NASA Ames Research Center. We appreciate support from the AFRL Information Directorate under grant F4HBKC4162G001 and the Office of the Director of National Intelligence (ODNI) and the Intelligence Advanced Research Projects Activity (IARPA), via IAA 145483. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of ODNI, IARPA, AFRL, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purpose notwithstanding any copyright annotation thereon.

References