Motivation. Text analytics has become an important part of business intelligence as enterprises increasingly seek to extract insights for decision making from text data sets. At the same time, data is being generated at an unprecedented rate, so that text data sets can get very large. For example, the Google Books Ngram data set contains 2.2 TB of data , and the Common Crawl corpus contains petabytes of data . Processing such large text data sets can be computationally expensive, especially if it involves sophisticated algorithms.
The above challenge is exacerbated when it is desirable to run different types of queries against a data set, making it expensive to build multiple indices to speedup query processing. For example, given a data set comprising user reviews on products, an enterprise may want to count positive vs. negative reviews, use the reviews to make recommendations, or retrieve reviews relating to a particular product .
In this paper, we propose a framework called EmApprox to speed up a wide range of analytical queries over large text data sets. The key idea behind EmApprox is to build a general index that can be used to guide the processing of a query toward a subset of the data that is most similar to the query. An example is the execution of a query that seeks to count the number of occurrences of a given phrase. EmApprox would select a sample of the data set, where it preferentially chooses items that are most similar to the phrase in the query, counts the occurrences in the sample and uses that count to estimate the number of occurrences in the entire data set. Clearly, the result is approximate. Thus, users of EmApprox would need to tolerate some imprecision in the estimated results, although EmApprox allows users to trade off precision and performance by adjusting the sampling rate.
Our approach is related to the many approximate query processing (AQP) systems that have been designed to answer aggregation queries over relational data sets using an estimator with error bounds by sampling a subset of the data [3, 43, 15, 33, 2, 9, 25]. In essence, one can think of EmApprox as extending AQP to text analytics. EmApprox supports the estimation of errors bounds when possible; e.g., for aggregation queries. However, EmApprox can also be used in scenarios where it is not possible to estimate error bounds such as information retrieval, making it widely applicable to many different text analytics queries/applications.
System overview. Figure 1
gives an overview of EmApprox. As mentioned above, EmApprox executes a query on a sample of the data set to reduce query processing time. Straightforward use of random sampling can lead to large errors, however, when sampling from a skewed distribution. To mitigate this issue, EmApprox builds an index offline, then consults this index at query processing time to guide sampling toward subsets of data that are most similar to the query.
Specifically, EmApprox uses a natural language processing (NLP) model  to learn vector representations for unique words and documents, where the resulting vectors can be composed and used to compute a similarity metric. Then, assuming that the data set is partitioned into a number of subcollections as shown in Figure 1, e.g., because of storage in a file system such as HDFS , EmApprox computes a vector for each subcollection from the vectors of the documents contained in it. The final index contains vectors for unique words together with vectors for the subcollections.
At query processing time, EmApprox first computes a vector for the query using vectors of the words in the query. It then uses probability proportional to size (pps)111Size is referring to the number of relevant data items in a cluster. to perform unequal probability cluster sampling over the subcollections , where sampling probabilities are proportional to subcollections’ similarity to the query as computed using the subcollections’ and query’s vectors.
To reduce the storage overhead of the index, we use locality-sensitive hashing (LSH) to hash each real-valued vector to a bit vector , which can preserve the distance among the original vectors. LSH bit vectors also makes the computation of similarity extremely cheap; it is simply the Hamming distance of two bit vectors that can be computed efficiently using XOR. This optimization has greatly increased the scalability and efficiency of EmApprox’s index.
Queries. We have implemented a prototype of EmApprox and use it to support approximate processing for three different types of queries: (1) aggregation queries that count occurrences within a text data set, (2) retrieval queries, both Boolean and ranked, that retrieve relevant documents, and (3) recommendation queries that predict users’ ratings for products. For aggregation queries, we show that it is possible to compute estimated error bounds along with the approximate results, and that PV-DBOW’s (the specific NLP model that we use) 
training objective is directly related to minimizing the variance of the estimated results when using similarity driven sampling. For the retrieval queries, we use EmApprox in a similar fashion todistributed information retrieval (DIR), where a query is only processed against subcollections that are expected to be most relevant to the query [15, 43, 9, 25]. Finally, for the recommendation queries, we use the user-centric collaborative filtering (CF) algorithm  to predict a target user’s ratings using the average of other users’ ratings weighted by similarities between their product reviews.
Evaluation. We generate a large number of queries for each query type, and execute them on three different data sets. We adopt equal probability cluster sampling  over subcollections as the baseline for our evaluation. We show that EmApprox can achieve significant improvements on different domain-specific metrics (e.g., error bounds, precision@k, etc.) compared to the baseline with very little extra overhead during query processing. For example, to match the error bounds in aggregation queries achieved by EmApprox, the baseline would have to process 4x the amount of data. We also show that EmApprox can achieve significant speedups if users can tolerate modest amounts of imprecision. For example, when sampling at 10%, EmApprox speeds up a set of queries counting phrase occurrences by almost 10x while achieving estimated relative errors of less than 22% for 90% of the queries.
While EmApprox is extremely efficient for processing queries that estimate results based on a sample such as aggregation and recommendation queries, like all sampling-based approaches, it is less effective for speeding up queries similar to information retrieval queries. This is because these queries are seeking specific data items in the data set, and it is impossible to estimate missed data items based on the sample.
Contributions. In summary, our contributions include: (i) to our knowledge, our work is the first to leverage an NLP model to build a general-purpose index to guide the approximate execution of text analytical queries; (ii) we show that the training objective of PV-DBOW is directly correlated with minimizing the variance of counting queries; (iii) we propose similarity driven sampling that can significantly increase accuracy compared to random sampling for three distinct types of approximate queries in three different application domains; (iv) we show that hashing real-valued vectors into light-weight LSH bit vectors significantly improves storage and computation efficiency without compromising approximation query’s quality of result.
Ii Background and Related Work
Our work draws inspiration from recent work 
that proposes a novel database index based on machine learning (ML) models learned from the data, which then probabilistically maps a queried key to the position of a desired record. A queried key and the matching record is analogous to an approximate query and relevant data in our work. However, our index targets approximating text analytical queries which are very different workloads than typical database queries.
Ii-a Approximate Query Processing
Traditional AQP systems have mainly targeted aggregation queries over relational data sets. AQP++  is a recent database system aimed at interactive response time for aggregation queries over relational data set, by leveraging both sampling-based AQP and precomputed aggregates. BlinkDB  is an AQP system built on top of Spark and proposes methodologies in selecting offline-generated samples to answer queries. It uses query column sets (QCSs) for representing the sets of columns appearing in past workloads, and stratified samples are created for each QCS. It assumes QCSs are stable over time, which does not perform well for queries outside the QCS coverage in the offline samples. ApproxHadoop  is an online sampling framework that supports approximating aggregation tasks with error bounds under the MapReduce programming model in Hadoop. It applies cluster sampling over the input data with a uniform sampling probability. As a consequence, its result estimation is prone to large error bounds over skewed data. Sapprox  has an offline and an online component. It collects the occurrences of sub-data sets offline, and the information to facilitate online cluster sampling. Sapprox targets certain counting queries in relational data sets, whereas EmApprox targets text analytics workload and the offline indexing scheme is more general-purpose than occurrence-based.
Ii-B Cluster Sampling
Simple random sampling, stratified sampling and cluster sampling are three common methods of sampling, where the most computationaly efficient sampling method is cluster sampling since it avoids full scan of the data set. As large data sets are usually partitioned, a cluster often would correspond to a partition [15, 43].
Suppose we want to estimate the frequency of a phrase occurring in a large data set partitioned into disjoint subcollections of documents. If we take a cluster sample from the data set using subcollections as the sampling units, we can derive the estimator using cluster sampling theory :
where is the chosen sample, is a subcollection in the sample, is the frequency of in , is the sampling probability for , and is the estimated error bounds. in turn is computed as:
where is the sample size and is the critical value of a -distribution at confidence level with degrees of freedom. We observe that as the ’s approach , and hence will approach 0. The goal of probability proportional to size (pps) sampling  is to set each close to by leveraging auxiliary information of each sampling unit, so that we can reduce the error bound for our estimator . If is a good estimator, then should be close to which would make the estimator have a small variance.
Ii-C Paragraph Vectors
. Word2vec uses an unsupervised neural network to learn vector representations (embeddings) for words. It seeks to produce vectors that are close in a vector space for words having similar contexts, which refers to the words surrounding a word in a pre-determined window. For example, synonyms like “smart” and “intelligent,” or closely related words such as “copper” and “iron,” are likely to be surrounded by similar words, so that Word2vec will produce spatially close vector representations for them. Similarity between two words can thus be scored based on the dot product distance between their corresponding vectors. The learned vectors also exhibit additive compositionality, enabling semantic reasoning through simple arithmetic such as element-wise addition over different vectors. For example, .
Paragraph Vector (PV)  is similar to Word2vec, which jointly learns vector representations for words and variable-length text ranging from sentences to entire documents in a method. Distributed Bag of Words PV (PV-DBOW) is a version of PV that has been shown to be effective in information retrieval due to its direct relationship to word distributions in text data sets . By setting PV-DBOW’s window size to be each of the document’s length, the generative probability of word in a document is modeled by a softmax function:
where and are vector representations for and , and is the vocabulary (i.e., the set of unique words in the data set). PV-DBOW learns the word and document vectors using standard maximum likelihood estimation (MLE), by maximizing the likelihood of observing the training text data set under the distribution defined by Eq (3). As a result, the training process will output word and document vectors that satisfy Eq (3) which formulates the theoretical foundation of our approximation index.
To reduce the expense of computing Eq (3) during training, a technique called negative sampling has been proposed  that randomly samples a subset of words in that document according to a noise distribution to approximate Eq (3). The training process is equivalent to implicitly factorizing a shifted matrix of pointwise information (PMI) between words and documents: :
where is the pointwise information between word and document , and is a constant representing the number of negative samples for each positive instance in the training process. PMI can be estimated empirically by observing the frequencies of words in documents in the data set as , where is the frequency of in document , is the length of (number of words in) , is the data set, is the total number of occurrences of in , is the total number of words in .
Given ’s definition, Eq (4) reveals that the exponential of the distance between a document and word vector is proportional to the probability of document this word , which indicates that if a word is chosen randomly from , then what is the probability that it would be :
We can see from Eq (5) that the inner-product distance between the word vector and each document vector is proportional to the percentage of the total occurrence of contributed by each document .
Ii-D Locality-Sensitive Hashing
Hashing methods have been studied extensively for searching for similar data samples in a high-dimensional data set to solve the approximate nearest neighbor problem. LSH is among the most popular choices for indexing a data set using hashing[20, 14]. The basic idea behind LSH is to transform each item in a high dimensional space into a bit vector with bits, using binary-valued hash functions , …, . In order for the bit vector to preserve the original vectors’ similarity, each hash function must satisfy the property:
where and are two vectors in the data set; is a similarity measure, such as Jaccard, Euclidean or cosine. is computed as one minus the ratio of the Hamming distance between two bit vectors over the total number of bits in them. Similarity between two items is preserved in the mapping, that is, if two items’ LSH bit vectors are close in Hamming distance then the probability that they are close to each other in the original metric space is also high. This property allows items’ LSH bit vectors to efficiently index a data set for similarity search .
Iii Similarity-driven Sampling
In this section, we discuss cluster sampling with probabilities driven by similarities of a query to subcollections, that can be computed online using the offline trained PV-DBOW vectors. The sampling method is inspired by approximating aggregation queries (section 2.2) that seek to compute a sum/mean over the data set, such as counting the occurrence of a phrase, number of documents in a given topic. Finally we describe our approximation index and using LSH to compress the real-valued vectors.
Query vector representation. We assume a query is represented as a piece of text of words . Under the bag-of-words assumption, the probability of in a document is the joint probability of its words :
If we define as element-wise arithmetic sum of its individual words’ vectors: = , then by combining Eq (5) and Eq (6) we can derive that is actually proportional to the exponential of , where Eq (5) is derived from PV-DBOW’s training objective:
which is the exactly the same form as Eq (5) where a query only comprises a single word. Therefore by computing this way, we can conveniently derive the probability of document predicting this query, under the assumption that the words are independent.
Iii-a Sampling Probability Estimation
We probabilistically define a document’s to a query as the probability of predicting . in Eq (7) can be used to compute at sampling time using and , both derived from the offline trained PV-DBOW model. Suppose we use documents as clusters for pps sampling to estimate the quantity of throughout the data set, then we can use as each document’s auxiliary information for setting sampling probabilities proportional to its similarity to :
A large data set is often partitioned in many subcollections, and cluster sampling with those as clusters is a more efficient sampling design [15, 43]. Similar to sampling documents with similarity-driven probabilities, we propose to use a subcollection’s vector representation to compute its similarity to the query and set sampling probabilities proportionally. Intuitively, we propose to define a subcollection’ vector representation using the element-wise arithmetic mean of the vectors of the documents in it: , where is the number of documents in , and is a document in .
We now analyze why arithmetic mean of document vectors is reasonable as subcollection’s vector representation. Given and Eq (7
), we can derive the exponential of the dot product between a subcollection and query as the geometric mean of eachin :
where we use to express the similarity of a subcollection to the query. Similar to computing the similarity of to a document, we can use Eq (10)’s left hand side to compute the similarity of
to a subcollection. Following the same idea as using documents as sampling units, we define a probability distributionfor each subcollection with respect to in the same form as Eq (8):
where can be computed at sampling time.
A cluster sample over the subcollections with probabilities set according to Eq (11) will greatly reduce the uncertainty in estimating the occurrence of . Variants of this query include estimating number of documents that contain or number of documents similar to in semantics. Both distributions defined in Eq (8) and (11) essentially normalize the probability of a phrase appearing in a specific document or subcollection, against every document or subcollection in the data set. Interestingly, Eq (8) and (11
) have the same form as a softmax classifier over query wordswhich predicts its probabilities conditioned on a document or subcollection.
Iii-B Approximation Index
The approximation index includes vectors for every word, document and subcollection. The index can occupy significant storage space for a large data set, for which we propose to map the real-valued vectors to LSH bit vectors to reduce the required storage. And, the cost of computing the similarities between bit vectors is also much more efficient than dot product between real-valued vectors.
LSH. LSH bit vectors can preserve different distance metrics (e.g., cosine) between their real-valued vector counterparts. Computing the Hamming distance of the LSH bit vectors using XOR
is also much more efficient than dot product. We slightly modify the gradient descent-based training process of PV-DBOW: at each update step, we normalize the vectors to be unit length so that the dot product of the trained vectors is equivalent to the cosine similarities between two vectors. The value of the hash function for preserving cosine distance depends on the dot product between a random planeand an item vector , where evaluates 1 if , and 0 otherwise, where
usually has a standard multi-dimensional Gaussian distribution, and a new is generated each time the hash function is applied . In order to generate the LSH signature for a real value vector, we first choose a dimension for the bit vector, then apply hash function times to generate each bit, each choosing a random . Specifically, we can approximate using , where is the Hamming distance between and ’s corresponding LSH bit vectors.
Iv Beyond Aggregation Query
Iv-a Query characterization
In the previous section, we introduced that certain aggregation queries such as estimating occurrence of a phrase can benefit from our approximation index. In this section, we describe approximating queries in more areas using our index, including DIR and recommendation. We note that the goal of IR is to identify “similar” documents to a query, and that some recommendation technique in data mining depend on identify “similar” users to a target user. We characterize our targeted queries as: 1) the query can be represented by words; 2) relevancy of the query to a subset of data can be reduced to the generative probability of the query’s relevant data given this subset - the definition of similarity; 3) efficiency of the approximate computation can be improved by processing the most similar data in the data set.
Iv-B Distributed Information Retrieval
Information retrieval from disjoint subcollections of documents is known as distributed information retrieval (DIR), where many subcollections are ignored for improved retrieval efficiency [22, 7]. The selection of the subcollections usually depends on a ranking model for each subcollection that is available at retrieval time, such as language model based (e.g., topic distributions of the subcollection) or simple metadata (e.g., source of the documents) [42, 22]. EmApprox’s index can be used for subcollection selection under the vector space retrieval paradigm for DIR [12, 22]. We target Boolean and ranked retrieval models for DIR in the following discussion.
Boolean retrieval. Under the Boolean model, a query is a Boolean expression of words connected by Boolean operators (e.g., ), where a term only evaluates to true when contained in a document . The retrieval result is a set of documents that satisfy the Boolean expression.
When approximating a Boolean query , we first compute its similarity to each subcollection - as we have previously defined. We can compute using the same sequence as how the Boolean query is evaluated, i.e. takes precedence over . We first compute each individual term ()’s to a subcollection using Eq (10), in order to compute the query’s overall similarity. Since implies that and both have to exist for it to be true, whereas satisfying requires either or to exist, the generative probability of is equivalent of ; by the same token, the generative probability of is therefore . For example, suppose we have a Boolean query , then can be computed as , where each is computed using Eq (10). Finally, we use the overall query to the subcollections to compute sampling probability of each subcollection to sample a subset of subcollections. Then, only documents in the chosen subcollections are evaluated against the Boolean query to retrieve matching documents.
Ranked retrieval. Instead of returning a set of documents that precisely match a Boolean expression, ranked retrieval returns a list of documents ranked by their relevancy to the query. The query is usually a phrase comprising of a sequence of words. In terms of ranking, each document is assigned a score using a function with potentially many factors considered, such as the tf-idf of the query terms or cosine similarity between the vector representations of the query and a document obtained under the same language model. Similar to Boolean retrieval, our proposed framework first samples a subset of subcollections with probabilities proportional to the query’s to each subcollection, which is more straightforward to compute than Boolean model - we just use Eq (11) to compute its to a subcollection. We then apply a user-specified scoring function to documents from the chosen subcollections, such as BM25  or a language model-based function [40, 4].
One of the most successful recommendation techniques in data mining is called neighborhood-based collaborative filtering (CF), which have been deployed in many commercial websites [26, 35, 32]. User-centric neighborhood CF is that given a target user , it first tries to identify a subset of more similar users as neighbors, then uses the similarity score between and each neighbor to predict a rating for item . It computes the prediction by taking an average of all the ratings for by ’s neighbors weighted by their similarity scores. Symmetrically, neighborhood CF can be item-centric which predicts a rating for an item based on similarities between and other items purchased by .
Without loss of generality, we will focus on user-centric CF. Review text embeds rich information which has been used to model users’ behaviors and items’ properties in past work as an alternative to numerical rating only data [29, 38]. A user’s vector representation can be learned under PV-DBOW model by defining a document as all the reviews a user has written . Consequently, a user’s vector would encode ’s preference. When the number of total users in the data set is large, the process of identifying neighbors from them to a target user would be expensive. Our proposed index can make it more amenable for large data sets by selecting only more similar subcollections of users. Suppose the review data set is sorted by users, then similarity of a user to a collection of reviews can be computed using Eq (10) to sample most similar collections of users. Then predicted ratings for the new user can be computed using any model/metric with the neighbors in the chosen subcollections.
As a concrete example, rating of an item by user ’s predicted value can directly leverage ’s review text vector’s similarity to a user who has also rated item , computed as , where is the set of all users who have rated item in the chosen subcollections, is user ’s rating for item , can be characterized by their similarity in the review texts and have written, defined as .
Iv-D Discussion on Document Allocation
The document allocation policy can affect DIR’s performance, where the storage of documents needs to be skewed for DIR to be effective [7, 22]. This is because the query expects to retrieve as many relevant documents to the query as possible from only a subset of the data set. For example, documents in the same topics, or from the same source are allocated together . The document allocation policy can also affect recommendation’s performance, since identifying similar users is similar to retrieving relevant documents. On the other hand, the accuracy of aggregation estimators is not dependent on document allocation, since the local sum of each chosen subcollection is multiplied by the inverse of its own sampling probability as a scaling factor to compute an overall estimator, i.e., subcollections with a large local sum will just have a small scaling factor.
In order to enhance the performance of DIR and recommendation queries, we propose to allocate documents based on their vectors’ pair-wise cosine distance through clustering the documents in the original data set using spherical K-means 
, which is a version of K-means that uses the cosine distance of the items as a distance metric. The clustering process takes as input the collection of document vectors, and produces an allocation where semantically similar documents are allocated together.
When documents , , …, are semantically similar, it indicates that the probabilities , , …, are also similar for a query word . The result of clustering is a more skewed distribution of the documents, therefore documents that have the same probability of predicting a query word tend to be allocated together. As Eq (9) shows that – the geometric mean of each document’s probability of predicting in that subcollection – so if ’s are similar in each , then their geometric mean will approach a local maximum equivalent to the arithmetic mean of all the , according to the AM-GM inequality . It therefore suggests that this allocation policy would produce a skewed sampling probability distribution , which is desired for a retrieval query.
EmApprox can approximate a range of text analytical queries, we nevertheless highlight a couple of limitations.
Model drift. We assume the text data set is historical and stable. If new documents are added to the existing data set, PV-DBOW can in fact compute the vectors for an unseen document by an inference step over the trained model using the words in the new document . Alternatively, an unseen document’s vector representation can also be obtained by averaging the word vectors in that document . However, the originally trained PV-DBOW model may still drift due to document updates. Therefore PV-DBOW model should be retrained to capture the true word/frequent distributions in the data set, which requires the offline index to be rebuilt.
Unseen words in the query. Currently we assume the query does not include words that are outside the vocabulary of the data set, so that any word vector in the query can be directly obtained. Our approximate technique does not support query that includes words that are unseen by the PV-DBOW model training over the data set yet.
We have implemented a prototype of EmApprox as a Python library comprising two parts: one for building offline indexers and one for building approximate query processing applications (queries for short). Users write indexers and queries as Spark programs in Python using the PySpark  and EmApprox libraries. Figure 2 gives an overview of a system built using EmApprox, where documents are stored in blocks of an HDFS filesystem, with blocks considered subcollections of documents and used as sampling units. Note that a single indexer can be used to index many different data sets of the same type, and many different queries can be executed against each index/data set.
We leave the task of writing indexers to the user because it allows the flexibility for indexing many different types of data (e.g., different data layouts and definitions of documents). It is quite simple to write indexers given the EmApprox library. Specifically, users need to write code to parse a given data set to extract the documents (much of this code can come from standardized libraries) and identify documents in each subcollection (HDFS block). All other functionalities are implemented in the EmApprox library, and simply requires the user program to call several functions. Similarly, the main difference between an approximate query built using EmApprox and a precise query is the invocation of several EmApprox functions.
An offline indexer uses EmApprox to learn vector representations for words and documents, cluster documents (when desired) using K-means as discussed in Section IV-D, compute vectors for blocks (subcollections), compute corresponding LSH bit vectors, and prepare the index. This process is shown as steps and in Figure 2. (We do not show the clustering for simplicity.) We use Gensim 
as the default library for PV-DBOW model training, but we can also use alternative implementations that can run on distributed systems such as Tensorflow to reduce the training time. We use Gensim in our prototype because it is a widely adopted PV-DBOW implementation.
The execution of an approximate query is shown as steps through in Figure 2. In step , the query uses EmApprox to read the index into an in-memory hash table, compute sampling probabilities for all HDFS blocks, and choose a sample using pps sampling. Step launches a Spark job. Steps are part of the Spark job and use EmApprox and PySpark to read the sample from the data set into an RDD. Step
is the execution of the rest of the Spark job. We provide two simple reduce functions that compute the estimated sum and average, along with the confidence interval (see Eq (1) and (2)).
|aggregation||estimates frequency for target phrase.||error bound|
|Boolean retrieval||DIR||retrieves a (sub)set of documents that precisely match a Boolean query.||recall|
|ranked retrieval||DIR||retrieves top-k documents ranked by a given scoring function over a set of query terms.||precision|
|user-centric CF||recommendation||predicts ratings on unbought products and outputs top-k recommendations for a user.||MSE, precision|
|Wikipedia||5 million Wikipedia articles in XML format.||aggregation||62GB||each Wikipedia article||4.3h||125MB|
|CCNews||22 million news articles crawled from the Web in 2016, in JSON format.||aggregation, DIR||65GB||each news article||6.2h||280MB|
|Amazon reviews||142 million user-product interactions from 5/1996- 7/2014 in JSON format.||recommendation||55GB||all reviews written by the same user||3.8h||87.5MB|
We evaluate EmApprox using synthetic queries of the three types discussed above. Table I
summarizes the query types and the evaluation metrics. TableII summarizes the data sets that we use in our evaluation.
Vii-a Evaluation Methodology and Metrics
Data. We use three data sets: a snapshot of Wikipedia , a news corpus from Common Crawl (CCNews) , and a set of Amazon user reviews . Table II summarizes the data sets, their use in different queries, the training time of PV-DBOW using Gensim, and the size of the resulting indices after compression using LSH.
Experimental platform. Experiments are run on a cluster of 6 servers interconnected with 1Gbps Ethernet. Each server is equipped with a 2.5GHz Intel Xeon CPU with 12 cores, 256GB of RAM, and a 100GB SSD. Data sets are stored in an HDFS file system hosted on the servers’ SSDs, configured with a replication factor of 2 and block size of 32MB. Applications are written in Python 3 and run on Spark 2.0.2.
Index construction. We train PV-DBOW to produce word and document vectors with 100 dimensions. We use LSH vectors of 100 bits, which compresses the PV-DBOW learned vectors by a factor of 64. We explore the sensitivity of EmApprox to these parameters in Section VII-E.
Baselines. We analyze and compare EmApprox’s performance against simple random clustered sampling (SRCS)  of HDFS blocks and precise execution. We compare execution times (speedups) and query-specific metrics. We run the precise executions as “pure” Spark programs on an unmodified Spark system. We run SRCS using the EmApprox prototype, replacing pps sampling with simple random sampling.
We run aggregation queries that estimate the numbers of occurrences and the corresponding relative errors of target phrases in the Wikipedia data set. We create 200 queries by randomly selecting phrases from the data set. The lengths of the phrases follow a normal distribution with a mean of 2 words and a standard deviation of 1.
The approximate answer to each query executed with a given sampling rate includes the estimated count () and an estimated confidence interval (). We report the estimated relative error at 95% confidence level as the ratio of over . We also compare the estimated relative error with the actual relative error, computed as , where is the precise answer.
DIR queries. We cluster documents within the Wikipedia and CCNews data sets as explained in Section IV-D, setting the number of centroids equal to the number of HDFS blocks in the file holding the data set.
We generate 200 sets of randomly chosen words, 100 from the Wikipedia data set and 100 from CCNews. Set sizes follow a normal distribution with an average size of 3 and a standard deviation of 1. We randomly insert Boolean operators (and and or) to form Boolean queries, and use the sets of words directly as queries in ranked retrieval.
We use the BM25 ranking function  in ranked retrieval. We choose BM25 from a plethora of ranking functions, including functions that use Paragraph Vector , because it is widely adopted by search platforms such as Solr  and IR libraries such as Apache Lucene .
In Boolean retrieval, we report recall, defined as ratio of the number of documents retrieved by the approximate query processing to the number of documents retrieved by the precise execution of the query. In ranked retrieval, we report precision-at-k (P@k), defined as the percentage of the top documents retrieved by the approximate query processing that is in the top retrieved by the precise query execution. We report precision instead of recall in ranked retrieval because users typically cares most about the top ranked answers, where is typically small .
Recommendation queries. We approximate user-centric CF, which takes as input a target user with past reviews/purchase history then outputs predicted ratings for unpurchased items. It also generates a top-k recommended item list sorted by their predicted ratings.
We rearrange the reviews in the original data set to group all reviews written by a unique user together. Each group of reviews written by a unique user is then considered a single document. We randomly select 100 users and remove 20% of each selected user’s ratings from the data set to be used as test data. We then cluster the remaining documents as we did for the DIR queries and construct the index. Finally, we construct 100 queries for the selected users, where each query outputs the predicted ratings as computed by the CF algorithm for the users/reviews in the test data.
The rating scale is 1-5 in the Amazon data set. We report the mean squared error (MSE) and P@k to measure prediction performance. MSE is computed for the predicted vs. actual ratings as a measure of accuracy for the predictions. P@k is the percentage of the items recommended in the top-k list that were actually purchased by the target user. (We assume that a user purchased an item if s/he reviewed it.)
Vii-B Results for aggregation queries
Figure 7 plots the CDFs of estimated relative errors when running the 200 queries under EmApprox and SRCS at 1%, 2.5%, 5% and 10% sampling rates. We observe that: (1) EmApprox consistently achieves smaller estimated relative errors than SRCS at the same sampling rate; (2) the “tails” of the CDFs are “shorter,” meaning that there are fewer query answers with large estimated relative errors; and, (3) EmApprox achieves smaller maximum estimated relative errors. Under SRCS, the estimated relative errors can be large at very low sampling rates. For example, the estimated relative errors are 80% and 95% for the 50 (median) and 90 percentile, respectively, at 1% sampling rate. Under EmApprox, they are reduced to 25% and 45%. Errors become much smaller with increasing sampling rates.
Figures 10(a) and (b) show average speedups compared to precise execution and average relative errors (both estimated and actual), respectively. We observe that speedups are slightly smaller for EmApprox compared to SRCS. This is because EmApprox does incur a small amount of extra overhead to compute the sampling probabilities for blocks. On the other hand, as already mentioned, EmApprox achieves much smaller relative errors than SRCS. Specifically, SRCS has to process roughly 4x the amount of data processed by EmApprox to achieve similar relative errors: for example, the relative errors under EmApprox at 2.5% sampling rate are similar to SRCS’s relative errors at 10% sampling rate.
In summary, EmApprox significantly outperforms SRCS. If the user can tolerate the estimated relative error profile for EmApprox at 10% sampling rate (i.e., 30% and 40% at 50 and 90 percentiles, respectively), then EmApprox achieves an average speedup of 10x for the 200 aggregation queries. Further, the user can trade off between accuracy and performance by adjusting the sampling rate.
Vii-C Results for DIR queries
Figure 15 plots the CDFs of recall for the Boolean queries under EmApprox and SRCS when run on the Wikipedia (100 queries) and CCNews (100 queries) data sets. Similar to the observations for aggregation queries, we observe that EmApprox significantly outperforms SRCS. Unfortunately, even at a high sampling rate (e.g., 50%), EmApprox misses significant fractions of relevant documents for many queries. This is because relevant documents can be spread out across many subcollections even with clustering, as clustering is performed based on overall contents of documents, which may be very different from the few words in a Boolean query.
Figures 19(a) and (b) show Boolean retrieval’s speedups over precise execution and the average achieved recall rates, respectively. We observe that the much higher sampling rates required to achieve higher recall rates constrain achievable speedups.
Figure 19(c) shows the average P@10 results for ranked retrieval for the Wikipedia data set. EmApprox significantly outperforms SRCS at lower sampling rates, but even its achieved precision levels are unlikely to be acceptable. EmApprox achieves much better precision levels at higher sampling rates; e.g., 0.57 and 0.78 at 50% and 75% sampling rates, respectively.
In summary, EmApprox significantly outperforms SRCS. However, the nature of the problem, which is to find specific items in a large data set, reduces the effectiveness of sampling, even when sampling is directed by some knowledge of content. Thus, EmApprox can only achieve modest speedups while achieving relatively high recall rates and precision levels. For example, EmApprox achieves an average speedup of 1.3x at a sampling rate of 75%. Achieved average recall and P@10 are 0.89 and 0.78, respectively.
Vii-D Results for recommendation queries
Figure 22 shows average MSE and P@10 under different sampling rates for EmApprox and SRCS. Similar to results for the other two query types, EmApprox outperforms SRCS. The differences between the two approaches are less pronounced, however. For example, EmApprox outperforms SRCS by 8% and 7.4% for average MSE and average P@10, respectively, at a sampling rate of 25%. This is likely due to the fact that user-centric CF itself does not achieve high accuracy—the precise execution achieves an average MSE of 1.015 and average P@10 of 0.32%—so that selecting customers most similar to the target customer does not have a large impact compared to random selection. EmApprox incurs minimal overheads at query processing time, however, and so its increased accuracy is still desirable, especially when the number of customers is large. EmApprox speeds up the query processing time by almost 9x while degrading P@10 by 18.7% at 10% sampling rate. Speedup is over 3x with 12.5% degradation of P@10 at 25% sampling.
Vii-E Sensitivity Analysis
In this subsection, we study the effect of PV-DBOW and LSH bit vectors’ dimensions ( and respectively). For DIR and recommendation, we also study the number of clusters () in K-means performed in offline preprocessing.
Vector dimensions. Intuitively, is more important than as it directly reflects how accurate vectors can capture the word distributions in the data set. It is because mapping to LSH bit vectors is an approximation to its corresponding real-valued vectors thus is unlikely to enhance their performance. As a result, we first tune a suitable value of and then . We find that the performance of EmApprox gradually improves with the increase of , and then stabilizes when its size is sufficiently large across our data sets. We would like a value of that is as small as possible since large values will significantly slow down the training process of PV-DBOW. We choose 100 as the default vector size during training for convenience across the evaluations, since we observe that the performance of the queries are relatively stable when the vector size reaches 100. Note that the best value of is still dependent on different data sets and queries combination.
Figures 27(a) and (b) show the estimated error of phrase occurrence and MSE of the recommendation’s relationship to the dimensions of vector. The observation for the effect of is similar in tuning , where the performance first gradually increases then stabilizes as gets larger. Our objective of tuning is to match its performance with directly using real-valued vectors. Figure 27(c) and (d) show the impact of to the phrase occurrence and ranked retrieval queries. We notice that the smallest value of to match the performance of its real-valued counterpart is related to both the data set and specific queries. For the experiments, we picked 100 for convenience because we have observed that this dimension has stably performed on par with the original real-valued vectors across the queries.
Number of Clusters for K-means. We have found that the performance of EmApprox on DIR and recommendation queries is sensitive to (number of clusters) in K-means. Figure 30(a) and (b) show the P@10 metric for ranked retrieval over the Wikipedia data set and recommendation under different values of . We observe that the performance gradually improves as increases and plateaus as it approaches the number of HDFS blocks in a data set, after which it starts to drop if keeps increasing. It is because when there are too many clusters, the data set will be close to have been randomly shuffled.
Our results show that EmApprox is more effective for queries that estimate results from samples, with higher sampling rates leading to increasingly precise estimates, than for queries that are seeking specific data items within large data sets. This matches intuition since EmApprox’s index is meant to guide sampling toward subcollections that are most similar to a query, rather than identifying data items that are guaranteed to be relevant to the query as the traditional inverted index is meant to do. Thus, EmApprox can help speed up aggregation and recommendation queries by close to 10x if the user can tolerate modest amounts of imprecision. In addition, the user can gracefully trade off between imprecision and performance by adjusting the sampling rate. While we can observe the same trade off for DIR, it is likely that users cannot tolerate the imprecision at lower sampling rates. For example, even at a sampling rate of 50%, EmApprox achieves an average P@10 of less than 0.8, implying that the approximate processing misses over 4 of the top 10 ranked documents on average. In comparison, with a sampling rate of 10%, EmApprox achieves an average estimated relative error of 18.2% for the aggregation queries, and a degradation on average of 18.7% for the P@10 recommended products for recommendation queries.
Viii Conclusion and future work
We present an approximation framework for a wide range of analytical queries over large text data sets, using a light-weight index based on an NLP model (PV-DBOW). We formally show that the training objective of PV-DBOW is directly related to the generative probability of a query given a collection of documents. We also characterize text analytic queries that are suitable for our similarity-driven approximation approach. Our experiment shows that our light-weight index can reduce the execution time by almost an order of magnitude while degrading gracefully in approximation quality with decreasing sampling rates. EmApprox is particularly useful for exploratory text analytics.
Future Work. We believe our proposed index can be effective under more scenarios and data sets. For example, our system can be used to efficiently detect the sentiment for a large text data set; it can also potentially be used toward selecting relevant users/products for more complex recommendation methods, such as model-based algorithms involving multiple data sources [29, 44]. We also plan to extend our key methodology—learning an index directly from the data set to facilitate approximate computation—to other types of data sets, such as visual data, time series, etc.
-  (2016) Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §VI.
-  (2014) Knowing when you’re wrong: building fast and reliable approximate query processing systems. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp. 481–492. Cited by: §I.
-  (2013) BlinkDB: queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys ’13, New York, NY, USA, pp. 29–42. External Links: Cited by: §I, §II-A.
-  (2016) Analysis of the paragraph vector model for information retrieval. In Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval, ICTIR ’16, New York, NY, USA, pp. 133–142. External Links: Cited by: §II-C, §IV-B.
-  (2016) Improving language estimation with the paragraph vector model for ad-hoc retrieval. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’16, New York, NY, USA, pp. 869–872. External Links: Cited by: §VII-A.
-  (2019) Apache Solr. Note: http://lucene.apache.org/solr/ Cited by: §VII-A.
-  (2002) Distributed information retrieval. In Advances in information retrieval, pp. 127–150. Cited by: §IV-B, §IV-D.
Similarity estimation techniques from rounding algorithms.
Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pp. 380–388. Cited by: §III-B.
-  (2001) A robust, optimization-based approach for approximate answering of aggregate queries. In ACM SIGMOD Record, Vol. 30, pp. 295–306. Cited by: §I, §I.
-  (2016) Common crawl news dataset. Note: http://commoncrawl.org/2016/10/news-dataset-available. Cited by: §VII-A.
-  (2018) Common crawl. Note: https://registry.opendata.aws/commoncrawl/ Cited by: §I.
-  (2015) Search engines: information retrieval in practice. Vol. 283, Addison-Wesley Reading. Cited by: §IV-B, §IV-B.
-  (2018) Gensim. Note: https://radimrehurek.com/gensim/ Cited by: §VI.
-  (1999) Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases, VLDB ’99, San Francisco, CA, USA, pp. 518–529. External Links: Cited by: §II-D.
-  (2015) ApproxHadoop: bringing approximations to mapreduce frameworks. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’15, New York, NY, USA. External Links: Cited by: §I, §I, §II-A, §II-B, §III-A.
-  (2019) Google Book Ngrams. Note: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html Cited by: §I.
-  (2016) Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16, Republic and Canton of Geneva, Switzerland, pp. 507–517. External Links: Cited by: §VII-A.
-  (2007) The am-gm inequality. The Mathematical Intelligencer 29 (4), pp. 7–7. Cited by: §IV-D.
-  (2019) http://lucene.apache.org/. Note: http://lucene.apache.org/ Cited by: §VII-A.
Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC ’98, New York, NY, USA, pp. 604–613. External Links: Cited by: §I, §II-D.
-  (2018) The case for learned index structures. In Proceedings of the 2018 International Conference on Management of Data, pp. 489–504. Cited by: §II.
-  (2015) Selective search: efficient and effective search of large textual collections. ACM Transactions on Information Systems (TOIS) 33 (4), pp. 17. Cited by: §IV-B, §IV-D.
-  (2014) Distributed representations of sentences and documents. In International Conference on Machine Learning, pp. 1188–1196. Cited by: §I, §I, §II-C, §II-C, §V.
-  (2014) Neural word embedding as implicit matrix factorization. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, Cambridge, MA, USA, pp. 2177–2185. External Links: Cited by: §II-C.
-  (2018-12-01) Approximate query processing: what is new and where to go?. Data Science and Engineering 3 (4), pp. 379–397. External Links: Cited by: §I, §I.
-  (2003) Amazon. com recommendations: item-to-item collaborative filtering. IEEE Internet computing (1), pp. 76–80. Cited by: §IV-C.
-  (2009) Sampling: design and analysis. Advanced (Cengage Learning), Cengage Learning. External Links: Cited by: §I, §I, §II-B, §VII-A.
-  (2010) Introduction to information retrieval. Natural Language Engineering 16 (1), pp. 100–103. Cited by: §VII-A.
-  (2013) Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, RecSys ’13, New York, NY, USA, pp. 165–172. External Links: Cited by: §I, §IV-C, §VIII.
-  (2010) Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association, Cited by: §II-C.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §II-C, §II-C.
-  (2015) A comprehensive survey of neighborhood-based recommendation methods. In Recommender systems handbook, pp. 37–76. Cited by: §IV-C.
-  (2018) AQP++: connecting approximate query processing with aggregate precomputation for interactive analytics. In Proceedings of the 2018 International Conference on Management of Data, pp. 1477–1492. Cited by: §I, §II-A.
-  (2017) PySpark. Note: https://spark.apache.org/docs/latest/api/python/index.html Cited by: §VI.
-  (2015) Recommender systems: introduction and challenges. In Recommender systems handbook, pp. 1–34. Cited by: §I, §IV-C.
-  (2009) The probabilistic relevance framework: bm25 and beyond. Foundations and Trends® in Information Retrieval 3 (4), pp. 333–389. Cited by: §IV-B, §VII-A.
-  (2010) The hadoop distributed file system. In 2010 IEEE 26th symposium on mass storage systems and technologies (MSST), pp. 1–10. Cited by: §I.
-  (2016) Meta-prod2vec: product embeddings using side-information for recommendation. In Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, New York, NY, USA, pp. 225–232. External Links: Cited by: §IV-C.
-  (2014) Hashing for similarity search: a survey. arXiv preprint arXiv:1408.2927. Cited by: §II-D.
-  (2006) LDA-based document models for ad-hoc retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 178–185. Cited by: §IV-B.
-  (2018) Wikipedia database. Note: http://en.wikipedia.org/wiki/Wikipedia_database. Cited by: §VII-A.
-  (1999) Cluster-based language models for distributed retrieval. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 254–261. Cited by: §IV-B.
-  (2016-11) Sapprox: enabling efficient and accurate approximations on sub-datasets with distribution-aware online sampling. Proc. VLDB Endow. 10 (3), pp. 109–120. External Links: Cited by: §I, §I, §I, §II-A, §II-B, §III-A.
-  (2017) Joint representation learning for top-n recommendation with heterogeneous information sources. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM ’17, New York, NY, USA, pp. 1449–1458. External Links: Cited by: §IV-C, §VIII.
-  (2005) Efficient online spherical k-means clustering. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., Vol. 5, pp. 3180–3185. Cited by: §IV-D.