Attention improves concentration when learning node embeddings

06/11/2020 ∙ by Matthew Dippel, et al. ∙ Northeastern University Amazon 0

We consider the problem of predicting edges in a graph from node attributes in an e-commerce setting. Specifically, given nodes labelled with search query text, we want to predict links to related queries that share products. Experiments with a range of deep neural architectures show that simple feedforward networks with an attention mechanism perform best for learning embeddings. The simplicity of these models allows us to explain the performance of attention. We propose an analytically tractable model of query generation, AttEST, that views both products and the query text as vectors embedded in a latent space. We prove (and empirically validate) that the point-wise mutual information (PMI) matrix of the AttEST query text embeddings displays a low-rank behavior analogous to that observed in word embeddings. This low-rank property allows us to derive a loss function that maximizes the mutual information between related queries which is used to train an attention network to learn query embeddings. This AttEST network beats traditional memory-based LSTM architectures by over 20 from the attention mechanism correlate strongly with the weights of the best linear unbiased estimator (BLUE) for the product vectors, and conclude that attention plays an important role in variance reduction.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graphs are used in various applications such as bioinformatics Fout et al. (2017), recommender systems Ying et al. (2018), social network analysis Hamilton et al. (2017)

, etc. An important learning problem on these graphs is to predict whether two nodes have an edge given information about other nodes in the graph. Solving this is crucial for tasks like metabolic network construction, movie recommendations, and knowledge graph completion.

One approach for solving such downstream tasks like link prediction is to employ low-dimensional vector representations of nodes generated by latent variable models. These techniques were originally developed for representing text and separately for representing images. Subsequently, these techniques have been combined Hamilton et al. (2017) to generate embeddings for graphs that respect the semantic similarity of node features (text/images).

In this work, we address the link prediction problem in an e-commerce setting. Here we consider the query graph consisting of nodes representing search queries entered by users seeking a specific product. There exists a link between two queries if users purchased the same product after making each query. The query reformulation problem is to infer the links in the graph for a newly added node labeled by its query. For example, a user may enter the never-before seen query “anxiety toy”. The system should infer that the user is searching for products also bought by searches for “fidget spinners”, and consequently, there must be a link between the two queries in the query graph. It is easy to see from the example that purely syntactic (string matching) approaches are insufficient. Furthermore, note that queries cannot be considered as a bag-of-words; the sequence of words conveys important meaning. For instance, the queries “milk chocolate” and “chocolate milk” contain the same words but mean different products.

Existing latent variable approaches are unable to solve this problem (see Related work)). Keeping the above examples in mind, we approached the query reformulation problem using deep latent variable models that are sensitive to word sequences. In particular, we considered long short-term memory networks (LSTMs) as well as feedforward networks with attention. Indeed, more sophisticated models like BERT

Devlin et al. (2019) require billions of parameters, which make them infeasible for training due to the large vocabulary size of commercial datasets. LSTMs and in particular the feedforward network with attention require fewer parameters and resources to train. Our experiments (see section 2) showed that attention networks perform very well and are significantly quicker to train.

This raises the question of why attention networks perform well. Indeed, there has been existing work investigating the limits of attention networks (see Related work). However, we are not aware of a sound analytical justification for the success of attention networks. Our main contribution is to offer a succinct, model-based explanation that may serve as a basis toward understanding more sophisticated models such as BERT.

Our results We solve the query reformulation problem using a feed forward attention network with a cross-entropy-based loss function. For the purpose of judging the output of the models, we come up with a novel analog of the F1 score intended to capture both relevance of outputted queries as well as the diversity of the products that the reformulated queries lead to. We compare the attention-based approach to a hybrid method using graph embeddings and a long short-term memory network (LSTM). We also compare to a pure text based approach. We show that the attention mechanism beats (all reasonable variants of) these other approaches by over 20 % on the F-1 score (Section 2).

We formulate a model for query generation, Attention Embeddings for Short Texts (AttEST), that matches statistical properties of queries and allows us to explain the success of attention networks with the cross-entropy loss function. Analogous to word embeddings, we show that the PMI of two queries is the dot product of their vector embeddings (see Corollary 3.1), matching the empirical observation that the query PMI matrix is low rank. Using this property we give theoretical validation for the cross-entropy-based loss function as maximizing mutual information. The AttEST model also allows us to prove that a weighted average of trigram vectors is the best linear unbiased estimator (BLUE) for the product desired by a query (in Section 3.1). Interestingly, in Figure 1, we observe a notable correlation between the (empirical) weights from the attention mechanism and the BLUE weights derived in Corollary 3.4. This validates the AttEST model and suggests that the attention mechanism weights allow more efficient concentration to the product vector by reducing the variance in the estimation.

Related work Since the success of word2vec Mikolov et al. (2013) in finding word embeddings, there have been many variants and extensions Pennington et al. (2014); De Boom et al. (2016); Arora et al. (2017); Le and Mikolov (2014). There has also been a lot of work on embeddings for nodes in a graph Kipf and Welling (2017); Hamilton et al. (2017); Narayanan et al. (2017); Velickovic et al. (2018). The former set of works do not consider any graph structure and only have embeddings for textual features while the latter set of works do not have features associated with the nodes and hence cannot incorporate that information to generate embeddings.

Another line of work in representation learning leverages multiple types of entities, for example text and images Klein et al. (2015); Kiela and Bottou (2014), to make multi-modal models to get embeddings for both entities in the same latent space. Saunshi et al. (2019) provides a theoretical framework for generating embeddings for entities that have a semantic similarity relationship between them, generalizing embeddings for nodes in a graph. However, their framework does not account for features associated with entities and hence their framework does not apply in our scenario.

Vaswani et al. (2017) showed that the attention mechanism of Bahdanau et al. (2015)

supplants and outperforms recurrent models for many problems via the Transformer network. We indirectly corroborate this by showing that the weighting scheme induced by an attention mechanism gives the least variance estimator for the true embedding. On the other hand,

Jain and Wallace (2019) argue against using attention weights as a measure of feature importance for RNN-based models. This does not contradict our reasoning of attention weights enhancing query embeddings since we employ simple, feedforward networks with attention. Devlin et al. (2019) introduces the BERT model which uses bidirectional training of the Transformer for training language models. BERT and its augmentations Liu et al. (2019); Shen et al. (2019) represent the state-of-the-art in language modeling.

Our work builds, in nontrivial fashion, on the seminal RAND-WALK model of Arora et al. (2016). RAND-WALK is a generative model of word embeddings, that provides an explanation for the low-rank nature of the point-wise mutual information (PMI) matrix Deerwester et al. (1990); Turney and Pantel (2010), among others. In contrast to RAND-WALK which analyzed long form text with small vocabularies and could exploit ergodicity, AttEST analyzes short queries over a massive vocabulary and so required novel yet justifiable modeling assumptions; in addition to corroborating the low-rank nature of PMI matrices, AttEst explains the effectiveness of both the loss function and the attention mechanism, with empirical validation.

In the specific context of e-commerce there have been works conducting an empirical study using LSTM networks to map queries to structured attributes Wu et al. (2017), as well as works that consider the more specific problem of ranking query reformulations Sheldon et al. (2011); Santos et al. (2013). As opposed to the former work, our latent space model AttEST allows for arbitrary downstream tasks on queries while having a theoretical grounding. This theoretical grounding also solves the reformulation ranking problem by following the embedding step with a -nearest neighbor search in the latent space to shortlist reformulations.

2 Experimental results

Our primary data set uses query-product pairs from the Electronics category of the Amazon US locale sampled during the March - April 2018 period. The resulting query-product graph has approximately 670,000 queries, 146,000 products, and 1 million edges. The second data set is sampled from the Amazon US locale over a period of 91 days, up to August 26, 2017. We partition the queries and products into disjoint clusters using a spectral clustering algorithm. From these, we take all queries and products which appear in the largest 25 clusters. In total, the data set is approximately 250,000 queries. From the bipartite query-product graph, we completed all triangles and took the resulting query-query graph on only the query nodes. Finally, we removed isolated queries from the query-query graph, and viewed the edges of the graph as data samples. Both the primary and Top-25 Clusters data sets are partitioned as 95%–5% training–testing split.

2.1 Metrics for query reformulation

Evaluating the quality of query reformulations is a non-trivial task since the data set only contains the products that were purchased

by the customer searching for a given query. Because the product search engine returns more than twenty products on the first page alone, all of which may be relevant, it is not clear how to determine whether the nearest-neighbor queries are useful reformulations. We propose two metrics which measure the precision and recall of top five reformulated queries.

  1. Query Precision@K: Precision is defined to be the fraction of the five reformulations which are ‘relevant’ to the initial query . We say a reformulation is ‘relevant’ if the top products (by purchases) of contain at least one of the top products of .

  2. Product Recall@K: Recall is defined to be the fraction of the top K products associated with that appear in the list of top K products associated with some .

Query precision measures the fraction of reformulated queries that are ‘valid’, while product recall measure the diversity of reformulations. Since the associated products of a query are only derived from clicks, adds to shopping cart, and purchases, very similar queries are likely to have no overlap of products. Therefore, a high product recall score suggests that the list of reformulations is diverse. Since these metrics are ultimately evaluated from purchase behavior associated with queries, they lower bound the ‘true’ precision and recall as would be determined by a human evaluator.

2.2 Models

Trigram Hash As a baseline, we consider the purely textual Trigram Hash model - it ignores any behavioral connections. In particular it would not associate "fidget-spinner" with "anxiety attention toy" since the two strings are textually so dissimilar. Each query is treated as a bag of trigrams and hashed down to a 300-dimensional vector. And given a new (test) query the nearest neighbors algorithm is used to find the closest training queries based on the Bray-Curtis distance.

LSTM to match graph embeddings There is plenty of existing literature on generating meaningful embeddings for nodes in a graph as we detailed in the Related work section. We make use of two such tools, node2vec Grover and Leskovec (2016) and GraphSAGE Hamilton et al. (2017), by applying them to the Query-Product graph. We take the learned embeddings for query nodes, and train an LSTM to match the embeddings given only the query string. Matching an LSTM to either of the embeddings performs similarly; we present the LSTM+node2vec results which were a few percentage points better.

We also have a LSTM only model which directly gets the text as input. We enforce the graph structure in a query embedding via positive samples, queries which share the same product set, and negative samples, queries with no products in common. Let be a proposed embedding for a query, a set of positive sample embeddings, and a set of negative sample embeddings. The following loss function will then drive to and to .

Attention on textual input Our main model simply learns trigram embeddings, and uses attention on the trigram embeddings to compute a weighted average. Note that the learning is inductive and unsupervised with the loss function agnostic to the downstream metric (in particular to the diversity component, Product Recall@K).

The input to the network is a vector with length equal to the maximum query length of . Here each coordinate of the input has a unique number identifying the trigram. The network then uses an embedding layer to come up with a vector representation

of each trigram. The attention mechanism is simply applying a different linear transformation

to each vector, getting and then taking softmax of these to get corresponding weights. The final query vector outputted is the weighted average of the trigram vectors using the attention weights.

While we always use the same loss function, we have a few variants in our experiments. First, we consider two positive sampling methods while training: uniformly sampling from neighbors, and sampling using the GraphSage approach of running multiple, fixed-length random walks from each node, and using all co-occurring pairs of nodes as positive samples. Since the GraphSage sampling performs slightly better, we only present those results in Table 1. The second variant additionally provides word data on top of trigram data. This method performs slightly worse than providing only the trigrams which we believe is due to the trigram-only model being better equipped to handle typographical errors. The word model wastes some attention weights on the words, which may have such errors thus adversely affecting performance.

2.3 Results

Model Query Precision@20 Product Recall@20 F1
Top-25 Clusters Trigram Hash 45.6% 60.3% 47.6%
LSTM only 43.2% 55.9% 44.4%
LSTM+node2vec 59.8% 64% 57.2 %
Attention 65.9% 70.8% 65.3%
Electronics Trigram Hash 37.22% 51.85% 41.62%
Attention+Word 52.22% 61.41% 55.02%
Attention 59.44% 68.20% 62.20%
Table 1: Experimental results of the various described models. Query Precision and Product Recall are percentages of the best possible scores on the test dataset found by brute force search.

For the Top-25 clusters dataset, the LSTM models performed reasonably comparably to the attention models. However, when compared on the primary dataset, the LSTM models performed very poorly. We conjecture that the LSTM models were memorizing textual information for the top queries and since the Top-25 clusters dataset only had 25 popular clusters, the LSTM model was able to predict the correct product class fairly easily. This strategy became useless in the larger primary dataset due to infeasible training times for LSTMs. The larger and sparser dataset allows the attention model to really shine through.

The attention models performed much better than any of the other neural models. Given that the attention model and loss function are agnostic to the metric one reason that the Top-25 Clusters performance is superior to the Electronics performance could be that clustering enhances the diversity component of the F-1 score. Interestingly, the baseline Trigram Hash model was the runner-up.

3 AttEST: Model and Theoretical Results

In the model, query generation is viewed as a two step process where the user first thinks of a product to search for, and then generates a query based on that product. Let be the -dimensional unit sphere. A product is selected by sampling a vector uniformly from . A query

is generated by synthesizing an ordered sequence of n-grams, where the sequence length,

, is determined by sampling from a Poisson distribution truncated at the maximum query length

. E-commerce vocabulary is in the order of tens of billions, including the regular English lexicon, as well as brands, models, ISBN codes, product codes, etc. The concatenative morphology of English (e.g. ‘antigovernment’ is sum of the morphemes ‘anti’, ‘govern’, and ‘ment’) and various codes (e.g. ISBN) allows us to derive meaning from the constituent n-grams. We use trigrams for our experiments due to the memory requirements of our data-set prohibiting larger n-grams. Let

be an isotropic set of vectors representing all possible trigrams. The -th trigram of is sampled from this mixture distribution on ,

(1)

where

is the partition function for the exponential distribution in the mixture,

is the mixture parameter and is the positional spread parameter. In a mild abuse of terminology we will use trigram and product to refer their corresponding vectors.

The exponential component of the mixture samples trigrams near the product while the uniform component models noise in the generative process. We follow the log-linear model introduced by Arora et al. (2016) but differ from it in several key ways to accommodate specific characteristics of (short text) queries that are not found in (longer form) written language. Our changes help model query generation in a natural manner. A user searches for a product by listing the attributes associated with it, and while most trigrams would be very relevant to the product, there will inevitably be some that introduce noise into the query. The position dependent mixing and spread parameters control how the noise changes depending on where in the query the trigram is. Generally, it becomes noisier as the query becomes longer. We make the simplifying assumption that the trigrams are all sampled independently of each other, which is not true in practice.

To generate the graph, queries are first generated according to the mixture in equation 1. Each query has an associated vertex labeled by its trigrams . Two queries are adjacent in the graph if they were generated by product vectors such that for some parameter .

3.1 Attention

We state some basic properties of the mean, variance and partition function of the trigrams sampled by the AttEST model (Equation 1).

Lemma 3.1.

Let . The mean of the trigram in the -th position is

Lemma 3.2.

The expected -distance squared of from its mean is

For large vocabularies (), the partition function can be approximated by a constant .

Lemma 3.3 (Concentration of partition functions, Lemma 2.1 from Arora et al. (2016)).

For trigram vectors of the form , where

comes from a spherical Gaussian distribution, and for

and there exists s. t., .

3.2 Low rank of PMI

Theorem 3.1.

Let

be query vectors generated by the AttEST model. Denote the probability that

co-occur in the query graph by . Then,

Proof sketch.

Start by averaging the event that co-occur over all product vectors that are close enough to have an edge in the query graph.

Note that the two products in the last line take probabilities from the mixture distribution. In order to complete the proof, we take the following steps. First, we use Lemma 3.3 to factor out the partition functions from the equation. Next, we show that the uniform component from the mixture can be ignored without incurring too much error. Finally, we remove the dependence on the second product by exploiting its closeness to on the assumption that co-occurring queries are generated by nearby products. These results allows us to complete the calculation and finish the proof. ∎

Theorem 3.2.

Let be a query vector generated by the AttEST model. Then,

The proof of Theorem 3.2 follows a similar argument to Theorem 3.1 and together they imply:

Corollary 3.1.

For ,

Since the query vectors are -dimensional, the corollary shows that the PMI is rank .

3.3 Loss function derivation

We now provide theoretical justification for the cross-entropy loss used in the AttEST attention model. Let be the set of all empirical queries, let () denote the queries (not) adjacent to in the query-query graph.

To derive the loss function, we start by maximizing the mutual information (MI) of the marginal distributions of each endpoint of the edge distribution while minimizing the MI between the marginal distribution of each endpoint of the non-edge distribution . This ensures that the amount of information derived from a query embedding about its related queries (and only its related queries) is maximized. Let be the marginal distributions for the first and second vertex in the edge sampled from . Similarly, let be the marginal endpoint distributions for non-edges sampled from . Recall the definition of mutual information,

We maximize the mutual information between and while minimizing it between and .

We can equivalently maximize the exponential of the above.

After some algebraic manipulation and approximations (see Appendix), we obtain the following:

where the last line follows from Corollary 3.1. We can remove the exponential due to monotonicity. Note that in a small range around

, the sigmoid function may be approximated by an exponential allowing us to take the logarithm of the sigmoid for each term in the sums, and get the loss function.

3.4 Experimental validation via trigram variance

Intrigued by the success of the attention mechanism we ran several statistical analyses on the attention weights. An immediate observation was that the curve of attention weights by trigram position exhibited a downward trend. This matches the intuition that the importance or information content of the trigrams early in the query is high while those at the end of a long query are less useful for inferring the product the user has in mind. The initial oscillations also seemed to suggest that users of search engines tend to order their descriptors from more important to less (e.g. ‘iphone white 32gb’ is preferred over ‘32gb white iphone’). However, this qualitative link between the semantics of search and empirical attention weights do not immediately suggest a quantitative link to our theoretical AttEST model. Our first inkling of an explanation for this success came when we noticed that the sequence of attention weights correlated strongly with the inverse of their variance which we explain now.

Let

be independent, real-valued random variables drawn from different distributions, such that all distributions have the same expectation

and (possibly different) variances , respectively. A special case of the well-known Gauss-Markov theorem states that:

Lemma 3.4.

The best linear unbiased estimator (BLUE) of is where .

When we plot the empirical attention weights and the BLUE weights, calculated using Lemma 3.4 but based on the empirical variance of the attention weights we see a remarkably good fit (see Figure 1). This led us to the hypothesis that attention weights essentially function as BLUE weights enabling more accurate inference of the product (vector) from the trigrams.

From Lemma 3.1 we see that the mean of the trigram vectors (suitably scaled) is an unbiased estimator of the product but how exactly do the variances predicted by the AttEST model manage to fit the empirical variances? Recall from Equation 1 that we have two parameters - the noise parameter , and the spread parameter - for each trigram position

, i.e., the AttEST model has two degrees of freedom to perfectly fit the variance value at position

. To validate the observed fit of the AttEST model we fixed all the to a constant and assumed a simple linear form for the variances, i.e., . By Lemma 3.2 this implies a simple fixed form for the . The linear growth of variance with trigram position is consistent with the observation that queries tend to get more discursive and ramble the longer they go on. It is also consistent with the practice of FKMR (fewer keywords, more results) enshrined in search engines where later parts of queries are preferentially dropped in order to provide meaningful results. Figure 1 shows an excellent fit between the empirical attention variance (scatter plot) and the theoretical AttEST variance (idealized line) using this linear model of growth in variance with trigram position.

Finally, in Figure 1 we plot the values of the terms so as to perfectly match up the empirical attention variance with the idealized variance line; the values can be thought of as the residuals. Recall that the parameter captures how strongly the chosen trigram at that position aligns with the product vector. In the figure, we notice the initial high fluctuation with a peak roughly between positions 10 and 20 which corroborates the intuition that the most relevant keywords can be seen once the first word has narrowed the category and the second word focuses the query onto the product. Further, looking at the tail end we see that the importance of goes down and it becomes a flat curve indicating that the later keywords are not as relevant to determining the product.

Given that the attention weights behave as the BLUE weights we now provide a quantitative justification for why the weighted average out-performs the unweighted average in terms of the variance of the inferred product (vector).

Lemma 3.5.

Let the variances of each trigram be proportional to , Then the variance of the unweighted query vector is while the variance of the weighted query vector is .

Proof.
  1. The variance of the unweighted query vector is:

  2. The variance of the weighted query vector is:

For the unweighted case, set and see that . For the weighted case, we get that where is the Harmonic number, which is approximately . In particular, if , then the variance is . ∎

Lemma 3.5 shows that, with growing query length, the variance with attention weights vanishes to zero whereas the variance of the unweighted (or uniformly weighted) case remains at a constant bounded away from 0. In other words the explanation for the success of attention mechanisms is that they provide an efficient method to reduce the variance (increase concentration) in the estimation of the ground truth.

Figure 1: From left to right: The left plot is of the weights produced by the attention mechanism, and the BLUE weights predicted by Lemma 3.4. Weights are averaged over all queries, and shown by the position of the trigram () in the query. The middle plot is a scatter plot of variance predicted by theory (without changing values) alongside the variance from the data. The best linear fit shown by the Ideal Variance curve is achieved by using the values shown in the right plot.

4 Conclusion

Predicting edges in a graph from node attributes is a general problem beyond e-commerce. Future work can consider alternative datasets with images as node attributes. However, study of alternative datasets requires careful adaptation of the theoretical analysis of this work as the AttEST model relies on the following properties of e-commerce data sets: node attributes are short texts, large vocabulary size, and the correlation of position and weight of trigrams. Analogues for these properties must be identified when analyzing alternative datasets. For instance, would the decomposition of text into trigrams work well in languages without concatenative morphology, such as Arabic or Hebrew?

References

  • [1] S. Arora, Y. Li, Y. Liang, T. Ma, and A. Risteski (2016) Rand-walk: a latent variable model approach to word embeddings. Transactions of the Association for Computational Linguistics (TACL). Cited by: §A.2, §A.2, Lemma A.3, Lemma A.7, §1, Lemma 3.3, §3.
  • [2] S. Arora, Y. Liang, and T. Ma (2017) A simple but tough-to-beat baseline for sentence embeddings. International Conference on Learning Representations (ICLR). Cited by: §1.
  • [3] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §1.
  • [4] C. De Boom, S. Van Canneyt, T. Demeester, and B. Dhoedt (2016) Representation learning for very short texts using weighted word embedding aggregation. Pattern Recognition Letters 80, pp. 150–156. Cited by: §1.
  • [5] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman (1990) Indexing by latent semantic analysis. Journal of the American Society for Information Science 41 (6), pp. 391–407. Cited by: §1.
  • [6] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019-06) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §1.
  • [7] P. Embrechts, C. Klüppelberg, and T. Mikosch (1997) Modeling extremal events for insurance and finance. Berlin Heidelberg. Cited by: §A.1.
  • [8] A. Fout, J. Byrd, B. Shariat, and A. Ben-Hur (2017) Protein interface prediction using graph convolutional networks. In Advances in Neural Information Processing Systems, pp. 6530–6539. Cited by: §1.
  • [9] A. Gasull, J. A. López-Salcedo, and F. Utzet (2015) Maxima of Gamma random variables and other Weibull-like distributions and the Lambert W function. Test 24 (4), pp. 714–733. Cited by: §A.1.
  • [10] A. Grover and J. Leskovec (2016) Node2Vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pp. 855–864. Cited by: §2.2.
  • [11] W. L. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In NIPS, pp. 1025–1035. Cited by: §1, §1, §1, §2.2.
  • [12] S. Jain and B. C. Wallace (2019) Attention is not explanation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 3543–3556. External Links: Link Cited by: §1.
  • [13] D. Kiela and L. Bottou (2014)

    Learning image embeddings using convolutional neural networks for improved multi-modal semantics

    .
    In

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    ,
    pp. 36–45. Cited by: §1.
  • [14] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §1.
  • [15] B. Klein, G. Lev, G. Sadeh, and L. Wolf (2015) Associating Neural Word Embeddings with Deep Image Representations using Fisher Vectors. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 4437–4446. Cited by: §1.
  • [16] B. Laurent and P. Massart (2000-10) Adaptive estimation of a quadratic functional by model selection. Ann. Statist. 28 (5), pp. 1302–1338. External Links: Document, Link Cited by: §A.2.
  • [17] Q. Le and T. Mikolov (2014) Distributed representations of sentences and documents. In

    International conference on machine learning

    ,
    pp. 1188–1196. Cited by: §1.
  • [18] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: §1.
  • [19] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. In ICLR, Cited by: §1.
  • [20] A. Narayanan, M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal (2017) Graph2vec: learning distributed representations of graphs. CoRR abs/1707.05005. Cited by: §1.
  • [21] J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In EMNLP, Cited by: §1.
  • [22] R. L. Santos, C. Macdonald, and I. Ounis (2013) Learning to rank query suggestions for adhoc and diversity search. Information Retrieval 16 (4), pp. 429–451. Cited by: §1.
  • [23] N. Saunshi, O. Plevrakis, S. Arora, M. Khodak, and H. Khandeparkar (2019) A theoretical analysis of contrastive unsupervised representation learning. In International Conference on Machine Learning, pp. 5628–5637. Cited by: §1.
  • [24] D. Sheldon, M. Shokouhi, M. Szummer, and N. Craswell (2011) LambdaMerge: merging the results of query reformulations. In Proceedings of the fourth ACM international conference on Web search and data mining, pp. 795–804. Cited by: §1.
  • [25] S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, M. W. Mahoney, and K. Keutzer (2019) Q-BERT: hessian based ultra low precision quantization of BERT. CoRR abs/1909.05840. External Links: Link, 1909.05840 Cited by: §1.
  • [26] P. D. Turney and P. Pantel (2010-01) From frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37 (1), pp. 141–188. Cited by: §1.
  • [27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1.
  • [28] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph attention networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1.
  • [29] C. Wu, A. Ahmed, G. R. Kumar, and R. Datta (2017) Predicting latent structured intents from shopping queries. In Proceedings of the 26th International Conference on World Wide Web, pp. 1133–1141. Cited by: §1.
  • [30] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamilton, and J. Leskovec (2018) Graph convolutional neural networks for web-scale recommender systems. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 974–983. Cited by: §1.

Appendix A Proofs

Proofs are organized by paper sections.

a.1 Trigram Properties

Lemma A.1.

For all , the mean of a trigram sampled using the distribution from the set of all trigrams is the following, where .

Proof.

Start by expanding the inner expectation over the distribution,

Note that the right term is just equal to zero since is sampled from the spherical Gaussian centered at the origin. The left term only depends on a single so we can change the expectation to be over a single sampled from a spherical Gaussian and replace the sum with a factor .

We take the orthogonal decomposition of the vector to be and which are in directions parallel and perpendicular to respectively. Note that this allows us to set to 0.

We can factor the expectation in the last line since is a random variable independent of . Since is a linear transformation of and therefore, it is also a mean-zero Gaussian. Therefore, the first term in the sum goes to zero. Since is a rank- linear transformation of , we can compute it as a one-dimensional Gaussian with mean and variance .

Lemma A.2.

The expected -distance squared of a sampled trigram from its mean is

Proof.

Without loss of generality we assume that the product vector is the vector.

Note that the first expectation term can be dealt with easily enough.

Now the second term is zero since the expectation is simply over a standard normal random variable. Also note that the sum in the first term can be replaced with a factor of since each term in the sum has the same value. The first expectation is over a standard normal variable and it can be explicitly calculated just like in the proof of Lemma A.1 to show that it is equal to .

That only leaves the expectation of the squared norm of a trigram left.

where the last line follows by the well-known approximation of block maxima by the Gumbel distribution as ([7], 4, p.156). The mean of the Gumbel distribution is:

where is the Euler-Mascheroni constant, and the equality follows from Stirling’s approximation for constant . [9] suggests nonstandard approximation terms which converge much faster to the Gumbel distribution for our parameter regime.

Combining all of the above we get that the expected distance squared of a trigram from its mean is: . ∎

Lemma A.3 (Concentration of partition functions, Lemma 2.1 from [1]).

For trigrams vectors of the form , where comes from a spherical Gaussian distribution, and for and there exists s. t., .

a.2 Low rank of PMI

Theorem A.1.

Let be query vectors generated by the AttEST model. Denote the probability that co-occur in the query graph by . Then,

Theorem A.2.

Let be a query vector generated by the AttEST model. Then,

The proof of Theorem A.2 follows a similar argument to Theorem A.1 and together they imply:

Corollary A.1.

For ,

To prove Theorem A.1, we make use of the following lemmas.

Lemma A.4 (Length of spherical Gaussian concentrates).

Let be a spherical Gaussian vector in dimensions, then

Proof.

The above lemma is simply a reparametrization of the corollary of Lemma 1 from [16]. For the first bound,

One can solve to get the exact relation between and . For the second bound,

One can solve to get the exact relation between and . ∎

Lemma A.5.

We can remove the dependence of the partition function on the products to approximate the following

with

where .

Proof.

Lemma A.3 implies that with probability at least , the following holds:

Using the approximation for small , we can replace the and similarly for the other term completing the proof. ∎

This can be proven using Lemma A.3 following an argument similar to the one in the proof of Theorem 2.1 in [1].

Lemma A.6 (Ignoring the uniform component).

If then with high probability the uniform component can be ignored without incurring too much error. Formally, let be the exponential component, be the uniform component then the following holds:

We denote by the event where the uniform component can be ignored.

Lemma A.7 (Lemma A.5 from [1]).

Let be a fixed vector with norm for absolute constant . Then for random variable

with uniform distribution over the sphere, we have that

(2)

where .

Proof of Theorem a.1.

Our proof for the co-occurrence probability initially follows the structure of the proof of Theorem 2.2 in [1], particularly for the concentration of the partition functions. However, after that point our proof crucially diverges to deal with the mixture distributions of multiple trigrams. First, using the law of total expectation, we can write the probability of co-occurrence in terms of the probability of sampling by product vectors that are within distance of each other.

The second step is valid since the only property of that we will use is that is it in an -ball of radius around . We now use Lemma A.5 to remove the dependence of the partition functions on the product vectors and to get the following:

Let be the term inside the expectation above, and be the term without the uniform components. We condition on the event from Lemma A.6. When happens Lemma A.6 allows us to ignore the uniform component.

Some algebraic manipulations allow us to show that conditioning on allows us to focus on the term alone. Note that for practical purposes is equal to 1 at values of