On the Dimensionality of Embeddings for Sparse Features and Data

01/07/2019 ∙ by Maxim Naumov, et al. ∙ 0

In this note we discuss a common misconception, namely that embeddings are always used to reduce the dimensionality of the item space. We show that when we measure dimensionality in terms of information entropy then the embedding of sparse probability distributions, that can be used to represent sparse features or data, may or not reduce the dimensionality of the item space. However, the embeddings do provide a different and often more meaningful representation of the items for a particular task at hand. Also, we give upper bounds and more precise guidelines for choosing the embedding dimension.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background

The mapping of concepts, objects or items into a vector space is called an embedding. The representation of

items in dimensional vector space is used in many applications, which can broadly be split into two categories.

The first category is characterized by models that produce embeddings. In this case the embeddings represent the information obtained by the model about the item space. If the model represents a probability distribution over items in the dataset then we can interpret the output as an embedding of this probability distribution.

For instance, singular value decomposition

Golub & Van Loan (2012) is used in latent semantic analysis/indexing (LSA/LSI) Dumais (2005) to produce low-rank approximations of the original data. In this case a low-rank approximation of the word-document matrix for words and documents, where each row corresponds to a word and each column corresponds to a document, while non-zero entries represent the occurence of words in documents, can be written as

(1)

where the matrices and can be interpreted as embeddings of words and documents in a dimensional space. A similar interpretation can be given to matrix factorization for collaborative filtering, with the caveat that empty matrix entires are unknown (rather than being ) Koren et al. (2009); Frolov & Oseledets (2017).

Also, deep learning models based on auto-encoders

Sedhain et al. (2015), multi-label classification Bengio et al. (2010)

and neural machine translation (NMT)

Neubig (2017) can be seen as generating a probability distribution over a set of classes/objects. Let this probability distribution be represented as a vector . Notice that the embeddings produced by the models, such as embeddings of target language words in NMT, represent an embedding of this probability distribution over items into dimensional space

(2)

for some embedding matrix . Notice that during training we often start with an arbitrary probability distribution and therefore vector p is dense. However, when training converges towards the end it is common for the probability distribution to concentrate more at particular items and therefore vector p becomes sparse (and if needed it can be quantized to eliminate a long tail of small values). It will become clear in this note that the information the embedding needs to represent will vary greatly based on the signature of vector p. In particular, when p is very sparse and its elements can be quantized to a few values the embedding dimension needed to represent the information in it can be relatively small.

The second category is characterized by models that consume embeddings. In this case embeddings represent information obtained from the input features or data. It is common for these input features to be sparse, such as the history of user clicks on webpages or words present in a post. The sparse features are often represented by a list of integer indices that select items from a larger sequence/set (in contrast to dense features that are represented by a single floating point number). They can also be represented by a sparse vector p as will be shown in the next section.

For instance, neural collaborative filtering He et al. (2017), wide & deep Cheng et al. (2016) and deep & cross recommendation systems Wang et al. (2017) all use embeddings to process sparse input features. The advantage of using embeddings instead of lists of sparse items is that we can measure distance between them in a more meaningful way. Also, notice that embedding elements represent the sparse features in some abtract space relevant to the model at hand, while integers simply represent an ordering of the input data.

The natural language processing models

Kalchbrenner & Blunsom (2013); Sutskever et al. (2014) may fall somewhere in between these two categories. In particular, NMT models Neubig (2017) often use two embeddings one representing words in the source and another in the target language. On one hand, the source embedding can be seen as a sparse feature consumed by the model. It is characterized by a list of indices that selects words used in the input sentence. On the other hand, the target embedding can be seen as a representation of the probability distribution over words in the target language.

In this note we focus on the embedding of sparse vectors p that can be used to represent sparse features and data that belong to the second category. We point out that the implications of choosing a particular value of a hyper-parameter are not theoretically well understood. The choice is usually based on empirical experiments or resource limitations, such as compute and memory available on hardware platforms Park et al. (2018). A recent work attempted to explain the sizing of the embedding vectors based on pairwise inner product dissimilarity metric Yin & Shen (2018).

We propose an alternative approach based on the entropy information measure. We leverage the ideas of Tishby et al. (1999); Shwartz-Ziv & Tishby (2017) as well as Traub & Woziakowski (1980); Pinkus (1985); Donoho (2006)

, but in contrast to them we do not attempt to explain the behavior of neural networks or find ways to compress the parameters/data. We use the information measure to discuss the dimensionality and provide guidelines on sizing of the embedding vectors, i.e. we provide roofline and more precise models for selecting the dimensionality

.

2 Introduction to embeddings

Let items be mapped into a dimensional vector space. The vectors corresponding to items are often organized into an embedding table, which can be seen as a tall matrix with and that can be written as

(3)

where vector corresponds to -th item.

A sparse feature is characterized by a list of integer indices, which can be represented as item lookups with different signature in the embedding table.

The item lookup with a single index is often encoded as a dense matrix-vector multiplication

(4)

where vector and

(5)

has in -th position and

everywhere else. It is often referred to as a one-hot encoding vector.

Notice that we can can select multiple items with some weights in a single lookup and express it as

(6)

where vector and

(7)

has weight at and everywhere else Jia et al. (2014); Paszke et al. (2017).

Further, we can generalize it to multiple lookups where each selects multiple items with some weight and encode it as a dense matrix-sparse matrix multiplication

(8)

where sparse matrix and

(9)

is composed of multiple vectors each corresponding to a single lookup111Notice that the vector subscript in here has a different meaning than before. In (9) it denotes the -th lookup, while in in (5) it denoted the -th item being selected. with non-zero elements corresponding to the items being selected. The output matrix is the result of lookups.

This setup is often used to state that the embedding vectors project dimensional items space into dimensional embedding vectors. In the next section we will examine this statement in more detail and show that it can be misleading and therefore lead to incorrect conclusions.

3 The Dimensionality of embeddings

Notice that when we discussed space dimensions in the previous section, we never took into consideration the precise vector element type and how much information we can represent with it. Let us now incorporate it into our analysis by measuring the cardinality of the set it can describe and the information associated with it.

Recall that the entropy of an information source is given by

(10)

where is the probability of -th symbol to be communicated Shannon & Weaver (1949).

Notice that the embedding vector has elements, each with bits. Therefore, it could represent

(11)

values and if we interpret it as an information source its entropy is

(12)

where denotes the probability of -th value being selected. Hence, if is uniform then

(13)

3.1 Single lookup of single item

Let a single lookup of a single item be done as shown in (4). In this case the -th embedding vector represents the -th item from the item space.

Notice that because we represent this lookup as a binary vector in (5) in dimensional space, it can describe only items and if we interpret it as an information source its entropy is

(14)

where denotes the probability of -th item being selected. Note that if is uniform then

(15)

Therefore, the dimensionality of the item and embedding spaces, as measured by the information they can represent, can be compared by looking at (12) and (14). For instance, under the assumption of uniform probability of item/value selection, if M and using (15) we have then a single 32-bit element is enough to represent information in the item space.

3.2 Single lookup of multiple items

Let a single lookup of multiple items be done as shown in (6). In this case the -th embedding vector represents combinations of -th and other items from the item space.

Let us first consider the case when this lookup is a binary vector in dimensional space, i.e. for in (7). In this case, the vector can describe items, with

(16)

and if we interpret it as an information source its entropy is

(17)

where is the probability of a combination being selected. Note that if is uniform then

(18)

where we have used properties of logarithms and Ramanujan’s approximation Ramanujan (1988)

(20)

that are described in more detail in the appendix.

Therefore, the dimensionality of the item and embedding spaces, as measured by the information they can represent, can be compared by looking at (12) and (17). For instance, under the assumption of uniform probability of item/value selection, if M, and using (3.2) we have then a vector with 64 elements with 32-bit per element is enough to represent information in the item space.

3.3 Single lookup of multiple items with weights

Let us now consider the case when this lookup is a vector in dimensional space, with each of the represented by bits for in (7). Notice that in this case the analysis of the previous section remains the same, but we can now select values for each position in the lookup. Therefore, the vector can describe items, with

(21)

Note that if we interpret it as an information source, under the assumption is uniform, then its entropy is

(22)

Therefore, the dimensionality of the item and embedding spaces, as measured by the information they can represent, can be compared by looking at (12) and (21). For instance, under the assumption of equal probability of item/value selection, if M, , and using (3.3) we have then a vector with 128 elements with 32-bit per element is enough to represent information in the item space.

3.4 The effects of mini-batch

We point out that the use of mini-batch of size does not affect the dimensionality because each vector in the mini-batch is treated independently. This can be observed in (8) where lookups from matrix in embedding generate results in matrix .

4 Upper bounds and sizing guidelines for embeddings

Notice that embeddings corresponding to sparse features do not necessarily reduce the dimensionality of data, as measured by the information they can represent. The dimensionality reduction depends on the size of the embedding vectors and the detailed signature of the input lookup vectors.

The upper bound (roofline) dimensionality of different types of lookups is provided in (15), (3.2) and (3.3) which can be compared with dimensionality of embeddings given in (13). We have tabulated a few sample lookup signatures and embedding dimensions for comparision in Tab. 1.

Input Lookup Signature Embedding Dimensions
n k H(.) d (s=8) d (s=16) d (s=32) H(.)
1M 1 19.9 4 2 1 32
10M 1 23.2 4 2 1 32
100M 1 26.5 4 2 1 32
1M 10 163.0 24 12 6 192
10M 10 196.3 28 14 7 224
100M 10 229.5 32 16 8 256
1M 100 1324.1 168 84 42 1344
10M 100 1656.3 208 104 52 1664
100M 100 1988.5 252 126 63 2016
1M 1000 9958.0 1248 624 312 9984
10M 1000 13281.2 1664 832 416 13312
100M 1000 16603.3 2076 1038 519 16608
Table 1: Entropy of a sample set of lookup signatures and embedding dimensions

Although, the function (18) achieves its maximum for larger as shown in Fig. 0(a), notice that for the sample lookup signatures in Tab. 1 the choice of has a much larger effect on the entropy function, as shown on Fig. 0(b). We point out that the function is ploted on a log scale in both plots.

(a) Entropy (18) for and
(b) Entropy (18) for sample lookup signatures
Figure 1: Entropy of a sample set of lookup signatures in log scale

In practice not all combinations will be exercised by the item lookups. We can make a pass over the input dataset to discover how many combinations are actually present and how often the same combinations are repeated. Then, we can use formulas (14) and (17

) to estimate the dimensionality of the input dataset from the information perspective. Ultimately, the more precise measure of dimensionality of the input dataset can guide a better choice of the embedding vector size and elemet type to be used in the model.

Once again we point out that although embeddings do not necessarily reduce the dimensionality of lookups, they do provide a different and very useful representation of them. Note that the embedding values are learned during training. Therefore, the embeddings are found based on what the lookups represent/mean in some abtract space relevant to the model at hand.

5 Conclusion

We have discussed embeddings corresponding to dense and sparse probability distributions. We have analyzed the dimensionality of the item lookups and embeddings corresponding to sparse features and data. We have shown that using embeddings does not necessarily reduce the dimension of the lookups, as measured in terms of the information they can represent. We have also provided rooflines and more precise guidelines for choosing the embedding size for the dataset and model at hand.

Acknowledgements

The author would like to thank Aleksandr Ulanov, Dheevatsa Mudigere, Satish Nadathur and Misha Smelyanskiy for thoughtful comments as well as Mark Tygert and Juan Miguel Pino for insighful discussion on the use of embeddings in different applications and NMT models.

References

6 Appendix

6.1 Logarithmic Indetities

We list a few handy logarithmic identities in this section Cormen et al. (2009). The multiplication and division operations are equivalent to

(24)

and

(25)

respectively. Also, we can change logarithm from base to using the following

(26)

6.2 Approximation theory

There exist a number of approximations to the factorial function, including Stirling’s approximation Le Cam (1986); Romik (2000)

(27)

and more accurate Ramanujan’s approximation Ramanujan (1988); Karatsuba (2001)

(28)

Then, using (28) it follows that

(29)

Finally, notice that using logarithmic properties and (29) we can write

We will take advantage of this expression in this note.