1 Background
The mapping of concepts, objects or items into a vector space is called an embedding. The representation of
items in dimensional vector space is used in many applications, which can broadly be split into two categories.The first category is characterized by models that produce embeddings. In this case the embeddings represent the information obtained by the model about the item space. If the model represents a probability distribution over items in the dataset then we can interpret the output as an embedding of this probability distribution.
For instance, singular value decomposition
Golub & Van Loan (2012) is used in latent semantic analysis/indexing (LSA/LSI) Dumais (2005) to produce lowrank approximations of the original data. In this case a lowrank approximation of the worddocument matrix for words and documents, where each row corresponds to a word and each column corresponds to a document, while nonzero entries represent the occurence of words in documents, can be written as(1) 
where the matrices and can be interpreted as embeddings of words and documents in a dimensional space. A similar interpretation can be given to matrix factorization for collaborative filtering, with the caveat that empty matrix entires are unknown (rather than being ) Koren et al. (2009); Frolov & Oseledets (2017).
Also, deep learning models based on autoencoders
Sedhain et al. (2015), multilabel classification Bengio et al. (2010)and neural machine translation (NMT)
Neubig (2017) can be seen as generating a probability distribution over a set of classes/objects. Let this probability distribution be represented as a vector . Notice that the embeddings produced by the models, such as embeddings of target language words in NMT, represent an embedding of this probability distribution over items into dimensional space(2) 
for some embedding matrix . Notice that during training we often start with an arbitrary probability distribution and therefore vector p is dense. However, when training converges towards the end it is common for the probability distribution to concentrate more at particular items and therefore vector p becomes sparse (and if needed it can be quantized to eliminate a long tail of small values). It will become clear in this note that the information the embedding needs to represent will vary greatly based on the signature of vector p. In particular, when p is very sparse and its elements can be quantized to a few values the embedding dimension needed to represent the information in it can be relatively small.
The second category is characterized by models that consume embeddings. In this case embeddings represent information obtained from the input features or data. It is common for these input features to be sparse, such as the history of user clicks on webpages or words present in a post. The sparse features are often represented by a list of integer indices that select items from a larger sequence/set (in contrast to dense features that are represented by a single floating point number). They can also be represented by a sparse vector p as will be shown in the next section.
For instance, neural collaborative filtering He et al. (2017), wide & deep Cheng et al. (2016) and deep & cross recommendation systems Wang et al. (2017) all use embeddings to process sparse input features. The advantage of using embeddings instead of lists of sparse items is that we can measure distance between them in a more meaningful way. Also, notice that embedding elements represent the sparse features in some abtract space relevant to the model at hand, while integers simply represent an ordering of the input data.
The natural language processing models
Kalchbrenner & Blunsom (2013); Sutskever et al. (2014) may fall somewhere in between these two categories. In particular, NMT models Neubig (2017) often use two embeddings one representing words in the source and another in the target language. On one hand, the source embedding can be seen as a sparse feature consumed by the model. It is characterized by a list of indices that selects words used in the input sentence. On the other hand, the target embedding can be seen as a representation of the probability distribution over words in the target language.In this note we focus on the embedding of sparse vectors p that can be used to represent sparse features and data that belong to the second category. We point out that the implications of choosing a particular value of a hyperparameter are not theoretically well understood. The choice is usually based on empirical experiments or resource limitations, such as compute and memory available on hardware platforms Park et al. (2018). A recent work attempted to explain the sizing of the embedding vectors based on pairwise inner product dissimilarity metric Yin & Shen (2018).
We propose an alternative approach based on the entropy information measure. We leverage the ideas of Tishby et al. (1999); ShwartzZiv & Tishby (2017) as well as Traub & Woziakowski (1980); Pinkus (1985); Donoho (2006)
, but in contrast to them we do not attempt to explain the behavior of neural networks or find ways to compress the parameters/data. We use the information measure to discuss the dimensionality and provide guidelines on sizing of the embedding vectors, i.e. we provide roofline and more precise models for selecting the dimensionality
.2 Introduction to embeddings
Let items be mapped into a dimensional vector space. The vectors corresponding to items are often organized into an embedding table, which can be seen as a tall matrix with and that can be written as
(3) 
where vector corresponds to th item.
A sparse feature is characterized by a list of integer indices, which can be represented as item lookups with different signature in the embedding table.
The item lookup with a single index is often encoded as a dense matrixvector multiplication
(4) 
where vector and
(5) 
has in th position and
everywhere else. It is often referred to as a onehot encoding vector.
Notice that we can can select multiple items with some weights in a single lookup and express it as
(6) 
where vector and
(7) 
has weight at and everywhere else Jia et al. (2014); Paszke et al. (2017).
Further, we can generalize it to multiple lookups where each selects multiple items with some weight and encode it as a dense matrixsparse matrix multiplication
(8) 
where sparse matrix and
(9) 
is composed of multiple vectors each corresponding to a single lookup^{1}^{1}1Notice that the vector subscript in here has a different meaning than before. In (9) it denotes the th lookup, while in in (5) it denoted the th item being selected. with nonzero elements corresponding to the items being selected. The output matrix is the result of lookups.
This setup is often used to state that the embedding vectors project dimensional items space into dimensional embedding vectors. In the next section we will examine this statement in more detail and show that it can be misleading and therefore lead to incorrect conclusions.
3 The Dimensionality of embeddings
Notice that when we discussed space dimensions in the previous section, we never took into consideration the precise vector element type and how much information we can represent with it. Let us now incorporate it into our analysis by measuring the cardinality of the set it can describe and the information associated with it.
Recall that the entropy of an information source is given by
(10) 
where is the probability of th symbol to be communicated Shannon & Weaver (1949).
Notice that the embedding vector has elements, each with bits. Therefore, it could represent
(11) 
values and if we interpret it as an information source its entropy is
(12) 
where denotes the probability of th value being selected. Hence, if is uniform then
(13) 
3.1 Single lookup of single item
Let a single lookup of a single item be done as shown in (4). In this case the th embedding vector represents the th item from the item space.
Notice that because we represent this lookup as a binary vector in (5) in dimensional space, it can describe only items and if we interpret it as an information source its entropy is
(14) 
where denotes the probability of th item being selected. Note that if is uniform then
(15) 
Therefore, the dimensionality of the item and embedding spaces, as measured by the information they can represent, can be compared by looking at (12) and (14). For instance, under the assumption of uniform probability of item/value selection, if M and using (15) we have then a single 32bit element is enough to represent information in the item space.
3.2 Single lookup of multiple items
Let a single lookup of multiple items be done as shown in (6). In this case the th embedding vector represents combinations of th and other items from the item space.
Let us first consider the case when this lookup is a binary vector in dimensional space, i.e. for in (7). In this case, the vector can describe items, with
(16) 
and if we interpret it as an information source its entropy is
(17) 
where is the probability of a combination being selected. Note that if is uniform then
(18)  
where we have used properties of logarithms and Ramanujan’s approximation Ramanujan (1988)
(20) 
that are described in more detail in the appendix.
Therefore, the dimensionality of the item and embedding spaces, as measured by the information they can represent, can be compared by looking at (12) and (17). For instance, under the assumption of uniform probability of item/value selection, if M, and using (3.2) we have then a vector with 64 elements with 32bit per element is enough to represent information in the item space.
3.3 Single lookup of multiple items with weights
Let us now consider the case when this lookup is a vector in dimensional space, with each of the represented by bits for in (7). Notice that in this case the analysis of the previous section remains the same, but we can now select values for each position in the lookup. Therefore, the vector can describe items, with
(21) 
Note that if we interpret it as an information source, under the assumption is uniform, then its entropy is
(22)  
Therefore, the dimensionality of the item and embedding spaces, as measured by the information they can represent, can be compared by looking at (12) and (21). For instance, under the assumption of equal probability of item/value selection, if M, , and using (3.3) we have then a vector with 128 elements with 32bit per element is enough to represent information in the item space.
3.4 The effects of minibatch
We point out that the use of minibatch of size does not affect the dimensionality because each vector in the minibatch is treated independently. This can be observed in (8) where lookups from matrix in embedding generate results in matrix .
4 Upper bounds and sizing guidelines for embeddings
Notice that embeddings corresponding to sparse features do not necessarily reduce the dimensionality of data, as measured by the information they can represent. The dimensionality reduction depends on the size of the embedding vectors and the detailed signature of the input lookup vectors.
The upper bound (roofline) dimensionality of different types of lookups is provided in (15), (3.2) and (3.3) which can be compared with dimensionality of embeddings given in (13). We have tabulated a few sample lookup signatures and embedding dimensions for comparision in Tab. 1.
Input Lookup Signature  Embedding Dimensions  

n  k  H(.)  d (s=8)  d (s=16)  d (s=32)  H(.) 
1M  1  19.9  4  2  1  32 
10M  1  23.2  4  2  1  32 
100M  1  26.5  4  2  1  32 
1M  10  163.0  24  12  6  192 
10M  10  196.3  28  14  7  224 
100M  10  229.5  32  16  8  256 
1M  100  1324.1  168  84  42  1344 
10M  100  1656.3  208  104  52  1664 
100M  100  1988.5  252  126  63  2016 
1M  1000  9958.0  1248  624  312  9984 
10M  1000  13281.2  1664  832  416  13312 
100M  1000  16603.3  2076  1038  519  16608 
Although, the function (18) achieves its maximum for larger as shown in Fig. 0(a), notice that for the sample lookup signatures in Tab. 1 the choice of has a much larger effect on the entropy function, as shown on Fig. 0(b). We point out that the function is ploted on a log scale in both plots.
In practice not all combinations will be exercised by the item lookups. We can make a pass over the input dataset to discover how many combinations are actually present and how often the same combinations are repeated. Then, we can use formulas (14) and (17
) to estimate the dimensionality of the input dataset from the information perspective. Ultimately, the more precise measure of dimensionality of the input dataset can guide a better choice of the embedding vector size and elemet type to be used in the model.
Once again we point out that although embeddings do not necessarily reduce the dimensionality of lookups, they do provide a different and very useful representation of them. Note that the embedding values are learned during training. Therefore, the embeddings are found based on what the lookups represent/mean in some abtract space relevant to the model at hand.
5 Conclusion
We have discussed embeddings corresponding to dense and sparse probability distributions. We have analyzed the dimensionality of the item lookups and embeddings corresponding to sparse features and data. We have shown that using embeddings does not necessarily reduce the dimension of the lookups, as measured in terms of the information they can represent. We have also provided rooflines and more precise guidelines for choosing the embedding size for the dataset and model at hand.
Acknowledgements
The author would like to thank Aleksandr Ulanov, Dheevatsa Mudigere, Satish Nadathur and Misha Smelyanskiy for thoughtful comments as well as Mark Tygert and Juan Miguel Pino for insighful discussion on the use of embeddings in different applications and NMT models.
References
 Bengio et al. (2010) Samy Bengio, Jason Weston, and David Grangier. Label embedding trees for large multiclass tasks. In Proc. Advances in Neural Information Processing Systems, 2010.
 Cheng et al. (2016) HengTze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. Wide & deep learning for recommender systems. In Proc. 1st Workshop on Deep Learning for Recommender Systems, pp. 7–10, 2016.
 Cormen et al. (2009) Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. MIT Press, 3rd edition, 2009.
 Donoho (2006) David L. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52:1289–1306, 2006.
 Dumais (2005) Susan T. Dumais. Latent semantic analysis. Annual Review of Information Science and Technology, 38:188 – 230, 2005.
 Frolov & Oseledets (2017) Evgeny Frolov and Ivan Oseledets. Tensor methods and recommender systems. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 7(3):e1201, 2017.
 Golub & Van Loan (2012) Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, 4th edition, 2012.
 He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and TatSeng Chua. Neural collaborative filtering. In Proc. 26th Int. Conf. World Wide Web, pp. 173–182, 2017.
 Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. CoRR, 2014. URL https://arxiv.org/abs/1408.5093.
 Kalchbrenner & Blunsom (2013) Nal Kalchbrenner and Phil Blunsom. Recurrent continuous translation models. In Proc. 2013 Conf. Empirical Methods in Natural Language Processing, pp. 1700–1709, 2013.
 Karatsuba (2001) Ekatherina Karatsuba. On the asymptotic representation of the Euler gamma function by Ramanujan. Journal of Computational and Applied Mathematics, 135:225–240, 2001.
 Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 8:30–37, 2009.

Le Cam (1986)
Lucien Le Cam.
The central limit theorem around 1935.
Statistical Science, 1:78–96, 1986.  Neubig (2017) Graham Neubig. Neural machine translation and sequencetosequence models: a tutorial. CoRR, 2017. URL https://arxiv.org/abs/1703.01619.
 Park et al. (2018) Jongsoo Park, Maxim Naumov, Protonu Basu, Summer Deng, Aravind Kalaiah, Daya Khudia, James Law, Parth Malani, Andrey Malevich, Satish Nadathur, Juan Pino, Martin Schatz, Alexander Sidorov, Viswanath Sivakumar, Andrew Tulloch, Xiaodong Wang, Yiming Wu, Hector Yuen, Utku Diril, Dmytro Dzhulgakov, Kim Hazelwood, Bill Jia, Yangqing Jia, Lin Qiao, Vijay Rao, Nadav Rotem, Sungjoo Yoo, and Mikhail Smelyanskiy. Deep learning inference in facebook data centers: Characterization, performance optimizations and hardware implications. CoRR, 2018. URL https://arxiv.org/abs/1811.09886.

Paszke et al. (2017)
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in PyTorch.
Proc. Advances in Neural Information Processing Systems, 2017.  Pinkus (1985) Alan Pinkus. nwidths in approximation theory. SpringerVerlag, Berlin, 1985.
 Ramanujan (1988) Srinivasa Ramanujan. The Lost Notebook and Other Unpublished Papers. Springer, Berlin, 1988.
 Romik (2000) Dan Romik. Stirling’s approximation for n!: The ultimate short proof? The American Mathematical Monthly, 107:556–557, 2000.

Sedhain et al. (2015)
Suvash Sedhain, Aditya K. Menon, Scott Sanner, and Lexing Xie.
Autorec: Autoencoders meet collaborative filtering.
In Proc. 24th Int. Conf. World Wide Web, 2015.  Shannon & Weaver (1949) Claude Shannon and Warren Weaver. The Mathematical Theory of Communication. University of Illinois Press, 1949.
 ShwartzZiv & Tishby (2017) Ravid ShwartzZiv and Naftali Tishby. Opening the black box of deep neural networks via information. CoRR, 2017. URL https://arxiv.org/abs/1703.00810.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Proc. Advances in Neural Information Processing Systems, pp. 3104–3112, 2014.
 Tishby et al. (1999) Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method. In Proc. 37th Allerton Conference on Communication, Control and Computing, pp. 368–377, 1999.
 Traub & Woziakowski (1980) Joe F. Traub and Henryk Woziakowski. A general theory of optimal algorithms. Academic, Ney York, 1980.
 Wang et al. (2017) Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. Deep & cross network for ad click predictions. In Proc. ADKDD, pp. 12, 2017.
 Yin & Shen (2018) Zi Yin and Yuanyuan Shen. On the dimensionality of word embeddings. In Proc. Neural Information Processing Systems, 2018.
6 Appendix
6.1 Logarithmic Indetities
We list a few handy logarithmic identities in this section Cormen et al. (2009). The multiplication and division operations are equivalent to
(24) 
and
(25) 
respectively. Also, we can change logarithm from base to using the following
(26) 
6.2 Approximation theory
There exist a number of approximations to the factorial function, including Stirling’s approximation Le Cam (1986); Romik (2000)
(27) 
and more accurate Ramanujan’s approximation Ramanujan (1988); Karatsuba (2001)
(28) 
Then, using (28) it follows that
(29) 
We will take advantage of this expression in this note.
Comments
There are no comments yet.