1 Introduction
In this paper, we consider the problem of embedding words into a lowdimensional space in order to measure the semantic similarity between them. As an example, how to find whether the word “table” is semantically more similar to the word “stool” than to the word “sky”? That is achieved by constructing a lowdimensional vector representation for each word and measuring similarity between the words as the similarity between the corresponding vectors.
One of the most popular word embedding models Mikolov et al. (2013)
is a discriminative neural network that optimizes SkipGram Negative Sampling (SGNS) objective (see Equation
3). It aims at predicting whether two words can be found close to each other within a text. As shown in Section 2, the process of word embeddings training using SGNS can be divided into two general steps with clear objectives:
[labelindent=15pt,itemindent=0em,leftmargin=!]

Search for a lowrank matrix that provides a good SGNS objective value;

Search for a good lowrank representation in terms of linguistic metrics, where is a matrix of word embeddings and is a matrix of socalled context embeddings.
Unfortunately, most previous approaches mixed these two steps into a single one, what entails a not completely correct formulation of the optimization problem. For example, popular approaches to train embeddings (including the original “word2vec” implementation) do not take into account that the objective from Step 1 depends only on the product : instead of straightforward computing of the derivative w.r.t. , these methods are explicitly based on the derivatives w.r.t. and , what complicates the optimization procedure. Moreover, such approaches do not take into account that parametrization of matrix
is nonunique and Step 2 is required. Indeed, for any invertible matrix
, we havetherefore, solutions and
are equally good in terms of the SGNS objective but entail different cosine similarities between embeddings and, as a result, different performance in terms of linguistic metrics (see Section
4.2 for details).A successful attempt to follow the above described steps, which outperforms the original SGNS optimization approach in terms of various linguistic tasks, was proposed in Levy and Goldberg (2014). In order to obtain a lowrank matrix
on Step 1, the method reduces the dimensionality of Shifted Positive Pointwise Mutual Information (SPPMI) matrix via Singular Value Decomposition (SVD). On Step 2, it computes embeddings
and via a simple formula that depends on the factors obtained by SVD. However, this method has one important limitation: SVD provides a solution to a surrogate optimization problem, which has no direct relation to the SGNS objective. In fact, SVD minimizes the Mean Squared Error (MSE) between and SPPMI matrix, what does not lead to minimization of SGNS objective in general (see Section 6.1 and Section 4.2 in Levy and Goldberg (2014) for details).These issues bring us to the main idea of our paper: while keeping the lowrank matrix search setup on Step 1, optimize the original SGNS objective directly. This leads to an optimization problem over matrix with the lowrank constraint, which is often Mishra et al. (2014) solved by applying Riemannian optimization framework Udriste (1994). In our paper, we use the projectorsplitting algorithm Lubich and Oseledets (2014), which is easy to implement and has low computational complexity. Of course, Step 2 may be improved as well, but we regard this as a direction of future work.
As a result, our approach achieves the significant improvement in terms of SGNS optimization on Step 1 and, moreover, the improvement on Step 1 entails the improvement on Step 2 in terms of linguistic metrics. That is why, the proposed twostep decomposition of the problem makes sense, what, most importantly, opens the way to applying even more advanced approaches based on it (e.g., more advanced Riemannian optimization techniques for Step 1 or a more sophisticated treatment of Step 2).
To summarize, the main contributions of our paper are:

We reformulated the problem of SGNS word embedding learning as a twostep procedure with clear objectives;

For Step 1, we developed an algorithm based on Riemannian optimization framework that optimizes SGNS objective over lowrank matrix directly;
2 Problem Setting
2.1 SkipGram Negative Sampling
In this paper, we consider the SkipGram Negative Sampling (SGNS) word embedding model Mikolov et al. (2013), which is a probabilistic discriminative model. Assume we have a text corpus given as a sequence of words , where may be larger than and belongs to a vocabulary of words . A context of the word is a word from set for some fixed window size . Let be the word embeddings of word and context , respectively. Assume they are specified by the following mappings:
The ultimate goal of SGNS word embedding training is to fit good mappings and .
Let
be a multiset of all wordcontext pairs observed in the corpus. In the SGNS model, the probability that wordcontext pair
is observed in the corpus is modeled as a following dsitribution:(1) 
where is the number of times the pair appears in and is the scalar product of vectors and . Number
is a hyperparameter that adjusts the flexibility of the model. It usually takes values from tens to hundreds.
In order to collect a training set, we take all pairs from as positive examples and randomly generated pairs as negative ones. The number of times the word and the context appear in can be computed as
accordingly. Then negative examples are generated from the distribution defined by counters:
In this way, we have a model maximizing the following logarithmic likelihood objective for all wordcontext pairs :
(2) 
In order to maximize the objective over all observations for each pair , we arrive at the following SGNS optimization problem over all possible mappings and :
(3) 
Usually, this optimization is done via the stochastic gradient descent procedure that is performed during passing through the corpus Mikolov et al. (2013); Rong (2014).
2.2 Optimization over LowRank Matrices
Relying on the prospect proposed in Levy and Goldberg (2014), let us show that the optimization problem given by (3) can be considered as a problem of searching for a matrix that maximizes a certain objective function and has the rank constraint (Step 1 in the scheme described in Section 1).
2.2.1 SGNS Loss Function
As shown in Levy and Goldberg (2014), the logarithmic likelihood (3) can be represented as the sum of over all pairs , where has the following form:
(4) 
A crucial observation is that this loss function depends only on the scalar product
but not on embeddings and separately:where
and is the scalar product , and
are constants.
2.2.2 Matrix Notation
Denote as and as . Let and be matrices, where each row of matrix is the word embedding of the corresponding word and each row of matrix is the context embedding of the corresponding context . Then the elements of the product of these matrices
are the scalar products of all pairs :
Note that this matrix has rank , because equals to the product of two matrices with sizes and . Now we can write SGNS objective given by (3) as a function of :
(5) 
This arrives us at the following proposition:
Proposition 1
The key idea of this paper is to solve the optimization problem given by (6) via the framework of Riemannian optimization, which we introduce in Section 3.
Important to note that this prospect does not suppose the optimization over parameters and directly. This entails the optimization in the space with degrees of freedom Mukherjee et al. (2015) instead of , what simplifies the optimization process (see Section 5 for the experimental results).
2.3 Computing Embeddings from a LowRank Solution
Once is found, we need to recover and such that (Step 2 in the scheme described in Section 1). This problem does not have a unique solution, since if satisfy this equation, then and satisfy it as well for any nonsingular matrix . Moreover, different solutions may achieve different values of the linguistic metrics (see Section 4.2
for details). While our paper focuses on Step 1, we use, for Step 2, a heuristic approach that was proposed in
Levy et al. (2015) and it shows good results in practice. We compute SVD of in the formwhere and have orthonormal columns, and is the diagonal matrix, and use
as matrices of embeddings.
A simple justification of this solution is the following: we need to map words into vectors in a way that similar words would have similar embeddings in terms of cosine similarities:
It is reasonable to assume that two words are similar, if they share contexts. Therefore, we can estimate the similarity of two words
, aswhat is the element of the matrix with indices . Note that
If we choose , we exactly obtain , since in this case. That is, the cosine similarity of the embeddings coincides with the intuitive similarity . However, scaling by instead of was shown in Levy et al. (2015) to be a better solution in experiments.
3 Proposed Method
3.1 Riemannian Optimization
3.1.1 General Scheme
The main idea of Riemannian optimization Udriste (1994) is to consider (6) as a constrained optimization problem. Assume we have an approximated solution on a current step of the optimization process, where is the step number. In order to improve , the next step of the standard gradient ascent outputs the point
where is the gradient of objective at the point . Note that the gradient can be naturally considered as a matrix in . Point leaves the manifold , because its rank is generally greater than . That is why Riemannian optimization methods map point back to manifold . The standard Riemannian gradient method first projects the gradient step onto the tangent space at the current point and then retracts it back to the manifold:
where is the retraction operator, and is the projection onto the tangent space.
Although the optimization problem is nonconvex, Riemannian optimization methods show good performance on it. Theoretical properties and convergence guarantees of such methods are discussed in Wei et al. (2016) more thoroughly.
3.1.2 ProjectorSplitting Algorithm
In our paper, we use a simplified version of such approach that retracts point directly to the manifold and does not require projection onto the tangent space as illustrated in Figure 1:
Intuitively, retractor finds a rank matrix on the manifold that is similar to highrank matrix in terms of Frobenius norm. How can we do it? The most straightforward way to reduce the rank of is to perform the SVD, which keeps largest singular values of it:
(7) 
However, it is computationally expensive. Instead of this approach, we use the projectorsplitting method Lubich and Oseledets (2014), which is a secondorder retraction onto the manifold (for details, see the review Absil and Oseledets (2015)). Its practical implementation is also quite intuitive: instead of computing the full SVD of according to the gradient projection method, we use just one step of the block power numerical method Bentbib and Kanber (2015) which computes the SVD, what reduces the computational complexity.
Let us keep the current point in the following factorized form:
(8) 
where matrices and have orthonormal columns and
. Then we need to perform two QRdecompositions to retract point
back to the manifold:In this way, we always keep the solution on the manifold and in the form (8).
What is important, we only need to compute , so the gradients with respect to , and are never computed explicitly, thus avoiding the subtle case where is close to singular (socalled singular (critical) point on the manifold). Indeed, the gradient with respect to (while keeping the orthogonality constraints) can be written Koch and Lubich (2007) as:
which means that the gradient will be large if is close to singular. The projectorsplitting scheme is free from this problem.
3.2 Algorithm
In case of SGNS objective given by (5), an element of gradient has the form:
To make the method more flexible in terms of convergence properties, we additionally use , which is a step size parameter. In this case, retractor returns instead of onto the manifold.
The whole optimization procedure is summarized in Algorithm 1.
4 Experimental Setup
4.1 Training Models
We compare our method (“ROSGNS” in the tables) performance to two baselines: SGNS embeddings optimized via Stochastic Gradient Descent, implemented in the original “word2vec”, (“SGDSGNS” in the tables) Mikolov et al. (2013) and embeddings obtained by SVD over SPPMI matrix (“SVDSPPMI” in the tables) Levy and Goldberg (2014). We have also experimented with the blockwise alternating optimization over factors W and C, but the results are almost the same to SGD results, that is why we do not to include them into the paper. The source code of our experiments is available online^{1}^{1}1https://github.com/AlexGrinch/ro_sgns.
The models were trained on English Wikipedia “enwik9” corpus^{2}^{2}2http://mattmahoney.net/dc/textdata, which was previously used in most papers on this topic. Like in previous studies, we counted only the words which occur more than times in the training corpus Levy and Goldberg (2014); Mikolov et al. (2013). As a result, we obtained a vocabulary of unique tokens (set of words and set of contexts are equal). The size of the context window was set to for all experiments, as it was done in Levy and Goldberg (2014); Mikolov et al. (2013). We conduct three series of experiments: for dimensionality , , and .
Optimization step size is chosen to be small enough to avoid huge gradient values. However, thorough choice of does not result in a significant difference in performance (this parameter was tuned on the training data only, the exact values used in experiments are reported below).
4.2 Evaluation
We evaluate word embeddings via the word similarity task. We use the following popular datasets for this purpose: “wordsim353” (Finkelstein et al. (2001); 3 datasets), “simlex999” Hill et al. (2016) and “men” Bruni et al. (2014). Original “wordsim353” dataset is a mixture of the word pairs for both word similarity and word relatedness tasks. This dataset was split Agirre et al. (2009) into two intersecting parts: “wordsimsim” (“wssim” in the tables) and “wordsimrel” (“wsrel” in the tables) to separate the words from different tasks. In our experiments, we use both of them on a par with the full version of “wordsim353” (“wsfull” in the tables). Each dataset contains word pairs together with assessorassigned similarity scores for each pair. As a quality measure, we use Spearman’s correlation between these human ratings and cosine similarities for each pair. We call this quality metric linguistic in our paper.
5 Results of Experiments
First of all, we compare the value of SGNS objective obtained by the methods. The comparison is demonstrated in Table 1.
SGDSGNS  

SVDSPPMI  
ROSGNS 
Dim.  Algorithm  wssim  wsrel  wsfull  simlex  men 

SGDSGNS  0.719  0.570  0.662  0.288  0.645  
SVDSPPMI  0.722  0.585  0.669  0.317  0.686  
ROSGNS  0.729  0.597  0.677  0.322  0.683  
SGDSGNS  0.733  0.584  0.677  0.317  0.664  
SVDSPPMI  0.747  0.625  0.694  0.347  0.710  
ROSGNS  0.757  0.647  0.708  0.353  0.701  
SGDSGNS  0.738  0.600  0.688  0.350  0.712  
SVDSPPMI  0.765  0.639  0.707  0.380  0.737  
ROSGNS  0.767  0.654  0.715  0.383  0.732 
five  he  main  

SVDSPPMI  ROSGNS  SVDSPPMI  ROSGNS  SVDSPPMI  ROSGNS  
Neighbors  Dist.  Neighbors  Dist.  Neighbors  Dist.  Neighbors  Dist.  Neighbors  Dist.  Neighbors  Dist. 
lb  0.748  four  0.999  she  0.918  when  0.904  major  0.631  major  0.689 
kg  0.731  three  0.999  was  0.797  had  0.903  busiest  0.621  important  0.661 
mm  0.670  six  0.997  promptly  0.742  was  0.901  principal  0.607  line  0.631 
mk  0.651  seven  0.997  having  0.731  who  0.892  nearest  0.607  external  0.624 
lbf  0.650  eight  0.996  dumbledore  0.731  she  0.884  connecting  0.591  principal  0.618 
per  0.644  and  0.985  him  0.730  by  0.880  linking  0.588  primary  0.612 
usa  

SGDSGNS  SVDSPPMI  ROSGNS  
Neighbors  Dist.  Neighbors  Dist.  Neighbors  Dist. 
akron  0.536  wisconsin  0.700  georgia  0.707 
midwest  0.535  delaware  0.693  delaware  0.706 
burbank  0.534  ohio  0.691  maryland  0.705 
nevada  0.534  northeast  0.690  illinois  0.704 
arizona  0.533  cities  0.688  madison  0.703 
uk  0.532  southwest  0.684  arkansas  0.699 
youngstown  0.532  places  0.684  dakota  0.690 
utah  0.530  counties  0.681  tennessee  0.689 
milwaukee  0.530  maryland  0.680  northeast  0.687 
headquartered  0.527  dakota  0.674  nebraska  0.686 
We see that SGDSGNS and SVDSPPMI methods provide quite similar results, however, the proposed method obtains significantly better SGNS values, what proves the feasibility of using Riemannian optimization framework in SGNS optimization problem. It is interesting to note that SVDSPPMI method, which does not optimize SGNS objective directly, obtains better results than SGDSGNS method, which aims at optimizing SGNS. This fact additionally confirms the idea described in Section 2.2.2 that the independent optimization over parameters and may decrease the performance.
However, the target performance measure of embedding models is the correlation between semantic similarity and human assessment (Section 4.2). Table 2 presents the comparison of the methods in terms of it. We see that our method outperforms the competitors on all datasets except for “men” dataset where it obtains slightly worse results. Moreover, it is important that the higher dimension entails higher performance gain of our method in comparison to the competitors.
To understand how our model improves or degrades the performance in comparison to the baseline, we found several words, whose neighbors in terms of cosine distance change significantly. Table 3 demonstrates neighbors of the words “five”, “he” and “main” for both SVDSPPMI and ROSGNS models. A neighbor is marked bold if we suppose that it has similar semantic meaning to the source word. First of all, we notice that our model produces much better neighbors of the words describing digits or numbers (see word “five” as an example). Similar situation happens for many other words, e.g. in case of “main” — the nearest neighbors contain 4 similar words for our model instead of 2 in case of SVDSPPMI. The neighbourhood of “he” contains less semantically similar words in case of our model. However, it filters out irrelevant words, such as “promptly” and “dumbledore”.
Table 4 contains the nearest words to the word “usa” from 11th to 20th. We marked names of USA states bold and did not represent top10 nearest words as they are exactly names of states for all three models. Some nonbold words are arguably relevant as they present large USA cities (“akron”, “burbank”, “madison”) or geographical regions of several states (“midwest”, “northeast”, “southwest”), but there are also some completely irrelevant words (“uk”, “cities”, “places”) presented by first two models.
Our experiments show that the optimal number of iterations in the optimization procedure and step size depend on the particular value of . For , we have , for , we have , and for , we have . Moreover, the best results were obtained when SVDSPPMI embeddings were used as an initialization of Riemannian optimization process.
Figure 2 illustrates how the correlation between semantic similarity and human assessment scores changes through iterations of our method. Optimal value of is the same for both whole testing set and its 10fold subsets chosen for crossvalidation. The idea to stop optimization procedure on some iteration is also discussed in Lai et al. (2015).
Training of the same dimensional models () on English Wikipedia corpus using SGDSGNS, SVDSPPMI, ROSGNS took minutes, minutes and minutes respectively. Our method works slower, but not significantly. Moreover, since we were not focused on the code efficiency optimization, this time can be reduced.
6 Related Work
6.1 Word Embeddings
SkipGram Negative Sampling was introduced in Mikolov et al. (2013). The “negative sampling” approach is thoroughly described in Goldberg and Levy (2014), and the learning method is explained in Rong (2014). There are several opensource implementations of SGNS neural network, which is widely known as “word2vec”. ^{1}^{1}1Original Google word2vec: https://code.google.com/archive/p/word2vec/^{2}^{2}2Gensim word2vec: https://radimrehurek.com/gensim/models/word2vec.html
As shown in Section 2.2, SkipGram Negative Sampling optimization can be reformulated as a problem of searching for a lowrank matrix. In order to be able to use outofthebox SVD for this task, the authors of Levy and Goldberg (2014) used the surrogate version of SGNS as the objective function. There are two general assumptions made in their algorithm that distinguish it from the SGNS optimization:

SVD optimizes Mean Squared Error (MSE) objective instead of SGNS loss function.

In order to avoid infinite elements in SPMI matrix, it is transformed in adhoc manner (SPPMI matrix) before applying SVD.
This makes the objective not interpretable in terms of the original task (3). As mentioned in Levy and Goldberg (2014), SGNS objective weighs different pairs differently, unlike the SVD, which works with the same weight for all pairs and may entail the performance fall. The comprehensive explanation of the relation between SGNS and SVDSPPMI methods is provided in Keerthi et al. (2015). Lai et al. (2015); Levy et al. (2015) give a good overview of highly practical methods to improve these word embedding models.
6.2 Riemannian Optimization
An introduction to optimization over Riemannian manifolds can be found in Udriste (1994). The overview of retractions of high rank matrices to lowrank manifolds is provided in Absil and Oseledets (2015). The projectorsplitting algorithm was introduced in Lubich and Oseledets (2014), and also was mentioned in Absil and Oseledets (2015) as “LieTrotter retraction”.
Riemannian optimization is succesfully applied to various data science problems: for example, matrix completion
Vandereycken (2013), largescale recommender systems Tan et al. (2014), and tensor completion
Kressner et al. (2014).7 Conclusions
In our paper, we proposed the general twostep scheme of training SGNS word embedding model and introduced the algorithm that performs the search of a solution in the lowrank form via Riemannian optimization framework. We also demonstrated the superiority of our method by providing experimental comparison to existing stateoftheart approaches.
Possible direction of future work is to apply more advanced optimization techniques to the Step 1 of the scheme proposed in Section 1 and to explore the Step 2 — obtaining embeddings with a given lowrank matrix.
Acknowledgments
This research was supported by the Ministry of Education and Science of the Russian Federation (grant 14.756.31.0001).
References
 Absil and Oseledets (2015) PA Absil and Ivan V Oseledets. 2015. Lowrank retractions: a survey and new results. Computational Optimization and Applications 62(1):5–29.
 Agirre et al. (2009) Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnetbased approaches. In NAACL. pages 19–27.
 Bentbib and Kanber (2015) AH Bentbib and A Kanber. 2015. Block power method for svd decomposition. Analele Stiintifice Ale Unversitatii Ovidius ConstantaSeria Matematica 23(2):45–58.
 Bruni et al. (2014) Elia Bruni, NamKhanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics. J. Artif. Intell. Res.(JAIR) 49(147).
 Finkelstein et al. (2001) Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In WWW. pages 406–414.
 Goldberg and Levy (2014) Yoav Goldberg and Omer Levy. 2014. word2vec explained: deriving mikolov et al.’s negativesampling wordembedding method. arXiv preprint arXiv:1402.3722 .
 Hill et al. (2016) Felix Hill, Roi Reichart, and Anna Korhonen. 2016. Simlex999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics .
 Keerthi et al. (2015) S Sathiya Keerthi, Tobias Schnabel, and Rajiv Khanna. 2015. Towards a better understanding of predict and count models. arXiv preprint arXiv:1511.02024 .
 Koch and Lubich (2007) Othmar Koch and Christian Lubich. 2007. Dynamical lowrank approximation. SIAM J. Matrix Anal. Appl. 29(2):434–454.
 Kressner et al. (2014) Daniel Kressner, Michael Steinlechner, and Bart Vandereycken. 2014. Lowrank tensor completion by riemannian optimization. BIT Numerical Mathematics 54(2):447–468.
 Lai et al. (2015) Siwei Lai, Kang Liu, Shi He, and Jun Zhao. 2015. How to generate a good word embedding? arXiv preprint arXiv:1507.05523 .
 Levy and Goldberg (2014) Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS. pages 2177–2185.
 Levy et al. (2015) Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. ACL 3:211–225.
 Lubich and Oseledets (2014) Christian Lubich and Ivan V Oseledets. 2014. A projectorsplitting integrator for dynamical lowrank approximation. BIT Numerical Mathematics 54(1):171–188.
 Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS. pages 3111–3119.
 Mishra et al. (2014) Bamdev Mishra, Gilles Meyer, Silvère Bonnabel, and Rodolphe Sepulchre. 2014. Fixedrank matrix factorizations and riemannian lowrank optimization. Computational Statistics 29(34):591–621.
 Mukherjee et al. (2015) A Mukherjee, K Chen, N Wang, and J Zhu. 2015. On the degrees of freedom of reducedrank estimators in multivariate regression. Biometrika 102(2):457–477.
 Rong (2014) Xin Rong. 2014. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 .
 Schnabel et al. (2015) Tobias Schnabel, Igor Labutov, David Mimno, and Thorsten Joachims. 2015. Evaluation methods for unsupervised word embeddings. In EMNLP.
 Tan et al. (2014) Mingkui Tan, Ivor W Tsang, Li Wang, Bart Vandereycken, and Sinno Jialin Pan. 2014. Riemannian pursuit for big matrix recovery. In ICML. volume 32, pages 1539–1547.
 Udriste (1994) Constantin Udriste. 1994. Convex functions and optimization methods on Riemannian manifolds, volume 297. Springer Science & Business Media.
 Vandereycken (2013) Bart Vandereycken. 2013. Lowrank matrix completion by riemannian optimization. SIAM Journal on Optimization 23(2):1214–1236.
 Wei et al. (2016) Ke Wei, JianFeng Cai, Tony F Chan, and Shingyu Leung. 2016. Guarantees of riemannian optimization for low rank matrix recovery. SIAM Journal on Matrix Analysis and Applications 37(3):1198–1222.
Comments
There are no comments yet.