1 Introduction
Existing methods of neural word embeddings are typically designed to go through the entire training data multiple times. For example, negative sampling (Mikolov et al., 2013b)
needs to precompute the noise distribution from the entire training data before performing Stochastic Gradient Descent (SGD). It thus needs to go through the training data at least twice. Similarly, hierarchical softmax
(Mikolov et al., 2013b) has to determine the tree structure and GloVe (Pennington et al., 2014) has to count cooccurrence frequencies before performing SGD.The fact that those existing methods are multipass algorithms means that they cannot perform incremental model update when additional training data is provided. Instead, they have to retrain the model on the old and new training data from scratch.
However, the retraining is obviously inefficient since it has to process the entire training data received thus far whenever new training data is provided. This is especially problematic when the amount of the new training data is relatively smaller than the old one. One such a situation is that the embedding model is updated on a small amount of training data that includes newly emerged words for instantly adding them to the vocabulary set. Another situation is that the word embeddings are learned from everevolving data such as news articles and microbologs (Peng et al., 2017) and the embedding model is periodically updated on newly generated data (e.g., once in a week or month).
This paper investigates an incremental training method of word embeddings with a focus on the skipgram model with negative sampling (SGNS) (Mikolov et al., 2013b) for its popularity. We present a simple incremental extension of SGNS, referred to as incremental SGNS, and provide a thorough theoretical analysis to demonstrate its validity. Our analysis reveals that, under a mild assumption, the optimal solution of incremental SGNS agrees with the original SGNS when the training data size is infinitely large. See Section 4 for the formal and strict statement. Additionally, we present techniques for the efficient implementation of incremental SGNS.
Three experiments were conducted to assess the correctness of the theoretical analysis as well as the practical usefulness of incremental SGNS. The first experiment empirically investigates the validity of the theoretical analysis result. The second experiment compares the word embeddings learned by incremental SGNS and the original SGNS across five benchmark datasets, and demonstrates that those word embeddings are of comparable quality. The last experiment explores the training time of incremental SGNS, demonstrating that it is able to save much training time by avoiding expensive retraining when additional training data is provided.
2 SGNS Overview
As a preliminary, this section provides a brief overview of SGNS.
Given a word sequence, , for training, the skipgram model seeks to minimize the following objective to learn word embeddings:
where is a target word and is a context word within a window of size .
represents the probability that
appears within the neighbor of , and is defined as(1) 
where and are ’s embeddings when it behaves as a target and context, respectively. represents the vocabulary set.
Since it is too expensive to optimize the above objective, Mikolov et al. Mikolov et al. (2013b) proposed negative sampling to speed up skipgram training. This approximates Eq. (1
) using sigmoid functions and
randomlysampled words, called negative samples. The resulting objective is given aswhere , , and is the sigmoid function. The negative sample
is drawn from a smoothed unigram probability distribution referred to as
noise distribution: , where represents the frequency of a word in the training data and is a smoothing parameter ().The objective is optimized by SGD. Given a targetcontext word pair ( and ) and negative samples () drawn from the noise distribution, the gradient of is computed. Then, the gradient descent is performed to update , , and .
SGNS training needs to scan the entire training data multiple times because it has to precompute the noise distribution before performing SGD. This makes it difficult to perform incremental model update when additional training data is provided.
3 Incremental SGNS
This section explores incremental training of SGNS. The incremental training algorithm (Section 3.1), its efficient implementation (Section 3.2), and the computational complexity (Section 3.3) are discussed in turn.
3.1 Algorithm
Algorithm 1 presents incremental SGNS, which goes through the training data in a singlepass to update word embeddings incrementally. Unlike the original SGNS, it does not precompute the noise distribution. Instead, it reads the training data word by word^{1}^{1}1In practice, Algorithm 1 buffers a sequence of words (rather than a single word ) at each step, as it requires an access to the context words in line 7. This is not a practical problem because the window size is usually small and independent from the training data size . to incrementally update the word frequency distribution and the noise distribution while performing SGD. Hereafter, the original SGNS (c.f., Section 2) is referred to as batch SGNS to emphasize that the noise distribution is computed in a batch fashion.
The learning rate for SGD is adjusted by using AdaGrad (Duchi et al., 2011). Although the linear decay function has widely been used for training batch SGNS (Mikolov, 2013), adaptive methods such as AdaGrad are more suitable for the incremental training since the amount of training data is unknown in advance or can increase unboundedly.
It is straightforward to extend the incremental SGNS to the minibatch setting by reading a subset of the training data (or minibatch), rather than a single word, at a time to update the noise distribution and perform SGD (Algorithm 2). Although this paper primarily focuses on the incremental SGNS, the minibatch algorithm is also important in practical terms because it is easier to be multithreaded.
Alternatives to Algorithms 2 might be possible. Other possible approaches include computing the noise distribution separately on each subset of the training data, fixing the noise distribution after computing it from the first (possibly large) subset, and so on. We exclude such alternatives from our investigation because it is considered difficult to provide them with theoretical justification.
3.2 Efficient implementation
Although the incremental SGNS is conceptually simple, implementation issues are involved.
3.2.1 Dynamic vocabulary
One problem that arises when training incremental SGNS is how to maintain the vocabulary set. Since new words emerge endlessly in the training data, the vocabulary set can grow unboundedly and exhaust a memory.
We address this problem by dynamically changing the vocabulary set. The MisraGries algorithm (Misra and Gries, 1982) is used to approximately keep track of top frequent words during training, and those words are used as the dynamic vocabulary set. This method allows the maximum vocabulary size to be explicitly limited to , while being able to dynamically change the vocabulary set.


3.2.2 Adaptive unigram table
Another problem is how to generate negative samples efficiently. Since negative samples per targetcontext pair have to be generated by the noise distribution, the sampling speed has a significant effect on the overall training efficiency.
Let us first examine how negative samples are generated in batch SGNS. In a popular implementation (Mikolov, 2013), a word array (referred to as a unigram table) is constructed such that the number of a word in it is proportional to . See Table 1 for an example. Using the unigram table, negative samples can be efficiently generated by sampling the table elements uniformly at random. It takes only time to generate one negative sample.
The above method assumes that the noise distribution is fixed and thus cannot be used directly for the incremental training. One simple solution is to reconstruct the unigram table whenever new training data is provided. However, such a method is not effective for the incremental SGNS, because the unigram table reconstruction requires time.^{2}^{2}2This overhead is amortized in minibatch SGNS if the minibatch size is sufficiently large. Our discussion here is dedicated to efficiently perform the incremental training irrespective of the minibatch size.
We propose a reservoirbased algorithm for efficiently updating the unigram table (Vitter, 1985; Efraimidis, 2015) (Algorithm 3). The algorithm incrementally update the unigram table while limiting its maximum size to . In case , it can be easily confirmed that the number of a word in is . In case , since is equal to the normalization factor of the noise distribution, it can be proven by induction that, for all , is a word with probability . See (Vitter, 1985; Efraimidis, 2015) for reference.
Note on implementation
In line 8, copies of are added to . When is not an integer, the copies are generated so that their expected number becomes . Specifically, copies are added to with probability , and copies are added otherwise.
The loop from line 10 to 12 becomes expensive if implemented straightforwardly because the maximum table size is typically set large (e.g., in word2vec (Mikolov, 2013)). For acceleration, instead of checking all elements in the unigram table, randomly chosen elements are substituted with . Note that is the expected number of table elements to be substituted in the original algorithm. This approximation achieves great speedup because we usually have . In fact, it can be proven that it takes time when . See Appendix^{3}^{3}3The appendices are in the supplementary material. A for more discussions.
3.3 Computational complexity
Both incremental and batch SGNS have the same space complexity, which is independent of the training data size . Both require space to store the word embeddings and the word frequency counts, and space to store the unigram table.
The two algorithms also have the same time complexity. Both require training time when the training data size is . Although incremental SGNS requires extra time for updating the dynamic vocabulary and adaptive unigram table, these costs are practically negligible, as will be demonstrated in Section 5.3.
4 Theoretical Analysis
Although the extension from batch to incremental SGNS is simple and intuitive, it is not readily clear whether incremental SGNS can learn word embeddings as well as the batch counterpart. To answer this question, in this section we examine incremental SGNS from a theoretical point of view.
The analysis begins by examining the difference between the objectives optimized by batch and incremental SGNS (Section 4.1). Then, probabilistic properties of their difference are investigated to demonstrate the relationship between batch and incremental SGNS (Sections 4.2 and 4.3). We shortly touch the minibatch SGNS at the end of this section (Section 4.4).
4.1 Objective difference
As discussed in Section 2, batch SGNS optimizes the following objective:
where collectively represents the model parameter^{4}^{4}4We treat words as integers and thus . (i.e., word embeddings) and represents the noise distribution. Note that the noise distribution is represented in a different notation than Section 2 to make its dependence on the whole training data explicit. The function is defined as , where represents the word frequency in the first words of the training data.
In contrast, incremental SGNS computes the gradient of at each step to perform gradient descent. Note that the noise distribution does not depend on but rather on . Because it can be seen as a sample approximation of the gradient of
incremental SGNS can be interpreted as optimizing with SGD.
Since the expectation terms in the objectives can be rewritten as , the difference between the two objectives can be given as
where is the delta function.
4.2 Unsmoothed case
Let us begin by examining the objective difference in the unsmoothed case, .
The technical difficulty in analyzing is that it is dependent on the word order in the training data. To address this difficulty, we assume that the words in the training data are generated from some stationary distribution. This assumption allows us to investigate the property of from a probabilistic perspective. Regarding the validity of this assumption, we want to note that this assumption is already taken by the original SGNS: the probability that the target and context words cooccur is assumed to be independent of their position in the training data.
We below introduce some definitions and notations as the preparation of the analysis.
Definition 1.
Let
be a random variable that represents
. It takes when the th word in the training data is and otherwise.Remind that we assume that the words in the training data are generated from a stationary distribution. This assumption means that the expectation and (co)variance of
do not depend on the index . Hereafter, they are respectively denoted as and .Definition 2.
Let be a random variable that represents when . It is given as .
4.2.1 Convergence of the first and second order moments of
It can be shown that the first order moment of
has an analytical form.Theorem 1.
The first order moment of is given as
where is the th harmonic number.
Sketch of proof.
Notice that can be written as
Because we have, for any and such that ,
plugging this into proves the theorem. See Appendix B.1 for the complete proof. ∎
Theorem 1 readily gives the convergence property of the first order moment of :
Theorem 2.
The firstorder moment of decreases in the order of :
and thus converges to zero in the limit of infinity:
Proof.
We have from the upper integral bound, and thus Theorem 1 gives the proof. ∎
A similar result to Theorem 2 can be obtained for the second order moment of as well.
Theorem 3.
The secondorder moment of decreases in the order of :
and thus converges to zero in the limit of infinity:
Proof.
Omitted. See Appendix B.2. ∎
4.2.2 Main result
The above theorems reveal the relationship between the optimal solutions of the two objectives, as stated in the next lemma.
Lemma 4.
Let and be the optimal solutions of and , respectively: and . Then,
(2)  
(3) 
Proof.
The proof is made by the squeeze theorem. Let . The optimality of gives . Also, the optimality of gives
We thus have . Since Theorem 2 implies that the right hand side converges to zero when , the squeeze theorem gives Eq. (2). Next, we have
(4) 
Theorem 3 suggests that Eq. (4) converges to zero when . Also, the nonnegativity of the variance gives . Therefore, the squeeze theorem gives Eq. (3). ∎
We are now ready to provide the main result of the analysis. The next theorem shows the convergence of .
Theorem 5.
converges in probability to :
Proof.
Informally, this theorem can be interpreted as suggesting that the optimal solutions of batch and incremental SGNS agree when is infinitely large.
4.3 Smoothed case
We next examine the smoothed case (). In this case, the noise distribution can be represented by using the ones in the unsmoothed case:
where and corresponds to the noise distribution in the unsmoothed case.
Definition 3.
Let be a random variable that represents in the smoothed case. Then, it can be written by using :
where .
Because is no longer a linear combination of , it becomes difficult to derive similar proofs to the unsmoothed case. To address this difficulty, is approximated by the firstorder Taylor expansion around
The firstorder Taylor approximation gives
where and . Consequently, it can be shown that the first and second order moments of have the order of in the smoothed case as well. See Appendix C for the details.
4.4 Minibatch SGNS
The same analysis result can also be obtained for the minibatch SGNS. We can prove Theorems 2 and 3 in the minibatch case as well (see Appendix D for the proof). The other part of the analysis remains the same.
5 Experiments
Three experiments were conducted to investigate the correctness of the theoretical analysis (Section 5.1) and the practical usefulness of incremental SGNS (Sections 5.2 and 5.3). Details of the experimental settings that do not fit into the paper are presented in Appendix E.
5.1 Validation of theorems
An empirical experiment was conducted to validate the result of the theoretical analysis. Since it is difficult to assess the main result in Section 4.2.2 directly, the theorems in Sections 4.2.1, from which the main result is readily derived, were investigated. Specifically, the first and second order moments of were computed on datasets of increasing sizes to empirically investigate the convergence property.
Datasets of various sizes were constructed from the English Gigaword corpus (Napoles et al., 2012). The datasets made up of words were constructed by randomly sampling sentences from the Gigaword corpus. The value of was varied over . different datasets were created for each size to compute the first and second order moments.
Figure 1 (top left) shows loglog plots of the first order moments of computed on the different sized datasets when . The crosses and circles represent the empirical values and theoretical values obtained by Theorem 1, respectively. Figure 1 (top right) similarly illustrates the second order moments of . Since Theorem 3 suggests that the second order moment decreases in the order of , the graph is also shown. The graph was fitted to the empirical data by minimizing the squared error.
The top left figure demonstrates that the empirical values of the first order moments fit the theoretical result very well, providing a strong empirical evidence for the correctness of Theorem 1. In addition, the two figures show that the first and second order moments decrease almost in the order of , converging to zero as the data size increases. This result validates Theorems 2 and 3.
Figures 1 (bottom left) and (bottom right) show similar results when
. Since we do not have theoretical estimates of the first order moment when
, the graphs are shown in both figures. From these, we can again observe that the first and second order moments decrease almost in the order of . This indicates the validity of the investigation in Section 4.3. The relatively larger deviations from the graphs , compared with the top right figure, are considered to be attributed to the firstorder Taylor approximation.5.2 Quality of word embeddings
The next experiment investigates the quality of the word embeddings learned by incremental SGNS through comparison with the batch counterparts.
The Gigaword corpus was used for the training. For the comparison, both our own implementation of batch SGNS as well as word2vec (Mikolov et al., 2013c) were used (denoted as batch and w2v). The training configurations of the three methods were set the same as much as possible, although it is impossible to do so perfectly. For example, incremental SGNS (denoted as incremental) utilized the dynamic vocabulary (c.f., Section 3.2.1) and thus we set the maximum vocabulary size to control the vocabulary size. On the other hand, we set a frequency threshold to determine the vocabulary size of w2v. We set k for incremental, while setting the frequency threshold to for w2v. This yields vocabulary sets of comparable sizes: and .
The learned word embeddings were assessed on five benchmark datasets commonly used in the literature (Levy et al., 2015): WordSim353 (Agirre et al., 2009), MEN (Bruni et al., 2013), SimLex999 (Hill et al., 2015), the MSR analogy dataset (Mikolov et al., 2013c), the Google analogy dataset (Mikolov et al., 2013a). The former three are for a semantic similarity task, and the remaining two are for a word analogy task. As evaluation measures, Spearman’s and prediction accuracy were used in the two tasks, respectively.
Figures 2 (a) and (b) represent the results on the similarity datasets and the analogy datasets. We see that the three methods (incremental, batch, and w2v) perform equally well on all of the datasets. This indicates that incremental SGNS can learn as good word embeddings as the batch counterparts, while being able to perform incremental model update. Although incremental performs slightly better than the batch methods in some datasets, the difference seems to be a product of chance.
The figures also show the results of incremental SGNS when the maximum vocabulary size was reduced to k and k (incremental150k and incremental100k). The resulting vocabulary sizes were and , respectively. We see that incremental150k and incremental100k perform comparatively well with incremental, although relatively large performance drops are observed in some datasets (MEN and MSR). This demonstrates that the MisraGries algorithm can effectively control the vocabulary size.
(a)  (b)  (c) 
5.3 Update time
The last experiment investigates how much time incremental SGNS can save by avoiding retraining when updating the word embeddings.
In this experiment, incremental was first trained on the initial training data of size^{5}^{5}5The number of sentences here. and then updated on the new training data of size to measure the update time. For comparison, batch and w2v were retrained on the combination of the initial and new training data. We fixed and varied over .
Figure 2 (c) compares the update time of the three methods across various values of . We see that incremental significantly reduces the update time. It achieves and times speedup compared with batch and w2v (when ). This represents the advantage of the incremental algorithm, as well as the time efficiency of the dynamic vocabulary and adaptive unigram table. We note that batch is slower than w2v because it uses AdaGrad, which maintains different learning rates for different dimensions of the parameter, while w2v uses the same learning rate for all dimensions.
6 Related Work
Word representations based on distributional semantics have been common (Turney and Pantel, 2010; Baroni and Lenci, 2010). The distributional methods typically begin by constructing a wordcontext matrix and then applying dimension reduction techniques such as SVD to obtain highquality word meaning representations. Although some investigated incremental updating of the wordcontext matrix (Yin et al., 2015; Goyal and Daume III, 2011), they did not explore the reduced representations. On the other hand, neural word embeddings have recently gained much popularity as an alternative. However, most previous studies have not explored incremental strategies (Mikolov et al., 2013a, b; Pennington et al., 2014).
Very recently, Peng et al. (2017) proposed an incremental learning method of hierarchical softmax. Because hierarchical softmax and negative sampling have different advantages (Peng et al., 2017), the incremental SGNS and their method are complementary to each other. Also, their updating method needs to scan not only new but also old training data, and thus is not an incremental algorithm in a strict sense. As a consequence, it potentially incurs the same time complexity as the retraining. Another consequence is that their method has to retain the old training data and thus wastes space, while incremental SGNS can discard old training examples after processing them.
There are publicly available implementations for training SGNS, one of the most popular being word2vec (Mikolov, 2013). However, it does not support an incremental training method. Gensim (Řehůřek and Sojka, 2010) also offers SGNS training. Although Gensim allows the incremental updating of SGNS models, it is done in an adhoc manner. In Gensim, the vocabulary set as well as the unigram table are fixed once trained, meaning that new words cannot be added. Also, they do not provide any theoretical accounts for the validity of their training method.
7 Conclusion and Future Work
This paper proposed incremental SGNS and provided thorough theoretical analysis to demonstrate its validity. We also conducted experiments to empirically demonstrate its effectiveness. Although the incremental model update is often required in practical machine learning applications, only a little attention has been paid to learning word embeddings incrementally. We consider that incremental SGNS successfully addresses this situation and serves as an useful tool for practitioners.
The success of this work suggests several research directions to be explored in the future. One possibility is to explore extending other embedding methods such as GloVe (Pennington et al., 2014) to incremental algorithms. Such studies would further extend the potential of word embedding methods.
References
 Agirre et al. (2009) Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnetbased approaches. In Proceedings of NAACL. pages 19–27. http://www.aclweb.org/anthology/N/N09/N091003.
 Baroni and Lenci (2010) Marco Baroni and Alessandro Lenci. 2010. Distributional memory: A general framework for corpusbased semantics. Computatoinal Linguistics 36:673–721. http://aclweb.org/anthology/J/J10/J104006.

Bruni et al. (2013)
E. Bruni, N. K. Tran, and M. Baroni. 2013.
Multimodal distributional semantics.
Journal of Artificial Intelligence Research
49:1–49.  Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12:2121–2159.
 Efraimidis (2015) Pavlos S. Efraimidis. 2015. Weighted random sampling over data streams. ArXiv:1012.0256.
 Goyal and Daume III (2011) Amit Goyal and Hal Daume III. 2011. Approximate scalable bounded space sketch for large data nlp. In Proceedings of EMNLP. pages 250–261. http://www.aclweb.org/anthology/D111023.
 Hill et al. (2015) Felix Hill, Roi Reichart, and Anna Korhonen. 2015. Simlex999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics 41:665–695. http://aclweb.org/anthology/J/J15/J154004.
 Levy et al. (2015) Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics 3:211–225. https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/570.
 Mikolov (2013) Tomas Mikolov. 2013. word2vec. https://code.google.com/archive/p/word2vec.

Mikolov et al. (2013a)
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a.
Efficient estimation of word representations in vector space.
In Workshop at ICLR.  Mikolov et al. (2013b) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Advances in NIPS. pages 3111–3119.
 Mikolov et al. (2013c) Tomas Mikolov, WenTau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space word representations. In Proceedings of NAACL. pages 746–751. http://www.aclweb.org/anthology/N131090.
 Misra and Gries (1982) Jayadev Misra and David Gries. 1982. Finding repeated elements. Science of Computer Programming 2(2):143–152.
 Napoles et al. (2012) Courtney Napoles, Matthew Gormley, and Benjamin Van Durme. 2012. Annotated english gigaword ldc2012t21.
 Peng et al. (2017) Hao Peng, Jianxin Li, Yangqiu Song, and Yaopeng Liu. 2017. Incrementally learning the hierarchical softmax function for neural language models. In Proceedings of AAAI (to appear).
 Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of EMNLP. pages 1532–1543. http://www.aclweb.org/anthology/D141162.
 Řehůřek and Sojka (2010) Radim Řehůřek and Petr Sojka. 2010. Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. pages 45–50.
 Turney and Pantel (2010) Peter D. Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research 37:141–188.
 Vitter (1985) Jeffrey S. Vitter. 1985. Random sampling with a reservoir. ACM Transactions on Mathematical Software 11:37–57.
 Yin et al. (2015) Wenpeng Yin, Tobias Schnabel, and Hinrich Schütze. 2015. Online updating of word representations for partofspeech tagging. In Proceedings of EMNLP. pages 1329–1334. http://aclweb.org/anthology/D151155.
Appendix A Note on Adaptive Unigram Table
Algorithm 4 illustrates the efficient implementation of the adaptive unigram table (c.f., Section 3.2.2). In line 8 and 10, and are not always integers and therefore they are probabilistically converted into integers as explained in the paper.
Time complexity of Algorithm 4 is per update in case of . When , the update (line 8) takes time since we always have . When , we have and consequently . This means that the update (line 10–13) takes time.
Even if , the value of becomes sufficiently large in practice, and thus the update becomes efficient as demonstrated in the experiment.
Appendix B Complete Proofs
This appendix provides complete proofs of Theorems 1, 3, and 5.
b.1 Proof of Theorem 1
Proof.
The first order moment of can be rewritten as
Here, for any and such that , we have
Therefore, we have
∎
b.2 Proof of Theorem 3
To prove Theorem 3, we begin by examining the upper and lowerbounds of in the following Lemma, and then make use of the bounds to evaluate the order of the second order moment of .
Lemma 6.
For any and such that , we have
Proof.
We have
To prove the lemma, we rewrite the expression by splitting the set of into two subsets. Let be a set of such that , , and are independent from each other (i.e., , , and are all different), and let be its complementary set:
Then, is upperbounded as
where the inequality holds because , , and are binary random variables and thus . Here, we have , because includes elements such that and also includes and elements such that and , respectively. And we consequently have . Therefore, the upperbound can be rewritten as
Similarly, by making use of , the lowerbound can be derived:
∎
Making use the above Lemma, we can prove Theorem 3.
Proof.
The upperbound of is examined to prove the theorem. Let . Making use of Jensen’s inequality, we have
Comments
There are no comments yet.