1 Introduction
Recently, embedding representations have been widely used in almost all AIrelated fields, from feature maps Krizhevsky et al. (2012)
in computer vision, to word embeddings
Mikolov et al. (2013); Pennington et al. (2014)in natural language processing, to user/item embeddings
Mnih and Salakhutdinov (2008); Hu et al. (2008)in recommender systems. Usually, the embeddings are highdimensional vectors. Take language models for example, in GPT
Radford et al. (2018) and BertBase model Devlin et al. (2018), 768dimensional vectors are used to represent words. BertLarge model utilizes 1024dimensional vectors and GPT2
Radford et al. (2019) may have used even higher dimensions in their unreleased large models. In recommender systems, things are slightly different: the dimension of user/item embeddings are usually set to be reasonably small, 50 or 100. But the number of users and items is on a much bigger scale. Contrast this with the fact that the size of word vocabulary that normally ranges from 50,000 to 150,000, the number of users and items can be millions or even billions in largescale realworld commercial recommender systems Bennett et al. (2007).Given the massive number of parameters in modern neural networks with embedding layers, mitigating overparameterization can play a big role in preventing overfitting in deep learning. We propose a regularization method, Stochastically Shared Embeddings (SSE), that uses prior information about similarities between embeddings, such as semantically and grammatically related words in natural languages or realworld users who share social relationships. Critically, SSE progresses by stochastically transitioning between embeddings as opposed to a more bruteforce regularization such as graphbased Laplacian regularization and ridge regularization. Thus, SSE integrates seamlessly with existing stochastic optimization methods and the resulting regularization is datadriven.
We will begin the paper with the mathematical formulation of the problem, propose SSE, and provide the motivations behind SSE. We provide a theoretical analysis of SSE that can be compared with excess risk bounds based on empirical Rademacher complexity. We then conducted experiments for a total of 6 tasks from simple neural networks with one hidden layer in recommender systems, to the transformer and BERT in natural languages and find that when used along with widelyused regularization methods such as weight decay and dropout, our proposed methods can further reduce overfitting, which often leads to more favorable generalization results.
2 Related Work
Regularization techniques are used to control model complexity and avoid overfitting. regularization Hoerl and Kennard (1970) is the most widely used approach and has been used in many matrix factorization models in recommender systems; regularization Tibshirani (1996) is used when a sparse model is preferred. For deep neural networks, it has been shown that regularizations are often too weak, while dropout Hinton et al. (2012); Srivastava et al. (2014) is more effective in practice. There are many other regularization techniques, including parameter sharing Goodfellow et al. (2016), maxnorm regularization Srebro et al. (2005)
Pascanu et al. (2013), etc.Our proposed SSEgraph is very different from graph Laplacian regularization Cai et al. (2011), in which the distances of any two embeddings connected over the graph are directly penalized. Hard parameter sharing uses one embedding to replace all distinct embeddings in the same group, which inevitably introduces a significant bias. Soft parameter sharing Nowlan and Hinton (1992) is similar to the graph Laplacian, penalizing the distances between any two embeddings. These methods have no dependence on the loss, while the proposed SSEgraph method is datadriven in that the loss influences the effect of regularization. Unlike graph Laplacian regularization, hard and soft parameter sharing, our method is stochastic by nature. This allows our model to enjoy similar advantages as dropout Srivastava et al. (2014).
Interestingly, in the original BERT model’s pretraining stage Devlin et al. (2018), a variant of SSESE is already implicitly used for token embeddings but for a different reason. In Devlin et al. (2018)
, the authors masked 15% of words and 10% of the time replaced the [mask] token with a random token. In the next section, we discuss how SSESE differs from this heuristic. Another closely related technique to ours is the label smoothing
Szegedy et al. (2016), which is widely used in the computer vision community. We find that in the classification setting if we apply SSESE to onehot encodings associated with output
only, our SSESE is closely related to the label smoothing, which can be treated as a special case of our proposed method.3 Stochastically Shared Embeddings
Throughout this paper, the network input and label will be encoded into indices which are elements of , the index sets of embedding tables. A typical choice is that the indices are the encoding of a dictionary for words in natural language applications, or user and item tables in recommendation systems. Each index, , within the th table, is associated with an embedding which is a trainable vector in . The embeddings associated with label are usually nontrainable onehot vectors corresponding to label lookup tables while embeddings associated with input are trainable embedding vectors for embedding lookup tables. In natural language applications, we appropriately modify this framework to accommodate sequences such as sentences.
The loss function can be written as the functions of embeddings:
(1) 
where is the label and encompasses all trainable parameters including the embeddings, . The loss function is a mapping from embedding spaces to the reals. For text input, each is a word embedding vector in the input sentence or document. For recommender systems, usually there are two embedding lookup tables: one for users and one for items He et al. (2017). So the objective function, such as mean squared loss or some ranking losses, will comprise both user and item embeddings for each input. We can more succinctly write the matrix of all embeddings for the th sample as where . By an abuse of notation we write the loss as a function of the embedding matrix, .
Suppose that we have access to knowledge graphs Miller (1995); Lehmann et al. (2015) over embeddings, and we have a prior belief that two embeddings will share information and replacing one with the other should not incur a significant change in the loss distribution. For example, if two movies are both comedies and they are starred by the same actors, it is very likely that for the same user, replacing one comedy movie with the other comedy movie will result in little change in the loss distribution. In stochastic optimization, we can replace the loss gradient for one movie’s embedding with the other similar movie’s embedding, and this will not significantly bias the gradient if the prior belief is accurate. On the other hand, if this exchange is stochastic, then it will act to smooth the gradient steps in the long run, thus regularizing the gradient updates.
3.1 General SSE with Knowledge Graphs: SSEGraph
Instead of optimizing objective function in (1), SSEGraph described in Algorithm 1, Figure 1, and Figure 2 is approximately optimizing the objective function below:
(2) 
where
is the transition probability (with parameters
) of exchanging the encoding vector with a new encoding vector in the Cartesian product index set of all embedding tables. When there is a single embedding table () then there are no hard restrictions on the transition probabilities, , but when there are multiple tables () then we will enforce thattakes a tensor product form (see (
4)). When we are assuming that there is only a single embedding table () we will not bold and suppress their indices.In the single embedding table case, , there are many ways to define transition probability from to . One simple and effective way is to use a random walk (with random restart and selfloops) on a knowledge graph , i.e. when embedding is connected with but not with , we can set the ratio of and to be a constant greater than . In more formal notation, we have
(3) 
where and is a tuning parameter. It is motivated by the fact that embeddings connected with each other in knowledge graphs should bear more resemblance and thus be more likely replaced by each other. Also, we let where is called the SSE probability and embedding retainment probability is . We treat both and as tuning hyperparameters in experiments. With (3) and , we can derive transition probabilities between any two embeddings to fill out the transition probability table.
When there are multiple embedding tables, , then we will force that the transition from to can be thought of as independent transitions from to within embedding table (and index set ). Each table may have its own knowledge graph, resulting in its own transition probabilities . The more general form of the SSEgraph objective is given below:
Optimizing (4) with SGD or its variants (Adagrad Duchi et al. (2011), Adam Kingma and Ba (2014)) is simple. We just need to randomly switch each original embedding tensor with another embedding tensor randomly sampled according to the transition probability (see Algorithm 1). This is equivalent to have a randomized embedding lookup layer as shown in Figure 1.
We can also accommodate sequences of embeddings, which commonly occur in natural language application, by considering instead of for th embedding table in (4), where and is the number of embeddings in table that are associated with . When there is more than one embedding lookup table, we sometimes prefer to use different and for different lookup tables in (3) and the SSE probability constraint. For example, in recommender systems, we would use for user embedding table and for item embedding table.
We find that SSE with knowledge graphs, i.e., SSEGraph, can force similar embeddings to cluster when compared to the original neural network without SSEGraph. In Figure 3, one can easily see that more embeddings tend to cluster into 2 black holes after applying SSEGraph when embeddings are projected into 3D spaces using PCA. Interestingly, a similar phenomenon occurs when assuming the knowledge graph is a complete graph, which we would introduce as SSESE below.
3.2 Simplified SSE with Complete Graph: SSESE
One clear limitation of applying the SSEGraph is that not every dataset comes with goodquality knowledge graphs on embeddings. For those cases, we could assume there is a complete graph over all embeddings so there is a small transition probability between every pair of different embeddings:
(5) 
where is the size of the embedding table. The SGD procedure in Algorithm 1 can still be applied and we call this algorithm SSESE (Stochastic Shared Embeddings  Simple and Easy). It is worth noting that SSEGraph and SSESE are applied to embeddings associated with not only input but also those with output . Unless there are considerably many more embeddings than data points and model is significantly overfitting, normally gives reasonably good results.
Interestingly, we found that the SSESE framework is related to several techniques used in practice. For example, BERT pretraining unintentionally applied a method similar to SSESE to input by replacing the masked word with a random word. This would implicitly introduce an SSE layer for input in Figure 1, because now embeddings associated with input be stochastically mapped according to (5). The main difference between this and SSESE is that it merely augments the input once, while SSE introduces randomization at every iteration, and we can also accommodate label embeddings. In experimental Section 4.4, we will show that SSESE would improve original BERT pretraining procedure as well as finetuning procedure.
3.3 Theoretical Guarantees
We explain why SSE can reduce the variance of estimators and thus leads to better generalization performance. For simplicity, we consider the SSEgraph objective (2) where there is no transition associated with the label , and only the embeddings associated with the input undergo a transition. When this is the case, we can think of the loss as a function of the embedding and the label, . We take this approach because it is more straightforward to compare our resulting theory to existing excess risk bounds.
The SSE objective in the case of only input transitions can be written as,
(6) 
and there may be some constraint on . Let denote the minimizer of subject to this constraint. We will show in the subsequent theory that minimizing will get us close to a minimizer of , and that under some conditions this will get us close to the Bayes risk. We will use the standard definitions of empirical and true risk, and .
Our results depend on the following decomposition of the risk. By optimality of ,
(7) 
where , and . We can think of as representing the bias due to SSE, and as an SSE form of excess risk. Then by another application of similar bounds,
(8) 
The high level idea behind the following results is that when the SSE protocol reflects the underlying distribution of the data, then the bias term is small, and if the SSE transitions are well mixing then the SSE excess risk will be of smaller order than the standard Rademacher complexity. This will result in a small excess risk.
Theorem 1.
Consider SSEgraph with only input transitions. Let be the expected loss conditional on input and be the residual loss. Define the conditional and residual SSE empirical Rademacher complexities to be
(9)  
(10) 
respectively where is a Rademacher random vectors in . Then we can decompose the SSE empirical risk into
(12) 
Remark 1.
The transition probabilities in (9), (10) act to smooth the empirical Rademacher complexity. To see this, notice that we can write the inner term of (9) as , where we have vectorized and formed the transition matrix
. Transition matrices are contractive and will induce dependencies between the Rademacher random variables, thereby stochastically reducing the supremum. In the case of no label noise, namely that
is a point mass, , and . The use of as opposed to the losses, , will also make of smaller order than the standard empirical Rademacher complexity. We demonstrate this with a partial simulation of on the Movielens1m dataset in Figure 5 of the Appendix.Theorem 2.
Let the SSEbias be defined as
Suppose that for some , then
Remark 2.
The price for ‘smoothing’ the Rademacher complexity in Theorem 1 is that SSE may introduce a bias. This will be particularly prominent when the SSE transitions have little to do with the underlying distribution of . On the other extreme, suppose that is nonzero over a neighborhood of , and that for data with encoding , is identically distributed with , then . In all likelihood, the SSE transition probabilities will not be supported over neighborhoods of iid random pairs, but with a well chosen SSE protocol the neighborhoods contain approximately iid pairs and is small.
Movielens1m  Movielens10m  

Model  RMSE  RMSE  
SGDMF  1.0984          1.9490         
Graph Laplacian + ALSMF  1.0464          1.9755         
SSEGraph + SGDMF  1.0145  500  200  0.005  0.005  1.9019  1  500  0.01  0.01 
SSESE + SGDMF  1.0150  1  1  0.005  0.005  1.9085  1  1  0.01  0.01 
Douban  Movielens10m  Netflix  

Model  RMSE  RMSE  RMSE  
MF  0.7339      0.8851      0.8941     
Dropout + MF  0.7296  0.1    0.8813  0.1    0.8897  0.1   
SSESE + MF  0.7201    0.008  0.8715    0.008  0.8842    0.008 
SSESE + Dropout + MF  0.7185  0.1  0.005  0.8678  0.1  0.005  0.8790  0.1  0.005 
Movielens1m  Yahoo Music  Foursquare  

Model  
SQLRank (2018)  0.7369  0.6717  0.6183  0.4551  0.3614  0.3069  0.0583  0.0194  0.0170 
BPR  0.6977  0.6568  0.6257  0.3971  0.3295  0.2806  0.0437  0.0189  0.0143 
Dropout + BPR  0.7031  0.6548  0.6273  0.4080  0.3315  0.2847  0.0437  0.0184  0.0146 
SSESE + BPR  0.7254  0.6813  0.6469  0.4297  0.3498  0.3005  0.0609  0.0262  0.0155 
4 Experiments
We have conducted extensive experiments on 6 tasks, including 3 recommendation tasks (explicit feedback, implicit feedback and sequential recommendation) and 3 NLP tasks (neural machine translation, BERT pretraining, and BERT finetuning for sentiment classification and ) and found that our proposed SSE can effectively improve generalization performances on a wide variety of tasks. Note that the details about datasets and parameter settings can be found in the appendix.
4.1 Neural Networks with One Hidden Layer (Matrix Factorization and BPR)
Matrix Factorization Algorithm (MF) Mnih and Salakhutdinov (2008) and Bayesian Personalized Ranking Algorithm (BPR) Rendle et al. (2009) can be viewed as neural networks with one hidden layer (latent features) and are quite popular in recommendation tasks. MF uses the squared loss designed for explicit feedback data while BPR uses the pairwise ranking loss designed for implicit feedback data.
First, we conduct experiments on two explicit feedback datasets: Movielens1m and Movielens10m. For these datasets, we can construct graphs based on actors/actresses starring the movies. We compare SSEgraph and the popular Graph Laplacian Regularization (GLR) method Rao et al. (2015) in Table 1. The results show that SSEgraph consistently outperforms GLR. This indicates that our SSEGraph has greater potentials over graph Laplacian regularization as we do not explicitly penalize the distances across embeddings, but rather we implicitly penalize the effects of similar embeddings on the loss. Furthermore, we show that even without existing knowledge graphs of embeddings, our SSESE performs only slightly worse than SSEGraph but still much better than GLR and MF.
In general, SSESE is a good alternative when graph information is not available. We then show that our proposed SSESE can be used together with standard regularization techniques such as dropout and weight decay to improve recommendation results regardless of the loss functions and dimensionality of embeddings. This is evident in Table 2 and Table 3. With the help of SSESE, BPR can perform better than the stateofart listwise approach SQLRank Wu et al. (2018) in most cases. We include the optimal SSE parameters in the table for references and leave out other experiment details to the appendix. In the rest of the paper, we would mostly focus on SSESE as we do not have highquality graphs of embeddings on most datasets.
Movielens1m  Dimension  of Blocks  SSESE Parameters  
Model  NDCG  Hit Ratio  
SASRec  0.5941  0.8182  100  2     
SASRec  0.5996  0.8272  100  6     
SSESE + SASRec  0.6092  0.8250  100  2  0.1  0 
SSESE + SASRec  0.6085  0.8293  100  2  0  0.1 
SSESE + SASRec  0.6200  0.8315  100  2  0.1  0.1 
SSESE + SASRec  0.6265  0.8364  100  6  0.1  0.1 
4.2 Transformer Encoder Model for Sequential Recommendation
SASRec Kang and McAuley (2018) is the stateofthearts algorithm for sequential recommendation task. It applies the transformer model Vaswani et al. (2017), where a sequence of items purchased by a user can be viewed as a sentence in transformer, and next item prediction is equivalent to next word prediction in the language model. In Table 4, we perform SSESE on input embeddings (, ), output embeddings (, ) and both embeddings (), and observe that all of them significantly improve over stateoftheart SASRec (). The regularization effects of SSESE is even more obvious when we increase the number of selfattention blocks from 2 to 6, as this will lead to a more sophisticated model with many more parameters. This leads to the model overfitting terribly even with dropout and weight decay. We can see in Table 4 that when both methods use dropout and weight decay, SSESE + SASRec is doing much better than SASRec without SSESE.
Test BLEU  

Model  2008  2009  2010  2011  2012  2013  2014  2015  2016  2017  2018 
Transformer  21.0  20.7  22.7  20.6  20.6  25.3  26.2  28.4  32.1  27.2  38.8 
SSESE + Transformer  21.4  21.1  23.0  21.0  20.8  25.2  27.2  29.2  33.1  27.9  39.9 
4.3 Neural Machine Translation
We use the transformer model Vaswani et al. (2017) as the backbone for our experiments. The baseline model is the standard 6layer transformer architecture and we apply SSESE to both encoder, and decoder by replacing corresponding vocabularies’ embeddings in the source and target sentences. We trained on the standard WMT 2014 English to German dataset which consists of roughly 4.5 million parallel sentence pairs and tested on WMT 2008 to 2018 newstest sets. We use the OpenNMT implementation in our experiments. We use the same dropout rate of 0.1 and label smoothing value of 0.1 for the baseline model and our SSEenhanced model. The only difference between the two models is whether or not we use our proposed SSESE with in (5) for both encoder and decoder embedding layers. We evaluate both models’ performances on the test datasets using BLEU scores Post (2018).
We summarize our results in Table 5 and find that SSESE helps improving accuracy and BLEU scores on both dev and test sets in 10 out of 11 years from 2008 to 2018. In particular, on the last 5 years’ test sets from 2014 to 2018, the transformer model with SSESE improves BLEU scores by 0.92 on average when compared to the baseline model without SSESE.
4.4 BERT for Sentiment Classification
BERT’s model architecture Devlin et al. (2018) is a multilayer bidirectional Transformer encoder based on the Transformer model in neural machine translation. Despite SSESE can be used for both pretraining and finetuning stages of BERT, we want to mainly focus on pretraining as finetuning bears more similarity to the previous section. We use SSE probability of 0.015 for embeddings (onehot encodings) associated with labels and SSE probability of 0.015 for embeddings (wordpiece embeddings) associated with inputs. One thing worth noting is that even in the original BERT model’s pretraining stage, SSESE is already implicitly used for token embeddings. In original BERT model, the authors masked 15% of words for a maximum of 80 words in sequences of maximum length of 512 and 10% of the time replaced the [mask] token with a random token. That is roughly equivalent to SSE probability of 0.015 for replacing input wordpiece embeddings.
We continue to pretrain Google pretrained BERT model on our crawled IMDB movie reviews with and without SSESE and compare downstream tasks performances. In Table 6, we find that SSESE pretrained BERT base model helps us achieve the stateoftheart results for the IMDB sentiment classification task, which is better than the previous best in Howard and Ruder (2018)
. We report test set accuracy of 0.9542 after finetuning for one epoch only. For the similar SST2 sentiment classification task in Table
7, we also find that SSESE can improve BERT pretrains better. Our SSESE pretrained model achieves 94.3% accuracy on SST2 test set after 3 epochs of finetuning while the standard pretrained BERT model only reports 93.8 after finetuning. Furthermore, we show that SSESE with SSE probability 0.01 can also improve dev and test accuracy in the finetuning stage. If we are using SSESE for both pretraining and finetuning stage of the BERT base model, we can achieve 94.5% accuracy on the SST2 test set, approaching the 94.9% accuracy by the BERT large model. We are optimistic that our SSESE can be applied to BERT large model as well in the future.IMDB Test Set  
Model  AUC  Accuracy  F1 Score 
ULMFiT Howard and Ruder (2018)    0.9540   
Google Pretrained Model + Finetuning  0.9415  0.9415  0.9419 
Pretraining + Finetuning  0.9518  0.9518  0.9523 
(SSESE + Pretraining) + Finetuning  0.9542  0.9542  0.9545 
SST2 Dev Set  SST2 Test Set  
Model  AUC  Accuracy  F1 Score  Accuracy (%) 
Google Pretrained + Finetuning  0.9230  0.9232  0.9253  93.6 
Pretraining + Finetuning  0.9265  0.9266  0.9281  93.8 
(SSESE + Pretraining) + Finetuning  0.9276  0.9278  0.9295  94.3 
(SSESE + Pretraining) + (SSESE + Finetuning)  0.9323  0.9323  0.9336  94.5 
4.5 Speed and convergence comparisons.
In Figure 4, it is clear to see that our onehiddenlayer neural networks with SSESE are achieving much better generalization results than their respective standalone versions. One can also easily spot that SSEversion algorithms converge at much faster speeds with the same learning rate.
5 Conclusion
We have proposed Stochastically Shared Embeddings, which is a datadriven approach to regularization, that stands in contrast to brute force regularization such as Laplacian and ridge regularization. Our theory is a first step towards explaining the regularization effect of SSE, particularly, by ‘smoothing’ the Rademacher complexity. The extensive experimentation demonstrates that SSE can be fruitfully integrated into existing deep learning applications.
References
 Bennett et al. (2007) James Bennett, Stan Lanning, et al. The netflix prize. In Proceedings of KDD cup and workshop, volume 2007, page 35. New York, NY, USA., 2007.
 Cai et al. (2011) Deng Cai, Xiaofei He, Jiawei Han, and Thomas S Huang. Graph regularized nonnegative matrix factorization for data representation. IEEE transactions on pattern analysis and machine intelligence, 33(8):1548–1560, 2011.
 Devlin et al. (2018) Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Duchi et al. (2011)
John Duchi, Elad Hazan, and Yoram Singer.
Adaptive subgradient methods for online learning and stochastic
optimization.
Journal of Machine Learning Research
, 12(Jul):2121–2159, 2011.  Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
 He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and TatSeng Chua. Neural collaborative filtering. In Proceedings of the 26th International Conference on World Wide Web, pages 173–182. International World Wide Web Conferences Steering Committee, 2017.
 Hinton et al. (2012) Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
 Hoerl and Kennard (1970) Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
 Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. Universal language model finetuning for text classification. arXiv preprint arXiv:1801.06146, 2018.
 Hu et al. (2008) Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for implicit feedback datasets. In ICDM, volume 8, pages 263–272. Citeseer, 2008.
 Kang and McAuley (2018) WangCheng Kang and Julian McAuley. Selfattentive sequential recommendation. arXiv preprint arXiv:1808.09781, 2018.
 Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 Lehmann et al. (2015) Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al. Dbpedia–a largescale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2):167–195, 2015.
 Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
 Miller (1995) George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
 Mnih and Salakhutdinov (2008) Andriy Mnih and Ruslan R Salakhutdinov. Probabilistic matrix factorization. In Advances in neural information processing systems, pages 1257–1264, 2008.
 Nowlan and Hinton (1992) Steven J Nowlan and Geoffrey E Hinton. Simplifying neural networks by soft weightsharing. Neural computation, 4(4):473–493, 1992.

Pascanu et al. (2013)
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.
On the difficulty of training recurrent neural networks.
In International Conference on Machine Learning, pages 1310–1318, 2013.  Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
 Post (2018) Matt Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels, October 2018. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W186319.
 Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pretraining. URL https://s3uswest2. amazonaws. com/openaiassets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018.
 Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. URL https://openai. com/blog/betterlanguagemodels, 2019.
 Rao et al. (2015) Nikhil Rao, HsiangFu Yu, Pradeep K Ravikumar, and Inderjit S Dhillon. Collaborative filtering with graph information: Consistency and scalable methods. In Advances in neural information processing systems, pages 2107–2115, 2015.

Rendle et al. (2009)
Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars SchmidtThieme.
Bpr: Bayesian personalized ranking from implicit feedback.
In
Proceedings of the twentyfifth conference on uncertainty in artificial intelligence
, pages 452–461. AUAI Press, 2009.  Srebro et al. (2005) Nathan Srebro, Jason Rennie, and Tommi S Jaakkola. Maximummargin matrix factorization. In Advances in neural information processing systems, pages 1329–1336, 2005.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.

Szegedy et al. (2016)
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Wojna.
Rethinking the inception architecture for computer vision.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 2818–2826, 2016.  Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
 Wu et al. (2017) Liwei Wu, ChoJui Hsieh, and James Sharpnack. Largescale collaborative ranking in nearlinear time. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 515–524. ACM, 2017.
 Wu et al. (2018) Liwei Wu, ChoJui Hsieh, and James Sharpnack. Sqlrank: A listwise approach to collaborative ranking. In Proceedings of Machine Learning Research (35th International Conference on Machine Learning), volume 80, 2018.
6 Appendix
For experiments in Section 4.1, we use Julia and C++ to implement SGD. For experiments in Section 4.2, and Section 4.4
, we use Tensorflow and SGD/Adam Optimizer. For experiments in Section
4.3, we use Pytorch and Adam with noam decay scheme and warmup. We find that none of these choices affect the strong empirical results supporting the effectiveness of our proposed methods, especially the SSESE. In any deep learning frameworks, we can introduce stochasticity to the original embedding lookup behaviors and easily implement SSELayer in Figure
1 as a custom operator.6.1 Neural Networks with One Hidden Layer
To run SSEGraph, we need to construct goodquality knowledge graphs on embeddings. We managed to match movies in Movielens1m and Movielens10m datasets to IMDB websites, therefore we can extract plentiful information for each movie, such as the cast of the movies, user reviews and so on. For simplicity reason, we construct the knowledge graph on itemside embeddings using the cast of movies. Two items are connected by an edge when they share one or more actors/actresses. For user side, we do not have good quality graphs: we are only able to create a graph on users in Movielens1m dataset based on their age groups but we do not have any side information on users in Movielens10m dataset. When running experiments, we do a parameter sweep for weight decay parameter and then fix it before tuning the parameters for SSEGraph and SSESE. We utilize different and for user and item embedding tables respectively. The optimal parameters are stated in Table 1 and Table 2. We use the learning rate of 0.01 in all SGD experiments.
In the first leg of experiments, we examine users with fewer than 60 ratings in Movielens1m and Movielens10m datasets. In this scenario, the graph should carry higher importance. One can easily see from Table 1 that without using graph information, our proposed SSESE is the best performing matrix factorization algorithms among all methods, including popular ALSMF and SGDMF in terms of RMSE. With Graph information, our proposed SSEGraph is performing significantly better than the Graph Laplacian Regularized Matrix Factorization method. This indicates that our SSEGraph has great potentials over Graph Laplacian Regularization as we do not explicitly penalize the distances across embeddings but rather we implicitly penalize the effects of similar embeddings on the loss.
In the second leg of experiments, we remove the constraints on the maximum number of ratings per user. We want to show that SSESE can be a good alternative when graph information is not available. We follow the same procedures in Wu et al. (2017, 2018). In Table 2, we can see that SSESE can be used with dropout to achieve the smallest RMSE across Douban, Movielens10m, and Netflix datasets. In Table 3, one can see that SSESE is more effective than dropout in this case and can perform better than STOA listwise approach SQLRank Wu et al. (2018) on 2 datasets out of 3.
In Table 2, SSESE has two tuning parameters: probability to replace embeddings associated with userside embeddings and probability to replace embeddings associated with item side embeddings because there are two embedding tables. But here for simplicity, we use one tuning parameter . We use dropout probability of , dimension of user/item embeddings , weight decay of and learning rate of for all experiments, with the exception that the learning rate is reduced to when both SSESE and Dropout are applied. For Douban dataset, we use . For Movielens10m and Netflix dataset, we use .
6.2 Neural Machine Translation
We use the transformer model Vaswani et al. (2017) as the backbone for our experiments. The control group is the standard transformer encoderdecoder architecture with selfattention. In the experiment group, we apply SSESE towards both encoder and decoder by replacing corresponding vocabularies’ embeddings in the source and target sentences. We trained on the standard WMT 2014 English to German dataset which consists of roughly 4.5 million parallel sentence pairs and tested on WMT 2008 to 2018 newstest sets. Sentences were encoded into 32,000 tokens using a bytepair encoding. We use the SentencePiece, OpenNMT and SacreBLEU implementations in our experiments. We trained the 6layer transformer base model on a single machine with 4 NVIDIA V100 GPUs for 20,000 steps. We use the same dropout rate of 0.1 and label smoothing value of 0.1 for the baseline model and our SSEenhanced model. Both models have dimensionality of embeddings as . When decoding, we use beam search with the beam size of 4 and length penalty of 0.6 and replace unknown words using attention. For both models, we average last 5 checkpoints (we save checkpoints every 10,000 steps) and evaluate the model’s performances on the test datasets using BLEU scores. The only difference between the two models is whether or not we use our proposed SSESE with in Equation 5 for both encoder and decoder embedding layers.
6.3 Bert
In the first leg of experiments, we crawled one million user reviews data from IMDB and pretrained the BERTBase model (12 blocks) for steps using sequences of maximum length 512 and batch size of 8, learning rates of for both models using one NVIDIA V100 GPU. Then we pretrained on a mixture of our crawled reviews and reviews in IMDB sentiment classification tasks (250K reviews in train and 250K reviews in test) for another steps before training for another steps for the reviews in IMDB sentiment classification task only. In total, both models are pretrained on the same datasets for
steps with the only difference being our model utilizes SSESE. In the second leg of experiments, we finetuned the two models obtained in the firstleg experiments on two sentiment classification tasks: IMDB sentiment classification task and SST2 sentiment classification task. The goal of pretraining on IMDB dataset but finetuning for SST2 task is to explore whether SSESE can play a role in transfer learning.
The results are summarized in Table 6 for IMDB sentiment task. In experiments, we use maximum sequence length of 512, learning rate of , dropout probability of and we run finetuning for 1 epoch for the two pretrained models we obtained before. For the Google pretrained BERTbase model, we find that we need to run a minimum of 2 epochs. This shows that pretraining can speed up the finetuning. We find that Google pretrained model performs worst in accuracy because it was only pretrained on Wikipedia and books corpus while ours have seen many additional user reviews. We also find that SSESE pretrained model can achieve accuracy of 0.9542 after finetuning for one epoch only. On the contrast, the accuracy is only 0.9518 without SSESE for embeddings associated with output .
For the SST2 task, we use maximum sequence length of 128, learning rate of , dropout probability of 0.1 and we run finetuning for 3 epochs for all 3 models in Table 7. We report AUC, accuracy and F1 score for dev data. For test results, we submitted our predictions to Glue website for the official evaluation. We find that even in transfer learning, our SSESE pretrained model still enjoys advantages over Google pretrained model and our pretrained model without SSESE. Our SSESE pretrained model achieves 94.3% accuracy on SST2 test set versus 93.6 and 93.8 respectively. If we are using SSESE for both pretraining and finetuning, we can achieve 94.5% accuracy on the SST2 test set, which approaches the 94.9 score reported by the BERTLarge model. SSE probability of 0.01 is used for finetuning.
6.4 Proofs
Throughout this section, we will suppress the probability parameters, .
Proof of Theorem 1.
Consider the following variability term,
(13) 
Let us break the variability term into two components
where represent the random input and label. To control the first term, we introduce a ghost dataset , where are independently and identically distributed according to . Define
(14) 
be the empirical SSE risk with respect to this ghost dataset.
We will rewrite in terms of the ghost dataset and apply Jensen’s inequality and law of iterated conditional expectation:
(15)  
(16)  
(17)  
(18) 
Notice that
Because are independent the term is a vector of symmetric independent random variables. Thus its distribution is not effected by multiplication by arbitrary Rademacher vectors .
But this is bounded by
For the second term,
we will introduce a second ghost dataset drawn iid to . Because we are augmenting the input then this results in a new ghost encoding . Let
(19) 
be the empirical risk with respect to this ghost dataset. Then we have that
Thus,
(20)  
(21)  
(22)  
(23) 
Notice that we may write,
Again we may introduce a second set of Rademacher random variables , which results in
And this is bounded by
by Jensen’s inequality again. ∎
Proof of Theorem 2.
It is clear that . It remains to show our concentration inequality. Consider changing a single sample, to , thus resulting in the SSE empirical risk, . Thus,
Then the result follows from McDiarmid’s inequality.
∎
Comments
There are no comments yet.