Most of the machine learning methodology can be defined as getting computational representations from real-life objects and then classifying the representations according to their tasks. Therefore, there have been two main approaches to increase the model performance: (1) starting with better representations from data[19, 25], and (2) building more complex architectures that are able to extract important features and generate higher level representations [30, 5].
For better initial representations, many NLP researchers have used pretrained word vectors, trained on substantially large corpus through unsupervised algorithms like word2vec , GloVe , and fastText . The pretrained word vectors not only represent the general meaning of words but also increase the model performances on most of NLP tasks . Further, word vector post-processing researches [7, 32, 23, 11]
have tried to enrich the pretrained representations using external semantic lexicons. The post-processing methods compensate for the weak point of word vector generation algorithms, which highly depend on word order, and increases the model performance.
When training an NLP model, we first initialize word representations with pretrained word vectors and then update both the model parameters and the word representations together. However, the model performance can be limited due to the initial word vectors. The pretrained word representations have the general meaning of words, but the words in some tasks do not have the meaning. Although the different meanings can be learned through the training process, it could fail to learn in the context. Since the word vectors are updated through gradient descent algorithm, the values are changed slightly, and the word vectors are easy to converge on local minima.
Therefore, we propose a simple trick to find better representations by using random noise maskers on the word vectors during iterative training process, which we call GROVER (Gradual Rumination On the Vector with maskERs). We expect that the noises help the model learn better representation. Additionally, GROVER can regularize the model by adding random noises to the fine-tuned word vectors. We show that proper degree of noises not only help a model learn task-specific representations but also regularize the model, improving the model performance.
Our contributions through GROVER are summarized as follows:
GROVER mitigates word vector updating problem and further fine-tunes the word vectors on the task.
GROVER increases the model performance by further fine-tuning word representations. Also, Our method can be applied to any model architectures using word embeddings.
GROVER regularizes the model and can be combined with other regularization techniques, further increasing the model performance.
Adding noises to input data is an old idea . However, there are few research on word-level noises since the small noises on language can change the meaning of the words.
NLP tasks which utilize the text as a form of sentence and phrase consider each word as features. However, lots of features can lead a model to be overfitted to the training data due to the curse of dimensionality. Therefore, the easiest way to reduce the number of features is to drop words in the sentence at random.
Word Embedding Perturbation. miyato2016adversarial tried to use word embedding perturbation for model regularization through adversarial training framework . cheng2018towards utilized the noises to build robust machine translation model . Also, there was an approach that considers the perturbation as data augmentation . The previous works added the noises to all word embeddings when they are used. So, the methods can regularize the models, particularly model weights, but do not care word representations. However, our method gradually adds noises to word embeddings according to word frequency. Also, we use iterative training process to benefit from the noises and get better word representations.
is applied to neural network models, masking random neurons with 0. Dropout randomly and temporarily removes the activations during training, so the masked weights are prevented from updating. As a result, the model is prevented from over-tuning on specific features, which brings regularization. Also, dropout discourses the weights to coadapt and carries out ensemble effects to the model.
Batch Normalization (BN) 
normalizes the features according to mini-batch statistics. Batch normalization enables the features to avoid covariate shift–the weight gradients are highly dependent on the gradients of previous layers. Besides, batch normalization speeds up the training process by reshaping loss function.
Layer Normalization (LN)  also utilizes mini-batch statistics to normalize the features. The difference with batch normalization is that layer normalization normalizes the inputs across the features. The statistics are computed across each feature, which is the same for all feature dimensions.
The representations fine-tuned by GROVER can be regarded as pretrained representations from the previous training process.
Pretrained Embedding Vector
is also called pretrained word representation. According to distributional representation hypothesis, pretrained embedding vectors are composed of pairs of (token, n-dimensional float vector). The word vectors usually are learned by unsupervised algorithms (e.g., word2vec , GloVe , fastText ) on substantial corpus to represent general meanings of words. The pretrained embedding vectors are widely used to initialize the word vectors in models.
Pretrained Embedding Model is suggested recently to get a deep representation of each word in the context. Previous research [18, 25, 6] trained deep architecture models and then utilized the model weights to represent words by using the outputs of models.
We take pretrained embedding vector approach because we believe that the contextual meaning of the words can be learned through training processes if the model architecture is built to process the sequential information of the words.
The overall process including GROVER is illustrated in Figure 1
. GROVER is applied to conventional training framework with early-stopping, but it needs a meta-level approach that trains the model again. The iterative training process will be denoted as a meta-epoch.
When a training process finishes, we extract the fine-tuned word embeddings (). Next, we add maskers filled with random values to , and then re-train the model from scratch with the noised word embeddings. In other words, GROVER is a process that adds random noises to fine-tuned word embeddings and then repeats the training process with the word embeddings. Observing the validation performance in every meta-epoch, we select the model and the fine-tuned embedding. The additional details in GROVER will be described.
Maskers start from infrequently used words and move to frequently used words.
The random noises in a portion of word embeddings change the distribution of all the word representations during the training process, since high-level representations such as sentence representation are generated from word representations. Likewise, the model performance can be affected too much by the randomly noised maskers. In order to gradually change the distribution of word representations, we produce the noises incrementally by maskers starting from infrequently used words and moving to frequently used words.
Gradualness. While following the aforementioned process, we change the number of random maskers according to validation performance. If the validation performance of re-trained model increases, we move the maskers to next frequently used words. Otherwise, we roll back the word embeddings to the previous one, and gradually increase the number of random maskers without moving them so that the maskers make noises on the words again in the next step. As a result, GROVER can process both words in the previous step and words in the current step. This gradualness let the word vectors benefit from noises again, and makes GROVER dynamic.
Degree of Noises (Step Size & Noise Range).
Moderate noises are required to change word vector distribution. Therefore, we have two hyperparameters related to the noises: step size and noise range.
Step size is how much the random maskers move to next frequently used words in the next training process, and we set step size to 10% of the vocabulary size as default. For example, bottom 10% of frequency-ordered vocabulary are masked at the first training process, and then bottom 10% to 20% words are masked at the next step. However, note that gradualness can change the number of maskers.
The maskers are filled with the random values sampled from a uniform distribution, which range is defined as noise range. Default noise range is between -1 and 1. The noise can be extended to well-defined perturbation methods like Gaussian kernels. In ablation studies, the effects of the hyperparameters will be presented.
Overall algorithms of GROVER are summarized as follows:
|3 Clf w/ Rand||98.00||65.88||42.73||89.00||56.56||76.20|
|3 Clf w/ word2vec||98.22||70.71||42.76||91.20||56.40||75.87|
|3 Clf w/ GloVe||98.70||74.28||51.49||91.55||56.81||81.43|
|3 Clf w/ fastText||98.01||64.06||40.07||86.12||55.03||64.38|
|3 Clf w/ BERTTokenEmb||97.81||55.15||32.74||89.92||54.53||64.66|
|3 Clf w/ ExtroGloVe||98.67||74.52||52.02||91.63||57.82||81.71|
GROVER is applied to the embeddings by adding random noises to the fine-tuned word vectors. So our method is independent of model architectures in that most NLP models use word-level embeddings. The random noises might disturb the representation, but during the re-training processes, some noises which harm the performance are compensated. On the other hand, other noises are utilized to jump over the initial word vector values if the degree of noises is moderate.
The noises by GROVER prevent the model from overfitting to the validation set in re-training processes, whereas the model with GROVER incrementally fits to the validation set through early-stopping in each training process. Therefore, the model keeps fitting to the validation set with regularization, so the model performance on the test set will increase if the model performance on the validation set is correlated to the performance on the test set.
We prepare 3 topic classification datasets; DBpedia ontology , YahooAnswers , AGNews and 2 sentiment classification datasets; Yelp reviews , IMDB . YahooAnswer dataset is used for 2 different tasks, which are to classify upper-level categories and to classify lower-level categories, respectively. The data information is presented in Table 1. We split 15% of each train set to validation set, and each dataset has its own test set. The validation set is used for early-stopping both at every epoch and every meta-epoch. We use the first 100 words as inputs, including all special symbols in 300 dimensional embedding space.
We use TextCNN 
classifiers. The model consists of 2 convolutional layers with the 32 channels and 16 channels, respectively. We adopt multiple sizes of kernels–2, 3, 4, and 5, followed by ReLU activation
and max-pooled. We concatenate the kernels after every max-pooling layer. We optimize the model using Adam with 1e-3 learning rate, and do early-stopping.
Initial word embeddings are GloVe  post-processed by extrofitting  if we do not mention explicitly. We will also present the performance of GROVER on other major pretrained word embeddings.
We implement 5 regularization (including normalization) methods to compare the techniques with ours. First, word dropping
is implemented in the pre-processing part, which removes random words in the text. Reducing the number of words in the training data, word dropping results in regularization. We set the random probabilityas 0.1. Dropout  is added to the final fully connected layer with dropout probability 0.1, which performs the best in our experiments. Batch Normalization 
is located between each convolutional layer and an activation function, as used in the original paper.Layer Normalization  is implemented in the same position. We report the performance averaged over 10 runs.
|3 Base TextCNN||98.67||74.52||52.02||91.63||57.82||81.71|
|3 + WordDrop||-0.05||-4.17||-8.73||-0.50||-1.60||-0.88|
|+ DO (p=0.1)||+0.05||-0.03||-0.03||+0.21||+0.18||-0.29|
|3 Base TextCNN||98.67||74.52||52.02||91.63||57.82||81.71|
|3 + DO (p=0.1)||98.72||74.49||51.99||91.84||58.00||81.41|
|3 + BN||98.50||74.43||51.36||91.41||57.58||81.27|
|3 + LN||98.75||75.46||53.38||91.56||57.63||81.19|
|3 + DO&BN||98.51||74.90||51.48||91.18||57.86||80.81|
|3 + DO&LN||98.81||75.34||53.78||91.49||58.18||81.19|
Top-20 Nearest Words(Cosine Similarity)
|3 Step Size .05||98.73||75.28||53.74||91.95||57.92||81.24|
|Step Size 0.1||98.72||75.69||53.82||91.91||57.98||80.99|
|Step Size 0.2||98.71||75.23||53.72||92.18||58.43||81.32|
|Step Size 0.5||98.73||74.57||52.56||92.01||58.02||81.33|
|Step Size 1.0||98.68||74.31||51.35||91.53||58.06||81.43|
We experiment with our method when initialized with 5 different pretrained word embeddings: word2vec  GoogleNews-vectors-negative300.bin, GloVe  glove.42B.300d.txt, fastText  wiki-news-300d-1M-subword.vec, extracted token embedding from BERT 111https://github.com/Kyubyong/bert-token-embeddings bert-base-uncased.30522.768d.vec, and extrofitted GloVe . The results are presented in Table 2.
GROVER improves the performance on most of the datasets when even using randomly initialized word vectors. Since we train a model from scratch except for word embeddings, the result implies that we can learn better word representations through GROVER. However, in some case, GROVER degrades the model performance because the distribution of word vectors in pretrained embeddings is already good enough for the tasks.
The model performance with regularization techniques is presented in Table 3. The result shows that the performance gain is comparable to other regularization methods except for IMDB. This might be because the number of data in IMDB is small so the validation set cannot represent the distribution of the test set. Thus, the model tries to fit the validation set through early-stopping, but the improvement in validation set does not lead to the improvement in test set. We present both the validation performance and the test performance on YahooAnswer(Upper) and IMDB, where GROVER shows good and bad performance, respectively (see Figure 2). The results show that the improvements in validation set do not warrant the improvements in test set. Another reason might be from the degree of noises added by GROVER. The noises might disturb the features in IMDB dataset too much. The further ablation study with respect to the degree of noise will be discussed.
We also present the results when our method is combined with other regularization methods in Table 3. The result shows that GROVER positively matches with the other regularization techniques, further improving the model performance.
Next, we analyze the embeddings further fine-tuned through GROVER.
We extract the word representations updated on each datasets and plots top-100 nearest words. We visualize the distribution of word representation using t-SNE , as presented in Figure 3. We can see that the distribution of frequently used words is trained further than fine-tuned embedding. These results show that our method can change the word vector distribution to be specialized further.
We present the list of top-20 nearest words of a cue word in Table 4. We can also observe that the word vectors are further fine-tuned. Moreover, we can find other similar words that are not shown in fine-tuned once embedding.
|3 Noise Range 0.1||98.72||75.31||54.13||92.03||57.10||80.76|
|Noise Range 0.5||98.74||75.26||53.42||92.04||57.59||81.64|
|Noise Range 1.0||98.72||75.69||53.82||91.91||57.98||80.99|
|Noise Range 2.0||98.57||75.49||53.84||91.76||57.29||80.85|
|Noise Range 10.||97.83||62.62||38.54||87.24||54.51||72.34|
|3 Proposed GROVER||98.72||75.69||53.82||91.91||57.98||80.99|
|3 No Gradualness||98.73||75.18||53.59||91.87||57.63||80.24|
Degree of Noises (Step Size & Noise Range).
The degree of noises added by GROVER is an important factor in that some noises should be small enough to be corrected during the training processes, while other noises should be large enough to change the word vector distribution. We first change the step size of how much random maskers move in every training processes. The bigger step size becomes, the more words are masked in a training process, so the degree of noises increases. Likewise, the degree of noises decreases, as the step size becomes small. The effect of step size is presented in Table 5. Moreover, the range of random values that we fill in the maskers also affects the degree of noises. We present the performance according to the noise range in Table 6.
We find that the degree of noises, which is controlled by the step size and the noise ranges, should be carefully chosen depending on the dataset. However, with an appropriate degree of noise, GROVER always performs well.
Gradualness. Our proposed method is to increase the number of maskers when the validation performance decrease in order to make noises on the previous words again. Otherwise, we simply move to the next frequently used words. We change the gradualness that (1) do not have gradualness, (2) have gradualness only when the validation performance increase, which is reverse to our proposed method, and (3) have gradualness both when the validation performance increase and decrease. The result is presented in Table 7. Although our proposed approach shows the best, the policy of gradualness is a hyperparameter in that the performance gap is not much different.
As a regularization technique. GROVER prevents the model from overfitting to input features (word vectors) by slightly modifying a portion of the word vectors with random noises. Our method shows performance gain on most of text classification datasets. Moreover, GROVER can be adapted to other regularization techniques, bringing further improvements.
As a representation learning. Recent research related to contextual representation [25, 6] largely improve the model performance, but the pretrained embedding models require lots of additional computational costs and data resources. With our method, we believe that such contextual information can be learned through the word vector updating in iterative training processes.
Furthermore, our final representations are learned from a given training set only and do not require additional embedding model. That is, the representations are specialized into the model architecture to solve the given task. Although our method requires additional training time, the performance gain by GROVER is useful when the pretrained word vectors are not suitable for given tasks (e.g., random), or when collecting additional data is hard.
We propose GROVER, which adds random noises to word embeddings to change its word vector distribution and regularize a model. Through the re-training process, we can mitigate some noises to be compensated and utilize other noises to learn better representations. In the experiments, GROVER regularizes the model while the model with GROVER incrementally fits to the validation set through early-stopping. We expect that our method can be utilized to improve model performances and to get better representation specialized on a given task.
-  (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: Regularization Techniques, Regularization Implementation.
-  (2016) Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. Cited by: Motivation, Pretrained Representations, Performance.
-  (2008) Importance of semantic representation: dataless classification.. In AAAI, Vol. 2, pp. 830–835. Cited by: Datasets.
Towards robust neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1756–1766. Cited by: Word-level Noises.
-  (2017) Very deep convolutional networks for text classification. In European Chapter of the Association for Computational Linguistics EACL’17, Cited by: Motivation.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Pretrained Representations, Performance, Discussion.
-  (2015) Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1606–1615. Cited by: Motivation.
-  (2000) Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405 (6789), pp. 947. Cited by: Classifier.
-  (2018) Norm matters: efficient and accurate normalization schemes in deep networks. In Advances in Neural Information Processing Systems, pp. 2164–2174. Cited by: Regularization Techniques.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. Cited by: Regularization Techniques, Regularization Implementation.
-  (2018) Extrofitting: enriching word representation and its vector space with semantic lexicons. ACL 2018, pp. 24. Cited by: Motivation, Classifier, Performance.
Convolutional neural networks for sentence classification.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Cited by: Classifier.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Classifier.
-  (2015) DBpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6 (2), pp. 167–195. Cited by: Datasets.
-  (2018) Towards understanding regularization in batch normalization. Cited by: Regularization Techniques.
Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pp. 142–150. Cited by: Datasets.
-  (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: Word Representations.
-  (2017) Learned in translation: contextualized word vectors. In Advances in Neural Information Processing Systems, pp. 6294–6305. Cited by: Pretrained Representations.
-  (2016) Context2vec: learning generic context embedding with bidirectional lstm. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 51–61. Cited by: Motivation.
Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: Motivation, Pretrained Representations, Performance.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: Pretrained Representations.
-  (2016) Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725. Cited by: Word-level Noises.
-  (2017) Semantic specialization of distributional word vector spaces using monolingual and cross-lingual constraints. Transactions of the Association for Computational Linguistics 5, pp. 309–324. Cited by: Motivation.
-  (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: Motivation, Pretrained Representations, Classifier, Performance.
-  (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Vol. 1, pp. 2227–2237. Cited by: Motivation, Pretrained Representations, Discussion.
-  (1986) Experiments on learning by back propagation.. Cited by: Word-level Noises.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: Regularization Techniques, Regularization Implementation.
Word representations: a simple and general method for semi-supervised learning. In Proceedings of the 48th annual meeting of the association for computational linguistics, pp. 384–394. Cited by: Motivation.
-  (2017) L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350. Cited by: Regularization Techniques.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: Motivation.
-  (2014) Word representations via gaussian embedding. arXiv preprint arXiv:1412.6623. Cited by: GROVER Details.
-  (2017) Cross-lingual induction and transfer of verb classes based on word vector space specialisation. arXiv preprint arXiv:1707.06945. Cited by: Motivation.
-  (2018) Word embedding perturbation for sentence classification. arXiv preprint arXiv:1804.08166. Cited by: Word-level Noises.
-  (2015) Character-level convolutional networks for text classification. In Advances in neural information processing systems, pp. 649–657. Cited by: Datasets.