Modern machine learning systems based on deep neural networks, which are usually over-parameterized111e.g., Those models that often have far more parameters than the number of training examples. Soltanolkotabi et al. (2018); Ari and Saha (2009); Denton et al. (2014) , are prone to over-fitting. In practice, there exists several strategies that explicitly help models achieve better generalization including , regularization of parameters. There also exists algorithmic ways to achieve the same goal such as dropout Srivastava et al. (2014). Recent studies show that these explicit regularization techniques do not significantly affect the generalization performance, but the model architecture itself takes a more important role Zhang et al. (2016)
. Several theory papers discover that the SGD has an implicit regularization effect that tends to make the norm of weight smaller on logistic regression and 1-layer neural networkLi et al. (2017). Understanding this effect in practical deep learning systems is crucial.
1.1 Our Contributions
In this paper, we study how implicit regularization behaves in deep learning systems for NLP tasks and how the implications are related to understanding over-parameterized model’s generalization performance. Specifically, we try to answer the following research questions:
RQ1: Does SGD implicit regularization exists in deep learning systems? If so, how can we make better use of this phenomenon in the state-of-the-art models?
SGD and implicit regularization
We show that the implicit regularization effect of SGD will gradually disappear as pure SGD move towards the mini-batch methods. Since pure SGD converges slowly, we recommend researchers with abundant computational resource use pure SGD instead of mini-batch to achieve a potentially better performance.
The role of explicit regularization
We discover that the implicit regularization effect is observable even adding the dropout, although such explicit regularizer can compensate some generalization gaps for large batch sizes. It gives us a better understanding of why explicit regularization does not help that much.
RQ2: Will SGD implicit regularization be affected by model’s capability of finding the intrinsic patterns, especially when training data is challenging to learn?
Impact of limited training data
Neural networks have their own limited capability to learn based on the training data size. Intuition suggests that if the dataset size is extremely small, the neural networks will struggle to learn the intrinsic pattern of the data, e.g. extremely bad generalization performance. To our surprise, we only observe a slow steady decrease of the performance as we decrease the dataset size to of the original. More than that, we observe the implicit regularization effect under different kind of partial dataset sizes. It means that the implicit regularization as the intrinsic nature of SGD will only be affected by the dataset sizes for a small amount if the model is over-parameterized. It sheds light on future analysis upon the generalization performance of over-parameterized models.
Fitting corrupted labels
Over-parameterized models are able to learn random labels: they tend to learn the recognizable patterns while at the same time fit noises by brute-force Zhang et al. (2016). In this case, the pattern is becoming harder to recognize and the converging of SGD is slower while the noise level becomes higher. Under this situation, is the decrease of generalization performance due to the decline of implicit regularization effect? Our experiments surprisingly show that under different acceptable noise levels, the implicit regularization effects are the same. It helps researchers study why over-parameterized model have generalization ability even though it can fit random data better.
RQ3: Initialization of the model will affect its performance. Can we discover any correlation between initialization range and implicit regularization performance?
Initialization based effect
Recent studies show that smaller weight initialization provides better generalization on the over-parameterized matrix factorization models using gradient descent (GD) with early stopping Li et al. (2017). Considering about initialization might potentially impact the implicit regularization effect when doing experiments, yet, we discover that this conclusion is not universal in deep learning systems with SGD. Our experiments expose that models’ implicit regularization performances do not have a steady monotonous relationship with the weight initialization range. Base on our results, higher implicit regularization performance models tend to prefer smaller initialization ranges. We will illustrate our findings in the following sections.
2 NLP Tasks and Configuration
In order to observe the impact of SGD implicit regularization, we use the vanilla SGD with batch size 1 as the baseline method. Then we observe how the model performs as we increase the batch size to shift to GD. Since different settings should be treated equally, instead of iterating through training data by epochs, we shuffle the data each time a random batch is selected, and measure the number of steps required. We choose sentence classification, language modeling, and sequence tagging as our target NLP tasks since they are core NLP problems with different machine learning settings. Our reported results are averaged over at least 5 runs. We specifically focus on SGD without momentum. Given a random pairfrom the training data, the new update of a parameter with learning rate is
is the loss function andstands for the computed label. Our experiments specifically focus on cross-entropy loss because it is practical but lack of theory construction in SGD implicit regularization:
Note that the definition above is only for pure SGD that update each parameter by the batch size of 1. Consider that larger batch size can also be used for the optimization method, the update rule with batch size can be generalized to
As the batch size increases from 1 to the size of the complete training sample, the optimization method will gradually move from pure SGD to GD. If the implicit regularization of pure SGD can be observed, the generalization performance will decrease as batch size in the above update rule increases. We define the degrade of generalization performance as: the performance decreases while batch size increases. This technique is used to study SGD implicit regularization.
2.1 Sentence classification
The first task that we investigate is sentence classification. The goal is to classify the sentiment of a sentence as positive or negative with a one-hot label, where
stands for probability positive meaning andstands for the contrary. The model will learn the probability given a sentence by:
We use a one-layer CNN with the same hyperparameter settings asKim Kim (2014) but eliminate all the explicit regularization including norm and dropout. The three datasets we select are also used by Kim Kim (2014) with pre-trained embedding. (1) MR: Movie reviews with one sentence per review which are either positive or negative Pang and Lee (2005). (2) SST2: Stanford Sentiment Treebank with neutral review removed and binary labels. Socher et al. (2013) (3) Subj: Subjectivity test for classifying a sentence is subjective or objective Pang and Lee (2004). MR and Subj datasets do not provide a standard train/test split, thus 10-fold is used by strictly following the choice of Kim Kim (2014) and Zhang and Wallace Zhang and Wallace (2015). SST2 has a train/test split, thus we average the performance over 10 runs.
2.2 Language modeling
Language modeling is the task of predicting the next word in a sentence. The model given current word
should learn to predict a probability distribution of the next word:
Using model to generate words is different from training a classifier in sentence classification. Thus, we choose it as our second task.
We use one dataset: Penn Treebank corpus Marcus et al. (1993) focusing only on a 2-layer LSTM. Learning rate decay is applied during training since it can significantly help the model converges better. To analyze the role of explicit regularization, focusing on dropout, we train the model in 2 different settings either without dropout or set it to 0.5. Both settings are repeated over 10 times. Consider about the effect of initialization, we choose 5 different initialization ranges: , , , , . Each of them has 2 fixed initialization schemes.
2.3 Sequence tagging
Sequence tagging, as a supervised learning task, is a core NLP problem. It can be recognized as a extension of classification problem and a simpler form of structure prediction. Input the word sequence and output the tagging sequence. Concretely, for an word sequence, we want to find the with maximum conditional probability:
We consider one dataset: CoNLL 2003 NER task Tjong Kim Sang and De Meulder (2003). The model architecture we use for training is BiLSTM with pretrained embedding and without conditional random field (CRF) loss since we want to consistently use ordinary cross entropy loss in all three tasks. We repeat the experiment five times and report the averaged results.
3 Shallow Neural Network Experiments
As deep neural networks remain mysterious for reasons, many researchers tried to reveal the inside logic starting from shallow modelsMianjy et al. (2018); Soudry et al. (2017); Gunasekar et al. (2017). It is useful to appeal to the simple case of shallow neural network models to see if there are parallel insights that can help us understand generalization better before we move on to the deep learning systems in the next section.
We build a simple 3-layer neural network on a non-linearly separable dataset as one step ahead from analyzing logistic regression on linearly separable data Soudry et al. (2017)
. The dataset of size 200 with two interleaving half circles and 0.35 standard deviation of Gaussian noise is what we choose to let our model learn. The split ratio of train and test is 7:3.
The generalization capability of a model is evaluated by measuring the complexity of the model by specific yardsticks, where the norm and the Lipschitz constant are two of the most widely used ones Ng (2004); Xu and Mannor (2012); McAllester (1999). Here, we choose these 2 methods to measure how implicit regularization is related to models’ generalization as well:
Averaged norm of all weight vectors
Lipschitz constant , given loss function
Figure 2 shows pure SGD has the smallest averaged norm and Lipschitz constant due to the implicit regularization effect. Figure 1 shows the decision boundary of the model with different batch sizes. Along with the fact that all batch sizes achieves the same test accuracy, pure SGD with batch size 1 tends to have a more stable decision boundary that is prevented from over-fitting the data. Larger batch sizes are more volatile and prone to converge to a decision boundary with more curvature. Even though we select the simplest model with the best accuracy when measuring the performance on test data, the converging and over-fitting observation can give us some insight on how implicit regularization can help SGD achieve a better generalization.
The results increase our confidence that SGD’s implicit regularization performance will be consistent on practical deep learning models. Our central findings in the 3-layer neural network can be summarized as:
F0-1: SGD with smaller batch sizes have stronger implicit regularization effect as they approach closer to pure SGD.
F0-2: Pure SGD has far better implicit regularization performance, whereas any other batch sizes show similar performances.
4 Analysis on Deep Learning
We move on our study to deep learning systems in the following sections. To answer RQ1, after eliminating all the explicit regularization effect, whether an over-parameterized model can have a good generalization performance is solely depends on the model architecture and its learning method.
The results in Figure 3 consistently reflect the findings in the shallow experiments. Pure SGD has a better generalization performance in all 3 tasks. Thus, we can generalize F0-1 and F0-2 to deep learning:
F1-1: generalization has a significant decrease between batch size 1 and 5 which shows even though the accuracy decreases almost monotonically, pure SGD still stands out among others.
On the other hand, pure SGD converges significantly slower compared to others. Since current state-of-the-art models in NLP often use mini-batch methods to measure the performance, we recommend future studies use pure SGD to test models’ generalization ability better, assuming researchers have appropriate computational resources.
Consistency of learning rate
Since using SGD, Kim Kim (2014)’s CNN model can converge without learning rate decay. We try multiple learning rate to make sure the observation is convincing. The results of the SST2 dataset in Figure 4 show the effect is consistent. SGD with batch size 1 converges to a more generalizable global minima. Note that from Figure 4, we can see the performance of batch size 1 is weakened as learning rate decreases because batch size 1 requires a relatively larger learning rate to converge better.
4.1 The Role of explicit regularization
|Dropout: No||Dropout: Yes|
|Batch Size||Train accuracy||Test accuracy||Train accuracy||Test accuracy|
|Dev Perplexity||Test Perplexity||Dev Perplexity||Test Perplexity|
|Language Modeling||1||140.08 (139.38,141.26)||135.41 (134.22,136.26)||94.13 (93.83,94.30)||89.24 (88.80,89.59)|
|5||139.73 (138.55,140.95)||136.02 (133.70,137.72)||100.26 (99.90,100.87)||95.38 (94.76,95.96)|
|10||140.47 (137.64,141.93)||136.80 (134.94,138.33)||103.30 (102.24,103.92)||99.25 (97.96,100.28)|
|20||140.82 (138.85,142.38)||137.06 (135.38,138.14)||105.58 (104.16,106.35)||102.03 (100.66,102.65)|
|30||141.60 (139.75,142.49)||137.82 (135.90,138.96)||137.80 (134.89,139.55)||138.23 (135.27,139.15)|
|40||141.45 (139.10,142.95)||137.80 (134.89,139.55)||107.28 (106.18,107.95)||103.63 (102.96,104.15)|
|50||141.29 (139.15,142.92)||138.32 (135.27,139.49)||107.48 (106.11,108.10)||104.00 (102.50,105.11)|
|Dev F1||Test F1||Dev F1||Test F1|
|Sequence Tagging||1||86.92 (80.83,81.60)||81.22 (80.83,81.60)||89.241 (88.89,89.74)||84.12 (83.53,84.50)|
|5||85.72 (85.35,85.95)||79.73 (79.15,80.18)||88.598 (88.06,88.89)||83.32 (82.99,83.89)|
|10||85.09 (84.99,89.73)||79.12 (78.61,79.73)||88.401 (87.95,88.68)||83.12 (82.24,83.52)|
|20||84.45 (84.19,84.76)||78.31 (77.50,78.68)||87.729 (87.56,87.86)||82.72 (82.21,83.23)|
|30||84.20 (83.85,84.93)||78.30 (77.72,78.84)||87.594 (87.26,87.77)||82.45 (82.13,82.76)|
|40||83.76 (83.47,84.16)||77.79 (77.28,77.99)||87.607 (87.54,87.78)||82.51 (82.14,82.74)|
|50||83.49 (83.29,83.80)||77.92 (77.53,78.43)||87.619 (87.50,87.93)||82.46 (82.20,82.94)|
Most of our experiments are done without explicit regularizations. These strategies are used to help deep neural nets with a large number of parameters deal with overfitting. We specifically focus on dropout in this paper, since it has a greater impact on other regularization methods Srivastava et al. (2014). The key idea is to randomly drop units during training to prevent units co-adapting too much.
We do have an understanding of the implicit bias of dropout that tends to make the norm of weight vectors of all hidden units equal for single hidden-layer linear neural networks Mianjy et al. (2018) and its close connection with regularizer Wager et al. (2013); Helmbold and Long (2015). Empirical study shows that even though dropout and , regularization are important, model architecture modification is more crucial Zhang and Wallace (2015); Zhang et al. (2016).
Our experiments confirm that dropout could potentially improve generalization performance. However, more importantly, sentence classification, language modeling and sequence tagging experiments with dropout can still observe the implicit regularization. We further compare the performance of implicit regularization with and without dropout in all three tasks. The result showed in Figure 3 implies that the implicit regularization effect is weakened but still exists which keeps pure SGD as the most generalizable model.
F1-2: Dropout, as an explicit regularization, neither tremendously affect the generalization performance nor completely substitute the regularization effect of SGD.
We recommend researchers further study the impact of other explicit regularization such as early stop and L2 regularizer. Although dropout is only a small piece of the explicit generalization story, our study points out a direction that implicit regularization is the major reason why over-parameterized model albeit can learn all the training data by brutal force, but still generalizes well.
5 Neural Network’s Finite Capability
Deep learning model has its own finite capability of understanding state-of-the-art datasets. Under rare circumstances, the intrinsic pattern of the training data is hard to discover, which leads to a challenge for whether over-parameterized model will learn the data by brute-force or reach its limit to understanding the pattern. We study how implicit regularization is affected by the dataset challenges. Specifically, we test our models on both training data that is smaller than the original and with corrupted labels which leads to a condition that is almost impossible to learn. We answer our RQ2 here.
5.1 Limited sample expressivity
There exists some analysis for the expressivity of neural networks Mhaskar and Poggio (2016); Cohen and Shashua (2016). One recent study on finite sample size is: how large the model should be in order to learn a certain number of random labels by brutal force Zhang et al. (2016). Here, we further generate a contrary question about the analysis: Is there a specific limitation about the size of data so that neural networks can hardly capture the real pattern and have difficulty choosing whether to fit as random labels easily or try their best to find intrinsic features? Admittedly, it is more realistic to use SGD if the training sample size is small, as SGD is the slowest optimization algorithm comparing with Adadelta Zeiler (2012), Adagrad Duchi et al. (2011), and Adam Kingma and Ba (2014). However, if this limitation is not small enough, we cannot utilize the implicit regularization ability of SGD on models with small data size. Verifying the how models’ expressivity affect SGD implicit regularization on limited sample dataset is crucial.
In the sentence classification experiment on the SST2 dataset, we extract 0.1, 0.3, 0.6 percent of the original data. Intuition suggests that if the dataset size is extremely small, the neural networks will struggle to learn the intrinsic pattern from the data, namely, extremely bad generalization performance. Unexpectedly, even though the performance of the model steadily decreases as the decreasing of the data percentage, we can observe that the implicit regularization effect is consistent under different percentages of the original training data. An important conclusion can be made:
F2-1: implicit regularization effect is one of the intrinsic properties of SGD which will not be affected by data size for over-parameterized models.
|Initialization range||Batch size 1||Batch size 5||Batch size 10||Batch size 20||Batch size 30||Batch size 40||Batch size 50|
5.2 Fitting corrupted labels
We run our experiments with partially corrupted labels as a variant of the randomization test from non-parametric statistics Edgington (1986). For a corrupted level , Each training label has a probability to be corrupted as a wrong label independently. The sentence classification datasets we choose have binary labels, thus the corrupted labels will be flipped to the wrong ones. This study is inspired by Zhang et al.’s generalization study Zhang et al. (2016). He surprisingly observed SGD on several state-of-the-art models could fit any level of label corruption including random labels that challenges the traditional thinking of generalization. Even though the training error can be reduced to 0, the generalization ability is still affected by the label corruption due to over-fitting.
Since the model tends to recognize the intrinsic pattern of the training sample while at the same time brutal force to learn the rest, why over-parameterized model still has generalization ability while it could brutal force to learn all the data is unknown. One interesting observation is that the performance of the model with larger batch sizes, such as 35 in our sentence classification experiments, does not decrease suddenly at some corruption levels but decrease predictably based on the percentage of corruption. On the contrary, SGD with smaller batch sizes tend to keep the generalization gap with large batch sizes and start to lose this advantage at some label corruption levels close to 0.3. All batch sizes converge to almost same generalization performance at corruption level close to 0.4. Since corruption level 0.5 on a binary label dataset means the model will do random prediction, which makes different batch sizes the same. However, the question is why this kind of convergence cross-batch sizes only happens after some corruption level? Why smaller batch sizes have a larger decreasing rate?
We believe there are some intrinsic reasons that prone small batch size models especially pure SGD to learn the pattern as much as possible and the implicit regularization is one of the biggest. To gain further insight into this phenomenon, we measure the role of implicit regularization on multiple corruption levels. It can be clearly seen that the implicit regularization effect remains strong at corruption level 0.1 and 0.2, but weakened at 0.3.
F2-2: SGD implicit regularization improves generalization while brutal force learning is required.
It gives us new insights into over-parameterized models’ generalization ability. Although implicit regularization of SGD might not completely solve the intriguing question of generalization, it sheds lights on future study of deep learning model’s intrinsic ability and the strategy of optimization.
6 Weight Initialization Stability
The widespread use of various kinds of initialization steps inside the mainstream of deep learning studies Saxe et al. (2013); Sutskever et al. (2013); Duchi et al. (2011). The Xavier Initialization and the He-et-al Initialization Jia et al. (2014); He et al. (2015) are the widely used. The study of Sutskever et al. Sutskever et al. (2013)
shows that well-designed initialization can bring a breakthrough to the performance of deep and recurrent neural networks using SGD. Consider the role of initialization in deep learning, verifying the consistency of our observation, pure SGD has strongest implicit regularization among other batch sizes, is necessary. We present our answer forRQ3 based on Table 2. In language modeling, a smaller initialization range, there exists a strong effect but large initialization of the weights can weaken it. This observation is consistent with a recent theory study on matrix factorization Li et al. (2017) that
F3: Smaller initialization range will lead to a better generalization effect.
In addition, a large batch size might prefer a larger initialization range albeit it can not guarantee a globally best solution. Based on the above study, using a slightly larger batch size such as 5 with a proper weight initialization setting could potentially restore the implicit regularization while converges a lot faster. We will continue the study of how to make the effect more practical.
7 Related work
A few papers have constructed theories of SGD implicit regularization. Li et al. Li et al. (2017) prove the implicit regularization effect of SGD on small machine learning models based on the fact that SGD prefers some generalizable local minimas to others Srebro et al. (2011). However, these theories does not establish universal conclusions which can be applied to deeper neural networks with more practical activations and the cross-entropy loss function. Such theory construction requires a better understanding to optimization algorithm’s behavior for non-linear models and statistics. In this work, we study the unsolved problem base on large-scale and comprehensive empirical studies.
One study shows that the logistic regression model and 2-layer neural networks using monotone decreasing loss functions tend to converge in the direction of the max-margin solution when using GD and SGD Soudry et al. (2017). We further enhance the conclusion by doing studies on more practical deep learning systems.
Zhang et al. Zhang et al. (2016) point out how traditional approaches fail to describe why over-parameterized models generalize well by successfully letting state-of-the-art models fit corrupted labels. However, this study does not have a clear conclusion for the implicit regularization effect. In this work, we explore further on how implicit regularization is affected by over-parameterized model’s capability of fitting corrupted labels. The role of dropout in SGD’s implicit regularization is also analyzed based on the work of Wager et al. Zhang et al. (2016) and Mianjy et al. Mianjy et al. (2018).
We analyze the effect of SGD implicit regularization in the deep learning systems, especially in NLP tasks. Results are consistent across all kinds of settings: (i) Dropout, as an explicit regularizer, cannot completely replace the effect of SGD implicit regularization. (ii) Models’ limited capabilities do not greatly affect the SGD implicit regularization. (iii) SGD models with stronger implicit regularization tend to prefer lower initialization ranges. Our study can be utilized to help future researchers to measure their state-of-the-art models’ generalization performance better by using SGD.
- Ari and Saha (2009) Samit Ari and Goutam Saha. 2009. In search of an optimization technique for artificial neural network to classify abnormal heart sounds. Applied Soft Computing, 9(1):330–340.
Cohen and Shashua (2016)
Nadav Cohen and Amnon Shashua. 2016.
Convolutional rectifier networks as generalized tensor decompositions.In International Conference on Machine Learning, pages 955–963.
- Denton et al. (2014) Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. 2014. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pages 1269–1277.
- Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159.
- Edgington (1986) Eugene S Edgington. 1986. Randomization tests. Marcel Dekker, Inc.
- Gunasekar et al. (2017) Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. 2017. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6151–6159.
He et al. (2015)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In
Proceedings of the IEEE international conference on computer vision, pages 1026–1034.
- Helmbold and Long (2015) David P Helmbold and Philip M Long. 2015. On the inductive bias of dropout. The Journal of Machine Learning Research, 16(1):3403–3454.
- Jia et al. (2014) Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross B. Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. CoRR, abs/1408.5093.
- Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Li et al. (2017) Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. 2017. Algorithmic regularization in over-parameterized matrix recovery. arXiv preprint arXiv:1712.09203.
- Marcus et al. (1993) Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330.
David A McAllester. 1999.
Pac-bayesian model averaging.
Proceedings of the twelfth annual conference on Computational learning theory, pages 164–170. ACM.
- Mhaskar and Poggio (2016) Hrushikesh N Mhaskar and Tomaso Poggio. 2016. Deep vs. shallow networks: An approximation theory perspective. Analysis and Applications, 14(06):829–848.
- Mianjy et al. (2018) Poorya Mianjy, Raman Arora, and Rene Vidal. 2018. On the implicit bias of dropout. arXiv preprint arXiv:1806.09777.
- Ng (2004) Andrew Y Ng. 2004. Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In Proceedings of the twenty-first international conference on Machine learning, page 78. ACM.
Pang and Lee (2004)
Bo Pang and Lillian Lee. 2004.
A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts.In Proceedings of the 42nd annual meeting on Association for Computational Linguistics, page 271. Association for Computational Linguistics.
- Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd annual meeting on association for computational linguistics, pages 115–124. Association for Computational Linguistics.
- Saxe et al. (2013) Andrew M Saxe, James L McClelland, and Surya Ganguli. 2013. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- Soltanolkotabi et al. (2018) Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. 2018. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Transactions on Information Theory.
- Soudry et al. (2017) Daniel Soudry, Elad Hoffer, and Nathan Srebro. 2017. The implicit bias of gradient descent on separable data. arXiv preprint arXiv:1710.10345.
- Srebro et al. (2011) Nati Srebro, Karthik Sridharan, and Ambuj Tewari. 2011. On the universality of online mirror descent. In Advances in neural information processing systems, pages 2645–2653.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958.
- Sutskever et al. (2013) Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. 2013. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pages 1139–1147.
Tjong Kim Sang and De Meulder (2003)
Erik F Tjong Kim Sang and Fien De Meulder. 2003.
Introduction to the conll-2003 shared task: Language-independent named entity recognition.In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 142–147. Association for Computational Linguistics.
- Wager et al. (2013) Stefan Wager, Sida Wang, and Percy S Liang. 2013. Dropout training as adaptive regularization. In Advances in neural information processing systems, pages 351–359.
- Xu and Mannor (2012) Huan Xu and Shie Mannor. 2012. Robustness and generalization. Machine learning, 86(3):391–423.
- Zeiler (2012) Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
- Zhang et al. (2016) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2016. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.
- Zhang and Wallace (2015) Ye Zhang and Byron Wallace. 2015. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820.