1 Introduction
Regularization is one of the key elements of machine learning, particularly of deep learning
(Goodfellow et al., 2016), allowing to generalize well to unseen data even when training on a finite training set or with an imperfect optimization procedure. In the traditional sense of optimization and also in older neural networks literature, the term “regularization” is reserved solely for a penalty term in the loss function
(Bishop, 1995a). Recently, the term has adopted a broader meaning: Goodfellow et al. (2016, Chap. 5) loosely define it as “any modification we make to a learning algorithm that is intended to reduce its test error but not its training error”. We find this definition slightly restrictive and present our working definition of regularization, since many techniques considered as regularization do reduce the training error (e.g. weight decay in AlexNet (Krizhevsky et al., 2012)).Definition 1.
Regularization is any supplementary technique that aims at making the model generalize better, i.e. produce better results on the test set.
This can include various properties of the loss function, the loss optimization algorithm, or other techniques. Note that this definition is more in line with machine learning literature than with inverse problems literature, the latter using a more restrictive definition.
Before we proceed to the presentation of our taxonomy, we revisit some basic machine learning theory in Section 2. This will provide a justification of the top level of the taxonomy. In Sections 3–7, we continue with a finer division of the individual classes of the regularization techniques, followed by our practical recommendations in Section 8. We are aware that the many research works discussed in this taxonomy cannot be summarized in a single sentence. For the sake of structuring the multitude of papers, we decided to merely describe a certain subset of their properties according to the focus of our taxonomy.
2 Theoretical framework
The central task of our interest is model fitting: finding a function that can well approximate a desired mapping from inputs to desired outputs . A given input can have an associated target which dictates the desired output directly (or in some applications indirectly (Ulyanov et al., 2016; Johnson et al., 2016)). A typical example of having available targets
is supervised learning. Data samples
then follow a ground truth probability distribution
.In many applications, neural networks have proven to be a good family of functions to choose from. A neural network is a function with trainable weights . Training the network means finding a weight configuration minimizing a loss function as follows:
(1) 
Usually the loss function takes the form of expected risk:
(2) 
where we identify two parts, an error function and a regularization term . The error function depends on the targets and assigns a penalty to model predictions according to their consistency with the targets. The regularization term assigns a penalty to the model based on other criteria. It may depend on anything except the targets, for example on the weights (see Section 6).
The expected risk cannot be minimized directly since the data distribution is unknown. Instead, a training set sampled from the distribution is given. The minimization of the expected risk can be then approximated by minimizing the empirical risk :
(3) 
where are samples from .
Now we have the minimal background to formalize the division of regularization methods into a systematic taxonomy. In the minimization of the empirical risk, Eq. (3), we can identify the following elements that are responsible for the value of the learned weights, and thus can contribute to regularization:

: The training set, discussed in Section 3

: The selected model family, discussed in Section 4

: The error function, briefly discussed in Section 5

: The regularization term, discussed in Section 6

The optimization procedure itself, discussed in Section 7
Ambiguity regarding the splitting of methods into these categories and their subcategories is discussed in Appendix A using notation from Section 3.
3 Regularization via data
The quality of a trained model depends largely on the training data. Apart from acquisition/selection of appropriate training data, it is possible to employ regularization via data. This is done by applying some transformation to the training set , resulting in a new set
. Some transformations perform feature extraction or preprocessing, modifying the feature space or the distribution of the data to some representation simplifying the learning task. Other methods allow generating new samples to create a larger, possibly infinite,
augmented dataset. These two principles are somewhat independent and may be combined. The goal of regularization via data is either one of them, or the other, or both. They both rely on transformations with (stochastic) parameters:Definition 2.
Transformation with stochastic parameters is a function with parameters which follow some probability distribution.
In this context we consider which can operate on network inputs, activations in hidden layers, or targets. An example of a transformation with stochastic parameters is the corruption of inputs by Gaussian noise (Bishop, 1995b; An, 1996):
(4) 
The stochasticity of the transformation parameters is responsible for generating new samples, i.e. data augmentation. Note that the term data augmentation often refers specifically to transformations of inputs or hidden activations, but here we also list transformations of targets for completeness. The exception to the stochasticity is when follows a delta distribution, in which case the transformation parameters become deterministic and the dataset size is not augmented.
We can categorize the databased methods according to the properties of the used transformation and of the distribution of its parameters. We identify the following criteria for categorization (some of them later serve as columns in Tables 1–2):
Stochasticity of the transformation parameters

Deterministic parameters: Parameters follow a delta distribution, size of the dataset remains unchanged

Stochastic parameters: Allow generation of a larger, possibly infinite, dataset. Various strategies for sampling of exist:

Random: Draw a random from the specified distribution

Adaptive: Value of is the result of an optimization procedure, usually with the objective of maximizing the network error on the transformed sample (such “challenging” sample is considered to be the most informative one at current training stage), or minimizing the difference between the network prediction and a predefined fake target

Constrained optimization: found by maximizing error under hard constraints (support of the distribution of controls the strongest allowed transformation)

Unconstrained optimization: found by maximizing modified error function, using the distribution of as weighting (proposed herein for completeness, not yet tested)

Stochastic: found by taking a fixed number of samples of and using the one yielding the highest error


Effect on the data representation

Representationpreserving transformations: Preserve the feature space and attempt to preserve the data distribution

Representationmodifying transformations: Map the data to a different representation (different distribution or even new feature space) that may disentangle the underlying factors of the original representation and make the learning problem easier
Transformation space

Input: Transformation is applied to

Hiddenfeature space: Transformation is applied to some deeplayer representation of samples (this also uses parts of and to map the input into the hiddenfeature space; such transformations act inside the network and thus can be considered part of the architecture, additionally fitting Section 4)

Target: Transformation is applied to (can only be used during the training phase since labels are not shown to the model at test time)
Universality

Generic: Applicable to all data domains

Domainspecific: Specific (handcrafted) for the problem at hand, for example image rotations
Dependence of the distribution of

: distribution of is the same for all samples

: distribution of can be different for each target (class)

: distribution of depends on desired (fake) target

: distribution of
can be different for each input vector (with implicit dependence on
and if the transformation is in hiddenfeature space) 
: distribution of depends on the whole training dataset

: distribution of depends on a batch of training inputs (for example (parts of) the current minibatch, or also previous minibatches)

: distribution of depends on time (current training iteration)

: distribution of depends on some trainable parameters subject to loss minimization (i.e. the parameters evolve during training along with the network weights )

Combinations of the above, e.g. , , , , ,
Phase

Training: Transformation of training samples

Test: Transformation of test samples, for example multiple augmented variants of a sample are classified and the result is aggregated over them
Method  Dependence  Transformation space  Stochasticity
( sampling) 
Phase 

Gaussian noise on input
(Bishop, 1995a; An, 1996) 
Input  Random  Training  
Gaussian noise on hidden units
(DeVries and Taylor, 2017) 
Hidden features  Random  Training  
Dropout (Hinton et al., 2012; Srivastava et al., 2014) 
Input and
hidden features 
Random  Training  
Random dropout probability
(Bouthillier et al., 2015, Sec. 4) 
Input and
hidden features 
Random  Training  
Curriculum dropout
(Morerio et al., 2017) 
Input and
hidden features 
Random  Training  
Bayesian dropout
(Maeda, 2014) 
Input and
hidden features 
Random  Training  
Standout (adaptive dropout)
(Ba and Frey, 2013) 
Input and
hidden features 
Random  Training  
“Projection” of dropout noise into input space (Bouthillier et al., 2015, Sec. 3) 
Input
Uses auxiliary in hiddenfeature space. 
Random  Training  
Approximation of Gaussian process by testtime dropout
(Gal and Ghahramani, 2016) 
Input and
hidden features 
Random  Test  
Stochastic depth (Huang et al., 2016b)  Hidden features  Random  Training  
Noisy activation functions (Nair and Hinton, 2010; Xu et al., 2015; Gülçehre et al., 2016a) 
Hidden features  Random  Training  
Training with adversarial examples (Szegedy et al., 2014)  Input 
Adaptive
Constrained 
Training  
Network fooling (adversarial examples) (Szegedy et al., 2014)
(Not for regularization) 
Input 
Adaptive
Constrained 
Test  
Synthetic minority oversampling in hiddenfeature space (Wong et al., 2016)  Hidden features  Random  Training  
Inter and extrapolation in hiddenfeature space (DeVries and Taylor, 2017)  Hidden features  Random  Training  
Batch normalization (Ioffe and Szegedy, 2015), Ghost batch normalization (Hoffer et al., 2017)  Hidden features  Deterministic  Training and test  
Layer normalization
(Ba et al., 2016) 
Hidden features  Deterministic  Training and test  
Annealed noise on targets
(Wang and Principe, 1999) 
Target  Random  Training  
Label smoothing Szegedy et al., 2016, Sec. 7; Goodfellow et al., 2016, Chap. 7  Target  Deterministic  Training  
Model compression (mimic models, distilled models) (Bucilă et al., 2006; Ba and Caruana, 2014; Hinton et al., 2015)  Target  Deterministic  Training 
Method  Dependence  Transformation space  Stochasticity
( sampling) 
Phase 

Rigid and elastic image transformation (Baird, 1990; Yaegger et al., 1996; Simard et al., 2003; Ciresan et al., 2010)  Input  Random  Training  
Testtime image transformations (Simonyan and Zisserman, 2015; Dieleman et al., 2015)  Input  Random  Test  
Sound transformations
(Salamon and Bello, 2017) 
Input  Random  Training  
Errormaximizing rigid image transformations
(Loosli et al., 2007; Fawzi et al., 2016) 
Input 
Adaptive stochastic &
constrained, respectively 
Training  
Learning classspecific elastic imagedeformation fields
(Hauberg et al., 2016) 
Input  Random  Training  
Any handcrafted data preprocessing, for example scaleinvariant feature transform (SIFT) for images (Lowe, 1999)  Input  Deterministic  Training and test  
Overfeat (Sermanet et al., 2013)  Input  Deterministic  Training and test 
A review of existing methods that use generic transformations can be found in Table 1. Dropout in its original form (Hinton et al., 2012; Srivastava et al., 2014) is one of the most popular methods from the generic group, but also several variants of Dropout have been proposed that provide additional theoretical motivation and improved empirical results (Standout (Ba and Frey, 2013), Random dropout probability (Bouthillier et al., 2015), Bayesian dropout (Maeda, 2014), Testtime dropout (Gal and Ghahramani, 2016)).
Table 2 contains a list of some domainspecific methods focused especially on the image domain. Here the most used method is rigid and elastic image deformation.
Targetpreserving data augmentation
In the following, we discuss an important group of methods: targetpreserving data augmentation. These methods use stochastic transformations in input and hiddenfeature spaces, while preserving the original target . As can be seen in the respective two columns in Tables 1–2, most of the listed methods have exactly these properties. These methods transform the training set to a distribution , which is used for training instead. In other words, the training samples are replaced in the empirical risk loss function (Eq. (3)) by augmented training samples . By randomly sampling the transformation parameters and thus creating many new samples from each original training sample , data augmentation attempts to bridge the limiteddata gap between the expected and the empirical risk, Eqs. (2)–(3). While unlimited sampling from provides more data than the original dataset , both of them usually are merely approximations of the ground truth data distribution or of an ideal training dataset; both and have their own distinct biases, advantages and disadvantages. For example, elastic image deformations result in images that are not perfectly realistic; this is not necessarily a disadvantage, but it is a bias compared to the ground truth data distribution; in any case, the advantages (having more training data) often prevail. In some cases, it may be even desired for to be deliberately different from the ground truth data distribution. For example, in case of class imbalance (unbalanced abundance or importance of classes), a common regularization strategy is to undersample or oversample the data, sometimes leading to a less realistic but better models. This is how an ideal training dataset may be different from the ground truth data distribution.
If the transformation is additionally representationpreserving, then the distribution created by the transformation attempts to mimic the ground truth data distribution . Otherwise, the notion of a “ground truth data distribution” in the modified representation may be vague. We provide more details about the transition from to in Appendix B.
Summary of databased methods
Databased regularization is a popular and very useful way to improve the results of deep learning. In this section we formalized this group of methods and showed that seemingly unrelated techniques such as Targetpreserving data augmentation, Dropout, or Batch normalization are methodologically surprisingly close to each other. In Section 8 we discuss future directions that we find promising.
4 Regularization via the network architecture
A network architecture can be selected to have certain properties or match certain assumptions in order to have a regularizing effect.^{1}^{1}1The network architecture is represented by a function , and together with the set of all its possible weight configurations defines a set of mappings that this particular architecture can realize: .
Method  Method class  Assumptions about an appropriate learnable inputoutput mapping 

Any chosen (not overly complex) architecture  *  Mapping can be well approximated by functions from the chosen family which are easily accessible by optimization. 
Small network  *  Mapping is simple (complexity of the mapping depends on the number of network units and layers). 
Deep network  *  The mapping is complex, but can be decomposed into a composition (or generally into a directed acyclic graph) of simple nonlinear transformations, e.g. affine transformation followed by simple nonlinearity (fullyconnected layer), “multichannel convolution” followed by simple nonlinearity (convolutional layer), etc. 
Hard bottleneck (layer with few neurons); soft bottleneck (e.g. Jacobian penalty (Rifai et al., 2011c), see Section 6) 
Layer operation  Data concentrates around a lowerdimensional manifold; has few factors of variation. 
Convolutional networks Fukushima and Miyake, 1982; Rumelhart et al., 1986, pp. 348352; LeCun et al., 1989; Simard et al.,2003  Layer operation  Spatially local and shiftequivariant feature extraction is all we need. 
Dilated convolutions
(Yu and Koltun, 2015) 
Layer operation  Like convolutional networks. Additionally: Sparse sampling of wide local neighborhoods provides relevant information, and better preserves relevant highresolution information than architectures with downscaling and upsampling. 
Strided convolutions (see Dumoulin and Visin, 2016)  Layer operation  The mapping is reliable at reacting to features that do not vary too abruptly in space, i.e. which are present in several neighboring pixels and can be detected even if the filter center skips some of the pixels. The output is robust towards slight changes of the location of features, and changes of strength/presence of spatially strongly varying features. 
Pooling  Layer operation  The output is invariant to slight spatial distortions of the input (slight changes of the location of (deep) features). Features that are sensitive to such distortions can be discarded. 
Stochastic pooling
(Zeiler and Fergus, 2013) 
Layer operation  The output is robust towards slight changes of the location (like pooling) but also of the strength/presence of (deep) features. 
Training with different kinds of noise (including Dropout; see Section 3)  Noise  The mapping is robust to noise: the given class of perturbations of the input or deep features should not affect the output too much. 
Dropout (Hinton et al., 2012; Srivastava et al., 2014), DropConnect (Wan et al., 2013), and related methods  Noise 
Extracting complementary (noncoadapted) features is helpful. Noncoadapted features are more informative, better disentangle factors of variation. (We want to disentangle factors of variation because they are entangled in different ways in inputs vs. in outputs.)
When interpreted as ensemble learning: usual assumptions of ensemble learning (predictions of weak learners have complementary info and can be combined to strong prediction). 
Maxout units
(Goodfellow et al., 2013) 
Layer operation  Assumptions similar to Dropout, with more accurate approximation of model averaging (when interpreted as ensemble learning) 
Skipconnections (Long et al., 2015; Huang et al., 2016a)  Connections between layers  Certain lowerlevel features can directly be reused in a meaningful way at (several) higher levels of abstraction 
Linearly augmented feedforward network (van der Smagt and Hirzinger, 1998)  Connections between layers  Skipconnections that share weights with the nonskipconnections. Helps against vanishing gradients. Rather changes the learning algorithm than the network mapping. 
Residual learning
(He et al., 2016) 
Connections between layers  Learning additive difference of a mapping (or its compositional parts) from the identity mapping is easier than learning itself. Meaningful deep features can be composed as a sum of lowerlevel and intermediatelevel features. 
Stochastic depth
(Huang et al., 2016b), DropIn (Smith et al., 2015) 
Connections between layers; noise  Similar to Dropout: extracting complementary (noncoadapted) features across different levels of abstraction is helpful; implicit model ensemble. Similar to Residual learning: meaningful deep features can be composed as a sum of lowerlevel and intermediatelevel features, with the intermediatelevel ones being optional, and leaving them out being meaningful data augmentation. Similar to Mollifying networks: simplifying random parts of the mapping improves training. 
Mollifying networks
(Gülçehre et al., 2016b) 
Connections between layers; noise  The mapping can be easier approximated by estimating its decreasingly linear simplified version 
Network information criterion (Murata et al., 1994), Network growing and network pruning (see Bishop, 1995a, Sec. 9.5)  Model selection  Optimal generalization is reached by a network that has the right number of units (not too few, not too many) 
Multitask learning (see Caruana, 1998; Ruder, 2017)  *  Several tasks can help each other to learn mutually useful feature extractors, as long as the tasks do not compete for resources (network capacity) 
Assumptions about the mapping
An inputoutput mapping must have certain properties in order to fit the data well. Although it may be intractable to enforce the precise properties of an ideal mapping, it may be possible to approximate them by simplified assumptions about the mapping. These properties and assumptions can then be imposed upon model fitting in a hard or soft manner. This limits the search space of models and allows finding better solutions. An example is the decision about the number of layers and units, which allows the mapping to be neither too simple nor too complex (thus avoiding underfitting and overfitting). Another example are certain invariances of the mapping, such as locality and shiftequivariance of feature extraction hardwired in convolutional layers. Overall, the approach of imposing assumptions about the inputoutput mapping discussed in this section is the selection of the network architecture . The choice of architecture on the one hand hardwires certain properties of the mapping; additionally, in an interplay between and the optimization algorithm (Section 7), certain weight configurations are more likely accessible by optimization than others, further limiting the likely search space in a soft way. A complementary way of imposing certain assumptions about the mapping are regularization terms (Section 6), as well as invariances present in the (augmented) data set (Section 3).
Assumptions can be hardwired into the definition of the operation performed by certain layers, and/or into the connections between layers. This distinction is made in Table 3, where these and other methods are listed.
In Section 3 about data, we mentioned regularization methods that transform data in the hiddenfeature space. They can be considered part of the architecture. In other words, they fit both Sections 3 (data) and 4 (architecture). These methods are listed in Table 1 with hidden features as their transformation space.
Weight sharing
Reusing a certain trainable parameter in several parts of the network is referred to as weight sharing. This usually makes the model less complex than using separately trainable parameters. An example are convolutional networks (LeCun et al., 1989)
. Here the weight sharing does not merely reduce the number of weights that need to be learned; it also encodes the prior knowledge about the shiftequivariance and locality of feature extraction. Another example is weight sharing in autoencoders.
Activation functions
Choosing the right activation function is quite important; for example, using Rectified linear units (ReLUs) improved the performance of many deep architectures both in the sense of training times and accuracy
(Jarrett et al., 2009; Nair and Hinton, 2010; Glorot et al., 2011). The success of ReLUs can be attributed to the fact that they help avoiding the vanishing gradient problem, but also to the fact that they provide more expressive families of mappings (the classical sigmoid nonlinearity can be approximated very well
^{2}^{2}2Small integrated squared error, small integrated absolute error. A simple example is .with only two ReLUs, but it takes an infinite number of sigmoid units to approximate a ReLU) and their affine extrapolation to unknown regions of data space seems to provide better generalization in practice than the “stagnating” extrapolation of sigmoid units. Some activation functions were designed explicitly for regularization. For Dropout, Maxout units
(Goodfellow et al., 2013)allow a more precise approximation of the geometric mean of the model ensemble predictions at test time. Stochastic pooling
(Zeiler and Fergus, 2013), on the other hand, is a noisy version of maxpooling. The authors claim that this allows modelling distributions of activations instead of taking just the maximum.
Noisy models
Multitask learning
A special type of regularization is multitask learning (see Caruana, 1998; Ruder, 2017)
. It can be combined with semisupervised learning to utilize unlabeled data on an auxiliary task
(Rasmus et al., 2015). A similar concept of sharing knowledge between tasks is also utilized in metalearning, where multiple tasks from the same domain are learned sequentially, using previously gained knowledge as bias for new tasks (Baxter, 2000); and transfer learning, where knowledge from one domain is transferred into another domain (Pan and Yang, 2010).Model selection
The best among several trained models (e.g. with different architectures) can be selected by evaluating the predictions on a validation set. It should be noted that this holds for selecting the best combination of all techniques (Sections 3–7), not just architecture; and that the validation set used for model selection in the “outer loop” should be different from the validation set used e.g. for Early stopping (Section 7), and different from the test set (Cawley and Talbot, 2010). However, there are also model selection methods that specifically target the selection of the number of units in a specific network architecture, e.g. using network growing and network pruning (see Bishop, 1995a, Sec. 9.5), or additionally do not require a validation set, e.g. the Network information criterion to compare models based on the training error and second derivatives of the loss function (Murata et al., 1994).
5 Regularization via the error function
Ideally, the error function reflects an appropriate notion of quality, and in some cases some assumptions about the data distribution. Typical examples are mean squared error or crossentropy. The error function can also have a regularizing effect. An example is Dice coefficient optimization (Milletari et al., 2016) which is robust to class imbalance. Moreover, the overall form of the loss function can be different than Eq. (3). For example, in certain loss functions that are robust to class imbalance, the sum is taken over pairwise combinations of training samples (Yan et al., 2003), rather than over training samples. But such alternatives to Eq. (3) are rather rare, and similar principles apply. If additional tasks are added for a regularizing effect (multitask learning (see Caruana, 1998; Ruder, 2017)), then targets are modified to consist of several tasks, the mapping is modified to produce an according output , and is modified to account for the modified and . Besides, there are regularization terms that depend on . They depend on and thus in our definition are considered part of rather than of , but they are listed in Section 6 among (rather than here) for a better overview.
6 Regularization via the regularization term
Regularization can be achieved by adding a regularizer into the loss function. Unlike the error function (which expresses consistency of outputs with targets), the regularization term is independent of the targets. Instead, it is used to encode other properties of the desired model, to provide inductive bias (i.e. assumptions about the mapping other than consistency of outputs with targets). The value of can thus be computed for an unlabeled test sample, whereas the value of cannot.
The independence of from has an important implication: it allows additionally using unlabeled samples (semisupervised learning) to improve the learned model based on its compliance with some desired properties (Sajjadi et al., 2016). For example, semisupervised learning with ladder networks (Rasmus et al., 2015) combines a supervised task with an unsupervised auxiliary denoising task in a “multitask” learning fashion. (For alternative interpretations, see Appendix A.) Unlabeled samples are extremely useful when labeled samples are scarce. A Bayesian perspective on the combination of labeled and unlabeled data in a semisupervised manner is offered by Lasserre et al. (2006).
A classical regularizer is weight decay (see Plaut et al., 1986; Lang and Hinton, 1990; Goodfellow et al., 2016, Chap. 7):
(5) 
where
is a weighting term controlling the importance of the regularization over the consistency. From the Bayesian perspective, weight decay corresponds to using a symmetric multivariate normal distribution as prior for the weights:
(Nowlan and Hinton, 1992). Indeed, . Weight decay has gained big popularity, and it is being successfully used; Krizhevsky et al. (2012) even observe reduction of the error on the training set.Another common prior assumption that can be expressed via the regularization term is “smoothness” of the learned mapping (see Bengio et al., 2013, Section 3.2): if , then . It can be expressed by the following loss term:
(6) 
where denotes the Frobenius norm, and is the Jacobian of the neural network inputtooutput mapping for some fixed network weights
. This term penalizes mappings with large derivatives, and is used in contractive autoencoders
(Rifai et al., 2011c).The domain of loss regularizers is very heterogeneous. We propose a natural way to categorize them by their dependence. We saw in Eq. (5) that weight decay depends on only, whereas the Jacobian penalty in Eq. (6) depends on , , and . More precisely, the Jacobian penalty uses the derivative of output w.r.t. input . (We use vectorbyvector derivative notation from matrix calculus, i.e. is the Jacobian of with fixed weights .) We identify the following dependencies of :

Dependence on the weights

Dependence on the network output

Dependence on the derivative of the output w.r.t. the weights

Dependence on the derivative of the output w.r.t. the input

Dependence on the derivative of the error term w.r.t. the input ( depends on , and according to our definition such methods belong to Section 5, but they are listed here for overview)
Dependency  

Method  Description  Equivalence  
Weight decay (see Plaut et al., 1986; Lang and Hinton, 1990; Goodfellow et al., 2016, Chap. 7)  norm on network weights (not biases). Favors smaller weights, thus for usual architectures tends to make the mapping less “extreme”, more robust to noise in the input.  ✖  
Weight smoothing
(Lang and Hinton, 1990) 
Penalizes norm of gradients of learned filters, making them smooth. Not beneficial in practice.  ✖  
Weight elimination
(Weigend et al., 1991) 
Similar to weight decay but favors few stronger connections over many weak ones.  ✖  Goal similar to Narrow and broad Gaussians  
Soft weightsharing
(Nowlan and Hinton, 1992) 
MixtureofGaussians prior on weights. Generalization of weight decay. Weights are pushed to form a predefined number of groups with similar values.  ✖  
Narrow and broad Gaussians (Nowlan and Hinton, 1992; Blundell et al., 2015)  Weights come from two Gaussians, a narrow and a broad one. Special case of Soft weightsharing.  ✖  Goal similar to Weight elimination  
Fast dropout approximation
(Wang and Manning, 2013) 
Approximates the loss that dropout minimizes. Weighted weight penalty. Only for shallow networks.  ✖  ✖  Dropout  
Mutual exclusivity
(Sajjadi et al., 2016) 
Unlabeled samples push decision boundaries to lowdensity regions in input space, promoting sharp (confident) predictions.  ✖  
Segmentation with binary potentials (BenTaieb and Hamarneh, 2016)  Penalty on anatomically implausible image segmentations.  ✖  
Flat minima search
(Hochreiter and Schmidhuber, 1995) 
Penalty for sharp minima, i.e. for weight configurations where small weight perturbation leads to high error increase. Flat minima have low Minimum description length (i.e. exhibit ideal balance between training error and model complexity) and thus should generalize better (Rissanen, 1986).  ✖  ✖  
Tangent prop
(Simard et al., 1992) 
penalty on directional derivative of mapping in the predefined tangent directions that correspond to known inputspace transformations.  ✖  Simple data augmentation  
Jacobian penalty
(Rifai et al., 2011c) 
penalty on the Jacobian of (parts of) the network mapping—smoothness prior.  ✖  Noise on inputs injection (not exact (see An, 1996))  
Manifold tangent classifier
(Rifai et al., 2011a) 
Like tangent prop, but the input “tangent” directions are extracted from manifold learned by a stack of contractive autoencoders and then performing SVD of the Jacobian at each input sample.  ✖  
Hessian penalty
(Rifai et al., 2011b) 
Fast way to approximate penalty of the Hessian of by penalizing Jacobian with noisy input.  ✖  
Tikhonov regularizers
(Bishop, 1995b) 
penalty on (up to) th derivative of the learned mapping w.r.t. input.  ✖  For penalty on first derivative: noise on inputs injection (not exact (see An, 1996))  
Lossinvariant backpropagation Demyanov et al., 2015, Sec. 3.1; Lyu et al.,2015 
() norm of gradient of loss w.r.t. input. Changes the mapping such that the loss becomes rather invariant to changes of the input.  ✖  Adversarial training  
Predictioninvariant backpropagation
(Demyanov et al., 2015, Sec. 3.2) 
() norm of directional derivative of mapping w.r.t. input in the direction of causing the largest increase in loss.  ✖  ✖  Adversarial training 
A review of existing methods can be found in Table 4. Weight decay seems to be still the most popular of the regularization terms. Some of the methods are equivalent or nearly equivalent to other methods from different taxonomy branches. For example, Tangent prop simulates minimal data augmentation (Simard et al., 1992)
; Injection of smallvariance Gaussian noise
(Bishop, 1995b; An, 1996) is an approximation of Jacobian penalty (Rifai et al., 2011c); and Fast dropout (Wang and Manning, 2013) is (in shallow networks) a deterministic approximation of Dropout. This is indicated in the Equivalence column in Table 4.7 Regularization via optimization
The last class of the regularization methods according to our taxonomy is the regularization through optimization. Stochastic gradient descent (SGD)
(see Bottou, 1998) (along with its derivations) is the most frequently used optimization algorithm in the context of deep neural networks and is the center of our attention. We also list some alternative methods below.Stochastic gradient descent is an iterative optimization algorithm using the following update rule:
(7) 
where is the gradient of the loss evaluated on a minibatch from the training set . It is frequently used in combination with momentum and other tweaks improving the convergence speed (see Wilson et al., 2017). Moreover, the noise induced by the varying minibatches helps the algorithm escape saddle points (Ge et al., 2015); this can be further reinforced by adding supplementary gradient noise (Neelakantan et al., 2015; Chaudhari and Soatto, 2015).
If the algorithm reaches a low training error in a reasonable time (linear in the size of the training set, allowing multiple passes through ), the solution generalizes well under certain mild assumptions; in that sense SGD works as an implicit regularizer: a short training time prevents overfitting even without any additional regularizer used (Hardt et al., 2016). This is in line with (Zhang et al., 2017) who find in a series of experiments that regularization (such as Dropout, data augmentation, and weight decay) is by itself neither necessary nor sufficient for good generalization.
We divide the methods into three groups: initialization/warmstart methods, update methods, and termination methods, discussed in the following.
Initialization and warmstart methods
These methods affect the initial selection of the model weights. Currently the most frequently used method is sampling the initial weights from a carefully tuned distribution. There are multiple strategies based on the architecture choice, aiming at keeping the variance of activations in all layers around , thus preventing vanishing or exploding activations (and gradients) in deeper layers Glorot and Bengio, 2010, Sec. 4.2; He et al., 2015.
Another (complementary) option is pretraining on different data, or with a different objective, or with partially different architecture. This can prime the learning algorithm towards a good solution before the finetuning on the actual objective starts. Pretraining the model on a different task in the same domain may lead to learning useful features, making the primary task easier. However, pretrained models are also often misused as a lazy approach to problems where training from scratch or using thorough domain adaptation, transfer learning, or multitask learning methods would be worth trying. On the other hand, pretraining or similar techniques may be a useful part of such methods.
Finally, with some methods such as Curriculum learning (Bengio et al., 2009), the transition between pretraining and finetuning is smooth. We refer to them as warmstart methods.
Update methods
This class of methods affects individual weight updates. There are two complementary subgroups: Update rules modify the form of the update formula; Weight and gradient filters are methods that affect the value of the gradient or weights, which are used in the update formula, e.g. by injecting noise into the gradient (Neelakantan et al., 2015).
Again, it is not entirely clear which of the methods only speed up the optimization and which actually help the generalization. Wilson et al. (2017) show that some of the methods such as AdaGrad or Adam even lose the regularization abilities of SGD.
Termination methods
There are numerous possible stopping criteria and selecting the right moment to stop the optimization procedure may improve the generalization by reducing the error caused by the discrepancy between the minimizers of expected and empirical risk: The network first learns general concepts that work for all samples from the ground truth distribution
before fitting the specific sample and its noise (Krueger et al., 2017).The most successful and popular termination methods put a portion of the labeled data aside as a validation set and use it to evaluate performance (validation error). The most prominent example is Early stopping (see Prechelt, 1998). In scenarios where the training data are scarce it is possible to resort to termination methods that do not use a validation set. The simplest case is fixing the number of passes through the training set.
8 Recommendations, discussion, conclusions
We see the main benefits of our taxonomy to be twofold: Firstly, it provides an overview of the existing techniques to the users of regularization methods and gives them a better idea of how to choose the ideal combination of regularization techniques for their problem. Secondly, it is useful for development of new methods, as it gives a comprehensive overview of the main principles that can be exploited to regularize the models. We summarize our recommendations in the following paragraphs:
Recommendations for users of existing regularization methods
Overall, using the information contained in data as well as prior knowledge as much as possible, and primarily starting with popular methods, the following procedure can be helpful:

Common recommendations for the first steps:

Deep learning is about disentangling the factors of variation. An appropriate data representation should be chosen; known meaningful data transformations should not be outsourced to the learning. Redundantly providing the same information in several representations is okay.

Output nonlinearity and error function should reflect the learning goals.

A good starting point are techniques that usually work well (e.g. ReLU, successful architectures). Hyperparameters (and architecture) can be tuned jointly, but “lazily” (interpolating/extrapolating from experience instead of trying too many combinations).

Often it is helpful to start with a simplified dataset (e.g. fewer and/or easier samples) and a simple network, and after obtaining promising results gradually increasing the complexity of both data and network while tuning hyperparameters and trying regularization methods.


Regularization via data:

When not working with nearly infinite/abundant data:

Gathering more real data (and using methods that take its properties into account) is advisable if possible:

Labeled samples are best, but unlabeled ones can also be helpful (compatible with semisupervised learning).

Samples from the same domain are best, but samples from similar domains can also be helpful (compatible with domain adaptation and transfer learning).

Reliable highquality samples are best, but lowerquality ones can also be helpful (their confidence/importance can be adjusted accordingly).

Labels for an additional task can be helpful (compatible with multitask learning).

Additional input features (from additional information sources) and/or data preprocessing (i.e. domainspecific data transformations) can be helpful (the network architecture needs to be adjusted accordingly).


Data augmentation (e.g. targetpreserving handcrafted domainspecific transformations) can well compensate for limited data. If natural ways to augment data (to mimic natural transformations sufficiently well) are known, they can be tried (and combined).

If natural ways to augment data are unknown or turn out to be insufficient, it may be possible to infer the transformation from data (e.g. learning imagedeformation fields) if a sufficient amount of data is available for that.


Popular generic methods (e.g. advanced variants of Dropout) often also help.


Architecture and regularization terms:

Knowledge about possible meaningful properties of the mapping can be used to e.g. hardwire invariances (to certain transformations) into the architecture, or be formulated as regularization terms.


Optimization:

Initialization: Even though pretrained readymade models greatly speed up prototyping, training from a good random initialization should also be considered.

Optimizers: Trying a few different ones, including advanced ones (e.g. Nesterov momentum, Adam, ProxProp), may lead to improved results. Correctly chosen parameters, such as learning rate, usually make a big difference.

Recommendations for developers of novel regularization methods
Getting an overview and understanding the reasons for the success of the best methods is a great foundation. Promising empty niches (certain combinations of taxonomy properties) exist that can be addressed. The assumptions to be imposed upon the model can have a strong impact on most elements of the taxonomy. Data augmentation is more expressive than loss terms (loss terms enforce properties only in infinitesimally small neighborhood of the training samples; data augmentation can use rich transformation parameter distributions). Data and loss terms impose assumptions and invariances in a rather soft manner, and their influence can be tuned, whereas hardwiring the network architecture is a harsher way to impose assumptions. Different assumptions and options to impose them have different advantages and disadvantages.
Future directions for databased methods
There are several promising directions that in our opinion require more investigation: Adaptive sampling of might lead to lower errors and shorter training times (Fawzi et al., 2016) (in turn, shorter training times may additionally work as implicit regularization (Hardt et al., 2016), see also Section 7). Secondly, learning classdependent transformations (i.e. ) in our opinion might lead to more plausible samples. Furthermore, the field of adversarial examples (and network robustness to them) is gaining increased attention after the recently sparked discussion on realworld adversarial examples and their robustness/invariance to transformations such as the change of camera position (Lu et al., 2017; Athalye and Sutskever, 2017). Countering strong adversarial examples may require better regularization techniques.
Summary
In this work we proposed a broad definition of regularization for deep learning, identified five main elements of neural network training (data, architecture, error term, regularization term, optimization procedure), described regularization via each of them, including a further, finer taxonomy for each, and presented example methods from these subcategories. Instead of attempting to explain referenced works in detail, we merely pinpointed their properties relevant to our categorization. Our work demonstrates some links between existing methods. Moreover, our systematic approach enables the discovery of new, improved regularization methods by combining the best properties of the existing ones.
Acknowledegments
We thank Antonij Golkov for valuable discussions. Grant support: ERC Consolidator Grant “3DReloaded”.
References
 Amari et al. (1997) Amari, S., Murata, N., Muller, K.R., Finke, M., and Yang, H. H. (1997). Asymptotic statistical theory of overtraining and crossvalidation. IEEE Transactions on Neural Networks, 8(5):985–996.
 An (1996) An, G. (1996). The effects of adding noise during backpropagation training on a generalization performance. Neural Computation, 8(3):643–674.
 Athalye and Sutskever (2017) Athalye, A. and Sutskever, I. (2017). Synthesizing robust adversarial examples. arXiv prerint arXiv:1707.07397.
 Ba et al. (2016) Ba, J. L., Kiros, J. R., and Hinton, G. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
 Ba and Caruana (2014) Ba, L. J. and Caruana, R. (2014). Do deep nets really need to be deep? In Advances in Neural Information Processing Systems (NIPS).
 Ba and Frey (2013) Ba, L. J. and Frey, B. (2013). Adaptive dropout for training deep neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 3084–3092.

Baird (1990)
Baird, H. S. (1990).
Document image defect models.
In
Proceedings of the IAPR Workshop on Syntactic and Structural Pattern Recognition (SSPR)
, pages 38–46. 
Baxter (2000)
Baxter, J. (2000).
A model of inductive bias learning.
Journal of Artificial Intelligence Research
, 12(149198):3.  Bengio et al. (2013) Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8).
 Bengio et al. (2007) Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. (2007). Greedy layerwise training of deep networks. In Advances in Neural Information Processing Systems (NIPS), pages 153–160.
 Bengio et al. (2009) Bengio, Y., Louradour, J., Collobert, R., and Weston, J. (2009). Curriculum learning. In Proceedings of the International Conference on Machine Learning (ICML), pages 41–48. ACM.
 BenTaieb and Hamarneh (2016) BenTaieb, A. and Hamarneh, G. (2016). Topology aware fully convolutional networks for histology gland segmentation. In Proceedings of the International Conference on Medical Image Computing and ComputerAssisted Intervention (MICCAI), pages 460–468. Springer International Publishing.
 Bishop (1995a) Bishop, C. M. (1995a). Neural Networks for Pattern Recognition. Oxford University Press.
 Bishop (1995b) Bishop, C. M. (1995b). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1):108–116.
 Blundell et al. (2015) Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural networks. In Proceedings of the International Conference on Machine Learning (ICML), pages 1613–1622.
 Bottou (1998) Bottou, L. (1998). Online algorithms and stochastic approximations. In Saad, D., editor, Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK.
 Bouthillier et al. (2015) Bouthillier, X., Konda, K., Vincent, P., and Memisevic, R. (2015). Dropout as data augmentation. arXiv preprint arXiv:1506.08700.
 Bucilă et al. (2006) Bucilă, C., Caruana, R., and NiculescuMizil, A. (2006). Model compression. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pages 535–541. ACM.
 Caruana (1998) Caruana, R. (1998). Multitask learning. In Learning to Learn, pages 95–133. Springer.
 Cawley and Talbot (2010) Cawley, G. C. and Talbot, N. L. (2010). On overfitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11(Jul):2079–2107.
 Chaudhari and Soatto (2015) Chaudhari, P. and Soatto, S. (2015). The effect of gradient noise on the energy landscape of deep networks. arXiv preprint arXiv:1511.06485.
 Ciresan et al. (2010) Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep big simple neural nets excel on handwritten digit recognition. Neural Computation, 22(12):1–14.
 Demyanov et al. (2015) Demyanov, S., Bailey, J., Kotagiri, R., and Leckie, C. (2015). Invariant backpropagation: how to train a transformationinvariant neural network. arXiv preprint arXiv:1502.04434.
 DeVries and Taylor (2017) DeVries, T. and Taylor, G. W. (2017). Dataset augmentation in feature space. In Proceedings of the International Conference on Machine Learning (ICML), Workshop Track.
 Dieleman et al. (2015) Dieleman, S., Van den Oord, A., Korshunova, I., Burms, J., Degrave, J., Pigou, L., and Buteneers, P. (2015). Classifying plankton with deep neural networks. Technical report, Reservoir Lab, Ghent University, Belgium. http://benanne.github.io/2015/03/17/plankton.html.
 Dumoulin and Visin (2016) Dumoulin, V. and Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285.
 Erhan et al. (2010) Erhan, D., Bengio, Y., Courville, A., Manzagol, P.A., Vincent, P., and Bengio, S. (2010). Why does unsupervised pretraining help deep learning? Journal of Machine Learning Research, 11:625–660.
 Fawzi et al. (2016) Fawzi, A., Horst, S., Turaga, D., and Frossard, P. (2016). Adaptive data augmentation for image classification. In Proceedings of the IEEE International Conference on Image Processing (ICIP), pages 3688–3692.
 Frerix et al. (2017) Frerix, T., Möllenhoff, T., Moeller, M., and Cremers, D. (2017). Proximal backpropagation. arXiv preprint arXiv:1706.04638.
 Fukushima and Miyake (1982) Fukushima, K. and Miyake, S. (1982). Neocognitron: A selforganizing neural network model for a mechanism of visual pattern recognition. In Competition and Cooperation in Neural Nets, pages 267–285. Springer.
 Gal and Ghahramani (2016) Gal, Y. and Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning (ICML), volume 48, pages 1050–1059.

Ge et al. (2015)
Ge, R., Huang, F., Jin, C., and Yuan, Y. (2015).
Escaping from saddle points—online stochastic gradient for tensor decomposition.
In Proceedings of the Conference on Learning Theory (COLT), pages 797–842.  Girosi et al. (1995) Girosi, F., Jones, M., and Poggio, T. (1995). Regularization theory and neural networks architectures. Neural Computation, 7(2):219–269.
 Glorot and Bengio (2010) Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 249–256.
 Glorot et al. (2011) Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 315–323.
 Goodfellow et al. (2013) Goodfellow, I., WardeFarley, D., Mirza, M., Courville, A., and Bengio, Y. (2013). Maxout networks. In Proceedings of the International Conference on Machine Learning (ICML), volume 28, pages 1319–1327.
 Goodfellow et al. (2016) Goodfellow, I. J., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press.
 Gülçehre and Bengio (2016) Gülçehre, Ç. and Bengio, Y. (2016). Knowledge matters: Importance of prior information for optimization. Journal of Machine Learning Research, 17(8):1–32.
 Gülçehre et al. (2016a) Gülçehre, Ç., Moczulski, M., Denil, M., and Bengio, Y. (2016a). Noisy activation functions. In Proceedings of the International Conference on Machine Learning (ICML), pages 3059–3068.
 Gülçehre et al. (2016b) Gülçehre, Ç., Moczulski, M., Visin, F., and Bengio, Y. (2016b). Mollifying networks. arXiv preprint arXiv:1608.04980.
 Hardt et al. (2016) Hardt, M., Recht, B., and Singer, Y. (2016). Train faster, generalize better: stability of stochastic gradient descent. In Balcan, M. F. and Weinberger, K. Q., editors, Proceedings of the International Conference on Machine Learning (ICML), volume 48, pages 1225–1234.
 Hauberg et al. (2016) Hauberg, S., Freifeld, O., Larsen, A. B. L., Fisher III, J. W., and Hansen, L. K. (2016). Dreaming more data: Classdependent distributions over diffeomorphisms for learned data augmentation. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 342–350.

He et al. (2015)
He, K., Zhang, X., Ren, S., and Sun, J. (2015).
Delving deep into rectifiers: Surpassing humanlevel performance on ImageNet classification.
In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 1026–1034.  He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
 Hendrycks and Gimpel (2016) Hendrycks, D. and Gimpel, K. (2016). Generalizing and improving weight initialization. arXiv preprint arXiv:1607.02488.
 Hinton et al. (2012) Hinton, G., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2012). Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580.
 Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
 Hinton et al. (2006) Hinton, G. E., Osindero, S., and Teh, Y.W. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554.
 Hochreiter and Schmidhuber (1995) Hochreiter, S. and Schmidhuber, J. (1995). Simplifying neural nets by discovering flat minima. In Advances in Neural Information Processing Systems (NIPS), pages 529–536.
 Hoffer et al. (2016) Hoffer, E., Hubara, I., and Ailon, N. (2016). Deep unsupervised learning through spatial contrasting. arXiv preprint arXiv:1610.00243.
 Hoffer et al. (2017) Hoffer, E., Hubara, I., and Soudry, D. (2017). Train longer, generalize better: closing the generalization gap in large batch training of neural networks. arXiv preprint arXiv:1705.08741.
 Huang et al. (2016a) Huang, G., Liu, Z., Weinberger, K. Q., and van der Maaten, L. (2016a). Densely connected convolutional networks. arXiv preprint arXiv:1608.06993.
 Huang et al. (2016b) Huang, G., Sun, Y., Liu, Z., Sedra, D., and Weinberger, K. Q. (2016b). Deep networks with stochastic depth. In Proceedings of the European Conference on Computer Vision (ECCV), pages 646–661. Springer.
 Ioffe and Szegedy (2015) Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (ICML), pages 448–456.
 Jarrett et al. (2009) Jarrett, K., Kavukcuoglu, K., LeCun, Y., et al. (2009). What is the best multistage architecture for object recognition? In Proceedings of the International Conference on Computer Vision (ICCV), pages 2146–2153. IEEE.

Johnson et al. (2016)
Johnson, J., Alahi, A., and FeiFei, L. (2016).
Perceptual losses for realtime style transfer and superresolution.
In Proceedings of the European Conference on Computer Vision (ECCV), pages 694–711. Springer.  Krähenbühl et al. (2015) Krähenbühl, P., Doersch, C., Donahue, J., and Darrell, T. (2015). Datadependent initializations of convolutional neural networks. arXiv preprint arXiv:1511.06856.

Krizhevsky et al. (2012)
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).
ImageNet classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems (NIPS), pages 1097–1105.  Krueger et al. (2017) Krueger, D., Ballas, N., Jastrzebski, S., Arpit, D., Kanwal, M. S., Maharaj, T., Bengio, E., Fischer, A., and Courville, A. (2017). Deep nets don’t learn via memorization. In Proceedings of the International Conference on Learning Representations (ICLR), Workshop Track.
 Lang and Hinton (1990) Lang, K. J. and Hinton, G. E. (1990). Dimensionality reduction and prior knowledge in Eset recognition. In Advances in Neural Information Processing Systems (NIPS), pages 178–185.
 Lasserre et al. (2006) Lasserre, J. A., Bishop, C. M., and Minka, T. P. (2006). Principled hybrids of generative and discriminative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1, pages 87–94.
 Le et al. (2011) Le, Q. V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., and Ng, A. Y. (2011). On optimization methods for deep learning. In Proceedings of the International Conference on Machine Learning (ICML), pages 265–272.
 LeCun et al. (1989) LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551.
 Liu and Nocedal (1989) Liu, D. C. and Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1):503–528.
 Liu et al. (2008) Liu, Y., Starzyk, J. A., and Zhu, Z. (2008). Optimized approximation algorithm in neural networks without overfitting. IEEE Transactions on Neural Networks, 19(6):983–995.
 Long et al. (2015) Long, J., Shelhamer, E., and Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3431–3440.

Loosli et al. (2007)
Loosli, G., Canu, S., and Bottou, L. (2007).
Training invariant support vector machines using selective sampling.
In Bottou, L., Chapelle, O., DeCoste, D., and Weston, J., editors, LargeScale Kernel Machines, pages 301–320. MIT Press, Cambridge, MA.  Loshchilov and Hutter (2015) Loshchilov, I. and Hutter, F. (2015). Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343.
 Lowe (1999) Lowe, D. G. (1999). Object recognition from local scaleinvariant features. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), volume 2, pages 1150–1157.
 Lu et al. (2017) Lu, J., Sibai, H., Fabry, E., and Forsyth, D. (2017). No need to worry about adversarial examples in object detection in autonomous vehicles. arXiv preprint arXiv:1707.03501.
 Lyu et al. (2015) Lyu, C., Huang, K., and Liang, H.N. (2015). A unified gradient regularization family for adversarial examples. In Proceedings of the IEEE International Conference on Data Mining (ICDM), pages 301–309. IEEE.
 Maeda (2014) Maeda, S. (2014). A Bayesian encourages dropout. arXiv preprint arXiv:1412.7003.
 Martens (2010) Martens, J. (2010). Deep learning via Hessianfree optimization. In Proceedings of the International Conference on Machine Learning (ICML), pages 735–742.
 Milletari et al. (2016) Milletari, F., Navab, N., and Ahmadi, S. A. (2016). Vnet: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the International Conference on 3D Vision (3DV), pages 565––571. IEEE.
 Morerio et al. (2017) Morerio, P., Cavazza, J., Volpi, R., Vidal, R., and Murino, V. (2017). Curriculum dropout. arXiv preprint arXiv:1703.06229.
 Morgan and Bourlard (1990) Morgan, N. and Bourlard, H. (1990). Generalization and parameter estimation in feedforward nets: Some experiments. In Advances in Neural Information Processing Systems (NIPS), pages 630–637.
 Murata et al. (1994) Murata, N., Yoshizawa, S., and Amari, S. (1994). Network information criterion—determining the number of hidden units for an artificial neural network model. IEEE Transactions on Neural Networks, 5(6):865–872.

Nair and Hinton (2010)
Nair, V. and Hinton, G. E. (2010).
Rectified linear units improve restricted Boltzmann machines.
In Proceedings of the International Conference on Machine Learning (ICML), pages 807–814.  Neelakantan et al. (2015) Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser, L., Kurach, K., and Martens, J. (2015). Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807.
 Nowlan and Hinton (1992) Nowlan, S. J. and Hinton, G. E. (1992). Simplifying neural networks by soft weightsharing. Neural Computation, 4(4):473–493.
 Pan and Yang (2010) Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359.
 Plaut et al. (1986) Plaut, D. C., Nowlan, S. J., and Hinton, G. E. (1986). Experiments on learning by back propagation. Technical report, CarnegieMellon Univ., Pittsburgh, Pa. Dept. of Computer Science.
 Prechelt (1998) Prechelt, L. (1998). Automatic early stopping using cross validation: quantifying the criteria. Neural Networks, 11(4):761–767.
 Rasmus et al. (2015) Rasmus, A., Berglund, M., Honkala, M., Valpola, H., and Raiko, T. (2015). Semisupervised learning with ladder networks. In Advances in Neural Information Processing Systems (NIPS), pages 3546–3554.
 Rifai et al. (2011a) Rifai, S., Dauphin, Y. N., Vincent, P., Bengio, Y., and Muller, X. (2011a). The manifold tangent classifier. In Advances in Neural Information Processing Systems (NIPS), pages 2294–2302.
 Rifai et al. (2011b) Rifai, S., Glorot, X., Bengio, Y., and Vincent, P. (2011b). Adding noise to the input of a model trained with a regularized objective. arXiv preprint arXiv:1104.3250.
 Rifai et al. (2011c) Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio, Y. (2011c). Contractive autoencoders: Explicit invariance during feature extraction. In Proceedings of the International Conference on Machine Learning (ICML), pages 833–840.
 Rissanen (1986) Rissanen, J. (1986). Stochastic complexity and modeling. The Annals of Statistics, 14:1080–1100.
 Ruder (2017) Ruder, S. (2017). An overview of multitask learning in deep neural networks. arXiv preprint arXiv:1706.05098.
 Rumelhart et al. (1986) Rumelhart, D. E., McClelland, J. L., and Group, P. R. (1986). Parallel distributed processing: Explorations in the microstructures of cognition. Volume 1: Foundations. MIT Press.
 Sajjadi et al. (2016) Sajjadi, M., Javanmardi, M., and Tasdizen, T. (2016). Regularization with stochastic transformations and perturbations for deep semisupervised learning. In Advances in Neural Information Processing Systems (NIPS), pages 1163–1171.
 Salamon and Bello (2017) Salamon, J. and Bello, J. P. (2017). Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters, 24(3):279–283.
 Saxe et al. (2013) Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120.
 Sermanet et al. (2013) Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229.
 Simard et al. (1992) Simard, P., Le Cun, Y., Denker, J., and Victorri, B. (1992). An efficient algorithm for learning invariance in adaptive classifiers. In Proceedings of the International Conference on Pattern Recognition (ICPR), pages 651–655. IEEE.
 Simard et al. (2003) Simard, P. Y., Steinkraus, D., and Platt, J. C. (2003). Best practices for convolutional neural networks. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), volume 3, pages 958–962.
 Simonyan and Zisserman (2015) Simonyan, K. and Zisserman, A. (2015). Very deep convolutional networks for largescale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR).
 Smith et al. (2015) Smith, L. N., Hand, E. M., and Doster, T. (2015). Gradual DropIn of layers to train very deep neural networks. arXiv preprint arXiv:1511.06951.
 SohlDickstein et al. (2014) SohlDickstein, J., Poole, B., and Ganguli, S. (2014). Fast largescale optimization by unifying stochastic gradient and quasiNewton methods. In Proceedings of the International Conference on Machine Learning (ICML), pages 604–612.
 Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958.
 Szegedy et al. (2016) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826.
 Szegedy et al. (2014) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. (2014). Intriguing properties of neural networks. In Proceedings of the International Conference on Machine Learning (ICML).
 Ulyanov et al. (2016) Ulyanov, D., Lebedev, V., Vedaldi, A., and Lempitsky, V. S. (2016). Texture networks: Feedforward synthesis of textures and stylized images. In Proceedings of the International Conference on Machine Learning (ICML), pages 1349–1357.
 van der Smagt and Hirzinger (1998) van der Smagt, P. and Hirzinger, G. (1998). Solving the illconditioning in neural network learning. In Neural Networks: Tricks of the Trade, pages 193–206. Springer.
 Wan et al. (2013) Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of neural networks using DropConnect. In Proceedings of the International Conference on Machine Learning (ICML), pages 1058–1066.
 Wang and Principe (1999) Wang, C. and Principe, J. C. (1999). Training neural networks with additive noise in the desired signal. IEEE Transactions on Neural Networks, 10(6):1511–1517.
 Wang and Manning (2013) Wang, S. and Manning, C. (2013). Fast dropout training. In Proceedings of the International Conference on Machine Learning (ICML), pages 118–126.
 Weigend et al. (1991) Weigend, A. S., Rumelhart, D. E., and Huberman, B. A. (1991). Generalization by weightelimination with application to forecasting. In Advances in Neural Information Processing Systems (NIPS), pages 875–882.
 Wilson et al. (2017) Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. (2017). The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292.
 Wong et al. (2016) Wong, S. C., Gatt, A., Stamatescu, V., and McDonnell, M. D. (2016). Understanding data augmentation for classification: When to warp? In Proceedings of the International Conference on Digital Image Computing: Techniques and Applications (DICTA).
 Xu et al. (2015) Xu, B., Wang, N., Chen, T., and Li, M. (2015). Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853.
 Yaegger et al. (1996) Yaegger, L., Lyon, R., and Webb, B. (1996). Effective training of a neural network character classifier for word recognition. In Advances in Neural Information Processing Systems (NIPS), volume 9, pages 807–813.
 Yan et al. (2003) Yan, L., Dodier, R. H., Mozer, M., and Wolniewicz, R. H. (2003). Optimizing classifier performance via an approximation to the WilcoxonMannWhitney statistic. In Proceedings of the International Conference on Machine Learning (ICML), pages 848–855.
 Yu and Koltun (2015) Yu, F. and Koltun, V. (2015). Multiscale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122.
 Zeiler and Fergus (2013) Zeiler, M. and Fergus, R. (2013). Stochastic pooling for regularization of deep convolutional neural networks. In Proceedings of the International Conference on Learning Representations (ICLR).
 Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In Proceedings of the International Conference on Learning Representations (ICLR).
Appendix A Ambiguities in the taxonomy
Although our proposed taxonomy seems intuitive, there are some ambiguities: Certain methods have multiple interpretations matching various categories. Viewed from the exterior, a neural network maps inputs to outputs . We formulate this as for transformations in input space (and similarly for hiddenfeature space, where is applied in between layers of the network ). However, how to split this to mapping into “the part” and “the part”, and thus into Section 3 vs. Section 4, is ambiguous and up to one’s taste and goals. In our choices (marked with “☑” below), we attempt to use common notions and Occam’s razor.

Ambiguity of attributing noise to , or to , or to data transformations :

Stochastic methods such as Stochastic depth (Huang et al., 2016b) can have several interpretations if stochastic transformations are allowed for or :

Stochastic transformation of the architecture (randomly dropping some connections), Table 3

Stochastic transformation of the weights (setting some weights to in a certain random pattern)

Stochastic transformation of data in hiddenfeature space; dependence is , described in Table 1 for completeness



Ambiguity of splitting into and :

Dropout:

Parameters are the dropout mask; dependence is ; transformation applies the dropout mask to the hidden features

Parameters are the seed state of a pseudorandom number generator; dependence is ; transformation internally generates the random dropout mask from the random seed and applies it to the hidden features


Projecting dropout noise into input space (Bouthillier et al., 2015, Sec. 3) can fit our taxonomy in different ways by defining and accordingly. It can have similar interpretations as Dropout above (if is generalized to allow for dependence on ), but we prefer the third interpretation without such generalizations:

Parameters are the dropout mask (to be applied in a hidden layer); dependence is ; transformation transforms the input to mimic the effect of the mask

Parameters are the seed state of a pseudorandom number generator; dependence is ; transformation internally generates the random dropout mask from the random seed and transforms the input to mimic the effect of the mask

Parameters describe the transformation of the input in any formulation; dependence is ; transformation merely applies the transformation in input space



Ambiguity of splitting the network operation into layers: There are several possibilities to represent a function (neural network) as a composition (or directed acyclic graph) of functions (layers).

The usage of a trainable parameter in several parts of the network is called weight sharing. However, some mappings can be expressed with two equivalent formulas such that a parameter appears only once in one formulation, and several times in the other.

Ambiguity of vs. : Auxiliary denoising task in ladder networks (Rasmus et al., 2015) and similar autoencoderstyle loss terms can be interpreted in different ways:

Regularization term without given auxiliary targets

The ideal reconstructions can be considered as targets (if the definition of “targets” is slightly modified) and thus the denoising task becomes part of the error term

Appendix B Dataaugmented loss function
To understand the success of targetpreserving data augmentation methods, we consider the dataaugmented loss function, which we obtain by replacing the training samples in the empirical risk loss function (Eq. (3)) by augmented training samples :
(8) 
where we have replaced the inner part ( and ) of the loss function by to simplify the notation. Moreover, can be rewritten as
(9) 
where is the Dirac delta function: and ; and is defined as
(10) 
Since is nonnegative and
, it is a valid probability density function inducing the distribution
of augmented data. Therefore,(11) 
When , Eq. (11) becomes the expected risk (2). We can show how this is related to importance sampling:
(12) 
The difference between and is the reweighting term identical to the one known from importance sampling (see Bishop, 1995a). The more similar is to (i.e. the closer models the ground truth distribution ), the more similar the augmenteddata loss is to the expected loss . We see that data augmentation tries to simulate the real distribution by creating new samples from the training set , bridging the gap between the expected and the empirical risk.
Comments
There are no comments yet.