Log In Sign Up

Training Deep Networks from Zero to Hero: avoiding pitfalls and going beyond

Training deep neural networks may be challenging in real world data. Using models as black-boxes, even with transfer learning, can result in poor generalization or inconclusive results when it comes to small datasets or specific applications. This tutorial covers the basic steps as well as more recent options to improve models, in particular, but not restricted to, supervised learning. It can be particularly useful in datasets that are not as well-prepared as those in challenges, and also under scarce annotation and/or small data. We describe basic procedures: as data preparation, optimization and transfer learning, but also recent architectural choices such as use of transformer modules, alternative convolutional layers, activation functions, wide and deep networks, as well as training procedures including as curriculum, contrastive and self-supervised learning.


page 1

page 3

page 4


On The Transferability of Deep-Q Networks

Transfer Learning (TL) is an efficient machine learning paradigm that al...

Improving the Generalization of Supervised Models

We consider the problem of training a deep neural network on a given cla...

Adversarial Self-Supervised Contrastive Learning

Existing adversarial learning approaches mostly use class labels to gene...

The Geometry of Self-supervised Learning Models and its Impact on Transfer Learning

Self-supervised learning (SSL) has emerged as a desirable paradigm in co...

Beyond Transfer Learning: Co-finetuning for Action Localisation

Transfer learning is the predominant paradigm for training deep networks...

Rapid Classification of Glaucomatous Fundus Images

We propose a new method for training convolutional neural networks which...

Guillotine Regularization: Improving Deep Networks Generalization by Removing their Head

One unexpected technique that emerged in recent years consists in traini...

I Introduction

Different fields were revolutionized in the last decade due to the huge investment in Deep Learning research. With the curation of large datasets and its availability, as well as popularization of graphical processing units, those methods became popular in all machine learning, pattern recognition, computer vision, natural language and signal/image processing communities 

[3]. After becoming pervasive more broadly in related fields such as engineering, computer science and applied math [1, 13, 24], we observed a crescent number of projects and papers including deep learning techniques were adopted by practitioners from other fields in attempt to solve particular problems [18, 12, 8]. Such hype raised concerns about the pitfalls in use of machine and deep learning methods. A remarkable example is the study of Roberts et al (2021) that, in a universe of over 2,000 papers using machine learning to detect and prognosticate for COVID-19 using medical imaging, found none of the models to be of potential clinical use due to methodological flaws and/or underlying biases [48].

In fact, training deep neural networks may be challenging in real world data. Using models as black-boxes, even with transfer learning – a popular and widely used technique in this context –, can result in poor generalization or inconclusive results when it comes to small datasets or specific applications. In this paper, we focus on the main issues related to training deep networks, and describe recent methods and strategies to deal with different types of tasks and data. Basic definitions about machine learning, deep learning and deep neural networks are outside the scope of this paper. For those, please refer to the following as good starting points [19, 44, 3].

Fig. 1: Main concepts related to training of Deep Networks

We cover the basic steps to avoid common pitfalls as well as more recent options to improve models, in particular, but not restricted to, supervised learning in visual content. Those guidelines can be particularly useful in datasets that are not as well-prepared as those in challenges, and also under scarce annotation and/or small data. Figure 1 summarizes the main options to be considered. We describe importance of basic procedures but also recent architectural choices such as use of transformer modules, alternative convolutional layers, activation functions, wide and deep networks, as well as training procedures including as curriculum, contrastive and self-supervised learning.

I-a Notation

Let be examples from a training set containing instances, from which we may have target values or labels . Such training set can be used to train a deep neural network (DNN) with multiple processing layers. For simplicity we define such neural network as a composition of functions (related to some layer ) that has a set of parameters

, takes as input a vector

and outputs a vector :

is the input data coming from the training set, and functions can represent different layers such as: Dense (or Fully Connected), Convolutional, Pooling, Dropout, Recurrent, among others. represents all learnable parameters of a given layer. For example in dense layers those are matrices and bias values , while in convolutional layers represent weights for convolutional kernels/filters. Also, let us have as examples from a test set used to evaluate the trained model.

There are also non-sequential networks having different branches containing independent or shared parameters such as siamese or triplet networks but for which at least the output layer is shared among the branches. Other models operate using more than one independent networks: a remarkable example is the Generative Adversarial Network, which contains a discriminator and a generator function.

Ii How to start and common issues

Common pitfalls and issues are due to overlooked details on the design of models. In this section we present a checklist, a kind of 7 Errors Game to begin with.

Ii-a Basic checklist (before trying anything else)

1. Input representation is fair and target patterns are present in the data. Make sure the input data is recognizible by a human or specialist, e.g., when undersampling and trimming an audio clip the expected patterns are still audible; when resizing images the objects to be recognized are still visible. For example in Figure 2 we show two resized versions of an image to be used as input in a pre-trained neural network, however one of them clearly lost details of the cell that may be important for the task.

2. Input data is normalized accordingly

. DNNs do not work well with arbitrary ranges of numerical values. Common choices are 0-1 scaling (by computing and storing minimum and maximum values), or z-score standardization (by computing and storing mean and standard deviation). For example, in Figure 

4(a-b) we compared loss and accuracy curves using normalized and non-normalized versions of the training set.

3. Data has quality (data-centric AI). After a decade of model frenzy, there has been a resurgence of concerns around data, that should be defined consistently, cover all important cases and be sized appropriately. Most datasets (even benchmark) have some wrong labels that hamper design of the model. In such case, recent work showed models with lower capacity (stronger bias) may be more resilient [41].

4. Both loss function and evaluation metrics makes sense


— loss and evaluation must be adequate to the task and to the terms involved in its computation, e.g. in a multi-class classification task make sure you are comparing probabilities (vectors with unity sum). For regression tasks, error functions (such as mean squared error) are adequate, and for object detection the intersection over union (IoU) 


is to be considered. Note the loss function must be differentiable, i.e., have a derivative! Metrics such as accuracy, area under the curve (AUC), Jaccard and cosine distances have particular interpretations and it is paramount to understand their meaning for the task you want to learn before using it;

— check if the loss values are reasonable from the first to the last iteration, inspecting for issues such as overflow, e.g. the cross-entropy for 10 classes of a random classification result () should be no more than, approximately, . Also, be sure your target (labels, range of values) matches what the network layer and its activation function outputs. For example, a sigmoid activation outputs values in the range

for every neuron, while the softmax function outputs values so that the sum of all neurons is

. See Figure 4(c) for the effects of using categorical vs binary cross entropy in a binary classification network in which the last layer contained only one neuron with sigmoid activation;

— Plot the loss curve for the training and validation (whenever possible) loss values along iterations (or epochs). Loss value along iterations should decrease (fairly) smoothly and converge to near zero. If not, of when the training and validation curves are too different, investigate optimization details or rethink adequacy of the chosen model for the task.

5. Projected features has reasonable structure. It is worth visualizing the learned feature space with tSNE [55] and UMAP [36] for example, by projecting into a 2d plane the learned features, e.g. the output of the penultimate layer (often the one just before the output/prediction layer). This complements the loss curve, and may show if such space makes sense in terms of the application, or if there was no actual convergence in terms of learning an useful representation as in the case of Figure 3 in which a 10-class problem obtained a test accuracy of around 0.35, which is above random, but still far from having learning an useful representation as the same test set is projected and show no class structure.

6. Model Tuning and Validation. The correct way to adjust a model is to use a validation set, never the test set. If you have to make any decision regarding the data preparation, neural network design, training strategies, and other, such decisions have to consider only the training data available. In this scenario you may tune the model using metrics extracted for example via a

-fold cross validation on the training set. After all choices on network topology, training strategies, hyperparameters are made, then you evaluate the final model on the test set. Otherwise, the results (even for the test set) are biased and cannot be generalized.

7. Use Internal and External Validation. In particular for computed-aided diagnostics or deployment for decision-support, it is important to be extra careful with the data preparation, modeling and the conclusions. Methodological flaws and biases often lead to highly optimistic reported performance, but fail to be useful in practice. For example, a recent study identified 2,212 studies on COVID-19 diagnosis with chest radiographs and CT scans, from which 415 were screened, all having methodological flaws and/or underlying biases [48]. We recommend reading and checking your study using PROBAST (tool to assess risk of bias and applicability of prediction model studies) [57]

and/or CLAIM (checklist for artificial intelligence in medical imaging) 

[38], since they may be useful not only to health data but to assess models for other applications.

After you check-listed the items above, if results are still to be improved, we now have to set ourselves to investigate: (1) how difficult the learning task is, (2) what is the nature of the problem that makes it difficult and what options can be used to address it. Let us begin with difficult scenarios, as discussed in next sections.

Fig. 2: Cell images resized to a size acceptable by a pre-trained network: left () still retaining structures of the cell, right () with insufficient details that would hamper learning.
Fig. 3:

t-SNE projection of the test set of STL-10 image features, extracted from the penultimate layer of a neural network that reached 35% test accuracy, but for which the learned representations shows poor class separability.

(a) Loss values for different normalization (b) Accuracy for different normalization (c) Accuracy for different losses
Fig. 4: Comparing loss curves (a) and accuracy (b) on the training set when training the same network with the same dataset for which the instances were normalized to 0-1 (dashed red line) and not normalized (dotted blue line), and the accuracy when using different loss functions (c).

Ii-B Small datasets and poor convergence

Learning under scarce data is known to be an issue with deep networks. For images, considering coarse-grained or category level data, i.e. the classes represent significantly different concepts such as in clothing and accessories Fashion-MNIST dataset, studies indicate a minimum of 1500 instances per class to allow learning. For fine-grained scenarios, i.e. the differences between concepts are more subtle, as in bird species CUB-200-2011 dataset (has around 60 instances per category), the problem may becomes harder if only the visual data is used in training. Therefore, if you have small data, consider transfer learning (see Section 

LABEL:subTransfer) or feature extraction using DNNs (see Section LABEL:subFeature), as well as architectures with less capacity (or reduced complexity). Data augmentation is also a possibility (see Section IV-E), but if the original data is unrepresented, the augmented data will also be limited.

Ii-C Imbalance of target data in supervised tasks

Ideally, in classification tasks, the number of examples available for each class should be similar, and for regression, training examples covering uniformly the whole range of the target data should be provided. When such supervision is not balanced with respect to the target data, one may be easily fooled by the loss function and evaluation metrics. Otherwise, one possible strategy is to weigh the classes so that the instances related to less frequent patterns will become more i the training process. Also, make sure you use metrics that evaluate how good the model along all the space of target values. In addition, data augmentation can be investigated as a way to mitigate for this imbalance (see Section IV-E).

Ii-D Complexity of models, overfitting and underfitting

Overfitting and underfitting represent undesired scenarios of learning and are related to the complexity of the models. Although it is not the scope of this paper to explain those phenomena (for a more complete explanation refer to [37]), it is important to know how to diagnose them.

Underfitting usually occurs when the chosen architecture and training procedure are not well adapted to the task and/or the difficulty of the dataset at hand. The first symptom of this effect is a loss curve that converges to a value far from zero, or when there is no convergence at all.

Overfitting is more common for deep neural nets since those are generally high capacity models, i.e. have a large number of trainable parameters that allow for a large space of admissible functions [37]. It occurs when the network is excessively adjusted to the training set, approaching a model that memorizes the training set. Because DNNs often produce (near) zero error in the training set, it is harder to evaluate their generalization for future data.

In an attempt to measure how deep networks may memorize the training set [60] uniformly randomized the labels of examples in benchmark datasets and showed that if the network has sufficient capacity, those are able to reach near zero loss (training error) by memorizing the entire training set. More recently [41] showed lower capacity models may be practically more useful than higher ones in real-world datasets, which emphasizes the need for better data quality data and better evaluation, in particular external validation before finding a good balance between complexity of the model and its performance on a particular dataset.

Ii-E Attacks

Deep networks learn features for a specific target task via a loss function that uses a specific training data. Because of its low interpretability, it is difficult to know which patterns from the input data were used to minimize the loss. For example, when counting white cells in blood smear images, if the purple color is present in all images with white, the optimization process has a huge incentive to use the purple color only as an indicator for white cells. Therefore, in future images, if there is purple dye in a blood smear medium (not the actual cells), the classifier may use this to incorrectly, but with a high confidence, classify the image as containing white cells. On the other hand, an image with white cells containing a different shade of purple may not be detected. The same can happen in soundscape ecology, for example when distinguishing from different bird species from its singing pattern. If there is a background noise, i.e. a critter, that usually sings at the same time of the day as some birds, the sound of the critter may be used by the network to detect the bird. In both scenarios, the features obtained after training are not the concept we wished to learn.

For example, in Figure 5 we show two test images one without attack, and the other containing a visible one-pixel attack, in which images in the training set from a given class contain a white pixel in a fixed given position, biasing the model to use that white pixel in order to predict the class, while neglecting other visual concepts. In this case we deliberately included the pixel in a visible region, but one could include that in less obvious regions such as in the border, or even add subtle features, such as gradient with similar effects [39].

Fig. 5: Example of pixel attack in which the same network is given as input two testing images (not seeing during training state), the first without attack and the second with a one-pixel attack (see the white dot on the car’s door), followed by the three most probable classes output by the network. In this image the pixel was included in a visible region to facilitate visualization.

Iii Architecture options

Iii-a Types of convolution units

Convolutions are of course the most important operation in CNNs, which means there are many studies in the literature bringing new ideas to this classic operation.

Let the kernel size, , refers to the lateral size of each kernel in a convolutional layer, and we consider all those kernels to be square, so in size. Each kernel is applied one input channel to be accumulated for one output channel, which leads to each convolutional layer having the collection of kernels for a total of learnable weights. The stride of a conv. layer refers to the step between each “application” of a kernel when it “slides” over the image, in the classic operation this step size is always one.

convolution can be useful for reducing computations further into networks by combining values along the channels of a single pixel. These operations do not take into account any neighbourhood, but perform the role of weighing and collecting information for each pixel on all channels and outputting at a new channel dimensionality .

Transposed Convolutions

play the role of a learnable, weighted upsampling operation in generative networks, autoencoders and pixel-to-pixel models (e.g. segmentation tasks). The concept simulates a

fractional stride

, so before applying the kernels a feature map is padded with zeros between spatial dimensions. When using this operation it is important to choose

as an even number to avoid the “checkerboard effect”, per [42] on the effect.

Spatially Separable Convolutions save on computation by breaking a larger convolution operation into two smaller operations. This is usually accomplished by making a convolution with kernels into a followed by a operation.

Depthwise Separable Convolutions Follow a similar principle to spatially separable ones by also breaking the traditional operation into two more efficient ones. First, the feature map is convolved with kernels, but instead of summing the resulting activations as usual, the matrices go through a convolution to map the output to have the desired number of feature maps. This yields the same output shape as the traditional operation, but at a fraction of the cost.

Iii-B Width, Depth and Resolution

Techniques for designing deep networks have evolved considerably since AlexNet [30]

won the 2012 ImageNet challenge. One of the main fronts of discussion is around scaling networks up or down find a balance between accuracy and memory/computational efficiency. Width, Depth and Resolution are strategies with different pros and cons.

Wider Nets are easier to train and are able to capture finer details in images (such as background information). Increasing width increases computational cost exponentially [58]

Deeper Nets perform better on “well-behaved” datasets [40], such as single-object classes with “clear” objects, while wider nets did better on classes that represent scenes (e.g. “bookshop”, “seashore”).

Scaling Depth, Width and Resolution Together yields the best results for a wide range of tasks and desired accuracies. Frameworks for scaling the three variables together were presented in EfficientNet [53] and MobileNet [49].

Iii-C Pooling

Pooling layers have been a staple of CNNs since their introduction; These downsampling operations are useful both for saving on computation, memory, and for summarising feature maps as networks get deeper.

Max Pooling is the most widely used method for classification as it enforces discriminative features within the network.

Average Pooling was the first pooling approach and is currently used in Generative Adversarial Networks as in those models they better match the upsampling layers of generators.

Strided Convolutions, with step size are a way of implementing “learnable pooling”. Less common in classification CNNs, more common in GAN designs.

Blur Pooling is a solution proposed by [61]. Their findings showed that current operations break the shift equivariance expected of CNNs and proposes that pooling operations are first densely evaluated, blurred by a low pass filter and only then subsampled. This improves shift equivariance.

Iii-D Transformers and ViT

Transformers are a recent architecture created primarily for language tasks [56]. It relies on self-attention as the defining mechanism of its layers. Self-attention is very different from convolutions and from recurrent layers in that very little inductive bias is taken into account for its mechanism.

On attention layers, the relevance of one item to all other items including itself is estimated so that each item becomes a weighted average of each most relevant counterpart. This is done by learning three projection matrices

(similar to 3 dense layers) applied to the input items :


where is the softmax function that makes the resulting attention weights (the result of ) behave as a probability function that weights the values . is the dimensionality of each item.

The architecture was quickly applied to compute vision tasks as well. Notable examples being the ViT [17] for image recognition, iGPT [9]

for image generation. While ongoing work with this architecture is exciting, the lack of inductive bias means that those models require much more training data than CNNs. ViT for example cannot be trained from scratch on the ImageNet dataset alone and perform well.

Iv Improving optimization

Optimization choices: algorithm, learning rate (or step size) and batch size, matters when using deep networks for learning representations. Using default options with arbitrary optimizers may lead to suboptimal results. Also, normalization, regularization and the sample size may significantly influence the optimization procedure.

Iv-a Optimizer and batch size

The original Gradient Descent algorithm computes the gradient at one iteration using all training data. Stochastic Gradient Descent (SGD) is an approximation that allows calculating the gradient of the cost function based on a random example or a small subset of examples (minibatch). The regular SGD is a conservative but fair choice, as long as the learning rate and batch size are well defined. In fact, nearly every state-of-the-art computer vision model was trained using SGD, for example ResNet 

[23], SqueezeNet [26], Faster R-CNN [46], Single Shot Detectors [32].

Adaptive methods such as Adam and RAdam are good alternatives, requiring smaller learning rate (LR) values (0.001 or lower) and larger batch sizes when compared to SGD, which is less sensitive to batch size choice and LR choice is often around 0.01. Momentum can be used as an to accelerate convergence of regular SGD, however it adds another hyperparameter (the velocity weight) to be set.

Iv-B Learning rate scheduling

A bad learning rate choice may ruin all other choices. Because the parameter adjustment is not uniform along the training process, a learning rate/step adaptation using scheduling should always be considered:
Step Decay, decreases the learning rate by some factor along the epochs or iterations, e.g. halving the value every 10 epochs,
Exponential Decay, reduces the learning rate exponentially.
Cosine Annealing, continuously decreases step to a minimum value and increase it again, which acts like a simulated restart of the learning process [34].

Iv-C Normalization

Normalizing data is a staple of classic machine learning. Since deep models are composite functions, it is beneficial to keep intermediary feature maps within some range:

Batch Norm. (BN), a widely known technique, it was introduced by [27] to accelerate training of deep networks; it works like a layer that standardizes the feature maps across each input in a minibatch (hence the name). As learning progresses it also learns an average mean and standard deviation across the dataset that can then be used for doing single sample inference. Santurkar et. al. [50] showed that BN’s advantage comes from making the optimization space smoother.

Instance Norm. can be also designed as a layer, but instead of performing standardization across input samples, it does so for each channel of each individual sample. It’s performance is worse than BN for recognition. It was designed specifically for generative and style transfer models [54].

Layer Norm. Performs standardization for each individual sample but takes mean and standard deviation from all feature maps. It was created [2] because BN cannot be applied in recurrent networks since the concept of a batch is harder to define in that context. Layer Norm. is also used on most Transformer Implementations.

Iv-D Regularization via Dropout

Comprises mechanisms to help find the best parameters while minimizing the loss function. During the convergence process of deep networks, several combinations of parameters can be found to correctly classify the training examples. Hence, Dropout[25] works by deactivating of neurons, mainly after dense layers. This avoids some neurons to over-specialize/memorize specific data. At each iteration of training, dropout provides different subsamples of activations, i.e different stages of the network. Consequently, this mechanism prevents overfitting during training. During inference dropout is turned off so that all neurons are activated.

Iv-E Data Augmentation

Unlike the other optimization techniques mentioned before, which work to improve performances by acting on the network structure, data augmentation techniques focus exclusively on increasing the size and variability of the training set [52]. Conceptually, it generates new instances derived from the original training set by manipulating the features and replicating the original label to the generated example. Thus the training set becomes more variable and larger. Data augmentation can also be used to balance datasets (see Subsection II-C), controlling one of the drawbacks of deep learning [31]. A recurrent concern in these techniques is to ensure that the transformation performed does not alter its concept.

V Training procedures beyond the basics

The regular approach for training deep networks is to design its topology, define its training strategies, randomly initializing all parameters and then train from scratch. However such networks are both data-hungry and highly sensitive to initialization. To overcome those issues, weight warmup procedures were studied, such as first training an unsupervised autoencoder [7] and then use its encoder weights as initialization. In addition, a widespread approach is to download models pre-trained using a large datasets such as ImageNet in the case of image classification [29]. This is called transfer learning, and assumes the model has generalization capability. Due to the hierarchical structure of deep networks, in which different layers provide different levels of attributes, even different image domains may benefit from pre-training [43, 14].

Transfer learning from pre-trained weights involve the following steps:

  1. remove the original output layer, design a new output layer and randomly initialize its weights;

  2. freeze the remaining layers, i.e. making the layers not trainable, by not allowing their parameters to be updated;

  3. train the last layer for a number of epochs.

Fine-tuning after transfer learning, unfreeze and train a subset of layers using a small learning rate (often or even less). As a rule of thumb, one starts by unfreezing the layers closer to the top (output) of the network and, the more data one has, more layers can made trainable. Use with care: if your dataset is small, beware not to overtrain.

Pre-trained nets can be used as feature extractor in scenarios with small sets of data, in which even transfer learning would be unfeasible. For this, perform a forward pass and get the activation maps of a given layer as a feature vector for the input data. Getting the output from the penultimate layer is a fair choice since this represents input data globally [43]. However, one can also insert a global pooling layer just after a convolutional layer to summarize the data. Previous studies show that combinations coming from different layers improve the representation [16, 62].

Alternatively to the use of a global pooling layer, get all activation maps/values (often high dimensional) and carry out a separate dimensionality reduction, for instance using Principal Component Analysis (PCA). With the extracted features one can proceed with external methods such as classification, clustering, and even anomaly detection 


In the next sections we will cover training strategies beyond the transfer learning approach.

V-a Curriculum Learning

This concept is based on the human strategy of creating a study script, in which a teacher elaborates a student’s learning order, facilitating training [4, 21]. With the premise that part of the data (or the task) at hand is easier than others to be learned, instead of trying to train all model at the same time with randomly sampled data, it is possible to define an order of instances or tasks. The basic technique works with instances by defining: a scoring function and a pacing function. The scoring function is a metric to sort the training examples from the easiest to the most difficult. The pacing function dictates the learning speed to incorporate more (difficult) examples into the training set [21]. Important considerations when applying curriculum learning are to keep the training set available balanced. In originally unbalanced scenarios, this requirement can be an additional challenging element. Another important factor is learning rates [21], which setting the learning rate wrongly can cause training performance degradation, being worse than conventional Convolutional Network training. However, when applied correctly, curriculum learning tends to improve convergence speed and final accuracy. Curriculum learning can also be applied as a sequence of tasks, where the easiest task is performed before the most difficult ones [15, 5].

V-B Contrastive Learning

It is known that deep learning can be seen as a way upending the traditional machine learning pipeline of “preprocessing-feature extraction-classification” with a model that can learn both to classify and regress as well as extract features of best interest to the task. This is known as Representation Learning.

One of the ways of learning representations is through Contrastive Learning. In general terms, it is a collection of losses designed around the task of learning representations where samples can be semantically distinguished when compared with a pre-defined distance.

Applications of Contrastive (and it’s variant Triplet loss) range from facial recognition


to content-based image retrieval

[5] and recently self-supervised learning, covered in the following section.

V-C Self-supervised Learning

Given a task and enough labels, supervised learning can probably solve it really well. But large amounts of manually labeled data are often costly, time-consuming, complex and expensive to obtain. Sometimes real-world applications require categories that are not present in standard large-scale benchmark datasets, e.g. medical images. And in some cases, vast amount of unlabelled images is readily available.

Self-supervision is a form of unsupervised learning where the data itself provides the supervision. It relies on pretext tasks that can be formulated using only unsupervised data. By producing surrogate labels, those tasks make use of those generated labels to guide the learning process. We can think that those models predict part of the data from other incomplete, transformed, distorted or corrupted parts. These models are able to learn useful image representations in order to solve those tasks and achieve state-of-the-art performance when we consider methods that rely only on unlabeled images, benefiting almost all types of downstream tasks. Most models ”learn to compare”, using some kind of contrastive learning strategy.

Methods include predictions of relative positions, maximization of mutual information, cluster-based discrimination, image/video generation and instance discrimination. MOCO [22], SimCLR [10], BYOL [20], Simsiam [11], SwAV [6] and Barlow Twins [59] are recent important methods in the field of self-supervised learning.

For a deeper understanding of those models and an extensive view of another methods, we recommend the following surveys:  [28] and  [33].

Vi Running the final mile to improve predictions

Vi-a Activation Functions beyond ReLU

The Rectified Linear Unit (ReLU) became the standard activation function for hidden layers of deep networks. ReLU truncates negative values to zero, while maintaining the positive values as they are. ReLU is often better because, being unbounded above 0, it avoids saturation which generally causes training to drastically slow down due to near-zero gradients. The problem with ReLU is that values near zero produce non-useful or bad estimates for the gradient. Numerically, it may lead to ”dead neurons”. A dead neuron is stuck completely in the negative side and always outputs zero for the training set. Therefore this neuron cannot recover back or learn anymore for the following epochs. Swish and Mish functions were proposed to improve this.

Swish is a gated version of Sigmoid and defined as , where

is the Sigmoid function and

is a hyper-parameter that can be adjusted arbitrarily or trained. Mish is defined as , which is bounded below and unbounded above and the range is approximately . Small but consistent improvement were observed when using Swish and Mish instead of ReLU in hidden layers of the network, with a slight advantage for Mish.

Vi-B Test Augmentation

The idea of data augmentation has been applied widely for test set suites as well, called test augmentation, to achieve a more robust prediction [52]. This promising practice has been welcomed, especially in critical scenarios, such as medical imaging diagnosis [35], where minimal error values can characterize serious problems. Essentially, the same techniques used to perform data augmentation on the training set can be chose on the test set (see more details in Subsection IV-E). However, the adoption of this resource are for different reasons.

In test stage, the objective is to hopefully improve the prediction of each test example , given several augmented version of the same testing image , to use as final output a combination (via average, majority voting or other combination rule [45]) of several predictions . Another use of augmenting data in validation or test set it to assess its robustness. One can compare the evaluation of the model on the original validation/test set with an augmented version of those containing modified versions of the original examples, i.e. by translation, noise injection, etc, . We expect the evaluation to be similar in those two scenarios. However, if a significant drop in performance is observed, this may be a sign of poor generalization.

Vii Conclusion

We believe this paper offers as a reference to allow researchers and practitioners to avoid major issues and to improve their models with those advanced training techniques. While DNNs have high generalization capacity and allow significant transfer learning, there are important concepts that require attention to allow learning. The first requirement is to have a lot of training data, which is not always possible. Also, many factors must be taken into account, including data preparation and optimization, as well as different processing layers. We describe those as basic tools to achieve network convergence. When those techniques are still insufficient, and it is necessary to advance in concepts to achieve sufficiently good performances, as well as increase generalization, advanced techniques can be used, such as curriculum, contrastive and self-supervised learning. Those are are still subject for future investigation to better bridge the performances obtained in benchmark datasets and those in real data.


  • [1] A. Arpteg, B. Brinne, L. Crnkovic-Friis, and J. Bosch (2018) Software engineering challenges of deep learning. In 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), pp. 50–59. Cited by: §I.
  • [2] L. J. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. CoRR abs/1607.06450. External Links: Link, 1607.06450 Cited by: §IV-C.
  • [3] Y. Bengio, A. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35 (8), pp. 1798–1828. Cited by: §I, §I.
  • [4] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §V-A.
  • [5] T. Bui, L. S. F. Ribeiro, M. Ponti, and J. P. Collomosse (2018)

    Sketching out the details: sketch-based image retrieval using convolutional neural networks with multi-stage regression

    Comput. Graph. 71, pp. 77–87. External Links: Link, Document Cited by: §V-A, §V-B.
  • [6] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. Cited by: §V-C.
  • [7] G. B. Cavallari, L. S. Ribeiro, and M. A. Ponti (2018) Unsupervised representation learning using convolutional and stacked auto-encoders: a domain and cross-domain feature space analysis. In 2018 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 440–446. Cited by: §V.
  • [8] I. Chalkidis and D. Kampas (2019) Deep learning in law: early adaptation and legal word embeddings trained on large corpora. Artificial Intelligence and Law 27 (2), pp. 171–198. Cited by: §I.
  • [9] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever (2020-13–18 Jul) Generative pretraining from pixels. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 1691–1703. External Links: Link Cited by: §III-D.
  • [10] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. Cited by: §V-C.
  • [11] X. Chen and K. He (2021) Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15750–15758. Cited by: §V-C.
  • [12] S. Christin, É. Hervet, and N. Lecomte (2019) Applications for deep learning in ecology. Methods in Ecology and Evolution 10 (10), pp. 1632–1644. Cited by: §I.
  • [13] D. M. Dimiduk, E. A. Holm, and S. R. Niezgoda (2018) Perspectives on the impact of machine learning, deep learning, and artificial intelligence on materials, processes, and structures engineering. Integrating Materials and Manufacturing Innovation 7 (3), pp. 157–172. Cited by: §I.
  • [14] F. P. dos Santos, L. S. Ribeiro, and M. A. Ponti (2019) Generalization of feature embeddings transferred from different video anomaly detection domains. Journal of Visual Communication and Image Representation 60, pp. 407–416. Cited by: §V, §V.
  • [15] F. P. Dos Santos, C. Zor, J. Kittler, and M. A. Ponti (2020) Learning image features with fewer labels using a semi-supervised deep convolutional network. Neural Networks 132, pp. 131–143. Cited by: §V-A.
  • [16] F. P. dos Santos and M. A. Ponti (2019) Alignment of local and global features from multiple layers of convolutional neural network for image classification. In 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 241–248. Cited by: §V.
  • [17] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, External Links: Link Cited by: §III-D.
  • [18] A. Esteva, A. Robicquet, B. Ramsundar, V. Kuleshov, M. DePristo, K. Chou, C. Cui, G. Corrado, S. Thrun, and J. Dean (2019) A guide to deep learning in healthcare. Nature medicine 25 (1), pp. 24–29. Cited by: §I.
  • [19] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT press. Cited by: §I.
  • [20] J. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, et al. (2020) Bootstrap your own latent: a new approach to self-supervised learning. arXiv preprint arXiv:2006.07733. Cited by: §V-C.
  • [21] G. Hacohen and D. Weinshall (2019) On the power of curriculum learning in training deep networks. In International Conference on Machine Learning, pp. 2535–2544. Cited by: §V-A.
  • [22] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §V-C.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §IV-A.
  • [24] C. F. Higham and D. J. Higham (2019) Deep learning: an introduction for applied mathematicians. Siam review 61 (4), pp. 860–891. Cited by: §I.
  • [25] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: §IV-D.
  • [26] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and <1mb model size. CoRR abs/1602.07360. External Links: Link, 1602.07360 Cited by: §IV-A.
  • [27] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp. 448–456. Cited by: §IV-C.
  • [28] L. Jing and Y. Tian (2020) Self-supervised visual feature learning with deep neural networks: a survey. IEEE transactions on pattern analysis and machine intelligence. Cited by: §V-C.
  • [29] S. Kornblith, J. Shlens, and Q. V. Le (2019) Do better imagenet models transfer better?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2661–2671. Cited by: §V.
  • [30] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §III-B.
  • [31] J. L. Leevy, T. M. Khoshgoftaar, R. A. Bauder, and N. Seliya (2018) A survey on addressing high-class imbalance in big data. Journal of Big Data 5 (1), pp. 1–30. Cited by: §IV-E.
  • [32] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Lecture Notes in Computer Science, Vol. 9905, pp. 21–37. External Links: Link, Document Cited by: §IV-A.
  • [33] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang (2021) Self-supervised learning: generative or contrastive. IEEE Transactions on Knowledge and Data Engineering. Cited by: §V-C.
  • [34] I. Loshchilov and F. Hutter (2016) Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983. Cited by: §IV-B.
  • [35] K. Matsunaga, A. Hamada, A. Minagawa, and H. Koga (2017) Image classification of melanoma, nevus and seborrheic keratosis by deep neural network ensemble. arXiv preprint arXiv:1703.03108. Cited by: §VI-B.
  • [36] L. McInnes, J. Healy, N. Saul, and L. Großberger (2018) UMAP: uniform manifold approximation and projection.

    Journal of Open Source Software

    3 (29), pp. 861.
    Cited by: §II-A.
  • [37] R. F. Mello and M. A. Ponti (2018)

    Machine learning: a practical approach on the statistical learning theory

    Springer. Cited by: §II-D, §II-D.
  • [38] J. Mongan, L. Moy, and C. E. Kahn Jr (2020) Checklist for artificial intelligence in medical imaging (claim): a guide for authors and reviewers. Radiological Society of North America. Cited by: §II-A.
  • [39] A. Nguyen, J. Yosinski, and J. Clune (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 427–436. Cited by: §II-E.
  • [40] T. Nguyen, M. Raghu, and S. Kornblith (2021) Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth. In International Conference on Learning Representations, External Links: Link Cited by: §III-B.
  • [41] C. G. Northcutt, A. Athalye, and J. Mueller (2021) Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749. Cited by: §II-A, §II-D.
  • [42] A. Odena, V. Dumoulin, and C. Olah (2016) Deconvolution and checkerboard artifacts. Distill. External Links: Link, Document Cited by: §III-A.
  • [43] M. A. Ponti, G. B. P. da Costa, F. P. Santos, and K. U. Silveira (2019) Supervised and unsupervised relevance sampling in handcrafted and deep learning features obtained from image collections. Applied Soft Computing 80, pp. 414–424. Cited by: §V, §V.
  • [44] M. Ponti, L. S.F. Ribeiro, T. S. Nazare, T. Bui, and J. Collomosse (2017) Everything you wanted to know about deep learning for computer vision but were afraid to ask. In SIBGRAPI Conference on Graphics, Patterns and Images Tutorials (SIBGRAPI-T 2017), pp. 1–25. Cited by: §I.
  • [45] M. Ponti (2011) Combining classifiers: from the creation of ensembles to the decision fusion. In 2011 24th SIBGRAPI Conference on Graphics, Patterns, and Images Tutorials, pp. 1–10. Cited by: §VI-B.
  • [46] S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 91–99. External Links: Link Cited by: §IV-A.
  • [47] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666. Cited by: §II-A.
  • [48] M. Roberts, D. Driggs, M. Thorpe, J. Gilbey, M. Yeung, S. Ursprung, A. I. Aviles-Rivero, C. Etmann, C. McCague, L. Beer, et al. (2021) Common pitfalls and recommendations for using machine learning to detect and prognosticate for covid-19 using chest radiographs and ct scans. Nature Machine Intelligence 3 (3), pp. 199–217. Cited by: §I, §II-A.
  • [49] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018-06) MobileNetV2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §III-B.
  • [50] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry (2018) How does batch normalization help optimization?. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, Red Hook, NY, USA, pp. 2488–2498. Cited by: §IV-C.
  • [51] F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823. Cited by: §V-B.
  • [52] C. Shorten and T. M. Khoshgoftaar (2019) A survey on image data augmentation for deep learning. Journal of Big Data 6 (1), pp. 1–48. Cited by: §IV-E, §VI-B.
  • [53] M. Tan and Q. V. Le (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 6105–6114. External Links: Link Cited by: §III-B.
  • [54] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. CoRR abs/1607.08022. External Links: Link, 1607.08022 Cited by: §IV-C.
  • [55] L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne.. Journal of machine learning research 9 (11). Cited by: §II-A.
  • [56] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §III-D.
  • [57] R. F. Wolff, K. G. Moons, R. D. Riley, P. F. Whiting, M. Westwood, G. S. Collins, J. B. Reitsma, J. Kleijnen, and S. Mallett (2019) PROBAST: a tool to assess the risk of bias and applicability of prediction model studies. Annals of internal medicine 170 (1), pp. 51–58. Cited by: §II-A.
  • [58] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016, R. C. Wilson, E. R. Hancock, and W. A. P. Smith (Eds.), External Links: Link Cited by: §III-B.
  • [59] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny (2021) Barlow twins: self-supervised learning via redundancy reduction. arXiv preprint arXiv:2103.03230. Cited by: §V-C.
  • [60] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations (ICLR), Cited by: §II-D.
  • [61] R. Zhang (2019) Making convolutional networks shift-invariant again. In ICML, Cited by: §III-C.
  • [62] Y. Zheng, J. Huang, T. Chen, Y. Ou, and W. Zhou (2019) CNN classification based on global and local features. In Real-Time Image Processing and Deep Learning 2019, Vol. 10996, pp. 109960G. Cited by: §V.