Friendly Training: Neural Networks Can Adapt Data To Make Learning Easier

06/21/2021 ∙ by Simone Marullo, et al. ∙ Università di Siena UNIFI 16

In the last decade, motivated by the success of Deep Learning, the scientific community proposed several approaches to make the learning procedure of Neural Networks more effective. When focussing on the way in which the training data are provided to the learning machine, we can distinguish between the classic random selection of stochastic gradient-based optimization and more involved techniques that devise curricula to organize data, and progressively increase the complexity of the training set. In this paper, we propose a novel training procedure named Friendly Training that, differently from the aforementioned approaches, involves altering the training examples in order to help the model to better fulfil its learning criterion. The model is allowed to simplify those examples that are too hard to be classified at a certain stage of the training procedure. The data transformation is controlled by a developmental plan that progressively reduces its impact during training, until it completely vanishes. In a sense, this is the opposite of what is commonly done in order to increase robustness against adversarial examples, i.e., Adversarial Training. Experiments on multiple datasets are provided, showing that Friendly Training yields improvements with respect to informed data sub-selection routines and random selection, especially in deep convolutional architectures. Results suggest that adapting the input data is a feasible way to stabilize learning and improve the generalization skills of the network.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The outstanding results yielded by Neural Networks in several real-world tasks in the last decade [11, 18, 3] have strongly motivated researchers to put even more effort in improving the design of neural architectures and to introduce tools to make the learning procedure more effective. In the framework of optimization-based learning, a number of improved methods are nowadays extremely popular across different application fields: regularization techniques [20], learnable normalization functions [8]

, powerful adaptive methods to estimate the appropriate learning rates

[10], to name a few.

In this paper we propose an approach that introduces a novel perspective in the learning dynamics of Neural Networks, and is compatible with the aforementioned techniques. In particular, we consider the possibility of extending the search space of the learning algorithm, allowing the network not only to adapt its weights and biases, but also to adapt on-the-fly the training data, in order to facilitate the fulfilment of its learning criterion. Of course, such adaptation, that we also refer to as “simplification”, must be controlled and embedded into a precise developmental plan in which the machine is progressively constrained to reduce the amount of simplification, until it ends up in handling the original training data as they are.

The key intuition behind this training strategy, that we named Friendly Training (FT), is that instead of exposing the network to an uncontrolled variety of data with heterogeneous properties over the input space, the learning can rather be guided by the information that the network has learnt to process so far. The aim of such technique is to operate on noisy data, outliers, and, more generally, on whatever falls into the areas of the input space that the network finds hard to handle at a certain stage of the learning process. Such data are modified to mitigate the impact of the information that is inconsistent with what has been learnt so far. In a sense, FT alters data so that they are just slightly more complicated with respect to what the learner is currently able to process, thus inducing the network to smoothly update the learnt function. This procedure is iterated until, at the end of training, the learnt function is consistent with the information carried by the original training data (see the toy example of Fig. 

1). This strategy echoes the pedagogical approach formulated by the psychologist Lev Vygotsky [22], according to which children learn by a progressive reduction of the so-called Zone of Proximal Development, i.e., the space between what the learner can and cannot still do autonomously, which contains tasks that the learner can accomplish if appropriately guided. In our approach, the network itself decides how to adapt the data to foster its learning process, without the need of extra information on the problem or additional scoring functions.

The idea of providing data to the network following an easy-data-first plan has been popularized in the last decade by Curriculum Learning (CL) [2]

. In this case, the training set is progressively expanded by adding more difficult data, that are selected and ordered accordingly to a scoring function that might come from further knowledge on the considered problem or other heuristics. A recent study

[23] on the impact of CL showed the importance of CL when dealing with noisy settings or in case of limited time or computational resources. The idea of CL has been recently applied to the case of convolutional architectures with a progressive smoothing of the convolutional feature maps [19]. A related research area is the one of Self-Paced Learning [12]

, in which some examples are either excluded from the training set or their impact in the loss function is kept low if some conditions, modeled by a specific regularizer, are met

[14]. Such conditions depend on the current state of the classifier, and the whole process is repeated multiple times. Differently from what we propose, these techniques do not alter the training data. From the algorithmic point of view, our approach can be seen as related to the so-called Adversarial Training strategies [15, 24], although oriented toward completely different goals. Such strategies exploit the idea of providing the network with examples specifically generated to be potentially capable of fooling the classifier. Such samples are then added to the original training data to make the classifier more robust to adversarial attacks. The price to pay is that this procedure might have a negative impact on the generalization skills of the classifier [17]. In this respect, it has been recently shown that it is better to limit the data alteration during the early stages of the learning process, in order to implement a Friendly Adversarial Training policy [24]. We partially borrowed this intuition, although focussing on the opposite direction, i.e., we let the classifier help itself by altering the data, instead of creating more difficult training conditions. The aim of our approach is to improve the network generalization capability by letting the network learn in a friendly environment.

In this manuscript, we provide an experimental analysis conducted on five datasets available in the related literature. Results suggest that Friendly Training is a feasible way to improve the generalization skills of the network, and to create a smooth evolution of the learning environment, avoiding to expose the network to uncontrolled information at the wrong time. The simplifications, on average, look more structured in (deep) convolutional architectures, leading to more effective improvements, while their impact in fully-connected networks is more spread over all the dimensions of the input space, resulting in less appreciable gains. In the following, we provide an in-depth analysis of the sensitivity of the system to the newly introduced hyper-parameters, and we compare Friendly Training with randomly ordered data or curricula defined by ad-hoc criteria.

The contributions of this paper are the following ones: (1) we propose a novel training strategy, named Friendly Training (FT), that allows the machine to partially simplify the data by automatically determining how to alter it; (2) we propose a developmental plan that allows the effects of FT to progressively fade out, in order to create a smooth transition from simplified data to the original one; (3) we experimentally evaluate the FT approach in convolutional and fully connected neural architectures with different number of layers and considering five different datasets, analyzing the impact of the simplification.

This paper is organized as follows. FT, including the developmental plan, is described in Section II. The relation with related work are deeply analyzed in Section III. Experiments and results are described in Section IV, while conclusions and suggestions for future work are drawn in Section V.

Iteration = 300
Iteration = 600
Iteration = 1000
Iteration = 10
Iteration = 300
Iteration = 600
Iteration = 300
Iteration = 600
Iteration = 1000
Iteration = 10
Iteration = 300
Iteration = 600
Iteration = 1000
Fig. 1:

Temporal evolution of the decision boundary developed by a single hidden layer network (5 neurons,

tanh activation, Adam optimizer) when learning on the two-moon dataset. Each plot is about a different training iteration (), and reports the full training set (semi-transparent empty circles) and the data of the last processed mini-batch of size (filled circles). Top row: classic training. Bottom row: Friendly Training. Both the experiments were initialized with the same weights and shared all the hyper-parameters. Those mini-batch examples (filled circles) that appear in different locations in the bottom-row with respect to the top-row have been temporarily altered by Friendly Training. As it can be noticed, the decision boundary adapts to the training data more smoothly and coherently when Friendly Training is employed.
Iteration = 10

Ii Friendly Training

We consider a generic classification problem in which we are given a training set composed of supervised pairs, , being a training example labeled with .111We consider the case of classification mostly for the sake of simplicity. The proposed approach goes beyond classification problems. Given some input data , we denote with

the function computed by a Neural Network-based classifier with all its weights and biases stored into vector

. When optimizing the model exploiting a mini-batch based stochastic gradient descent procedure, at each step of the training routine the following loss function

is used for computing the gradient with respect to and updating the model parameters:

(1)

where is a mini-batch of data of size , , and is the loss restricted to a single example. For simplicity, we assumed to aggregate the contributes of by averaging over the mini-batch data. The set

can be sampled with different strategies. In the most common case of stochastic gradient optimization, a set of mini-batches is randomly sampled at each training epoch, in order to cover the whole set

without intersections among the mini-batches.

Of course, the data that populate might include examples with different properties, in a task-dependent manner. To give some examples, the distribution of the training data could be multi-modal, might include outliers, might span over several disjoint manifolds, and so on. However, the aforementioned training procedure provides data to the machine independently on the state of the network and with no control on the information carried by such data. If additional information is available on the problem at hand, it might be exploited to devise curricula providing simple-examples-first, as in CL [2]. However, it is very unlikely to have information on the complexity of examples in advance and, even more importantly, the human-based criteria might not match the way in which examples are processed by the machine. The value could be used as an indicator to estimate such simplicity, to exclude those examples with too large loss or to reduce their contribution in Eq. (1), similarly to what is done in [12, 14].222We postpone to Section III a more detailed description of these strategies.

In this paper, instead, we propose an alternative approach that, differently from what has been described so far, analyzes the training data according to the state of the learner, allowing the network to modify such data, eventually discarding the parts of information that are too complex to be handled by the model at that moment, while preserving what sounds more coherent with the expectation of the current classifier. Notice that this is significantly different from deciding whether or not to keep a training example, to give small or big weight to them in Eq. (

1), or to simply re-order the examples. Interestingly, FT is compatible with (and not necessarily an alternative to) the aforementioned existing strategies.

Formally, each example is altered to by adding a learnable perturbation ,

(2)

where . Given a mini-batch , we indicate with the matrix that collects the perturbations associated to the mini-batch examples. In detail, the -th row of is the perturbation associated with the -th example in . For convenience in the notation, we avoid mentioning training epochs in what follows, and we describe the training procedure as the iterative processing of mini-batches of data, updating after each of them. Let us denote with the iteration index. We re-define the aggregated loss of Eq. (1) by providing the network with instead of , introducing the dependency on , and by adding the iteration index ,

(3)

where is defined as in Eq. (2), (i.e., is the mini-batch at iteration ) and is the -th row of . Jointly optimizing Eq. (3) with respect to and allows the network not only to adapt its weights and biases in order to better cope with the learning criterion, but also to alter the data in by translating them in those space regions that can be more easily classified. However, the loss of Eq. (3) does not introduce any constraints on each . Hence, the network is free to change the training data without any guarantees that the simplification amount will reduce while learning proceeds. Moreover, differently from , the set is specifically associated with the data in (i.e, each training example is associated with its own perturbation), meaning that the number of variables of the optimization problem becomes a function of the size of the training data.

We frame Eq. (3) in the context of a developmental plan that solves both these issues. First, the system is enforced to reduce the perturbations as long as the number of training iterations increases. If is the maximum number of allowed iterations, we ensure that after steps the data are not perturbed anymore. Secondly, we remove the dependence of from , introducing an Alternate Optimization scheme in which we decouple the optimization of perturbations and weights. In detail, we consider a single matrix that is shared by all the mini-batches (the total number of rows in is equal to the size of a single mini-batch). At the beginning of each training iteration, we keep fixed, we initialize to zeros and then we estimate the appropriate perturbations for the current data in by gradient descent over the variable (second argument of Eq. (3)). We indicate with the number of iterative steps of such inner optimization. The value of controls the amount of alteration on the data. For small values of the network will only marginally simplify the data, while for a larger the data alteration will be more aggressive. The initial value is a fixed hyper-parameter, while we considered a quadratic law to progressively reduce in function of ,

(4)

Afterwards, we update the values of , given the output at the end of the just described inner optimization. This developmental plan allows the system to adapt the data in order to make the learning procedure less disruptive, especially during the early stages of learning, introducing a smooth optimization path driven by the evolution of . The detailed training procedure is formally reported in Algorithm 1,

0:  Training set , initial weights and biases , batch size , max learning steps , max simplification steps , , learning rates and , shared matrix with rows and columns.
0:  The final .
1:  for  to  do
2:     Sample of size from
3:     Compute following Eq. (4).
4:     Set all the entries of to 0
5:     for  to  do
6:        Compute , see Eq. (3)
7:        
8:     end for
9:     Compute , see Eq. (3)
10:     
11:  end for
12:  return  
Algorithm 1 Friendly Training.

and in the following lines we provide some further details. Notice that while the weight update equation (line 10) can include any existing adaptive learning rate estimation procedure, in our current implementation is updated by a fixed small learning rate (line 7). While Algorithm 1 formally returns the weights after having completed the last training iteration, as usual, the best configuration of the classifier can be selected by measuring the performance on a validation set, when available. Another important fact to mention is that when the prediction on examples perfectly matches target, line 6 will return zero gradient, hence no simplifications are performed. In our implementation we slightly relaxed this condition by zeroing the rows of (line 6) associated to those examples that are classified with a large confidence above a given threshold (see Section IV), that implements a selective early-stopping condition on the inner optimization.

We qualitatively show the behavior of the proposed training strategy in the toy example of Fig. 1. A very simple network with one hidden layer (

neurons with hyperbolic tangent activation function) is trained on the popular two-moon dataset (two data distributions shaped as interleaving half-circles), optimized by Adam with mini-batch of size

. When the network (having the same initial weights) is trained with the Friendly Training algorithm, it is less subject to oscillations in the learning process, resulting in a more controlled development of the classifier and leading to a final decision boundary that better fits the data distribution. With the Friendly Training approach, the mini-batch examples (filled dots) are altered during the first iterations (second row, first two pictures - compare their position with the corresponding pictures in the first row, which represent Classic Training). Notice that the mini-batch examples occupy their original positions during later stages (final stage of developmental plan).

Iii Relationships with Existing Work

The idea of training neural networks with a learning methodology that “gradually” changes the learning environment (which proves to be appropriate not only in humans but also in animals [16]) traces back to almost three decades ago [4] and has been known for long as Curriculum Learning (CL) [2]. Taking inspiration from the common experience of learning in humans, CL aims at designing an optimal learning plan in which the learning agent is exposed to simple, easily-discernible examples at first, and later to gradually harder examples, also progressively increasing the size of the training set. CL proved to be successful in training deep networks on a variety of tasks [5, 7]. The concrete effect of CL has been recently revisited [23], showing how the positive impact of curriculum-based learning strategies is mostly evident in noisy setups (e.g., random permutation of labels) or large-data regimes with limited time availability. Nonetheless, recent evidence suggests that elementary curriculum-based tricks might help in vision-related tasks, at least when very deep and wide networks are used. Curriculum By Smoothing (CBS) [19]

is built upon the idea of applying Gaussian-based low-pass filters on feature maps of Convolutional Neural Networks (CNNs) that process image data, with variance that progressively goes to zero as training proceeds. Even though CL remains a somewhat controversial topic and has not been widely adopted in the machine learning community; more research may be needed before ruling out this approach. In fact, there is a quite general consensus on the fact that what happens in the early stages of deep network training affects the network behavior at steady state

[9]. For instance, the authors of [1] showed that inflicting visual impairments to deep CNNs in the first epochs leads to minor visual skills despite conceding recovery time.

The purpose of Friendly Training, on the opposite, is to downplay difficult, confusing and outlier examples at the beginning, while let them contribute to the generalization capability when the learner has already acquired basic skills. With respect to CL, Friendly Training does not alter the size of the dataset as a function of the number of training iterations, nor the relative ordering in which examples are presented to the network. Moreover it does not assume any predefined complexity criteria. Conversely, Friendly Training alters every single example in an adaptive way, unless is it correctly predicted with a sufficiently high confidence, and only afterwards the example is used to update the network weights. In so doing, the network is exposed, at every training step, to a data population with quite large variability, although their distribution has been slightly adapted, in order to decrease the occurrence of abrupt weight changes. Differently from CBS, Friendly Training can be applied to any type of data and, interestingly, it is compatible with CBS, and not necessarily an alternative to it.

Self-Paced Learning (SPL) [12]

is a technique inspired by CL, originally designed for Structural Support Vector Machines (SSVMs) with latent variables. Self-paced stands for the fact that the curriculum is determined by the pupil’s abilities (the classifier’s behaviour) rather than by a teacher’s plan. Since this direction proved to be effective when compared with standard algorithms (latent variable models correspond to hard optimization problems), researchers adapted this formulation to CNNs

[14]. The basic idea consists of searching for suitable example-specific weighting coefficients in the loss computation. In contrast to this approach, what we propose is not about weighting the importance of training examples, but rather temporarily altering them. However, we do embrace the idea that the state of the classifier is what can be used to determine how to deal with a particular input/target pair, gradually exposing the learner to more and more difficult examples, with an explicit temporal dynamics.

Another technique that is somehow related to Friendly Training is the so-called Adversarial Training (AT) [6, 15], which may be seen as the inverse learning technique of FT. Developed as an empirical defense strategy for adversarial attacks, which seriously affect neural networks operating on high-dimensional spaces [6], AT incorporates adversarial data into the training process. Most AT techniques rely on a minimax optimization problem [15], since the goal is to generate adversarial examples that strongly fool the classifier. Each generated example is an artificial element, lying very close to an original training data point of a certain class, but that is classified incoherently with respect to the label attached to such original point. Interestingly, in AT the system basically alters examples as in Eq. (2), even if the perturbation is computed with a different criterion and typically with no temporal dynamics.

Friendly Adversarial Training (FAT) [24] builds up on the ideas of both CL and AT. Researchers noticed [21] that the adversarial formulation sometimes hurts generalization capabilities [17]. The FAT strategy provides a more gentle learning problem, in which the generation of adversarial data is early-stopped as soon as the datapoint is misclassified. The resulting learning dynamics is such that, as learning progresses (together with accuracy and robustness), more and more iterations are needed to generate (harder) adversarial data. Our Friendly Training algorithm shares some intuitions with the FAT algorithm (consider the comments on early-stopping right after Algorithm 1), even though the direction of the iterative process which generates input data has a different goal. Moreover, FAT does not include explicit temporal dynamics.

Iv Experiments

We describe our experimental experience by introducing the considered datasets in Section IV-A, the neural architectures, the parameter tuning procedure and the competitors in Section IV-B, and by reporting and discussing the numerical and qualitative results in Section IV-C, also including an analysis of the sensitivity of key parameters.

Iv-a Datasets

In order to assess the performance of the proposed learning algorithm, we employed the datasets presented in [13]. Some of these datasets are about -class digit recognition problems (

, grayscale), and they were explicitly designed to provide harder learning conditions with respect to the well-known MNIST dataset, keeping an affordable size (62k examples per dataset, already divided into training, validation and test set). The data distribution is the product of multiple factor distributions (i.e., rotation angle, background, etc., besides factors inherent to the original data, such as the handwriting style), making them a challenging benchmark. In detail:

  • mnist-rot: MNIST digits, rotated by a random angle , so that .

  • mnist-back-image: MNIST digits, with the background replaced with patches extracted by random images of public domain.

  • mnist-rot-back-image: MNIST digits, with both the rotation and background factors of variations (combined).

We also considered other different types of data still from [13], aimed at investigating how neural networks deal with geometrical shapes and learn their properties. They share the same resolution and almost the same size of the previously mentioned datasets, but they are used for geometry-based binary classification tasks. They are:

  • rectangles-image: white rectangles, wide or tall, with the inner and outer regions filled with patches taken from random images.

  • convex: convex or non-convex white regions on a black background.

Iv-B Experimental Setup

We focused on four neural network architectures, where two of them are feed-forward Fully-Connected multi-layer perceptrons, referred to as FC-A and FC-B, and the others are Convolutional Neural Networks named CNN-A and CNN-B. FC-A is a simple one-hidden-layer network with hyperbolic tangent activations (10 hidden neurons), while FC-B is inherited from

[1]

, has 5 hidden layers (2500-2000-1500-1000-500 neurons), batch normalization and ReLU activations. CNN-A consists of 2 convolutional layers, max pooling, dropout and 2 fully connected layers; CNN-B is deeper (4 convolutional layers). Both of them employ ReLU activation functions on the convolutional feature maps (32-64 filters in CNN-A, 32-48-64-64 filters in CNN-B) and on the fully connected layers activations (9216-128 neurons for CNN-A, 5184-128 neurons for CNN-B). Cross-entropy is used as loss function, while the Adam optimizer is exploited for the network weights and biases, using mini-batches of size

.

We compared the test error rates of these architectures when trained under the following conditions:

  • Classic Training (CT): training of the neural network parameters with random selection of examples (no intervention on data), as in most nowadays cases.

  • Friendly Training (FT): the FT training procedure of Algorithm 1 is exploited.

  • Easy-Examples First (EEF): mini-batch examples are sorted by loss value in ascending order and only the first of them are used to calculate the gradients. At the beginning, , so that only example per mini-batch is kept, then grows with the training iterations following the same dynamics of FT. The criterion implemented by EEF can be identified as a basic instance of CL, closely related to FT (i.e., it is has the same temporal dynamics of FT, but examples are not altered and only sub-selected).

All the experiments are executed for 200 epochs (trivially, is the number of epochs multiplied by the number of mini-batches per epoch), which consistently proved to be sufficient to obtain convergence in all the learning problems. Moreover, the metrics we report are about the epoch with the lowest validation error. All the experiments referring to the same architecture share the same initialization weights; examples are presented in the same order.

The hyper-parameters of FT and EEF were selected by grid search. We indicate with the threshold above which a prediction is considered correct in order to early stop the data transformation of FT (see the description right after Algorithm 1). In detail, we considered the following ranges: , , , . The above intervals proved to be suitable in preliminary investigations. Each experiment was repeated 3 times (results are averaged), varying the initialization of the weights and biases (sharing them among the competitors).

Iv-C Results and Discussion

We report the test error rate we obtained in Table I, along with some reference results by other authors [13, 14] (top-portion of Table I

). Such reference results are about Fully-connected networks (NNet), Deep Belief Networks (DBNs, two different architectures), Convolutional Autoencoders (CAEs, two different architectures) – see

[13] for further details. General Stochastic Network (GSN) is described in [25]. Finally, SPCN is an instance of Self-Paced Learning based on CNNs, described in [14].

Regardless of the learning algorithm, convolutional architectures CNN-A and CNN-B consistently achieve lower error rates, as expected. Several of the reference results are beaten by the architectures we experimented, also in the case of CT, due to the fact that we considered models with a larger number of parameters. Nonetheless, the general trend is coherent with reference experiments, confirming the validity of our experimental setup (e.g., compare FC-A and FC-B with NNet, or SPCN with CNN-A and CNN-B, for example).

When comparing CT, EEF, and FT, we notice that albeit EEF occasionally improves the baseline (CT) result (e.g., in the mnist-back-image task with convolutional networks), it degrades the performance in most of the datasets, confirming that simply keeping the low-loss examples of the training set, although injected in a progressive developmental plan, does not work as a trivial trick to improve performance. Differently, the FT algorithm clearly improves the test error rate when used on the convolutional architectures CNN-A and CNN-B, providing better results than the competitors in most of the cases. In the case of CNN-A, the error is systematically lowered (sometimes in a statistically significant manner), with the exception of the convex dataset. Regarding CNN-B, we still get improvements in four out of five datasets, even if not as evident as in CNN-A. These results confirm the validity of the proposed learning technique. Fully-Connected networks FC-A and FC-B, conversely, are usually not improved by FT, with a few exceptions. A further inspection on the way the system is computing the perturbation offsets revealed important elements that explain these results, pointing out at interesting facets of FT that we will describe in the following.

Classifier mnist-back-image mnist-rot-back-image mnist-rot rectangles-image convex
NNet
DBN-1
DBN-3
CAE-1 n.a.
CAE-2 n.a.
GSN n.a.
SPCN n.a.
FC-A / CT
FC-A / EEF
FC-A / FT
FC-B / CT
FC-B / EEF
FC-B / FT
CNN-A / CT
CNN-A / EEF
CNN-A / FT
CNN-B / CT
CNN-B / EEF
CNN-B / FT

TABLE I:

Performance comparison of classifiers with different architectures (FC-A, FC-B, CNN-A, CNN-B) and learning algorithms (CT, EEF, FT). Mean test error (smaller is better) is reported along with standard deviation. The first portion of rows is about reference results taken from existing work (see the paper text for more details). For each architecture, we report in bold those results that improve the baseline (CT) case.

We qualitatively evaluated the perturbation in Eq. (2), considering the CNN-A model of Table I. In Fig. 2 we report the perturbations at different steps of the first epoch, applied to four randomly selected examples. The perturbation on the left concerns an example altered at the very early stages of the training procedures. The noise level decreases as long as learning proceeds (left-to-right in Fig. 2), until we see that the system stops perturbing the background and some regions of digit area are emphasized. This is coherent with the fact that, as long as the network learns to focus on the digit area to discriminate the data, there is no need to alter the background anymore.

Fig. 2: Randomly selected perturbation offsets () taken from different stages of the first epoch of CNN-A (mnist-back-image). Left: perturbations generated close to the beginning of training. Moving toward the right: perturbations generated closer to the end of the epoch. As the system evolves, mostly involves the portions of image covered by the digits.

Then, we inspected the differences among the perturbations applied by the considered neural architectures, showing results in Fig. 3. In order to compute the

’s, gradients are backpropagated by the FT algorithm across all the layers up to the input, and the network architecture plays an important role in shaping the perturbations. Interestingly enough, convolutional networks consistently provide more structured simplifications (Fig.

2(b), 2(d)), with slight manipulation of low-level local visual features in salient regions covered by the digits. On the other hand, ’s obtained from the FC-A and FC-B models (Fig. 2(a), 2(c)) are very noisy, with a limited emergence of visual structure in the deeper model FC-B (2(c)). However, still most of the pixels get altered.

(a) FC-A (b) CNN-A (c) FC-B (d) CNN-B
Fig. 3: Original data (first column), perturbation (second column - emphasized by normalization to make it more visible) and resulting “simplified” images (third column) for different networks (Table I) at the end of the first epoch. Some simplifications are hardly distinguishable by a human.

This rowdy perturbation might be the reason behind the limited success of FT in the fully-connected networks, and suggests directions for future improvements. As a matter of fact, the simplified examples get very artificial and far from the distribution of the original training data. In Fig. 4, we report the evolution of the error rate during the training epochs (mnist-back-image, CNN-A), comparing FT and CT. We also report a black curve that is proportional to , thus showing how the developmental plan reduces the impact of the perturbation (when it becomes zero, data are not altered anymore).

Fig. 4: Training and test error rates for FT and CT on a single run (mnist-back-image, CNN-A architecture). The black curve (plan) is proportional to .

The small bump right before 100 epochs is due to the final transition from altered to original data. The test error of FT is higher than the one of CT when the data are altered, as expected, while it becomes lower when the simplification rate vanishes.

We also evaluated the sensitivity of the system to the main hyper-parameters of FT. In Fig. 5, we report the test error of CNN-A, mnist-back-image dataset, for different configurations of , , and in a sample run that is pretty representative of the general trend we observed in the experiments. Larger values of reduce the frequency of early-stops in computing the perturbations, thus allowing FT to alter the data more extensively and improve the performance. The preference on the largest value of suggests that an aggressive perturbation at the early stages of learning helps. Small learning rates () allow the system to avoid extreme alterations of the data, while a developmental plan covering at least of the training steps seems more effective than the shorter-term plans, .

Fig. 5: Test error (single run) of CNN-A (mnist-back-image dataset) under different configuration of the FT hyper-parameters (see the paper text for a description).

V Conclusion and Future Work

We presented a novel training procedure named Friendly Training, that allows a network to alter the training data accordingly to a developmental plan, in order to implicitly learn from more manageable data. Differently from related work, the network decides what information to discard from the data at different stages of the training procedure, leading to improved generalization skills and a smoother development of the decision boundaries. We plan to extend this approach introducing a separate neural model to estimate the simplification that should be applied to the data, instead of directly optimizing the perturbations.

References

  • [1] A. Achille, M. Rovere, and S. Soatto (2018) Critical learning periods in deep networks. In International Conference on Learning Representations, External Links: Link Cited by: §III, §IV-B.
  • [2] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In International conference on machine learning, pp. 41–48. Cited by: §I, §II, §III.
  • [3] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems, Cited by: §I.
  • [4] J. L. Elman (1993) Learning and development in neural networks: the importance of starting small. Cognition 48 (1), pp. 71–99. Cited by: §III.
  • [5] C. Gong, D. Tao, S. J. Maybank, W. Liu, G. Kang, and J. Yang (2016) Multi-modal curriculum learning for semi-supervised image classification. IEEE Transactions on Image Processing 25 (7), pp. 3249–3260. Cited by: §III.
  • [6] I. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, External Links: Link Cited by: §III.
  • [7] G. Hacohen and D. Weinshall (2019) On the power of curriculum learning in training deep networks. In International Conference on Machine Learning, pp. 2535–2544. Cited by: §III.
  • [8] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. Cited by: §I.
  • [9] S. Jastrzebski, M. Szymczak, S. Fort, D. Arpit, J. Tabor, K. Cho, and K. Geras (2019) The break-even point on optimization trajectories of deep neural networks. In International Conference on Learning Representations, Cited by: §III.
  • [10] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations, External Links: Link Cited by: §I.
  • [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §I.
  • [12] M. P. Kumar, B. Packer, and D. Koller (2010) Self-paced learning for latent variable models. In Proceedings of the 23rd International Conference on Neural Information Processing Systems-Volume 1, pp. 1189–1197. Cited by: §I, §II, §III.
  • [13] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio (2007) An empirical evaluation of deep architectures on problems with many factors of variation. In International conference on machine learning, pp. 473–480. Cited by: §IV-A, §IV-C.
  • [14] H. Li and M. Gong (2017) Self-paced convolutional neural networks. In

    Proceedings of the 26th International Joint Conference on Artificial Intelligence

    ,
    pp. 2110–2116. Cited by: §I, §II, §III, §IV-C.
  • [15] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, External Links: Link Cited by: §I, §III.
  • [16] G. Peterson (2004-12) A day of great illumination: b. f. skinner’s discovery of shaping. Journal of the experimental analysis of behavior 82, pp. 317–28. External Links: Document Cited by: §III.
  • [17] A. Raghunathan, S. M. Xie, F. Yang, J. C. Duchi, and P. Liang (2019) Adversarial training can hurt generalization. In International Conference on Machine Learning 2019 - Workshop Deep Phenomena, Cited by: §I, §III.
  • [18] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489. Cited by: §I.
  • [19] S. Sinha, A. Garg, and H. Larochelle (2020) Curriculum by smoothing. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), Cited by: §I, §III.
  • [20] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (56), pp. 1929–1958. Cited by: §I.
  • [21] D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry (2019)

    Robustness may be at odds with accuracy

    .
    External Links: 1805.12152 Cited by: §III.
  • [22] L. S. Vygotsky (1987) Mind in society: the development of higher psychological processes. Cambridge, MA: Harvard University Press.. Cited by: §I.
  • [23] X. Wu, E. Dyer, and B. Neyshabur (2021) When do curricula work?. In International Conference on Learning Representations, External Links: Link Cited by: §I, §III.
  • [24] J. Zhang, X. Xu, B. Han, G. Niu, L. Cui, M. Sugiyama, and M. Kankanhalli (2020) Attacks which do not kill training make adversarial learning stronger. In International conference on machine learning, Cited by: §I, §III.
  • [25] M. Zöhrer and F. Pernkopf (2014) General stochastic networks for classification. Advances in Neural Information Processing Systems 27, pp. 2015–2023. Cited by: §IV-C.