Statistical Measures For Defining Curriculum Scoring Function

02/27/2021
by   Vinu Sankar Sadasivan, et al.
3

Curriculum learning is a training strategy that sorts the training examples by some measure of their difficulty and gradually exposes them to the learner to improve the network performance. In this work, we propose two novel curriculum learning algorithms, and empirically show their improvements in performance with convolutional and fully-connected neural networks on multiple real image datasets. Motivated by our insights from implicit curriculum ordering, we introduce a simple curriculum learning strategy that uses statistical measures such as standard deviation and entropy values to score the difficulty of data points for real image classification tasks. We also propose and study the performance of a dynamic curriculum learning algorithm. Our dynamic curriculum algorithm tries to reduce the distance between the network weight and an optimal weight at any training step by greedily sampling examples with gradients that are directed towards the optimal weight. Further, we also use our algorithms to discuss why curriculum learning is helpful.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 4

page 5

page 8

10/30/2020

Dynamic Data Selection for Curriculum Learning via Ability Estimation

Curriculum learning methods typically rely on heuristics to estimate the...
09/29/2020

Learn like a Pathologist: Curriculum Learning by Annotator Agreement for Histopathology Image Classification

Applying curriculum learning requires both a range of difficulty in data...
08/22/2021

Spatial Transformer Networks for Curriculum Learning

Curriculum learning is a bio-inspired training technique that is widely ...
02/20/2021

Unsupervised Medical Image Alignment with Curriculum Learning

We explore different curriculum learning methods for training convolutio...
05/18/2022

LeRaC: Learning Rate Curriculum

Most curriculum learning methods require an approach to sort the data sa...
12/05/2020

When Do Curricula Work?

Inspired by human learning, researchers have proposed ordering examples ...
11/15/2019

Label-similarity Curriculum Learning

Curriculum learning can improve neural network training by guiding the o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic Gradient Descent (SGD) (Robbins and Monro, 1951) is a simple yet widely used algorithm for machine learning optimization. There have been many efforts to improve its performance. A number of such directions, such as AdaGrad (Duchi et al., 2011)

, RMSProp

(Tieleman and Hinton, 2012), and Adam (Kingma and Ba, 2015), improve upon SGD by fine-tuning its learning rate, often adaptively. However, Wilson et al. (2017) has shown that the solutions found by adaptive methods generalize worse even for simple overparameterized problems. Reddi et al. (2018) introduced AMSGrad hoping to solve this issue. Yet there is performance gap between AMSGrad and SGD in terms of the ability to generalize (Shirish Keskar and Socher, 2017)

. Hence, SGD still remains one of the main workhorses of the machine learning optimization toolkit. SGD proceeds by stochastically making unbiased estimates of the gradient on the full data

(Zhao and Zhang, 2015). However, this approach does not match the way humans typically learn various tasks. We learn a concept faster if we are presented easy examples first and then gradually exposed to examples with more complexity, based on a curriculum. An orthogonal extension to SGD (Weinshall et al., 2018), that has some promise in improving its performance is to choose examples according to a specific strategy, driven by cognitive science – this is curriculum learning (CL) (Bengio et al., 2009), wherein the examples are shown to the learner based on a curriculum.

Figure 1:

Implicit curricula: Top and bottom rows contain images that are learned at the beginning and end of the training, respectively. Top rows: CIFAR-100, bottom rows: MNIST.

1.1 Related Works

Bengio et al. (2009) formalizes the idea of CL in machine learning framework where the examples are fed to the learner in an order based on its difficulty

. The notation of difficulty scoring of examples has not really been formalized and various heuristics have been tried out:

Bengio et al. (2009) uses manually crafted scores, self-paced learning (SPL) (Kumar et al., 2010)

uses the loss values with respect to the learner’s current parameters, and CL by transfer learning

(Hacohen and Weinshall, 2019)

uses the loss values with respect to a pre-trained model to rate the difficulty of examples in a dataset. Among these works, what makes SPL particular is that they use a dynamic CL strategy, i.e., the preferred ordering is determined dynamically over the course of the optimization. However, SPL does not really improve the performance of deep learning models, as noted in

(Fan et al., 2018). Similarly, Loshchilov and Hutter (2015) uses a function of rank based on latest loss values for online batch selection for faster training of neural networks. Katharopoulos and Fleuret (2018) and Chang et al. (2017)

perform importance sampling to reduce the variance of stochastic gradients during training.

Graves et al. (2017) and Matiisen et al. (2019)

propose teacher-guided automatic CL algorithms that employ various supervised measures to define dynamic curricula. The most recent works in CL show its advantages in reinforcement learning

(Portelas et al., 2020; Zhang et al., 2020). The recent work by Weinshall et al. (2018) introduces the notion of ideal difficulty score

to rate the difficulty of examples based on their loss values with respect to a set of optimal hypotheses. They theoretically show that for linear regression, the expected rate of convergence at a training step for an example monotonically decreases with its ideal difficulty score. This is practically validated by

Hacohen and Weinshall (2019) by sorting the training examples based on the performance of a network trained through transfer learning. They also show that anti-curriculum learning, exposing the most difficult examples first, leads to a degrade in the network performance. However, there is a lack of theory to show that CL improves the performance of a completely trained network. Thus, while CL indicates that it is possible to improve the performance of SGD by a judicious ordering, both theoretical insights as well as concrete empirical guidelines to create this ordering remain unclear. In contrast to CL (Hacohen and Weinshall, 2019), anti-curriculum learning (Kocmi and Bojar, 2017; Zhang et al., 2018, 2019) can be better than CL in certain settings. Hacohen et al. (2020) and Wu et al. (2021) investigate implicit curricula and observe that networks learn examples in a dataset in a highly consistent order. Figure 1

shows the implicit order in which a convolutional neural network (CNN) learns data points from MNIST and CIFAR-100 datasets.

Wu et al. (2021) also shows that CL (explicit curriculum) can be useful in scenarios with limited training budget or noisy data. Mirzasoleiman et al. (2020) uses a coreset construction method to dynamically expose a subset of the dataset to robustly train neural networks against noisy labels. While the previous CL works employ tedious methods to score the difficulty level of the examples, Hu et al. (2020) uses the number of audio sources to determine the difficulty for audiovisual learning. Liu et al. (2020)

uses the norm of word embeddings as a difficulty measure for CL for neural machine translation. In light of these recent works, we discuss the idea of using statistical measures to score examples making it easy to perform CL on real image datasets without the aid of any pre-trained network.

1.2 Our Contributions

Our work proposes two novel approaches for CL. We do a thorough empirical study of our algorithms and provide some more insights into why CL works. Our contributions are as follows:

  • We introduce a simple, novel, and practical CL approach for image classification tasks that does the ordering of examples in an unsupervised manner using statistical measures. Our insight is that statistical measures could have an association with implicit curricula ordering. We empirically analyze our argument of using statistical scoring measures (especially standard deviation) over combinations of multiple datasets and networks.

  • We propose a novel dynamic curriculum learning (DCL) algorithm to study the behaviour of CL. DCL is not a practical CL algorithm since it requires the knowledge of a reasonable local optima to compute the gradients of the full data after ever training epoch. DCL uses the

    gradient information to define a curriculum that minimizes the distance between the current weight and a desired local minima after every epoch. However, this simplicity in the definition of DCL makes it easier to analyze its performance formally.

  • Our DCL algorithm generates a natural ordering for training the examples. Previous CL works have demonstrated that exposing a part of the data initially and then gradually exposing the rest is a standard way to setup a curriculum. We use two variants of our DCL framework to show that it is not just the subset of data which is exposed to the model that matters, but also the ordering within the data partition that is exposed.

  • We analyze how DCL is able to serve as a regularizer and improve the generalization of networks. Additionally, we study why CL based on standard deviation scoring works using our DCL framework.

2 Preliminaries

At any training step , SGD updates the current weight using which is the gradient of loss of example with respect to the current weight. The learning rate and the data are denoted by and , respectively, where denotes an example and its corresponding label for a dataset with classes. Without loss of generality, we assume that the dataset is normalized such that . We denote the learner as . Generally, SGD is used to train by giving the model a sequence of mini-batches , where . Traditionally, each is generated by uniformly sampling examples from the data. We denote this approach as vanilla. In CL, the curriculum is defined by two functions, namely the scoring function and the pacing function. The scoring function, , scores each example in the dataset. Scoring function is used to sort in an ascending order of difficulty. A data point is said to be easier than if , where both the examples belong to . Unsupervised scoring measures do not use the data labels to determine the difficulty of data points. The pacing function, , determines how much of the data is to be exposed at a training step .

Figure 2: Top images with the lowest standard deviation values (top row) and top images with the highest standard deviation values (bottom row) in CIFAR-100 dataset.
  Input: Data , batch size , number of mini-batches , scoring function , and pacing function .
  Output: Sequence of mini-batches .
  sort according to , in ascending order.
  
  for  to  do
     
     
     uniformly sample of size from
  end for
  return
Algorithm 1 Curriculum learning method.
Figure 3: Learning curves for Cases (top row: CNN-8 + CIFAR-100) and

(bottom row: CNN-8 + ImageNet Cats). Error bars represent the standard error of the mean (STE) after 25 and 10 independent trials.

3 Statistical measures for defining curricula

In this section, we discuss our simple approach of using statistical measures to define curricula for real image classification tasks. Hacohen et al. (2020) shows that the orders in which a dataset is learned by various network architectures are highly correlated. While training a stronger learner, it first learns the examples learned by a weaker learner, and then continues to learn new examples. Can we design an explicit curriculum that sorts the examples according to the implicit order in which they are learned by a network? From Figure 1 it is clear that the CIFAR-100 images learned at the beginning of training have bright backgrounds or rich color shades, while the images learned at the end of training have monotonous colors. We observe that the CIFAR-100 images learned at the beginning of training have a higher mean standard deviation () than those learned at the end of training (). For MNIST, the mean standard deviation of images learned at the beginning of training () is lesser than those learned at the end of training (). Motivated by this observation, we investigate the benefits of using standard deviation for defining curriculum scoring functions in order to improve the generalization of the learner. We perform mutiple experiments and validate our proposal over various image classification datatsets with different network architectures. Standard deviation and entropy are informative statistical measures for images and used widely in digital image processing (DIP) tasks (Kumar and Gupta, 2012; Arora, 1981). Mastriani and Giraldez (2016) uses standard deviation filters for effective edge-preserving smoothing of radar images. Natural images might have a higher standard deviation if they have a lot of edges and/or vibrant range of colors. Edges and colours are among the most important features that help in image classification at a higher level. Figure 2 shows 8 images which have the lowest and highest standard deviations in the CIFAR-100 dataset. Entropy gives a measure of image information content and is used for various DIP tasks such as automatic image annotation (Jeon and Manmatha, 2004). We experiment using the standard deviation measure (), the Shanon’s entropy measure () (Shannon, 1951), and different norm measures as scoring functions for CL (see Algorithm 1). The performance improvement with norm measures is not consistent and significant over the experiments we perform (see Suppl. A for details). For a flattened image example represented as , we define

(1)

We use a fixed exponential pace function that exponentially increases the amount of data exposed to the network after every fixed number of training steps. For a training step , it is formally given as: , where is the fraction of the data that is exposed to the model initially, is the exponential factor by which the the pace function value increases after a step, and is the total number of examples in the data.

3.1 Baselines

We use vanilla and CL by transfer learning (Hacohen and Weinshall, 2019), denoted as TL

, as our baselines. We use the same hyperparameters and the codes

111https://github.com/GuyHacohen/curriculum_learning published by the authors for running TL experiments. TL works with the aid of an Inception network (Szegedy et al., 2016a)

pre-trained on the ImageNet dataset

(Deng et al., 2009)

. The activation levels of the penultimate layer of this Inception network is used as a feature vector for each of the images in the training data. These features are used to train a classifier (e.g, support vector machine) and its confidence scores for each of the training images are used as the curriculum scores.

3.2 Experiments

We denote CL models with scoring functions as stddev+, as stddev-, as entropy+, and as entropy-. We employ three network architectures for our experiments: a) FCN- – A -layer fully-connected network (FCN-) with

hidden neurons with Exponential Linear Unit (ELU) nonlinearities, b) CNN-

(Hacohen and Weinshall, 2019) – A moderately deep CNN with convolution layers and fully-connected layers, and c) ResNet-20 (He et al., 2016) – A deep CNN. We use the following datasets for our experiments: a) MNIST, b) Fashion-MNIST, c) CIFAR-10, d) CIFAR-100, e) Small Mammals (a super-class of CIFAR-100, (Krizhevsky et al., 2009)), and f) ImageNet Cats (a subset of 7 classes of cats in ImageNet, see Suppl. B.3). For our experiments, we use the same setup as used in Hacohen and Weinshall (2019). We use learning rates with an exponential step-decay rate for the optimizers in all our experiments as traditionally done (Simonyan and Zisserman, 2015; Szegedy et al., 2016b). In all our experiments, the models use fine-tuned hyperparameters for the purpose of an unbiased comparison of model generalization over the test set. More experimental details are deferred to Suppl. B. While practically performing model training, we prioritize class balance. Although we do not follow the exact ordering provided by the curriculum scoring function, the ordering within a class is preserved. We define 9 test cases. Cases 1–5 use CNN- to classify Small Mammals, CIFAR-10, CIFAR-100, ImageNet Cats, and Fashion-MNIST datasets, respectively. Cases 6–8 use FCN- to classify MNIST, Fashion-MNIST, and CIFAR-10 datasets, respectively. Case 9 uses ResNet-20 to classify the ImageNet Cats dataset.

(a) Case 1
(b) Case 2
(c) Case 3
(d) Case 4
(e) Case 5
(f) Case 6
(g) Case 7
(h) Case 8
(i) Case 9
Figure 4: Bars represent the final mean top-1 test accuracy (in ) achieved by models in Cases 1–9. Error bars represent the STE after 25 independent trials for Cases 2, 3, 5 – 8, and 10 independent trials for Cases 1, 4, 9.

Figure 3 shows the improvement in network generalization of CNN- on CIFAR-100 and ImageNet Cats datasets using stddev CL algorithms. Figure 4 shows the results of all the test cases that we perform. From Figures 4(b) and 4(c) it is clear that serves as a better scoring function than . Further, we observe that the datasets MNIST, Fashion-MNIST, and ImageNet Cats best follow the curriculum variant stddev+. CIFAR-100, CIFAR-10, and Small Mammals follow the curriculum defined by stddev-. As discussed earlier in this section, this trend is consistent with the stddev order in which the dataset images are implicitly learned by the network. In all the test cases, stddev CL algorithm consistently performs better than vanilla with a mean improvement of top-1 test accuracy.


Dataset
MNIST 0.01 0.00
Fashion-MNIST 0.06 0.01
Small Mammals 0.03 0.08
CIFAR-10 0.02 0.06
CIFAR-100 0.06 0.09
ImageNet Cats 0.07 0.02
Table 1: stddev curriculum selection using median pixel distance values. Bolded value corresponds to the curriculum variant that works the best for a dataset.

Let 222Similar to NumPy median function. denote the median value of all the pixels in image(s) and denote the median pixel value of the full training images, where the examples are ordered according to stddev+. We denote and as the median pixel distances of the first batches of examples sampled according to stddev+ and stddev-, respectively, where is the batch size. Interestingly, we notice that the stddev curriculum variant that best works for a dataset has a higher median pixel distance as shown in Table 1.

(a) CIFAR-100
(b) ImageNet Cats
Figure 5: Bars represent the final mean top-1 test accuracy (in ) achieved by CNN-. Error bars represent the STE after 25 and 10 independent trials, respectively.

We also test the robustness of our stddev algorithms to noisy labels. For this purpose, we design two test cases: CNN- to classify CIFAR-100 and ImageNet Cats datasets with label noise. We add label noise by uniformly sampling of the data points and randomly changing their labels. Figure 5 shows that our CL algorithms work well in training settings with label noise, even with only coarse fine-tuning of curriculum hyperparameters.

4 Dynamic Curriculum Learning

For DCL algorithms (Kumar et al., 2010), examples are either scored and sorted or automatically selected (Graves et al., 2017; Matiisen et al., 2019) after every few training steps since the scoring function changes dynamically with the learner as training proceeds. Hacohen and Weinshall (2019) and Bengio et al. (2009) use a fixed scoring function and pace function for the entire training process. They empirically show that a curriculum helps to learn fast in the initial phase of the training process. In this section, we propose our novel DCL algorithm for studying the behaviour of CL. Our DCL algorithm updates the difficulty scores of all the examples in the training data at every epoch using their gradient information. We hypothesize the following: Given a weight initialization and a local minima obtained by full training of vanilla SGD, the curriculum ordering determined by our DCL variant leads to convergence in fewer number of training steps than vanilla. We first describe the algorithm, then the underlying intuition, and finally validate the hypothesis using experiments. Our DCL algorithm iteratively works on reducing the L distance, , between the weight parameters and at any training step . Suppose, is the index of the example sampled at training step , and for any , is the ordered set containing the indices of training examples that are to be shown to the learner from the training steps through . Let us define , , and as the angle between and . Then, using a geometrical argument, (see Figure 6),

(2)
Figure 6: A geometrical interpretation of gradient steps for the understanding of equation 4.
  Input: Data , local minima , weight , batch size , and pacing function .
  Output: Sequence of mini-batches for the next training epoch.
  
  
  
  for  to  do
     .
  end for
   sorted according to , in ascending order
  
  for  do
     append to
  end for
  return
Algorithm 2 Dynamic curriculum learning (DCL+).

For a vanilla model, is generated by uniformly sampling indices from with replacement. Since, finding an ordered set to minimize is computationally expensive, we approximate the DCL algorithm (DCL+, see Algorithm 2) by neglecting the terms with coefficient in equation 4. Algorithm 2 uses a greedy approach to approximately minimize by sampling examples at every epoch using the scoring function

(3)

Let us denote the models that use the natural ordering of mini-batches greedily generated by Algorithm 2 as DCL+. DCL- uses the same sequence of mini-batches that DCL+ exposes to the network at any given epoch, but the order is reversed. We empirically show that DCL+ achieves a faster and better convergence with various initializations of .

4.1 Experiments

(a) Experiment 1
(b) Experiment 2
Figure 7: Learning curves of experiments comparing DCL+, DCL-, and vanilla SGDs. Error bars signify the standard error of the mean (STE) after independent trials.
Figure 8: Learning curves for Experiment with varying for DCL+. The parameter needs to be finely tuned for improving the generalization of the network. A low value exposes only examples with less/no gradient noise to the network at every epoch whereas a high value exposes most of the dataset including examples with high gradient noise to the network. A moderate value shows examples with low/moderate gradient noise. Here, a moderate generalizes the best.

In our experiments, we set , where is a tunable hyperparameter. We use FCN- architecture to empirically validate our algorithms () on a subset of the MNIST dataset with class labels and (Experiment 1). Since, this is a very easy task (as the vanilla model training accuracy is as high as ), we compare the test loss values across training steps in Figure 7(a) to see the behaviour of DCL on an easy task. DCL+ shows the fastest convergence, although all the networks achieve the same test accuracy. DCL+ achieves vanilla’s final (at training step ) test loss score at training step . In Experiment 2, we use FCN- to evaluate our DCL algorithms () on a relatively difficult Small Mammals dataset. Figure 7(b) shows that DCL+ achieves a faster and better convergence than vanilla in Experiment 2. DCL+ achieves vanilla’s convergence (at training step ) test accuracy score at training step . Further experimental details are deferred to Suppl. B.1. Since, DCL is computationally expensive, we perform DCL experiments only on small datasets. Fine-tuning of is crucial for improving the generalization of DCL+ on the test set (see Figure 8). We fine-tune by trial-and-error over the training accuracy score.

Figure 9: Relation of and values of examples over training epochs 1 (left), 5 (middle), and 100 (right) for Experiments (top row) and (bottom row), respectively. Dotted red lines fit the scattered points.

5 Why is a curriculum useful?

At an intuitive level, we can say that DCL+ converges faster than the vanilla SGD as we greedily sample those examples whose gradient steps are the most aligned towards an approximate optimal weight vector. In previous CL works, mini-batches are generated by uniformly sampling examples from a partition of the dataset which is made by putting a threshold on the difficulty scores of the examples. Notice that our DCL algorithms generate mini-batches with a natural ordering at every epoch. We design DCL+ and DCL- to investigate an important question: can CL benefit from having a set of mini-batches with a specific order or is it just the subset of data that is exposed to the learner that matters? Figure 7 shows that the ordering of mini-batches matters while comparing DCL+ and DCL-, which expose the same set of examples to the learner in any training epoch. Once the mini-batch sequence for an epoch is computed, DCL- provides mini-batches to the learner in the decreasing order of gradient noise. This is the reason for DCL- to have high discontinuities in the test loss curve after every epoch in Figure 7(a). With our empirical results, we argue that the ordering of mini-batches within an epoch does matter. Bengio et al. (2009) illustrates that removing examples that are misclassified by a Bayes classifier (noisy examples) provides a good curriculum for training networks. SPL tries to remove examples that might be misclassified during a training step by avoiding examples with high loss. TL avoids examples that are noisy to an approximate optimal hypotheses in the initial phases of training. DCL+ and DCL- try to avoid examples with noisy gradients that might slow down the convergence towards the desired optimal minima. Guo et al. (2018) empirically shows that avoiding examples with label noise improves the initial learning of CNNs. According to their work, adding examples with label noise to later phases of training serves as a regularizer and improves the generalization capability of CNNs. DCL+ uses its pace function to avoid highly noisy examples (in terms of gradients). In our DCL experiments, the parameter is chosen such that few moderately noisy examples (examples present in the last few mini-batches within an epoch) are included in training along with lesser noisy examples to improve the network’s generalization. We also show the importance of tuning CL hyperparameters for achieving a better network generalization (see Figure 8). Hence, the parameter in DCL+ serves as a regularizer and helps in improving the generalization of networks.

5.1 Analyzing stddev with our DCL framework

We use our DCL framework to understand why stddev works as a scoring function. We try to analyze the relation between the standard deviation and values of examples over training epochs. Figure 9 shows the plots of on the Y-axis against examples ranked based on their values (in ascending order) plotted on the X-axis at various stages of training. It shows the dynamics of over initial, intermediate, and final stages of training. Correlation between and after the first epoch for Experiments 1 and 2 are and , respectively. The corresponding p-values for testing non-correlation are and , respectively. In the initial stage of training, examples with high tend to have high values. In the final stage of training, this trend changes to the exact opposite. This shows that can be useful in removing noisy gradients from the initial phases of training and hence help in defining a simple, good curriculum.

6 Conclusion

In this paper, we propose two novel CL algorithms that show improvements in network generalization over multiple image classification tasks with CNNs and FCNs. A fresh approach to define curricula for image classification tasks based on statistical measures is introduced, based on our observations from implicit curricula ordering. This technique makes it easy to score examples in an unsupervised manner without the aid of any teacher network. We thoroughly evaluate our CL algorithms and find it beneficial in noisy settings and improving network accuracy. We also propose a novel DCL algorithm for analyzing CL. We show that the ordering of mini-batches within training epochs and fine-tuning of CL hyperparameters are important to achieve good results with CL. Further, we also use our DCL framework to support our CL algorithm that uses for scoring examples.

References

  • P. Arora (1981) On the shannon measure of entropy. Information Sciences 23 (1), pp. 1–9. Cited by: §3.
  • Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §1.1, §1, §4, §5.
  • H. Chang, E. Learned-Miller, and A. McCallum (2017) Active bias: training more accurate neural networks by emphasizing high variance samples. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. 1002–1012. Cited by: §1.1.
  • J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Vol. , pp. 248–255. Cited by: §3.1.
  • J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization.. Journal of machine learning research 12 (7). Cited by: §1.
  • Y. Fan, F. Tian, T. Qin, X. Li, and T. Liu (2018) Learning to teach. In International Conference on Learning Representations, Cited by: §1.1.
  • A. Graves, M. G. Bellemare, J. Menick, R. Munos, and K. Kavukcuoglu (2017) Automated curriculum learning for neural networks. In International Conference on Machine Learning, pp. 1311–1320. Cited by: §1.1, §4.
  • S. Guo, W. Huang, H. Zhang, C. Zhuang, D. Dong, M. R. Scott, and D. Huang (2018)

    Curriculumnet: weakly supervised learning from large-scale web images

    .
    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 135–150. Cited by: §5.
  • G. Hacohen, L. Choshen, and D. Weinshall (2020) Let’s agree to agree: neural networks share classification order on real datasets. In International Conference on Machine Learning, pp. 3950–3960. Cited by: §1.1, §3.
  • G. Hacohen and D. Weinshall (2019) On the power of curriculum learning in training deep networks. In International Conference on Machine Learning, pp. 2535–2544. Cited by: §B.1, §B.2, §1.1, §3.1, §3.2, §4.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. Cited by: §3.2.
  • D. Hu, Z. Wang, H. Xiong, D. Wang, F. Nie, and D. Dou (2020) Curriculum audiovisual learning. arXiv preprint arXiv:2001.09414. Cited by: §1.1.
  • J. Jeon and R. Manmatha (2004) Using maximum entropy for automatic image annotation. In International Conference on Image and Video Retrieval, pp. 24–32. Cited by: §3.
  • A. Katharopoulos and F. Fleuret (2018) Not all samples are created equal: deep learning with importance sampling. In International conference on machine learning, pp. 2525–2534. Cited by: §1.1.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §1.
  • T. Kocmi and O. Bojar (2017) Curriculum learning and minibatch bucketing in neural machine translation. In

    Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017

    ,
    Varna, Bulgaria, pp. 379–386. Cited by: §1.1.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: §3.2.
  • M. P. Kumar, B. Packer, and D. Koller (2010) Self-paced learning for latent variable models.. In NIPS, Vol. 1, pp. 2. Cited by: §1.1, §4.
  • V. Kumar and P. Gupta (2012) Importance of statistical measures in digital image processing. International Journal of Emerging Technology and Advanced Engineering 2 (8), pp. 56–62. Cited by: Appendix A, §3.
  • X. Liu, H. Lai, D. F. Wong, and L. S. Chao (2020) Norm-based curriculum learning for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 427–436. Cited by: §1.1.
  • I. Loshchilov and F. Hutter (2015) Online batch selection for faster training of neural networks. ArXiv abs/1511.06343. Cited by: §1.1.
  • M. Mastriani and A. E. Giraldez (2016) Enhanced directional smoothing algorithm for edge-preserving smoothing of synthetic-aperture radar images. arXiv preprint arXiv:1608.01993. Cited by: §3.
  • T. Matiisen, A. Oliver, T. Cohen, and J. Schulman (2019) Teacher–student curriculum learning. IEEE Transactions on Neural Networks and Learning Systems 31 (9), pp. 3732–3740. Cited by: §1.1, §4.
  • B. Mirzasoleiman, K. Cao, and J. Leskovec (2020) Coresets for robust training of deep neural networks against noisy labels. In Advances in Neural Information Processing Systems, Cited by: §1.1.
  • R. Portelas, C. Colas, K. Hofmann, and P. Oudeyer (2020) Teacher algorithms for curriculum learning of deep rl in continuously parameterized environments. In Conference on Robot Learning, pp. 835–853. Cited by: §1.1.
  • S. J. Reddi, S. Kale, and S. Kumar (2018) On the convergence of adam and beyond. In International Conference on Learning Representations, Cited by: §1.
  • H. Robbins and S. Monro (1951) A stochastic approximation method. The annals of mathematical statistics, pp. 400–407. Cited by: §1.
  • C. E. Shannon (1951) Prediction and entropy of printed english. Bell system technical journal 30 (1), pp. 50–64. Cited by: §3.
  • N. Shirish Keskar and R. Socher (2017) Improving generalization performance by switching from adam to sgd. arXiv e-prints, pp. arXiv–1712. Cited by: §1.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: §3.2.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016a) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §3.1.
  • C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016b) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §3.2.
  • T. Tieleman and G. Hinton (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4 (2), pp. 26–31. Cited by: §1.
  • D. Weinshall, G. Cohen, and D. Amir (2018) Curriculum learning by transfer learning: theory and experiments with deep networks. In International Conference on Machine Learning, pp. 5238–5246. Cited by: §1.1, §1.
  • A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht (2017) The marginal value of adaptive gradient methods in machine learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4151–4161. Cited by: §1.
  • X. Wu, E. Dyer, and B. Neyshabur (2021) When do curricula work?. In International Conference on Learning Representations, Cited by: §1.1.
  • X. Zhang, G. Kumar, H. Khayrallah, K. Murray, J. Gwinnup, M. J. Martindale, P. McNamee, K. Duh, and M. Carpuat (2018) An Empirical Exploration of Curriculum Learning for Neural Machine Translation. arXiv e-prints. Cited by: §1.1.
  • X. Zhang, P. Shapiro, G. Kumar, P. McNamee, M. Carpuat, and K. Duh (2019) Curriculum learning for domain adaptation in neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1903–1915. Cited by: §1.1.
  • Y. Zhang, P. Abbeel, and L. Pinto (2020) Automatic curriculum learning through value disagreement. Advances in Neural Information Processing Systems 33. Cited by: §1.1.
  • P. Zhao and T. Zhang (2015) Stochastic optimization with importance sampling for regularized loss minimization. In International Conference on Machine Learning, pp. 1–9. Cited by: §1.

Supplementary Material

Appendix A Additional empirical results

(a) CIFAR-100
(b) CIFAR-10
Figure 10: Bars represent the final mean top-1 test accuracy (in ) achieved by CNN-. Error bars represent the STE after 25 independent trials.

In Section 3, we study the performance of CL using and as scoring measures. Other important statistical measures are mode, median, and norm (Kumar and Gupta, 2012). A high for a real image could mean that the image is having a lot of edges and a wide range of colors. A low entropy could mean that an image is less noisy. Norm of an image could give information about its brightness. Intuitively, norm is not a good measure for scoring images as low norm valued images are really dark and high norm valued images are really bright. We experiment with different norm measures and find that they do not serve as a good CL scoring measure since they have lesser improvement with high variance over multiple trials when compared to stddev- on the CIFAR datasets. We use two norm measures:

(4)

where is an image in the dataset represented as a vector, and is the mean pixel value of all the images belonging to the class of . In our experiments, all the orderings are performed based on the scoring function and the examples are then arranged to avoid class imbalance within a mini-batch. Let us denote the models that use the scoring functions as norm+, as norm-, as class_norm+, and as class_norm-. Figure 10 shows the results of our experiments on CIFAR-100 and CIFAR-10 datasets with CNN- using scoring functions. We find that the improvements reported for norm-, the best model among the models that use norm measures, have a lower improvement than stddev-. Also, norm- has a higher STE when compared to both vanilla and stddev-. Hence, based on our results, we suggest that is a more useful statistical measure than norm measures for defining curricula for image classification tasks.

Appendix B Experimental Details

b.1 Network architectures

All FCNs (denoted as FCN-) we use are -layered with a hidden layer consisting of neurons with ELU nonlinearities. Experiment 1 employs FCN- while Experiment 2 employs FCN-

with no bias parameters. The outputs from the last layer is fed into a softmax layer. Cases 6–8 employ FCN-

with bias parameters. The batch-size for Experiments 1–2 and Cases 1–9 are 50 and 100, respectively. We use one NVIDIA Quadro RTX 5000 GPU for our experiments. Average runtimes of our experiments vary from 1 hour to 3 days. For Cases 1–5 and 9, we use the CNN- architecture that is used in Hacohen and Weinshall (2019). The codes are available in their GitHub repository. CNN- contains convolution layers with and filters, respectively, and ELU nonlinearities. Except for the last two convolution layers with filter size , all other layers have a filter size of

. Batch normalization is performed after every convolution layer.

max-pooling and dropout layers are present after every two convolution layers. The output from the CNN is flattened and fed into a fully-connected layer with neurons followed by a dropout layer. A softmax layer follows the fully-connected output layer that has a number of neurons same as the number of classes in the dataset. The batch-size is . All the CNNs and FCNs are trained using SGD with cross-entropy loss. SGD uses an exponential step-decay learning rate scheduler. Our codes will be published on acceptance.

b.2 Hyperparameter tuning

For fair comparison of network generalization, the hyperparameters should be finely tuned as mentioned in Hacohen and Weinshall (2019). We exploit hyperparameter grid-search to tune the hyperparameters of the models in our experiments. For vanilla models, grid-search is easier since they do not have a pace function. For CL models, we follow a coarse two-step tuning process as they have a lot of hyperparameters. First we tune the optimizer hyperparameters fixing other CL hyperparameters. Then we fix the obtained optimizer parameters and tune the CL hyperparameters. The gird-search parameter ranges are as follows. Case 1: a) initial learning rate b) learning rate exponential decay factor c) learning rate decay step d) e) f) . Cases 2–3, 9: a) initial learning rate b) learning rate exponential decay factor c) learning rate decay step d) e) f) . Case 4–5, 9: a) initial learning rate b) learning rate exponential decay factor c) learning rate decay step d) e) f) . Cases 6–8: a) initial learning rate b) learning rate exponential decay factor c) learning rate decay step d) e) f) . The experiments are tuned to perform better on the training data.

b.3 Dataset details

We use CIFAR-100, CIFAR-10, ImageNet Cats, Small Mammals, MNIST, and Fashion-MNIST datasets. CIFAR-100 and CIFAR-10 contain training and test images of shape belonging to and classes, respectively. Small Mammals is a super-class of CIFAR-100 containing 5 classes – “Hamster”, “Mouse”, “Rabbit”, “Shrew”, and “Squirrel”. It has training images per class and test images per class. MNIST and Fashion-MNIST contain training and test gray-scale images of shape belonging to different classes. ImageNet Cats is a subset of the ImageNet dataset ILSVRC 2012. It has classes with each class containing training images and test images. The labels in the subset are “Tiger cat”, “Lesser panda, Red panda, Panda, Bear cat, Cat bear, Ailurus fulgens”, “Egyptian cat”, “Persian cat”, “Tabby, Tabby cat”, “Siamese cat, Siamese”, “Madagascar cat”, and “Ring-tailed lemur, Lemur catta”. The images in the dataset are reshaped to . All the datasets are preprocessed before training to have a zero mean and unit standard deviation.