1 Introduction
Image recognition is a fundamental task of computer vision, which aims at understanding semantic contents from raw pixels. This is often difficult, because the underlying connections (e.g., a mathematical function) between lowlevel pixels and highlevel semantics are often complicated and fuzzy, e.g., there exist a lot of elements in the data space which are either meaningless or ambiguous [24]
. In the deep learning era, researchers design deep neural networks as hierarchical and composite functions
[16][11]. However, the difficulty of training a network increases as its complexity [9] does. Despite some technical improvements designed to alleviate instability of training [23][34][14], the learned model still suffers from overfitting, which is arguably caused by the overhigh complexity of the designed model so that the limited amount of training data can be interpreted in some improper ways.Despite designing more complex models, an alternative solution lies in optimizing an existing model better. This paper investigates an algorithm along this direction. It is named curriculum learning [2], which measures the difficulty of each training sample and adjusts data distribution during the training process, so that the model is optimized with gradually increasing complexity, and thus becomes more stable and less likely to overfit data. However, this idea was only applied to a limited range of vision tasks, because it is often hard to determine the difficulty level of training data and further partition them into different groups. The major contribution of this paper is to provide a new framework named progressive recurrent learning (PRL), which uses entropy to define the difficulty of training data distribution, and gradually reduces the level of auxiliary cues to construct training data with varying difficulties.
PRL is built upon a straightforward idea named coarsetofine learning, in which we train two models and simultaneously, with the former taking input data and outputting , and the latter taking input data as well as an auxiliary cue and outputting . Here, is a function of , and is either the groundtruth or the coarse prediction
. This implies that the overall model is formulated in a recurrent form with the output being used as input repeatedly. In the training process, we control the difficulty by changing the probability of sampling
and in computing , e.g., the probability of sampling is , where indicates the fraction of elapsed training iterations. This is to say, the groundtruth annotation is used to provide a warm start of training , but gradually replaced by the coarse prediction so that training difficulty increases. At the end of training, so that does not rely on any extra information, and can be applied in the testing process as a refinement of . In practice, PRL demonstrates twofold benefits. First, the gradually increasing training difficulty makes the optimization process more stable, i.e., converges better. Second, the coarsetofine iteration pushes the entire model towards higher accuracy.PRL is a generalized framework which can be applied to a wide range of vision problems. In this paper, we study two examples, namely, object localization and semantic segmentation. In these tasks, the desired output is either an object bounding box or a segmentation mask, both of which can be easily represented as a vector or a matrix
. We start with providing perfect supervision () to the fine stage. To prevent it from directly taking this information and thus not learning any useful knowledge, we perform a transformation function on before combining it with input data . Experiments reveal consistent accuracy gain brought by PRL to the baseline models. Empirical analysis by comparing training and testing losses verifies our motivation, i.e., such improvement comes from alleviating the risk of overfitting.2 Related Work
Deep learning [17]
in particular deep convolutional neural networks have been dominating the field of computer vision. The fundamental idea is to build a hierarchical structure to learn complicated visual patterns from a largescale database
[6]. As the number of network layers increases from tens [16][33][35] to hundreds [11][13], the network’s representation ability becomes stronger, but training these networks becomes more and more challenging. Various techniques have been proposed to improve numerical stability [23][14] and overfitting [34], but the transferability from training data to testing data is still below satisfactory. It was pointed out that this issue is mainly caused by the overhigh complexity of deep networks, so that the limited amount of training data can be interpreted in an unexpected way [24]. There exist two types of solutions, namely, curriculum learning and coarsetofine learning.The basic idea of curriculum learning [2] is to gradually increase the difficulty of training data, so that the model can be optimized in a faster and/or more stable manner. This idea was first brought up by referring to how humans are taught to learn a concept and verified effective also for computer algorithms [15]. It was later widely applied to a wide range of learning tasks, including visual recognition [30][36] and generation [31]
[19][27][44][40]. Curriculum learning was theoretically verified a good choice in transfer learning
[39], multitask learning [25] and sequential learning [1] scenarios, and there have been discussions on the principle of designing curriculum towards better performance [46]. A similar idea (gradually increasing training difficulty) was also adopted in online hard example mining [32][5], but the latter case often started with a regular data distribution which is gradually adjusted towards difficult training data. The major drawback of curriculum learning lies in the requirement of evaluating the difficulty of training data, which is not easy in general. This paper provides a framework to bypass this problem.Another idea, named coarsetofine learning, was based on the observation that a vision model can rethink its prediction to amend errors [3]. Researchers designed several approaches for refining visual recognition in an iterative manner. These approaches can be explained using autocontext [37] or formulated into a fixedpoint model [20]. Examples include the coarsetofine models for image classification [8], object detection [4], semantic segmentation [47]
, pose estimation
[38], image captioning [10], etc. It was verified that joint optimization over coarse and fine stages boosts the performance [43], which raised an issue of the communication between coarse and fine stages in the training process – we desire feeding coarsestage output to finestage input, but when the coarse model has not been well optimized, this can lead to unstable performance in optimization. The method proposed in this paper can largely alleviate this issue.3 Our Approach
This section describes the progressive recurrent learning (PRL) algorithm. We first briefly review the curriculum learning algorithm, the precursor of PRL. Based on the limitation of curriculum learning, we propose PRL and provide some theoretical analysis. The applications of PRL on two vision problems are illustrated in the next section.
3.1 Background: Curriculum Learning
Machine learning often starts with constructing a model , where and are input and output, and denotes the parameters. In the context of deep learning, describes the network architecture, and the learnable weights. The goal is to find the optimal that best fits a training dataset . In the modern era, often contains millions of parameters while the amount of training data, , can still be limited, e.g., thousands. Therefore, it is possible that is overfitted, i.e., interprets the training data in perfectly, but fails to generalize well to new, unseen testing data.
Instead of designing more complex models, there were efforts in optimizing an existing network in a safer way. One of them is named curriculum learning [2], which assumed that each training data to be sampled from a distribution with being the density function at . Then, the training process is parameterized using time . Each training sample is assigned with a difficulty measure , and a weighting term . The actual sampling distribution satisfies . The exact coefficient for is determined by the normalization rule . This process is named curriculum learning if the following two constraints are satisfied [2]:
(1) 
which means that for each , is monotonically increasing with respect to , and
(2) 
where denotes the entropy of the distribution. The general idea of curriculum learning is to measure the difficulty of training data. If is equally or more difficult than , then for each step . Eventually, when there is for all . This formulation leans towards sampling easy cases at first, and gradually converges to the real distribution, i.e., .
3.2 Progressive Recurrent Learning
A major limitation of curriculum learning lies in an explicit way of defining the difficulty of each , which is often difficult as few datasets provided such information for each training sample. To deal with this issue, we borrow the definition from information theory and use the entropy of data distribution to define the difficulty of training data. Section 3.4 provides a theoretical analysis that entropy in our training process is indeed gradually growing with time.
To instantiate this general idea, we borrow a simple framework named coarsetofine learning and, by adding an auxiliary cue to the fine stage, we can easily sample training data with different difficulty levels by controlling how much this cue takes advantage from the groundtruth label.
Consider a training sample where denotes the groundtruth label. Due to the difficulty of directly learning the model , we decompose it into two components, namely, a coarse model and a fine model. The coarse model obtains a rough prediction and the fine model polishes the prediction by feeding it to the network with input data again with an auxiliary cue . We assume that is closely related to , so that can take advantage of this additional information, and becomes easier to train and performs better in the testing process^{1}^{1}1In the testing process, we can first apply to to obtain and then repeatedly execute to update ( is the iteration index) until convergence or some terminating condition is satisfied.. The overall goal of optimization is:
(3) 
It remains to determine how to optimize ( simply follows a regular training strategy), in particular how to design the auxiliary cue . We expect to learn the knowledge that is related to , but do not hope it to deliver too much information so that largely relies on and ignores . In addition, note that is not available in the testing process, i.e., it needs to be replaced by . Considering these factors, we design , where takes the value of or , and is a transformation function which weakens the information delivered by . The probability of choosing instead of determines the fraction of oracle information provided to , which serves as a practical way of controlling the difficulty of training data.
Based on this, we apply a progressive learning process^{2}^{2}2The term of “progressive learning” was used in a few prior approaches [42][29][21], but with different motivations. Our goal is to gradually increase the difficulty of training data. to gradually increase training difficulty. We set a variable which is positively related to the fraction of elapsed iterations. In each iteration, the probability that is selected and fed into is , and in the remaining case we choose instead:
(4) 
where
denotes a uniform distribution. There are of course other options,
e.g., computing a weighted average of and , i.e., . In this paper, we investigate Eqn (4) and show its effectiveness.In summary, in each training iteration, we take the groundtruth label and the output of the coarse model , compute basing on Eqn (4) and feed
to the fine model. The overall loss function is:
(5) 
where denotes concatenation at the channel dimension. When is chosen as , the gradient of the second term involves both and and the coarse and fine models are optimized jointly. We add a coarse loss term so that is better optimized [35][18] and the entire model achieves a higher stability [43]. The complete training process is described in Algorithm 1.
In the testing stage, we use to compute the first output , and iteratively feed it to , i.e., the th iteration produces from , and this process continues until convergence or a predefined maximum number of iterations is reached. Except for the first time, takes the output from itself rather than , which is not consistent with the training process. However, since both and are supervised by , the difference between their outputs is relatively small, so this pipeline works well in practice. The testing process is illustrated in Algorithm 2.
3.3 Implementation Details
In all our experiments, both and are deep neural networks, use the same architecture for simplicity, but are allowed to have different parameters. The only difference lies in the input layer, where takes the original input and takes both and .
In practice, has two convolutional layers, each with
output channels (same as the input image), and a ReLU activation layer to provide nonlinearity
[23]. This transformation module, though only containing less than parameters, plays an important role of weakening the information provided by , otherwise can easily learns an identity inputoutput mapping especially in the early training stage ( is very close to ). The parameters in the first convolutional layers of and are adjusted according to the dimensionality of input data.3.4 Theoretical Analysis
We analyze the property of PRL before applying it to different vision problems. First, we note that PRL is not a curriculum learning process, as PRL gradually reduces the probability of sampling easy cases during the training process, and eventually discards them – this does not align with curriculum learning. However, PRL shares a similar property with curriculum learning, that the entropy of data distribution keeps increasing during training.
Let be a training sample of , which is sampled from the coarseprediction distribution :
(6) 
where is determined by the training set and
is an isotropic Gaussian distribution, which degenerates to the Kronecker
function when its variance
. Similarly, we define the groundtruth distribution :(7) 
In the training process, changes with . Thus, the distribution of at time , denoted by , has the following formulation:
(8) 
Here we make a simple assumption, that the difference between and is relatively large. This can be explained as (i) in the early training stage, coarse prediction is often less accurate, i.e., is often far away from , while (ii) in the late training stage, coarse prediction becomes more accurate but also more deterministic, i.e., becomes very small. Thus, we can approximate the Shannon entropy of as:
(9) 
where , which has an upper bound of . On the other hand, is smaller than by a margin. So during the training process, is mostly increasing, which implies that training difficulty becomes larger.
3.5 Discussions and Relationship to Prior Work
PRL provides a tradeoff between training stability and generalization ability. On the one hand, PRL allows a warm start in training the fine model. In the training process especially the early epochs,
is often less optimized, and thus the coarse prediction may be less stable and introduce noise to the fine model^{3}^{3}3According to the assumption of a coarsetofine approach [8][43], the fine model expects the coarse prediction to be “good enough”, otherwise the iteration process cannot guarantee a stable convergence.. On the other hand, although learning from is easier as it provides more accurate cues, it can easily lead to overfitting as in the testing process, we are actually obtaining which is much more noisy. PRL alleviates this issue via a gradual transition from to .In the previous literature, the most related work is [27] and [43]. [27] considered a sequence learning task in which each cell takes the output of the previous cell as input. In each training epoch, the first part of training data are provided by groundtruth while the second provided by prediction, and the fraction was controlled by the elapsed training time . Differently, PRL allows the data distribution to be changed more smoothly and thus improves training stability. [43] proposed a coarsetofine framework for semantic segmentation, and used a weaker version of curriculum learning in which the distribution was changed from to all at once. This sudden change may cause the model fail to convergence. PRL instead gradually changes the distribution, leading to consistent convergence and accuracy gain in experiments (see Section 4.2).
Last but not least, we find that curriculum learning and coarsetofine learning have their own benefits, and PRL combines both of them. This also makes PRL applicable to a wide range of visual recognition tasks, and we study two of them in the next section.
4 Applications
In this section, we apply the theory of PRL to two popular vision problems, i.e., object localization (Section 4.1) and semantic segmentation (Section 4.2). We also show that the improvement brought by PRL can be directly transferred from object localization to object parsing.
4.1 Object Localization
4.1.1 Settings
In the first part, we apply PRL to object localization, which differs from object detection in that we do not need to predict the object class in both training and testing – for each input image, the desired output is a bounding box that indicates the object. While being less specific, this system can assist a wide range of vision tasks including object detection [26] and object parsing, i.e., detecting the semantic parts of an object [45]. Here, we assume that only one object exists in each image, but as shown in [45], this assumption can be easily taken out by applying simple techniques in the testing process.
We collect data from the ILSVRC2012 dataset [28], in which categories with the superclass of vehicle are chosen, because the original method provided reasonable prediction on rigid objects [26]. We only choose those training and testing images with exactly one boundingbox annotated^{4}^{4}4In the ILSVRC2012 dataset, about half of training image were not annotated with boundingboxes, but all testing images were annotated with boundingboxes.. We ignore those images with more than one objects annotated to avoid confusion. In total, there are around training and testing images.
STG  Scl.  Loc.  IOU  Acc.  

BL  N/A  
IND  F  
JNT  C  
F  
PRL  C  
F  
PRL  C  
F  
PRL  C  
F  
PRL  C  
F 
training epochs in total. For detailed descriptions of these evaluation metrics, see Section
4.1.1. indicate the higher number is better, and indicates the opposite.We take ScaleNet [26] as our baseline. The entire image is rescaled into with its aspect ratio preserved (empty stripes are added if necessary) and fed into a layer deep residual network [11] (only the middle part with layers is actually used). The output consists of four floating point numbers, indicating the coordinate of the central pixel, the width and the height of the bounding box, respectively. These numbers are individually compared with the groundtruth using the logscale norm and summed up to the final loss. There are in total training epochs.
In the testing stage, we compute several evaluation metrics. Suppose the groundtruth numbers are , , and , a prediction with , , and has a scale distance of , a location distance of , an IOU between the groundtruth and predicted bounding boxes, and an accuracy indicating whether IOU is at least ^{5}^{5}5In both training and testing, we perform coarse and fine models exactly once (no iteration) because iteration helps little in this problem (this is partly due to the limited amount of information provided by the bounding box). For an opposite example, see the next subsection for PRL on image segmentation, in which rich information is delivered and so iteration largely improves recognition accuracy..
To study this task in a coarsetofine manner, we first construct a weighted map using the predicted , , and from the coarse stage. The values within the bounding box is set to be and those outside set to be . This map is then passed through which is composed by two convolutional layers and appended to . Although this box only provides a limited amount of information, we shall see the improvement it brings to object localization.
4.1.2 Different Learning Options
We compare progressive recurrent learning (PRL) with four other training strategies. The baseline (BL) simply trains one single network (a.k.a., the coarse model). The individual learning (IND) and joint learning (JNT) methods trains the coarse and fine model simultaneously, but in individual and joint manner, respectively. Here, by joint optimization we mean to provide to from the beginning of training i.e., . We study different options of PRL defined by (the value at the start of training) and (the number of epochs when reaches ), and we assume that always grows linearly with training time.
Results are summarized in Table 1. Two interesting phenomena are observed. First, starting training with a nonzero often improves performance, since when , the extra information is too strong so that the fine model can be severely biased towards such “cheating” information and thus learns a weaker connection between image data and output label. In addition, it is always better for the model to be trained on (the same setting as in testing) for several epochs, so that the model can adjust to this scenario. Several examples showing how PRL works in localization, including a failure case, are shown in Figure 2.
In Figure 3, we compare the learning curves of IND, JNT and the best PRL, in terms of mean IOU, with and . We can see that the fine phase of IND achieves a very low training error by heavily overfitting training data, in particular, with the “cheating” information from . The loss of JNT is much higher at the beginning, because the fine stage is confused by the coarse stage. As training continues, the loss term becomes smaller because it starts fitting the coarse prediction. This does not bring benefit, because the potential errors in coarse prediction are not fixed. PRL alleviates this issue by starting with a relatively easy task in which part of data are assisted by groundtruth, and gradually moving onto the real distribution, during which groundtruth data are still provided to prevent the model from being impacted by inaccurate coarse predictions.
4.1.3 Application to Object Parsing
Finally, we apply the result of object localization to DeepVoting [45] for object parsing, i.e., detecting the socalled semantic parts in objects. Here, each semantic part refers to a verbally describable pattern in the object, e.g., the wheels of a car or the pedal of a bike. As DeepVoting required all training objects to have a fixed scale, accurate object localization (either scale and location) can help a lot in the testing process.
We use the VehicleSemanticParts (VSP) dataset introduced in [45], which was created from the vehicle images in Pascal3D+ [41]. There are six types of vehicles, namely, airplane, bike, bus, car, motorbike and train. There are different numbers of semantic parts annotated for each class, and we directly use the trained model for these six classes individually, i.e., DeepVoting itself is not modified, and we only change the object localization module which aims at providing a better input for DeepVoting. In the testing stage, we also add random occlusion by extracting pixelwise masks from irrelevant objects (e.g., cat or dog) in the PascalVOC 2007 dataset [7] and pasting them to the input images. By controlling the number of occluders and the fraction of occlusion, we construct four levels of difficulties denoted by L0, L1, L2 and L3, with L0 indicating no occlusion, and L3 the heaviest occlusion.
SC  L0  L1  L2  L3  
BL  PRL  BL  PRL  BL  PRL  BL  PRL  
ai.  
bi.  
bu.  
ca.  
mo.  
tr.  
avg  
SP  L0  L1  L2  L3  
BL  PRL  BL  PRL  BL  PRL  BL  PRL  
ai.  
bi.  
bu.  
ca.  
mo.  
tr.  
avg 
Result are summarized in Table 2. We train two scale prediction models BL and PRL (, ) on the training set of VSP (each image provides a bounding box for the only vehicle in it). On the testing set, we compute both scale prediction accuracy (measured by whether it differs from the groundtruth by more than , which follows the original work [45]) and semantic part detection accuracy (measured by mAP). Results are summarized in Table 2. We can see that, PRL generalizes from ILSVRC2012 to Pascal3D+ well for scale prediction, and the more accurate scale prediction indeed helps object parsing, i.e., the improvement of mAP, averaged over six classes, exceeds at all occlusion levels. This demonstrates the wide application of PRL.
4.2 Medical Imaging Segmentation
The second task is medical imaging segmentation, which serves as an important prerequisite for computerassisted diagnosis (CAD). We investigate the scenario of CT scans which are easy to acquire yet raise the problem of organ segmentation. We follow [43] to evaluate the dataset containing organs and blood vessels in abdominal CT scans. These scans have different numbers of slices along the long axis of the body (the distance between neighboring slices is ), but the spatial resolution of each slice is the same (). We evaluate each organ individually, where cases are used for training and the remaining
for testing. In both datasets, we measure the segmentation accuracy by computing the DiceSørensen coefficient (DSC) for each sample, and report the average and standard deviation over all tested cases.
The baseline model is RSTN [43], a coarsetofine approach which deals with each target individually. It is a 2Dbased approach, which cuts each 3D volume into slices and processes each slice separately. Three viewpoints, i.e., the coronal, sagittal and axial views are individually trained and tested, and finally combined into prediction. In RSTN, Both and are fullyconvolutional networks (FCN) [22], and contains two convolutional layers on the segmentation mask , blurring it into a saliency map and adds it to the original image. To filter out less useful input contents, a minimal bounding box is built to cover all pixels with a probability of at least , and the input image is cropped accordingly before fed into the fine stage. In the original paper [43], to improve the stability of RSTN, the authors designed a stepwise training strategy which first feeds the groundtruth mask into , and changes it to at a fixed point of the training process. However, it still consistently failed to converge on three out of targets (see Table 3), and had a probability to fail on other five.
By applying PRL, we allow the supervision signal to change gradually from to , not suddenly. There are in total iterations with a minibatch size of . Because semantic segmentation is much more difficult than object localization, we use in the first iterations otherwise coarse prediction may provide a meaningless mask and thus totally confuse the fine stage. We change gradually from to in the next iterations, and set in the last iterations, which is learned from previous experiments. The learning rate starts with and is divided by after , and iterations. We also tried to change linearly throughout all iterations, but this achieved a worse success rate in convergence.
Organ  [43]C  [43]F  PRLC  PRLF 
aorta  
adrenal gland  
celiac a.a.  
colon  
duodenum  
gallbladder  
inferior v.c.  
kidney left  
kidney right  
liver  
pancreas  
superior m.a.  
small bowel  
spleen  
stomach  
veins  
average: organs  
average: vessels  
average: all 
Results are summarized in Table 3. We can see that, after PRL is applied, RSTN achieves convergence on all targets, including the blood vessels which failed to converge. Among all targets, PRL saves of them from nonconvergence, improves the segmentation accuracy of other , with of them over , of them over and of them over (small bowel). Slight accuracy drop () is reported on out of targets, and the maximal drop is (adrenal gland). In overall, the average accuracy over converged targets is boosted by over , which is significant given such a high baseline and that PRL merely changes the training strategy of RSTN.
5 Conclusions
In this paper, we generalize curriculum learning to a wide range of vision problems. Our approach, named progressive recurrent learning (PRL), is motivated by that curriculum learning can be integrated with coarsetofine learning, so that the former provides a more stable training scheme, and the latter provides a natural way of constructing training data with varying difficulties. PRL is evaluated in two vision problems, namely, object localization and semantic segmentation. In both scenarios, PRL consistently improves the accuracy of visual recognition.
This paper leaves many topics for future research. For example, it remains unclear if there exists some specific strategies for each particular task which take full advantage of its properties. Also, the strategy of monotonically increasing the difficulty of training data may not be perfect, as some prior work [12] verified that disturbing the training process can lead to better model ensemble. Studying these topics may provide new perspectives to understand machine learning, in particular deep learning methods.
References

[1]
S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer.
Scheduled sampling for sequence prediction with recurrent neural networks.
In Advances in Neural Information Processing Systems, 2015.  [2] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In International Conference on Machine Learning, 2009.
 [3] C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, L. Wang, C. Huang, W. Xu, et al. Look and think twice: Capturing topdown visual attention with feedback convolutional neural networks. In International Conference on Computer Vision, 2015.

[4]
H. Chen, Q. Dou, X. Wang, J. Qin, and P. A. Heng.
Mitosis detection in breast cancer histology images via deep cascaded
networks.
In
AAAI Conference on Artificial Intelligence
, 2016.  [5] Q. Chen, W. Qiu, Y. Zhang, L. Xie, and A. Yuille. Sampleahead: Online classifiersampler communication for learning from synthesized data. arXiv preprint arXiv:1804.00248, 2018.

[6]
J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. FeiFei.
Imagenet: A largescale hierarchical image database.
In
Computer Vision and Pattern Recognition
, 2009.  [7] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
 [8] S. Gangaputra and D. Geman. A design principle for coarsetofine classification. In Computer Vision and Pattern Recognition, 2006.
 [9] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. In International Conference on Artificial Intelligence and Statistics, 2010.
 [10] J. Gu, J. Cai, G. Wang, and T. Chen. Stackcaptioning: Coarsetofine learning for image captioning. 2018.
 [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, 2016.
 [12] G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, and K. Q. Weinberger. Snapshot ensembles: Train 1, get m for free. In International Conference on Learning Representations, 2018.
 [13] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. In Computer Vision and Pattern Recognition, 2017.
 [14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. 2015.
 [15] F. Khan, B. Mutlu, and X. Zhu. How do humans teach: On curriculum learning and teaching dimension. In Advances in Neural Information Processing Systems, 2011.
 [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
 [17] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436, 2015.
 [18] C. Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. In Artificial Intelligence and Statistics, 2015.
 [19] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky. Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541, 2016.
 [20] Q. Li, J. Wang, D. Wipf, and Z. Tu. Fixedpoint model for structured labeling. In International Conference on Machine Learning, 2013.
 [21] C. Liu, B. Zoph, J. Shlens, W. Hua, L. J. Li, L. FeiFei, A. L. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. In European Conference on Computer Vision, 2018.
 [22] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Computer Vision and Pattern Recognition, 2015.
 [23] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, 2010.
 [24] A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Computer Vision and Pattern Recognition, 2015.
 [25] A. Pentina, V. Sharmanska, and C. H. Lampert. Curriculum learning of multiple tasks. In Computer Vision and Pattern Recognition, 2015.
 [26] S. Qiao, W. Shen, W. Qiu, C. Liu, and A. L. Yuille. Scalenet: Guiding object proposal generation in supermarkets and beyond. In International Conference on Computer Vision, 2017.
 [27] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. 2016.
 [28] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [29] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
 [30] N. Sarafianos, T. Giannakopoulos, C. Nikou, and I. A. Kakadiaris. Curriculum learning for multitask classification of visual attributes. arXiv preprint arXiv:1708.08728, 2017.
 [31] R. Sharma, S. Barratt, S. Ermon, and V. Pande. Improved training with curriculum gans. arXiv preprint arXiv:1807.09295, 2018.
 [32] A. Shrivastava, A. Gupta, and R. Girshick. Training regionbased object detectors with online hard example mining. In Computer Vision and Pattern Recognition, 2016.
 [33] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations, 2015.
 [34] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 [35] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, et al. Going deeper with convolutions. In Computer Vision and Pattern Recognition, 2015.
 [36] Y. Tang, X. Wang, A. P. Harrison, L. Lu, J. Xiao, and R. M. Summers. Attentionguided curriculum learning for weakly supervised classification and localization of thoracic diseases on chest radiographs. In International Workshop on Machine Learning in Medical Imaging, 2018.
 [37] Z. Tu. Autocontext and its application to highlevel vision tasks. In Computer Vision and Pattern Recognition, 2008.
 [38] S. E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In Computer Vision and Pattern Recognition, 2016.
 [39] D. Weinshall and G. Cohen. Curriculum learning by transfer learning: Theory and experiments with deep networks. In International Conference on Machine Learning, 2018.
 [40] Y. Wu and Y. Tian. Training agent for firstperson shooter game with actorcritic curriculum learning. 2017.
 [41] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision, 2014.
 [42] Y. Yang, Y. Wang, Q. M. J. Wu, X. Lin, and M. Liu. Progressive learning machine: A new approach for general hybrid system approximation. IEEE Transactions on Neural Networks and Learning Systems, 26(9):1855–1874, 2015.

[43]
Q. Yu, L. Xie, Y. Wang, Y. Zhou, E. K. Fishman, and A. L. Yuille.
Recurrent saliency transformation network: Incorporating multistage visual cues for small organ segmentation.
In Computer Vision and Pattern Recognition, 2018.  [44] W. Zaremba and I. Sutskever. Reinforcement learning neural turing machinesrevised. arXiv preprint arXiv:1505.00521, 2015.
 [45] Z. Zhang, C. Xie, J. Wang, L. Xie, and A. L. Yuille. Deepvoting: An explainable framework for semantic part detection under partial occlusion. In Computer Vision and Pattern Recognition, 2018.
 [46] T. Zhou and J. Bilmes. Minimax curriculum learning: Machine teaching with desirable difficulties and scheduled diversity. In International Conference on Learning Representations, 2018.
 [47] Y. Zhou, L. Xie, W. Shen, Y. Wang, E. K. Fishman, and A. L. Yuille. A fixedpoint model for pancreas segmentation in abdominal ct scans. In International Conference on Medical Image Computing and ComputerAssisted Intervention, 2017.