I Introduction
With the growth of mobile phones, cameras and social networks, a large amount of photographs is rapidly created, especially those containing person faces. To interact with these photos, there have been increasing demands of developing intelligent systems (e.g., contentbased personal photo search and sharing from either his/her mobile albums or social network) with face recognition techniques
[1, 2, 3]. Thanks to several recently proposed pose/expression normalization and alignmentfree approaches [4, 5, 6], identifying face in the wild has achieved remarkable progress. As for the commercial product, the website “Face.com” once provided an API (application interface) to automatically detect and recognize faces in photos. The main problem in such scenarios is to identify individuals from images under a relatively unconstrained environment. Traditional methods usually handle this problem by supervised learning
[7], while it is typically expensive and timeconsuming to prepare a good set of labeled samples. Since only a few data are labeled, Semisupervised learning
[8] may be a good candidate to solve this problem. But it has been pointed out by [9]: Due to large amounts of noisy samples and outliers, directly using the unlabeled data may significantly reduce learning performance.
This paper targets on the challenge of incrementally learning a batch of face recognizers with the increasing face images of different individuals^{1}^{1}1http://hcp.sysu.edu.cn/projects/aspl/. Here we assume that the person faces can be basically detected and localized by existing face detectors. However, to build such a system is quite challenging in the following aspects.

Person faces have large appearance variations (see examples in Fig. 1 (a)) caused by diverse views and expressions as well as facial accessories (e.g., glasses and hats) and aging. The different lighting condition is also required to be considered in practice.

It is possible that only a few labeled samples are accessible at first, and the changes of personal faces are rather unpredictable over time, especially under the current scenarios that there are large amount of images swarmed into Internet every day.

Even though a few user interventions (e.g., labeling new samples) could be allowed, the user effort is desired to be kept minimizing over time.
Conventional incremental face recognition methods such as incremental subspace approaches [10, 11] often fail on complex and largescale environments. Their performances could be dropped drastically when the initial training set of face images is either insufficient or inappropriate. In addition, most of existing incremental approaches suffer from noisy samples or outliers in the model updating. In this work, we propose a novel active selfpaced learning framework (ASPL) to handle the above difficulties, which absorbs powers of two recently rising techniques: active learning (AL) [12, 13] and selfpaced learning (SPL) [14, 15, 16]. In particular, our framework tends to conduct a “CostlessEarnmore” working manner: as much as possible pursuing a high performance while reducing costs.
The basic approach of the AL methods is to progressively select and annotate most informative unlabeled samples to boost the model, in which user interaction is allowed. The sample selection criteria is the key in AL, and it is typically defined according to the classification uncertainty of samples. Specifically, the samples of low classification confidence, together with other informative criteria like diversity, are generally treated as good candidates for model retraining. On the other hand, SPL is a recently proposed learning regime to mimic the learning process of humans/animals that gradually incorporates easy to more complex samples into training [17, 18], where an easy sample is actual the one of high classification confidence by the currently trained model. Interestingly, the two categories of learning methods select samples with the opposite criteria. This finding inspires us to investigate the connection between the two learning regimes and the possibility of making them complementary to each other. Moreover, as pointed out in [3, 19]
, learning based features are considered to be able to exploit information with better discriminative ability for face recognition, compared to the handcrafted features. We thus utilize the deep convolutional neural network (CNN)
[20, 21]for feature extraction instead of using handcraft image features.. In sum, we aim at designing a costeffective and progressive learning framework, which is capable of automatically annotating new instances and incorporating them into training under weak expert recertification. In the following, we discuss the advantage of our ASPL framework in two aspects: “Costless” and “Earnmore”.
(I) Cost less: Our framework is capable of building effective classifiers with less labeled training instances and less user efforts, compared with other stateoftheart algorithms. This property is achieved by combining the active learning and selfpaced learning in the incremental learning process. In certain feature space of model training as Fig. 1 (b) illustrates, samples of low classification confidence are scattered and close to the classifier decision boundary while high confidence samples distribute compactly in the intraclass regions. Our approach takes both categories of samples into consideration for classifier updating. The benefit of this strategy includes: i) Highconfidence samples can be automatically labeled and consistently added into model training throughout the learning process in a selfpaced fashion, particularly when the classifier becomes more and more reliable at later learning iterations. This significantly reduce the burden of user annotations and make the method scalable in largescale scenarios. ii) The lowconfidence samples are selected by allowing active user annotations, making our approach more efficiently pick up informative samples, more adapt to practical variations and converge faster, especially in the early learning stage of training.
(II) Earn more: The mixture of selfpaced learning and active learning effectively improves not only the classifier accuracy but also the classifier robustness against noisy samples. From the perspective of AL, extra highconfidence samples are automatically incorporated into the retraining without cost of human labor in each iteration, and faster convergence can be thus gained. These introduced highconfidence samples also contribute to suppress noisy samples in learning, due to their compactness and consistency in the feature space. From the SPL perspective, allowing active user intervention generates the reliable and diverse samples that can avoid the learning been misled by outliers. In addition, utilizing the CNN facilitates to pursue a higher classification performance by learning the convolutional filters instead of handcraft feature engineering.
In brief, our ASPL framework includes two main phases. At the initial stage, we first learn a general face representation using an architecture of convolutional neural nets, and train a batch of classifiers with a very small set of annotated samples of different individuals. In the iteration learning stage, we rank the unlabeled samples according to how they relate to the current classifiers, and retrain the classifiers by selecting and annotating samples in either active userquery or selfpaced manners. We can also make the CNN finetuned based on the updated classifiers.
The key point in designing such an effective interactive learning system is to make an efficient labor division between computers and human participants, i.e., we should possibly feed computable and faithful tasks into computers, and to possibly arrange laborsaving and intelligent tasks to humans [22]. The proposed ASPL framework provides a rational realization to this task by automatically distinguishing highconfidence samples, which can be easily and faithfully recognized by computers in a selfpaced way, and lowconfidence ones, which can be discovered by requesting user annotation.
The main contributions of this work are several folds. i) To the best of our knowledge, our work is the first one to make a face recognition framework capable of automatically annotating highconfidence samples and involve them into training without need of extra human labor in a purely selfpaced manner under weak recertification of active learning. Especially in that along the learning process, we can achieve more and more pseudolabeled samples to facilitate learning totally for free. Our framework is thus suitable in practical largescale scenarios. The proposed framework can be easily extended to other similar visual recognition tasks. ii) We provide a concise optimization problem and theoretically interpret that the proposed ASPL is an rational implementation for solving this problem. iii) This work also advances the SPL development, by setting a dynamic curriculum variation. The new SPL setting better complies with the “instructorstudentcollaborative” learning mode in human education than previous models. iv) Extensive experiments on challenging CACD and CASIAWebFace datasets show that our approach is capable of achieving competitive or even better performance under only small fraction of sample annotations than that under overall labeled data. A dramatic reduction () of user interaction is achieved over other stateoftheart active learning methods.
The rest of the paper is organized as follows. Section II presents a brief review of related work. Section III overview the pipeline of our framework, followed by a discussion of model formulation and optimization in Section IV. The experimental results, comparisons and component analysis are presented in Section V. Section VI concludes the paper.
Ii Related Work
In this section, we first present a review for the incremental face recognition, and then briefly introduce related developments on active learning and selfpaced learning.
Incremental Face Recognition.
There are two categories of methods addressing the problem of identifying faces with incremental data, namely incremental subspace and incremental classifier methods. The first category mainly includes the incremental versions of traditional subspace learning approaches such as principal component analysis (PCA)
[23] and linear discriminant analysis (LDA) [11]. These approaches map facial features into a subspace, and keep the eigen representations (i.e., eigenfaces) updated by incrementally incorporating new samples. And face recognition is commonly accomplished by the nearest neighborbased feature matching, which is computational expensive when a large number of samples are accumulated over time. On the other hand, the incremental classifier methods target on updating the prediction boundary with the learned model parameters and new samples. Exemplars include the incremental support vector machines (ISVM)
[24] and the online sequential forward neural network [25]. In addition, several attempts have been made to absorb advantages from both of the two categories of methods. For example, Ozawa et al., [26] proposed to integrate the Incremental PCA with the resource allocation network in an iterative way. Although these mentioned approaches make remarkable progresses, they suffer from low accuracy compared with those of batchbased stateoftheart face recognizers, and none of these approaches have been successfully validated on largescale datasets (e.g., more than 500 individuals). And these approaches are basically studied in the context of fully supervised learning, i.e., both initial and incremental data are required to be labeled.Active Learning. This branch of works mainly focus on actively selecting and annotating the most informative unlabeled samples, in order to avoid unnecessary and redundant annotation. The key part of active learning is thus the selection strategy, i.e., which samples should be presented to the user for annotation. One of the most common strategies is the certaintybased selection [27, 28], in which the certainties are measured according to the predictions on new unlabeled samples obtained from the initial classifiers. For example, Lewis et al., [27] proposed to take the most uncertain instance as the one that has the largest entropy on the conditional distribution over its predicted labels. Several SVMbased methods [28] determine the uncertain samples as they are relatively close to the decision boundary. The sample certainty was also measured by applying a committee of classifiers in [29]. These certaintybased approaches usually ignore the large set of unlabeled instances, and are thus sensitive to outliers. A number of later methods present the information density measure by exploiting the information of unlabeled data when selecting samples. For example, the informative samples are sequentially selected to minimize the generalization error of the trained classifier on the unlabeled data, based on a statistical approach [30] or prior information [31]. In [32, 33], instances are taken to maximize the increase of mutual information between the candidate instances and the remaining ones based on Gaussian Process models. The diversity of the selected instance over the unlabeled data has been also taken into consideration [34]. Recently, Elhamifar et al., [12] presented a general framework via convex programming, which considered both the uncertainty and diversity measure for sample selection. However, these mentioned active learning approaches usually emphasize those lowconfidence samples (e.g., uncertain or diverse samples) while ignoring the other majority of highconfidence samples. To enhance the discriminative capability, wang [8] et al. proposed a unified semisupervised learning framework, which incorporates the high confidence coding vectors of unlabeled data into training under the proposed effective iterative algorithm, and demonstrate its effectiveness in dictionarybased classification. Our work inspires by this work, and also employs the highconfidence samples to improve both accuracy and robustness of classifiers.
Selfpaced Learning. Inspired by the cognitive principle of humans/animals, Bengio et al. [17] initialized the concept of curriculum learning (CL), in which a model is learned by gradually including samples into training from easy to complex. To make it more implementable, Kumar et al. [18] substantially prompted this learning philosophy by formulating the CL principle as a concise optimization model named selfpaced learning (SPL). The SPL model includes a weighted loss term on all samples and a general SPL regularizer imposed on sample weights. By sequentially optimizing the model with gradually increasing pace parameter on the SPL regularizer, more samples can be automatically discovered in a pure selfpaced way. Jiang et al. [14, 35, 16] provided more comprehensive understanding for the learning insight underlying SPL/CL, and formulated the learning model as a general optimization problem as:
(1) 
where corresponds to the training dataset,
denotes the loss function which calculates the cost between the objective label
and the estimated one,
represents the model parameter inside the decision function, denote the weight variables reflecting the samples’ importance. is a parameter for controlling the learning pace, which is also referred as “pace age”.In the model, corresponds to a selfpaced regularizer. Jiang et al. abstracted three necessary conditions it should be satisfy [16, 14]: (1) is convex with respect to ; (2) The optimal weight of each sample should be monotonically decreasing with respect to its corresponding loss; and (3) The optimal weight of each sample should be monotonically decreasing with respect to the pace parameter .
In this axiomic definition, Condition 2 indicates that the model inclines to select easy samples (with smaller errors) in favor of complex samples (with larger errors). Condition 3 states that when the model “age”
gets larger, it embarks on incorporating more, probably complex, samples to train a “mature” model. The convexity in Condition 1 further ensures that the model can find good solutions.
is the so called curriculum region that encodes the information of predetermined curriculums. Its axiomic definition contains two conditions [14]: (1) It should be nonempty and convex; and (2) If is ranking before in curriculum (more important for the problem), the expectation should be larger than . Condition 1 ensures the soundness for the calculation of this specific constraint, and Condition 2 indicates that samples to be learned earlier is supposed to have larger expected values. This constraint weakly implies a prior learning sequence of samples, where the expected value for the favored samples should be larger.
The SPL model (1) finely simulates the learning process of human education. Specifically, it builds an “instructorstudent collaborative” paradigm, which on one hand utilizes prior knowledge provided by instructors as a guidance for curriculum designing (encoded by the curriculum constraint), and on the other hand leaves certain freedom to students to ameliorate the actual curriculum according to their learning pace (encoded by the selfpaced regularizer). Such a model not only includes all previous SPL/CL methods as its special cases, but also provides a general guild line to extend a rational SPL implementation scheme against certain learning task. Based on this framework, multiple SPL variations have been recently proposed, like SPaR [16], SPLD [15], SPMF [35] and SPCL [14].
The SPL related strategies have also been recently attempted in a series of applications, such as specificclass segmentation learning [36], visual category discovery [37], longterm tracking [38], action recognition [15] and background subtraction [35]. Especially, the SPaR method, constructed based on the general formulation (1), was applied to the challenging SQ/000Ex task of the TRECVID MED/MER competition, and achieved the leading performance among all competing teams [39].
Complementarity between AL and SPL: It is interesting that the function of SPL is very complementary to that of AL. The SPL methods emphasize easy samples in learning, which correspond to the highconfidence intraclass samples, while AL inclines to pick up the most uncertain and informative samples for the learning task, which are always located in lowconfidence area near classification boundaries. SPL is capable of easily attaining large amount of faithful pseudolabeled samples with less requirement of human labors (by reranking technique [16]. We will introduce details in Section 4), while tends to underestimate the roles of those most informative ones intrinsically configuring the classification boundaries; on the contrary, AL inclines to get informative samples, while need more human labors to manually annotate these samples with more carefully annotation. We thus expect to effectively mix these two learning schemes to help incremental learning both improve the efficiency with less human labors (i.e., Cost Less) and achieve better accuracy and robustness of the learned classifier against noisy samples (i.e., Earn More). This constructs the basic motivation of our ASPL framework for face identification under largescale scenarios.
Iii Framework Overview
In this section, we illustrate how our ASPL model works. As illustrated in Fig. 2, the main stages in our framework pipeline include: CNN pretraining for face representation, classifier updating, highconfidence sample pseudolabeling in a selfpaced fashion, lowconfidence sample annotating by active users, and CNN finetuning.
CNN pretraining: Before running the ASPL framework, we need to pretrain a CNN for feature extraction based on a pregiven face dataset. These images are extra selected without overlapping to all our experimental data. Since several public available CNN architectures [40, 41] have achieved remarkable success on visual recognition, our framework supports to directly employ these architectures and their pretrained model as initialized parameters. In our all experiments, AlexNet [40] is utilized. Given the extra selected of annotated samples, we further finetune the CNN for learning discriminative feature representation.
Initialization: At the beginning, we randomly select few images for each individual, extract feature representation for them by pretrained CNN, and manually annotate labels to them as the starting point.
Classifier updating: In our ASPL framework, we use onevsall linear SVM as our classifier updating strategies. In the beginning, only a small part of samples are labeled, and we train an initial a classifier for every individual using these samples. As the framework gets mature, samples manually annotated by the AL and pseudolabeled by the SPL are growing, we adopt them to retrain the classifiers.
Highconfidence sample pseudolabeling: We rank the unlabeled samples by their important weights via the current classifiers, e.g., using the classification prediction hinge loss, and then assign pseudolabels to the topranked samples of high confidences. This step can be automatically implemented by our system.
Lowconfidence sample annotating: Based on certain AL criterion obtained under the current classifiers, rank all unlabeled samples, select those topranked ones (most informative and generally with lowconfidence) from the unlabeled samples, and then manually annotate these samples by active users.
CNN finetuning: After several steps of the interaction, we make the neural nets finetuned by the backward propagation algorithm. All selflabeled samples by the SPL and manually annotated ones by the AL are added into the network, we utilize the softmax loss to optimize the CNN parameters via stochastic gradient decent approach.
Iv Formulation and Optimization
In this section we will discuss the formulation of our proposed framework, and also provide a theoretical interpretation of its entire pipeline from the perspective of optimization. In specific, we can theoretically justify that the entire pipeline of this framework finely accords with a solving process for an active selfpaced learning (ASPL) optimization model. Such a theoretical understanding will help deliver more insightful understanding on the intrinsic mechanism underlying the ASPL system.
Iva Active Selfpaced Learning
In the context of face identification, suppose that we have facial photos which are taken from subjects. Denote the training samples as , where is the dimensional feature representation for the th sample. We have classifiers for recognizing each sample by the onevsall strategy.
Learned knowledge from data will be utilized to ameliorate our model after a period of pace increasing. Correspondingly, we denote the label set of as , where corresponds to the label of for the th subject. That is, if , this means that is categorized as a face from the th subject.
On our problem setting, we should give two necessary remarks. One is that in our investigated face identification problems, almost all data have not been labeled before our system running. Only very small amount of samples are annotated as the initialization. That is, most of are unknown and needed to be completed in the learning process. In our system, a minority of them is manually annotated by the active users and a majority is pseudolabeled in a selfpaced manner. The other remark is that the data might possibly been inputted into the system in an incremental way. This means that the data scale might be consistently growing.
Via the proposed mechanism of combining SPL and AL, our proposed ASPL model can adaptively handle both manually annotated and pseudolabeled samples, and still progressively fit the consistently growing unlabeled data in such an incremental manner. The ASPL is formulated as follows:
(2)  
where and represent the weight and bias parameters of the decision functions for all classifiers. is the standard regularization parameter trading off the loss function and the margin, and we set in our experiments. denotes the weight variables reflecting the training samples’ importance, and is a parameter (i.e. the pace age) for controlling the learning pace of the th classifier. is the selfpaced regularizer controlling the learning scheme. We denote the index collection of all currently active annotated samples as , where corresponds to the set of the th subject with the pace age . Here is introduced as a constraint on . composes of the curriculum constraint of the model at the classifiers’ pace age . In particular, we specify two alternative types of the curriculum constraint for each sample , as:

is for the pseudolabeled sample, i.e., . Then, its importance weights with respect to all the classifiers need to be learned in the SPL optimization.

is for the sample annotated by the AL process, i.e., . Thus, its importance weights are deterministically set during the model training, i.e., .
Each type of the curriculums will be detailedly interpreted in Section II. Note that different from the previous SPL settings, this curriculum can be dynamically changed with respect to all the pace ages of classifiers. This conducts the superiority of our model, as we discuss in the end of this section.
We then define the loss function on as:
(3) 
where is the hinge loss of in the th classifier. The cost term corresponds to the summarized loss of all classifiers, and the constraint term only allows two kinds of feasible solutions: i) for any , there exists while for all other for all ; ii) for all (i.e., background or an unknown person class). These samples will be added into the unknown sample set . It is easy to see that such constraint complies with real cases where a sample should be categorized into one prespecified subject or not classified into any of the current subjects.
Referring to the known alternative search strategy, we can then solve this optimization problem. Specifically, the algorithm is designed by alternatively updating the classifier parameters via onevsall SVM, the sample importance weights via the SPL, the pseudolabel via reranking. Along with gradually increasing pace parameter , the optimization updates: i) the curriculum constraint via AL and ii) the feature representation via CNN finetuning. In the following we introduce the details of these optimization steps, and give their physical interpretations. The correspondence of this algorithm to the practical implementation of the ASPL system will also be discussed in the end.
Initialization: As introduced in the framework, we initialize our system running by using pretrained CNN to extract feature representations of all samples . Set an initial classifiers’ pace parameter set . Initialize the curriculum constraint with currently user annotated samples and corresponding and .
Classifier Updating: This step aims to update the classifier parameters by onevsall SVM. Fixing , the original ASPL model Eqn. (2) can be simplified into the following form:
which can be equivalently reformulated as solving the following independent suboptimization problems for each classifier :
(4) 
This is a standard onevsall SVM model with weights by taking oneclass sample as positive while all others as negative. Specifically, when the weights are only of values , it corresponds to a simplified SVM model under sampled instances with ; otherwise when sets values from , it corresponds to the weighted SVM model. And both of them can be readily solved by many offtheshelf efficient solvers. Thus, this step can be interpreted as implementing onevsall SVM over instances manually annotated from the AL and selfannotated from the SPL.
Highconfidence Sample Labeling: This step aims to assign pseudolabels and corresponding important weights to the topranked samples of high confidences.
We start by employing the SPL to rank the unlabeled samples according to their importance weights . Under fixed , our ASPL model in Eqn. (2) can be simplified to optimize as:
(5) 
This problem then degenerates to a standard SPL problem as in Eqn.(1). Since both the selfpaced regularizer and the curriculum constraint is convex (with respect to ), various existing convex optimization techniques, like the gradientbased or interiorpoint methods, can be used for solving it. Note that we have multiple choices for the selfpaced regularizer, as those built in [16][15]. All of them comply with three axiomic conditions required for a selfpaced regularizer, as defined in Section II.
Based on the second axiomatic condition for selfpaced regularizer, any of the above inclines to conduct larger weights on highconfidence (i.e., easy) samples with less loss values while vice versa, which evidently facilitates the model with the “learning from easy to hard” insight. In all our experiments, we utilize the linear soft weighting regularizer due to its relatively easy implementation and well adaptability to complex scenarios. This regularizer penalizes the sample weights linearly in terms of the loss. Specifically, we have
(6) 
where . Eqn. (6) is convex with respect to , and we can thus search for its global optimum by computing the partial gradient equals. Considering , we deduce the analytical solution for the linear soft weighting, as,
(7) 
where is the loss of in the th classifier. Note that the deducing way to Eqn. (7) is similar with in [16], but our resulting solution is different since our ASPL model in Eqn. (2) is new.
After obtaining the weight for all unlabeled samples () according to the optimized in a descending order. Then we consider the samples with larger important weight than others are high confidences. We form these samples into highconfidence sample set and assign them pseudolabels: Fixing {}, we optimize of Eqn. (2) which corresponds to solve:
(8) 
where is fixed and can be treated as constant. When belongs to a certain person class, Eqn. (11) has an optimum, which can be exactly extracted by the Theorem 1. The proof is specified in the supplementary material.
Denote those s that satisfy and as a set and set all for others in default ^{2}^{2}2 actually implies that the th sample is with lowconfidence to be annotated as the th class, and thus it is natural to pseudolabel it as a negative sample for the th class. implies that a sample is located in the classification boundary of the th class, and thus it is also a lowconfidence class sample and thus we directly annotate it as negative. Actually, for these samples, pseudolabel them as positive or negative will not affect the value of the objective function of Eq. (11). We tend to annotate these lowconfidence samples as negative since due to the constraint of Eq. (11) (at most one positive class one sample is allowed to be annotated), this will not influence selecting a more rational positive class for each sample.. The solution of Eqn. (11) for can be obtained by the following theorem.
Theorem 1
(a) If , , Eqn. (11) has a solution:
(b) When except , , i.e., , then Eqn. (11) has a solution:
Actually, only those highconfidence samples with positive weights, as calculated in the last updating step for , are meaningful for the solution. This implies the physical interpretation for this optimization step: we iteratively find the highconfidence samples based on the current classifier, and then enforce pseudolabels on those topranked highconfidence ones (). This is exactly the mechanism underlying a reranking technique [16].
The above optimization process can be understood as the selflearning manner of a student. The student tends to pick up most highconfident samples, which imply easier aspects and faithful knowledge underlying data, to learn, under the regularization of the predesigned curriculum . Such regularization inclines to rectify his/her learning process so as to avoid him/her stuck into a unexpected overfitting point.
Lowconfidence Sample Annotating: After pseudolabeling highconfidence samples in such a selfpaced uncertainty modeling, we employ AL fashion to update the curriculum constraint in the model by supplementing more informative curriculums based on human knowledge. The AL process aims to select most lowconfidence unlabeled samples and to annotate them as either positive or negative by requesting user annotation. Our selection criteria are based on the classical uncertaintybased strategy [27, 28]. Specifically, given the current classifiers, we randomly collected a number of randomly unlabeled samples, which are usually located in lowconfidence area near the classification boundaries.
1) Annotated Sample Verifying: Considering the user annotation may contain outliers (incorrectly annotated samples), we introduce a verification step to correct the wrongly annotated samples. Assuming that labeled samples with lower prediction scores from the current classifiers have higher probability of being incorrectly labeled, we propose to ask the active user to verify their annotations on these samples. Specifically, in this step we first employ the current classifiers to obtain the prediction scores of all the annotated samples. Then we rerank them and select Top ones with lowest prediction scores and ask the user to verify these selected samples, i.e., doublechecking them. We can set L as a small number ( = 5 in our experiments), since we do believe the chance of human making mistakes is low. In sum, we improve the robustness of the AL process by further validating TopL most uncertain samples with the user. In this way, we can reduce the effects of accumulated human annotation errors and enable the classifier to be trained in a robust manner.
2) Lowconfidence Definition: When we utilize the current classifiers ( classifiers for discriminating object categories) to predict the label of unlabeled samples, those predicted as more than two positive labels (i.e., predicted as the corresponding object category) actually represent these samples making the current classifiers ambiguous. We thus adopt them as so called ”lowconfident” samples and require active user to manually annotate them. Actually, in this step, other ”lowconfidence” criterion can be utilized. We employed this simple strategy just due to its intuitive rationality and efficiency.
After users perform manual annotation, we update the by additionally incorporating those newly annotated sample set into the current curriculum . For each annotated sample, our AL process includes the following two operations: i) Set its curriculum constraint, i.e., ; ii) Update its labels and add its index into the set of currently annotated samples . Such specified curriculum still complies with the axiomic conditions for the curriculum constraint as defined in [14]. For those annotated samples, the corresponding with expectation value over the whole set, while for others with expectation value . Thus the more informative samples still have a larger expectation than the others. Also, it is easy to see is nonempty and convex. It thus complies traditional curriculum understanding.
New Class Handling: After the AL process, if active user annotates the selected unlabeled samples with unseen person classes, new classifiers for these unseen classes are needed to be initialized without affecting the existed classifiers. Moreover, there is another difficulty that the samples of the new class are not enough for classifier training. Thanks to the proposed ASPL framework, we can employ the following four steps to address above mentioned issues.

For each of these new class samples, search all the unlabeled samples and pick out its nearest neighbors from the unseen class set in the feature space;

Require active user to annotate these selected neighbors to enrich the positive samples for these new person classes;

Initialize and update for these new person classes according to above mentioned iteration process of {initialization, classifier updating, highconfidence sample labeling, lowconfidence sample annotating}.
This step corresponds to the instructor’s role in human education, which aims to guide a student to involve more informative curriculums in learning. Different from the previous fixed curriculum setting in SPL throughout the learning process, here the curriculum is dynamically updated based on the selfpaced learned knowledge of the model. Such an improvement better simulates the general learning process of a good student. With the learned knowledge of a student increasing, his/her instructor should vary the curriculum settings imposed on him from more in the early stage to less in later. This learning manner evidently should conduct a better learning effect which can well adapt the personal information of the student.
Feature Representation Updating: After several of the SPL and AL updating iterations of {}, we now aim to update the feature representation through finetuning the pretrained CNN by inputting all manually labeled samples from the AL and selfannotated ones from the SPL. These samples tend to deliver data knowledge into the network and improve the representation of the training samples. A better feature representation is thus expected to be extracted from this ameliorated CNN.
This learning process simulates the updating of the knowledge structure of a human brain after a period of domain learning. Such updating tends to facilitate a person grasp more effective features to represent newly coming samples from certain domain and make him/her with a better learning performance. In our experiments, we generally conduct the CNN feature finetuning after around rounds of the SPL and AL updating, and the learning rate is set as 0.001 for all layers.
Pace Parameter Updating: We utilize a heuristic strategy to update pace parameters for classifiers in our implementation.
After multiple iterations of the ASPL, we specifically set the pace parameter for each individual classifier, and utilize a heuristic strategy in our implementation for parameter updating. For the th iteration, we compute the pace parameter for optimizing Eqn. (2) by :
(10) 
where is the average accuracy of the th classifier in the current iteration, and is a parameter which controls the pace increasing rate. In our experiments, we empirically set . Note that the pace parameters should be stopped when all training samples are with . Thus, we introduce an empirical threshold constraining that is only updated in early iterations, i.e., . is set as 12 in our experiments.
The entire algorithm can then be summarized into Algorithm 1. It is easy to see that this solving strategy for the ASPL model finely accords with the pipeline of our framework.
Convergence Discussion: As illustrated in Algorithm 1, the ASPL algorithm alternatively updates variables including: the classifier parameters , (by weighted SVM), the pseudolabels (closedform solution by Theorem 1), the importance weight (by SPL), and lowconfidence sample annotations (by AL). For the first three parameters, these updates are calculated by a global optimum obtained from a subproblem of the original model, and thus the objective function can be guaranteed to be decreased. However, just as other existing AL techniques, human efforts are involved in the loop of the AL stage, and thus the objective function cannot be guaranteed to be monotonically decreased in this step. However, just as shows in Sect. V, as the learning processing, the model tends to be more and more mature, and the labor of AL tends to be less and less in the later learning stage. Thus with gradually less involvement of the AL calculation in our algorithm, the monotonic decrease of the objective function in iteration tends to be promised, and thus our algorithm tends to be convergent.
IvB Relationship with Other SPL/AL Models
It is easy to see that the proposed ASPL model extends the previous AL/SPL models and includes all of them as special cases. When we fix the curriculum and feature representations and only update other parameters, it degenerates to the traditional SPL models by rationally setting the selfpaced regularizer. When we fix the SPL parameters, feature representations and do not involve pseudolabels in learning, the model degenerates the a general AL learning regime. The amelioration to both SPL and AL is expected to bring benefits to both regimes. On one hand, introducing more highconfidence samples in the selfpaced fashion is helpful to reduce the burden of user annotations, particularly when the classifier becomes reliable at later learning iterations. On the other hand, the low confidence samples selected by active user annotations tends to make our approach workable with less initial labeled samples than existing selfpaced learning algorithms. All these benefits are comprehensively substantiated by our experiments.
Dataset  # images  # persons  # images/person 

CACD  56,138  500  79306 
CASIAWebFaceSub  181,901  925  100804 
V Experiments
In this section, we first introduce the datasets and implementation setting, and then discuss the experimental results and comparisons with other existing approaches.
Va Datasets and Setting
We adopt two public datasets in our experiments, the CrossAge Celebrity Dataset (CACD) [42] and CASIAWebFaceSub dataset [43].
CACD is a largescale and challenging dataset for evaluating face recognition and retrieval, and it contains a batch of images of celebrities collected from Internet, which are varying in age, pose, illumination, and occlusion. And only a subset of celebrities are manually annotated by Chen et al. [42]. For better convincing evaluation, we augment this subset by extra labeling individuals and obtain a set of images in total.
CASIAWebFace dataset [43] is a large scale face recognition dataset with 10,575 subjects/persons and 494,414 images. CASIAWebFace is extremely challenging for its images are all collected from Internet with different view points and light illumination under different scenes. Though the total person/subject number of CASIAWebFace dataset is very large, the sample number for each person, varying from 3 to 804, is heavily unbalanced. For those persons who has very few samples (say below 100), the experiment analysis is not able to be performed. Hence, we select a subset of the CASIAWebFace dataset by discarding its persons with less than 100 samples to form the CASIAWebFaceSub dataset. The CASIAWebFaceSub dataset has 181,901 images with 925 persons inside. The detailed information of above mentioned datasets is summarized in Table I.
Experiment setting. We detect the facial points using the method proposed in [44] and align the faces based on the eye locations. The experiments on both of the datasets are conducted as the following steps. We first randomly select images of each individual to form the unlabeled training set, and the rest samples are used for testing, according to the setting in the existing active learning method [12]. Then, we randomly annotate samples of each person in the training set to initialize the classifier. To get rid of the influence of randomness, we average the results over times of execution with different sample selections. All of the experiments are conducted on a common desktop PC with i7 3.4GHz CPU and a NVIDIA Titan X GPU.
On the two above mentioned datasets, we evaluate the performance of incremental face identification in two aspects: the recognition accuracy and user annotation amount in the incremental learning process. The recognition accuracy is defined as the rankone rate for face identification. We compare our ASPL framework with several existing active learning algorithms and baseline methods under the same setting: i) CPAL (Convex Programming based Active Learning) [12]: Annotate a few samples in each step based on prediction uncertainty and sample diversity; ii) CCAL (Confidencebased Active Learning via SVMs) [28]: Select only one sample having lowest prediction confidence; iii) AL_RAND: Randomly select unlabeled samples to be annotated during the training phase. This method discards all active learning techniques and can be considered as the lower bound, and iv) AL_ALL: All unlabeled samples are annotated for training the classifier. This method can be regarded as the upper bound (best performance the classifier can achieve). For fair comparison, all of these methods utilize the same feature representation as ours in the beginning. As the training iteration increase, active user annotation is employed to those selected most informative and representative samples. Then, CNN finetuning is also exploited to improve the feature extractor for ASPL, CPAL, CCAL, AL_RAND, AL_ALL.
Details of CNN implementation. The architecture of AlexNet [40] is utilized in our all experiments. Thanks to the well pretraining, the CNN updating is only implemented few times during ASPL iteration in all our experiments, each only containing no more than 5 CNN updating steps. We generally conducted CNN steps after around 5 rounds of the SPL and AL updating, and the learning rate is set as 0.001 for all layers. Equal importance is imposed between the previous training examples and the newly labeled examples, and CNN is updated using the stochastic gradient decent methods with the momentum 0.9 and weight decay 0.0005.
VB Experimental Comparisons
The results on the two datasets are reported in Fig. 3(a) and Fig. 3(b), respectively, where we can observe how the recognition accuracy changes with increasingly incorporating more unlabeled samples. In CACD dataset, to achieve the same recognition accuracy, ASPL model requires few annotation of the unlabeled data. On the other hand, ASPL outperforms the competing methods in accuracy when the same amount annotations. ASPL can still have a superior performance as the iteration goes on. The similar results and phenomena can be discovered in CASIAWebFaceSub dataset. As one can see that, ASPL only requires about 40% and 45% annotations to achieve thestateofart performance on CACD and CASIAWebFaceSub dataset, respectively. While the compared methods AL_RAND, CCAL and CPAL all requires about 81% and 65%, respectively. Hence, our ASPL can performs as well as the AL_ALL with minimal annotations.
Note that the performances of RAND and CCAL are relatively close, and the similar results were reported in [12]. According to the explanation in [12], this comes from the fact that many samples have low prediction confidences and distribute not densely in the feature space. Thus, the randomizing sample selection achieves similar results compared to CCAL.
VC Component Analysis
To further analyze how different components contribute to performance, we implement several variants of our framework: i) ASPL (w/o FT): allowing both active and selfpaced sample selection during learning while disabling the CNN finetuning, i.e., the feature extractor is kept the same as the iteration goes on for training; ii) ASPL (w/o SPL): discarding highconfidence sample pseudolabeling via selfpaced learning; iii) ASPL (w/o AL): ignoring low confidence samples for active user annotation; iv) AL_ALL: finetuning the CNN and train classifiers with all the labels of the training samples and v) AL_ALL (w/o FT): training classifiers with all the labels of the training samples without finetuning. Moreover, the full version of our proposed model is denoted as ASPL, which allows the convolutional nets to be finetuned during the training process. We further evaluate the ASPL variants in the following aspects.
Contribution of different ASPL components. Using AL_ALL and AL_ALL (w/o FT) as the baselines, we gradually add the AL, SPL and finetuning components to ASPL. These experiments are executed on the CASIAWebface dataset. Fig. 4 illustrates the accuracy obtained using ASPL, ASPL (w/o FT), ASPL (w/o AL) and ASPL (w/o SPL). One can observe that any of the three components is useful in improving the recognition accuracy. Especially, the additional SPL component can significantly improve the recognition accuracy and reduce the number of annotation samples by automatically exploiting the majority of highconfidence samples for feature learning.
We also observe that the CNN feature finetuning can dramatically improve the recognition accuracy in the early steps. This is mainly because the information gain (i.e., individual appearance diversity) deceases with progressively introducing new samples to the neural nets.
Analysis on initial samples. In SPL [18], classifier is first trained using the initial samples. With the current classifier, easy samples are preferred to be selected in the early training steps, and thus it is expected that the performance of SPL heavily relies on the initial samples. Fortunately, by incorporating with active learning, ASPL can evidently alleviate this problem. To verify this, we compare the performance of ASPL and SPL on 20 randomly selected individuals of CASIAWebfaceSub dataset. The result is shown in Fig. 5. Given the same initialized feature representations, we also conduct the experiments to analyze the performance vs different initial portions to be handled by AL on this dataset. The results are illustrated in Fig. 6.
As one can see from Fig. 5, with different initial samples, ASPL reaches similar/stable results as the training continues, while SPL still varies a lot. This result indicates that the AL component is effective in handling the poor initialization. Fig. 6 illustrates that though poor performance is obtained at the beginning, the performance of our model increases during the training process. In summary, our model is insensitive to the diversity and quantity of initial samples.
# Class Number  300  600  925 

ASPL (ALL)  88.3%  81.0%  76.0% 
ASPL  88.3%  81.6%  76.0% 
# iteration  5  10  15  20  25 

ASPL (w/o FT)  8.2%  6.9%  5.1%  5.0%  4.9% 
ASPL  4.5%  4.1%  3.4%  3.3%  3.3% 
Performance with new classes. To justify the effectiveness of our ASPL for handling unseen new classes, we conduct the following experiment on the CASIAWebFaceSub dataset: We compare the performance of incrementally giving some classes (our ASPL) and directly giving all person classes. Specifically, given all person classes, we initialize all the classifiers at the beginning of the training and optimize them without handling unseen new classes. We denote this variant as ASPL (ALL). The experimental result is illustrated in Table II and shows that our proposed ASPL can handle unseen new classes effectively without substantially performance drop or even with slightly better performance, compared with the all classes given version ASPL (ALL).
Annotation required for large scale dataset. To demonstrate that our ASPL can be adopted under large scale scenario, we analyze the training phase of ASPL on the large scale CASIAWebFaceSub dataset. As illustrated in Fig. 7, the xaxis denotes the number of training iterations and the yaxis denotes the amount of required user annotation. The curve in Fig. 7 demonstrates that our proposed ASPL model requires relatively larger annotations when the training iteration number is small. As the training continues, the amount required annotations began to be reduced due to the gradually mature model incrementally ameliorated in the learning process. This observation indicates that the burden of user annotations would be indeed relieved when the classifier becomes reliable at the later learning stage of the proposed ASPL method. Moreover, as illustrated in Table III, with the increase of user annotations over time, ASPL can automatically assign more reliable pseudolabels to the unlabeled samples selected in the selfpaced way.
Robustness analysis. We further analyze the robustness of ASPL when noisy images are deliberately included in two experiments. (i) Ex1: () noisy images are added to the initial samples for each individual. (ii) Ex2: noisefree initials are used, but () importers are deliberately annotated during the training process. These experiments are conducted on the CASIAWebfaceSub dataset. To validate the effectiveness of the proposed annotated sample verifying step, we disable the verifying step and denote these modification as “Noise w/o VF”.
Fig. 8(a) shows the result of Ex1, where ASPL is initialized with different number of noisy images. In early steps of the iteration, noisy data have huge adverse effect on test accuracy. Along with the increase of iteration number, the genuine data gradually dominate the results. Fig. 8(b) illustrates the result of Ex2, where noisy images are added to the labeled training set the 2th step of iteration. We can see that a sharp decline in the recognition accuracy. However, with the evolving of ASPL training, similar accuracy as compared with that got on the original clean data can be obtained when the number of iterations increases. As one can comparing “Noise (10/30/50%)” with “Noise (10/30/50%) w/o VF” from Fig. 8(a), with the verifying step, ASPL can recover from noisy images in a slightly fast way. This justifies the effectiveness of the proposed annotated sample verifying step.
Vi Conclusions
In this paper, we have introduced, first, an effective framework to solve incremental face identification, which build classifiers by progressively annotating and selecting unlabeled samples in an active selfpaced way, and second, a theoretical interpretation of the proposed framework pipeline from the perspective of optimization. Third, we evaluate our approach on challenging scenarios and show very promising results.
In the future, we will extend the system to support several videobased vision applications, which require large amount of user annotations. The proposed framework provides a rational realization to this task by automatically distinguishing highconfidence samples, which can be easily and faithfully recognized by computers in a selfpaced way, and lowconfidence ones, which can be discovered by requesting user annotation.
Proof for Theorem 1
Our aim is to solve the following optimization problem:
(11) 
where is the hinge loss of in the th classifier. Specifically, we define the hinge loss as:
The constraint term
(12) 
dominates two cases of can be for all classifiers: (i) all items of are all negative, i.e., . In this case, the input region proposal is assumed to be the background by classifiers in the current optimization. (ii) In all items of , one is positive and all others are negative. In this case, is categorized into a certain object class.
Before giving the solution of Eqn. (11), we first introduce the two necessary lemmas as follows:
Lemma 1: The solution of
(13) 
is:
We discuss the solution in three cases:
(i) When , it is easy to see that
Thus the global solution of Eqn. (13) is .
(ii) When , similar to (i), one can easily prove that is the global solution in this case.
(iii) When , whether = or , will have the same value . Thus both and are the global solution of Eqn. (13).
Lemma 2: The solution of
(14) 
is:
For , since is a positive constant, the solution of Eqn. (14) is the same as that of Eqn. (13). While if , for both or , will have the same value . The conclusion is thus evident.
As one can easily see from Lemma 2, when or , the optimal for Eqn. (14) can be either or . Thus in all components of Eqn. (11) with or , we can easily assume that the corresponding solution is , i.e., , which will not affect the soundness and final values of the optimal solution of Eqn. (11).
Denote those s that satisfy and as a set and set all for others in default. The solution of Eqn. (11) for can be obtained by the following theorem.
Theorem 2
(a) If , , Eqn. (11) has a solution:
(b) When except , , i.e., , then Eqn. (11) has a solution:
In the cases and , it is easy to see that the provided is actually the solution of the unconstraint problem of Eqn. (11). Since the solution complies with the constraint, this solution is also the one of the constrained one.
In the case , there are more than two samples with positive confidence scores, i.e., . In this case, it is impossible that the final solution is
since if we let for any one sample satisfying , the objective function will have a decrease value with respect to .
Then there will be a unique where the final solution should have . We only need to pick up the one at which the objective of Eqn. (11) attains the minimal value.
References
 [1] Fabio Celli, Elia Bruni, and Bruno Lepri, “Automatic personality and interaction style recognition from facebook profile pictures”, in ACM Conference on Multimedia, 2014.
 [2] Zak Stone, Todd Zickler, and Trevor Darrell, “Toward largescale face recognition using social network context”, Proceedings of the IEEE, vol. 98, 2010.
 [3] Z. Lei, D. Yi, and S. Z. Li, “Learning stacked image descriptor for face recognition”, IEEE Transactions on Circuits and Systems for Video Technology, vol. PP, no. 99, pp. 1–1, 2015.
 [4] S. Liao, A. K. Jain, and S. Z. Li, “Partial face recognition: Alignmentfree approach”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 5, pp. 1193–1205, 2013.
 [5] D. Yi, Z. Lei, and S. Z. Li, “Towards pose robust face recognition”, in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, 2013, pp. 3539–3545.
 [6] Xiangyu Zhu, Z. Lei, Junjie Yan, D. Yi, and S. Z. Li, “Highfidelity pose and expression normalization for face recognition in the wild”, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 787–796.

[7]
Yi Sun, Xiaogang Wang, and Xiaoo Tang,
“Hybrid deep learning for face verification”,
in Proc. of IEEE International Conference on Computer Vision, 2013.  [8] X. Wang, X. Guo, and S. Z. Li, “Adaptively unified semisupervised dictionary learning with active points”, in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1787–1795.
 [9] YuFeng Li and ZhiHua Zhou, “Towards making unlabeled data never hurt”, IEEE Trans. Pattern Anal. Mach. Intelligence, vol. 37, no. 1, pp. 175–188, 2015.
 [10] Haitao Zhao et al., “A novel incremental principal component analysis and its application for face recognition”, SMC, IEEE Transactions on, 2006.
 [11] TaeKyun Kim, KwanYee Kenneth Wong, Björn Stenger, Josef Kittler, and Roberto Cipolla, “Incremental linear discriminant analysis using sufficient spanning set approximations”, in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2007.
 [12] Elhamifar, Ehsan, Sapiro Guillermo, Yang Allen, and Sasrty S Shankar, “A convex optimization framework for active learning”, in Proc. of IEEE International Conference on Computer Vision, 2013.
 [13] K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin, “Costeffective active learning for deep image classification”, IEEE Transactions on Circuits and Systems for Video Technology, vol. PP, no. 99, pp. 1–1, 2016.

[14]
Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander G Hauptmann,
“Selfpaced curriculum learning”,
Proc. of AAAI Conference on Artificial Intelligence
, 2015.  [15] Lu Jiang, Deyu Meng, ShoouI Yu, Zhenzhong Lan, Shiguang Shan, and Alexander Hauptmann, “Selfpaced learning with diversity”, in Proc. of Advances in Neural Information Processing Systems, 2014.
 [16] Lu Jiang, Deyu Meng, Teruko Mitamura, and Alexander G Hauptmann, “Easy samples first: selfpaced reranking for zeroexample multimedia search”, in ACM Conference on Multimedia, 2014.

[17]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston,
“Curriculum learning”,
in
Proc. of IEEE International Conference on Machine Learning
, 2009.  [18] M Pawan Kumar et al., “Selfpaced learning for latent variable models”, in Proc. of Advances in Neural Information Processing Systems, 2010.
 [19] Guosheng Hu, Yongxin Yang, Dong Yi, Josef Kittler, William Christmas, Stan Z. Li, and Timothy Hospedales, “When face recognition meets with deep learning: An evaluation of convolutional neural networks for face recognition”, in The IEEE International Conference on Computer Vision (ICCV) Workshops, 2015.
 [20] Yann LeCun, Koray Kavukcuoglu, and Clément Farabet, “Convolutional networks and applications in vision”, in ISCAS, 2010.
 [21] K. Wang, L. Lin, W. Zuo, S. Gu, and L. Zhang, “Dictionary pair classifier driven convolutional neural networks for object detection”, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016, pp. 2138–2146.
 [22] Chenqiang Gao, Deyu Meng, Wei Tong, Yi Yang, Yang Cai, Haoquan Shen, Gaowen Liu, Shicheng Xu, and Alexander Hauptmann, “Interactive surveillance event detection through midlevel discriminative representation”, in ACM International Conference on Multimedia Retrieval, 2014.
 [23] Lindsay I Smith, “A tutorial on principal components analysis”, Cornell University, USA, vol. 51, pp. 52, 2002.
 [24] Masayuki Karasuyama and Ichiro Takeuchi, “Multiple incremental decremental learning of support vector machines”, in Proc. of Advances in Neural Information Processing Systems, 2009.
 [25] NanYing Liang et al., “A fast and accurate online sequential learning algorithm for feedforward networks”, Neural Networks, IEEE Transactions on, 2006.
 [26] Seiichi Ozawa et al., “Incremental learning of feature space and classifier for face recognition”, Neural Networks, vol. 18, 2005.
 [27] David D Lewis and William A Gale, “A sequential algorithm for training text classifiers”, in ACM SIGIR Conference, 1994.
 [28] Simon Tong and Daphne Koller, “Support vector machine active learning with applications to text classification”, The Journal of Machine Learning Research, vol. 2, 2002.
 [29] Andrew Kachites McCallumzy and Kamal Nigamy, “Employing em and poolbased active learning for text classification”, in Proc. of IEEE International Conference on Machine Learning, 1998.
 [30] Ajay J Joshi, Fatih Porikli, and Nikolaos Papanikolopoulos, “Multiclass active learning for image classification”, in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2009.
 [31] Ashish Kapoor, Gang Hua, Amir Akbarzadeh, and Simon Baker, “Which faces to tag: Adding prior constraints into active learning”, in Proc. of IEEE International Conference on Computer Vision, 2009.
 [32] Ashish Kapoor, Kristen Grauman, Raquel Urtasun, and Trevor Darrell, “Active learning with gaussian processes for object categorization”, in Proc. of IEEE International Conference on Computer Vision, 2007.
 [33] Xin Li and Yuhong Guo, “Adaptive active learning for image classification”, in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2013.
 [34] Klaus Brinker, “Incorporating diversity in active learning with support vector machines”, in Proc. of IEEE International Conference on Machine Learning, 2003.
 [35] Qian Zhao, Deyu Meng, Lu Jiang, Qi Xie, Zongben Xu, and Alexander G Hauptmann, “Selfpaced learning for matrix factorization”, in Proc. of AAAI Conference on Artificial Intelligence, 2015.
 [36] M Pawan Kumar, Haithem Turki, Dan Preston, and Daphne Koller, “Learning specificclass segmentation from diverse data”, in Proc. of IEEE International Conference on Computer Vision, 2011.
 [37] Yong Jae Lee and Kristen Grauman, “Learning the easy things first: Selfpaced visual category discovery”, in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2011.
 [38] JS Supancic and Deva Ramanan, “Selfpaced learning for longterm tracking”, in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2013.
 [39] S. Yu et al, “Cmuinformedia@ trecvid 2014 multimedia event detection”, in TRECVID Video Retrieval Evaluation Workshop, 2014.

[40]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton,
“Imagenet classification with deep convolutional neural networks”,
in Advances in Neural Information Processing Systems 25, pp. 1097–1105. 2012.  [41] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition”, in ICLR, 2015.
 [42] BorChun Chen, ChuSong Chen, and Winston H Hsu, “Crossage reference coding for ageinvariant face recognition and retrieval”, in ECCV, 2014.
 [43] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z. Li, “Learning face representation from scratch”, CoRR, vol. abs/1411.7923, 2014.
 [44] Xuehan Xiong and Fernando De la Torre, “Supervised descent method and its applications to face alignment”, in Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 2013.
Comments
There are no comments yet.