have been grabbing attention in machine learning and artificial intelligence. Both learning paradigms are designed by simulating the learning principle of humans/animals, attempting to start learning from easier examples and gradually including more complex ones into the training process. The CL regimeBengio2009Curriculum ; CLApp1 ; CLAPP3 was formerly designed by setting a series of learning curriculums for ranking samples from easy to hard manually, and the SPL methodology Kumar2010Self has then been latterly proposed to make this easy-to-hard learning manner automatically implementable by imposing a regularization term into a general learning object, which enables the learning machine to objectively evaluate the “easiness” of a sample and automatically learn the object in an adaptive way. This learning paradigm has been empirically verified to be helpful on alleviating the local-minimum issue for a non-convex optimization problem Zhao2015Self
, while later on more comprehensively to be verified to be capable of making the learning method more robust to heavy noises/outliersMeng2015What . Recently, such a new learning regime has been applied to many practical problems, such as multimedia event detection Jiang2014Easy
, neural network trainingavramova2015curriculum , matrix factorization Zhao2015Self , multi-view clustering Xu2015Multi , multi-task learning li2016self , boosting classification PiSelf , object tracking Supancic2013 , person re-identification reid2013 , face identification LinLiang2018 , object segmentation Objectsegmentation
, and some related mechanisms have been applied to weakly supervised learningliang2015towards ,wei2017stc , liang2017learning . Furthermore, an intrinsic advanced version of CL/SPL, called self-paced curriculum learning (SPCL) Jiang2015Self , has been designed, which tends to inherit advantages of both SPL and CL and to have a broader application jiang2015bridging . Besides, many variations of SPL realization schemes have also been constructed, like self-paced reranking Jiang2014Easy , self-paced multiple instance learning SPMIL ; SPMIL-PAMI , self-paced learning with diversity SPLD , multi-objective self-paced learning multiobjective2016 , self-paced co-training cotraining2017 and etc.
For understanding the theoretical insights of the working mechanism underlying the CL/SPL strategy, some beneficial investigations have been made. Meng et al Meng2015What proved that the alternative search algorithm generally used to solve the self-paced learning problem is equivalent to a majorization minimization algorithm implemented on a latent SPL object function, which is closely related to the non-convex penalty used in statistics and machine learning Meng2015What . This follows a natural explanation for the intrinsic robustness of CL/SPL. Recently, they have further proved that SPL scheme converges to a critical point of the latent objective SPLConverge . Afterwards, Fan et al. Fan2016Self explored an implicit regularization perspective of self-paced learning, which also conducts similar robust understandings for this learning regime. Recently, Li et al. Li2017Self proposed a general way to find the desired self-paced functions, which is beneficial for constructing more variations of SPL forms in practice.
However, these investigations explore the SPL theory mainly through exploring the equivalence of the alternative search algorithm on the SPL objectives and other algorithms implemented on some latent objective functions, while not on the SPL objective function, as well as its self-paced regularizer, itself. This makes the theory not sufficiently insightful to the problem. For example, the intrinsic relationships between self-paced regularizers and the weighting scheme to measure the importance of training samples in a SPL model is generally implicit, and hard to be intuitively explained. Besides, after adding curriculum constraint in SPL regime to form a SPCL model, current theories cannot attain the latent function like under general SPL framework. The rationality of SPCL thus still rests on the intuitive level.
To alleviate these issues, this study mainly makes the following contributions: Firstly, we establish a systematic theoretical framework under concave conjugacy theory for understanding the CL/SPL/SPCL insights. We find that the concave conjugacy theory surprisingly tallies with the requirements of the SPL model. And under this framework, the relationship among self-paced regularizer, latent SPL object function and sample weights can be clarified in a theoretically sound manner. Besides, by using this theory, the redundancy of the original SPL axiom can be removed and simplified, and the influence of the age parameter can be interpreted. Secondly, we can render a general approach for designing the SPL regime by using this theory. Furthermore, one can easily embed the required prior knowledge directly to the sample weights under this framework to make it properly used in specific applications. Thirdly, the latent objective of SPCL can be obtained under this theory. We especially discuss the form of the latent objective functions of SPCL under the partial order and group curriculums. This theory is thus meaningful for providing generalizable explanation for more general CL/SPL variations.
The paper is organized as follows. Section 2 introduces the necessary concepts and theories on concave conjugacy. Section 3 proposes the concave conjugacy theory for understanding CL/SPL. Section 4 presents two general approaches for designing a specific SPL model. Section 5 provides the theoretical understanding for SPCL under this new theory, and discusses the latent objectives of two specific curriculums.
2 Related contents on concave conjugacy
In the following we use the bolded lower letter to denote a vector, and the non-bolded lower letter to denote a scaler. Forand , denote as a vector in by arranging after the last position of . The inequality means that satisfies for ; denotes the inner products of and . For a concave function, we assume that it takes out of its domain; for a convex function, we assume that it takes out of its domain. Before giving more related concepts, we first presents the following definition.
Definition 1 (Increasing Function).
A multivariate function is increasing if for all lying in its domain denoted by .
We first present some necessary concepts and their related properties on the conjugate theory.
Definition 2 (Hypograph).
The hypograph associated with the function is the set of points lying on or below its graph:
Property 1 (Hypograph CorrespondenceRockafellar1970 ).
The function and its hypograph satisfy the following correspondence:
Property 2 (Concave function).
is a concave function if and only if is a convex set.
Definition 3 (Closure of Function).
The closure of the function is a function generated by the closure of its hypograph:
Definition 4 (Concave Conjugate).
The concave conjugate of a function is defined as follows:
Property 3 (Relation of Concave Conjugate and Convex ConjugateRockafellar1970 ).
For a convex function , it holds that:
where is the convex conjugate of defined as:
For notation convenience, in the follows we also use conjugate to represent concave conjugate.
Definition 5 (Proper Function).
A concave function is proper if it takes value on and there is at least one such that .
Following the proof given by W.Fenchel fenchel1949conjugate regarding the property of the conjugate convex function, one can easily prove that if is proper, then is a closed concave function. The concave conjugacy inherits the following duality properties of convex conjugacy as well.
Property 4 (DualityRockafellar1970 ).
If is a upper semi-continuous, concave and proper function,
It can be observed that the concave conjugate presents a one-to-one correspondence for all closed proper concave functions defined on .
2.2 Additive properties
The additive properties of concave conjugacy are also required to prove the related theory for SPL. We thus introduce the following necessary definitions and properties.
Definition 6 (Sup-Convolution).
The sup-convolution of functions and is defined as:
The sup-convolution has the following properties:
Property 5 (Increasing and Concave Preserving).
Let , and then
if and are increasing function, so is ;
if and are concave function, so is .
The relationship between the sup-convolution and the concave conjugate can be well illustrated by the following result.
Property 6 (Additive Property).
Let be proper concave functions defined on . Then we have:
If the relative interior of have a point in common, the closure operation can be omitted from the above second formula, and
where for each the supremum is attained.
The proof of this property can be referred to in Rockafellar1970 .
2.3 Differential theory
The differential theory regarding the concave conjugate plays an important role in our SPL theory. Some necessary definitions and properties are thus introduced as follows.
Definition 7 (Subgradient).
A vector is a subgradient of a concave function at if
The set of all subgradients of at is called the subdifferential of at and is denoted by .
Correspondingly, the subgradient of a convex function at if
The set of all subgradients of at is called the subdifferential of at and is denoted by .
The above subdifferentials of and have the following relation
Property 7 (Duality of Subdifferential Rockafellar1970 ).
For any closed proper concave function and any vector , the following conditions on a vector are equivalent to each other:
achieves its infmum in at ;
achieves its infmum in at .
Property 8 (Structure of Subdifferential Rockafellar1970 ).
Let be a closed proper concave function such that has a non-empty interior. Then
where is the normal cone to at and is the set of all limits of sequences such that is differentiable at and converges to .
Theorem 1 (Duality of essential strictly convex and essentially smoothRockafellar1970 ).
A closed proper convex function is essential strictly convex if and only if its conjugate is essential smooth.
If is a closed strictly convex function with bounded domain, then is a closed differentiable function on the whole space.
Since is with bounded domain, we know is co-finite. And then we have that is defined on whole space Rockafellar1970 .
2.4 Indicator function
The following theory illustrates that a restriction imposed on feasible region can be viewed as the addition of an indicator function of the restricted feasible region to the objective function.
Definition 8 (Indicator Function).
The indicator function of a convex set is defined by:
The closure of satisfies .
We call the conjugate of the support function of :
Based on the above definitions of indictor function and support function, the concave conjugate with constraint can be interpreted in a new way. Specifically, suppose is a upper semi-continuous, proper, concave function, is a closed convex set and the relative interior of and have at lease a point in common. Then we have
This implies that a concave conjugate with domain constraint can be understood as the addition of two conjugate terms. This will help a lot to deduce the related theory on explaining SPCL. Details will be shown in Section 4.
Theorem 3 (Monotone Conjugate).
If is a function defined on a closed set , then
is increasing on .
The proof of this theorem can be seen in Appendix A.
3 Concave conjugate theory for SPL
3.1 SPL Regime
We first give a short review to the generally used SPL regime.
where represent the vector of weights imposed on all training samples, is called self-paced regularizer which encodes the learning procedures following the principle from easy to hard, is the general regularizer for the model parameters to alleviate the overfitting problem, and is a parameter that controls the learning pace and guarantees the easy-to-complex learning procedure. By gradually increasing the age parameter, more samples can be automatically included with higher weights into training in a purely self-paced way.
is the decision function for the task, like a classifier or a regressor,
is the loss function (the functionis generally parameterized by parameters and is then the function with respect to and ). Let denote the loss vector . This leads to a brief expression for the model:
A common way to solve the SPL model is to alternatively optimize the target function and the weight vector as follows:
Definition 10 (SP-regularizer).
is called a SP-regularizer, if
is convex with respect to ;
decrease with respect to , and it holds that , and ;
increase with respect to , and it holds that , and ,
By using such defined SP-regularizer, SPL can conduct the learning manner that imposes larger weights on easier samples while smaller on harder ones, and gradually increases the sample weights with the age parameter increasing.
3.2 Conjugate theory of SP-regularizer
We can prove the following conjugate result on a SP-regularizer as follows:
Theorem 4 (Conjugate Equivalence).
For arbitrary function satisfying , let , and then
The proof is provided in Appendix B.
From the above theorem, it can be found that there are redundancy in the definition of SP-regularizer, which can be simplified as follows:
Theorem 5 (SP-regularizer Simplification).
is strictly convex in ;
is lower semi-continuous in ;
then it holds that :
decrease with respect to ; ;
If where satisfy the above condition in , then , increases with respect to , , ,
The proof is presented in Appendix C.
This theorem shows that the conditions in can be implied by the conditions being directly imposed on the SP-regularizer. According to simplification theorem, determining one easily handled representative in the equivalence class, the following assumption gives weaker conditions for a SP-regularizer.
Definition 11 (SP-regularizer simplification).
is called a self-paced regularizer with simplified conditions if:
is convex in ;
is lower semi-continuous in ;
3.3 Model Equivalence
Based on the concave conjugacy of SPL, its equivalent model can be derived as follows. For convenience, let , and then it holds that:
where . According to the property of the concave conjugate, is a proper closed concave function. Through this analysis, we can try to get more insights of SPL.
3.3.1 Latent SPL objective
Mostly, we can separate a SPL optimization model to multiple dimension sub-problems:
Then, the optimization on can be reformulated as solving the following multiple subproblems on each of its component :
In Meng2015What , it is proved that the alternative search algorithm on the SPL objective is equivalent to the MM algorithm implemented on a latent objective
on l. We can get the similar result under concave conjugate theory as follows.
Theorem 6 (Model Equivalence).
If satisfy the simplified conditions of SPL as defined in 11 and be strictly convex, then the latent SPL objective can be written as:
where is a function in .
The proof is listed in Appendix D.
In the following theorem, we want to make the relations among the SP-regularizer , latent objective , and the weight function clear.
If satisfy the simplified conditions of SPL, then we have:
Furthermore, if and is strictly convex in and , respectively and we can further obtain that
According to Theorem 7, one can easily derive the weight function from the SP-regularizer through the differential and inverse step, which is empirically more convenient than through the arg-minimization analysis. We then discuss on how to specify the age parameter in the model.
3.5 On age parameter
An easy way to construct a SP-regularizer is first to generate a regularizer, denoted by , satisfying the simplified conditions of SPL, and then use the SP-regularizer as . The reason why it works can be interpreted as follows:
Let and let the concave conjugate of . Then we have:
For simplicity, we assume is strictly concave. As a result, is differentiable and the original , and then we have:
Thus, increase with respect to , and it holds that , and .
Besides, since , the change of the stretches the shape of . In particular, if the is with threshold, then shifts the threshold through which reflexes the change of decision boundary regarding learning or not.
Then we give a discussion on how to specify a proper age parameter in the learning process.
Generally the SP-regularizer has the data screening properties, that is, there exists some such that . One can use two ways for specifying the age parameter. The first is suggested by Kumar2010Self : first to choose a such that around half of example are used with positive weight, and then gradually increase the to include more samples into training. Another strategy is suggested in Jiang2014Easy : first calculate the loss of each example, and choose a age parameter such that a portion of samples with smaller loss is with positive weights and the other with zero weights; and then increase the portion number to implicitly increase the age parameter. Also some other variations avramova2015curriculum have also been discussed and can be considered in application.
4 Two methods for designing a SPL regime
By utilizing the aforementioned theoretical results, we can construct two methods for designing a general SPL regime in practice.
We call the first method as the vFlR method. The progress for one dimension sub-problem is provided as follows:
Design satisfying decrease with respect to and
; ; .
If is given then and the other steps are the same.
We can then provide an example for designing SPL by using this method.
, whose component is computed by ;
; ; .
In this example, linear SP-regularizerJiang2014Easy is derivated from the weight function that linearly weights the sample whose loss is between and .
The second method is called the flvF method. Its main process for one dimension sub-problem includes the following steps:
is convex and continuous;
; ; .
We also present an example for using this method to design SPL.
In this example, the weight function, which weights the sample by the minimal of 1 and times its loss reciprocal, is derived from the LOG-like SP-regularizer.
5 Concave conjugate theory for SPCL
In the conventional SPCL strategies, a curriculum region needs to be specified and added into a general SPL optimization as a constraint Jiang2015Self . In this way, however, the latent objective of SPL as deduced in the previous sections is changed and cannot be obtained by the previous theory. We thus attempt to discuss this point, and provide explicit latent objective functions underlying SPCL for two specific curriculums. For notation convenience, in the following we omit in SPL functions.
5.1 Latent objective of SPCL
In the following theorem we propose the form of the latent objective underlying SPCL.
Suppose the self-paced regularizer is satisfying the simplified conditions of SPL. Let denote the concave conjugate of in . is closed convex set and and is the indicator function. Then
From the theorem, we know that the latent objective of SPCL under certain curriculum region is the sup convolution of the original SPL latent objective without this constraint and the support function on it. There are several properties on this new objective .
If the conditions of the theorem 8 hold, then has the following properties.
It is upper semi-continuous and concave since it is the concave conjugate.
It is increasing according to the Theorem 3.
due to the property of sup convolution and the fact that .
Moreover, if is strictly convex, it yields that
According to Corollary 2, is differentiable.
5.2 Curriculum function
Through the above discussion, we may find that the curriculum region can be interpreted as a special family of curriculum function.
Suppose we provide the SPL model by adding a curriculum function which is a closed convex function and satisfies . Then the new latent objective function can be obtained by the following:
It can be seen that the curriculum properties depends on the conjugate of the curriculum function and the sup convolution step.
Suppose we have curriculum functions which are proper closed convex functions, and let denote . If they satisfy , then according to Property 6 the objective function of SPCL is
By introducing a new curriculum function into the model, new latent objective is obtained by sup convolution of original object function and conjugate of the curriculum function. The result can be viewed as the action of the new curriculum on the original latentive objective. We call this action Curriculum Action in the follows for convenience.
5.3 Basic curriculum region
Consider the following case that the feasible region of is and the SP-regularizer is , and then
which means that it takes finite value when the component of equals 0 and it takes on .
For all proper concave function , it holds that
We can then give the following definition related to curriculums:
Definition 12 (Basic Curriculum Region).
For the SPL model
we call the the basic curriculum region.
The commonly discussed SP-regularizers are defined on . Suppose the regularizer is a concave function being differentiable on , and it can be extended to an open set which contains . According to Property 8 the structure of subdifferential, we can obtain
where is the vertex of the hypercube , , represents the cone generated by with positive coefficients and represents all the vertices of .
By calculating the inverse of set-valued function , the weight set-valued function can be obtained.
5.3.1 Linear Regularizer
Definition 13 (Linear Regularizer).
linear regularizer for the SPL model
Once we select the linear regularizer, we can obtain:
According to the Property 7, we can obtain that
Hence, the domain of can be separated into part, each taking the same value corresponding to the vertex of the hypercube .
5.4 Linear homogeneous curriculum
One of the most commonly used curriculum is the partial order curriculum. For instance, if one has the prior knowledge that example 1 is more important or reliable than example 2, it’s reasonable to restrict their feasible region such that . In regard to , we call it linear homogeneous curriculum. Generally, those knowledge come as a series of linear inequalities and we call them partial order curriculum. For simplicity, in the following we consider the simple linear homogeneous curriculum and, for more curriculums, we can treat them one by one.
In order to avoid the disfunctional curriculum and to make analysis convenient, we render the following nonsingular assumption for the curriculum region.
Assumption 1 (Assumption for Curriculum Region).
A curriculum region satisfies the following conditions:
Definition 14 (Linear Homogeneous Curriculum).
If , we call a linear homogeneous curriculum and the linear homogeneous curriculum direction.
We can then prove the following result: