Self-paced learning (SPL) is a recently raised methodology designed through simulating the learning principle of humans/animals . A variety of SPL realization schemes have been designed, and empirically substantiated to be effective in different computer vision and pattern recognition tasks, such as object detector adaptation , specific-class segmentation learning , visual category discovery , concept learning , long-term tracking , graph trimming, co-saliency detection, matrix factorization, face identification, and multimedia event detection .
To explain the underlying effectiveness mechanism inside SPL, 
firstly provided some new theoretical understandings under the SPL scheme. Specifically, this work proved that the alternative optimization strategy (AOS) on SPL accords with a majorization minimization (MM) algorithm implemented on an implicit objective function. Furthermore, it is found that the loss function contained in this implicit objective has a similar configuration with non-convex regularized penalty (NCRP), leading to a rational interpretation to the robustness insight under SPL.
However, such understanding is still not theoretically strict. The theory in  can only guarantee that during the iterations of SPL solving process (i.e., the MM algorithm), the implicit objective is monotonically decreasing, while cannot prove any convergence results on this implicit objective theoretically. However, this theoretical result regarding this implicit objective is critical to the soundness of the robustness insight explanation of SPL, which guarantees to settle the convergence point of the algorithm down on the expected implicit objective, and intrinsically relate the original SPL model and this implicit objective.
To this theoretical issue of SPL, in this paper, we prove that the optimization of the implicit objective actually converges to critical points of original SPL problem under satisfactorily weak conditions. This result provides an affirmative answer to our guess that the SPL intrinsically optimizes a robust implicit objective.
In what follows, we will first introduce some related background of this research, and then we provide the main theoretical result of this work.
2 Related work
In this section, we first briefly introduce the definition of SPL, and then provides its relationship to the implicit objective of NCRP.
2.1 The SPL objective
Given training data set , many machine learning problems need to minimizing the following form of objective function:
where is variables to be solved, is a regularizer parameter, is the loss function and is the parametrized learning machine, like a discriminative or a regression function.
To improve the robustness, specially avoiding the negative influence brought by large-noise-outliers, SPL imposes additional importance weightsto loss functions of all samples, adjusted by a self-paced regularizer (SP-regularizer). Here, each represents how much extent the sample will be trained in the learning process. The self-paced objective can then be designed as :
where is the SP-regularizer, satisfying the following conditions:
is convex on ;
then is non-increasing, and
is non-decreasing, and
Throughout this paper, we shall assume that can be uniquely determined and thus can be seen as a real-valued function instead of a set-valued function.
The three conditions in the definition above provide basic principles for constructing a SP-regularizer. Condition 2 indicates that the model inclines to select easy samples (with smaller losses) in favor of complex samples (with larger losses). Condition 3 states that when the model “pace” (controlled by the pace parameter
) gets larger, it tends to incorporate more, probably complex, samples to train a “mature” model. The convexity in Condition 1 further ensures the soundness of this regularizer for optimization.
The existence of the SP-regularizer can be illustrated by the following example.
Let the SP-regularizer be
then it yields
It is easy to verify that satisfies the above conditions.
In the following, we shall write:
2.2 The implicit NCRP objective
is non-increasing, the set of its discontinuous points is countable and consists only of jump discontinuity. Thusis integrable and is absolutely continuous and concave. We now define
as the implicit objective, where denotes that composed with . An interesting observation is that this implicit SPL objective has a close relationship to NCRP widely investigated in machine learning and statistics, which provides some helpful explanation to the robustness insight under SPL .
The original utilized AOS algorithm for solving the SPL problem is designed by performing coordinate descent calculation on , i.e., iterating through the process as:
Specifically, given , if we have finished steps, then the AOS algorithm need to iteratively calculating the following two subproblems:
Note that the first subproblem is feasible since we have assumed that can be uniquely determined. Indeed, using the notation of , we have
We then set
It is easy to deduce that is actually the first-order Taylor series of at . Based on the concavity of , we know that
constitutes a upper bound of (as defined in Eq. (2)), which provides a qualified surrogate function for MM algorithm.
One of the key issues in  is that if is produced by AOS algorithm of , then it can also be produced by performing MM algorithm on and vice versa. We prove one side by induction. The other side is totally the same. Suppose we have proved that can be produced by performing MM algorithm on at step. When it comes to the step,
Thus we have proved our claim that these two optimization algorithms (AOS/MM) conducting on the two different objective functions (/) are intrinsically equivalent.
We then need to prove whether every convergence point of MM algorithm, or equivalently, that of the AOS algorithm on the SPL objective, is at least a critical point of .
3 The main convergence result
Actually, the proof of the convergence of MM algorithm is basically the same as that of the EM algorithm (see ) only with some obvious changes, as discussed in . And the convergence of EM and MM is indeed a corollary of a global convergence theorem of Zangwill (see ). We can generalize the proof to the case of variational analysis. Before that, we need to clarify some terminologies which can be referred to in .
A function is said to be lower semi-continuous or simply lsc if
is closed for any . is said to be level-bounded if is bounded for any . And is called coercive if . Note that coercive functions are level-bounded. A critical point of means that , where stands for the subdifferential.
The main theorem of this paper can then be stated as follows.
Suppose that the objective function of MM algorithm, , is lsc and level-bounded, and that the surrogate function at is , which is lsc as a function on , and satisfies
where is the partial subdifferential in . Then for any initial parameter , every cluster point of the produced sequence of MM algorithm is a critical point of .
See the appendix. ∎
For our problem, we can give a sufficient condition of convergence, which is easy to verify and satisfied by most of the current SPL variations.
In the SPL objective as defined in Section 2.1, suppose is bounded below, is continuously differentiable, is continuous, and is coercive and lsc. Then for any initial parameter , every cluster point of the produced sequence , obtained by the AOS algorithm on solving Eq. (1), is a critical point of the implicit objective as defined in Eq. (2).
It is obvious that is lsc and level-bounded and is lsc as a function on with these assumptions. And the continuity of makes continuously differentiable. Then we have
Based on Theorem 1, for any initial parameter , every cluster point of the produced sequence is a critical point of . The proof is then completed. ∎
From the theorem, we can see that the AOS algorithm generally used to solving the SPL problem can be guaranteed to convergent to a critical point of the implicit NCRP objective . The intrinsic relationship between two objectives can then be constructed.
Note that in the above theorem, it is required that every minimization step in MM algorithm exactly attains the minima of the surrogate function , i.e.,
This is generally hard to achieve in real applications, especially for those learning models without closed-form solution. We thus want to further relax the condition to allow a relatively weaker solution “with errors” in implementing the MM algorithm on the surrogate function. That is, we can weaken the condition (4) as:
where is a non-negative sequence satisfying , i.e., .
Under this relaxed condition, we can still prove the convergence result of SPL in the following algorithm
In the SPL objective as defined in Section 2.1, suppose is bounded below, is continuously differentiable, is continuous, and is coercive and lsc. Let be an arbitrary initial parameter, and be the sequence obtained by the AOS algorithm on solving Eq. (1) with errors , that is,
Then every cluster point of is a critical point of the implicit objective as defined in Eq. (2).
Based on the theorem, we can then confirm the intrinsic relationship between SPL and its implicit objective.
In this paper, we have proved that the learning process of traditional SPL regime can be guaranteed to converge to rational critical points of the corresponding implicit NCRP objective. This theory helps confirm the intrinsic relationship between SPL and this implicit objective, and thus verifies previous robustness analysis of SPL on the basis of the understanding of such relationship. Besides, we have used some new theoretical skills for the proof of convergence, which inclines to be beneficial to the previous MM and EM convergence theories to a certain extent.
-  L. Jiang, D. Meng, T. Mitamura, and A. Hauptmann. Easy samples first: self-paced reranking for zeroexample multimedia search. In ACM MM, 2014.
-  M. Kumar, H. Turki, D. Preston, and D. Koller. Learning specific-class segmentation from diverse data. In ICCV, 2011.
-  M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, pages 1189–1197, 2010.
-  Y. Lee and K. Grauman. Learning the easy things first: Self-paced visual category discovery. In CVPR, 2011.
-  Junwei Liang, Lu Jiang, Deyu Meng, and Alex Hauptmann. Learning to detect concepts from webly-labeled video data. In IJCAI, 2016.
-  L. Lin, K. Wang, D. Meng, W. Zuo, and L. Zhang. Active self-paced learning for cost-effective and progressive face identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
-  D. Meng, Q. Zhao, and L. Jiang. What Objective Does Self-paced Learning Indeed Optimize? ArXiv e-prints, November 2015.
-  R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
-  J. Supančič III and D. Ramanan. Self-paced learning for long-term tracking. In CVPR, 2013.
-  K. Tang, V. Ramanathan, F. Li, and D. Koller. Shifting weights: Adapting object detectors from image to video. In NIPS, 2012.
-  Florin Vaida. Parameter convergence for EM and MM algorithms. Statistica Sinica, pages 831–840, 2005.
-  CF Jeff Wu. On the convergence properties of the EM algorithm. The Annals of statistics, pages 95–103, 1983.
-  S. Yu, L. Jiang, Z. Mao, and et al. CMU-Informedia@TRECVID 2014 multimedia eventdetection (MED). In TRECVID Video Retrieval Evaluation Workshop, 2014.
-  Zongsheng Yue, Deyu Meng, Juan He, and Gemeng Zhang. Semi-supervised learning through adaptive laplacian graph trimming. Image and Vision Computing, 2016.
-  W.I. Zangwill. Nonlinear programming: a unified approach. Prentice-Hall international series in management. Prentice-Hall, 1969.
-  Dingwen Zhang, Deyu Meng, Chao Li, Lu Jiang, Qian Zhao, and Junwei Han. A self-paced multiple-instance learning framework for co-saliency detection. In IEEE International Conference on Computer Vision, pages 594–602, 2015.
-  Qian Zhao, Deyu Meng, Lu Jiang, Qi Xie, Zongben Xu, and Alexander G Hauptmann. Self-paced learning for matrix factorization. In AAAI, pages 3196–3202, 2015.
Appendix A Proof of Theorem 3
Theorem 1 is actually a corollary of a stronger version of Zangwill’s global convergence theorem [15, page 91]. We first need to give the following lemmas. If is lsc, , and is non-increasing, then .
Suppose that is a finite-dimensional Euclidean space, is a set-valued mapping from to and that is produced by , which means
is a subset of that we are interested at, called the ”solution set” and satisfying
There is a compact subset , such that ,
is outer semicontinuous on , that is
There is a lsc function defined on , such that
then all the cluster points of are in , and , such that is non-increasing and convergent to . Note: we will repeatedly use the fact that is non-increasing. Without loss of generality, we can assume that when we take a subsequence of .
(1) Suppose is a cluster point of . The existence of is guaranteed by the compactness of . Thus there exists a subsequence , such that . Since is non-increasing, based on Lemma 3, it holds that
Denote , and then we prove that . This is because such that
There exists , such that , and thus
which indicates .
(2) If , take a subsequence
Since all lie in , there exists a subsequence , such that . Since is outer semicontinuous, . Based on Lemma 3, we know that . Due to the properties of ,
We then provide a proof of Theorem 3.
Proof of Theorem 3.
Let be the set of critical points of , and
By the descending property of MM algorithm, . Condition 3b is satisfied.
Condition 1: since is lsc and level-bounded, is closed and bounded, and thus compact. By the descending property of MM algorithm, all the parameters lie in .
Condition 2: suppose , and then , it holds that
Taking infimal limit on both sides when , we have
Thus , which means is outer semi-continuous.
Condition 3a: If , then
By the generalized Fermat theorem (see [8, 10.1]), is not a minima of , i.e., . Since ,
All the conditions of the proceeding theorem are satisfied. The proof is then completed. ∎
Appendix B Proof of Theorem 3
Similar to [8, 5.41], we give the following definition: A sequence of set-valued mappings converges outer semicontinuously to another set-valued mapping , if
Before giving proof of Theorem 3, we need to prove the following two lemmas.
Let be an Euclidean space with finite dimension, and be set-valued mappings from to itself. Suppose that converges outer semicontinuously to , and that is produced by , which means
Let be an arbitrary set, called the ”solution set”, satisfying
There is a compact set such that ,
There is a lsc defined on , such that
There is a sequence of non-negative numbers , that is , and
Then all the cluster points of lie in , and , such that converges to .
(1) Set , then , and
Thus is non-increasing.
(2) Let be a cluster point of , and then there exists a subsequence , such that . Since is lsc, we have
The second equality holds because
And we can prove in the same way as in (1) of Lemma A that
(3) We need to show that . Suppose not, take
Due to the compactness of , there is a subsequence of , such that , . We can argue in the same way as in (2) to show that . Since
and converges outer semicontinuously to , we have . Thus
The proof is then completed. ∎
Now we can prove another lemma using above theoretical result.
Let be the objective of MM algorithm. Suppose that is lsc and level-bounded, and that the surrogate function at is . In addition, suppose is lsc as a function defined on whose subgradient satisfies
where is the partial subdifferential with respect to . Then for any initial parameter , all the cluster points of the sequence produced by MM algorithm ”with errors” are still critical points of .
We prove by a direct application of Lemma B. Let be the set consisting of all the critical points of , , and
is the same as before:
We first need to show that converges outer semicontinuously to . Suppose , and then ,
Taking infimal limit on both sides when , we have
which means . Thus converges outer semicontinuously to .
Condition 1: is level-bounded, thus
is bounded. Since is also lsc, is closed and hence compact.
By (1) of Lemma B, all lie in .
Condition 2a: If , then
By the generalized Fermat theorem, is not a minima of , and hence . It follows that ,
Condition 2b: Let , and then
Therefore, all the conditions of Lemma B are satisfied and we have finished the proof. ∎
Just like the proof of Theorem 2, Theorem 3 can be easily proved by directly utilizing the results of the above Lemma B. We omit the proof here.