# On Convergence Property of Implicit Self-paced Objective

Self-paced learning (SPL) is a new methodology that simulates the learning principle of humans/animals to start learning easier aspects of a learning task, and then gradually take more complex examples into training. This new-coming learning regime has been empirically substantiated to be effective in various computer vision and pattern recognition tasks. Recently, it has been proved that the SPL regime has a close relationship to a implicit self-paced objective function. While this implicit objective could provide helpful interpretations to the effectiveness, especially the robustness, insights under the SPL paradigms, there are still no theoretical results strictly proved to verify such relationship. To this issue, in this paper, we provide some convergence results on this implicit objective of SPL. Specifically, we prove that the learning process of SPL always converges to critical points of this implicit objective under some mild conditions. This result verifies the intrinsic relationship between SPL and this implicit objective, and makes the previous robustness analysis on SPL complete and theoretically rational.

## Authors

• 2 publications
• 4 publications
• 64 publications
• ### Self-Paced Learning: an Implicit Regularization Perspective

Self-paced learning (SPL) mimics the cognitive mechanism of humans and a...
06/01/2016 ∙ by Yanbo Fan, et al. ∙ 0

• ### Temporal-difference learning for nonlinear value function approximation in the lazy training regime

We discuss the approximation of the value function for infinite-horizon ...
05/27/2019 ∙ by Andrea Agazzi, et al. ∙ 0

• ### Learning from Incremental Directional Corrections

This paper proposes a technique which enables a robot to learn a control...
11/30/2020 ∙ by Wanxin Jin, et al. ∙ 0

• ### Self-Supervised Contextual Bandits in Computer Vision

Contextual bandits are a common problem faced by machine learning practi...
03/18/2020 ∙ by Aniket Anand Deshmukh, et al. ∙ 0

• ### Understanding Self-Paced Learning under Concave Conjugacy Theory

By simulating the easy-to-hard learning manners of humans/animals, the l...
05/21/2018 ∙ by Shiqi Liu, et al. ∙ 0

• ### The Use of Implicit Human Motor Behaviour in the Online Personalisation of Prosthetic Interfaces

In previous work, the authors proposed a data-driven optimisation algori...
03/02/2020 ∙ by Ricardo Garcia-Rosas, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Self-paced learning (SPL) is a recently raised methodology designed through simulating the learning principle of humans/animals [3]. A variety of SPL realization schemes have been designed, and empirically substantiated to be effective in different computer vision and pattern recognition tasks, such as object detector adaptation [10], specific-class segmentation learning [2], visual category discovery [4], concept learning [5], long-term tracking [9], graph trimming[14], co-saliency detection[16], matrix factorization[17], face identification[6], and multimedia event detection [13].

To explain the underlying effectiveness mechanism inside SPL, [7]

firstly provided some new theoretical understandings under the SPL scheme. Specifically, this work proved that the alternative optimization strategy (AOS) on SPL accords with a majorization minimization (MM) algorithm implemented on an implicit objective function. Furthermore, it is found that the loss function contained in this implicit objective has a similar configuration with non-convex regularized penalty (NCRP), leading to a rational interpretation to the robustness insight under SPL.

However, such understanding is still not theoretically strict. The theory in [7] can only guarantee that during the iterations of SPL solving process (i.e., the MM algorithm), the implicit objective is monotonically decreasing, while cannot prove any convergence results on this implicit objective theoretically. However, this theoretical result regarding this implicit objective is critical to the soundness of the robustness insight explanation of SPL, which guarantees to settle the convergence point of the algorithm down on the expected implicit objective, and intrinsically relate the original SPL model and this implicit objective.

To this theoretical issue of SPL, in this paper, we prove that the optimization of the implicit objective actually converges to critical points of original SPL problem under satisfactorily weak conditions. This result provides an affirmative answer to our guess that the SPL intrinsically optimizes a robust implicit objective.

In what follows, we will first introduce some related background of this research, and then we provide the main theoretical result of this work.

## 2 Related work

In this section, we first briefly introduce the definition of SPL, and then provides its relationship to the implicit objective of NCRP.

### 2.1 The SPL objective

Given training data set , many machine learning problems need to minimizing the following form of objective function:

 J(w)=ϕλ(w)+N∑i=1L(yi,g(xi,w)),

where is variables to be solved, is a regularizer parameter, is the loss function and is the parametrized learning machine, like a discriminative or a regression function.

To improve the robustness, specially avoiding the negative influence brought by large-noise-outliers, SPL imposes additional importance weights

to loss functions of all samples, adjusted by a self-paced regularizer (SP-regularizer). Here, each represents how much extent the sample will be trained in the learning process. The self-paced objective can then be designed as [1]:

 E(w,v;λ)=ϕλ(w)+N∑i=1viL(yi,g(xi,w))+fλ(vi), (1)

where is the SP-regularizer, satisfying the following conditions:

1. is convex on ;

2. Let

 v∗λ(l)=argminv∈[0,1]{vl+fλ(v)},

then is non-increasing, and

 liml→0v∗λ(l)=1,liml→∞v∗λ(l)=0;
3. is non-decreasing, and

 limλ→0v∗λ(l)=0,limλ→∞v∗λ(l)≤1.

Throughout this paper, we shall assume that can be uniquely determined and thus can be seen as a real-valued function instead of a set-valued function.

The three conditions in the definition above provide basic principles for constructing a SP-regularizer. Condition 2 indicates that the model inclines to select easy samples (with smaller losses) in favor of complex samples (with larger losses). Condition 3 states that when the model “pace” (controlled by the pace parameter

) gets larger, it tends to incorporate more, probably complex, samples to train a “mature” model. The convexity in Condition 1 further ensures the soundness of this regularizer for optimization.

The existence of the SP-regularizer can be illustrated by the following example.

Let the SP-regularizer be

 fλ(v)=λv(logv−1),

then it yields

 v∗λ(l)=e−λ−1l.

It is easy to verify that satisfies the above conditions.

In the following, we shall write:

 li(w)=L(yi,g(xi,w)),i=1,⋯,N

for simplicity.

### 2.2 The implicit NCRP objective

Let

 Fλ(l)=∫l0v∗λ(τ)dτ.

Since

is non-increasing, the set of its discontinuous points is countable and consists only of jump discontinuity. Thus

is integrable and is absolutely continuous and concave. We now define

 Gλ(w)=ϕλ(w)+N∑i=1(Fλ∘li)(w) (2)

as the implicit objective, where denotes that composed with . An interesting observation is that this implicit SPL objective has a close relationship to NCRP widely investigated in machine learning and statistics, which provides some helpful explanation to the robustness insight under SPL [7].

The original utilized AOS algorithm for solving the SPL problem is designed by performing coordinate descent calculation on , i.e., iterating through the process as:

 (wk−1,vk−1)→(wk−1,vk)→(wk,vk).

Specifically, given , if we have finished steps, then the AOS algorithm need to iteratively calculating the following two subproblems:

 vk=argminvE(wk−1,v;λ)=argminv{N∑i=1vili(wk−1)+fλ(vi)},
 wk∈argminwE(w,vk;λ)=argminw{ϕλ(w)+N∑i=1vkili(w)}.

Note that the first subproblem is feasible since we have assumed that can be uniquely determined. Indeed, using the notation of , we have

 vki=v∗λ(li(wk−1)),i=1,⋯,N.

We then set

 Q(w|w∗)=N∑i=1(Fλ∘li)(w∗)+(v∗λ∘li)(w∗)[li(w)−li(w∗)].

It is easy to deduce that is actually the first-order Taylor series of at . Based on the concavity of , we know that

 U(w|w∗)=ϕλ(w)+Q(w|w∗)

constitutes a upper bound of (as defined in Eq. (2)), which provides a qualified surrogate function for MM algorithm.

One of the key issues in [7] is that if is produced by AOS algorithm of , then it can also be produced by performing MM algorithm on and vice versa. We prove one side by induction. The other side is totally the same. Suppose we have proved that can be produced by performing MM algorithm on at step. When it comes to the step,

 wk+1 ∈ argminwE(w,vk+1;λ) (3) = argminwϕλ(w)+N∑i=1vk+1ili(w) = argminwϕλ(w)+N∑i=1v∗λ(li(wk))⋅li(w) = argminwϕλ(w)+Q(w|wk)=argminwU(w|wk).

Thus we have proved our claim that these two optimization algorithms (AOS/MM) conducting on the two different objective functions (/) are intrinsically equivalent.

We then need to prove whether every convergence point of MM algorithm, or equivalently, that of the AOS algorithm on the SPL objective, is at least a critical point of .

## 3 The main convergence result

Actually, the proof of the convergence of MM algorithm is basically the same as that of the EM algorithm (see [12]) only with some obvious changes, as discussed in [11]. And the convergence of EM and MM is indeed a corollary of a global convergence theorem of Zangwill (see [15]). We can generalize the proof to the case of variational analysis. Before that, we need to clarify some terminologies which can be referred to in [8].

A function is said to be lower semi-continuous or simply lsc if

 levf≤α:={x:f(x)≤α}

is closed for any . is said to be level-bounded if is bounded for any . And is called coercive if . Note that coercive functions are level-bounded. A critical point of means that , where stands for the subdifferential[8].

The main theorem of this paper can then be stated as follows.

Suppose that the objective function of MM algorithm, , is lsc and level-bounded, and that the surrogate function at is , which is lsc as a function on , and satisfies

 ∂U(w|w)⊂∂G(w),∀w∈RD,

where is the partial subdifferential in . Then for any initial parameter , every cluster point of the produced sequence of MM algorithm is a critical point of .

###### Proof.

See the appendix. ∎

For our problem, we can give a sufficient condition of convergence, which is easy to verify and satisfied by most of the current SPL variations.

In the SPL objective as defined in Section 2.1, suppose is bounded below, is continuously differentiable, is continuous, and is coercive and lsc. Then for any initial parameter , every cluster point of the produced sequence , obtained by the AOS algorithm on solving Eq. (1), is a critical point of the implicit objective as defined in Eq. (2).

###### Proof.

It is obvious that is lsc and level-bounded and is lsc as a function on with these assumptions. And the continuity of makes continuously differentiable. Then we have

 ∂Gλ(w∗) = ∂ϕλ(w∗)+N∑i=1F′λ(li(w∗))∇li(w∗) = ∂ϕλ(w∗)+N∑i=1(v∗λ∘li)(w∗)∇li(w∗) = ∂U(w∗|w∗).

Based on Theorem 1, for any initial parameter , every cluster point of the produced sequence is a critical point of . The proof is then completed. ∎

From the theorem, we can see that the AOS algorithm generally used to solving the SPL problem can be guaranteed to convergent to a critical point of the implicit NCRP objective . The intrinsic relationship between two objectives can then be constructed.

Note that in the above theorem, it is required that every minimization step in MM algorithm exactly attains the minima of the surrogate function , i.e.,

 U(wk+1|wk)=minU(⋅|wk). (4)

This is generally hard to achieve in real applications, especially for those learning models without closed-form solution. We thus want to further relax the condition to allow a relatively weaker solution “with errors” in implementing the MM algorithm on the surrogate function. That is, we can weaken the condition (4) as:

 U(wk+1|wk)≤minU(⋅|wk)+ϵk,

where is a non-negative sequence satisfying , i.e., .

Under this relaxed condition, we can still prove the convergence result of SPL in the following algorithm

In the SPL objective as defined in Section 2.1, suppose is bounded below, is continuously differentiable, is continuous, and is coercive and lsc. Let be an arbitrary initial parameter, and be the sequence obtained by the AOS algorithm on solving Eq. (1) with errors , that is,

 E(wk,vk;λ)≤minE(⋅,vk;λ)+ϵk,∀k≥1.

Then every cluster point of is a critical point of the implicit objective as defined in Eq. (2).

Based on the theorem, we can then confirm the intrinsic relationship between SPL and its implicit objective.

## 4 Conclusion

In this paper, we have proved that the learning process of traditional SPL regime can be guaranteed to converge to rational critical points of the corresponding implicit NCRP objective. This theory helps confirm the intrinsic relationship between SPL and this implicit objective, and thus verifies previous robustness analysis of SPL on the basis of the understanding of such relationship. Besides, we have used some new theoretical skills for the proof of convergence, which inclines to be beneficial to the previous MM and EM convergence theories to a certain extent.

## References

• [1] L. Jiang, D. Meng, T. Mitamura, and A. Hauptmann. Easy samples first: self-paced reranking for zeroexample multimedia search. In ACM MM, 2014.
• [2] M. Kumar, H. Turki, D. Preston, and D. Koller. Learning specific-class segmentation from diverse data. In ICCV, 2011.
• [3] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, pages 1189–1197, 2010.
• [4] Y. Lee and K. Grauman. Learning the easy things first: Self-paced visual category discovery. In CVPR, 2011.
• [5] Junwei Liang, Lu Jiang, Deyu Meng, and Alex Hauptmann. Learning to detect concepts from webly-labeled video data. In IJCAI, 2016.
• [6] L. Lin, K. Wang, D. Meng, W. Zuo, and L. Zhang. Active self-paced learning for cost-effective and progressive face identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
• [7] D. Meng, Q. Zhao, and L. Jiang. What Objective Does Self-paced Learning Indeed Optimize? ArXiv e-prints, November 2015.
• [8] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
• [9] J. Supančič III and D. Ramanan. Self-paced learning for long-term tracking. In CVPR, 2013.
• [10] K. Tang, V. Ramanathan, F. Li, and D. Koller. Shifting weights: Adapting object detectors from image to video. In NIPS, 2012.
• [11] Florin Vaida. Parameter convergence for EM and MM algorithms. Statistica Sinica, pages 831–840, 2005.
• [12] CF Jeff Wu. On the convergence properties of the EM algorithm. The Annals of statistics, pages 95–103, 1983.
• [13] S. Yu, L. Jiang, Z. Mao, and et al. CMU-Informedia@TRECVID 2014 multimedia eventdetection (MED). In TRECVID Video Retrieval Evaluation Workshop, 2014.
• [14] Zongsheng Yue, Deyu Meng, Juan He, and Gemeng Zhang. Semi-supervised learning through adaptive laplacian graph trimming. Image and Vision Computing, 2016.
• [15] W.I. Zangwill. Nonlinear programming: a unified approach. Prentice-Hall international series in management. Prentice-Hall, 1969.
• [16] Dingwen Zhang, Deyu Meng, Chao Li, Lu Jiang, Qian Zhao, and Junwei Han. A self-paced multiple-instance learning framework for co-saliency detection. In IEEE International Conference on Computer Vision, pages 594–602, 2015.
• [17] Qian Zhao, Deyu Meng, Lu Jiang, Qi Xie, Zongben Xu, and Alexander G Hauptmann. Self-paced learning for matrix factorization. In AAAI, pages 3196–3202, 2015.

## Appendix A Proof of Theorem 3

Theorem 1 is actually a corollary of a stronger version of Zangwill’s global convergence theorem [15, page 91]. We first need to give the following lemmas. If is lsc, , and is non-increasing, then .

###### Proof.
 f(x)=liminfn→∞f(xn)=limk→∞infn≥kf(xn)=infn≥1f(xn)=limn→∞f(xn).

Suppose that is a finite-dimensional Euclidean space, is a set-valued mapping from to and that is produced by , which means

 xk+1∈M(xk),∀k≥0.

is a subset of that we are interested at, called the ”solution set” and satisfying

1. There is a compact subset , such that ,

2. is outer semicontinuous on , that is

 xk→x in X∖Γ⟹M(xk)→M(x).
3. There is a lsc function defined on , such that

1. ,

2. ,

then all the cluster points of are in , and , such that is non-increasing and convergent to . Note: we will repeatedly use the fact that is non-increasing. Without loss of generality, we can assume that when we take a subsequence of .

###### Proof.

(1) Suppose is a cluster point of . The existence of is guaranteed by the compactness of . Thus there exists a subsequence , such that . Since is non-increasing, based on Lemma 3, it holds that

 G(x∗)=limk→∞G(xnk).

Denote , and then we prove that . This is because such that

 G(xnk)−G∗<ϵ,∀k≥k0.

When ,

 G(xn)−G∗=G(xn)−G(xnk0)+G(xnk0)−G∗<0+ϵ=ϵ.

There exists , such that , and thus

 G(xn)−G∗=G(xn)−G(xnk1)+G(xnk1)−G∗≥0+0=0.

Therefore,

 0≤G(xn)−G∗<ϵ,∀n≥nk0,

which indicates .

(2) If , take a subsequence

 yk=xnk+1∈M(xnk).

Since all lie in , there exists a subsequence , such that . Since is outer semicontinuous, . Based on Lemma 3, we know that . Due to the properties of ,

 G(¯x)

We then provide a proof of Theorem 3.

###### Proof of Theorem 3.

Let be the set of critical points of , and

 M(w∗)=argminwU(w|w∗).

By the descending property of MM algorithm, . Condition 3b is satisfied.

Condition 1: since is lsc and level-bounded, is closed and bounded, and thus compact. By the descending property of MM algorithm, all the parameters lie in .

Condition 2: suppose , and then , it holds that

 U(vk|wk)≤U(w|wk).

Taking infimal limit on both sides when , we have

 U(v∗|w∗)=liminfk→∞U(vk|wk)≤liminfk→∞U(w|wk)=U(w|w∗).

Thus , which means is outer semi-continuous.

Condition 3a: If , then

 0∉∂G(w∗)⊃∂U(w∗|w∗).

By the generalized Fermat theorem (see [8, 10.1]), is not a minima of , i.e., . Since ,

 G(w)≤U(w|w∗)

All the conditions of the proceeding theorem are satisfied. The proof is then completed. ∎

## Appendix B Proof of Theorem 3

Similar to [8, 5.41], we give the following definition: A sequence of set-valued mappings converges outer semicontinuously to another set-valued mapping , if

 limsupk→∞Mk(xk)⊂M(¯x),∀xk→¯x,

that is,

 xk→¯x,vk∈Mk(xk),vk→¯v⟹¯v∈M(¯x).

Before giving proof of Theorem 3, we need to prove the following two lemmas.

Let be an Euclidean space with finite dimension, and be set-valued mappings from to itself. Suppose that converges outer semicontinuously to , and that is produced by , which means

 xk+1∈Mk(xk),∀k.

Let be an arbitrary set, called the ”solution set”, satisfying

1. There is a compact set such that ,

2. There is a lsc defined on , such that

1. ;

2. There is a sequence of non-negative numbers , that is , and

 α(yk+1)≤α(x)+ϵk,∀yk+1∈Mk(x),∀x,∀k.

Then all the cluster points of lie in , and , such that converges to .

###### Proof.

(1) Set , then , and

 α(xk+1)+rk+1≤α(xk)+ϵk+rk+1=α(xk)+rk.

Thus is non-increasing.

(2) Let be a cluster point of , and then there exists a subsequence , such that . Since is lsc, we have

 α(x∗)=liminfkα(xnk)=liminfk(α(xnk)+rnk)=limk(α(xnk)+rnk)=limkα(xnk).

The second equality holds because

 liminfkα(xnk)≤liminfk(α(xnk)+rnk)≤liminfkα(xnk)+limsupkrnk=liminfkα(xnk).

And we can prove in the same way as in (1) of Lemma A that

 limnα(xn)=α(x∗).

(3) We need to show that . Suppose not, take

 yk=xnk+1∈Mnk(xnk),

Due to the compactness of , there is a subsequence of , such that , . We can argue in the same way as in (2) to show that . Since

 ykl=xnkl+1∈Mnkl(xnkl),

and converges outer semicontinuously to , we have . Thus

 α(¯x)<α(x∗)=limnα(xn)=limlα(ykl)=α(¯x),

The proof is then completed. ∎

Now we can prove another lemma using above theoretical result.

Let be the objective of MM algorithm. Suppose that is lsc and level-bounded, and that the surrogate function at is . In addition, suppose is lsc as a function defined on whose subgradient satisfies

 ∂U(w|w)⊂∂F(w),∀w∈RD,

where is the partial subdifferential with respect to . Then for any initial parameter , all the cluster points of the sequence produced by MM algorithm ”with errors” are still critical points of .

###### Proof.

We prove by a direct application of Lemma B. Let be the set consisting of all the critical points of , , and

 Mk(w∗)={w:U(w|w∗)≤minU(⋅|w∗)+ϵk}.

is the same as before:

 M(w∗)=argminwU(w|w∗).

We first need to show that converges outer semicontinuously to . Suppose , and then ,

 U(vk|wk)≤minU(⋅|wk)+ϵk≤U(w|wk)+ϵk.

Taking infimal limit on both sides when , we have

 U(¯v|¯w)=liminfkU(vk|wk)≤liminfk(U(w|wk)+ϵk)≤liminfkU(w|wk)+limsupkϵk=U(w|¯w),

which means . Thus converges outer semicontinuously to .

Condition 1: is level-bounded, thus

 K(w0)={w:F(w)≤F(w0)+∑kϵk}

is bounded. Since is also lsc, is closed and hence compact. By (1) of Lemma B, all lie in .

Condition 2a: If , then

 0∉∂F(w∗)⊃∂U(w∗|w∗).

By the generalized Fermat theorem, is not a minima of , and hence . It follows that ,

 F(w)≤U(w|w∗)

Condition 2b: Let , and then

 U(v|w)≤minU(⋅|w)+ϵk.

Thus,

 F(v)≤U(v|w)≤minU(⋅|w)+ϵk≤U(w|w)+ϵk=F(w)+ϵk.

Therefore, all the conditions of Lemma B are satisfied and we have finished the proof. ∎

Just like the proof of Theorem 2, Theorem 3 can be easily proved by directly utilizing the results of the above Lemma B. We omit the proof here.