Inspired by the learning processes used by humans and animals [Bengio et al.2009], Self-Paced Learning (SPL) [Kumar et al.2010] considers training data in a meaningful order, from easy to hard, to facilitate the learning process. Unlike standard curriculum learning [Bengio et al.2009], which learns the data in a predefined curriculum design based on prior knowledge, SPL learns the training data in an order that is dynamically determined by feedback from the individual learner, which means it can be more extensively utilized in practice. In self-paced learning, given a set of training samples along with their labels, a parameter is used to represents the “age” of the SPL in order to control the learning pace. When is small, “easy” samples with small losses are considered. As grows, “harder” samples with larger losses are gradually added to the training set. This type of learning process is modeled on the way human education and cognition functions. For instance, students will start by learning easier concepts (e.g., Linear Equations) before moving on to more complex ones (e.g., Differential Equations) in the mathematics curriculum. Self-paced learning can also be finely explained in a robust learning manner, where uncorrupted data samples are likely to be used for training earlier in the process than corrupted data.
In recent years, self-paced learning [Kumar et al.2010]
has received widespread attention for various applications in machine learning, such as image classification[Jiang et al.2015], event detection [Jiang et al.2014a, Zhang:2017:SEF:3132847.3132996] and object tracking [Supancic and Ramanan2013, Zhang et al.2016]. A wide assortment of SPL-based methods [Pi et al.2016, Ma et al.2017a] have been developed, including self-paced curriculum learning [Jiang et al.2015], self-paced learning with diversity [Jiang et al.2014b], multi-view [Xu et al.2015] and multi-task [Li et al.2017, Keerthiram Murugesan2017] self-paced learning. In addition, several researchers have conducted theoretical analyses of self-paced learning. [Meng et al.2015] provides a theoretical analysis of the robustness of SPL, revealing that the implicit objective function of SPL has a similar configuration to a non-convex regularized penalty. Such regularization restricts the contributions of noisy examples to the objective, and thus enhances the learning robustness. [Ma et al.2017b] proved that the learning process of SPL always converges to critical points of its implicit objective under mild conditions, while [Fan et al.2017] studied a group of new regularizers, named self-paced implicit regularizers that are derived from convex conjugacy.
Existing self-paced learning approaches typically focus on modeling the entire dataset at once; however, this may introduce a bottleneck in terms of memory and computation, as today’s fast-growing datasets are becoming too large to be handled integrally. For those seeking to address this issue, the main challenges can be summarized as follows: 1) Computational infeasibility of handling the entire dataset at once. Traditional self-paced learning approaches gradually grow the training set according to their learning pace. However, this strategy fails when the training set grows too large to be handled due to the limited capacity of the physical machines. Therefore, a scalable algorithm is required to extend the existing self-paced learning algorithm for massive datasets. 2) Existence of heterogeneously distributed “easy” data. Due to the unpredictability of data distributions, the “easy” data samples can be arbitrarily distributed across the whole dataset. Considering the entire dataset as a combination of multiple batches, some batches may contain large amount of “hard” samples. Thus, simply applying self-paced learning to each batch and averaging across the trained models is not an ideal approach, as some models will only focus on the “hard” samples and thus degrade the overall performance. 3) Dependency decoupling across different data batches. In self-paced learning, the instance weights depend on the model trained on the entire dataset. Also, the trained model depends on the weights assigned to each data instance. If we consider each data batch independently, a model trained in a “hard” batch can mistakenly mark some “hard” samples as “easy” ones. For example, in robust learning, the corrupted data samples are typically considered as “hard” samples, then more corrupted samples will therefore tend to be involved into the training process when we train data batches independently.
In order to simultaneously address all these technical challenges, this paper presents a novel Distributed Self-Paced Learning (DSPL) algorithm. The main contributions of this paper can be summarized as follows: 1) We reformulate the self-paced problem into a distributed setting. Specifically, an auxiliary variable is introduced to decouple the dependency of the model parameters for each data batch. 2) A distributed self-paced learning algorithm based on consensus ADMM is proposed to solve the SPL problem in a distributed setting. The algorithm optimizes the model parameters for each batch in parallel and consolidates their values in each iteration. 3) A theoretical analysis is provided for the convergence of our proposed DSPL
algorithm. The proof shows that our new algorithm will converge under mild assumptions, e.g., the loss function can be non-convex. 4) Extensive experiments have been conducted utilizing both synthetic and real-world data based on a robust regression task. The results demonstrate that the proposed approaches consistently outperform existing methods for multiple data settings. To the best of our knowledge, this is the first work to extend self-paced learning to a distributed setting, making it possible to handle large-scale datasets.
The reminder of this paper is organized as follows. Section 2 gives a formal problem formulation. The proposed distributed self-paced learning algorithm is presented in Section 3 and Section 4 presents a theoretical analysis of the convergence of the proposed method. In Section 5, the experimental results are analyzed and the paper concludes with a summary of our work in Section 6.
2 Problem Formulation
In the context of distributed self-paced learning, we consider the samples to be provided in a sequence of mini batches as , where represents the sample data for the batch,
is the corresponding response vector, andis the instance number of the batch.
The goal of self-paced learning problem is to infer the model parameter for the entire dataset, which can be formally defined as follows:
where is the regularization term for model parameters . Variable represents the instance weight vector for the batch and is the weight of the instance in the batch. The objective function for each mini-batch is defined as follows:
We denote and as the feature vector and its corresponding label for the instance in the batch. The loss function is used to measure the error between label
and the estimated value from model. The term is the regularization term for instance weights , where parameter controls the learning pace. The notations used in this paper are summarized in Table 1.
The problem defined above is very challenging in the following three aspects. First, data instances for all batches can be too large to be handled simultaneously in their entirety, thus requiring the use of a scalable algorithm for large datasets. Second, the instance weight variable of each batch depends on the optimization result for shared by all the data, which means all the batches are inter-dependent and it is thus not feasible to run them in parallel. Third, the objective function of variables and are jointly non-convex and it is an NP-hard problem to retrieve the global optimal solution [Gorski et al.2007]. In next section, we present a distributed self-paced learning algorithm based on consensus ADMM to address all these challenges.
|feature number in data matrix|
|instance number in the data batch|
|data matrix of the batch|
|the response vector of the batch|
|model parameter of the entire dataset|
|instance weight vector of the batch|
|weight of the instance in the batch|
|parameter to control the learning pace|
|loss function of estimated model|
3 The Proposed Methodology
In this section, we propose a distributed self-paced learning algorithm based on the alternating direction method of multipliers (ADMM) to solve the problem defined in Section 2.
The problem defined in Equation (1) cannot be solved in parallel because the model parameter is shared in each batch and the result of will impact on the instance weight variable for each batch. In order to decouple the relationships among all the batches, we introduce different model parameters for each batch and use an auxiliary variable to ensure the uniformity of all the model parameters. The problem can now be reformulated as follows:
where the function is defined as follows.
Unlike the original problem defined in Equation (1), here each batch has its own model parameter and the constraint for ensures the model parameter has the same value as the auxiliary variable . The purpose of the problem reformulation is to optimize the model parameters in parallel for each batch. It is important to note that the reformulation is tight because our new problem is equivalent to the original problem when the constraint is strictly satisfied.
In the new problem, Equation (3) is a bi-convex optimization problem over and for each batch with fixed , which can be efficiently solved using the Alternate Convex Search (ACS) method [Gorski et al.2007]. With the variable fixed, the remaining variables , and can be solved by consensus ADMM [Boyd et al.2011]. As the problem is an NP-hard problem, in which the global optimum requires polynomial time complexity, we propose an alternating algorithm DSPL based on ADMM to handle the problem efficiently.
The augmented Lagrangian format of optimization in Equation (3) can be represented as follows:
where are the Lagrangian multipliers and is the step size of the dual step.
The process used to update model parameter for the batch with the other variables fixed is as follows:
Specifically, if we choose the loss function to be a squared loss and model
to be a linear regression, we have the following analytical solution for :
The auxiliary variable and Lagrangian multipliers can be updated as follows:
The stop condition of consensus ADMM is determined by the (squared) norm of the primal and dual residuals of the iteration, which are calculated as follows:
After the weight parameter for each batch has been updated, the instance weight vector for each batch will be updated based on the fixed by solving the following problem:
For the above problem in Equation (10), we always obtain the following closed-form solution:
where is the indicator function whose value equals to one when the condition is satisfied; otherwise, its value is zero.
The details of algorithm DSPL are presented in Algorithm LABEL:algo:dspl. In Lines 1-2, the variables and parameters are initialized. With the variables fixed, the other variables are optimized in Lines 5-13 based on the result of consensus ADMM, in which both the model weights and Lagrangian multipliers can be updated in parallel for each batch. Note that if no closed-form can be found for Equation (6), the updating of is the most time-consuming operation in the algorithm. Therefore, updating in parallel can significantly improve the efficiency of the algorithm. The variable for each batch is updated in Line 14, with the variable fixed. In Lines 15-18, the parameter is enlarged to include more data instances into the training set. is the maximum threshold for and is the step size. The algorithm will be stopped when the Lagrangian is converged in Line 20. The following two alternative methods can be applied to improve the efficiency of the algorithm: 1) dynamically update the penalty parameter after Line 11. When , we can update . When , we have . 2) Move the update of variable into the consensus ADMM step after Line 9. This ensures that the instance weights are updated every time the model is updated, so that the algorithm quickly converges. However, no theoretical convergence guarantee can be made for the two solutions, although in practice they do always converge.
4 Theoretical Analysis
In this section, we will prove the convergence of the proposed algorithm. Before we start to prove the convergence of Algorithm LABEL:algo:dspl, we make the following assumptions regarding our objective function and penalty parameter :
Assumption 1 (Gradient Lipchitz Continuity).
There exists a positive constant for objective function of each batch with the following properties:
Assumption 2 (Lower Bound).
The objective function in problem (3) has a lower bound as follows:
Assumption 3 (Penalty Parameter Constraints).
For , the penalty parameter for each batch is chosen in accord with the following constraints:
For , the subproblem (6) of variable is strongly convex with modulus .
For , we have and .
Note that when increases, subproblem (6) will be eventually become strongly convex with respect to variable . For simplicity, we will choose the same penalty parameter for all the batches with . Based on these assumptions, we can draw the following conclusions.
Assume the augmented Lagrangian format of optimization problem (3) satisfies Assumption 1, the augmented Lagrangian has the following property:
Due to the gradient Lipchitz continuity assumption, we have the following optimality condition for the update step in Equation (6):
Combined with the update of the Lagrangian multipliers in Equation (8), we have
The augmented Lagrangian can be represented as:
Equation (a) follows from Equation (19) and the inequality (b) comes from the Lipschitz continuity property in Assumption 1. The inequality (c) follows from the lower bound property in Assumption 2.
The Algorithm LABEL:algo:dspl converges when Assumptions 1-3 are all satisfied.
In Lemmas 1 and 2, we proved that the Lagrangian is monotonically decreasing and has a lower bound during the iterations of ADMM. Now we will prove that the same properties hold for the entire algorithm after updating variable and parameter .
Inequality (a) follows Lemma 1 and inequality (b) follows the optimization step in Line 14 in Algorithm LABEL:algo:dspl. Inequality (c) follows from the fact that increases monotonically so that . As for some constant values of and has a lower bound , we can easily prove that , where is a constant and is the size of the entire dataset. Therefore, the Lagrangian is convergent. According to the stop condition for Algorithm LABEL:algo:dspl, the algorithm converges when the Lagrangian is converged.
5 Experimental Results
In this section, the performance of the proposed algorithm DSPL is evaluated for both synthetic and real-world datasets in robust regression task. After the experimental setup has been introduced in Section 5.1, we present the results for the regression coefficient recovery performance with different settings using synthetic data in Section 5.2, followed by house rental price prediction evaluation using real-world data in Section 5.3. All the experiments were performed on a 64-bit machine with an Intel(R) Core(TM) quad-core processor (i7CPU@3.6GHz) and 32.0GB memory. Details of both the source code and the datasets used in the experiment can be downloaded here111https://goo.gl/cis7tK.
5.1 Experimental Setup
5.1.1 Datasets and Labels
The datasets used for the experimental verification were composed of synthetic and real-world data. The simulation samples were randomly generated according to the model for each mini-batch, where represents the ground truth coefficients and the adversarial corruption vector. represents the additive dense noise for the batch, where . We sampled the regression coefficients as a random unit norm vector. The covariance data for each mini-batch was drawn independently and identically distributed from
and the uncorrupted response variables were generated as. The corrupted response vector for each mini-batch was generated as , where the corruption vector
was sampled from the uniform distribution. The set of uncorrupted points was selected as a uniformly random subset of for each batch.
The real-world datasets utilized consisted of house rental transaction data from two cities, New York City and Los Angeles listed on the Airbnb222https://www.airbnb.com/ website from January 2015 to October 2016. These datasets can be publicly downloaded333http://insideairbnb.com/get-the-data.html. For the New York City dataset, the first 321,530 data samples from January 2015 to December 2015 were used as the training data and the remaining 329,187 samples from January to October 2016 as the test data. For the Los Angeles dataset, the first 106,438 samples from May 2015 to May 2016 were used as training data, and the remaining 103,711 samples as test data. In each dataset, there were 21 features after data preprocessing, including the number of beds and bathrooms, location, and average price in the area.
5.1.2 Evaluation Metrics
For the synthetic data, we measured the performance of the regression coefficient recovery using the averaged error , where represents the recovered coefficients for each method and represents the ground truth regression coefficients. For the real-world dataset, we used the mean absolute error (MAE) to evaluate the performance for rental price prediction. Defining and as the predicted price and ground truth price, respectively, the mean absolute error between and can be presented as , where represents the label of the data sample.
|New York City (Corruption Ratio)|
|Los Angeles (Corruption Ratio)|
5.1.3 Comparison Methods
We used the following methods to compare the performance of the robust regression task: Torrent (Abbr. TORR) [Bhatia et al.2015], which is a hard-thresholding based method that includes a parameter for the corruption ratio. As this parameter is hard to estimate in practice, we opted to use a variant, TORR25, which represents a corruption parameter that is uniformly distributed across a range of off the true value. We also used RLHH [Zhang et al.2017b]
for the comparison, which applies a recently proposed robust regression method based on heuristic hard thresholding with no additional parameters. This method computes the regression coefficients for each batch, and averages them all. TheDRLR [Zhang2017OnlineAD] algorithm, which is a distributed robust learning method specifically designed to handle large scale data with a distributed robust consolidation. The traditional self-paced learning algorithm (SPL) [Kumar et al.2010] with the parameter and the step size was also compared in our experiment. For DSPL, we used the same settings as for SPL with the initial and . All the results from each of these comparison methods were averaged over 10 runs.
5.2 Robust Regression in Synthetic Data
5.2.1 Recovery Coefficients Recovery
Figure 1 shows the coefficient recovery performance for different corruption ratios in uniform distribution. Specifically, Figures 1(a) and 1(b) show the results for a different number of features with a fixed data size. Looking at the results, we can conclude: 1) Of the six methods tested, the DSPL method outperformed all the competing methods, including TORR, whose corruption ratio parameter uses the ground truth value. 2) Although DRLR turned in a competitive performance when the data corruption level was low. However, when the corruption ratio rose to over 40%, the recovery error is increased dramatically. 3) The TORR method is highly dependent on the corruption ratio parameter. When the parameter is 25% different from the ground truth, the error for TORR25 was over 50% compared to TORR, which uses the ground truth corruption ratio. 4) When the feature number is increased, the average error for the SPL algorithm increased by a factor of four. However, the results obtained for the DSPL algorithm varied consistently with the corruption ratio and feature number. The results presented in Figures 1(a) and 1(c) conform that the DSPL method consistently outperformed the other methods for larger datasets, while still achieving a close recovery of the ground truth coefficient.
5.2.2 Performance on Different Corrupted Batches
The regression coefficient recovery performance for different numbers of corrupted batches is shown in Table 2, ranging from four to nine corrupted batches out of the total of 10 batches. Each corrupted batch used in the experiment contains 90% corrupted samples and each uncorrupted batch has 10% corrupted samples. The results are shown for the averaged error in 10 different synthetic datasets with randomly ordered batches. Looking at the results shown in Table 2, we can conclude: 1) When the ratio of corrupted batches exceeds 50%, DSPL outperforms all the competing methods with a consistent recovery error. 2) DRLR performs the best when the mini-batch is 40% corrupted, although its recovery error increases dramatically when the number of corrupted batch increases. 3) SPL turns in a competitive performance for different levels of corrupted batches, but its error almost doubles when the number of corrupted batches increases from four to nine.
5.2.3 Analysis of Parameter
Figure 2 show the relationship between the parameter and the coefficient recovery error, along with the corresponding Lagrangian . This result is based on the robust coefficient recovery task for a 90% data corruption setting. Examining the blue line, as the parameter increases, the recovery error continues to decrease until it reaches a critical point, after which it increases. These results indicate that the training process will keep improving the model until the parameter becomes so large that some corrupted samples become incorporated into the training data. In the case shown here, the critical point is around . The red line shows the value of the Lagrangian in terms of different values of the parameter , leading us to conclude that: 1) the Lagrangian monotonically decreases as increases. 2) The Lagrangian decreases much faster once reaches a critical point, following the same pattern as the recovery error shown in blue line.
5.3 House Rental Price Prediction
To evaluate the effectiveness of our proposed method in a real-world dataset, we compared its performance for rental price prediction for a number of different corruption settings, ranging from 10% to 90%. The additional data corruption was sampled from the uniform distribution [-0.5, 0.5], where denotes the price value of the data point. Table 3
shows the mean absolute error for rental price prediction and its corresponding standard deviation over 10 runs for the New York City and Los Angeles datasets. The results indicate that: 1) TheDSPL method outperforms all the other methods for all the different corruption settings except when the corruption ratio is less than 30%, and consistently produced with the most stable results (smallest standard deviation). 2) Although the DRLR method performs the best when the corruption ratio is less than 30%, the results of all the methods are very close. Moreover, as the corruption ratio rises, the error for DRLR increases dramatically. 3) SPL has a very competitive performance for all the corruption settings but is still around 12% worse than the new DSPL method proposed here, which indicates that considering the data integrally produces a better performance than can be achieved by breaking up the data into batches and treating them separately.
In this paper, a distributed self-paced learning algorithm (DSPL) is proposed to extend the traditional SPL algorithm to its distributed version for large scale datasets. To achieve this, we reformulated the original SPL problem into a distributed setting and optimized the problem of treating different mini-batches in parallel based on consensus ADMM. We also proved that our algorithm can be convergent under mild assumptions. Extensive experiments on both synthetic data and real-world rental price data demonstrated that the proposed algorithms are very effective, outperforming the other comparable methods over a range of different data settings.
- [Bengio et al.2009] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48. ACM, 2009.
- [Bhatia et al.2015] Kush Bhatia, Prateek Jain, and Purushottam Kar. Robust regression via hard thresholding. In Advances in Neural Information Processing Systems, pages 721–729, 2015.
- [Boyd et al.2011] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine Learning, 3(1):1–122, 2011.
- [Fan et al.2017] Yanbo Fan, Ran He, Jian Liang, and Bao-Gang Hu. Self-paced learning: an implicit regularization perspective. In AAAI, 2017.
- [Gorski et al.2007] Jochen Gorski, Frank Pfeuffer, and Kathrin Klamroth. Biconvex sets and optimization with biconvex functions: a survey and extensions. Mathematical Methods of Operations Research, 66(3):373–407, 2007.
- [Hong et al.2016] Mingyi Hong, Zhi-Quan Luo, and Meisam Razaviyayn. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM Journal on Optimization, 26(1):337–364, 2016.
- [Jiang et al.2014a] Lu Jiang, Deyu Meng, Teruko Mitamura, and Alexander G Hauptmann. Easy samples first: Self-paced reranking for zero-example multimedia search. In Proceedings of the 22nd ACM international conference on Multimedia, pages 547–556. ACM, 2014.
- [Jiang et al.2014b] Lu Jiang, Deyu Meng, Shoou-I Yu, Zhenzhong Lan, Shiguang Shan, and Alexander Hauptmann. Self-paced learning with diversity. In Advances in Neural Information Processing Systems, pages 2078–2086, 2014.
- [Jiang et al.2015] Lu Jiang, Deyu Meng, Qian Zhao, Shiguang Shan, and Alexander Hauptmann. Self-paced curriculum learning, 2015.
Jaime Carbonell Keerthiram Murugesan.
Self-paced multitask learning with shared knowledge.
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pages 2522–2528, 2017.
- [Kumar et al.2010] M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, pages 1189–1197, 2010.
- [Li et al.2017] Changsheng Li, Junchi Yan, Fan Wei, Weishan Dong, Qingshan Liu, and Hongyuan Zha. Self-paced multi-task learning. In AAAI, pages 2175–2181, 2017.
- [Ma et al.2017a] Fan Ma, Deyu Meng, Qi Xie, Zina Li, and Xuanyi Dong. Self-paced co-training. In International Conference on Machine Learning, pages 2275–2284, 2017.
- [Ma et al.2017b] Zilu Ma, Shiqi Liu, and Deyu Meng. On convergence property of implicit self-paced objective. arXiv preprint arXiv:1703.09923, 2017.
- [Meng et al.2015] Deyu Meng, Qian Zhao, and Lu Jiang. What objective does self-paced learning indeed optimize? arXiv preprint arXiv:1511.06049, 2015.
- [Pi et al.2016] Te Pi, Xi Li, Zhongfei Zhang, Deyu Meng, Fei Wu, Jun Xiao, and Yueting Zhuang. Self-paced boost learning for classification. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, pages 1932–1938. AAAI Press, 2016.
- [Supancic and Ramanan2013] James S Supancic and Deva Ramanan. Self-paced learning for long-term tracking. In , pages 2379–2386, 2013.
- [Xu et al.2015] Chang Xu, Dacheng Tao, and Chao Xu. Multi-view self-paced learning for clustering. In IJCAI, pages 3974–3980, 2015.
- [Zhang et al.2016] Dingwen Zhang, Deyu Meng, Long Zhao, and Junwei Han. Bridging saliency detection to weakly supervised object detection based on self-paced curriculum learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, pages 3538–3544. AAAI Press, 2016.
- [Zhang et al.2017a] X. Zhang, L. Zhao, A. P. Boedihardjo, and C. Lu. Online and distributed robust regressions under adversarial data corruption. In 2017 IEEE International Conference on Data Mining (ICDM), volume 00, pages 625–634, Nov. 2017.
- [Zhang et al.2017b] Xuchao Zhang, Liang Zhao, Arnold P. Boedihardjo, and Chang-Tien Lu. Robust regression via heuristic hard thresholding. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI’17. AAAI Press, 2017.