Multivariate Regression with Grossly Corrupted Observations: A Robust Approach and its Applications

This paper studies the problem of multivariate linear regression where a portion of the observations is grossly corrupted or is missing, and the magnitudes and locations of such occurrences are unknown in priori. To deal with this problem, we propose a new approach by explicitly consider the error source as well as its sparseness nature. An interesting property of our approach lies in its ability of allowing individual regression output elements or tasks to possess their unique noise levels. Moreover, despite working with a non-smooth optimization problem, our approach still guarantees to converge to its optimal solution. Experiments on synthetic data demonstrate the competitiveness of our approach compared with existing multivariate regression models. In addition, empirically our approach has been validated with very promising results on two exemplar real-world applications: The first concerns the prediction of Big-Five personality based on user behaviors at social network sites (SNSs), while the second is 3D human hand pose estimation from depth images. The implementation of our approach and comparison methods as well as the involved datasets are made publicly available in support of the open-source and reproducible research initiatives.



There are no comments yet.


page 1

page 2

page 3

page 4


Multivariate Regression with Gross Errors on Manifold-valued Data

We consider the topic of multivariate regression on manifold-valued outp...

Robust Regression via Mutivariate Regression Depth

This paper studies robust regression in the settings of Huber's ϵ-contam...

Simultaneous Confidence Tubes for Comparison of Several Multivariate Linear Regression Models

Much of the research on multiple comparison and simultaneous inference i...

A Hypergradient Approach to Robust Regression without Correspondence

We consider a regression problem, where the correspondence between input...

Multivariate Density Estimation with Missing Data

Multivariate density estimation is a popular technique in statistics wit...

Models with varying structure

In this paper the problems of the retrospective analysis of models with ...

Calibrated Multivariate Regression with Application to Neural Semantic Basis Discovery

We propose a calibrated multivariate regression method named CMR for fit...

Code Repositories


Implementation of Calibrated Multivariate Regression with Grossly Corrupted Observations

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The multivariate linear model, also known as general linear model, plays an important role in multivariate analysis 

(Anderson, 2003). It can be written as


Here is a matrix with a set of multivariate observations, is referred to as a design matrix with each row being a sample, is a regression coefficient matrix which needs to be estimated, and is a matrix containing observation noises. One of the central problems in statistics is to precisely estimate , the coefficient regression matrix from design matrix and noisy observations . It is typical to assume that the noise in has bounded energy, and can be well-absorbed into the noise matrix . This is usually modelled to follow certain Gaussian-type distributions. It thus gives rise to the following regularized loss minimization framework in statistics, where is estimated by



denotes a loss function,

refers to a tuning parameter, and corresponds to a regularization term. Moreover for , the least square loss is usually the most popular choice, which has been shown to achieve the optimal rates of convergence under certain conditions on and  (Lounici et al, 2011; Rohde and Tsybakov, 2011). It has also been applied in many applications including e.g. multi-task learning (Argyriou et al, 2008; Caruana, 1997).

Nonetheless, there exist real-life situations where certain entries in observation are corrupted by considerably larger errors than those of the “normal” ones that can be incorporated in the noise model considered above. Consider for example the following scenario: a few of data entries could be severely contaminated due to careless or even malicious user annotations, while these errors are unfortunately difficult to be identified in practice. This type of sparse yet large-magnitude noises might seriously damage the performance of the above-mentioned estimator .

Here we consider to tackle this problem of grossly corrupted observations explicitly by considering a sparse matrix with the locations of nonzero entries being unknown and with their magnitudes being possibly very large. This gives rise to a multivariate linear model as


It thus enables us to restore those examples with gross errors instead of merely throwing them away as outliers. Note the same model is also capable of dealing with the missing data problem, i.e. situations in which a subset of observations in

are missing. More concretely, the missing observations can be imputed with zeros, then model (

3) is applied to this modified data. As a result, for each missing entry in whose corresponding entry in is nonzero, its negative forms the predicted correction.

This naturally leads to the following optimization framework of estimating


with being a loss function, as a trade-off parameter, and as a regularization term. Further, rather than the usual least square loss, the -norm, defined as the sum of 2-norm of all columns, is considered here as the loss function due to its ability of dealing with different noise levels in regression tasks. Additionally, we employ a group sparsity inducing norm for to enforce group-structured sparsity, and use -norm for to impose element-wise sparsity constraint for detecting possible gross corruptions.

This paper contains the following major contributions: (1) A new approach is proposed in the context of multivariate linear regression to explicitly model and recover the missing or grossly corrupted observations, while the adoption of

-norm loss function also facilitates the ability of modeling the noise levels of individual outcome variables; (2) The induced non-smooth optimization problem is addressed by our proposed multi-block proximal alternating direction method of multipliers (ADMM) solver, which is shown in this paper to be efficient and globally convergent; (3) To demonstrate the general applicability of our approach, two interesting and distinct real-world applications have been examined: The first application involves the investigation of Big-Five personality from user online behaviors when interacting with social network sites (SNSs), an emerging problem from computational psychology. The second application concerns the challenging computer-vision problem of depth-image based human hand pose estimation. These two problems exemplify the broad spectrum of applications where our approach could be applicable. Empirical evaluation is carried out with synthetic and real datastets for both applications, where our approach is shown to compare favorably with existing multivariate regression models as well as application-specific state-of-the-art methods. (4) Last but not least, to support the open-source and reproducible practice, our implementations and related datasets are also made publicly available 

111Implementations of our approach as well as comparison methods, related datasets, and detailed information can be found at our dedicated project webpage

We note in the passing that part of this paper was presented in (Zhang et al, 2015). Meanwhile, this paper presents substantial amount of new contents comparing to (Zhang et al, 2015): From algorithmic aspect, our approach is carefully presented with more details and in a self-complete manner, with convergence analysis and proof, as well as time complexity analysis; From empirical application point of view, our approach is systematically examined on a series of simulated data. Practically, our approach has also been additionally validated on the interesting yet challenging application of depth-image based human hand pose estimation, where very competitive performance is obtained on both synthetic and real datasets. Besides, our code is also made publicly available in support of the open-source research practice; From presentation side, the paper is significant re-written and re-organized to accommodate the new materials, including e.g. review of hand pose estimation related literature; Overall the work presented in this paper is a lot more self-complete, and is better connected with real-world problems.

1.1 Related Work

In the line of work in machine learning and statistics, there have been various methods 

(Bhatia et al, 2015; Li, 2012; Nguyen and Tran, 2013; Wright and Ma, 2010; Xu et al, 2013, 2012; Xu and Leng, 2012) proposed for linear regression with gross errors as in (3), among which (Nguyen and Tran, 2013) and (Xu and Leng, 2012) examine the univariate and multivariate outputs, respectively, of the optimization problem (4), where is the standard least square loss. Nevertheless, as pointed out in (Liu et al, 2014), the least square loss has two drawbacks: First, all regression tasks (each column in or of (1), which corresponds to an element of the multivariate regression output, can be regarded as a task) share the same regularization trade-off . As a result, the varying noise levels contained in different regression tasks are unfortunately ignored; Second, to improve the finite-sample performance, it is often important to choose an optimal

that depends on the estimation of unknown variance of

. Aiming at address these two issues, a calibrated multivariate regression (CMR) method has been proposed in (Liu et al, 2014), where the -norm is employed as a novel loss function. It enjoys a number of desirable properties including being tuning insensitive and being capable of calibrating tasks according to their individual noise levels. Theoretical and empirical evidence (Liu et al, 2014) has demonstrated the ability of CMR to deliver an improved finite sample performance. This inspires us to adopt in our approach the -norm as our loss function. It is worth pointing out that our induced optimization problem and subsequently our proposed solver bear clearly differences from that of (Liu et al, 2014).

Related Work in Personality Prediction

So far, there are relatively few research efforts attempting toward personality prediction from SNSs behaviors (Ma et al, 2011). One prominent theory in the study of personality is the Big-Five theory (Funder, 2001; Matthews et al, 2006), which describes personality trait by five disparate dimensions, which areConscientiousness, Agreeableness, Extraversion, Openness, and Neuroticism. Traditionally, the most common way to predict an individual’s personality is applying the self-report inventory (Domino and Domino, 2006), which relies on the subjects to fill up questionnaires by themselves on their own behaviors, which are then summarized into a quantitative five-dimensional personality descriptor. However, such method has two disadvantages. First, it is practically infeasible to conduct self-report inventory in large-scale. Second, maliciously wrong answers might be supplied, or sometimes idealized answers or wishes are provided instead of the real ones, which nevertheless reduce the annotation credibility of the their behaviors. The above mentioned issues would collectively lead to highly deviated or sometimes missing personality descriptors.

It has been widely accepted in psychology that an individual’s personality can be manifested by behaviors. In recent years, these social networks including Facebook, Twitter, RenRen, and Weibo, have drastically changed the way people live and communicate. There have been evidences (Landers and Lounsbury, 2006) suggesting that social network behaviors are significantly correlated with the real-world behaviors. On one hand, the fast growing number of SNSs users provides large amount of data for social research (Jie, 2011; Reynol, 2011)

. On the other hand, there still lacks a proper model which can fully exploit the data to perform personality prediction. One research effort along this direction is probably that of Gosling et al. 

(Gosling et al, 2011), which proposes a mapping between personality and SNSs behaviors. Specifically, they design 11 features, including friends count and weekly usage, based on self-reported Facebook usage and observable profile information, and investigate the correlation between personality and these features. Meanwhile, these features entirely rely on statistical descriptions rather than ones that explicitly revealing user behaviors. Moreover, their data collection requires considerable manual efforts since the procedure is based on self-reported facebook utilization and online profile information, making it non-realistic for practical purpose.

Related Work in Hand Pose Estimation from Single Depth Images

3D hand pose estimation (Erol et al, 2007; de La Gorce et al, 2011) refers to the problem of estimating the finger-level human hand joint locations in 3D. Vision-based hand interpretation has played important roles in diverse applications including humanoid animation (Sueda et al, 2008; Wang and Popović, 2009), robotic control (Gustus et al, 2012), and human-computer interaction (Hackenberg et al, 2011), among others. In its core lies this interesting yet challenging problem of 3D hand pose estimation, owing mostly to the complex and dexterous nature of hand articulations (Gustus et al, 2012). Facilitated by the emerging commodity-level depth cameras, recent efforts such as (Keskin et al, 2012; Tang et al, 2014; Xu et al, 2016) have led to noticeable progress in the field. A binary latent tree model is used in (Tang et al, 2014) to guide the searching process of 3D locations of hand joints, while (Oikonomidis et al, 2014)

adopts an evolutionary optimization method to capture hand and object interactions. A dedicated random forest variant for hand pose estimation problem is proposed in 

(Xu et al, 2016) with state-of-the-art empirical performance as well as nice theoretically consistency guarantees. As an emerging research topic, the NYU Hand pose dataset (Tompson et al, 2014) is becoming the de facto benchmark for new methods to assess their performance on 3D hand pose estimation, which is also considered during the empirical evaluation section of our paper.

1.2 Notations and Definitions

Several notations and definitions are provided below. Given any scalar , we define , that is if and 0 otherwise. Given a and , we denote by the -norm and the -norm. A group is a subset of with cardinality , while denotes a set of groups, where each element corresponds to a group that potentially overlaps with others. Overall we have . denotes the subset of entries of with indices in . In a similar way, given a matrix , and refer to the rows and columns of indexed by

, respectively. An identity matrix

is used with its size being self-explained from the context. denotes the set of symmetric positive definite matrices of size -by-. In what follows we define three norms in a row, which are the Frobenius, spectral, and -norms: , , and . denotes the matrix rank in our context, and denotes the

-th largest singular value of matrix

. We also define the matrix -norm and -norm as and , respectively. At last, the group lasso penalty associated with a group set is defined as .

1.3 Organization

The rest of this paper is organized as follows. In Section 2, we propose our robust multivariate regression model and compare its finite-sample performance with other multivariate regression models. In Section 3, we describe the derivation of algorithm CMRG, which solves the induced optimization problem of our model via proximal ADMM, and provide its convergence analysis and complexity analysis. In Section 4, we evaluate the effectiveness of the proposed method on both synthetic and real data, and apply our method to predict personality from user behaviours at SNSs as well as estimate hand pose from depth images. Finally, conclusions are drawn in Section 5.

2 The Proposed Model

In our context, is denoted as a group sparsity inducing norm. The stochastic noise considered in model (1) is assumed to follow the following law with a covariance matrix .

To start with, we consdier a least square loss and its usage in the ordinary multivariate regression or OMR model, which gives rise to a convex optimization problem as follows:


Theoretically, it has been shown in (Lounici et al, 2011) that, under the assumption that and suitable conditions on , let , if we choose for some , then with the following rate of convergence


we obtain the estimator of (5). Here denotes the number of non-zeros rows in and means stochastic upper bond. However, as pointed out in (Liu et al, 2014), the empirical loss function of (5) has two potential drawbacks: (1) All the tasks are regularized by the same parameter , which introduces unnecessary estimation bias to some of the tasks with less noise in order to compensate the other tasks having larger noise. (2) It is highly non-trivial to decide a proper tuning parameter for a satisfactory result.

As a remedy, Liu et al. (Liu et al, 2014) advocate a calibrated multivariate regression (CMR) model that uses -norm as loss function, where where the regularization is dedicated to each regression task . In other words, it is calibrated toward the individual noise level . Mathematically, CMR considers the following optimization problem:


An re-interpretation of this CMR model in (7) is from the following weighted regularized least squares problem:

where is the weight assigned to calibrate the -th task. When there is no prior knowledge on , we estimate it by letting , which can be considered as the error in the -th task. Theoretically, (Liu et al, 2014) has proven that the CMR model (7) achieves better finite-sample performance than the ordinary multivariate linear regression model (5) in the sense that estimator achieves the optimal rates of convergence in (6) if we choose , which is independent of . Therefore, the tuning parameter in the OMR model depends on the noise level ’s (through ), while the tuning parameter in the CMR model is insensitive to the noise level ’s.

Unfortunately, neither the OMR model of (5) nor the CMR model of (7) addresses the gross error issue. A natural idea toward addressing this problem is to explicitly model the gross errors in as stated in (3). Then corrected observations can be obtained by simply removing gross errors from , which are further used to estimate coefficient regression matrix . More specifically, we consider joint-forcing the benefits of both the CMR model of (7) and the model of (3) by investigating the optimization objective as


Here we have two regularization parameters, and . In particular, when letting , we get , and optimization problems (7) and (8) give the same solutions. In other words, when there is no gross error, our method reduces to the original CMR. In this regard, our method can be considered as an extension of CMR to deal with gross errors.

A related work is (Xu and Leng, 2012) that also looks at grossly corrupted observations but in the context of multi task regression. Its associated optimization problem could be reformulated as:


Note as pointed out in (Liu et al, 2014), there are some limitations in the least square loss function in (9).

To illustrate the applicability of model (8

), we use 3D hand pose estimation from depth images as an example. In such context, each instance is a depth image of human hand and the associated observation is a vector containing

coordinates of all hand joints. The total length of the observation vector depends on the number of joints. It is clear that entries in the observation vector are distinct and yet intrinsically connected in the sense that they describe 3D coordinates of different joints, but joints in the same finger are connected. Moreover, the noise levels of these entries are not necessarily the same. For example, finger tips are prone to have large error while other joints of finger have smaller error. This naturally suggests the usage of multi-task regression model with CMR loss aiming to regularize different tasks with different parameters. On the other hand, ground-truth observations are obtained based on annual annotations, which is hard to be precise due to occlusions, and is truly difficult to rule out the potential existence of gross errors from either careless or malicious user annotations.

3 Our Cmrg Algorithm

Different from ordinary multivariate linear regression problems in (5) and (9), the optimization problem in (8) is more challenging, as both the loss function and regularization terms are non-smooth. For this we develop a dedicated proximal ADMM algorithm which is also inline with (Boyd et al, 2011; Fazel et al, 2013; Sun et al, 2015; Chen et al, 2016). We first adopt a variable splitting procedure as of (Chen et al, 2012; Qin and Goldfarb, 2012) to reformulate (8) as an equivalent linearly constrained problem, as follows. Let , and denote as the composition matrix from , which is constructed by copying the rows of whenever they are shared between two overlapping groups. That is, provided as the set of overlapping groups, we can constructed a new set of non-overlapping groups by means of a disjoint partition of conforming to the identity below

It is clear that . Moreover, the linear system explicitly characterizes the relations between and . Here is defined as: if and otherwise. Note here coresponds to a very sparse matrix. denotes a diagonal matrix where each of its diagonal entries equals the number of repetitions of the corresponding row in . In the special case of , its corresponding is composed of only non-overlapping groups.

Furthermore, (8) can be equivalently reformulated as the following optimization problem

s.t. (10)

when leting . It turns to be in the exact form of a 3-block convex problem as follows:


Here the notations are simplified as follows: , and . , , , . , , , .

A natural algorithmic candidate to tackle the above-mentioned general 3-block convex optimization problem at (11) or its more concrete realization as (10)) in our context is the multi-block ADMM, which is a direct extension from the ADMM for addressing 2-block convex optimization problem (Boyd et al, 2011). It is unfortunately observed in (Chen et al, 2016) that, although the usual 2-block ADMM converges, its direct extension to multi-block ADMM might however diverge. This non-convergence behavior of multi-block ADMM has attracted a number of research efforts for convergent variants. The study of (Sun et al, 2015) empirically examines the regime of existing multi-block ADMM convergent variants, and finds out that collectively they substantially under-performs than the direct multi-block ADMM extension that has no convergence guarantee. Fortunately, very recently a proximal ADMM is developed in (Sun et al, 2015) that enjoys both theoretical convergence guarantee as well as supreme empirical performance over the direct ADMM extension. This inspires us to propose a similar algorithm to be described below.

For optimization problem (10), the corresponding augmented Lagrangian function can be written as


where and are Lagrangian multipliers and is the barrier parameter. Similar to (Sun et al, 2015), by applying proximal ADMM to solve the optimization problem of (10), we obtain the following steps for updating variables and parameters at the th iteration:


Here we have turning parameters , . Let , we choose initial values of , and so that


It turns out all the subproblems in (13)–(16) enjoy closed-form solutions. Subproblem (13) can be formulated as

where . Thus, the columns of are given by


Similarly, subproblem (13) can be formulated as

where , and the solution is given by


To solve the subproblem (14), we have


where we used equality which can be derived from equalities (17), (18) and initial conditions (19). The solution to subproblem (15) is given by


where , is the sign function, and denotes component-wise multiplication. To solve subproblem (16), we have


we are now ready to present our Algorithm 1 for calibrated multivariate regression with grossly corrupted observations (CMRG) that incorporates the above-mentioned components. Note that when examining side-by-side with the direct 3-block ADMM extension, our proximal ADMM proposed as above possesses an additional step to evaluate . The additional cost of evaluating is trivial as that the inverse of is usually easy to compute (e.g., when Cholesky factorization of exists).

1:  Input: , , , , , and .
2:  Initialization: , , , , , such that and . .
3:  repeat
4:     Evaluate by (20).
5:     Evaluate by (21).
6:     Evaluate by (22).
7:     Evaluate by (3).
8:     Evaluate by (24).
9:     Evaluate as well as by (17).
10:     .
11:  until Convergence
12:  Output: , .
Algorithm 1 CMRG (Calibrated Multivariate Regression with Gross Errors)

For any optimal solution to problem (10), there exit optimal Lagrangian multipliers such that


where denotes subdifferential of convex functions. Performing the extra step is in fact crucial as it ensures the global convergence of the sequence generated by Algorithm 1 to an optimal solution satisfying (25). This is formally stated in Theorem 3.1.

Theorem 3.1

Under the condition , the sequence , , , , , generated by Algorithm 1 converges to a unique point satisfying (25), so that is an optimal solution to optimization problem (10) and is an optimal solution to the dual problem of (10).


The proof can be derived by applying Theorem 2.2(iii) in (Sun et al, 2015). It is easy to verify that the solution set of (10) is nonempty and the constraint qualification in Assumption 2.1 of (Sun et al, 2015) holds. Therefore, to complete our proof, it is sufficient to show that and in (11) are positive definite, which is obvious since and .

Algorithmic complexity

The complexity of Algorithm 1 consists of two main parts, corresponding to the computation of the inverse of and the updating of variables in each iteration. Since the coefficient matrix is the same for all , one has to compute only once before the iteration, which costs . Moreover, when , the cost can be further reduced by applying the Sherman-Morrison-Woodbury formula (Golub and Van Loan, 1996) and computing the inverse of the , which costs . Thus, the cost of computing is with . When both and are large, it might be inapplicable to compute the Cholesky factorization of or . In this circumstance, one can solve linear systems (22) and (24) using iterative solver such as the preconditioned conjugate gradient method (Golub and Van Loan, 1996). In each iteration, the cost of computing is dominated by that of computing which is . Similarly, computing , , and costs , , and , respectively. Overall, the complexity of Algorithm 1 is , where denotes the number of iterations.

4 Empirical Evaluations

The central piece of our approach is the model (8), or CMRG in short, which is also referred to as our approach when without confusion. There are also three related models: the OMR model in (5) for ordinary multivariate regression, the CMR model in (7) for calibrated multivariate regression, as well as the OMRG model in (9) for ordinary multivariate regression with gross error, where is used instead of . Our proposed Algorithm 1 is then employed in solving our proposed model, meanwhile standard ADMM is used for solving the rest models in a similar manner.

To facilitate a better understanding of the inner-working as well as a systematic evaluation of the proposed approach, we first consider a series of experiments on simulated data, where we have full access to the ground-truths, the gross errors, and the contaminated observations. This is followed by experiments on two exemplar real-world applications: Big-five personality prediction from computational psychology, and 3D hand pose estimation from computer vision. Each of these experiments is described in details in what follows.

4.1 Simulated Experiments

We first generate simulation datasets to systematically evaluate the finite-sample performance of our new model in controlled settings. The synthetic data are obtained following a similar scheme to that of (Liu et al, 2014), as follows. Each dataset has 400 examples for training, 400 examples for validation, as well as examples for testing. More concretely, the training examples are obtained as follows:

  1. Each individual row of

    is generated by independent sampling from a 1000-dimensional normal distribution law,

    , with diagonals , and for all off-diagonal entries .

  2. Construct the structure of group sparsity as

    The regression coefficient matrix are obtained by (1) for and and , we have ; (2) the rest entries are set to .

  3. Construct the noise matrix as . Here each entry in is i.i.d. sampled from zero-mean identity variance Gaussian law, ; The matrix is a diagonal matrix, which is obtained by


    where implies that all regression tasks contain stochastic noises of the same magnitude while implies that regression tasks contain stochastic noises of different magnitude.

  4. Construct the gross error . The number of nonzero entries is controlled by a ratio (). The positions of these non-zero entries are randomly selected, while their magnitudes are set to , where is a scaling factor, and the signs are randomly assigned.

In the same way, we can generate validation samples and testing samples except that we do not add gross errors to the validation and testing samples.

Empirical evaluations are carried out on datasets generated with different values of , and to evaluate the performance of CMRG. The regularization parameters and are obtained by


respectively, using 5-fold cross-validation. The optimal parameter is set by

Here refers to the estimation obtained from parameters , while and correspond to the design and observation matrices from the validation set.

The following metrics are used in our experiments:

which measures prediction error on the testing data , adjusted prediction error on the testing data, estimation error of and estimation error of , respectively. Throughout this experiment, all shown results are average results over 100 repetitions.

Algorithm Without gross error With gross error ( and )
Pre.Err. Est.Err. Pre.Err. Est.Err. Est.Err.
OMR 0.21961.0e-2 0.21881.0e-2 0.44002.0e-2 0.43792.0e-2
CMR 0.21961.1e-2 0.21901.1e-2 0.43261.9e-2 0.43132.0e-2
OMRG 0.21961.0e-2 0.21881.0e-2 0.43901.9e-2 0.43602.0e-2 0.96442.9e-2
CMRG 0.21961.1e-2 0.21901.1e-2 0.35442.1e-2 0.35342.1e-2 0.48141.3e-2
Table 1: Prediction and estimation error (in term of meanstandard deviation) of the comparison regression models: OMR, CMR, OMRG and CMRG on synthetic data generated with and .
Algorithm Without gross error With gross error ( and )
Pre.Err. Adj.Pre.Err. Est.Err. Pre.Err. Adj.Pre.Err. Est.Err. Est.Err.
OMR 0.12095.3e-3 0.08666.1e-3 0.12005.4e-3 0.41091.7e-2 0.40922.1e-2 0.40831.7e-2
CMR 0.11154.9e-3 0.06124.4e-3 0.11064.9e-3 0.40521.8e-2 0.40392.2e-2 0.40321.8e-2
OMRG 0.11753.6e-3 0.07933.9e-3 0.11202.6e-3 0.41121.7e-2 0.40782.2e-2 0.40482.1e-2 0.98002.5e-2
CMRG 0.11154.9e-3 0.06124.4e-3 0.11064.9e-3 0.20211.4e-2 0.13051.9e-2 0.20151.3e-2 0.26459.0e-3
Table 2: Prediction and estimation error (meanstandard deviation) of four regression models: OMR, CMR, OMRG and CMRG on synthetic data generated with and .

We first study the effect of stochastic noise level in different tasks by letting and , and show results of four comparison models in Table 1 and Table 2, respectively. In Table 1, since metric Adj.Pre.Err. reduces to metric Pre.Err. when , we do not include the results related to Adj.Pre.Err.. From Table 1 and Table 2, we have four observations: (1) When (that is, all regression tasks contain the same level of stochastic noise) and no gross error, all four models have the same performance. (2) When (that is, regression tasks contain different levels of stochastic noise), models adopting the -norm as the loss function (i.e., CMR and CMRG) outperform the ones using least square loss (i.e., OMR and OMRG) in terms of both prediction error on testing data and estimation error of . (3) In the presence of gross errors, regression models OMRG and CMRG that consider gross error perform consistently better than OMR and CMR that without such consideration. (4) It is observed that CMRG usually delivers lower prediction error as well as lower estimation error of and when comparing to OMRG. In summary, our newly proposed model CMRG achieves the best overall performance and outperforms other models by a large margin when there are gross errors.

Next, we study the effect of while letting , and , and show results for , , in Table 3. Again, we observe that regression models with calibration perform better than their counterparts without calibration, and CMRG outperforms other models by a large margin.

Algorithm Pre.Err. Adj.Pre.Err. Est.Err. Est.Err.
OMR 0.41091.7e-2 0.40922.1e-2 0.40831.7e-2
CMR 0.40521.8e-2 0.40392.2e-2 0.40321.8e-2
OMRG 0.41121.7e-2 0.40782.2e-2 0.40482.1e-2 0.98002.5e-2
CMRG 0.20211.4e-2 0.13051.9e-2 0.20151.3e-2 0.26459.0e-3
OMR 0.52881.3e-2 0.52371.7e-2 0.52551.4e-2
CMR 0.51571.6e-2 0.51081.9e-2 0.51321.6e-2
OMRG 0.62001.8e-2 0.61231.7e-2 0.60581.8e-2 0.98952.2e-2
CMRG 0.25961.1e-2 0.17001.3e-2 0.25851.2e-2 0.25291.0e-2
OMR 1.08821.6e-2 1.07242.3e-2 1.08121.4e-2
CMR 0.73592.3e-2 0.73191.9e-2 0.72802.3e-2
OMRG 1.29384.5e-1 1.26724.6e-1 1.23365.0e-1 0.85542.3e-1
CMRG 0.39092.0e-2 0.27172.1e-2 0.38812.0e-2 0.21835.7e-3
Table 3: Effect of on the prediction and estimation error (meanstandard deviation) of four regression models: OMR, CMR, OMRG and CMRG on synthetic data generated with , and .
(a) Prediction error
(b) Adjusted prediction error
(c) Estimation error of
(d) Estimation error of
Figure 1: Effect of the ratio () of grossly corrupted training observations with , ,

. Eeach figure shows one evaluation metric as a function of

with value equal to 0, 0.2, , 1.
(a) Prediction error
(b) Adjusted prediction error
(c) Estimation error of
(d) Estimation error of
Figure 2: Effect of the magnitude () of gross errors in the observations with , , . Eeach figure shows one evaluation metric as a function of with value equal to 5, 10, 50, and 100.

We also study the effect of (the ratio of gross errors in the observations) and (the magnitude of gross errors). Figure 1 shows results of four regression models for equal to 0, 0.2, , 1, and Figure 2 shows results of four regression models for equal to 5, 10, 50, 100. From Figure 1, we observe that for all four models the prediction error and estimation errors of and increase as more and more observations are grossly corrupted. Models with calibration perform better than models without calibration for all values of and the advantage is more profound when is large. Moreover, our newly proposed model CMRG outperforms CMR when and has similar performance as CMR when (see Figure 1(a)). One reason may be that CMRG fails to identify gross errors in the observations when more than half of observations are corrupted, as shown in Figure 1(d). It suggests the existence of certain threshold, and our approach could recover the gross error successfully when is lower than the threshold. This topic is left for future research. From Figure 2, we see that CMRG is insensitive to the magnitude . On the other hand, OMRG has large deviations when is large, which also reflects the difficulty in selecting proper regularization parameters in OMRG.

4.2 Experiments on Personality Prediction from SNSs Behaviors

We used a new SNSs dataset built from the microblogging site Sina Weibo (the Chinese equivalent of Twitter). By recruiting subjects to login Weibo through our dedicated website (after filling consent forms and with legal data privacy management), these users’ historical behavior data at Weibo are collected. For this Weibo dataset, 45 behavior features are constructed and are arranged into 4 groups, namely social networking, profile, self-presentation, and security setting, and in total 630 subjects are recruited for this dataset.

This set of data is further inspected to keep only those who are active users, while reject those participants who either publish 512 blogs altogether, or publish zero blog during the past three months. This leads to the final dataset with only 562 subjects (instances). It is further partitioned into 450 instances for training, and 112 instances for testing. Each subject is also asked to complete a questionnaire, which is well-known BPL (Berkeley Personality Lab) Big-Five inventory consisting of forty-four inquiries. The inventory results are then epitomize into a five-dimensional personality descriptor following standard procedure. This gives rise to a vector with each element taking a value with .

Four comparison methods are employed here, which include the three closely related methods (OMR, CMR, and OMRG), as well as a ridge regression or RR method, which is in fact model (

2) by considering the least square loss and the regularization term of . Similar to the simulated experiments considered previously in subsection 4.1, we choose parameter from , and pick-up parameter from the set . Finally, the optimal pair is selected based on five-fold cross validation on the training set. The relative prediction error already used during synthetic evaluations are again adopted here for performance evaluation.

To begin with, we consider evaluations w/o the presence of gross error. The left half of Table 4 illustrates averaged results over 10 repetitions, where personalities Agreeableness, Conscientiousness, Extraversion, Neuroticism and Openness in the left-most column denote the average prediction error evaluated for the corresponding personality. Further, Pre.Err. stands for the relative prediction error averaged over the 5 output personalities. As displayed in Table 4, empirically OMRG and CMRG (i.e. the two models that consider gross error) performs on par with OMR and CMR (i.e. the other two methods that do not consider gross error at all), in the current dataset context where there is no gross error). Furthermore, CMRG outperforms OMR and OMRG, which is slightly taken over by RR.

Personality Without gross error Corrupted data (with 10% missing observations)
Agreeableness 0.1784 0.1788 0.1783 0.1788 0.1783 0.2176 0.2146 0.2136 0.2055 0.1914
Conscientiousness 0.2128 0.2226 0.2212 0.2226 0.2212 0.2332 0.2174 0.2170 0.2160 0.2109
Extraversion 0.2147 0.2172 0.2152 0.2172 0.2152 0.2384 0.2340 0.2379 0.2310 0.2205
Neuroticism 0.2262 0.2269 0.2271 0.2269 0.2271 0.2670 0.2676 0.2622 0.2594 0.2543
Openness 0.1717 0.1830 0.1823 0.1830 0.1823 0.2088 0.1822 0.1814 0.1720 0.1641
Overall Pre.Err. 0.1993 0.2046 0.2037 0.2046 0.2037 0.2320 0.2231 0.2221 0.2166 0.2076
Table 4: Performance of five competing methods on Weibo data.

In practice, there are situations where some entries in the multivariate output space are missing. To further investigate this type of cases, we consider a processing of our personality dataset where entries (which amounts to out of the total output entries) in the observations are randomly replaced by zero (i.e. they are deleted). Experiments are then carried out based on this corrupted dataset (for both training and testing). The right half of Table 4 presents average prediction errors over ten repeats, where we clearly observe that CMRG significantly outperforms the rest competitors. To further investigate the ability of our newly proposed model in identifying gross errors in the observations, we introduce an additional metric Rec.Rate. which quantifies the fraction of perfectly restored positive / negative entry signs in . In other words, Rec.Rate. equals to the number of entries in and that have the same sign divided by the number of entries in . Table 5 presents the comparison of CMRG vs. OMRG in term of restoring those missing values. Empirically CMRG is shown to be capable of accurately identifying most of the missing observations and performs much better than OMRG.

Methods Est.Err. Rec.Rate.
CMRG 0.71300.0052 0.99200.00030
OMRG 0.88590.2552 0.89230.0076
Table 5: Recovery accuracy of OMRG and CMRG on Weibo data with missing observations.

In addition, since the averaged absolute distance (AAD) is widely used as an evaluation metric in the area of personality prediction, it is also applied here to measure the deviation from gold-standard to our prediction, when the corrupted data are in use. Table 6 displayed the averaged results over ten repeats, from which we see that CMRG consistently outperforms other regression models.

Agreeableness 0.64 0.64 0.61 0.65 0.56
Conscientiousness 0.55 0.55 0.55 0.59 0.54
Extraversion 0.62 0.63 0.61 0.62 0.59
Neuroticism 0.68 0.66 0.65 0.68 0.64
Openness 0.53 0.53 0.50 0.61 0.48
Average 0.60 0.60 0.58 0.63 0.56
Table 6: AAD results of competing methods on Weibo data with missing observations.

4.3 Hand Pose Estimation from Depth Images

Vision-based hand pose estimation has plenty of applications in various areas including humanoid animation, human-computer interaction, and robotic control. The core problem here is the problem of 3D hand pose estimation (Erol et al, 2007; de La Gorce et al, 2011), owing mostly to the complex and dexterous nature of hand articulations. Facilitated by the emerging commodity-level depth cameras such as Kinect 222 and Softkinect 333, recent efforts on 3D hand pose estimation from depth images (Ye et al, 2013; Tang et al, 2014; Xu et al, 2016) have led to noticeable progress in the field. In this section, we apply our CMRG method to the problem of 3D hand pose estimation from depth images. We evaluate the performance of our method on a home-grown synthesized depth image dataset as well as the benchmark NYU Hand pose dataset (Tompson et al, 2014), which are described separately in what follows, and compare against state-of-the-art methods in this field.

(a) Hand kinematic model
(b) Examples of synthesized depth image with ground-truth annotations
Figure 3: An illustration of the hand kinematic model and examples of hand depth image with ground-truth annotations used in our synthetic dataset for performance evaluation. For each hand image, its annotation contains 20 joints represented as a vector of length , consisting of the 3D locations of the joints following a prescribed order.
Figure 4: Examples of hand depth image with ground-truth annotations from the NYU Hand pose dataset. For each hand image, its annotation contains 14 joints represented as a -dimensional vector.
(a) No gross error in
(b) entries missing in
(c) entries missing in
Figure 5: Average joint error in millimeter on the synthetic dataset: (a) Results using the original training data; (b) Results using the corrupted training data with entries in missing; (c) Results using the corrupted training data with entries in missing.

Our synthetic dataset

To conduct quantitatively analysis, we generate an in-house dataset of k synthesized hand depth images, in which k are used for training, k are for validation, and the rest k are reserved for testing. The 3D position, orientation, and hand gesture are randomly generated. The distance form a synthetic hand to virtual camera varies within the range of 650mm to 800mm. The image size obtained from the virtual depth camera is , and the vertical field-of-view of the camera is 43 degree. For each depth image, an output label of hand pose is expressed in term of the set of 3D coordinates of all 20 finger joints as illustrated in Figure 3. We concatenate the coordinate of the joints to obtain a -dimensional vector.

To apply the proposed approach, Convolutional Neural Network (CNN) features are extracted from each depth image as input

in our context, and the corresponding -dimensional coordinate vector corresponds to the label

. The CNN features are obtained as follows: the ImageNet-pretrained AlexNet 

(Krizhevsky et al, 2012)

is adopted to learn a CNN model based on our aforementioned training set. Note to fulfill the input requirement of AlexNet, the depth values in each image are scaled between 0 and 255; Each image is resized properly; And each depth image is replicated three times to form a three-channel image. The MatConvNet deep learning library is adopted in this paper. The final CNN model is attained after 50 training epochs. Now, given a new image, after applying the learned CNN model, its CNN features is obtained by simply retrieving the output from the second-to-last fully connected layer, which is a

-dimensional vector.

NYU Hand pose dataset

The NYU Hand pose dataset (Tompson et al, 2014) contains 8,252 RGBD images in its test set and 72,757 in the training set 444, from which only the depth channel images are considered in our context. Some examples of depth images are displayed in Figure 4. As only 14 finger joints are annotated in the NYU dataset, here the output label becomes a -dimensional coordinate vector. Meanwhile input contains the extracted CNN features following the same protocol used in the synthetic dataset. Now, for competing regression models OMR, CMR, OMRG and CMRG, the original 72,757 training images are randomly partitioned into two subsets: 62,757 images as training and 10,000 images as validation set to determine internal parameters and .

(a) No gross error in
(b) entries missing in
(c) entries missing in
Figure 6: Average joint error in millimeter on the NYU Hand pose dataset: (a) Results using the original training data; (b) Results using the corrupted training data with entries in missing; (c) Results using the corrupted training data with entries in missing.
Synthetic dataset
missing 3.1047 3.1032 3.0995 3.0817 6.5168 4.9900
missing 8.0142 8.0271 6.2471 3.2753 11.8555 8.9299
missing 14.9866 14.7223 14.7838 4.6401 17.5957 14.9712
NYU Hand pose dataset
missing 18.3859 18.3527 18.3871 18.3527 24.8189 18.9210
missing 22.3120 21.9279 22.3106 18.4887 30.2869 26.0876
missing 33.0263 32.7631 21.3270 18.8763 38.5659 38.1679
Table 7: Average joint error in millimeter (mm) of the whole hand.

Evaluation metric

Following the convention of hand pose estimation literature such as (Xu et al, 2016), our performance evaluation metric is based on the joint error, which is defined as the Euclidean distance between ground-truth and predicted 3D joint locations. Formally, denote and as the ground truth and predicted joint locations for the -th joint of the -th testing sample. The mean joint error of the -th joint is defined as , where is the number of testing examples and denotes the Euclidean norm in 3D space. Moreover, the mean joint error of the whole hand is simply the average of all mean joint errors, that is where for the synthetic dataset and for the NYU Hand pose dataset.

Experimental Set-up

Four multivariate regression models OMR, CMR, OMRG, and CMRG are compared in the experiments. In addition, two dedicated hand pose estimation methods are considered: One is the recent work of DHand (Xu et al, 2016); For the other one, due to the recent dramatic progress of deep learning, it becomes sensible to include a CNN method based on the ImageNet-pretrained AlexNet (Krizhevsky et al, 2012) as described earlier in the paper, where right after the 4096-dimensional fully connected layer, the standard least-square loss layer with loss term is used for multivariate regression of joint 3D locations. Similar to the previous experiments on simulated data, for all four regression problems, the optimal parameters and are chosen from based on the validation data. Moreover, to investigate on how these methods behave when there exist missing annotations in training data, for both synthetic and NYU datasets, two additional sets of training data are obtained with and entries being randomly deleted from the ground-truth , and with the rest remains the same. Note for the deleted annotation entries, the original values are replaced by .


Quantitative results of competing methods on the synthetic datasets are presented in Figure 5, where -axis of each plot shows the average joint error in millimeter (mm). Mean joint error over the entire hand for all participating methods are also provided in Table 7. From Figure 5 and Table 7, clearly our approach (CMRG) consistently outperforms the rest methods in the presence of gross errors, including domain-specific methods such as DHand, as well as deep learning baseline method. More specifically, when training with the original non-contaminated data in Figure 5

(a), all four multivariate regression methods deliver similar performance, and CMRG is only slightly better. Interestingly all four methods perform better than DHand and AlexNet, which we attribute to the additional sparsity-induced regularizer adopted by all four models to enforce feature selection, in comparison with ALexNet where only the least-square empirical loss term is in use. Compared with DHand, these four regression models are fed with CNN features which may secure a performance boosting. In particular, as illustrated during Figure 

5(b)-(c), our approach stands out in term of being robust with increased missing entries, meanwhile rest methods produce noticeably larger errors.

Similar trends can also be observed from the NYU dataset as presented in Figure 6 and Table 7. It is worth mentioning that the advantage of our CMRG over comparison methods on the data is less significant when without gross error (in Figure 6(a)), but when there are increasing amount of missing entries in , CMRG behaves much better than the rest competitors by retaining a robust performance, as shown in Figure 6(b)-(c) and Table 7. Inspired by the surprisingly good performance of our approach, combining our model with CNN would be an interesting direction for future research in the area of 3D hand pose estimation.

5 Conclusions

We consider a new approach dedicating to the multivariate regression problem where some output labels are either corrupted or missing. The gross error is explicitly addressed in our model, while it allows the adaptation of distinct regression elements or tasks according to their own noise levels. We further propose and analyze the convergence and runtime properties of the proposed proximal ADMM algorithm which is globally convergent and efficient. The model combined with the specifically designed solver enable our approach to tackle a diverse range of applications. This is practically demonstrated on two distinct applications, that is, to predict personalities based on behaviors at SNSs, as well as to estimation 3D hand pose from single depth images. Empirical experiments on synthetic and real datasets have showcased the applicability of our approach in the presence of label noises. For future work, we plan to integrate with more advanced deep learning techniques to better address more practical problems, including 3D hand pose estimation and beyond.


  • Anderson (2003) Anderson T (2003) An Introduction to Multivariate Statistical Analysis. Wiley
  • Argyriou et al (2008) Argyriou A, Evgeniou T, Pontil M (2008) Convex multi-task feature learning. Machine Learning 73:243–272
  • Bhatia et al (2015) Bhatia K, Jain P, Kar P (2015) Robust regression via hard thresholding. In: Advances in Neural Information Processing Systems 28
  • Boyd et al (2011) Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3:1–122
  • Caruana (1997) Caruana R (1997) Multitask learning. Machine Learning 28:41–75
  • Chen et al (2016) Chen C, He B, Ye Y, Yuan X (2016) The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent. Mathematical Programming 155(1–2):57–79
  • Chen et al (2012) Chen X, Lin Q, Kim S, Carbonell J, Xing E (2012) Smoothing proximal gradient method for general structured sparse regression. The Annals of Applied Statistics 6:719–752
  • Domino and Domino (2006) Domino G, Domino M (2006) Psychological testing: An introduction. Cambridge University Press
  • Erol et al (2007) Erol A, Bebis G, Nicolescu M, Boyle R, Twombly X (2007) Vision-based hand pose estimation: A review. Computer Vision and Image Understanding 108(1-2):52–73
  • Fazel et al (2013) Fazel M, Pong T, Sun D, Tseng P (2013) Hankel matrix rank minimization with applications to system identification and realization. SIAM Journal on Matrix Analysis and Applications 34(3):946–977
  • Funder (2001) Funder D (2001) Personality. Annual Review of Psychology 52:197–221
  • Golub and Van Loan (1996) Golub GH, Van Loan CF (1996) Matrix Computations, 3rd edn. The Johns Hopkins University Press
  • Gosling et al (2011) Gosling S, Augustine A, Vazire S, Holtzman N, Gaddis S (2011) Manifestations of personality in online social networks: Self-reported facebook-related behaviors and observable profile information. Cyberpsychology, Behavior, and Social Networking 14:483–488
  • Gustus et al (2012) Gustus A, Stillfried G, Visser J, Jorntell H, van der Smagt P (2012) Human hand modelling: kinematics, dynamics, applications. Biological Cybernetics 106(11-12):741–755
  • Hackenberg et al (2011) Hackenberg G, McCall R, Broll W (2011) Lightweight palm and finger tracking for real-time 3d gesture control. In: IEEE Virtual Reality Conference, pp 19–26
  • Jie (2011) Jie X (2011) Sequencing algorithm based on the social network real-time search engin. Science Technology and Engineering 28:1671–1815
  • Keskin et al (2012) Keskin C, Kirac F, Kara Y, Akarun L (2012) Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In: ECCV
  • Krizhevsky et al (2012) Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems 25, pp 1097–1105
  • de La Gorce et al (2011) de La Gorce M, Fleet D, Paragios N (2011) Model-based 3d hand pose estimation from monocular video. IEEE Transaction on Pattern Analysis and Machine Intelligence 33(9):1793–1805
  • Landers and Lounsbury (2006) Landers R, Lounsbury J (2006) An investigation of big-five and narrow personality traits in relation to internet usage. Computers in Human Behavior 22(2):283–293
  • Li (2012) Li X (2012) Compressed sensing and matrix completion with constant proportion of corruptions. Tech. rep., arXiv:1104.1041v2
  • Liu et al (2014) Liu H, Wang L, Zhao T (2014) Multivariate regression with calibration. In: Advances in Neural Information Processing Systems 27
  • Lounici et al (2011) Lounici K, Pontil M, van de Geer S, Tsybakov A (2011) Oracle inequalities and optimal inference under group sparsity. The Annals of Statistics 39:2164–2204
  • Ma et al (2011) Ma SQ, Jiao C, Zhang MQ (2011) Application of social network analysis in psychology. Advances in Psychological Science 19(5):755–764
  • Matthews et al (2006) Matthews G, Deary I, Whiteman M (2006) Personality Traits. Cambridge University Press
  • Nguyen and Tran (2013) Nguyen N, Tran T (2013) Robust Lasso with missing and grossly corrupted observations. IEEE Transactions on Information Theory 59:2036–2058
  • Oikonomidis et al (2014) Oikonomidis I, Lourakis M, Argyros A (2014) Evolutionary quasi-random search for hand articulations tracking. In: CVPR
  • Qin and Goldfarb (2012) Qin Z, Goldfarb D (2012) Structured sparsity via alternating direction methods. Journal of Machine Learning Research 13:1435–1468
  • Reynol (2011) Reynol J (2011) The relationship between frequency of Facebook use, participation in facebook activities, and student engagement. Computers and Education 58:162–171
  • Rohde and Tsybakov (2011) Rohde A, Tsybakov A (2011) Estimation of hign-dimensional low-rank matrices. The Annals of Statistics 39:887–930
  • Sueda et al (2008) Sueda S, Kaufman A, Pai D (2008) Musculotendon simulation for hand animation. In: SIGGRAPH, pp 83:1–83:8
  • Sun et al (2015) Sun D, Toh KC, Yang L (2015) A convergent 3-block semi-proximal alternating direction method of multipliers for conic programming with 4-type constraints. SIAM Journal on Optimization 25(2):882–915
  • Tang et al (2014) Tang D, Tejani A, Chang H, Kim T (2014) Latent regression forest: Structured estimation of 3D articulated hand posture. In: CVPR
  • Tompson et al (2014) Tompson J, Stein M, Lecun Y, Perlin K (2014) Real-time continuous pose recovery of human hands using convolutional networks. ACM Transactions on Graphics 33(5):169:1–169:10
  • Wang and Popović (2009) Wang R, Popović J (2009) Real-time hand-tracking with a color glove. In: SIGGRAPH, pp 63:1–63:8
  • Wright and Ma (2010) Wright J, Ma Y (2010) Dense error correction via l1-minimization. IEEE Transactions on Information Theory 56:3540–3560
  • Xu et al (2016) Xu C, Nanjappa A, Zhang X, Cheng L (2016) Estimate hand poses efficiently from single depth images. International Journal of Computer Vision (IJCV) 116(1):21–45
  • Xu and Leng (2012) Xu H, Leng C (2012) Robust multi-task regression with grossly corrupted observations. In: Proceedings of the

    International Conference on Artificial Intelligence and Statistics (AISTATS)

  • Xu et al (2012) Xu H, Caramanis C, Sanghavi S (2012) Robust PCA via outlier pursuit. IEEE Transactions on Information Theory 58:3047–3064
  • Xu et al (2013) Xu H, Caramanis C, Mannor S (2013) Outlier-robust PCA: The high-dimensional case. IEEE Transactions on Information Theory 59:546–572
  • Ye et al (2013) Ye M, Zhang Q, Wang L, Zhu J, Yang R, Gall J (2013) Time-of-Flight and Depth Imaging. Sensors, Algorithms, and Applications, Springer, chap A Survey on Human Motion Analysis from Depth Data
  • Zhang et al (2015) Zhang X, Cheng L, Zhu T (2015) Robust multivariate regression with grossly corrupted observations and its application to personality prediction. In: Holmes G, Liu TY (eds) Proceedings of the 7th Asian Conference on Machine Learning, vol 45, pp 112 – 126