Regression has been successfully applied to various computer vision tasks such as head pose estimation[17, 13], object direction estimation [13, 30], human body pose estimation [2, 28, 18] and facial point localization [10, 5]
, which require continuous outputs. In regression, a mapping from an input space to a target space is learned from the training data. The learned mapping function is used to predict the target values for new data. In computer vision, the input space is typically the high-dimensional image feature space and the target space is a low-dimensional space which represents some high level concepts present in the given image. Due to the complex input-target relationship, non-linear regression methods are usually employed for computer vision tasks.
Among several non-linear regression methods, regression forests  have been shown to be effective for various computer vision problems [28, 9, 10, 8]. The regression forest is an ensemble learning method which combines several regression trees  into a strong regressor. The regression trees define recursive partitioning of the input space and each leaf node contains a model for the predictor. In the training stage, the trees are grown in order to reduce the empirical loss over the training data. In the regression forest, each regression tree is independently trained using a random subset of training data and prediction is done by finding the average/mode of outputs from all the trees.
As a node splitting algorithm, binary splitting is commonly employed for regression trees, however, it has limitations regarding how it partitions the input space. The biggest limitation of the standard binary splitting is that a splitting rule at each node is selected by trial-and-error from a predefined set of splitting rules. To maintain the search space manageable, typically simple thresholding operations on a single dimension of the input is chosen. Due to these limitations, the resulting trees are not necessarily efficient in reducing the empirical loss.
To overcome the above drawbacks of the standard binary splitting scheme, we propose a novel node splitting method and incorporate it into the regression forest framework. In our node splitting method, clusters of the training data which at least locally minimize the empirical loss are first found without being restricted to a predefined set of splitting rules. Then splitting rules which preserve the found clusters as much as possible are determined by casting the problem into a classification problem. As a by-product, our procedure allows each node in the tree to have more than two child nodes, adding one more level of flexibility to the model. We also propose a way to adaptively determine the number of child nodes at each splitting. Unlike the standard binary splitting method, our splitting procedure enjoys more freedom in choosing the partitioning rules, resulting in more efficient regression tree structures. In addition to the method for the Euclidean target space, we present an extension which can naturally deal with a circular target space by the proper use of circular statistics.
We refer to regression forests (RF) employing our node splitting algorithm as KRF (K-clusters Regression Forest) and those employing the adaptive determination of the number of child nodes as AKRF. We test KRF and AKRF on Pointing’04 dataset for head pose estimation (Euclidean target space) and EPFL Multi-view Car Dataset for car direction estimation (circular target space) and observe that the proposed methods outperform state-of-the-art with 38.5% error reduction on Pointing’04 and 22.5% error reduction on EPFL Multi-view Car Dataset. Also KRF and AKRF significantly outperform other general regression methods including regression forests with the standard binary splitting.
2 Related work
A number of inherently regression problems such as head pose estimation and body orientation estimation have been addressed by classification methods by assigning a different pseudo-class label to each of roughly discretized target value (e.g., [33, 20, 23, 1, 24]). Increasing the number of pseudo-classes allows more precise prediction, however, the classification problem becomes more difficult. This becomes more problematic as the dimensionality of target space increases. In general, discretization is conducted experimentally to balance the desired classification accuracy and precision.
apply k-means clustering to the target space to automatically discretize the target space and assign pseudo-classes. They then solve the classification problem by rule induction algorithms for classification. Though somewhat more sophisticated, these approaches still suffer from problems due to discretization. The difference of our method from approaches discussed above is that in these approaches, pseudo-classes are fixed once determined either by human or clustering algorithms while in our approach, pseudo-classes areadaptively redetermined at each node splitting of regression tree training.
Similarly to our method,  converts node splitting tasks into local classification tasks by applying EM algorithm to the joint input-output space. Since clustering is applied to the joint space, their method is not suitable for tasks with high dimensional input space. In fact there experiments are limited to tasks with upto 20 dimensional input space, where their method performs poorly compared to baseline methods.
The work most similar to our method was proposed by Chou 
who applied k-means like algorithm to the target space to find a locally optimal set of partitions for regression tree learning. However, this method is limited to the case where the input is a categorical variable. Although we limit ourselves to continuous inputs, our formulation is more general and can be applied to any type of inputs by choosing appropriate classification methods.
Regression has been widely applied for head pose estimation tasks.  used kernel partial least squares regression to learn a mapping from HOG features to head poses. Fenzi  learned a set of local feature generative model using RBF networks and estimated poses using MAP inference.
A few works considered direction estimation tasks where the direction ranges from 0 and 360.  modified regression forests so that the binary splitting minimizes a cost function specifically designed for direction estimation tasks.  applied supervised manifold learning and used RBF networks to learn a mapping from a point on the learnt manifold to the target space.
We denote a set of training data by , where
is an input vector andis a target vector. The goal of regression is to learn a function
such that the expected value of a certain loss functionis minimized:
By approximating the above expected loss by an empirical loss and using the squared loss function, Eq.1 is reformulated as minimizing the sum of squared errors (SSE):
However, other loss functions can also be used. In this paper we employ a specialized loss function for a circular target space (Sec.3.5).
In the following subsections, we first explain an abstracted regression tree algorithm, followed by the presentation of a standard binary splitting method normally employed for regression tree training. We then describe the details of our splitting method. An algorithm to adaptively determine the number of child nodes is presented, followed by a modification of our method for the circular target space, which is necessary for direction estimation tasks. Lastly, the regression forest framework for combining regression trees is presented.
3.1 Abstracted Regression Tree Model
Regression trees are grown by recursively partitioning the input space into a set of disjoint partitions, starting from a root node which corresponds to the entire input space. At each node splitting stage, a set of splitting rules and prediction models for each partition are determined so as to minimize the certain loss (error). A typical choice for a prediction model is a constant model which is determined as a mean target value of training samples in the partition. However, higher order models such as linear regression can also be used. Throughout this work, we employ the constant model. After each partitioning, corresponding child nodes are created and each training sample is forwarded to one of the child nodes. Each child node is further split if the number of the training samples belonging to that node is larger than a predefined number.
The essential component of regression tree training is an algorithm for splitting the nodes. Due to the recursive nature of training stage, it suffices to discuss the splitting of the root node where all the training data are available. Subsequent splitting is done with a subset of the training data belonging to each node in exactly the same manner.
Formally, we denote a set of disjoint partitions of the input space by , a set of constant estimates associated with each partition by and the clusters of the training data by where
In the squared loss case, a constant estimate, , for the -th partition is computed as the mean target vector of the training samples that fall into :
The sum of squared errors (SSE) associated with each child node is computed as:
where is the SSE for the -th child node. Then the sum of squared errors on the entire training data is computed as:
The aim of training is to find a set of splitting rules defining the input partitions which minimizes the SSE.
Assuming there is no further splitting, the regression tree is formally represented as
where is an indicator function. The regression tree outputs one of the elements of depending on to which of the , the new data belongs. As mentioned earlier, the child nodes are further split as long as the number of the training samples belonging to the node is larger than a predefined number.
3.2 Standard Binary Node Splitting
In standard binary regression trees ,
is fixed at two. Each splitting rule is defined as a pair of the index of the input dimension and a threshold. Thus, each binary splitting rule corresponds to a hyperplane that is perpendicular to one of the axes. Among a predefined set of such splitting rules, the one which minimizes the overall SSE (Eq.6) is selected by trial-and-error.
The major drawback of the above splitting procedure is that the splitting rules are determined by exhaustively searching the best splitting rule among the predefined set of candidate rules. Essentially, this is the reason why only simple binary splitting rules defined as thresholding on a single dimension are considered in the training stage. Since the candidate rules are severely limited, the selected rules are not necessarily the best among all possible ways to partition the input space.
3.3 Proposed Node Splitting
In order to overcome the drawbacks of the standard binary splitting procedure, we propose a new splitting procedure which does not rely on trial-and-error. A graphical illustration of the algorithm is given in Fig.1. At each node splitting stage, we first find ideal clusters of the training data associated with the node, those at least locally minimize the following objective function:
where and . This minimization can be done by applying the k-means clustering algorithm in the target space with as the number of clusters. Note the similarity between the objective functions in Eq.8 and Eq.6. The difference is that in Eq.6, clusters in are indirectly determined by the splitting rules defined in the input space while clusters in are directly determined by the k-means algorithm without taking into account the input space.
After finding , we find partitions of the input space which preserves as much as possible. This task is equivalent to a -class classification problem which aims at determining a cluster ID of each training data based on . Although any classification method can be used, in this work, we employ L2-regularized L2-loss linear SVM with a one-versus-rest approach. Formally, we solve the following optimization for each cluster using LIBLINEAR :
where is the weight vector for the -th cluster, if and otherwise and is a penalty parameter. We set throughout the paper. Each training sample is forwarded to one of the child nodes by
Unlike standard binary splitting, our splitting rules are not limited to hyperplanes that are perpendicular to one of the axes and the clusters are found without being restricted to a set of predefined splitting rules in the input space. Furthermore, our splitting strategy allows each node to have more than two child nodes by employing , adding one more level of flexibility to the model. Note that larger generally results in smaller value for Eq.8, however, since the following classification problem becomes more difficult, the larger does not necessarily lead to better performance.
3.4 Adaptive determination of
Since is a parameter, we need to determine the value for by time consuming cross-validation step. In order to avoid the cross-validation step while achieving comparative performance, we propose a method to adaptively determine at each node based on the sample distribution.
In this work we employ Bayesian Information Criterion (BIC) [21, 27] as a measure to choose . BIC was also used in  but with a different formulation. The BIC is designed to balance the model complexity and likelihood. As a result, when a target distribution is complex, a larger number of is selected and when the target distribution is simple, a smaller value of is selected. This is in contrast to the non-adaptive method where a fixed number of is used regardless of the complexity of the distributions.
As k-means clustering itself does not assume any underling probability distribution, we assume that the data are generated from a mixture of isotropic weighted Gaussians with a shared variance. The unbiased estimate for the shared variance is computed as
We compute a point probability density for a data point belonging to the -th cluster as follows:
Then after simple calculations, the log-likelihood of the data is obtained as
Finally, the BIC for a particular value of is computed as
At each node splitting stage, we run the k-means algorithm for each value of in a manually specified range and select with the smallest BIC. Throughout this work, we select from .
3.5 Modification for a Circular Target Space
1D direction estimation of the object such as cars and pedestrians is unique in that the target variable is periodic, namely, 0 and 360 represent the same direction angle. Thus, the target space can be naturally represented as a unit circle, which is a 1D Riemannian manifold in . To deal with a such target space, special treatments are needed since the Euclidean distance is inappropriate. For instance, the distance between 10 and 350 should be shorter than that between 10 and 50 on this manifold.
In our method, such direction estimation problems are naturally addressed by modifying the k-means algorithm and the computation of BIC. The remaining steps are kept unchanged. The k-means clustering method consists of computing cluster centroids and hard assignment of the training samples to the closest centroid. Finding the closest centroid on a circle is trivially done by using the length of the shorter arc as a distance. Due to the periodic nature of the variable, the arithmetic mean is not appropriate for computing the centroids. A typical way to compute the mean of angles is to first convert each angle to a 2D point on a unit circle. The arithmetic mean is then computed on a 2D plane and converted back to the angular value. More specifically, given a set of direction angles , the mean direction is computed by
It is known  that minimizes the sum of a certain distance defined on a circle,
where . Thus, the k-means clustering using the above definition of means finds clusters of the training data that at least locally minimize the following objective function,
Using the above k-means algorithm in our node splitting essentially means that we employ distance as a loss function in Eq.1. Although squared shorter arc length might be more appropriate for the direction estimation task, there is no constant time algorithm to find a mean which minimizes it. Also as will be explained shortly, the above definition of the mean coincides with the maximum likelihood estimate of the mean of a certain probability distribution defined on a circle.
As in the Euclidean target case, we can also adaptively determine the value for
at each node using BIC. As a density function, the Gaussian distribution is not appropriate. A suitable choice is the von Mises distribution, which is a periodic continuous probability distribution defined on a circle,
where , are analogous to the mean and variance of the Gaussian distribution and is the modified Bessel function of order . It is known  that the maximum likelihood estimate of is computed by Eq.15 and that of satisfies
Note that, from the second term, the above quantity is the Euclidean norm of the mean vector obtained by converting each angle to a 2D point on a unit circle.
Similar to the derivation for the Euclidean case, we assume that the data are generated from a mixture of weighted von Mises distributions with a shared . The mean of k-th von Mises distribution is same as the mean of the k-th cluster obtained by the k-means clustering. The shared value for is obtained by solving the following equation
Since there is no closed form solution for the above equation, we use the following approximation proposed in ,
Then, a point probability density for a data point belonging to the k-th cluster is computed as:
After simple calculations, the log-likelihood of the data is obtained as
Finally, the BIC for a particular value of is computed as
where the last term is obtained by putting into the last term of Eq.14.
3.6 Regression Forest
We use the regression forest  as the final regression model. The regression forest is an ensemble learning method for regression which first constructs multiple regression trees from random subsets of training data. Testing is done by computing the mean of the outputs from each regression tree. We denote the ratio of random samples as . For the Euclidean target case, arithmetic mean is used to obtain the final estimate and for the circular target case, the mean defined in Eq.15 is used.
For the regression forest with standard binary regression trees, an additional randomness is typically injected. In finding the best splitting function at each node, only a randomly selected subset of the feature dimensions is considered. We denote the ratio of randomly chosen feature dimensions as . For the regression forest with our regression trees, we always consider all feature dimensions. However, another form of randomness is naturally injected by randomly selecting the data points as the initial cluster centroids in the k-means algorithm.
4.1 Head Pose Estimation
We test the effectiveness of KRF and AKRF for the Euclidean target space on the head pose estimation task. We adopt Pointing’04 dataset . The dataset contains head images of 15 subjects and for each subject there are two series of 93 images with different poses represented by pitch and yaw.
The dataset comes with manually specified bounding boxes indicating the head regions. Based on the bounding boxes, we crop and resize the image patches to pixels image patches and compute multiscale HOG from each image patch with cell size 8, 16, 32 and cell blocks. The orientation histogram for each cell is computed with signed gradients for 9 orientation bins. The resulting HOG feature is 2124 dimensional.
First, we compare the KRF and AKRF with other general regression methods using the same image features. We choose standard binary regression forest (BRF) , kernel PLS  and -SVR with RBF kernels , all of which have been widely used for various computer vision tasks. The first series of images from all subjects are used as training set and the second series of images are used for testing. The performance is measured by Mean Absolute Error in degree. For KRF, AKRF and BRF, we terminate node splitting once the number of training data associated with each leaf node is less than 5. The number of trees combined is set to 20. for KRF, for KRF, AKRF and BRF and for BRF are all determined by 5-fold cross-validation on the training set. For kernel PLS, we use the implementation provided by the author of  and for -SVR, we use LIBSVM package . All the parameters for kernel PLS and -SVR are also determined by 5-fold cross-validation. As can been seen in Table 1, both KRF and AKRF work significantly better than other regression methods. Also our methods are computationally efficient (Table 1). KRF and AKRF take only 7.7 msec and 8.7 msec, respectively, to process one image including feature computation with a single thread.
|Methods||yaw||pitch||average||testing time (msec)|
|Kernel PLS ||7.35||7.02||7.18||86.2|
Table 2 compares KRF and AKRF with prior art. Since the previous works report the 5-fold cross-validation estimate on the whole dataset, we also follow the same protocol. KRF and AKRF advance state-of-the-art with 38.5% and 29.7% reduction in the average MAE, respectively.
|Haj  Kernel PLS||6.56||6.61||6.59|
|Haj  PLS||11.29||10.52||10.91|
Fig.2 shows the effect of of KRF on the average MAE along with the average MAE of AKRF. In this experiment, the cross-validation process successfully selects with the best performance. AKRF works better than KRF with the second best . The overall training time is much faster with AKRF since the cross-validation step for determining the value of is not necessary. To train a single regression tree with , AKRF takes only 6.8 sec while KRF takes 331.4 sec for the cross-validation and 4.4 sec for training a final model. As a reference, BRF takes 1.7 sec to train a single tree with and . Finally, some estimation results by AKRF on the second sequence of person 13 are shown in Fig.3.
4.2 Car Direction Estimation
We test KRF and AKRF for circular target space (denoted as KRF-circle and AKRF-circle respectively) on the EPFL Multi-view Car Dataset . The dataset contains 20 sequences of images of cars with various directions. Each sequence contains images of only one instance of car. In total, there are 2299 images in the dataset. Each image comes with a bounding box specifying the location of the car and ground truth for the direction of the car. The direction ranges from 0 to 360. As input features, multiscale HOG features with the same parameters as in the previous experiment are extracted from pixels image patches obtained by resizing the given bounding boxes.
The algorithm is evaluated by using the first 10 sequences for training and the remaining 10 sequences for testing. In Table 3, we compare the KRF-circle and AKRF-circle with previous work. We also include the performance of BRF, Kernel PLS and -SVR with RBF kernels using the same HOG features. For BRF, we extend it to directly minimize the same loss function () as with KRF-circle and AKRF-circle (denoted by BRF-circle). For Kernel PLS and -SVR, we first map direction angles to 2d points on a unit circle and train regressors using the mapped points as target values. In testing phase, a 2d point coordinate is first estimated and then mapped back to the angle by . All the parameters are determined by leave-one-sequence-out cross-validation on the training set. The performance is evaluated by the Mean Absolute Error (MAE) measured in degrees. In addition, the MAE of 90-th percentile of the absolute errors and that of 95-th percentile are reported, following the convention from the prior works.
As can be seen from Table 3, both KRF-circle and AKRF-circle work much better than existing regression methods. In particular, the improvement over BRF-circle is notable. Our methods also advance state-of-the-art with 22.5% and 20.7% reduction in MAE from the previous best method, respectively. In Fig.4, we show the MAE of AKRF-circle computed on each sequence in the testing set. The performance varies significantly among different sequences (car models). Fig.5 shows some representative results from the worst three sequences in the testing set (seq 16, 20 and 15). We notice that most of the failure cases are due to the flipping errors () which mostly occur at particular intervals of directions. Fig.6 shows the effect of of KRF-circle. The performance of the AKRF-circle is comparable to that of KRF-circle with selected by the cross-validation.
|Method||MAE () 90-th percentile||MAE () 95-th percentile||MAE ()|
|Fenzi et al. ||14.51||22.83||31.27|
|Torki et al. ||19.4||26.7||33.98|
|Ozuysal et al. ||-||-||46.48|
In this paper, we proposed a novel node splitting algorithm for regression tree training. Unlike previous works, our method does not rely on a trial-and-error process to find the best splitting rules from a predefined set of rules, providing more flexibility to the model. Combined with the regression forest framework, our methods work significantly better than state-of-the-art methods on head pose estimation and car direction estimation tasks.
Acknowledgements. This research was supported by a MURI grant from the US Office of Naval Research under N00014-10-1-0934.
-  Baltieri, D., Vezzani, R., Cucchiara, R.: People Orientation Recognition by Mixtures of Wrapped Distributions on Random Trees. ECCV (2012)
-  Bissacco, A., Yang, M.H., Soatto, S.: Fast Human Pose Estimation using Appearance and Motion via Multi-dimensional Boosting Regression. CVPR (2007)
Breiman, L.: Random Forests. Machine Learning (2001)
-  Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees. Chapman and Hall/CRC (1984)
-  Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by Explicit Shape Regression. CVPR (2012)
Chang, C.C., Lin, C.J.: LIBSVM : A Library for Support Vector Machines. ACM Transactions on Intelligent Systems and Technology (2011)
-  Chou, P.A.: Optimal Partitioning for Classification and Regression Trees. PAMI (1991)
-  Criminisi, A., Shotton, J.: Decision Forests for Computer Vision and Medical Image Analysis. Springer (2013)
-  Criminisi, A., Shotton, J., Robertson, D., Konukoglu, E.: Regression Forests for Efficient Anatomy Detection and Localization in CT Studies. Medical Computer Vision (2010)
-  Dantone, M., Gall, J., Fanelli, G., Van Gool, L.: Real-time Facial Feature Detection using Conditional Regression Forests. CVPR (2012)
-  Dobra, A., Gehrke, J.: Secret: A scalable linear regression tree algorithm. SIGKDD (2002)
-  Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: A Library for Large Linear Classification. JMLR (2008)
-  Fenzi, M., Leal-Taixé, L., Rosenhahn, B., Ostermann, J.: Class Generative Models based on Feature Regression for Pose Estimation of Object Categories. CVPR (2013)
-  Fisher, N.I.: Statistical Analysis of Circular Data. Cambridge University Press (1996)
-  Gaile, G.L., Burt, J.E.: Directional Statistics (Concepts and techniques in modern geography). Geo Abstracts Ltd. (1980)
-  Gourier, N., Hall, D., Crowley, J.L.: Estimating Face Orientation from Robust Detection of Salient Facial Structures. ICPRW (2004)
-  Haj, M.A., Gonzàlez, J., Davis, L.S.: On partial least squares in head pose estimation: How to simultaneously deal with misalignment. CVPR (2012)
-  Hara, K., Chellappa, R.: Computationally Efficient Regression on a Dependency Graph for Human Pose Estimation. CVPR (2013)
-  Herdtweck, C., Curio, C.: Monocular Car Viewpoint Estimation with Circular Regression Forests. Intelligent Vehicles Symposium (2013)
-  Huang, C., Ding, X., Fang, C.: Head Pose Estimation Based on Random Forests for Multiclass Classification. ICPR (2010)
-  Kashyap, R.L.: A Bayesian Comparison of Different Classes of Dynamic Models Using Empirical Data. IEEE Trans. on Automatic Control (1977)
-  Mardia, K.V., Jupp, P.: Directional Statistics, 2nd edition. John Wiley and Sons Ltd. (2000)
-  Orozco, J., Gong, S., Xiang, T.: Head Pose Classification in Crowded Scenes. BMVC (2009)
-  Ozuysal, M., Lepetit, V., Fua, P.: Pose Estimation for Category Specific Multiview Object Localization. CVPR (2009)
-  Pelleg, D., Moore, A.: X-means: Extending K-means with Efficient Estimation of the Number of Clusters. ICML (2000)
-  Rosipal, R., Trejo, L.J.: Kernel Partial Least Squares Regression in Reproducing Kernel Hilbert Space. JMLR (2001)
-  Schwarz, G.: Estimating the Dimension of a Model. The Annals of Statistics (1978)
-  Sun, M., Kohli, P., Shotton, J.: Conditional Regression Forests for Human Pose Estimation. CVPR (2012)
Torgo, L., Gama, J.: Regression by classification. Brazilian Symposium on Artificial Intelligence (1996)
-  Torki, M., Elgammal, A.: Regression from local features for viewpoint and pose estimation. ICCV (2011)
Vapnik, V.: Statistical Learning Theory. Wiley (1998)
-  Weiss, S.M., Indurkhya, N.: Rule-based Machine Learning Methods for Functional Prediction. Journal of Artificial Intelligence Research (1995)
-  Yan, Y., Ricci, E., Subramanian, R., Lanz, O., Sebe, N.: No matter where you are: Flexible graph-guided multi-task learning for multi-view head pose classification under target motion. ICCV (2013)