I Introduction
The goal of object segmentation is to produce a pixel level segmentation of different object categories. It is challenging as the objects may appear in various backgrounds and in different visual conditions. CRFs [19] model the conditional distribution of labels given observations, and represents the stateoftheart in image/object segmentation [34, 32, 10, 24, 27, 15]. The maxmargin principle has also been applied to predict structured outputs, including SSVMs [36], and maxmargin Markov networks [35]
. These three methods share similarities when viewed as optimization problems using different loss functions. Szummer
et al. [34] proposed to learn linear coefficients of CRFs potentials using SSVMs and graph cuts. To date, most of these methods assume a predefined parametric model for the potential functions, and typically only the linear coefficients of the parametric model are learned. This can greatly limit the flexibility of the model capability of CRFs, and thus calls for effective methods to incorporate nonlinear nonparametric models for learning the potential functions in CRFs.As similar in standard support vector machines (SVMs), nonlinearity can be achieved by introducing nonlinear kernels for SSVMs. However, the time complexity of nonlinear SVMs is roughly with being the number of training data examples. This time complexity is problematic for SSVMs, where the number of constraints grows exponentially in the description length of the label
. Moreover, nonlinear functions can significantly slow down the test time in most cases. Because of these reasons, currently most SSVMs applications use linear kernels (or linear parametric potential functions in CRFs), despite the fact that nonlinear functions usually deliver more promising prediction accuracy. In this work, we address this issue by combining CRFs with nonparametric decision trees. Both CRFs and decision trees have gained tremendous success in computer vision. Decision trees are capable of modelling complex relations and generalize well on test data. Unlike kernel methods, decision trees are fast to evaluate and can be used to select informative features.
In this work, we propose to use ensembles of decision trees to map the image content to both the unary terms and the pairwise interaction values in CRFs. The proposed method is termed as CRFTree. Specifically, we formulate both the unary and pairwise potentials as nonparametric forests—ensembles of decision trees, and learn the ensemble parameters and the trees in a single optimization framework. In this way, the nonlinearity is easily introduced into CRFs learning without confronting the kernel dilemma. Furthermore, we learn classwise decision trees for each object. Due to the rich structure and flexibility of decision trees, our approach is powerful in modelling complex data likelihoods and label relationships. The resulting optimization problem is very challenging in the sense that it can involve exponentially or even infinitely many variables and constraints. We summarize our main contributions as follows.

We formulate the unary and pairwise potentials as ensembles of decision trees, and show how to jointly learn the ensemble parameters and the trees as a unified optimization problem within the largemargin framework. In this fashion, we achieve nonlinear potential learning on both the unary and pairwise terms.

We learn classwise decision trees (potentials) for each object that appears in the image.

We show how to train the proposed CRFTree model efficiently. In particular, we combine the column generation and cuttingplanes techniques to approximately solve the resulting optimization problem, which can involve exponentially many variables and constraints.

We empirically demonstrate that CRFTree outperforms existing methods for image segmentation. On both binary and multiclass segmentation datasets we show the advantages of the learned nonlinear nonparametric potentials of decision trees.
Related work We briefly review the recent works that are relevant to ours. A few attempts have been made to apply nonlinear kernels in SSVMs. Yu et al.[37] and Severyn et al.[29] developed sampled cuts based methods for training SSVMs with kernels. Sampled cuts methods were originally proposed for standard kernel SVMs. When applied to SSVMs, the performance is compromised [24]. In [5], the imagemask pair kernels are designed to exploit imagelevel structural information for object segmentation. However, these kernels are restricted to the unary term. Although not in the large margin framework, the kernel CRFs proposed in [20] incorporates kernels into the CRFs learning. The authors only demonstrated the efficacy of their method on a synthetic and a small scale protein dataset. To sum up, these approaches are hampered by the heavy computation complexity. Furthermore, it is not a trivial task to design appropriate kernels for structured problems. Recently, Lucchi et al. [24]
proposed a twostep solution to tackle this problem. Specifically, they train linear SSVMs by using kernelized feature vectors that are obtained from training a standard nonlinear kernel SVMs model. They experimentally demonstrate that the kernel transferred linear SVMs model achieves similar performance as the Gaussian SVMs. However, this approach is heuristic and it cannot be shown theoretically that their formulation approximates a nonlinear SSVMs model. Besides, their method consumes extra usage of memory and training time since the dimension of the transformed features equals to the number of support vectors, while the latter is linearly proportional to the size of the training data
[33]. Moreover, compared to the above mentioned works of [5] and [24], we achieve nonlinear learning on both the unary and the pairwise terms while theirs are limited to nonlinear unary potential learning. The recent work of Shen et al.[31] generalizes standard boosting methods to structured learning, which shares similarities to our work here. However, our method bears critical differences from theirs: 1) We design a column generation method for nonlinear tree potentials learning in CRFs directly from the SSVMs formulation. Different from the case in [31], which can directly derive column generation method analogous to LPBoost [8], our derivation here is more challenging. This is because we can not obtain the most violated constraint from the constraints of the dual problem, on which the column generation technique relies. We instead inspect the KKT condition to seek for the most violated constraint. This is an important difference compared to existing column generation techniques. 2) We develop a CRFs learning method for multiclass semantic segmentation, while [31] only shows CRFs learning for binary foreground/background segmentation. Our experiments on the MSRC21 dataset shows that our method achieves stateoftheart results. 3) We learn classwise decision trees (potentials) for each object that appears in the image. This is different from [31]. The work of decision tree fields [28] is close to ours in that they also use decision trees to model the pairwise potentials. The major difference is that in [28] potential functions are constructed by directly summing the energy tables associated with the set of nodes taken during evaluating the decision trees. Their trees are generally deep, with depth 15 for the unary potential and 6 for the pairwise potential in their experiment. By contrast, we model the potential functions as an ensemble of decision trees and learn them in the large margin framework. In our method, the decision trees are shallow and simple with binary outputs.Ii Learning tree potentials in CRFs
We present the details of our method in this section by first introducing the CRFs models for segmentation, then formulating the energy functions and showing how to learn decision tree potentials in the largemargin framework.
Iia Segmentation using CRFs models
Before presenting our method, we first revisit how to use CRFs models to perform image segmentation. Given an image instance and its corresponding labelling , CRFs [19] models the conditional distribution of the form
(1) 
where are parameters and is the normalization term. The energy of an image with segmentation labels over the nodes (superpixels) and edges , takes the following form:
(2) 
Here ; and are the unary and pairwise potentials, both of which depend on the observations as well as the parameter . CRFs seeks an optimal labeling that achieves maximum a posterior (MAP), which mainly involves a twostep process [34]: 1) Learning the model parameters from the training data; 2) Inferring a most likely label for the test data given the learned parameters. The segmentation problem thus reduces to minimizing the energy (or cost) over by the learned parameters , which is . When the energy function is submodular, this inference problem can be efficiently solved via graph cuts [34].
IiB Energy Formulation
Given the energy function in Eqn. (2), we show how to construct the unary and pairwise potentials using decision trees. We denote as the features of superpixel (), with its label , where is the number of classes. Let be a set of decision trees, which can be infinite. Each takes as the input, and takes a pair as the input to output . We introduce groups of decision trees, in which groups are for the unary potential and one group for the pairwise potential. For the unary potential, the groups of decision trees are denoted by , which correspond to categories. Each is associated with the th class. In other words, for each class, we maintain its own unary feature mappings. Each group of decision trees for the unary potential can be written as: , which are the output of decision trees: . All decision trees of the unary potential are denoted by . Accordingly, for the pairwise potential, the group of decision trees is denoted by , and being the output of all . The whole set of decision trees is denoted by . We then construct the unary and pairwise potentials as
(3) 
(4) 
where is an indicator function which equals if the input is true and otherwise. Then the energy function in Eqn. (2) can be written as:
(5) 
Next we show how to learn these decision tree potentials in the largemargin framework.
IiC Learning CRFs in the largemargin framework
Instead of directly minimizing the negative loglikelihood loss, we here learn the CRFs parameters in the large margin framework, similar to [34]. Given a set of training examples , the largemargin based CRFs learning solves the following optimization:
(6) 
where is a loss function associated with the prediction and the true label mask. In general, we have and for any . Intuitively, the optimization in Eqn. (6) is to encourage the energy of the ground truth label to be lower than that of any other incorrect labels by at least a margin .
To learn the potential functions we proposed in §IIB in the largemargin framework, we introduce the following definitions. For the unary part, we define , where stacks two vectors, and
(7) 
where
denotes the tensor operation (
e.g., ). Recall that denotes the th superpixel of the image . Here, acts as the unary feature mapping. Clearly we have:(8) 
For the pairwise part, we define the pairwise feature mapping as:
(9) 
Then we have the following relation:
(10) 
We further define , and the joint feature mapping as
(11) 
With the definitions of and , the energy function can then be written as:
(12) 
Now we can apply the largemargin framework to learn CRFs using the proposed energy functions by rewriting the optimization problem in Eqn. (6) as:
(13) 
Note that we add the constraint to ensure submodular property of our energy functions, which we will discuss the details later in §IIE. Up until now, we are ready to learn and (or ) in a single optimization problem formulated in Eqn. (IIC), but it is not clear how. Next we demonstrate how to solve the optimization problem in Eqn. (IIC) by using column generation and cuttingplane.
IiD Learning tree potentials using column generation
We aim to learn a set of decision trees and the potential parameter by solving the optimization problem in Eqn. (IIC). However, jointly learning and is generally difficult. Here we propose to apply column generation techniques [8, 30] to alternatively construct the set of decision trees and solve for . From the point of view of column generation techniques, the dimension of the primal variable is infinitely large; the column generation is to iteratively select (generate) variables for solving the optimization. In our case, infinitely many dimension of corresponds to infinitely many decision trees, thus we iteratively generate decision trees to solve the optimization.
Basically, we construct a working set of decision trees (denoted as ). During each column generation iteration we perform two steps. In the first step, we generate new decision trees and add them to . In the second step, we solve a restricted optimization problem in Eqn. (IIC) on the current working set to obtain the solution of . We repeat these two steps until convergence. Next we describe how to generate decision trees in a principal way by using the dual solution of the optimization in Eqn. (IIC), which is similar to the conventional column generation technique. First we derive the Lagrange dual problem of Eqn. (IIC), which can be written as
(14) 
Here are the dual variables. When using column generation technique, one need to find the most violated constraint in the dual. However, the constraints of the dual problem do not involve decision trees . Instead of examining the dual constraint, we inspect the KKT condition, which is an important difference compared to existing column generation techniques. According to the KKT condition, when at optimal, the following condition holds for the primal solution and the current working set :
(15) 
All of those generated satisfy the above condition. Obviously, generating new decision trees which most violate the above condition would contribute the most to the optimization of Eqn. (IIC). Hence the strategy of generating new decision trees is to solve the following problem:
(16) 
Then is added to the current working set . If still satisfies the condition in Eqn. (15), the current solution of and is already the globally optimal one.
The optimization in Eqn. (16) for generating new decision trees can be independently decomposed into solving the unary part and the pairwise part. Hence can be written as: . For the unary part, we learn classwise decision trees, namely, we generate decision trees corresponding to categories at each column generation iteration. Hence is composed of decision trees: . More specifically, according to the definition of in Eqn. (11), we solve the following problems:
(17)  
To solve the above optimization problems, we here train
weighted decision tree classifiers. Specifically, when training decision trees for the
th class, the training data is composed of those superpixels whose ground truth label or predicted label is equal to the category label . Since the output of the decision tree is in and , the maximization in Eqn. (IID) is achieved if outputs 1 for each of the superpixel with , and outputs 0 for each of the superpixel with . Therefore, as indicated by the horizontal curly braces in Eqn. (IID), superpixels with the predicted labels of category are used as positive training examples, while superpixels with ground truth labels of category c are used as negative training examples. The dual solution serves as weightings of the training data.For the pairwise part, we generate one decision tree in each column generation iteration, hence can be written as , the new decision tree for the pairwise part is generated as:
(18) 
Similar to the unary case, we train a weighted decision tree classifier with as training example weightings. The positive and negative training data are indicated by the horizontal curly braces in Eqn. (IID). is the response of a decision tree applied on the pairwise features constructed by two neighbouring superpixels (, ), e.g., color differences or shared boundary lengths.
With the above analysis, we can now apply column generation to jointly learn the decision trees and . The column generation (CG) procedure iterates the following two steps:
2) Add and to working set and resolve for the primal solution and dual solution .
We show two segmentation examples on the Oxford flower dataset produced by our method with different CG iterations in Fig. 1. As can be seen, our method refines the segmentation with the increase of CG iterations. Since this dataset is relatively simple, a few CG iterations are enough to get satisfactory results.
For solving the primal problem in the second step, it involves a large number of constraints due to the large output space . We next show how to apply the cuttingplane technique [13] to efficiently solve this problem.
IiE Speeding up optimization using cuttingplane
To apply cuttingplane for solving the optimization in Eqn. (IIC), we first derive its slack formulation. The slack SSVMs formulation was first introduced by [13]. The slack formulation of our method can be written as:
(19) 
Cuttingplane methods work by finding the most violated constraint for each example
(20) 
at every iteration and add it to the constraint working set. The sketch of our method is summarized in Algorithm 1, which calls Algorithm 2 to solve the slack optimization.
Implementation details
To deal with the unbalanced appearance of different categories in the dataset, we define as weighted Hamming loss, which weighs errors for a given class inversely proportional to the frequency it appears in the training data, as similar in [24]. In the inference problem of Eqn. (20), when using the hamming loss as the label cost , the label cost term can be absorbed into the unary part. We therefore can apply Graphcut to efficiently solve Eqn. (20). As for more complicated label cost functions, an efficient inference algorithm is proposed in [4]. During each CG iteration, our method first solves Eqn. (IID), (IID) given the current and , and then solves a quadratic programming (QP) problem given . When solving Eqn. (IID), (IID), we train weighted decision tree classifiers using the highly optimized decision tree training method of [3].
Discussions on the submodularity
It is known that if graph cuts are to be applied to achieve globally optimum labelling in segmentation, the energy function must be submodular. For foreground/background segmentation in which a (super)pixel label takes value in , we show that our method keeps this submodular property. It is commonly known that an energy function is submodular if its pairwise term satisfies: . Recall that our pairwise energy is written as . Clearly we have () because of the indicator function . The second thing is to ensure . Given the nonnegativeness constraint we impose on in our model, and the output of decision trees in our method taking values from , we have and . We thus accomplish the proof of the submodularity of our model. In the case of multiobject segmentation, the inference is done by the expansion of graph cuts.
Discussions on the nonnegative constraint on
Our learning framework aligns with boosting methods, where we learn a nonnegative weighted ensemble of weak structured learners (constructed by decision trees), which is analogous to weak learners in boosting methods. This is similar to boosting methods, such as AdaBoost, LPBoost [8], where the nonnegative weighting is commonly used. Further, a weak structured learner generated by our column generation method is expected to make positive contribution to the learning objective. If it is of no use to the objective, the weight will approach zero. Therefore it is reasonable to enforce the nonnegative constraint on .
Iii Experiments
To demonstrate the effectiveness of the proposed method, we first compare our model with some most related baseline methods, which are SVMs, AdaBoost and SSVMs. In section IIIC, we show that our method achieves stateoftheart results by exploiting recent advances in feature learning [6, 16].
Iiia Experimental setup
The datasets evaluated here include three binary datasets (Weizmann horse, Oxford flower and Graz02) and two multiclass datasets (MSRC21 and PASCAL VOC 2012). The Weizmann horse dataset^{1}^{1}1http://www.msri.org/people/members/eranb/ consists of 328 horse images from various backgrounds, with groundtruth masks available for each image. We use the same data split as in [5] and [17]. The Oxford 17 category flower dataset [26] is composed of 849 flower images. Those with too small foreground are removed, which leaves 753 for segmentation purpose [26]. The data split stated in [26] is used to perform the evaluation. During our experiment, images of the Weizmann horse and the Oxford flower datasets are resized to 256256. The Graz02 dataset^{2}^{2}2http://www.emt.tugraz.at/~pinz/ contains 3 categories (bike, car and people). This dataset is considered challenging as the objects appear at various background and with different poses. We follow the evaluation protocol in [25] to use 150 for training and 150 for testing for each category. The MSRC21 dataset [32] is a popular multiclass segmentation benchmark with 591 images containing objects from 21 categories. We follow the standard split to divide the dataset into training/validation/test subsets. The PASCAL VOC 2012 dataset ^{3}^{3}3http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ is a widely used benchmark for semantic segmentation, which contains 2913 images from the trainval set and 1456 images from the test set, making up 21 categories. Unlike many stateofthearts methods such as [38], we do not use any additional training data for this dataset.
We start with oversegmenting the images into superpixels using SLIC [1], with 700 superpixels generated per image. We extract dense SIFT descriptors and color histograms around each superpixel centroid with different block sizes (1212, 2424, 3636). The dense SIFT descriptors are then quantized into bagofwords features using nearest neighbour search with a codebook size of 400. We construct four types of pairwise features also using different block sizes to enforce spatial smoothness, which are color difference in LUV space, color histogram difference, texture difference in terms of LBP operators as well as shared boundary length [10]. The column generation iteration number of our CRFTree is set to 50 based on a validation set. We learn tree potentials with the tree depth being . Training on the MSRC21 dataset on a standard PC machine takes around 16 hours.
IiiB Comparing with baseline methods
We first compare CRFTree with some conventional methods, which are linear SVMs, AdaBoost and SSVMs to demonstrate the superiority of our method. For SVMs and AdaBoost, each superpixel is classified independently without CRFs. We mainly evaluate on the more challenging Graz02 and MSRC21 dataset in this part. The regularization parameter C of SVMs, SSVMs and our CRFTree are selected from based on a validation set. We use depth2 decision trees for training AdaBoost and our CRFTree. The maximum iteration number of AdaBoost is chosen from 50, 100, 200. For our method, we treat the foreground and background as two categories in the binary case to learn classwise potentials.
Graz02
For a comprehensive evaluation, we use two measurements to quantify the performance on the Graz02 dataset, which are intersection over union score and the pixel accuracy (including foreground and background). We report the results in Table I. As can be observed, AdaBoost based on a depth2 decision tree performs better than the linear SVMs. On the other hand, structured methods which jointly consider local information and spatial consistency are able to significantly outperform the simple binary models. By introducing nonlinear and classwise potential learning, our method is able to gain further improvement over SSVMs.
Msrc21
We learn classwise potentials using our CRFTree for each of the 21 classes on the MSRC dataset. The compared results are summarized in Table II (upper part). Similar conclusions can be drawn as on the Graz02 dataset and our CRFTree again outperforms all its baseline competitors.
Category  bike  car  people 

intersection/union (foreground, background)()  
SVMs  67.8 (51.9, 83.8)  69.7 (46.8, 92.6)  65.0 (44.5, 85.5) 
AdaBoost  71.2 (57.6, 84.9)  71.0 (49.4, 92.6)  67.7 (48.7, 86.7) 
SSVMs  72.2 (58.6, 85.8)  76.9 (60.0, 94.2)  70.9 (53.8, 87.9) 
CRFTree  76.4 (65.0, 87.8)  79.5 (64.0, 95.0)  74.2 (58.7, 89.7) 
CRFTree (FL)  78.3 (67.7, 88.9)  83.0 (70.1, 95.9)  75.7 (61.0, 90.5) 
pixel accuracy (foreground, background)()  
SVMs  79.5 (67.4, 91.5)  77.3 (57.2, 97.3)  77.7 (63.8, 91.6) 
AdaBoost  83.8 (77.3, 90.3)  80.1 (63.5, 96.6)  80.5 (69.0, 91.9) 
SSVMs  83.8 (76.1, 91.6)  85.5 (73.8, 97.2)  83.9 (75.8, 92.1) 
CRFTree  87.8 (83.9, 91.8)  87.0 (76.4, 97.7)  85.9 (78.4, 93.4) 
CRFTree (FL)  89.1 (85.8, 92.4)  90.0 (82.1, 98.0)  86.9 (80.0, 94.0) 
building 
grass 
tree 
cow 
sheep 
sky 
aeroplane 
water 
face 
car 
bicycle 
flower 
sign 
bird 
book 
chair 
road 
cat 
dog 
body 
boat 
Average 
Global 


SVMs  54  92  73  41  54  80  51  67  51  41  59  41  28  8  64  17  75  41  23  20  7  47.0  63.7 
AdaBoost  68  92  83  48  58  87  43  69  58  43  64  41  32  14  70  28  79  47  22  41  6  52.0  68.6 
SSVMs  65  92  81  42  76  84  65  70  75  54  87  62  31  14  76  31  78  61  30  25  2  57.2  70.8 
CRFTree  53  87  85  59  84  90  77  82  81  54  90  57  62  22  81  59  80  71  26  49  15  64.9  73.9 
CRFTree (FL)  66  95  89  83  89  90  90  83  76  74  83  71  69  46  87  73  87  84  53  68  20  75.1  82.2 
CRFTree (CNN)  73  96  89  82  92  96  89  86  93  78  86  91  71  75  85  76  86  91  63  83  41  82.0  86.2 
Shotton et al.[32]  49  88  79  97  97  78  82  54  87  74  72  74  36  24  93  51  78  75  35  66  18  67  72 
Ladicky et al.[18]  80  96  86  74  87  99  74  87  86  87  82  97  95  30  86  31  95  51  69  66  9  75  86 
Gonfaus et al.[11]  60  78  77  91  68  88  87  76  73  77  93  97  73  57  95  81  76  81  46  56  46  75  77 
Lucchi et al.[24]  59  90  92  82  83  94  91  80  85  88  96  89  73  48  96  62  81  87  33  44  30  76  82 
Lucchi et al.[23]  67  89  85  93  79  93  84  75  79  87  89  92  71  46  96  79  86  76  64  77  50  78.9  83.7 
IiiC Comparing with stateoftheart methods
Since features play a pivotal role in the performance of vision algorithms, we exploit recent advances in feature learning to pursue stateoftheart results, i.e., unsupervised feature learning [6]
and convolutional neural networks (CNN)
[16]. Specifically, for the unsupervised feature learning, we first learn a dictionary of size 400 and patch size 66 based on the evaluated image dataset using Kmeans, and then use the soft threshold coding [6]to encode patches extracted from each superpixel block. The final feature vectors (we call it encoding feature here) are obtained by performing a threelevel max pooling over the superpixel block. For the CNN features, we use the Alex model
[16]trained on the ImageNet
^{4}^{4}4http://imagenet.org to generate CNN features. These two versions of our method are denoted as CRFTree (FL) and CRFTree (CNN) respectively. We only report the results of CRFTree (CNN) on the MSRC21 and PASCAL VOC 2012 datasets since our method already performs very well by using the encoding features on the three binary datasets.Method  Sa  So 

Levin & Weiss [21]  95.5   
Cosegmentation [14]  80.1   
Bertelli et al.[5]  94.6  80.1 
Kuttel et al.[17]  94.7   
CRFTree (FL)  94.6  80.4 
Weizmann horse
We quantify the performance by the global pixelwise accuracy and the foreground intersection over union score , as did in [5]. measures the percentage of pixels correctly classified while directly reflects the segmentation quality of the foreground. The results are reported in Table III. Our method performs better than the kernel structural learning method of [5], which may result from the fact that they only introduced nonlinearity into the unary part while our method achieves nonlinearity on both unary and pairwise terms. The best score is obtained by [21]. However their method relies on an assumption that a perfect bounding box of the horse is available for each test image, which is not practically applicable. On the contrary, we provide a principal and general way of nonlinearly learning CRFs parameters. We show some segmentation examples of our method in Fig. 2.
Oxford flower
As in [5], we also use and to measure the performance on the Oxford flower dataset, and report the results in Table IV. Our method performs comparable to the original work of [26] on this dataset in terms of while again obtains better results than the closely related stateoftheart work of [5]. It is also worth noting that the method in [26] is very domain specific, which relies on modelling the flower’s shape (center and petal), while ours is generally applicable.
Graz02
As in the work of [25], [9], [2], [17]
, we also evaluate the Fscore on the Graz02 dataset besides the above mentioned intersection over union score and pixel accuracy. The Fscore is defined as
, where is the precision and is the recall. We summarize the results in Table V and Table I. From Table V, it can be seen that our method significantly outperforms all the compared methods, which fully demonstrate the power of nonlinear and classwise potential learning. Furthermore, we can observe from Table I that compared with the previous results, adding more features help to improve the performance.Method  bike  car  people  average 

Marszalek & Schimid [25]  61.8  53.8  44.1  53.2 
Fulkerson et al.[9]  66.4  54.7  51.4  57.5 
Aldavert et al.[2]  71.9  62.9  58.6  64.5 
Kuettel et al.[17]  63.2  74.8  66.4  68.1 
CRFTree (FL)  80.7  82.4  75.8  79.5 
Msrc21
The compared results with stateoftheart works are reported in the lower part of Table II. As we can see, by incorporating more advanced features, our CRFTree gains significant improvements over the previous results which only use bagofwords and color histogram features. It is worth noting that our method performs better than the closely related work of Lucchi et al.[24] which claims exploiting nonlinear kernels. It has to be pointed out that we did not employ any global potentials (while in [24], they improve the global and average percategory accuracy from 70, 73 to 82 and 76 by adding global information). If global or higher potentials are incorporated into our model, further performance promotion can be expected. We show some qualitative evaluation examples in Fig. 3.
Pascal Voc 2012
We generate deep features of each superpixel by averaging the pixelwise feature map scores within the superpixel obtained from a pretrained FCN model
[22]. We then train our CRFTree model on the standard PASCAL VOC 2012 training dataset with the generated deep features. Following the standard evaluation procedure for the Pascal VOC challenge, we upload our segmentation results to the test server and use the average intersection over union as the evaluation metric. We compare against several stateoftheart methods (
[12], [7], [22], [38]) on the test set of the PASCAL VOC 2012 dataset. The results are reported in Table VI. As seen from the table, our CRFTree beats the Hypercolum [12] and the CFM [7] and outperforms the FCN [22] by a notable margin. Although our method is triumphed by [38], it should be noted that their result is obtained by using extra training data (11,685 images vs 1456 images used for training our CRFTree). Some qualitative evaluation examples of our method are illustrated in Fig. 4.Method  intersection/union 

CFM [7]  61.8 
Hypercolumn [12]  62.6 
FCN8s [22]  62.2 
Zheng et al.[38]  72.0 
CRFTree  65.4 
Iv Conclusion
Nonlinear structured learning has been a promising yet challenging topic in the community. In this work, we have proposed a nonlinear structured learning method of tree potentials for image segmentation. The unary and pairwise potentials are ensembles of classwise trees, with the ensemble parameters and the trees jointly learned in a unified largemargin framework. In this way, nonlinearity is easily introduced into the CRFs learning. The resulted model involves exponential number of variables and constraints. We therefore derive a novel algorithm combining a modified column generation method and the cuttingplane technique for efficient model training. We have exemplified the superiority of the proposed nonlinear potential learning method by comparing against stateoftheart methods on both binary and multiclass object segmentation datasets. A potential disadvantage of our method is that it is prone to overfitting due to the outstanding nonlinear learning capacity. This can be alleviated by using more training data. On the other hand, as we show in Table II
, our method using pretrained CNN features has shown the best performance. Therefore it is worth exploiting to further combine our method with deep learning techniques in the future work.
References
 [1] R. Achanta, K. Smith, A. Lucchi, P. Fua, and S. S sstrunk. Slic superpixels compared to stateoftheart superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.

[2]
D. Aldavert, A. Ramisa, R. L. de M ntaras, and R. Toledo.
Fast and robust object segmentation with the integral linear
classifier.
In
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
, 2010. 
[3]
R. Appel, T. J. Fuchs, P. Dollár, and P. Perona.
Quickly boosting decision trees  pruning underachieving features
early.
In
Proceedings of International Conference on Machine Learning
, 2013.  [4] A. Bauer, S. Nakajima, and K.R. Müller. Efficient exact inference with loss augmented objective in structured learning. IEEE Transactions on Neural Networks and Learning Systems, 2016.
 [5] L. Bertelli, T. Yu, D. Vu, and B. Gokturk. Kernelized structural SVM learning for supervised object segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011.
 [6] A. Coates and A. Y. Ng. The importance of encoding versus training with sparse coding and vector quantization. In Proceedings of International Conference on Machine Learning, 2011.
 [7] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 [8] A. Demiriz, K. P. Bennett, and J. ShaweTaylor. Linear programming boosting via column generation. Machine Learning, 2002.
 [9] B. Fulkerson, A. Vedaldi, and S. Soatto. Localizing objects with smart dictionaries. In Proceedings of European Conference on Computer Vision, 2008.
 [10] B. Fulkerson, A. Vedaldi, and S. Soatto. Class segmentation and object localization with superpixel neighborhoods. In Proceedings of IEEE International Conference on Computer Vision, 2009.
 [11] J. M. Gonfaus, X. B. Bosch, J. van de Weijer, A. D. Bagdanov, J. S. Gual, and J. G. Sabaté. Harmony potentials for joint classification and segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2010.
 [12] B. Hariharan, P. A. Arbeláez, R. B. Girshick, and J. Malik. Hypercolumns for object segmentation and finegrained localization. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 [13] T. Joachims. Training linear SVMs in linear time. In Proceedings of ACM SIGKDD ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2006.
 [14] A. Joulin, F. R. Bach, and J. Ponce. Discriminative clustering for image cosegmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2010.
 [15] M. Kim. Mixtures of conditional random fields for improved structured output prediction. IEEE Transactions on Neural Networks and Learning Systems, 2016.
 [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proceedings of Advances in Neural Information Processing Systems, 2012.
 [17] D. Kuettel and V. Ferrari. Figureground segmentation by transferring window masks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2012.
 [18] L. Ladicky, C. Russell, P. Kohli, and P. H. S. Torr. Associative hierarchical crfs for object class image segmentation. In Proceedings of IEEE International Conference on Computer Vision, 2009.
 [19] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of International Conference on Machine Learning, 2001.
 [20] J. Lafferty, X. Zhu, and Y. Liu. Kernel conditional random fields: Representation and clique selection. In Proceedings of International Conference on Machine Learning, 2004.
 [21] A. Levin and Y. Weiss. Learning to combine bottomup and topdown segmentation. In Proceedings of European Conference on Computer Vision, 2006.
 [22] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 [23] A. Lucchi, Y. Li, and P. Fua. Learning for structured prediction using approximate subgradient descent with working sets. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2013.
 [24] A. Lucchi, Y. Li, K. Smith, and P. Fua. Structured image segmentation using kernelized features. In Proceedings of European Conference on Computer Vision, 2012.
 [25] M. Marszalek and C. Schmid. Accurate object localization with shape masks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2007.
 [26] M.E. Nilsback and A. Zisserman. Delving deeper into the whorl of flower segmentation. Image Vision Computing, 2010.
 [27] S. Nowozin, P. Gehler, and C. H. Lampert. On parameter learning in CRFbased approaches to object class image segmentation. In Proceedings of European Conference on Computer Vision, 2010.
 [28] S. Nowozin, C. Rother, S. Bagon, T. Sharp, B. Yao, and P. Kohli. Decision tree fields. In Proceedings of IEEE International Conference on Computer Vision, 2011.
 [29] A. Severyn and A. Moschitti. Fast support vector machines for structural kernels. In Proceeding of European Conference on Machine Learning & Knowledge Discovery in Databases, 2011.
 [30] C. Shen and Z. Hao. A direct formulation for totallycorrective multiclass boosting. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2011.
 [31] C. Shen, G. Lin, and A. van den Hengel. Structboost: Boosting methods for predicting structured output variables. CoRR, 2013.
 [32] J. Shotton, M. Johnson, and R. Cipolla. Semantic texton forests for image categorization and segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008.
 [33] I. Steinwart. Sparseness of support vector machines. Journal of Machine Learning Research, 2003.
 [34] M. Szummer, P. Kohli, and D. Hoiem. Learning CRFs using graph cuts. In Proceedings of European Conference on Computer Vision, 2008.
 [35] B. Taskar, C. Guestrin, and D. Koller. Maxmargin Markov networks. In Proceedings of Advances in Neural Information Processing Systems, 2004.
 [36] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 2005.
 [37] C.N. Yu and T. Joachims. Training structural SVMs with kernels using sampled cuts. In Proceedings of ACM SIGKDD ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2008.

[38]
S. Zheng, S. Jayasumana, B. RomeraParedes, V. Vineet, Z. Su, D. Du, C. Huang,
and P. Torr.
Conditional random fields as recurrent neural networks.
In Proceedings of IEEE International Conference on Computer Vision, 2015.
Comments
There are no comments yet.