Integrated Inference and Learning of Neural Factors in Structural Support Vector Machines

08/03/2015 ∙ by Rein Houthooft, et al. ∙ Ghent University 0

Tackling pattern recognition problems in areas such as computer vision, bioinformatics, speech or text recognition is often done best by taking into account task-specific statistical relations between output variables. In structured prediction, this internal structure is used to predict multiple outputs simultaneously, leading to more accurate and coherent predictions. Structural support vector machines (SSVMs) are nonprobabilistic models that optimize a joint input-output function through margin-based learning. Because SSVMs generally disregard the interplay between unary and interaction factors during the training phase, final parameters are suboptimal. Moreover, its factors are often restricted to linear combinations of input features, limiting its generalization power. To improve prediction accuracy, this paper proposes: (i) Joint inference and learning by integration of back-propagation and loss-augmented inference in SSVM subgradient descent; (ii) Extending SSVM factors to neural networks that form highly nonlinear functions of input features. Image segmentation benchmark results demonstrate improvements over conventional SSVM training methods in terms of accuracy, highlighting the feasibility of end-to-end SSVM training with neural factors.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In traditional machine learning, the output consists of a single scalar, whereas in structured prediction, the output can be arbitrarily structured. These models have proven useful in tasks where output interactions play an important role. Examples are image segmentation, part-of-speech tagging, and optical character recognition, where taking into account contextual cues and predicting all output variables at once is beneficial. A widely used framework is the conditional random field (CRF), which models the statistical conditional dependencies between input and output variables, as well as between output variables mutually. However, many tasks only require ‘most-likely’ predictions, which led to the rise of nonprobabilistic approaches. Rather than optimizing the Bayes’ risk, these models minimize a structured loss, allowing the optimization of performance indicators directly

Nowozin:2011:SLP:2185833.2185834. One such model is the structural support vector machine (SSVM) tsochantaridis2005large in which a generalization of the hinge loss to multiclass and multilabel prediction is used.

A downside to traditional SSVM training is the bifurcated training approach in which unary factors (dependencies of outputs on inputs), and interaction factors

(mutual output dependencies) are trained sequentially. A unary classification model is optimized, while the interactions are trained post-hoc. However, this two-phase approach is suboptimal, because the errors made during the training of the interaction factors cannot be accounted for during training of the unary classifier. Another limitation is that SSVM factors are linear feature combinations, restricting the SSVM’s generalization power. We propose to extend these linearities to highly nonlinear functions by means of multilayer neural networks, to which we refer as

neural factors. Towards this goal, subgradient descent is extended by combining loss-augmented inference with back-propagation of the SSVM objective error into both unary and interaction neural factors. This leads to better generalization and more synergy between both SSVM factor types, resulting in more accurate and coherent predictions.

Our model is empirically validated by means of the complex structured prediction task of image segmentation on the MSRC-21, KITTI, and SIFT Flow benchmarks. The results demonstrate that integrated inference and learning, and/or using neural factors, improves prediction accuracy over conventional SSVM training methods, such as -slack cutting plane and subgradient descent optimization Nowozin:2011:SLP:2185833.2185834. Furthermore, we demonstrate that our model is able to perform on par with current state-of-the-art segmentation models on the MSRC-21 benchmark.

2 Related work

Although the combination of neural networks and structured or probabilistic graphical models dates back to the early ’90s bottou1997global; NIPS1989_195, interest in this topic is resurging. Several recent works introduce nonlinear unary factors/potentials into structured models. For the task of image segmentation, Chen et al. Chen2015

train a convolutional neural network as a unary classifier, followed by the training of a dense random field over the input pixels. Similarly, Farabet et al. 

farabet-pami-13 combine the output maps of a convolutional network with a CRF for image segmentation, while Li and Zemel li2014high propose semisupervised maxmargin learning with nonlinear unary potentials. Contrary to these works, we trade the bifurcated training approach for integrated inference and training of unary and interactions factors. Several works collobert2011natural; morris2008conditional; Prabhavalkar2010; Yu2009

focus on linear-chain graphs, using an independently trained deep learning model whose output serves as unary input features. Contrary to these works, we focus on more general graphs. Other works suggest kernels towards nonlinear SSVMs

lucchi; bertelli2011kernelized; we approach nonlinearity by representing SSVM factors by arbitrarily deep neural networks.

Do and Artières do2010neural propose a CRF in which potentials are represented by multilayer networks. The performance of their linear-chain probabilistic model is demonstrated by optical character and speech recognition using two-hidden-layer neural network outputs as unary potentials. Furthermore, joint inference and learning in linear-chain models is also proposed by Peng et al. peng2009conditional, however, the application to more general graphs remains an open problem amullerthesis. Contrary to these works, we popose a nonprobabilistic approach for general graphs by also modeling nonlinear interaction factors. More recently, Schwing and Urtasun schwing2015fully train a convolutional network as a unary classifier jointly with a fully-connected CRF for the task of image segmentation, similar to Tompson2014; krahenbuhl2013parameter. Chen et al. Chen2014 advocate a joint learning and reasoning approach, in which a structured model is probabilistically trained using loopy belief propagation for the task of optical character recognition and image tagging. Other related work includes Domke domke2013structured who uses relaxations for combined message-passing and learning.

Other related work aiming to improve conventional SSVMs are the works of Wang et al. wang2013incorporating and Lin et al. lin2015discriminatively, in which a hierarchical part-based model is proposed for multiclass object recognition and shape detection, focusing on model reconfigurability through compositional alternatives in And-Or graphs. Liang et al. liang2015deep propose the use of convolutional neural networks to model an end-to-end relation between input images and structured outputs in active template regression. Xu et al. xu2014compositional propose the learning of a structured model with multilayer deformable parts for action understanding, while Lu et al. lu2015human propose a hierarchical structured model for action segmentation.

Many of these works use probabilistic models that maximize the negative log-likelihood, such as do2010neural; peng2009conditional. In contrast, this paper takes a nonprobabilistic approach, wherein an SSVM is optimized via subgradient descent. The algorithm is altered to back-propagate SSVM loss errors, based on the ground truth and a loss-augmented prediction into the factors. Moreover, all factors are nonlinear functions, allowing the learning of complex patterns that originate from interaction features.

3 Methodology

In this section, essential SSVM background is introduced, after which integrated inference and back-propagation is explained for nonlinear unary factors. Finally, this notion is generalized into an SSVM model using only neural factors which are optimized by an alteration of subgradient descent.

3.1 Background

Traditional classification models are based on a prediction function that outputs a scalar. In contrast, structured prediction models define a prediction function , whose output can be arbitrarily structured. In this paper, this structure is represented by a vector in , with a set of class labels. Structured models employ a compatibility function , parametrized by . Prediction is done by solving the following maximization problem:

(1)

This is called inference, i.e., obtaining the most-likely assignment of labels, which is similar to maximum-a-posteriori (MAP) inference in probabilistic models. Because of the combinatorial complexity of the output space , the maximization problem in Eq. (1) is NP-hard Chen2014. Hence, it is important to impose on some kind of regularity that can be exploited for inference. This can be done by ensuring that corresponds to a nonprobabilistic factor graph, for which efficient inference techniques exist Nowozin:2011:SLP:2185833.2185834. In general, is linearly parametrized as a product of a weight vector and a joint feature function .

Commonly, decomposes as a sum of unary and interaction factors111Maximizing corresponds to minimizing the state of a nonprobabilistic factor graph, which factorizes into a product of factors. However, by operating in the log-domain, the state decomposes as a sum of factors., in which . The functions and are then sums over all individual joint input-output features of the nodes and interactions of the corresponding factor graph Nowozin:2011:SLP:2185833.2185834; lucchi. For example in the use case of Section 4, nodes are image regions, while interactions are connections between regions, each with their own joint feature vector. Data samples are conform this graphical structure, i.e., is composed of unary features and interaction features . Moreover, the unary and interaction parameters are generally concatenated as .

In this formulation, the unary features are defined as

(2)

while the interaction features for 2nd-order (edges) interactions are defined as

(3)

with the unary features corresponding to node and the interaction features corresponding to interaction (edge) . Similarly, higher-order interaction features can be incorporated by extending this matrix into higher-order combinations of nodes, according to the interactions. In the experiments of this paper, unary features are bag-of-words features corresponding to each superpixel. Interaction features are also bag-of-words, but this time corresponding to all connected superpixels.

In an SSVM the compatibility function is linearly parametrized as

and optimized effectively by minimizing an empirical estimate of the regularized structured risk

(4)

with

a structured loss function for which holds

, , and ; a regularization function; the inverse of the regularization strength; for a set of training samples that can be decomposed into nodes and interactions. In this paper, we make use of -regularization, hence . Furthermore, in line with our image segmentation use case in Section 4, the loss function is the class-weighted Hamming distance between two label assignments, or

(5)

with the Iverson brackets and the number of nodes (i.e., inputs to the unary factors, which corresponds to the number of nodes in the underlying factor graph) in the -th training sample. Contrary to maximum likelihood approaches do2010neural; Chen2014; krahenbuhl2013parameter, the Hamming distance allows us to directly maximize performance metrics regarding accuracy. By setting we can focus on node-wise accuracy, while setting allows us to focus on class-mean accuracy.

Due to the piecewise nature of the loss function , traditional gradient-based optimization techniques are ineffective for solving Eq. (4). However, according to Zhang zhang2004statistical, the equations

(6)
(7)

define a continuous and convex upper bound for the actual structured risk in Eq. (4) that can be minimized effectively by solving through numerical optimization Nowozin:2011:SLP:2185833.2185834; zhang2004statistical.

3.2 Integrated back-propagation and inference

algocf[!t]    

Traditional SSVM training methods optimize a joint parameter vector of the unary and interaction factors. However, they restrict these parameters to linear combinations of input features, or allow limited nonlinearity through the addition of kernels. The objective function in case of arbitrary nonlinear factors is often hard to optimize, as many numerical optimization methods require a convex objective function formulation. For example, -slack cutting plane training requires the conversion of the -operation in Eq. (7) to a set of linear constraints for its quadratic programming procedure joachims2009cutting; block-coordinate Frank-Wolfe SSVM optimization ICML2013_lacoste-julien13

assumes linear input dependencies; the structured perceptron similarly assumes linear parametrization

collins2002discriminative; and dual coordinate descent focuses on solving the dual of the linear -loss in SSVMs chang2013dual.

Subgradient descent minimization, as described in Nowozin:2011:SLP:2185833.2185834; book:shor1985, is a flexible tool for optimizing Eq. (6) as it naturally allows error back-propagation. This algorithm alternates between two steps. First,

(8)

is calculated for all training samples, which is called the loss-augmented inference or prediction step, derived from Eq. (7). In this paper, general inference for determining Eq. (1) is approximated via the -expansion boykov2001fast algorithm, whose effectiveness has been validated through extensive experiments Peng20131020. Loss-augmented prediction as in Eq. (8) is incorporated into this procedure by adding the loss term to the unary factors.

Second, these -values are used to calculate a subgradient222 is a subgradient of in a point if . Due to its piecewise continuous nature, Eq. (6) is nondifferentiable in some points, hence we are forced to rely on subgradients. of Eq. (6) as for each sample , in order to update . Traditional SSVMs assume that in which is a predefined joint input-output feature function. Commonly, this joint function is made up of the outputs of a nonlinear ‘unary’ classifier , such that becomes Houthooft-aaai-2016. This classifier is trained upfront, based on the different unary inputs corresponding to each node in the underlying factor graph. Due to the linear definition of , the SSVM model is learning linear combinations of these classifier outputs as its unary factors. In general, the interaction factors are not trained through a separate classifier, and are thus linear combinations of the interaction features directly.

We propose to replace the pretraining of a nonlinear unary classifier, and the transformation of its outputs through linear factors, by the direct optimization of nonlinear unary factors. In particular, the unary part of is represented by a sum of outputs of an adapted neural network which models factor values. To achieve this, the loss-augmented prediction step defined in Eq. (8) is altered to

(9)

in which represents the joint interaction feature function as described in Section 3.1 and Eq. (3). Eq. (9) is calculated similarly to Eq. (8) through -expansion by encoding the loss term into the unary factors.

The compatibility function thus becomes . The calculation of , originally defined as the subderivative of the objective function in Eq. (6), remains unaltered. However, we can no longer assume that conforms to the definition of a subgradient due to its nonconvexity. However, we can calculate

(10)

with the set of indices corresponding to training samples for which in Eq. (7), for a particular loss-augmented prediction . In case , we set . This gradient incorporates the loss-augmented prediction of Eq. (9) and is back-propagated through the underlying network to adjust each element of . The altered subgradient descent method is shown in Algorithm LABEL:algo:duplicate. Herein, represents the objective function for the -th training sample, i.e., .

algocf[!t]    

In contrast to gradient descent, subgradient methods Nowozin:2011:SLP:2185833.2185834; book:shor1985 do not guarantee the lowering of the objective function value in each step. Therefore, the current best value is memorized in each iteration , along with the corresponding parameter values . As such, the objective value decreases at each step as . This update rule is omitted from Algorithm LABEL:algo:duplicate to improve readability.

Because the loss terms in Eq. (7) are no longer affine input transformations due to the introduced nonlinearities of the neural network, we can no longer assume Eq. (6) to be convex, as is the case for conventional SSVMs. Although theoretical guarantees can be made for the convergence of (sub)gradient methods for convex functions nedic2001convergence, and particular classes of nonconvex functions bagirov2013subgradient, no such guarantees can be made for arbitrary nonconvex functions ngiam2011optimization. The problem of optimizing highly nonconvex functions is studied extensively in neural network gradient descent literature. However, it has been demonstrated that nonconvex objectives can be minimized effectively due to the high dimensionality of the neural network parameter space pascanu2014saddle. Dauphin et al. dauphin2014identifying show that saddle points are much likelier than local minima in multilayer neural network objective landscapes. In particular, the ratio of saddle points to local minima increases exponentially with the parameter dimensionality. Several methods exists to avoid these these saddle points, e.g., momentum sutskever2013importance. Furthermore, Dauphin et al. dauphin2014identifying

show, based on random matrix theory, that the existing local minima are very close to the global minimum of the objective function. This can be understood intuitively as the probability that all directions surrounding a local minimum lead upwards is very small, making local minima not an issue in general. The empirical results presented in Section 

4.2 reinforce this believe by demonstrating that the regularized objective function can still be minimized effectively, as we achieve accurate predictions.

As described in Algorithm LABEL:algo:duplicate, the (sub)gradient is defined over whole data samples, which each consist of multiple nodes. thus models the unary part of the compatibility function , which is a sum of the unary factors. Therefore, the function decomposes as a sum of neural unary factors

(11)

with the unary features in . The nonlinear function is a multiclass multilayer neural network parametrized by , whose inputs are features corresponding to the different nodes. It forms a template for the neural unary factors. In this network , the softmax-function is removed from the output layer, such that it matches the unary factor range . The argument of the joint feature function is used as an index to select a particular output unit.

3.3 Neural interaction factors

In this section we extend the notion of nonlinear factors beyond the integration of the training of a unary classifier. We now also replace the linear interaction part of the compatibility function with a function that decomposes as a sum of neural interaction factors

(12)

with the interaction features in , the combination of node labels in the -th interaction, and the number of interactions in the -th training sample. The function is parametrized by , and forms a template for the interaction factors. Herein, depends on the interaction order, e.g., in the Section 4 use case as connections between nodes are then edges. Interaction factors are generally not trained upfront. However, neural interaction factors are useful as they can extract complexer interaction patterns, and thus transcend the limited generalization power of linear combinations. In image segmentation for example, interaction features consisting of vertical gradients and a -angle can indicate that the two connected nodes belong to the same class. The loss-augmented inference step in Eq. (9) is now adapted to

(13)

while the compatibility function becomes . The two distinct models and are trained in a similar fashion to the method described in Algorithm LABEL:algo:duplicate, as depicted in Algorithm LABEL:algo:duplicate2. Notice that this method can easily be adjusted for batch or online learning by adapting and moving the weight updates at line LABEL:algo:backprop into the inner loop.

Like the unary function in Eq. (11), is a multiclass multilayer neural network in which the top softmax-function is removed, shared among all interaction factors. The output layer dimension matches the number of interaction label combinations, in the most general case. For example in image segmentation, for a problem with symmetric edge features, the number of output units in is , which all represent different states for a particular interaction factor (in this case the interactions are undirected edges, thus consists of the -th edge’s incident nodes).

The resulting structured predictor no longer requires two-phase training in which linear interaction factors are combined with the upfront training of a unary classifier, whose output is transformed linearly into unary factor values. It makes use of highly nonlinear functions for all SSVM factors, by way of multilayer neural networks, using an integration of loss-augmented inference and back-propagation in a subgradient descent framework. This allows the factors to generalize strongly while being able to mutually adapt to each other’s parameter updates, leading to more accurate predictions.

4 Experiments

GT   []   []   []   []   []          []

(a)

SGD   []   []   []   []   []          []

(b)

int+nrl   []   []   []   []   []          []

(c)
Figure 1: Illustrative examples of the performance of SGD and int+nrl on several MSRC-21 test images. Integrated training with neural factors improves classification accuracy over subgradient descent. The last column presents a case in which our model fails to outperform SGD.

GT   []   []          []

(a)

SGD   []   []          []

(b)

int+nrl   []   []          []

(c)
Figure 2: Illustrative examples of the performance of SGD and int+nrl on several KITTI test images. Integrated training with neural factors improves classification accuracy over subgradient descent. The last column presents a case in which our model fails to outperform SGD.

GT   []   []   []   []   []   []       []

(a)

SGD   []   []   []   []   []   []       []

(b)

int+nrl   []   []   []   []   []   []       []

(c)
Figure 3: Illustrative examples of the performance of SGD and int+nrl on several SIFT Flow test images. Integrated training with neural factors improves classification accuracy over subgradient descent. The last column presents a case in which our model fails to outperform SGD.

In this section, our model is analyzed on the task of image segmentation. Herein, the goal is to label different image regions with a correct class label. This is cast into a structured prediction problem by predicting all image region class labels simultaneously. There is one unary factor in underlying SSVM graphical structure for every image region, while interactions represent edges between neighboring regions. First, our model is analyzed and its different variants are compared to conventional SSVM training schemes. Second, the best performing variant is compared with state-of-the-art segmentation approaches. Our model is implemented as an extension of PyStruct JMLR:v15:mueller14a

, using Theano

theano12 for GPU-accelerated neural factor optimization.

4.1 Experimental setup

building

grass

tree

cow

sheep

sky

aeropl.

water

face

car

bicycle

flower

sign

bird

book

chair

road

cat

dog

body

boat

pixel

class

unary   15 60 52 08 10 68 35 46 12 21 21 42 09 02 36 00 21 14 05 06 01   36.3 23.1
CP   44 77 61 48 21 85 60 69 51 70 63 54 49 16 87 21 41 47 06 16 33   59.4 48.5
SGD   49 67 71 39 64 80 81 67 35 74 60 42 19 02 88 51 53 38 04 31 26   59.2 49.6
int+lin   48 76 83 67 73 94 78 67 59 56 68 65 48 14 95 43 61 53 06 45 32   67.4 58.5
bif+nrl   46 74 79 51 51 92 83 64 76 64 67 50 53 09 83 34 42 42 00 47 22   62.7 53.7
int+nrl   53 77 86 61 73 95 83 60 87 77 72 69 77 27 85 29 67 46 00 57 26   70.1 62.3
int+lin   46 67 80 47 69 83 79 60 35 66 63 53 10 02 89 43 66 62 04 45 17   61.2 51.7
3-layer 62 76 87 68 77 94 81 66 84 65 75 53 69 33 81 51 67 58 30 64 25 71.6 65.1
Table 1: MSRC-21 class, pixel-wise, and class-mean test accuracy (in %) for different models

sky

building

road

sidewalk

fence

vegetation

pole

car

pixel

class

unary   75 63 59 29 08 71 0 38   53.8 42.8
CP   84 76 75 11 05 75 0 48   61.5 46.7
SGD   77 68 86 19 04 80 0 71   65.5 50.6
int+lin   86 76 82 42 23 81 6 67   70.2 57.8
bif+nrl   86 77 81 41 12 80 0 71   70.0 55.9
int+nrl   86 83 88 50 19 84 4 74   75.6 60.9
int+lin   81 76 85 22 12 82 0 70   69.2 53.5
3-layer 90 82 88 55 28 87 1 78 77.6 63.6
Table 3: SIFT Flow pixel-wise and class-mean test accuracy (in %) for different models

vegetation

pixel

class

unary   44.7 7.5
CP   62.5 13.8
SGD   65.9 15.3
int+lin   70.3 16.2
bif+nrl   68.8 16.1
int+nrl   71.3 17.0
int+lin   70.2 15.6
3-layer 71.5 17.2
Table 2: KITTI class, pixel-wise, and class-mean test accuracy (in %) for different models

The model analysis experiments are executed on the widely-used MSRC-21 benchmark shotton2009textonboost, which consists of training, validation, and testing images. This benchmark is sufficiently complex with its 21 classes and noisy labels, and focuses on object delineation as well as irregular background recognition. Furthermore, the experiments are executed on the KITTI benchmark ros:2015 consisting of training and testing images, augmented with training images of Kundu et al. augm. This latter benchmark consists of classes, but we drop the least frequently-occurring ones as they are insufficiently represented in the dataset. Finally, the same experiment is repeated for a larger dataset, namely the SIFT Flow benchmark liu2011sift, consisting of classes with training and testing images.

All image pixels are clustered into regions using the SLIC Achanta:2012:SSC:2377349.2377556 superpixel algorithm. For each region, gradient (DAISY Tola10) and color (in HSV-space) features are densely extracted. These features are transformed two times into separate bags-of-words via minibatch -means clustering (once 60 gradient and 30 color words, once 10 and 5 words). The unary input vectors form -D concatenations of the first two bags-of-words. The model’s connectivity structure links together all neighboring regions via edges. The edge/interaction input vectors are based on concatenations of the second set of bags-of-words. Both -D input vectors of the edge’s incident regions are concatenated into a -D vector. Moreover, two edge-specific features are added, namely the distance and angle between adjacent superpixel centers, leading to -D interaction feature vectors.

Factors are trained with (regular) momentum, using a learning rate curve , with and parameters, and the current training iteration number as used in Algorithms LABEL:algo:duplicate and LABEL:algo:duplicate2

. The regularization, learning rate, and momentum hyperparameter values are tuned using a validation set by means of a coarse- and fine-grained grid search over the parameter spaces, yielding separate settings for the unary and pairwise factors. The linear parameters

are initialized to , while the neural factor parameters and are initialized according to glorot2010understanding, except for the top layer weights which are set to . The class weights in Eq. (5) are set to correct for class imbalance. The model is trained using CPU-parallelized loss-augmented prediction, while the neural factors are trained using GPU parallelism.

The following models are compared: unary-only (unary), -slack cutting plane training (CP) with delayed constraint generation, subgradient descent (SGD)333SGD uses bifurcated training with linear interactions, hence it could be named bif+lin., integrated training with neural unary and linear interaction factors (int+lin), bifurcated training with neural interaction factors (bif+nrl), and integrated training with neural unary and neural interaction factors (int+nrl).

Multiclass logistic regression is used as unary classifier, trained with gradient descent by cross-entropy optimization. All unary neural factors contain a single hidden layer with 256

-units, for direct comparison of integrated learning with upfront logistic regression training. The interaction neural factors contain a single hidden layer of 512 -units to elucidate the benefit of nonlinear factors, without overly increasing the model’s capacity. The experiment is set up to highlight the benefit of integrated learning by restricting the unary factors to features insufficiently discriminative on their own. This deliberately leads to noisy unary classification, forcing the model to rely on contextual relations for accurate prediction. The interaction factors encode information about their incident region feature vectors to allow neural factors to extract meaningful patterns from gradient/color combinations. We deliberately encoded less information in the interaction features, such that the model cannot solely rely on interaction factors for accurate and coherent predictions.

4.2 Results and discussion

Accuracy results on the MSRC-21 shotton2009textonboost test images are presented in Table 1, while Figure 1 shows a handful of illustrative examples that compare segmentations attained by SGD with int+nrl. The results of the same experiment for the KITTI benchmark ros:2015, augmented with additional training images Kundu et al. augm, are shown in Table 3 and Figure 2. Qualitative results on the SIFT Flow liu2011sift dataset are shown in Figure 3, while accuracy results are shown in Table 3.

(a)
Figure 4: Visualization of the synergy between unary and interaction factors. In bifurcated training the interactions make unary factors redundant as these cannot be adapt to errors made by the interactions. In integrated training, combining both factor types leads to a higher accuracy as they can mutually adapt to each other’s weight updates.

building

grass

tree

cow

sheep

sky

aeropl.

water

face

car

bicycle

flower

sign

bird

book

chair

road

cat

dog

body

boat

pixel

class

neural factors 76 94 94 92 97 92 94 85 93 88 94 95 70 78 97 87 88 91 78 88 63 88.9 87.4
Liu et al. Liu:2015:CLC:2796563.2796622   71 95 92 87 98 97 97 89 95 85 96 94 75 76 89 84 88 97 77 87 52   88.5 86.7
Yao et al. yao2012describing   71 98 90 79 86 93 88 86 90 84 94 98 76 53 97 71 89 83 55 68 17   86.2 79.3
Lucchi et al. lucchi2013learning   67 89 85 93 79 93 84 75 79 87 89 92 71 46 96 79 86 76 64 77 50   83.7 78.9
Munoz et al. munoz2010stacked   63 93 88 84 65 89 69 78 74 81 84 80 51 55 84 80 69 47 59 71 24   78 71
Gonfause et al. gonfaus2010harmony   60 78 77 91 68 88 87 76 73 77 93 97 73 57 95 81 76 81 46 56 46   77 75
Shotton et al. shotton2008semantic   49 88 79 97 97 78 82 54 87 74 72 74 36 24 93 51 78 75 35 66 18   72 67
Lucchi et al. lucchi2012structured   41 77 79 87 91 86 92 65 86 65 89 61 76 48 77 91 77 82 32 48 39   73 70
Table 4: State-of-the-art comparison: MSRC-21 per-class, class-mean, and global pixel-wise test accuracy (in %) for different models

The results show that unary-only prediction is very inaccurate (pixel-wise/class-mean accuracy of 36.3/23.1% for the MSRC-21 dataset, 53.8/42.8% for the KITTI dataset, and 44.7/7.5% for the SIFT Flow dataset). The reason for this is that unary features are not sufficiently distinctive to allow for differentiation between classes due to their low dimensionality. Accurate predictions are only possible by taking into account contextual output relations, demonstrated by the increased accuracy of CP (MSRC-21: 59.4/48.5%; KITTI: 61.5/46.7%; SIFT Flow: 62.5/13.8%) as well as SGD (MSRC-21: 59.2/49.6%; KITTI: 65.5/50.6%; SIFT Flow: 65.9/15.3%). These structured predictors learn linear relations between image regions, which allows them to correct errors originating from the underlying unary classifier. However, the unary factor’s linear weights have only limited capability for error correction in the opposite direction, due to the fact that the SSVM cannot alter the unary classifier parameters post-hoc.

Using an integrated training approach such as int+lin, in which the SSVM is trained end-to-end, improves accuracy (MSRC-21: 67.4/58.5%; KITTI: 70.2/57.8%; SIFT Flow: 70.2/15.6%) over the bifurcated procedures CP and SGD. Although neither the unary or interaction features are very distinctive, the integrated procedure updates parameters in such a way that both factor types have a unique discriminative focus. Their synergistic relationship ultimately results in higher accuracy. To better compare SGD (which uses 8, 21, and 33 logistic regression outputs as unary input features for the different benchmarks) with int+lin, we also depict the accuracy (MSRC-21: 61.2/51.7%; KITTI: 69.2/53.5%; SIFT Flow: 70.2/15.6%) of a model (int+lin) with only 8, 21, and 33 unary hidden units for the KITTI, MSRC-21, and SIFT Flow dataset, rather than 256 units. The 2.0/2.1% (MSRC-21), 3.7/2.9% (KITTI), and 4.3/0.3% (SIFT Flow) increases in accuracy over SGD further illustrates the benefit of integrated learning and inference over conventional bifurcated SSVM training.

Another insight gained by the results is that accuracy increases when replacing linear interaction factors of conventional SSVMs with neural factors, i.e., int+nrl (MSRC-21: 70.1/62.3%; KITTI: 75.6/60.9%; SIFT Flow: 71.3/17.0%) and bif+nrl (MSRC-21: 62.7/53.7%; KITTI: 70.0/55.9%; SIFT Flow:68.8/16.1%) outperform int+lin (MSRC-21: 67.4/58.5%; KITTI: 70.2/57.8%; SIFT Flow: 70.3/16.2%) and SGD (MSRC-21: 59.2/49.6%; KITTI: 65.5/50.6%; SIFT Flow: 65.9/15.3%) respectively. This increase can be attributed to the higher number of parameters, as well as the added nonlinearities in combination with correct regularization. The model has greater generalization power, allowing the factors to extract more complex and meaningful interaction patterns. Neural factors offer great flexibility as they can be stacked to arbitrary depths. This leads to even higher generalization, as indicated by the increased accuracy (MSRC-21: 71.6/65.1%; KITTI: 77.6/63.6%; SIFT Flow: 71.5/17.2%) of the deeper 3-layer (int+nrl) model. Herein both unary and interaction factors are 3-hidden-layer neural networks consisting of 256 and 512 units (rectified linear units for MSRC-21 and KITTI and

units for SIFT Flow) in each layer respectively. Our model can thus easily be extended, for example by letting neural factors represent the fully-connected layer in convolutional neural networks. As such, it serves as a foundation for more complex structured models.

All methods converge within 600 epochs, with one epoch taking approximately 12.62 seconds for the MSRC-21 dataset, 4.35 seconds for the KITTI dataset, and 197.27 seconds on the SIFT Flow dataset for the int+nrl algorithm. Since the implementation of our algorithm is not optimized for speed, these values can be further reduced by better exploitation of CPU parallelism.

Figure 4 illustrates the synergy between unary and interaction factors achieved through both integrated and bifurcated training, exercised on the MSRC-21 dataset. The bars depict model test accuracy when using only unary or pairwise factors, by setting either the pairwise or unary factors respectively to a zero factor value, thus or . Although the unary factors alone perform well in bifurcated training, nearly all accuracy can be attributed to the interactions. A possible explanation is that both types essentially learn the same information. The interactions correct errors of the underlying classifier and ultimately make unary factors redundant. In integrated training, neither the unary or interaction factors alone attain a high accuracy, but the combination of both does.

We explain this synergistic relationship with an example: Unary factors assign to a region of class A, a second-to-highest factor value to class A, a highest value to class B, and a low value to class C. The interactions also assign a second-to-highest value to class A, but a highest value to class C, and a low value to class B. Independently both factors incorrectly predict the region of class A as belonging to class B or class C. However, when combined they correctly assign a highest value to class A. In the figure, bifurcated training only shows limited signs of factor synergy, as the optimization procedure is insufficiently able to steer unary and pairwise parameters in different directions, which causes them have a similar discriminative focus. This observation leads us to believe that integrated learning and inference results in higher accuracy by synergistic unary/interaction factor optimization. Both factor types are no longer optimized for independent accuracy, but mutually adapt to each other’s parameter updates, which results in enhanced predictive power.

In addition to the previous experiments, the viability of our neural factor model is shown through comparison with the closely related work of Liu et al. Liu:2015:CLC:2796563.2796622

on the MSRC-21 dataset. Liu et al. make use of features extracted from square regions of varying size around each superpixel, through means of a pretrained convolutional neural network. We compare our model with theirs by using overfeat features

sermanet2013overfeat in a similar fashion, trained on individual regions. Furthermore, the model settings have been altered with respect to the previous experiments. More specifically, 1,000 SLIC superpixels are utilized for the over-segmentation preprocessing step, enforcing superpixel connectivity and merging any superpixel with a surface area below a particular threshold. DAISY gradient and HSV color features are extracted according to a regular lattice, and clustered via minibatch -means clustering. Next, the same type of features are extracted for each individual pixel, leading to unary and pairwise factor feature vectors. Moreover, the -position of the superpixel (median-based) center is included in the unary feature vectors, while the distance and angle between the two superpixel centers is encoded into the interaction feature vectors. The neural factors are represented by multilayer neural networks using -units, trained according to our Algorithm LABEL:algo:duplicate2, using conventional momentum and single image-sized batches per gradient update. Classes are balanced by weighing them with the inverse of the class frequency. The results are presented in Table 4, which indicate that our model is capable of performing on par with the current state-of-practice, when used in conjunction with more advanced methods, e.g., overfeat features. Moreover, similar to Liu et al. Liu:2015:CLC:2796563.2796622, we have compared our model with other less closely related methods for completeness, for which the results are shown below the horizontal line in Table 4.

5 Conclusion

A structured prediction model that integrates back-propagation and loss-augmented inference into subgradient descent training of structural support vector machines (SSVMs) is proposed. This model departs from the traditional bifurcated approach in which a unary classifier is trained independently from the structured predictor. Furthermore, the SSVM factors are extended to neural factors, which allows both unary and interaction factors to be highly nonlinear functions of input features. Results on a complex image segmentation task show that end-to-end SSVM training, and/or using neural factors, leads to more accurate predictions than conventional subgradient descent and -slack cutting plane training. Results show that our model serves as a foundation for more advanced structured models, e.g., by using latent variables, learned feature representations, or complexer connectivity structures.

Acknowledgments

Rein Houthooft is supported by a Ph.D. Fellowship of the Research Foundation - Flanders (FWO). Many thanks to Brecht Hanssens and Cedric De Boom for their insightful comments.

References