1 Introduction
In traditional machine learning, the output consists of a single scalar, whereas in structured prediction, the output can be arbitrarily structured. These models have proven useful in tasks where output interactions play an important role. Examples are image segmentation, partofspeech tagging, and optical character recognition, where taking into account contextual cues and predicting all output variables at once is beneficial. A widely used framework is the conditional random field (CRF), which models the statistical conditional dependencies between input and output variables, as well as between output variables mutually. However, many tasks only require ‘mostlikely’ predictions, which led to the rise of nonprobabilistic approaches. Rather than optimizing the Bayes’ risk, these models minimize a structured loss, allowing the optimization of performance indicators directly
Nowozin:2011:SLP:2185833.2185834. One such model is the structural support vector machine (SSVM) tsochantaridis2005large in which a generalization of the hinge loss to multiclass and multilabel prediction is used.A downside to traditional SSVM training is the bifurcated training approach in which unary factors (dependencies of outputs on inputs), and interaction factors
(mutual output dependencies) are trained sequentially. A unary classification model is optimized, while the interactions are trained posthoc. However, this twophase approach is suboptimal, because the errors made during the training of the interaction factors cannot be accounted for during training of the unary classifier. Another limitation is that SSVM factors are linear feature combinations, restricting the SSVM’s generalization power. We propose to extend these linearities to highly nonlinear functions by means of multilayer neural networks, to which we refer as
neural factors. Towards this goal, subgradient descent is extended by combining lossaugmented inference with backpropagation of the SSVM objective error into both unary and interaction neural factors. This leads to better generalization and more synergy between both SSVM factor types, resulting in more accurate and coherent predictions.Our model is empirically validated by means of the complex structured prediction task of image segmentation on the MSRC21, KITTI, and SIFT Flow benchmarks. The results demonstrate that integrated inference and learning, and/or using neural factors, improves prediction accuracy over conventional SSVM training methods, such as slack cutting plane and subgradient descent optimization Nowozin:2011:SLP:2185833.2185834. Furthermore, we demonstrate that our model is able to perform on par with current stateoftheart segmentation models on the MSRC21 benchmark.
2 Related work
Although the combination of neural networks and structured or probabilistic graphical models dates back to the early ’90s bottou1997global; NIPS1989_195, interest in this topic is resurging. Several recent works introduce nonlinear unary factors/potentials into structured models. For the task of image segmentation, Chen et al. Chen2015
train a convolutional neural network as a unary classifier, followed by the training of a dense random field over the input pixels. Similarly, Farabet et al.
farabetpami13 combine the output maps of a convolutional network with a CRF for image segmentation, while Li and Zemel li2014high propose semisupervised maxmargin learning with nonlinear unary potentials. Contrary to these works, we trade the bifurcated training approach for integrated inference and training of unary and interactions factors. Several works collobert2011natural; morris2008conditional; Prabhavalkar2010; Yu2009focus on linearchain graphs, using an independently trained deep learning model whose output serves as unary input features. Contrary to these works, we focus on more general graphs. Other works suggest kernels towards nonlinear SSVMs
lucchi; bertelli2011kernelized; we approach nonlinearity by representing SSVM factors by arbitrarily deep neural networks.Do and Artières do2010neural propose a CRF in which potentials are represented by multilayer networks. The performance of their linearchain probabilistic model is demonstrated by optical character and speech recognition using twohiddenlayer neural network outputs as unary potentials. Furthermore, joint inference and learning in linearchain models is also proposed by Peng et al. peng2009conditional, however, the application to more general graphs remains an open problem amullerthesis. Contrary to these works, we popose a nonprobabilistic approach for general graphs by also modeling nonlinear interaction factors. More recently, Schwing and Urtasun schwing2015fully train a convolutional network as a unary classifier jointly with a fullyconnected CRF for the task of image segmentation, similar to Tompson2014; krahenbuhl2013parameter. Chen et al. Chen2014 advocate a joint learning and reasoning approach, in which a structured model is probabilistically trained using loopy belief propagation for the task of optical character recognition and image tagging. Other related work includes Domke domke2013structured who uses relaxations for combined messagepassing and learning.
Other related work aiming to improve conventional SSVMs are the works of Wang et al. wang2013incorporating and Lin et al. lin2015discriminatively, in which a hierarchical partbased model is proposed for multiclass object recognition and shape detection, focusing on model reconfigurability through compositional alternatives in AndOr graphs. Liang et al. liang2015deep propose the use of convolutional neural networks to model an endtoend relation between input images and structured outputs in active template regression. Xu et al. xu2014compositional propose the learning of a structured model with multilayer deformable parts for action understanding, while Lu et al. lu2015human propose a hierarchical structured model for action segmentation.
Many of these works use probabilistic models that maximize the negative loglikelihood, such as do2010neural; peng2009conditional. In contrast, this paper takes a nonprobabilistic approach, wherein an SSVM is optimized via subgradient descent. The algorithm is altered to backpropagate SSVM loss errors, based on the ground truth and a lossaugmented prediction into the factors. Moreover, all factors are nonlinear functions, allowing the learning of complex patterns that originate from interaction features.
3 Methodology
In this section, essential SSVM background is introduced, after which integrated inference and backpropagation is explained for nonlinear unary factors. Finally, this notion is generalized into an SSVM model using only neural factors which are optimized by an alteration of subgradient descent.
3.1 Background
Traditional classification models are based on a prediction function that outputs a scalar. In contrast, structured prediction models define a prediction function , whose output can be arbitrarily structured. In this paper, this structure is represented by a vector in , with a set of class labels. Structured models employ a compatibility function , parametrized by . Prediction is done by solving the following maximization problem:
(1) 
This is called inference, i.e., obtaining the mostlikely assignment of labels, which is similar to maximumaposteriori (MAP) inference in probabilistic models. Because of the combinatorial complexity of the output space , the maximization problem in Eq. (1) is NPhard Chen2014. Hence, it is important to impose on some kind of regularity that can be exploited for inference. This can be done by ensuring that corresponds to a nonprobabilistic factor graph, for which efficient inference techniques exist Nowozin:2011:SLP:2185833.2185834. In general, is linearly parametrized as a product of a weight vector and a joint feature function .
Commonly, decomposes as a sum of unary and interaction factors^{1}^{1}1Maximizing corresponds to minimizing the state of a nonprobabilistic factor graph, which factorizes into a product of factors. However, by operating in the logdomain, the state decomposes as a sum of factors., in which . The functions and are then sums over all individual joint inputoutput features of the nodes and interactions of the corresponding factor graph Nowozin:2011:SLP:2185833.2185834; lucchi. For example in the use case of Section 4, nodes are image regions, while interactions are connections between regions, each with their own joint feature vector. Data samples are conform this graphical structure, i.e., is composed of unary features and interaction features . Moreover, the unary and interaction parameters are generally concatenated as .
In this formulation, the unary features are defined as
(2) 
while the interaction features for 2ndorder (edges) interactions are defined as
(3) 
with the unary features corresponding to node and the interaction features corresponding to interaction (edge) . Similarly, higherorder interaction features can be incorporated by extending this matrix into higherorder combinations of nodes, according to the interactions. In the experiments of this paper, unary features are bagofwords features corresponding to each superpixel. Interaction features are also bagofwords, but this time corresponding to all connected superpixels.
In an SSVM the compatibility function is linearly parametrized as
and optimized effectively by minimizing an empirical estimate of the regularized structured risk
(4) 
with
a structured loss function for which holds
, , and ; a regularization function; the inverse of the regularization strength; for a set of training samples that can be decomposed into nodes and interactions. In this paper, we make use of regularization, hence . Furthermore, in line with our image segmentation use case in Section 4, the loss function is the classweighted Hamming distance between two label assignments, or(5) 
with the Iverson brackets and the number of nodes (i.e., inputs to the unary factors, which corresponds to the number of nodes in the underlying factor graph) in the th training sample. Contrary to maximum likelihood approaches do2010neural; Chen2014; krahenbuhl2013parameter, the Hamming distance allows us to directly maximize performance metrics regarding accuracy. By setting we can focus on nodewise accuracy, while setting allows us to focus on classmean accuracy.
Due to the piecewise nature of the loss function , traditional gradientbased optimization techniques are ineffective for solving Eq. (4). However, according to Zhang zhang2004statistical, the equations
(6) 
(7) 
define a continuous and convex upper bound for the actual structured risk in Eq. (4) that can be minimized effectively by solving through numerical optimization Nowozin:2011:SLP:2185833.2185834; zhang2004statistical.
3.2 Integrated backpropagation and inference
algocf[!t]
Traditional SSVM training methods optimize a joint parameter vector of the unary and interaction factors. However, they restrict these parameters to linear combinations of input features, or allow limited nonlinearity through the addition of kernels. The objective function in case of arbitrary nonlinear factors is often hard to optimize, as many numerical optimization methods require a convex objective function formulation. For example, slack cutting plane training requires the conversion of the operation in Eq. (7) to a set of linear constraints for its quadratic programming procedure joachims2009cutting; blockcoordinate FrankWolfe SSVM optimization ICML2013_lacostejulien13
assumes linear input dependencies; the structured perceptron similarly assumes linear parametrization
collins2002discriminative; and dual coordinate descent focuses on solving the dual of the linear loss in SSVMs chang2013dual.Subgradient descent minimization, as described in Nowozin:2011:SLP:2185833.2185834; book:shor1985, is a flexible tool for optimizing Eq. (6) as it naturally allows error backpropagation. This algorithm alternates between two steps. First,
(8) 
is calculated for all training samples, which is called the lossaugmented inference or prediction step, derived from Eq. (7). In this paper, general inference for determining Eq. (1) is approximated via the expansion boykov2001fast algorithm, whose effectiveness has been validated through extensive experiments Peng20131020. Lossaugmented prediction as in Eq. (8) is incorporated into this procedure by adding the loss term to the unary factors.
Second, these values are used to calculate a subgradient^{2}^{2}2 is a subgradient of in a point if . Due to its piecewise continuous nature, Eq. (6) is nondifferentiable in some points, hence we are forced to rely on subgradients. of Eq. (6) as for each sample , in order to update . Traditional SSVMs assume that in which is a predefined joint inputoutput feature function. Commonly, this joint function is made up of the outputs of a nonlinear ‘unary’ classifier , such that becomes Houthooftaaai2016. This classifier is trained upfront, based on the different unary inputs corresponding to each node in the underlying factor graph. Due to the linear definition of , the SSVM model is learning linear combinations of these classifier outputs as its unary factors. In general, the interaction factors are not trained through a separate classifier, and are thus linear combinations of the interaction features directly.
We propose to replace the pretraining of a nonlinear unary classifier, and the transformation of its outputs through linear factors, by the direct optimization of nonlinear unary factors. In particular, the unary part of is represented by a sum of outputs of an adapted neural network which models factor values. To achieve this, the lossaugmented prediction step defined in Eq. (8) is altered to
(9) 
in which represents the joint interaction feature function as described in Section 3.1 and Eq. (3). Eq. (9) is calculated similarly to Eq. (8) through expansion by encoding the loss term into the unary factors.
The compatibility function thus becomes . The calculation of , originally defined as the subderivative of the objective function in Eq. (6), remains unaltered. However, we can no longer assume that conforms to the definition of a subgradient due to its nonconvexity. However, we can calculate
(10) 
with the set of indices corresponding to training samples for which in Eq. (7), for a particular lossaugmented prediction . In case , we set . This gradient incorporates the lossaugmented prediction of Eq. (9) and is backpropagated through the underlying network to adjust each element of . The altered subgradient descent method is shown in Algorithm LABEL:algo:duplicate. Herein, represents the objective function for the th training sample, i.e., .
algocf[!t]
In contrast to gradient descent, subgradient methods Nowozin:2011:SLP:2185833.2185834; book:shor1985 do not guarantee the lowering of the objective function value in each step. Therefore, the current best value is memorized in each iteration , along with the corresponding parameter values . As such, the objective value decreases at each step as . This update rule is omitted from Algorithm LABEL:algo:duplicate to improve readability.
Because the loss terms in Eq. (7) are no longer affine input transformations due to the introduced nonlinearities of the neural network, we can no longer assume Eq. (6) to be convex, as is the case for conventional SSVMs. Although theoretical guarantees can be made for the convergence of (sub)gradient methods for convex functions nedic2001convergence, and particular classes of nonconvex functions bagirov2013subgradient, no such guarantees can be made for arbitrary nonconvex functions ngiam2011optimization. The problem of optimizing highly nonconvex functions is studied extensively in neural network gradient descent literature. However, it has been demonstrated that nonconvex objectives can be minimized effectively due to the high dimensionality of the neural network parameter space pascanu2014saddle. Dauphin et al. dauphin2014identifying show that saddle points are much likelier than local minima in multilayer neural network objective landscapes. In particular, the ratio of saddle points to local minima increases exponentially with the parameter dimensionality. Several methods exists to avoid these these saddle points, e.g., momentum sutskever2013importance. Furthermore, Dauphin et al. dauphin2014identifying
show, based on random matrix theory, that the existing local minima are very close to the global minimum of the objective function. This can be understood intuitively as the probability that all directions surrounding a local minimum lead upwards is very small, making local minima not an issue in general. The empirical results presented in Section
4.2 reinforce this believe by demonstrating that the regularized objective function can still be minimized effectively, as we achieve accurate predictions.As described in Algorithm LABEL:algo:duplicate, the (sub)gradient is defined over whole data samples, which each consist of multiple nodes. thus models the unary part of the compatibility function , which is a sum of the unary factors. Therefore, the function decomposes as a sum of neural unary factors
(11) 
with the unary features in . The nonlinear function is a multiclass multilayer neural network parametrized by , whose inputs are features corresponding to the different nodes. It forms a template for the neural unary factors. In this network , the softmaxfunction is removed from the output layer, such that it matches the unary factor range . The argument of the joint feature function is used as an index to select a particular output unit.
3.3 Neural interaction factors
In this section we extend the notion of nonlinear factors beyond the integration of the training of a unary classifier. We now also replace the linear interaction part of the compatibility function with a function that decomposes as a sum of neural interaction factors
(12) 
with the interaction features in , the combination of node labels in the th interaction, and the number of interactions in the th training sample. The function is parametrized by , and forms a template for the interaction factors. Herein, depends on the interaction order, e.g., in the Section 4 use case as connections between nodes are then edges. Interaction factors are generally not trained upfront. However, neural interaction factors are useful as they can extract complexer interaction patterns, and thus transcend the limited generalization power of linear combinations. In image segmentation for example, interaction features consisting of vertical gradients and a angle can indicate that the two connected nodes belong to the same class. The lossaugmented inference step in Eq. (9) is now adapted to
(13) 
while the compatibility function becomes . The two distinct models and are trained in a similar fashion to the method described in Algorithm LABEL:algo:duplicate, as depicted in Algorithm LABEL:algo:duplicate2. Notice that this method can easily be adjusted for batch or online learning by adapting and moving the weight updates at line LABEL:algo:backprop into the inner loop.
Like the unary function in Eq. (11), is a multiclass multilayer neural network in which the top softmaxfunction is removed, shared among all interaction factors. The output layer dimension matches the number of interaction label combinations, in the most general case. For example in image segmentation, for a problem with symmetric edge features, the number of output units in is , which all represent different states for a particular interaction factor (in this case the interactions are undirected edges, thus consists of the th edge’s incident nodes).
The resulting structured predictor no longer requires twophase training in which linear interaction factors are combined with the upfront training of a unary classifier, whose output is transformed linearly into unary factor values. It makes use of highly nonlinear functions for all SSVM factors, by way of multilayer neural networks, using an integration of lossaugmented inference and backpropagation in a subgradient descent framework. This allows the factors to generalize strongly while being able to mutually adapt to each other’s parameter updates, leading to more accurate predictions.
4 Experiments
In this section, our model is analyzed on the task of image segmentation. Herein, the goal is to label different image regions with a correct class label. This is cast into a structured prediction problem by predicting all image region class labels simultaneously. There is one unary factor in underlying SSVM graphical structure for every image region, while interactions represent edges between neighboring regions. First, our model is analyzed and its different variants are compared to conventional SSVM training schemes. Second, the best performing variant is compared with stateoftheart segmentation approaches. Our model is implemented as an extension of PyStruct JMLR:v15:mueller14a
, using Theano
theano12 for GPUaccelerated neural factor optimization.4.1 Experimental setup
building 
grass 
tree 
cow 
sheep 
sky 
aeropl. 
water 
face 
car 
bicycle 
flower 
sign 
bird 
book 
chair 
road 
cat 
dog 
body 
boat 
pixel 
class 


unary  15  60  52  8  10  68  35  46  12  21  21  42  9  2  36  0  21  14  5  6  1  36.3  23.1 
CP  44  77  61  48  21  85  60  69  51  70  63  54  49  16  87  21  41  47  6  16  33  59.4  48.5 
SGD  49  67  71  39  64  80  81  67  35  74  60  42  19  2  88  51  53  38  4  31  26  59.2  49.6 
int+lin  48  76  83  67  73  94  78  67  59  56  68  65  48  14  95  43  61  53  6  45  32  67.4  58.5 
bif+nrl  46  74  79  51  51  92  83  64  76  64  67  50  53  9  83  34  42  42  0  47  22  62.7  53.7 
int+nrl  53  77  86  61  73  95  83  60  87  77  72  69  77  27  85  29  67  46  0  57  26  70.1  62.3 
int+lin  46  67  80  47  69  83  79  60  35  66  63  53  10  2  89  43  66  62  4  45  17  61.2  51.7 
3layer  62  76  87  68  77  94  81  66  84  65  75  53  69  33  81  51  67  58  30  64  25  71.6  65.1 
sky 
building 
road 
sidewalk 
fence 
vegetation 
pole 
car 
pixel 
class 


unary  75  63  59  29  8  71  0  38  53.8  42.8 
CP  84  76  75  11  5  75  0  48  61.5  46.7 
SGD  77  68  86  19  4  80  0  71  65.5  50.6 
int+lin  86  76  82  42  23  81  6  67  70.2  57.8 
bif+nrl  86  77  81  41  12  80  0  71  70.0  55.9 
int+nrl  86  83  88  50  19  84  4  74  75.6  60.9 
int+lin  81  76  85  22  12  82  0  70  69.2  53.5 
3layer  90  82  88  55  28  87  1  78  77.6  63.6 
vegetation 
pixel 
class 

unary  44.7  7.5 
CP  62.5  13.8 
SGD  65.9  15.3 
int+lin  70.3  16.2 
bif+nrl  68.8  16.1 
int+nrl  71.3  17.0 
int+lin  70.2  15.6 
3layer  71.5  17.2 
The model analysis experiments are executed on the widelyused MSRC21 benchmark shotton2009textonboost, which consists of training, validation, and testing images. This benchmark is sufficiently complex with its 21 classes and noisy labels, and focuses on object delineation as well as irregular background recognition. Furthermore, the experiments are executed on the KITTI benchmark ros:2015 consisting of training and testing images, augmented with training images of Kundu et al. augm. This latter benchmark consists of classes, but we drop the least frequentlyoccurring ones as they are insufficiently represented in the dataset. Finally, the same experiment is repeated for a larger dataset, namely the SIFT Flow benchmark liu2011sift, consisting of classes with training and testing images.
All image pixels are clustered into regions using the SLIC Achanta:2012:SSC:2377349.2377556 superpixel algorithm. For each region, gradient (DAISY Tola10) and color (in HSVspace) features are densely extracted. These features are transformed two times into separate bagsofwords via minibatch means clustering (once 60 gradient and 30 color words, once 10 and 5 words). The unary input vectors form D concatenations of the first two bagsofwords. The model’s connectivity structure links together all neighboring regions via edges. The edge/interaction input vectors are based on concatenations of the second set of bagsofwords. Both D input vectors of the edge’s incident regions are concatenated into a D vector. Moreover, two edgespecific features are added, namely the distance and angle between adjacent superpixel centers, leading to D interaction feature vectors.
Factors are trained with (regular) momentum, using a learning rate curve , with and parameters, and the current training iteration number as used in Algorithms LABEL:algo:duplicate and LABEL:algo:duplicate2
. The regularization, learning rate, and momentum hyperparameter values are tuned using a validation set by means of a coarse and finegrained grid search over the parameter spaces, yielding separate settings for the unary and pairwise factors. The linear parameters
are initialized to , while the neural factor parameters and are initialized according to glorot2010understanding, except for the top layer weights which are set to . The class weights in Eq. (5) are set to correct for class imbalance. The model is trained using CPUparallelized lossaugmented prediction, while the neural factors are trained using GPU parallelism.The following models are compared: unaryonly (unary), slack cutting plane training (CP) with delayed constraint generation, subgradient descent (SGD)^{3}^{3}3SGD uses bifurcated training with linear interactions, hence it could be named bif+lin., integrated training with neural unary and linear interaction factors (int+lin), bifurcated training with neural interaction factors (bif+nrl), and integrated training with neural unary and neural interaction factors (int+nrl).
Multiclass logistic regression is used as unary classifier, trained with gradient descent by crossentropy optimization. All unary neural factors contain a single hidden layer with 256
units, for direct comparison of integrated learning with upfront logistic regression training. The interaction neural factors contain a single hidden layer of 512 units to elucidate the benefit of nonlinear factors, without overly increasing the model’s capacity. The experiment is set up to highlight the benefit of integrated learning by restricting the unary factors to features insufficiently discriminative on their own. This deliberately leads to noisy unary classification, forcing the model to rely on contextual relations for accurate prediction. The interaction factors encode information about their incident region feature vectors to allow neural factors to extract meaningful patterns from gradient/color combinations. We deliberately encoded less information in the interaction features, such that the model cannot solely rely on interaction factors for accurate and coherent predictions.4.2 Results and discussion
Accuracy results on the MSRC21 shotton2009textonboost test images are presented in Table 1, while Figure 1 shows a handful of illustrative examples that compare segmentations attained by SGD with int+nrl. The results of the same experiment for the KITTI benchmark ros:2015, augmented with additional training images Kundu et al. augm, are shown in Table 3 and Figure 2. Qualitative results on the SIFT Flow liu2011sift dataset are shown in Figure 3, while accuracy results are shown in Table 3.
building 
grass 
tree 
cow 
sheep 
sky 
aeropl. 
water 
face 
car 
bicycle 
flower 
sign 
bird 
book 
chair 
road 
cat 
dog 
body 
boat 
pixel 
class 


neural factors  76  94  94  92  97  92  94  85  93  88  94  95  70  78  97  87  88  91  78  88  63  88.9  87.4 
Liu et al. Liu:2015:CLC:2796563.2796622  71  95  92  87  98  97  97  89  95  85  96  94  75  76  89  84  88  97  77  87  52  88.5  86.7 
Yao et al. yao2012describing  71  98  90  79  86  93  88  86  90  84  94  98  76  53  97  71  89  83  55  68  17  86.2  79.3 
Lucchi et al. lucchi2013learning  67  89  85  93  79  93  84  75  79  87  89  92  71  46  96  79  86  76  64  77  50  83.7  78.9 
Munoz et al. munoz2010stacked  63  93  88  84  65  89  69  78  74  81  84  80  51  55  84  80  69  47  59  71  24  78  71 
Gonfause et al. gonfaus2010harmony  60  78  77  91  68  88  87  76  73  77  93  97  73  57  95  81  76  81  46  56  46  77  75 
Shotton et al. shotton2008semantic  49  88  79  97  97  78  82  54  87  74  72  74  36  24  93  51  78  75  35  66  18  72  67 
Lucchi et al. lucchi2012structured  41  77  79  87  91  86  92  65  86  65  89  61  76  48  77  91  77  82  32  48  39  73  70 
The results show that unaryonly prediction is very inaccurate (pixelwise/classmean accuracy of 36.3/23.1% for the MSRC21 dataset, 53.8/42.8% for the KITTI dataset, and 44.7/7.5% for the SIFT Flow dataset). The reason for this is that unary features are not sufficiently distinctive to allow for differentiation between classes due to their low dimensionality. Accurate predictions are only possible by taking into account contextual output relations, demonstrated by the increased accuracy of CP (MSRC21: 59.4/48.5%; KITTI: 61.5/46.7%; SIFT Flow: 62.5/13.8%) as well as SGD (MSRC21: 59.2/49.6%; KITTI: 65.5/50.6%; SIFT Flow: 65.9/15.3%). These structured predictors learn linear relations between image regions, which allows them to correct errors originating from the underlying unary classifier. However, the unary factor’s linear weights have only limited capability for error correction in the opposite direction, due to the fact that the SSVM cannot alter the unary classifier parameters posthoc.
Using an integrated training approach such as int+lin, in which the SSVM is trained endtoend, improves accuracy (MSRC21: 67.4/58.5%; KITTI: 70.2/57.8%; SIFT Flow: 70.2/15.6%) over the bifurcated procedures CP and SGD. Although neither the unary or interaction features are very distinctive, the integrated procedure updates parameters in such a way that both factor types have a unique discriminative focus. Their synergistic relationship ultimately results in higher accuracy. To better compare SGD (which uses 8, 21, and 33 logistic regression outputs as unary input features for the different benchmarks) with int+lin, we also depict the accuracy (MSRC21: 61.2/51.7%; KITTI: 69.2/53.5%; SIFT Flow: 70.2/15.6%) of a model (int+lin) with only 8, 21, and 33 unary hidden units for the KITTI, MSRC21, and SIFT Flow dataset, rather than 256 units. The 2.0/2.1% (MSRC21), 3.7/2.9% (KITTI), and 4.3/0.3% (SIFT Flow) increases in accuracy over SGD further illustrates the benefit of integrated learning and inference over conventional bifurcated SSVM training.
Another insight gained by the results is that accuracy increases when replacing linear interaction factors of conventional SSVMs with neural factors, i.e., int+nrl (MSRC21: 70.1/62.3%; KITTI: 75.6/60.9%; SIFT Flow: 71.3/17.0%) and bif+nrl (MSRC21: 62.7/53.7%; KITTI: 70.0/55.9%; SIFT Flow:68.8/16.1%) outperform int+lin (MSRC21: 67.4/58.5%; KITTI: 70.2/57.8%; SIFT Flow: 70.3/16.2%) and SGD (MSRC21: 59.2/49.6%; KITTI: 65.5/50.6%; SIFT Flow: 65.9/15.3%) respectively. This increase can be attributed to the higher number of parameters, as well as the added nonlinearities in combination with correct regularization. The model has greater generalization power, allowing the factors to extract more complex and meaningful interaction patterns. Neural factors offer great flexibility as they can be stacked to arbitrary depths. This leads to even higher generalization, as indicated by the increased accuracy (MSRC21: 71.6/65.1%; KITTI: 77.6/63.6%; SIFT Flow: 71.5/17.2%) of the deeper 3layer (int+nrl) model. Herein both unary and interaction factors are 3hiddenlayer neural networks consisting of 256 and 512 units (rectified linear units for MSRC21 and KITTI and
units for SIFT Flow) in each layer respectively. Our model can thus easily be extended, for example by letting neural factors represent the fullyconnected layer in convolutional neural networks. As such, it serves as a foundation for more complex structured models.All methods converge within 600 epochs, with one epoch taking approximately 12.62 seconds for the MSRC21 dataset, 4.35 seconds for the KITTI dataset, and 197.27 seconds on the SIFT Flow dataset for the int+nrl algorithm. Since the implementation of our algorithm is not optimized for speed, these values can be further reduced by better exploitation of CPU parallelism.
Figure 4 illustrates the synergy between unary and interaction factors achieved through both integrated and bifurcated training, exercised on the MSRC21 dataset. The bars depict model test accuracy when using only unary or pairwise factors, by setting either the pairwise or unary factors respectively to a zero factor value, thus or . Although the unary factors alone perform well in bifurcated training, nearly all accuracy can be attributed to the interactions. A possible explanation is that both types essentially learn the same information. The interactions correct errors of the underlying classifier and ultimately make unary factors redundant. In integrated training, neither the unary or interaction factors alone attain a high accuracy, but the combination of both does.
We explain this synergistic relationship with an example: Unary factors assign to a region of class A, a secondtohighest factor value to class A, a highest value to class B, and a low value to class C. The interactions also assign a secondtohighest value to class A, but a highest value to class C, and a low value to class B. Independently both factors incorrectly predict the region of class A as belonging to class B or class C. However, when combined they correctly assign a highest value to class A. In the figure, bifurcated training only shows limited signs of factor synergy, as the optimization procedure is insufficiently able to steer unary and pairwise parameters in different directions, which causes them have a similar discriminative focus. This observation leads us to believe that integrated learning and inference results in higher accuracy by synergistic unary/interaction factor optimization. Both factor types are no longer optimized for independent accuracy, but mutually adapt to each other’s parameter updates, which results in enhanced predictive power.
In addition to the previous experiments, the viability of our neural factor model is shown through comparison with the closely related work of Liu et al. Liu:2015:CLC:2796563.2796622
on the MSRC21 dataset. Liu et al. make use of features extracted from square regions of varying size around each superpixel, through means of a pretrained convolutional neural network. We compare our model with theirs by using overfeat features
sermanet2013overfeat in a similar fashion, trained on individual regions. Furthermore, the model settings have been altered with respect to the previous experiments. More specifically, 1,000 SLIC superpixels are utilized for the oversegmentation preprocessing step, enforcing superpixel connectivity and merging any superpixel with a surface area below a particular threshold. DAISY gradient and HSV color features are extracted according to a regular lattice, and clustered via minibatch means clustering. Next, the same type of features are extracted for each individual pixel, leading to unary and pairwise factor feature vectors. Moreover, the position of the superpixel (medianbased) center is included in the unary feature vectors, while the distance and angle between the two superpixel centers is encoded into the interaction feature vectors. The neural factors are represented by multilayer neural networks using units, trained according to our Algorithm LABEL:algo:duplicate2, using conventional momentum and single imagesized batches per gradient update. Classes are balanced by weighing them with the inverse of the class frequency. The results are presented in Table 4, which indicate that our model is capable of performing on par with the current stateofpractice, when used in conjunction with more advanced methods, e.g., overfeat features. Moreover, similar to Liu et al. Liu:2015:CLC:2796563.2796622, we have compared our model with other less closely related methods for completeness, for which the results are shown below the horizontal line in Table 4.5 Conclusion
A structured prediction model that integrates backpropagation and lossaugmented inference into subgradient descent training of structural support vector machines (SSVMs) is proposed. This model departs from the traditional bifurcated approach in which a unary classifier is trained independently from the structured predictor. Furthermore, the SSVM factors are extended to neural factors, which allows both unary and interaction factors to be highly nonlinear functions of input features. Results on a complex image segmentation task show that endtoend SSVM training, and/or using neural factors, leads to more accurate predictions than conventional subgradient descent and slack cutting plane training. Results show that our model serves as a foundation for more advanced structured models, e.g., by using latent variables, learned feature representations, or complexer connectivity structures.
Acknowledgments
Rein Houthooft is supported by a Ph.D. Fellowship of the Research Foundation  Flanders (FWO). Many thanks to Brecht Hanssens and Cedric De Boom for their insightful comments.