Convolutional neural networks with many layers have recently been shown to achieve excellent results on many high-level tasks such as image classification, object detection and more recently also semantic segmentation. Particularly for semantic segmentation, a two-stage procedure is often employed. Hereby, convolutional networks are trained to provide good local pixel-wise features for the second step being traditionally a more global graphical model. In this work we unify this two-stage process into a single joint training algorithm. We demonstrate our method on the semantic image segmentation task and show encouraging results on the challenging PASCAL VOC 2012 dataset.READ FULL TEXT VIEW PDF
Large-scale data is of crucial importance for learning semantic segmenta...
Semantic Segmentation using deep convolutional neural network pose more
Driven by Convolutional Neural Networks, object detection and semantic
A novel global energy model for multi-class semantic image segmentation ...
Deep Convolutional Neural Networks (DCNNs) have recently shown state of ...
State-of-the-art approaches for semantic image segmentation are built on...
Convolutional Neural Networks have been a subject of great importance ov...
In the past few years, Convolutional Neural Networks (CNNs) have revolutionized computer vision. They have been shown to achieve state-of-the-art performance in a variety of vision problems, including image classification[KrizhevskyNIPS2013, SimonyanARXIV2014], object detection [Girshick2014RCNN]
, human pose estimation[tompson2014joint], stereo [ZbontarARXIV2014], and caption generation [KirosICML2014, MaoARXIV2014, VinyalsARXIV2014, DonahueARXIV2014, KarpathyARXIV2014, FangARXIV2014]. This is mainly due to their high representational power achieved by learning complex, non-linear dependencies.
It is only very recently that convolutional nets have proven also very effective for semantic segmentation [GuistiICIP2013, SermanetICLR2014, LongCVPR2014, ZhengARXIV2015, ChenARXIV2015b]. This is perhaps due to the fact that to achieve invariance, pooling operations are performed, often reducing the dimensionality of the prediction. A Markov random field (MRF) is then used as a refinement step in order to obtain segmentations that respect well segment boundaries. The seminal work of [KrahenbuhlNIPS2011] showed that inference in fully connected MRFs is possible if the smoothness potentials are Gaussian. Impressive performance was demonstrated in semantic segmentation with hand craft features. Later, [ChenARXIV2015b] extended the unary potentials to incorporate convolutional network features. However, these current approaches train the segmentation models in a piece-wise fashion, fixing the unary weights during learning of the parameters of the pairwise terms which enforce smoothness.
In this paper we present an algorithm that is able to train jointly the parameters of the convolutional network defining the unary potentials as well as the smoothness terms taking into account the dependencies between the random variables. We demonstrate the effectiveness of our approach using the dataset of the PASCAL VOC 2012 challenge[pascal-voc-2012].
We begin by describing how to learn probabilistic deep networks which take into account correlations between multiple output variables that are of interest to us. Moreover, a valid configuration is assumed to lie in the product space of the discrete variable domains .
For a given data sample
, and a parameter vector, the score of a configuration is generally modeled by the mapping .
The prediction task amounts to finding the configuration
which maximizes the score . Note that the best scoring configuration
is equivalently given as the maximizer of the probability distribution
since the exponential function is a monotone increasing function and the normalization constant is independent of the configuration , i.e., it is constant indeed.
The learning task is concerned with finding a parameter vector
which maximizes the likelihood of a given training set . The training set consists of input-output pairs which are assumed to be independent and identically distributed. Note that maximizing the likelihood is equivalent to maximizing the cross entropy between the modeled distribution and a target distribution which places all its mass on the groundtruth configuration . Throughout this work we make no further assumptions about the dependence of the scoring function on the parameter vector , i.e., is generally neither convex nor smooth.
For problems where the output-space size is in the thousands, we can exactly solve the inference task given in Eq. (1) by searching over all possible output space configurations . In such a setting, those different configurations are typically referred to as different classes. Similarly, we normalize the distribution by summing up the exponentiated score over all possibilities . This is often referred to as a soft-max computation. Non-convexity and non-smoothness of the learning objective w.r.t. the parameters is answered with stochastic gradient ascent. For efficiency, the gradient is often computed on a small subset of the training data, i.e., a mini-batch.
We summarize the resulting training algorithm in Fig. 1. On a high level it consists of four steps which are iterated until a stopping criterion is met: (i) the forward pass to compute the scoring function for all output space configurations . (ii) normalizing the scoring function via a soft-max computation to obtain the probability distribution
. (iii) computation and back-propagation of the gradient of the loss function,i.e., often the log-likelihood or equivalently the cross-entropy. (iv) an update of the parameters.
However, solving the inference task given in Eq. (1) or the learning problem stated in Eq. (2) is computationally challenging if we consider more complex output spaces , e.g., those arising from tasks like image tagging. The situation is even more severe if we target image segmentation where the exponential number of possible output space configurations prevents even storage of . Note that this is required in the first line of the algorithm summarized in Fig. 1.
Given an exponential amount of possible configurations , how do we represent the scoring function efficiently? Assuming we have an efficient representation, how can we effectively normalize the probability ? One possible answer to those questions was given by Chen et al. [ChenARXIV2015], who discussed extending log-linear models, i.e., those with a scoring function of the form , to the more general setting, i.e., an arbitrary dependence of the scoring function on the parameter vector .
In short, [ChenARXIV2015] assumed the global scoring function to decompose into a sum of local scoring functions , each depending on a small subset of variables . All restrictions required to compute the global function via
are subsumed in the set . If the size of each and every local restriction set is small, is efficiently representable.
To compute the gradient of the log-likelihood cost function, we require a properly normalized distribution, or more specifically its marginals for each restriction . To this end, message passing type algorithms were employed by [ChenARXIV2015]. Such an approach is exact if the distribution is of low tree-width. Otherwise computational complexity is prohibitively large and approximations like loopy belief propagation [Pearl1988], convex belief propagation [Weiss2007] or tree-reweighted message passing [Wainwright2003] are alternatives that were successfully applied.
The resulting iterative method of [ChenARXIV2015] is summarized in Fig. 2. In a first step the forward pass computes all outputs of every local scoring function. Afterwards (approximate) marginals are obtained in a second step, and utilized to compute the derivative of the (approximated) maximum likelihood cost function w.r.t. the parameters . The following backward pass computes the gradient of the parameters by repeatedly applying the chain-rule according to the definition of the scoring function . The gradient is then utilized during the final parameter update.
Not only does the approach presented by [ChenARXIV2015] fail if the decomposition assumed in Eq. (3) is not available. But it is also computationally challenging to obtain the required marginals if too many local functions are required. I.e., computation is slow if the number of restrictions is large, e.g., when working with densely connected image segmentation models where every pixel is possibly correlated to every other pixel in the image.
Densely connected models were previously considered by [KrahenbuhlNIPS2011, VineetBMVC2012, VineetECCV2012, KrahenbuhlICML2013] and shown to yield impressive results for the image segmentation task. Learning the parameters of densely connected models was considered by Krähenbühl and Koltun [KrahenbuhlICML2013] in the context of the log-linear setting. Following [ChenARXIV2015] we aim at extending those fully connected log-linear models to the more general setting of an arbitrary function , e.g., a deep convolutional neural network. Note that a similar approach has been recently discussed by [ZhengARXIV2015] in independent work.
Let us consider within this section how to efficiently combine deep structured prediction [ChenARXIV2015] with densely connected probabilistic models [KrahenbuhlNIPS2011, VineetBMVC2012, VineetECCV2012, KrahenbuhlICML2013]. Before getting into the details we note that the presented approach trades computational complexity of the general method of [ChenARXIV2015] with a restriction on the pairwise functions (i.e., ). Concretely, the local functions are assumed to be mixtures of kernels in a feature space as detailed below. For simplicity we assume that local functions of order higher than two are not required to represent our global scoring function . Generalizations have however been presented, e.g., by Vineet et al. [VineetECCV2012].
We begin our discussion by considering the inference task. To obtain a computationally efficient prediction algorithm we use a mean field approximation of the model distribution for every sample . More formally, we assume our approximation to factor according to . Given some parameters , we employ a forward pass to obtain our local function representations . Next we compute the single variable marginals by minimizing the Kullback-Leibler (KL) divergence w.r.t. to the assumed factorization of the mean field distribution , i.e.,
Hereby requires to be a valid probability distribution. Due to non-convexity, only convergence to a stationary point of the KL divergence cost function is guaranteed for sequential block-coordinate updates [Wainwright2008, Koller2009]. More precisely, iterating until convergence through the variables using the closed form update
which assumes all marginals but to be fixed, retrieves a stationary point for the cost function of the program given in Eq. (4). The set of variables neighboring is denoted .
In the case of densely connected variables, the computational bottleneck arises from the second summand which involves additions. The sum ranges over terms for densely connected structured models. Hence the complexity of an update for a single marginal is of , and updating all marginals therefore requires operations as also discussed by Krähenbühl and Koltun [KrahenbuhlICML2013].
Importantly, Krähenbühl and Koltun [KrahenbuhlNIPS2011] observed that a high dimensional Gaussian filter can be applied to concurrently update all marginals in . This is achievable when constraining ourselves to pairwise functions being mixtures of kernels in the feature space as mentioned before. Formally, we require
where is a label compatibility function, is a kernel function, and are features of variable depending on the data .
However, to ensure convergence to a stationary point of the KL divergence cost function for this parallel update, further restrictions on the form of the pairwise functions apply. Formally, if the label compatibility functions are negative semi-definite , and the kernels are positive definite , the KL divergence is readily given as the difference between a concave and a convex term [KrahenbuhlICML2013]. Hence the concave-convex procedure (CCCP) [Yuille2003] is directly applicable. We therefore proceed iteratively by first linearizing the concave term at the current location and second minimizing the resulting linearized but convex program.
As detailed by Krähenbühl and Koltun [KrahenbuhlICML2013], and as discussed above, finding the linearization is equivalently solved via filtering in time linear in . Solving the convex program in its original form requires solving a non-linear system of equations independently for each marginal , e.g., via Newton’s method. A further approximation to the cross-entropy term of the KL-divergence relates the efficient filtering based mean field update of the marginals to the corresponding cost function for which a stationary point is found.
Having observed that mean-field inference can be efficiently addressed with Gaussian filtering, given restrictions on the pairwise functions , we now turn our attention to the learning task. As mentioned before we aim at finding a parameter vector that maximizes the likelihood objective function. Since the exact likelihood is computationally expensive, we use the log-likelihood based on the mean-field marginals. Hence our surrogate loss function for a sample with corresponding annotated ground truth labeling is given by
To perform a parameter update step we need the gradient of the surrogate loss function w.r.t. the parameters, i.e.,
The gradient of the surrogate loss function w.r.t. the marginals is easily obtained from Eq. (6). It is given by
where the Iverson bracket equals one if , and returns zero otherwise.
To perform a gradient step during learning, we additionally require the derivatives of the marginals w.r.t. the parameters, i.e., .
More carefully investigating the mean-field update given in Eq. (5) reveals a recursive definition. More concretely, the derivative of the marginal after iterations depends on the results from earlier iterations. Hence, we obtain the desired result by successively back-tracking through the mean-field iterations from the last iteration back to the first. This direct computation is however computationally expensive. Fortunately, back-substitution into the loss gradient yields an algorithm which requires a total of back-tracking steps, independent of the number of parameters. We refer the interested reader to [KrahenbuhlICML2013] for additional details regarding the computation of the gradient .
But contrasting [KrahenbuhlICML2013]
, we no longer assume the unaries to be given by a logistic regression model. Contrasting[ChenARXIV2015b], we don’t assume the unaries to be fixed during CRF parameter updates. Generalizing the gradient of the marginals w.r.t. parameters to arbitrary unaries is straightforward since the gradients are directly given by the marginals. Combined with the gradient of the log-likelihood loss function w.r.t. the marginals, given in Eq. (8), we obtain as the difference between the ground-truth and the predicted marginals. This result is then used for back-propagation through any functional structure which provides the unary scoring functions , e.g., convolutional neural networks.
Derivatives w.r.t. to label compatibility and kernel shape parameters are readily given in [KrahenbuhlICML2013]. The resulting algorithm is summarized in Fig. 3. In short, we first obtain again our functional representation via a forward pass through any functional network. Subsequently we compute our mean-field marginals via filtering. Afterwards we obtain the gradient of the loss function via an efficient back-tracking. In the next step the gradient of the parameters is computed by back-propagating the gradient of the loss-function using the chain-rule dictated by the definition of the scoring function. In a final step we update the parameters.
We evaluate our approach summarized in Fig. 3 on the dataset of the Pascal VOC 2012 challenge [pascal-voc-2012]. The task is semantic image segmentation of 21 object classes (including background). The original dataset contains training, validation and test images. In addition to this data we make use of the annotations provided by Hariharan et al. [HariharanICCV2011], resulting in a total of training instances. The reported performance is measured using the intersection-over-union metric. Note that we conduct our tests on the 1449 validation set images which were neither used during training nor for fine-tuning.
Our model setup follows [ChenARXIV2015b], i.e., we employ the 16 layer DeepNet model [SimonyanARXIV2014]. Just like [ChenARXIV2015b] we first convert the fully connected layers into convolutions as first discussed in [GuistiICIP2013, SermanetICLR2014]
. This is useful since we are not interested in a single variable output prediction, but rather aim at learning probability masks. To obtain a larger probability mask we skip downsampling during the last two max-pooling operations. To take into account the skipped downsampling during subsequent convolutions we employ the ‘à trous (with hole) algorithm’[Mallat1999]. It takes care of the fact that data is stored in an interleaved way, i.e., in our case convolutions sub-sample the input data by a factor of two or four respectively. To adapt to the 21 object classes we also replace the top layer of the DeepNet model to yield 21 classes for each pixel.
Similar to [ChenARXIV2015b] we assume the input size of our network to be of dimension which results in a sized spatial output of the DeepNet which is in our case an intermediate result however.
Contrasting [ChenARXIV2015b], we jointly optimize for both unary and CRF parameters using the algorithm presented in Fig. 3. To this end, given images downsampled to a size of , our algorithm first performs a forward pass through the convolutional DeepNet to obtain the sized class probability maps in an intermediate
stage. These intermediate class probability maps are directly up-sampled to the original image dimension using a bi-linear interpolation layer. This yields the actual output of our augmented DeepNet network defining the scoring function. Note that the number of variables is therefore equal to the number of pixels of the original image.
For the second step of our algorithm we perform 5 iterations of mean field updates to compute the marginals of the fully connected CRF. Those are then compared to the original groundtruth image segmentations, using as our loss function the sum of cross-entropy terms, i.e., the log-likelihood loss, as specified in Eq. (6). In the third step we back-track through the marginals to obtain a gradient of the loss function. Afterwards we back-propagate the derivatives w.r.t. the unary term through both the bi-linear interpolation and the 16-layer convolutional network. The shape and compatibility parameters of the CRF, detailed below, are updated directly.
It was shown independently by many authors [SimonyanARXIV2014, ChenARXIV2015], that successively increasing the number of parameters during training typically yields better performance due to better initialization of larger models. We therefore train our model in two stages. First, we assume no pairwise connections to be present, i.e
., we fine-tune the weights obtained from the DeepNet ImageNet model[SimonyanARXIV2014, ILSVRCarxiv14] to the Pascal dataset [pascal-voc-2012]. Standard parameter settings for a momentum of , a weight decay of and learning rates of and for the top and all other layers are employed respectively. Due to the 12GB memory restrictions on the Tesla K40 GPU we use a mini-batch size of 20 images.
In a second stage we jointly train the convolutional network parameters as well as the compatibility and shape parameters of the dense CRF arising from the pairwise functions
Hereby, we employ the Potts potential and the Gaussian kernels given by
As indicated in Eq. (9), we use kernels, both with diagonal covariance matrix . One containing as features the two-dimensional pixel positions, the other one containing as features the two dimensional pixel positions as well as the three color channels. Hence we obtain a total of nine parameters, i.e., two compatibility parameters and and kernel shape parameters for the diagonal covariance matrices .
As mentioned before, all our results were computed on the validation set of the Pascal VOC dataset. This part of the data was neither used for training nor for fine-tuning.
Unary performance: We first investigate the performance of the first training stage of the proposed approach, i.e., fine-tuning of the 16 layer DeepNet parameters on the Pascal VOC data. The validation set accuracy is plotted over the number of iterations in Fig. 4 (a). We observe the performance to peak at around 4000 iterations with a mean intersection over union measure of . The result reported by [ChenARXIV2015b] for this experiment is , i.e., we outperform their unary model by .
Joint training: Next we illustrate the performance of the second step, i.e., joint training of both convolutional network parameters and CRF compatibility and shape parameters. In Fig. 4 (b) we indicate the best obtained unary performance from the first step and visualize the validation and training set performance over the number of iterations. We observe the results to peak quickly after around iterations and remain largely stable thereafter.
Details: In Tab. 1 we provide the training and test set accuracies for the 21 individual classes. We observe the ‘bike’ and ‘chair’ class to be particularly difficult. For both categories the validation set performance is roughly half of the training set accuracy.
Comparison to baseline: As provided in Tab. 1, the peak validation set performance of our approach is , which slightly outperforms the separate training result of reported by Chen et al. [ChenARXIV2015b].
Visual results: We illustrate visual results of our approach in Fig. 5. Our method successfully segments the object if the images are clearly apparent. Noisy images and objects with many variations pose challenges to the presented approach as visualized in Fig. 6. Also, we observe our learnt parameters to generally over-smooth results while being noisy on the boundaries.
We presented a first method that jointly trains convolutional neural networks and fully connected conditional random fields for semantic image segmentation. To this end we generalize [ChenARXIV2015b] to joint training. Note that a method along those lines has also been recently made publicly available in independent work [ZhengARXIV2015]. Whereas the latter combines dense conditional random fields [KrahenbuhlNIPS2011] with the fully convolutional networks presented by Long et al. [LongCVPR2014], we employ and modify the 16 layer DeepNet architecture presented in work by Simonyan and Zisserman [SimonyanARXIV2014].
Ideas along the lines of joint training were discussed within machine learning and computer vision as early as the 90’s in work done by Bridle[BridleNIPS1990] and Bottou [BottouCVPR1997]. More recently [collobert2011natural, PengNIPS2009, MaBioinformatics2012, do2010neural, prabhavalkar2010backpropagation, morris2008conditional] incorporate non-linearities into unary potentials but generally assume exact inference to be tractable. Even more recently, Li and Zemel [LiICML2014] investigate training with hinge-loss objectives using non-linear unaries, but the pairwise potentials remain fixed, i.e., no joint training. Domke [domke2013structured] decomposes the learning objective into logistic regressors which will be computationally expensive in our setting. Tompson et al. [tompson2014joint]
propose joint training for pose estimation based on a heuristic approximation which ignores the normalization constant of the model distribution. Joint training of conditional random fields and deep networks was also discussed recently by[ChenARXIV2015] for graphical models in general. Techniques based on convex and non-convex approximations were described for obtaining marginals in the general non-linear setting.
We discussed a method for semantic image segmentation that jointly trains convolutional neural networks and conditional random fields. Our approach combines techniques from deep convolutional neural networks with variational mean-field approximations from the graphical model literature. We obtain good results on the challenging Pascal VOC 2012 dataset.
In the future we plan to train our method on larger datasets. Additionally we want to investigate training with weakly labeled data.