We consider the task of pixel-wise semantic segmentation given a small set of labeled training images. Among two of the most popular techniques to address this task are Random Forests (RF) and Neural Networks (NN). The main contribution of this work is to explore the relationship between two special forms of these techniques: stacked RFs and deep Convolutional Neural Networks (CNN). We show that there exists a mapping from stacked RF to deep CNN, and an approximate mapping back. This insight gives two major practical benefits: Firstly, deep CNNs can be intelligently constructed and initialized, which is crucial when dealing with a limited amount of training data. Secondly, it can be utilized to create a new stacked RF with improved performance. Furthermore, this mapping yields a new CNN architecture, that is well suited for pixel-wise semantic labeling. We experimentally verify these practical benefits for two different application scenarios in computer vision and biology, where the layout of parts is important: Kinect-based body part labeling from depth images, and somite segmentation in microscopy images of developing zebrafish.READ FULL TEXT VIEW PDF
A central challenge in computer vision is the assignment of a semantic class label to every pixel in an image, a task known as semantic segmentation. A common strategy for semantic segmentation is to use pixel-level classifiers such as Random Forests (RF), which have the advantage of being easy to train and performing well on a wide range of tasks, even in the face of little training data. The use of stacked classifiers, such as in Auto-context , has been shown to improve performance on many tasks such as object-class segmentation , facade segmentation , and brain segmentation . However, this strategy has the limitation that the individual classifiers are trained greedily.
Recently, numerous groups have explored the use of Convolutional Neural Networks (CNNs) for semantic segmentation [7, 19, 6, 34], which has the advantage that it enables “end-to-end learning” of all model parameters. This trend is largely inspired by the success of deep CNNs on high-level computer vision tasks, such as image classification  and object detection . However, training a deep CNN requires substantial experience and large amounts of labeled data, or availability of a pre-trained CNN for a similar task [2, 3]. Thus, there currently exists a divide between stacked classifiers and deep CNNs.
We propose an alternative solution, exploiting the fundamental connection between decision trees (DT) and NNs to bridge the gap between stacked classifiers and deep CNNs. This provides a novel approach with the strengths of stacked classifiers, namely robustness to limited training data, and the end-end-learning capacity of NNs. Figure 1 depicts our proposed pipeline.
Contributions. We make the following contributions:
1. We show that a stacked RF with contextual features is a special case of a deep CNN with sparse convolutional kernels. We apply this successfully to semantic segmentation.
2. We describe an exact mapping of a stacked RF to our sparse, deep CNN. We utilize this mapping to initialize the CNN from a greedily trained stacked RF. This is important in the case of limited training samples. We show that this leads to superior results compared to alternative strategies.
3. We describe an approximate mapping of our sparse, deep CNN back to a stacked RF. We show that this improves the performance of a greedily trained stacked RF.
4. Due to our special CNN architecture we are able to gain new insights of the activation pattern of internal layers, with respect to semantic labels. In particular, we observe that the common smoothing strategy in stacked RFs is naturally learned by our CNN.
Our work relates to (i) global optimization of RF classifiers, (ii) mapping RF classifiers to neural networks, (iii) feature learning in stacked RF models, (iv) applying CNNs to the task of semantic segmentation, and (v) training CNNs with limited labeled data. We cover these areas in turn.
, the authors learn a DT by the standard greedy construction, followed by a process they call “fuzzification”, replacing all threshold split decisions with smooth sigmoid functions that they interpret as partial or “fuzzy” inheritance by the daughter nodes. They develop a back-propagation algorithm, which begins in the leaves and propagates up one layer at time to the root node, re-optimizing all split parameters of the DT. In
, they learn to combine the predictions from each DT so that the complementary information between multiple trees is optimally exploited. They identify a suitable loss function, and after training a standard RF, they retrain the distributions stored in the leaves, and prune the DTs to accomplish compression and avoid overfitting. However, does not retrain the parameters of the internal split nodes of individual DTs, whereas  does not retrain the combination of trees in the forest. Conceptually, our approach does both.
Mapping RFs to NNs. In both  and , RFs were initially trained in a greedy fashion, and then later refined. An alternative but related approach is to map the greedily trained RF to an NN with two hidden layers, and use this as a smart initialization for subsequent parameter refinement by back-propagation [27, 33]. This effectively “fuzzifies” threshold split decisions, and simultaneously enables training with respect to a final loss function on the output of the NN. Hence as opposed to  and , all model parameters are learned simultaneously in an end-to-end fashion. Additional advantages are that (i) back-propagation has been widely studied in this form, and (ii) back-propagation is highly parallelized, and only needs to propagate over 2 hidden layers, compared to all tree levels as in .
Our work builds upon [27, 33]: We extend their approach to a deep CNN, inspired by the Auto-context algorithm , for the purpose of semantic segmentation. Furthermore, we propose an approximate algorithm for mapping the trained CNN back to a RF with axis-aligned threshold split functions, for fast inference at test time.
Feature Learning in a RF Framework. The Auto-context algorithm  attempts to capture pixel interdependencies in the learning process by iteratively learning a pixel-wise classifier, using the prediction of nearby pixels from the previous iteration as features. This process is closely related to feature learning, due to the introduction of new features during the learning process. Numerous works have generalized the initial approach of Auto-context. In Entangled Random Forests (ERFs) , spatial dependencies are captured by “entanglement features” in each DT, without the need for stacking. Geodesic Forests  additionally introduce image-aware geodesic smoothing to the class distributions, to be used as features by deeper nodes in the DT. However, despite the fact that ERFs use a soft sigmoid split function to obtain max-margin behaviour with a small number of trees, these approaches are still limited by greedy parameter optimization.
In a more traditional approach to feature learning, Neural Decision Forests 
mix RFs and NNs by using multi-layer perceptrons (MLP) as soft split functions, to jointly tackle the problem of data representation and discriminative learning. This approach can obtain superior results with smaller trees, at the cost of more complicated split functions; however, the MLPs in each split node are trained independently of each other. This limitation is addressed in, which trains the entire system end-to-end. However, they adopt a mixed framework, with both differentiable RFs and CNNs, that are trained in an alternating fashion, and applied to image classification. In contrast, we map to the CNN framework, which enables optimization with popular back-propagation algorithm, and apply to the task of semantic segmentation.
CNNs for Semantic Segmentation. While CNNs have proven very successful for high-level vision tasks, such as image classification, they are less popular for the task of dense semantic segmentation, due to their in-built spatial invariance. CNNs can be applied in a tile-based manner ; however, this leads to pixel-independent predictions, which require additional measures to ensure spatial consistency [11, 21]. In , the authors extend the tile-based approach to “whole-image-at-a-time” processing, in their Fully Convolutional Network (FCN). They address the coarse-graining effect of the CNN by upsampling the feature maps in deconvolution layers, and combining fine-grained and coarse-grained features during prediction. This approach, combining down-sampling with subsequent up-sampling, is necessary to maintain a large receptive field without increasing the size of the convolution kernels, which otherwise become difficult to learn. A variant of FCN called U-Net was recently proposed in . In , they minimize coarse-graining by skipping multiple sub-sampling layers and avoid introducing additional parameters by using sparse convolutional kernels in the layers with large receptive fields. They additionally post-process by a fully connected CRF. In 
, they address coarse-graining by expressing mean-field inference in a dense CRF as a Recurrent Neural Network (RNN), and concatenating this RNN behind a FCN, for end-to-end training of all parameters. Notably, they demonstrate a significant boost in performance on the Pascal VOC 2012 segmentation benchmark.
In our work we propose a new CNN architecture for semantic segmentation. Contrary to the previous approaches, we avoid coarse-graining effects, which arise in large part due to pre-training a CNN for image classification
on data provided by the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Instead, we pre-train a stacked RF on a small set of densely labeled data. Our approach is related to the use of sparse kernels in
; however, we learn the non-zero element(s) of very sparse convolutional kernels during greedy construction of an RF stack. One advantage of this approach is that since the kernels have a very large receptive field, we do not need max-pooling and deconvolution layers, ase.g.in the FCN. Additionally, in our approach the sparsity of the kernels can be specified by the number of features used in each RF split node, independently of the size of the receptive field.
Training CNNs with Limited Labelled Data. CNNs provide a powerful tool for feature learning; however, their performance relies on a large set of labeled training data. Unsupervised pre-training has been used successfully to leverage small labeled training sets [26, 22]
; however, fully supervised training on large data sets still gives higher performance. Alternatively, transfer learning makes use ofe.g., pseudo-tasks , or surrogate training data .
More recent practice is to train a CNN on a large training set, and then fine tune the parameters on the target data . However, this requires a closely related task with a large labeled data set, such as ILSVRC. Another strategy to address the dependency on training data, is to expand a small labeled training set through data augmentation . Alternatively, one can use companion objective functions at each hidden layer, as a form of regularization during training [18, 17]. However, this may in principle interfere with the deep network’s ability to learn the optimal internal representations, as noted by the authors.
We propose a novel strategy for addressing the challenge of training deep CNNs given limited training data. Similar in spirit to [10, 18], we employ greedy supervised pre-training, yet in a complementary model, namely the popular Auto-context model. We then map the resulting Auto-context model onto a deep CNN, and refine all weights using back-propagation.
In Section 3.1, we review the algorithm for mapping an RF onto an NN with two hidden layers [27, 33]. In Section 3.2, we introduce the relationship between RFs with contextual features and CNNs. In Section 3.3, we describe our main contribution, namely how to map a stack of RFs onto a deep CNN. In Section 3.4, we describe our second contribution, namely an algorithm for mapping our deep CNN back onto the original RF stack, with updated parameters.
In the following, we review the existing works [27, 33]. A decision tree consists of a set of split nodes, , and leaf nodes, . Each split node processes the subset of the feature space that reaches it. Usually, , where is the number of features. Let and denote the left and right child node of a split node . A split node partitions the set into two sets and by means of a split decision. For DTs using axis-aligned split decisions, the split is performed on the basis of a single feature whose index we denote by , and a respective threshold denoted as : .
For each leaf node , there exists a unique path from root node to leaf , , with and . Thus, leaf membership can be expressed as follows:
Each leaf node stores votes for the semantic class labels, , where
is the number of classes. For a feature vector, we denote the unique leaf of the tree that has as . The prediction of a DT for feature vector to be of class is given by:
Using this notation, we now describe how to map a DT to a feed-forward NN, with two hidden layers. Conceptually, the NN separates the task of evaluating the split nodes and evaluating leaf membership into the first and second hidden layers, respectively. See Figure 2 for a sketch of the following description.
Hidden Layer 1. The first hidden layer,
, is constructed with one neuron,, per split node in the corresponding DT. This neuron evaluates , and encodes the outcome in its activity, . is connected to the input layer with the following weights and biases: and . The global constant sets how rapidly the neuron activation changes as its input crosses its threshold. All other weights in this layer are zero.
As activation function in, is used, with a large value for to approximate thresholded split decisions. During training, can be reduced to avoid the problem of diminishing gradients in back-propagation; however, for now we assume is a large positive constant. Thus, the pattern of activations encodes leaf node membership as follows:
Hidden Layer 2. The role of neurons in the second hidden layer, , is to interpret the activation pattern a feature vector triggers in , and thus identify the unique . Therefore, for every leaf in the DT, one neuron is created, denoted as . Each such neuron is connected to all with , but no others. Weights are set as follows: if and if . The sign of these weights matches the pattern of incoming activations iff , thus making the activation of maximal. To distinguish leaf membership, the biases in are set as . Thus the input to node is equal to if , and less than or equal to otherwise. Using activation functions, linearly scaled to range, and a large value for , the neurons approximately behave as binary switches that indicate leaf membership. I.e., and all other neurons are silent.
Output Layer. The output layer of the NN has neurons, one for every class label. This layer is fully connected; however, there are no bias nodes introduced. The weights store scaled votes from the leaves of the corresponding DT: . A softmax activation function is applied, to ensure a probabilistic interpretation of the output after training:
Note that the softmax activation slightly perturbs the output distribution of the original RF (cf. Equation 2), making the mapping approximate. This can be tuned by the choice of , and in practice is a minor effect. Importantly, the softmax activation preserves the MAP solution.
From a Tree to a Forest. Let the number of DTs in a forest be denoted as . The prediction of a forest for feature vector to be of class is the normalised sum over the votes stored in the single leaf per tree , denoted :
Extending the DT-to-NN mapping described above to RFs is trivial: (i) replicate the basic NN design number of times, and (ii) fully connect to the output layer (see Figure 2(c)). This accomplishes summing over the leaf distributions from the different trees, before the softmax activation is applied.
One of the defining characteristics of CNNs is weight sharing across neurons corresponding to the same feature map. These neurons compute convolutions over a local window in their input, and their convolutional weights are constant across the entire feature map. Unsurprisingly, RFs work in the same way: A feature vector is pre-computed for each pixel in the image, and then fed through the same forest, or in the NN formulation given above, it traverses the identical NN.
A difference between RF and CNN is that in a RF, the first “convolutional layer” is pre-computed with a hand-selected filter bank, not learned as in a CNN. However, the subsequent operations of the RF can be broken down into two convolutions (corresponding to and ). The first of these two convolutions has depth equal to the number of filters in the filter bank, denoted F (typically s), and is very sparse. E.g., axis-aligned decision stumps correspond to a convolution kernel with a single non-zero element. The second convolution () is similarly very sparse, with the number of non-zero elements equal to the depth of the tree. Recall from Section 3.1, each neuron in this layer combines the response of all split node neurons along path . For instance, for a balanced tree of depth 10, creates feature maps, where each neuron has a single input. creates feature maps, but each neuron combines 10 features from the previous layer.
In many applications such as body-pose estimation, medical image labeling , and scene labeling , contextual information is included in the form of contextual “offset features” that are selected from within a window defined by a maximum offset, . In this case, neurons in compute sparse convolutions with width and height of , and depth F. Again, it is conventional to have only a single non-zero element in this convolution kernel; however, in the case of medical imaging it is also common to use e.g., average intensity over an offset window .
Altogether, a RF with contextual features can be viewed as a special case of a CNN, with sparse convolutional kernels and no max pooling layers. As we shall see in the next section, stacked RFs iterate this architecture using the previous RF predictions as input features, thereby generating a deep CNN with sparse convolutional kernels.
In a stack of RFs, the modular architecture of a single RF is repeated. We map this architecture onto a deep CNN as follows: Each RF is mapped to a CNN, and then these CNNs are concatenated such that the layers corresponding to intermediate RF predictions become hidden layers, used as input to the next CNN in the sequence (see Figure 4). For a -level RF stack, this generates a deep CNN with hidden layers. In the original Auto-context algorithm , each classifier can either select a feature from the output of the previous classifier, or from the set of input filter responses. Thus, we also introduce the input filter responses as bias nodes in hidden layers , . Note that both addition of trees to the RF and/or growing trees to a greater depth results in a CNN with 2 hidden layers, but with greater width. However, stacking RFs naturally increases the depth of the CNN architecture.
An interesteing question is what activation function to use on layers , which are no longer prediction layers. We explored the following options: identity, , class normalization (Equation 5), and softmax. Despite the fact that class normalization can in principle become undefined, due to the possibility of having negative weights, we found that it out-performed the other options. In particular, softmax was the most problematic, because it perturbs the prediction with regards to the original RF, and this error is compounded in a deep stack. This is consistent with class normalization performing the best, since it exactly matches the operation in the original RF stack. For the rest of the paper, we use class normalization activation functions on layers
. We apply softmax activation at the final output layer to convert to a probability.
In stacked RFs used for semantic segmentation, individual pixels cannot be run through the entire stack independently, but rather the complete image must be run through one level at a time, such that all features are available for the next level. This is similarly true for our deep CNN.
We are interested in mapping our deep CNN architecture back to a stacked RF, with axis-aligned split functions, for fast evaluation at test time. Given a CNN constructed from a K-level RF stack as described above, the weights , manifest the correspondence of the CNN with the original tree structure. Thus, during training, keeping these weights and the corresponding biases, fixed, allows the CNN to be trivially mapped back to the original RF stack. For a single level stack, the mapping is: (i) , (ii) . We refer to this as “Map Back #1”. Finally, when evaluating this RF, a softmax activation function needs to be applied to the output distribution. For deeper stacks, the output of each RF must be post-processed with the corresponding activation function in the CNN, which in this paper is simple class normalization, but could be something different, such as softmax.
While the approach described above does map the CNN architecture back to the original RF stack, it may not make optimal use of the parameter refinement learned during back-propagation. Above, for a single level stack we assigned , which is the correct thing to do if only a single leaf neuron fires in the network. However, after training by back-propagation, the activation pattern in may be distributed, with many neurons contributing to the prediction.
Here, we propose a strategy to capture the distributed activation of the CNN by updating the votes stored in the RF leaves. For feature vector and class , we would ideally like to store in , the inner product of the activation pattern in with the out-going weights, .
This would elicit the identical output from the RF as from the CNN for input . However, the activation pattern will vary for different training samples that end up in the same leaf, so this mapping cannot be satisfied simultaneously for the whole training set. In other words, DTs store distributions in their leaves that represent constant functions on the respective , while the re-trained CNN allows for non-constant functions on (see Figure 5). As a compromise, we seek new vote distributions , for each to minimise the following error, averaged over the finite set of training samples, .
Equation 6 can be solved analytically, yielding the following result:
This is a simple average of over all samples that end up in the same leaf of the corresponding DT. We refer to this as “Map Back #2”. In the trivial case where, for every sample, only one neuron fires in , this is equivalent to “Map Back #1”.
To implement this algorithm in a stack, we must take one additional precaution. Since updating the votes as described in Equation 7 does not capture the output of the re-trained CNN exactly, we update the votes sequentially, from the first to the last level of the corresponding stack. E.g., for a 2 level stack, after updating the votes in the first RF using Equation 7, we pass the training data through and determine the new value of in the second RF for each training sample, and use this to update the votes in the second RF. See Algorithm 1 for details.
Experimental Setup. We applied our method to human body part classification from Kinect depth images, a domain where Random Forests have been highly successful . We use the recently provided data set in , since there is no publicly available data set from the original paper . It contains 2000 training images, and 500 testing images, each 320x240 pixels, containing 19 foreground classes and 1 background class (see Figure 6(a,b) for an example). We evaluate the pixel accuracy, averaged over all foreground classes, as was done by . Note that background is trivially classified.
Training Parameters. We first trained a two-level stacked RF, and then mapped the RF stack to a deep CNN with 5 hidden layers, as described in Section 3
. We trained the CNN using back-propagation and stochastic gradient descent (SGD) with momentum. SGD training is applied by passing images through the network one at a time, and computing the gradient averaged over all pixels (i.e., batch size = 1 image). Thus, we do “whole-image-at-a-time” training, as in . We trained for 8000 iterations, which takes approximately hours in our CPU-based Matlab implementation. For a detailed list of the parameters, see Section 6.1.1.
Results. With our initial two-level stacked RF, we achieved a pixel accuracy of , comparable to the original result of 0.79  (See Figure 6(c)). After mapping to a deep CNN and re-training, we achieved a pixel accuracy of , corresponding to an relative improvement over the RF stack (see Figure 6(d)). This final result is comparable to the state-of-the-art result on this data set which aims to compress RFs by learning a better combination of their constituent trees . They achieve a class-balanced pixel accuracy of over all classes, including the background class, for a model size of 6.8MB. Our model is smaller, at 3.3MB, due to our use of fewer and shallower trees. Due to the different error metric, and their evaluation on a selected subset of pixels, the results are not directly comparable; however, they appear to be very similar.
The architecture of the deep CNN preserves the intermediate prediction layers of the RF stack, which generates one image for each class at the same resolution as the input image. This enables us to gain insights on internal CNN layers. However, due to back-propagation training, these images no longer represent probability distributions. In particular, the pixel values can now be negative. We visualized the internal layers to better understand how they changed during additional training in the CNN (Figure7). Interestingly, we noticed that compared to the stacked RF, the internal activation layers in the CNN were less thresholded, and fired on adjacent body parts. A common strategy in stacked classification is to introduce smoothing between the layers of the stack (see e.g. [15, 13, 24]), and it appears that a similar strategy is naturally learned by the deep CNN.
Experimental Setup. We next applied our method to semantic segmentation of 21 somites and 1 background class in a data set of 32 images (800x950 pixels) of developing zebrafish111Somites are the metameric units that give rise to muscle and bone, including vertebrae.,222This data set will be made publicly available upon acceptance of the manuscript. Experts in biology manually created ground truth segmentations of these images. This data set poses multiple challenges for automated segmentation, due to the similar appearance of neighboring segments and the limited training data. The data set was split into 16 images for training and 16 images for test. Two additional training images were generated from each original training image by random rotation of the originals. We evaluated the resulting segmentation by means of the class-balanced Dice score.
Training Parameters. We first trained a three-level stacked RF, and then mapped the RF stack to a deep CNN with 8 hidden layers. The CNN was initialized and trained exactly as for the Kinect example; however, with different parameters (see Section 6.1.2).
Results. Segmentation of the test data by means of the resulting three-level stacked RF achieved an average Dice score of 0.60 (see Figure 8(c) and Table 1(RF)). The RF-initialized CNN achieved a Dice score of 0.66 after re-training, corresponding to a relative improvement (see Figure 8(d) and Table 1(CNN)).
Next, we mapped the CNN back to the initial stacked RF architecture, albeit with updated parameters, for fast test-time evaluation. We first employed the trivial approach of mapping weights directly onto votes, similar to what was done in the RF to NN mapping; however, this reduced the Dice score to (see Figure 8(e) and Table 1(MB1)), worse than the performance of the initial RF. Next we applied Algorithm 1, which produces a result that is visually superior to the trivial mapping, and yields a final Dice score of (see Figure 8(f) and Table 1(MB2)). Thus, we achieve a relative improvement of our RF stack, which retains its exact tree structure, by mapping to a deep CNN, training all weights by back-propagation, and mapping back to the original RF stack with updated threshold and leaf distributions.
Above we described a method for training a deep CNN on relatively little training data, using a novel initialization from a stacked RF. As a comparison, we considered the task of training the same CNN architecture from a random initialization, using a similar SGD training routine (see Section 6.1.2 for parameters). We first attempted to train the network maintaining the sparsity of the weight layers. However, the energy quickly plateaued, and yielded a final Dice score of only . We then fully connected the layers corresponding to the tree connectivity, (i.e. ,,) and retrained with the same hyper-parameters. This network performed considerably better, reaching a final Dice score of .
We also compared our method with the Fully Convolutional Network (FCN), a state-of-the-art method for semantic segmentation using CNNs 
. This network was downloaded from Caffe’s Model Zoo333https://github.com/BVLC/caffe/wiki/Model-Zoo#fcn, and initialized with weights fine-tuned from the ILSVRC-trained VGG-16 model. Fine-tuning takes approximately day on a single Nvidia K-40 GPU (see Section 6.1.2 for details). We observed that the FCN network failed to train successfully, achieving a Dice score of only , likely because of the limited size of the training data set (see Figure 9).
Insights. In Figure 10 we discuss insights on the internal activation layers of this network.
We have exploited a new mapping between stacked RFs and Deep CNNs, and demonstrated the practical benefits of this mapping for semantic segmentation. This is particularly important when dealing with limited amount of training data. In contrast to common CNN architectures, our specific architecture produces internal activation images, one for each class, which are of the same dimension as the input image. This enables us to gain insights on the semantic behaviour of the internal layers.
There are many exciting avenues for future research. In the short term, we plan to refine the input convolution filters, which are currently fixed, during back-propagation. Another refinement is to incorporate drop-out regularization during training, which should lead to better generalization performance as has been shown for traditional CNN architectures. Also, the approximate mapping from a CNN architecture back to stacked RFs, and related test-time efficient architectures, may be further improved. In the midterm we are excited about extending our architecture and also merging it with existing CNN architectures. Since our internal activation images are directly interpretable, it is straight forward to incorporate differentiable model layers. It will be interesting to see how our specialized CNN behaves as part of a larger CNN network, for instance by placing it directly after the feature extraction layers of a traditional CNN.
In Section 6.1.1, we describe the training parameters used to train the stacked RF and deep CNN for the Kinect example. In Section 6.1.2, we describe the training parameters used to train the stacked RF and deep CNN for the zebrafish example. We also describe the parameters used for training the equivalent deep CNN with random weight initialization.
Stacked RF. We trained a two-level stacked RF, with the following forest parameters at every level: 10 trees, maximum depth 12, stop node splitting if less than 25 samples. We selected 20 samples per class per image for training, and used the standard scale invariant offset features from 
, with standard deviation,= 50 in each dimension. Each split node selected the best from a random sample of 100 such features.
CNN. We mapped the RF stack to a deep CNN with 5 hidden layers, as described in Section 3.3. For efficient training, the initialization parameters were reduced such that the network could transmit a strong gradient via back-propagation. However, softening these parameters moves the deep CNN further from its initialization by the equivalent stacked RF. We evaluated a range of initialization parameters and found , , to be a good compromise.
We trained the CNN using back-propagation and stochastic gradient descent (SGD), with a cross-entropy loss function. During back-propagation, we maintained the sparse connectivity from RF initialization, allowing only the weights on pre-existing edges to change, corresponding to the sparse training scheme from .
Since the network is designed for whole-image inputs, we first cropped the training images around the region of foreground pixels, and then down-sampled them by 25x. Learning rate, , was set such that for the iteration of SGD, with hyper-parameters and iterations. Momentum, , was set according to the following schedule: , where .
We trained a three-level RF stack, with the following forest parameters at every level: 16 trees, maximum depth 12, stop node splitting if less than 25 samples. Features were extracted from the images using a standard filter bank, and then normalized to zero mean, unit variance. The number of random features tested in each node was set to the square root of the total number of input features. For each randomly selected feature, 10 additional contextual features were also considered, with X and Y offsets within a 129x129 pixel window. Training samples were generated by sub-sampling the training images 3x in each dimension and then randomly selectingof these samples for training.
CNN. We mapped the RF stack to a deep CNN with 8 hidden layers. The CNN was initialized and trained exactly as for the Kinect example, with the following exeptions: (i) We used a class-balanced cross-entropy loss function, (ii) Training samples were generated by sub-sampling the training images 9x in each dimension. (iii) Learning rate parameters were as follows: and iterations. (iv) Momentum was initialized to , and increased to after iterations. We observed convergence after only 1-2 passes through the training data, similar to what was reported by .
CNN from Random Initialization.
As discussed in Section 4.2 of the paper, for comparison to the RF-initialized weights described above, we also trained CNNs with the same architecture, but with random weight initialization. Weights were initialized according to a Gaussian distribution with zero mean and standard deviation,. We applied a similar SGD training routine, and re-tuned the hyper-parameters as follows: , iterations, momentum was initialized to 0.4 and increased to 0.99 after 96 iterations. Larger step-sizes failed to train. Networks were trained for 2500 iterations.
Fully Convolutional Network. As discussed in Section 4.2 of the paper, we also compared our method with the Fully Convolutional Network (FCN) . This network was downloaded from Caffe’s Model Zoo444https://github.com/BVLC/caffe/wiki/Model-Zoo#fcn, and initialized with weights fine-tuned from the ILSVRC-trained VGG-16 model. We trained all layers of the network using SGD with a learning rate of , momentum of and weight decay of . See Figure 9(b) for an example of the resulting segmentation.
On the importance of initialization and momentum in deep learning.In ICML, 2013.