awesomerandomforest
None
view repo
We consider the task of pixelwise semantic segmentation given a small set of labeled training images. Among two of the most popular techniques to address this task are Random Forests (RF) and Neural Networks (NN). The main contribution of this work is to explore the relationship between two special forms of these techniques: stacked RFs and deep Convolutional Neural Networks (CNN). We show that there exists a mapping from stacked RF to deep CNN, and an approximate mapping back. This insight gives two major practical benefits: Firstly, deep CNNs can be intelligently constructed and initialized, which is crucial when dealing with a limited amount of training data. Secondly, it can be utilized to create a new stacked RF with improved performance. Furthermore, this mapping yields a new CNN architecture, that is well suited for pixelwise semantic labeling. We experimentally verify these practical benefits for two different application scenarios in computer vision and biology, where the layout of parts is important: Kinectbased body part labeling from depth images, and somite segmentation in microscopy images of developing zebrafish.
READ FULL TEXT VIEW PDFNone
A central challenge in computer vision is the assignment of a semantic class label to every pixel in an image, a task known as semantic segmentation. A common strategy for semantic segmentation is to use pixellevel classifiers such as Random Forests (RF)
[4], which have the advantage of being easy to train and performing well on a wide range of tasks, even in the face of little training data. The use of stacked classifiers, such as in Autocontext [32], has been shown to improve performance on many tasks such as objectclass segmentation [29], facade segmentation [13], and brain segmentation [32]. However, this strategy has the limitation that the individual classifiers are trained greedily.Recently, numerous groups have explored the use of Convolutional Neural Networks (CNNs) for semantic segmentation [7, 19, 6, 34], which has the advantage that it enables “endtoend learning” of all model parameters. This trend is largely inspired by the success of deep CNNs on highlevel computer vision tasks, such as image classification [16] and object detection [12]. However, training a deep CNN requires substantial experience and large amounts of labeled data, or availability of a pretrained CNN for a similar task [2, 3]. Thus, there currently exists a divide between stacked classifiers and deep CNNs.
We propose an alternative solution, exploiting the fundamental connection between decision trees (DT) and NNs
[27] to bridge the gap between stacked classifiers and deep CNNs. This provides a novel approach with the strengths of stacked classifiers, namely robustness to limited training data, and the endendlearning capacity of NNs. Figure 1 depicts our proposed pipeline.Contributions. We make the following contributions:
1. We show that a stacked RF with contextual features is a special case of a deep CNN with sparse convolutional kernels. We apply this successfully to semantic segmentation.
2. We describe an exact mapping of a stacked RF to our sparse, deep CNN. We utilize this mapping to initialize the CNN from a greedily trained stacked RF. This is important in the case of limited training samples. We show that this leads to superior results compared to alternative strategies.
3. We describe an approximate mapping of our sparse, deep CNN back to a stacked RF. We show that this improves the performance of a greedily trained stacked RF.
4. Due to our special CNN architecture we are able to gain new insights of the activation pattern of internal layers, with respect to semantic labels. In particular, we observe that the common smoothing strategy in stacked RFs is naturally learned by our CNN.
Our work relates to (i) global optimization of RF classifiers, (ii) mapping RF classifiers to neural networks, (iii) feature learning in stacked RF models, (iv) applying CNNs to the task of semantic segmentation, and (v) training CNNs with limited labeled data. We cover these areas in turn.
Global Optimization of RFs. The limitations of traditional greedy RF construction [4] have been addressed by numerous works. In [31]
, the authors learn a DT by the standard greedy construction, followed by a process they call “fuzzification”, replacing all threshold split decisions with smooth sigmoid functions that they interpret as partial or “fuzzy” inheritance by the daughter nodes. They develop a backpropagation algorithm, which begins in the leaves and propagates up one layer at time to the root node, reoptimizing all split parameters of the DT. In
[23], they learn to combine the predictions from each DT so that the complementary information between multiple trees is optimally exploited. They identify a suitable loss function, and after training a standard RF, they retrain the distributions stored in the leaves, and prune the DTs to accomplish compression and avoid overfitting. However,
[23] does not retrain the parameters of the internal split nodes of individual DTs, whereas [31] does not retrain the combination of trees in the forest. Conceptually, our approach does both.Mapping RFs to NNs. In both [31] and [23], RFs were initially trained in a greedy fashion, and then later refined. An alternative but related approach is to map the greedily trained RF to an NN with two hidden layers, and use this as a smart initialization for subsequent parameter refinement by backpropagation [27, 33]. This effectively “fuzzifies” threshold split decisions, and simultaneously enables training with respect to a final loss function on the output of the NN. Hence as opposed to [31] and [23], all model parameters are learned simultaneously in an endtoend fashion. Additional advantages are that (i) backpropagation has been widely studied in this form, and (ii) backpropagation is highly parallelized, and only needs to propagate over 2 hidden layers, compared to all tree levels as in [31].
Our work builds upon [27, 33]: We extend their approach to a deep CNN, inspired by the Autocontext algorithm [32], for the purpose of semantic segmentation. Furthermore, we propose an approximate algorithm for mapping the trained CNN back to a RF with axisaligned threshold split functions, for fast inference at test time.
Feature Learning in a RF Framework. The Autocontext algorithm [32] attempts to capture pixel interdependencies in the learning process by iteratively learning a pixelwise classifier, using the prediction of nearby pixels from the previous iteration as features. This process is closely related to feature learning, due to the introduction of new features during the learning process. Numerous works have generalized the initial approach of Autocontext. In Entangled Random Forests (ERFs) [20], spatial dependencies are captured by “entanglement features” in each DT, without the need for stacking. Geodesic Forests [15] additionally introduce imageaware geodesic smoothing to the class distributions, to be used as features by deeper nodes in the DT. However, despite the fact that ERFs use a soft sigmoid split function to obtain maxmargin behaviour with a small number of trees, these approaches are still limited by greedy parameter optimization.
In a more traditional approach to feature learning, Neural Decision Forests [5]
mix RFs and NNs by using multilayer perceptrons (MLP) as soft split functions, to jointly tackle the problem of data representation and discriminative learning. This approach can obtain superior results with smaller trees, at the cost of more complicated split functions; however, the MLPs in each split node are trained independently of each other. This limitation is addressed in
[14], which trains the entire system endtoend. However, they adopt a mixed framework, with both differentiable RFs and CNNs, that are trained in an alternating fashion, and applied to image classification. In contrast, we map to the CNN framework, which enables optimization with popular backpropagation algorithm, and apply to the task of semantic segmentation.CNNs for Semantic Segmentation. While CNNs have proven very successful for highlevel vision tasks, such as image classification, they are less popular for the task of dense semantic segmentation, due to their inbuilt spatial invariance. CNNs can be applied in a tilebased manner [7]; however, this leads to pixelindependent predictions, which require additional measures to ensure spatial consistency [11, 21]. In [19], the authors extend the tilebased approach to “wholeimageatatime” processing, in their Fully Convolutional Network (FCN). They address the coarsegraining effect of the CNN by upsampling the feature maps in deconvolution layers, and combining finegrained and coarsegrained features during prediction. This approach, combining downsampling with subsequent upsampling, is necessary to maintain a large receptive field without increasing the size of the convolution kernels, which otherwise become difficult to learn. A variant of FCN called UNet was recently proposed in [25]. In [6], they minimize coarsegraining by skipping multiple subsampling layers and avoid introducing additional parameters by using sparse convolutional kernels in the layers with large receptive fields. They additionally postprocess by a fully connected CRF. In [34]
, they address coarsegraining by expressing meanfield inference in a dense CRF as a Recurrent Neural Network (RNN), and concatenating this RNN behind a FCN, for endtoend training of all parameters. Notably, they demonstrate a significant boost in performance on the Pascal VOC 2012 segmentation benchmark.
In our work we propose a new CNN architecture for semantic segmentation. Contrary to the previous approaches, we avoid coarsegraining effects, which arise in large part due to pretraining a CNN for image classification
on data provided by the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Instead, we pretrain a stacked RF on a small set of densely labeled data. Our approach is related to the use of sparse kernels in
[6]; however, we learn the nonzero element(s) of very sparse convolutional kernels during greedy construction of an RF stack. One advantage of this approach is that since the kernels have a very large receptive field, we do not need maxpooling and deconvolution layers, as
e.g.in the FCN. Additionally, in our approach the sparsity of the kernels can be specified by the number of features used in each RF split node, independently of the size of the receptive field.Training CNNs with Limited Labelled Data. CNNs provide a powerful tool for feature learning; however, their performance relies on a large set of labeled training data. Unsupervised pretraining has been used successfully to leverage small labeled training sets [26, 22]
; however, fully supervised training on large data sets still gives higher performance. Alternatively, transfer learning makes use of
e.g., pseudotasks [1], or surrogate training data [9].More recent practice is to train a CNN on a large training set, and then fine tune the parameters on the target data [12]. However, this requires a closely related task with a large labeled data set, such as ILSVRC. Another strategy to address the dependency on training data, is to expand a small labeled training set through data augmentation [25]. Alternatively, one can use companion objective functions at each hidden layer, as a form of regularization during training [18, 17]. However, this may in principle interfere with the deep network’s ability to learn the optimal internal representations, as noted by the authors.
We propose a novel strategy for addressing the challenge of training deep CNNs given limited training data. Similar in spirit to [10, 18], we employ greedy supervised pretraining, yet in a complementary model, namely the popular Autocontext model. We then map the resulting Autocontext model onto a deep CNN, and refine all weights using backpropagation.
In Section 3.1, we review the algorithm for mapping an RF onto an NN with two hidden layers [27, 33]. In Section 3.2, we introduce the relationship between RFs with contextual features and CNNs. In Section 3.3, we describe our main contribution, namely how to map a stack of RFs onto a deep CNN. In Section 3.4, we describe our second contribution, namely an algorithm for mapping our deep CNN back onto the original RF stack, with updated parameters.
In the following, we review the existing works [27, 33]. A decision tree consists of a set of split nodes, , and leaf nodes, . Each split node processes the subset of the feature space that reaches it. Usually, , where is the number of features. Let and denote the left and right child node of a split node . A split node partitions the set into two sets and by means of a split decision. For DTs using axisaligned split decisions, the split is performed on the basis of a single feature whose index we denote by , and a respective threshold denoted as : .
For each leaf node , there exists a unique path from root node to leaf , , with and . Thus, leaf membership can be expressed as follows:
(1) 
Each leaf node stores votes for the semantic class labels, , where
is the number of classes. For a feature vector
, we denote the unique leaf of the tree that has as . The prediction of a DT for feature vector to be of class is given by:(2) 
Using this notation, we now describe how to map a DT to a feedforward NN, with two hidden layers. Conceptually, the NN separates the task of evaluating the split nodes and evaluating leaf membership into the first and second hidden layers, respectively. See Figure 2 for a sketch of the following description.



(a)  (b)  (c) 
Hidden Layer 1. The first hidden layer,
, is constructed with one neuron,
, per split node in the corresponding DT. This neuron evaluates , and encodes the outcome in its activity, . is connected to the input layer with the following weights and biases: and . The global constant sets how rapidly the neuron activation changes as its input crosses its threshold. All other weights in this layer are zero.As activation function in
, is used, with a large value for to approximate thresholded split decisions. During training, can be reduced to avoid the problem of diminishing gradients in backpropagation; however, for now we assume is a large positive constant. Thus, the pattern of activations encodes leaf node membership as follows:(3) 
Hidden Layer 2. The role of neurons in the second hidden layer, , is to interpret the activation pattern a feature vector triggers in , and thus identify the unique . Therefore, for every leaf in the DT, one neuron is created, denoted as . Each such neuron is connected to all with , but no others. Weights are set as follows: if and if . The sign of these weights matches the pattern of incoming activations iff , thus making the activation of maximal. To distinguish leaf membership, the biases in are set as . Thus the input to node is equal to if , and less than or equal to otherwise. Using activation functions, linearly scaled to range, and a large value for , the neurons approximately behave as binary switches that indicate leaf membership. I.e., and all other neurons are silent.
Output Layer. The output layer of the NN has neurons, one for every class label. This layer is fully connected; however, there are no bias nodes introduced. The weights store scaled votes from the leaves of the corresponding DT: . A softmax activation function is applied, to ensure a probabilistic interpretation of the output after training:
(4) 
Note that the softmax activation slightly perturbs the output distribution of the original RF (cf. Equation 2), making the mapping approximate. This can be tuned by the choice of , and in practice is a minor effect. Importantly, the softmax activation preserves the MAP solution.
From a Tree to a Forest. Let the number of DTs in a forest be denoted as . The prediction of a forest for feature vector to be of class is the normalised sum over the votes stored in the single leaf per tree , denoted :
(5) 
Extending the DTtoNN mapping described above to RFs is trivial: (i) replicate the basic NN design number of times, and (ii) fully connect to the output layer (see Figure 2(c)). This accomplishes summing over the leaf distributions from the different trees, before the softmax activation is applied.
We now explain a new relationship which is crucial for our main contributions in Sections 3.3 and 3.4. The key concepts are summarized in Figure 3.



(a)  (b) 
One of the defining characteristics of CNNs is weight sharing across neurons corresponding to the same feature map. These neurons compute convolutions over a local window in their input, and their convolutional weights are constant across the entire feature map. Unsurprisingly, RFs work in the same way: A feature vector is precomputed for each pixel in the image, and then fed through the same forest, or in the NN formulation given above, it traverses the identical NN.
A difference between RF and CNN is that in a RF, the first “convolutional layer” is precomputed with a handselected filter bank, not learned as in a CNN. However, the subsequent operations of the RF can be broken down into two convolutions (corresponding to and ). The first of these two convolutions has depth equal to the number of filters in the filter bank, denoted F (typically s), and is very sparse. E.g., axisaligned decision stumps correspond to a convolution kernel with a single nonzero element. The second convolution () is similarly very sparse, with the number of nonzero elements equal to the depth of the tree. Recall from Section 3.1, each neuron in this layer combines the response of all split node neurons along path . For instance, for a balanced tree of depth 10, creates feature maps, where each neuron has a single input. creates feature maps, but each neuron combines 10 features from the previous layer.
In many applications such as bodypose estimation
[23], medical image labeling [20], and scene labeling [32], contextual information is included in the form of contextual “offset features” that are selected from within a window defined by a maximum offset, . In this case, neurons in compute sparse convolutions with width and height of , and depth F. Again, it is conventional to have only a single nonzero element in this convolution kernel; however, in the case of medical imaging it is also common to use e.g., average intensity over an offset window [20].Altogether, a RF with contextual features can be viewed as a special case of a CNN, with sparse convolutional kernels and no max pooling layers. As we shall see in the next section, stacked RFs iterate this architecture using the previous RF predictions as input features, thereby generating a deep CNN with sparse convolutional kernels.
In a stack of RFs, the modular architecture of a single RF is repeated. We map this architecture onto a deep CNN as follows: Each RF is mapped to a CNN, and then these CNNs are concatenated such that the layers corresponding to intermediate RF predictions become hidden layers, used as input to the next CNN in the sequence (see Figure 4). For a level RF stack, this generates a deep CNN with hidden layers. In the original Autocontext algorithm [32], each classifier can either select a feature from the output of the previous classifier, or from the set of input filter responses. Thus, we also introduce the input filter responses as bias nodes in hidden layers , . Note that both addition of trees to the RF and/or growing trees to a greater depth results in a CNN with 2 hidden layers, but with greater width. However, stacking RFs naturally increases the depth of the CNN architecture.
An interesteing question is what activation function to use on layers , which are no longer prediction layers. We explored the following options: identity, , class normalization (Equation 5), and softmax. Despite the fact that class normalization can in principle become undefined, due to the possibility of having negative weights, we found that it outperformed the other options. In particular, softmax was the most problematic, because it perturbs the prediction with regards to the original RF, and this error is compounded in a deep stack. This is consistent with class normalization performing the best, since it exactly matches the operation in the original RF stack. For the rest of the paper, we use class normalization activation functions on layers
. We apply softmax activation at the final output layer to convert to a probability.
(a)  (b) 
In stacked RFs used for semantic segmentation, individual pixels cannot be run through the entire stack independently, but rather the complete image must be run through one level at a time, such that all features are available for the next level. This is similarly true for our deep CNN.
We are interested in mapping our deep CNN architecture back to a stacked RF, with axisaligned split functions, for fast evaluation at test time. Given a CNN constructed from a Klevel RF stack as described above, the weights , manifest the correspondence of the CNN with the original tree structure. Thus, during training, keeping these weights and the corresponding biases, fixed, allows the CNN to be trivially mapped back to the original RF stack. For a single level stack, the mapping is: (i) , (ii) . We refer to this as “Map Back #1”. Finally, when evaluating this RF, a softmax activation function needs to be applied to the output distribution. For deeper stacks, the output of each RF must be postprocessed with the corresponding activation function in the CNN, which in this paper is simple class normalization, but could be something different, such as softmax.
While the approach described above does map the CNN architecture back to the original RF stack, it may not make optimal use of the parameter refinement learned during backpropagation. Above, for a single level stack we assigned , which is the correct thing to do if only a single leaf neuron fires in the network. However, after training by backpropagation, the activation pattern in may be distributed, with many neurons contributing to the prediction.
Here, we propose a strategy to capture the distributed activation of the CNN by updating the votes stored in the RF leaves. For feature vector and class , we would ideally like to store in , the inner product of the activation pattern in with the outgoing weights, .
This would elicit the identical output from the RF as from the CNN for input . However, the activation pattern will vary for different training samples that end up in the same leaf, so this mapping cannot be satisfied simultaneously for the whole training set. In other words, DTs store distributions in their leaves that represent constant functions on the respective , while the retrained CNN allows for nonconstant functions on (see Figure 5). As a compromise, we seek new vote distributions , for each to minimise the following error, averaged over the finite set of training samples, .
This is a simple average of over all samples that end up in the same leaf of the corresponding DT. We refer to this as “Map Back #2”. In the trivial case where, for every sample, only one neuron fires in , this is equivalent to “Map Back #1”.



(a)  (b)  (c) 
To implement this algorithm in a stack, we must take one additional precaution. Since updating the votes as described in Equation 7 does not capture the output of the retrained CNN exactly, we update the votes sequentially, from the first to the last level of the corresponding stack. E.g., for a 2 level stack, after updating the votes in the first RF using Equation 7, we pass the training data through and determine the new value of in the second RF for each training sample, and use this to update the votes in the second RF. See Algorithm 1 for details.
Experimental Setup. We applied our method to human body part classification from Kinect depth images, a domain where Random Forests have been highly successful [28]. We use the recently provided data set in [8], since there is no publicly available data set from the original paper [28]. It contains 2000 training images, and 500 testing images, each 320x240 pixels, containing 19 foreground classes and 1 background class (see Figure 6(a,b) for an example). We evaluate the pixel accuracy, averaged over all foreground classes, as was done by [8]. Note that background is trivially classified.
Training Parameters. We first trained a twolevel stacked RF, and then mapped the RF stack to a deep CNN with 5 hidden layers, as described in Section 3
. We trained the CNN using backpropagation and stochastic gradient descent (SGD) with momentum. SGD training is applied by passing images through the network one at a time, and computing the gradient averaged over all pixels (
i.e., batch size = 1 image). Thus, we do “wholeimageatatime” training, as in [19]. We trained for 8000 iterations, which takes approximately hours in our CPUbased Matlab implementation. For a detailed list of the parameters, see Section 6.1.1.Results. With our initial twolevel stacked RF, we achieved a pixel accuracy of , comparable to the original result of 0.79 [8] (See Figure 6(c)). After mapping to a deep CNN and retraining, we achieved a pixel accuracy of , corresponding to an relative improvement over the RF stack (see Figure 6(d)). This final result is comparable to the stateoftheart result on this data set which aims to compress RFs by learning a better combination of their constituent trees [23]. They achieve a classbalanced pixel accuracy of over all classes, including the background class, for a model size of 6.8MB. Our model is smaller, at 3.3MB, due to our use of fewer and shallower trees. Due to the different error metric, and their evaluation on a selected subset of pixels, the results are not directly comparable; however, they appear to be very similar.



(a)  (b)  (c)  (d)  (e) 
Insights.
The architecture of the deep CNN preserves the intermediate prediction layers of the RF stack, which generates one image for each class at the same resolution as the input image. This enables us to gain insights on internal CNN layers. However, due to backpropagation training, these images no longer represent probability distributions. In particular, the pixel values can now be negative. We visualized the internal layers to better understand how they changed during additional training in the CNN (Figure
7). Interestingly, we noticed that compared to the stacked RF, the internal activation layers in the CNN were less thresholded, and fired on adjacent body parts. A common strategy in stacked classification is to introduce smoothing between the layers of the stack (see e.g. [15, 13, 24]), and it appears that a similar strategy is naturally learned by the deep CNN.Experimental Setup. We next applied our method to semantic segmentation of 21 somites and 1 background class in a data set of 32 images (800x950 pixels) of developing zebrafish^{1}^{1}1Somites are the metameric units that give rise to muscle and bone, including vertebrae.^{,}^{2}^{2}2This data set will be made publicly available upon acceptance of the manuscript. Experts in biology manually created ground truth segmentations of these images. This data set poses multiple challenges for automated segmentation, due to the similar appearance of neighboring segments and the limited training data. The data set was split into 16 images for training and 16 images for test. Two additional training images were generated from each original training image by random rotation of the originals. We evaluated the resulting segmentation by means of the classbalanced Dice score.
Training Parameters. We first trained a threelevel stacked RF, and then mapped the RF stack to a deep CNN with 8 hidden layers. The CNN was initialized and trained exactly as for the Kinect example; however, with different parameters (see Section 6.1.2).
Results. Segmentation of the test data by means of the resulting threelevel stacked RF achieved an average Dice score of 0.60 (see Figure 8(c) and Table 1(RF)). The RFinitialized CNN achieved a Dice score of 0.66 after retraining, corresponding to a relative improvement (see Figure 8(d) and Table 1(CNN)).
Next, we mapped the CNN back to the initial stacked RF architecture, albeit with updated parameters, for fast testtime evaluation. We first employed the trivial approach of mapping weights directly onto votes, similar to what was done in the RF to NN mapping; however, this reduced the Dice score to (see Figure 8(e) and Table 1(MB1)), worse than the performance of the initial RF. Next we applied Algorithm 1, which produces a result that is visually superior to the trivial mapping, and yields a final Dice score of (see Figure 8(f) and Table 1(MB2)). Thus, we achieve a relative improvement of our RF stack, which retains its exact tree structure, by mapping to a deep CNN, training all weights by backpropagation, and mapping back to the original RF stack with updated threshold and leaf distributions.



(a)  (b)  (c)  (d)  (e)  (f) 
Method  RF  FCN  CNN  MB1  MB2 
Dice Score  0.60  0.18  0.66  0.59  0.63 
Above we described a method for training a deep CNN on relatively little training data, using a novel initialization from a stacked RF. As a comparison, we considered the task of training the same CNN architecture from a random initialization, using a similar SGD training routine (see Section 6.1.2 for parameters). We first attempted to train the network maintaining the sparsity of the weight layers. However, the energy quickly plateaued, and yielded a final Dice score of only . We then fully connected the layers corresponding to the tree connectivity, (i.e. ,,) and retrained with the same hyperparameters. This network performed considerably better, reaching a final Dice score of .
We also compared our method with the Fully Convolutional Network (FCN), a stateoftheart method for semantic segmentation using CNNs [19]
. This network was downloaded from Caffe’s Model Zoo
^{3}^{3}3https://github.com/BVLC/caffe/wiki/ModelZoo#fcn, and initialized with weights finetuned from the ILSVRCtrained VGG16 model. Finetuning takes approximately day on a single Nvidia K40 GPU (see Section 6.1.2 for details). We observed that the FCN network failed to train successfully, achieving a Dice score of only , likely because of the limited size of the training data set (see Figure 9).



(a)  (b) 
Insights. In Figure 10 we discuss insights on the internal activation layers of this network.

We have exploited a new mapping between stacked RFs and Deep CNNs, and demonstrated the practical benefits of this mapping for semantic segmentation. This is particularly important when dealing with limited amount of training data. In contrast to common CNN architectures, our specific architecture produces internal activation images, one for each class, which are of the same dimension as the input image. This enables us to gain insights on the semantic behaviour of the internal layers.
There are many exciting avenues for future research. In the short term, we plan to refine the input convolution filters, which are currently fixed, during backpropagation. Another refinement is to incorporate dropout regularization during training, which should lead to better generalization performance as has been shown for traditional CNN architectures. Also, the approximate mapping from a CNN architecture back to stacked RFs, and related testtime efficient architectures, may be further improved. In the midterm we are excited about extending our architecture and also merging it with existing CNN architectures. Since our internal activation images are directly interpretable, it is straight forward to incorporate differentiable model layers. It will be interesting to see how our specialized CNN behaves as part of a larger CNN network, for instance by placing it directly after the feature extraction layers of a traditional CNN.
In Section 6.1.1, we describe the training parameters used to train the stacked RF and deep CNN for the Kinect example. In Section 6.1.2, we describe the training parameters used to train the stacked RF and deep CNN for the zebrafish example. We also describe the parameters used for training the equivalent deep CNN with random weight initialization.
Stacked RF. We trained a twolevel stacked RF, with the following forest parameters at every level: 10 trees, maximum depth 12, stop node splitting if less than 25 samples. We selected 20 samples per class per image for training, and used the standard scale invariant offset features from [28]
, with standard deviation,
= 50 in each dimension. Each split node selected the best from a random sample of 100 such features.CNN. We mapped the RF stack to a deep CNN with 5 hidden layers, as described in Section 3.3. For efficient training, the initialization parameters were reduced such that the network could transmit a strong gradient via backpropagation. However, softening these parameters moves the deep CNN further from its initialization by the equivalent stacked RF. We evaluated a range of initialization parameters and found , , to be a good compromise.
We trained the CNN using backpropagation and stochastic gradient descent (SGD), with a crossentropy loss function. During backpropagation, we maintained the sparse connectivity from RF initialization, allowing only the weights on preexisting edges to change, corresponding to the sparse training scheme from [33].
Since the network is designed for wholeimage inputs, we first cropped the training images around the region of foreground pixels, and then downsampled them by 25x. Learning rate, , was set such that for the iteration of SGD, with hyperparameters and iterations. Momentum, , was set according to the following schedule: , where [30].
Stacked RF.
We trained a threelevel RF stack, with the following forest parameters at every level: 16 trees, maximum depth 12, stop node splitting if less than 25 samples. Features were extracted from the images using a standard filter bank, and then normalized to zero mean, unit variance. The number of random features tested in each node was set to the square root of the total number of input features. For each randomly selected feature, 10 additional contextual features were also considered, with X and Y offsets within a 129x129 pixel window. Training samples were generated by subsampling the training images 3x in each dimension and then randomly selecting
of these samples for training.CNN. We mapped the RF stack to a deep CNN with 8 hidden layers. The CNN was initialized and trained exactly as for the Kinect example, with the following exeptions: (i) We used a classbalanced crossentropy loss function, (ii) Training samples were generated by subsampling the training images 9x in each dimension. (iii) Learning rate parameters were as follows: and iterations. (iv) Momentum was initialized to , and increased to after iterations. We observed convergence after only 12 passes through the training data, similar to what was reported by [12].
CNN from Random Initialization.
As discussed in Section 4.2 of the paper, for comparison to the RFinitialized weights described above, we also trained CNNs with the same architecture, but with random weight initialization. Weights were initialized according to a Gaussian distribution with zero mean and standard deviation,
. We applied a similar SGD training routine, and retuned the hyperparameters as follows: , iterations, momentum was initialized to 0.4 and increased to 0.99 after 96 iterations. Larger stepsizes failed to train. Networks were trained for 2500 iterations.Fully Convolutional Network. As discussed in Section 4.2 of the paper, we also compared our method with the Fully Convolutional Network (FCN) [19]. This network was downloaded from Caffe’s Model Zoo^{4}^{4}4https://github.com/BVLC/caffe/wiki/ModelZoo#fcn, and initialized with weights finetuned from the ILSVRCtrained VGG16 model. We trained all layers of the network using SGD with a learning rate of , momentum of and weight decay of . See Figure 9(b) for an example of the resulting segmentation.
On the importance of initialization and momentum in deep learning.
In ICML, 2013.
Comments
There are no comments yet.