Nearly all current deep learning methods rely on a vast amount of floating point operations and rather simplistic combinations of trainable convolution filters and rectifying non-linearities. Another direction of machine learning that is based on non-linear multidimensional mapping through ensembles of binary decision trees[Criminisi et al.(2012)Criminisi, Shotton, Konukoglu, et al.] or random ferns [Ozuysal et al.(2009)Ozuysal, Calonder, Lepetit, and Fua] has become less relevant due to the difficulty or inability to embed these approaches into end-to-end trainable networks. However, random ferns prove to serve as fast and efficient feature extractors in connection with separately trainable layers [Kim et al.(2019)Kim, Jeong, Lee, and Ko]. Designing differentiable decision boundaries in deeper binary trees is considered a very challenging task therefore previous work has focused on combining CNNs with differentiable (neural) forests [Kontschieder et al.(2015)Kontschieder, Fiterau, Criminisi, and Rota Bulo]. In parallel, much research work has recently been devoted to binary networks [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] that avoid memory- and computation-intensive floating point matrix multiplications. We propose a method to efficiently use random ferns within an end-to-end trainable architecture to replace convolutions without using floating point multiplications.
To explain the proposed method, first, we will revisit the concept of a random ferns ensemble without optimisation. Next, our proposed differentiable binary embedding with weighted sums is described and extended to multi-layer and convolutional architectures.
For the most part, the procedure during a standard convolution is identical to our method (unfold im2col [Chetlur et al.(2014)Chetlur, Woolley, Vandermersch, Cohen, Tran, Catanzaro, and Shelhamer], matrix-multiplication, fold), since only the matrix-multiplication part is replaced by the differentiable random fern implementation. Therefore, our method is a potential drop-in replacement for convolutions using a Look-Up-Table (LUT) instead of multiplications.
To generate a classical random fern ensemble classifier, as a first step, for every fern, corresponding to the depth , two sets will be randomly drawn. The first set consists of a random subset of the input feature dimensions and the second set contains a number of thresholds . In contrast to optimized decision trees, where different dimensions of a feature vector will be examined along its path to a leaf-node, traversing a fern yields always the same feature dimension sequence whose contents will be compared with the fixed thresholds (see orange and green lines in fig:fern_scheme at the unfolded feature matrix (UFM) as fixed dimensions of interest for the Ferns and ). Every comparison results in a binary output and encodes as binary string an index to access the class specific histograms of each fern.
In contrast to previous work, the output of each fern is in our case not a data driven (normalised) histogram but directly learned as a feature vector. Inspired by Natural Language Processing[Mikolov et al.(2013)Mikolov, Sutskever, Chen, Corrado, and Dean], we use the EmbeddingBag-layer to map a dictionary (here of size ) into a different-dimensional output space - effectively implementing all different class histograms of a fern ensemble into a single large LUT. Inside the red-dotted box in fig:fern_scheme, we follow the original fern algorithm by feeding rows of the UFM (generated with the im2col operator) through the random fern ensemble - except for the minor deviation of applying a tanh function after the threshold substraction, resulting in a vector per fern and row . Taking the sign of , converting it to its decimal representation and adding an appropriate offset per fern gives access to the according LUT index position and its embedding weights.
The key observation to gain differentiability for these discretely addressed LUT embeddings is the fact that while the feature indices based on the UFM provide no gradient to train networks, a scalar instance weight that measures the proximity of continuous valued feature vectors to binary strings is very suitable to enable end-to-end training. Here it is obtained by computing the mean distance of absolute entries to 1: . The weighted sums of these LUT entries form the unfolded output feature matrix containing , before a final col2im operation reshapes the data.
3 Experiments & Results
We perform our experimental validation on the binary classification task of the Tumor Proliferation Assessment Challenge 2016 and show that relatively shallow ferns as networks with very few trainable weights can be learned that enable high classification accuracy. As baseline comparisons, we use a Vanilla CNN architecture and its conversion as XNOR net [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi].
In all 3 experiments, we use the same network architecture: a 5-layer network defined by the following encoding scheme : 1) , 2) , 3) , 4) AdaptiveAvgPool, 5)
. The Vanilla net uses ReLU activation functions after the first 3 layers, whereas the XNOR counterpart achieves non-linearity already by its input binarization. While we change the backbone using our Fern-Ensemble layers, the spatial operations (unfolding & folding) remain untouched for the Fern net. In every layer, we use 24 ferns with a depth of only 3. Here, index-binarization provides the non-linearity. With only half the parameters, our approach falls just short of the Vanilla CNN implementation (tab:results) and outperforms the XNOR net. Regarding the energy comsumption of processing a single input image according to[Hubara et al.(2016)Hubara, Courbariaux, Soudry, El-Yaniv, and Bengio], our Fern net is by far the most efficient approach.
We presented a novel approach that enables the use of random ferns within an end-to-end trainable convolutional architecture and demonstrates impressive classification results that are on-par with state-of-the-art binary XNOR nets and without using floating point multiplications. Spatial convolutions can be easily integrated into fern-like architectures by employing the im2col operator. In contrast to conventional ferns that build class histograms purely data-driven, we learn the embedding directly - following the end-to-end trainable paradigm of learning task specific feature extractors and classifiers simultaneously. This work was supported by the German Research Foundation (DFG) under grant number 320997906(HE 7364/2-1). We gratefully acknowledge the support of the NVIDIA Corporation with their GPU donations for this research.
- [Chetlur et al.(2014)Chetlur, Woolley, Vandermersch, Cohen, Tran, Catanzaro, and Shelhamer] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759, 2014.
- [Criminisi et al.(2012)Criminisi, Shotton, Konukoglu, et al.] Antonio Criminisi, Jamie Shotton, Ender Konukoglu, et al. Foundations and Trends® in Computer Graphics and Vision, 7(2–3):81–227, 2012.
[Hubara et al.(2016)Hubara, Courbariaux, Soudry, El-Yaniv, and
Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua
Binarized neural networks.In Advances in neural information processing systems, pages 4107–4115, 2016.
- [Kim et al.(2019)Kim, Jeong, Lee, and Ko] Sangwon Kim, Mira Jeong, Deokwoo Lee, and Byoung Chul Ko. Deep coupling of random ferns. In
- [Kontschieder et al.(2015)Kontschieder, Fiterau, Criminisi, and Rota Bulo] Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, and Samuel Rota Bulo. Deep neural decision forests. In Proceedings of the IEEE international conference on computer vision, pages 1467–1475, 2015.
- [Mikolov et al.(2013)Mikolov, Sutskever, Chen, Corrado, and Dean] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
- [Ozuysal et al.(2009)Ozuysal, Calonder, Lepetit, and Fua] Mustafa Ozuysal, Michael Calonder, Vincent Lepetit, and Pascal Fua. Fast keypoint recognition using random ferns. IEEE transactions on pattern analysis and machine intelligence, 32(3):448–461, 2009.
- [Rastegari et al.(2016)Rastegari, Ordonez, Redmon, and Farhadi] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. In European conference on computer vision, pages 525–542. Springer, 2016.