Over the last ten years, computer-vision-based object perception has continuously been moving away from using hand-crafted features like HOG  and SIFT , and significant efforts have been made on feature learning. An important characteristics of state-of-the-art feature learning methods [3, 4, 5, 6] can be found in the graduated complexity of the features as layers are added to the network, thus forming a hierarchy.
Two paradigms have emerged in the design of hierarchical models, that differ in the definition of a unit learned at each layer. The first paradigm is that of compositional models [7, 8]. We refer to models as being compositional when its features at a layer are modeled as an explicit combination of features in the lower layer. These models allow fast inference via inverted indexing, inherently produce region proposals, offer straight-forward reconstruction from partial observations and visualization of features. However, the learning cost function is weakly defined and is usually performed via co-occurrence learning [7, 8]. While attempts have been made to discriminatively re-interpret the learned reconstructive parts [9, 10], learning discriminative parts directly remains an open issue. On the other hand, the paradigm of convolutional models has recently gained significant momentum in feature learning. These models define the feature units at each layer as filters, which afford learning by back-propagation. In particular, convolutional neural networks (CNNs) [4, 3] have emerged as a highly successful representative of this class of hierarchical models. A longstanding criticism of CNNs is the lack of precise spatial relationships between high-level parts, a reason advocated to move towards viewpoint-invariant capsule-like systems . The lack of structure in basic units in CNNs also hinders robust handling of occlusions and missing parts [12, 10], and prohibits straight-forward visualization and understanding of networks. Approximate visualization techniques have been developed [13, 14, 15], but further attempts at understanding CNNs uncovered unintuitive behavior when small input perturbations were applied . Recent efforts have also been made toward approximation of the learned filters to reduce the computational complexity of the learned CNNs , an issue typically addressed by brute force CPU/GPU power increase.
In this paper we propose a novel form of the unit in a deep hierarchical model with an explicit compositional structure (Fig. 1
). While stacked weights in deep networks can be considered as compositions, they lack an explicit structure that could further expose and leverage compositional properties. Our proposed unit exposes an explicit structure of compositions as a parametric model over spatial clustering of responses from the lower layer, and can directly be embedded into the learning framework used in CNNs. This allows learning of hierarchical compositional models with a well-defined, potentially discriminative, cost function similar to that used in convolutional neural networks, and retains the benefits of compositional models, such as a precise encoding of the spatial relationship between parts. We derive the necessary equations for back-propagation and propose a compositional model trained by a discriminative cost function, which is the major contribution of our paper. We experimentally evaluate the proposed model on CIFAR-10 and PaCMan  datasets and show that our model achieves comparable performance to a standard CNN, while allowing simple compositional mean-reconstruction of parts. Since our units are separable by design, we demonstrate a significant speedup in inference compared to CNNs.
Ii Deep compositional network
We first provide notation for deep convolutional neural networks and then derive our compositional network. A neuron activation in a deep neural network is modeled with a linear function wrapped inside a specific non-linearity:, with output activation , input activations , bias and weights determining a linear combination of input activations further modified by a non-linear function . In the image domain the neuron activations are organized in a 3-dimensional matrix , where two dimensions, , represent the image or feature plane, and the third dimension,
represents channels. With the introduction of weight-sharing along the 2D feature plane, the convolutional neural network models the neuron’s activation functionas:
where is a convolution of , the activation map from the -th channel, with , the weights for the -th channel. The are basic units in the CNNs that take the form of convolution filters and have to be learned from the data. The convolutions give an output map for each channel. Element-wise summation over the channels represents the final activation map after non-linearity is applied. Typically, several activation maps are created, and the network is organized in layers, such that in the -th layer is the output activation from layer.
We propose a new basic unit that explicitly models the composition of a feature. We define it as a weighted Gaussian component:
with weight and Gaussian parameters , containing mean
and variance. Multiple can be applied to the same input channel , but we omit this in the notation in the interest of clarity. Units take the form of convolution filters and have elements. We therefore define as a two dimensional matrix of the same size with 2-dimensional index over elements:
We use two dimensional means but single dimensional variance for simplification. Commonly used unit sizes are , or ; however, such small sizes can lead to significant discretization errors in . We avoid this by replacing the normalization factor computed in continuous space with one computed in the discretized space, leading to our final distribution function :
is a non-normalized Gaussian distribution andis a sum over this non-normalized Gaussian distribution computed for a filter of size :
A new, compositional unit can be embedded into a CNN by grouping multiple instances applied to the same input channel and deriving the basic CNN unit from Eq. (1):
This proposed model for each
unit is similar to a standard Gaussian mixture model, but we do not enforceand component weights can take any value, . Having negative weights is important to approximate edges as differences between neighboring components, whereas positive components can be interpreted as requirements for a presence of a feature and negative components as requirements for an absence of a feature. In principle, this could be satisfied by normalizing the sum of absolute weights, , but this would significantly complicate the gradient computation without any performance gain.
Learning an individual unit consists of learning its parameters for the Gaussian distribution, mean and variance , together with the weight . The number of units, i.e., the number of Gaussian components per input channel, can be considered as a hyper-parameter. Learning can be performed in the same way as in convolutional networks via gradient descent. Parameters are optimized by computing the gradients w.r.t. the cost function
, which leads to three different types of gradients. By applying the chain rule we can define the gradient for the component weightas a dot-product of back-propagated error and the input feature convolved with the -th Gaussian component:
where and is back-propagated error. Note, that only the -th channel of input features are used since the weight component appears only in . The back-propagated error for layer is computed the same as in a standard convolutional network:
where the back-propagated error from the higher layer is convolved with the 180° rotated unit (a weight filter) which can be computed from Eq. 6. We can similarly apply the chain rule to obtain the gradient for the mean and the variance:
where the derivatives of the Gaussian are:
Iii Classification performance
The evaluation on both datasets is performed with a network containing three layers. The first two layers are convolutional/compositional and the third one is fully-connected. We use soft-max with multinominal logistic loss as the cost function. Either three-channel RGB (CIFAR-10) or a single-channel gray-scale image (PaCMan) is used as input data with zero-mean normalization. The data is not normalized to unit variance, since between each layer we use ReLU non-linearity which is less sensitive to data variance. Note that we use slightly bigger filters (basic units) in our network, but use fewer components to approximately match the number of parameters with the standard CNN model. We also restrict components’ positions and standard deviation to ensure derivative of the resulting filters would not have non-zero values outside of the valid window. This also prevents collapsing to a single point and stalling. Positions are restricted to at leastpixels away from the borders of the valid window, and standard deviations are restricted to . We apply the AdaDelta  with momentum of and no weight-decay to achieve proper behavior in gradient descent.
Iii-a Classification on CIFAR-10
The CIFAR-10  dataset consists of 60.000 images split into 50.000 training and 10.000 testing images. We perform training with a mini-batch size of 100 images per iteration and run learning for 5000 iterations. A detailed configuration of the network used is shown in Table I. Comparing performance of both models, shown in Fig. 2, we can see the same accuracy with both achieving slightly less than 70% on testing set. The left-most two graphs reveal similar learning rates with standard CNN learning slightly faster, but they both converge to a similar loss in the end.
Iii-B Classification on PaCMan database
The proposed compositional network was also evaluated on the PaCMan dataset . This dataset contains gray-scale and depth images generated from 3D models of 20 categories of various kitchen objects, with each category containing 20 different instances of objects. Each object is captured at dense, regular viewpoint intervals, but we use only 28 different viewpoints, summing to a total of around 50.000 images. We use only gray-scale images. The dataset was split into approximately 25.000 samples for training and 25.000 for testing. We ensure that all viewpoints of the same object are in the same split and each category has proportionally the same number of objects in testing and training. The input images are resized to to fit the network into the GPU. The network configuration as shown in Table II is slightly modified to accommodate higher resolution images.
Fig. 3 confirms that the proposed compositional network achieves discriminative performance on a par with the CNN – both models attain an accuracy of 64-67%.
We also visualize CNN filters and units in our network for features on the first and the second layer in Fig. 3. The units in our network applied to the same input channel are visualized in a single filter based on Eq. (6), while individual component’s means are plotted as small circles, and variances as large circles. Filters in the standard CNN have a certain structure but they are still noisy and incoherent. It would also be difficult to capture this structure without human interpretation. On the other hand, our unit with Gaussian models explicitly captures the spatial structure as can be seen in the first-layer filters. Many components converge to the same location and the configuration of components directly points to different edge or blob detectors. On the second layer only a small set of components have high weights, indicating that most are irrelevant for the final classification, offering further simplifications of the network. Compared to the filters from the standard CNN they are more compact. With Gaussian modeling a spatial position of a sub-feature is much clearer and easily determined, and, as we show in the next section, can be further utilized.
Iv Utilizing compositional representation
In this section we demonstrate two advantages of having a rich, compositional representation. We propose a novel visualization of deep networks using mean reconstruction of compositions and demonstrate the inference speed-up due to inherent kernel separability in compositions.
Iv-a Visualization by mean reconstruction
Feature visualization in standard deep networks is difficult due to lack of structure. Visualization is thus usually performed indirectly by deconvolution  or optimization of input pixels to maximize the output activations [14, 15]. Such visualization techniques are applicable to our model as well, but having explicit compositions enables exploration of more straight-forward visualization techniques.
A feature can be visualized by finding an image that produces a maximum output response for its unit. Based on Eq. (1) a maximum response is obtained when all sub-features match to specific patterns defined in weights . In our model each consists of individual compositions , thus the premise can further be extended to having a maximum response when individual sub-features are present at specific position, i.e., at a mean of a Gaussian in our case. We propose to visualize a single feature by recursively projecting compositions top-down to image pixels by following (indexing) the corresponding means in network units. We term this process as mean reconstruction, similar to visualization techniques for hierarchical compositions .
During each step of back-projection, several properties of the compositions need to be correctly accounted for: (a) expected uncertainty of the position of a sub-feature (variance), (b) the importance of a sub-feature (weight) and (c) requested presence or absence of a sub-feature (weight sign). The uncertainty is defined by a Gaussian variance and grows with each step of back-projection. We account for this by summing the variances along each step to arrive at the final uncertainty at the pixel level. The importance of a sub-feature is accounted for by multiplying the magnitude of weight at each back-projected step. We consider only sub-feature components with a positive weight, i.e., ones that request the presence of a sub-feature. Sub-feature components with negative weights are ignored. This is applied to all layers, except the first one, since negative values of features are truncated by the ReLU layer on all layers except for the first one. We therefore consider the first layer as a special case and use compositions with negative weights as well.
Positive and negative weights in the first layer in most cases define edges. But edges are defined indirectly since means in Gaussians define positions of blobs, i.e., regions with low intensity or high intensity values. Edges can be inferred from neighboring blobs with opposing signs and are free to occur anywhere between blobs, either as a smooth transition or a sharp edge. Consequently, after all compositions are back-projected from the top to the bottom we are left with two sets of Gaussian distributions, ones with the positive sign and ones with the negative sign. They are visualized in the second and the fourth row of Fig. 4. This visualization is performed by summing over all positive and negative distributions for each pixel. We refer to a map obtained this way as a reconstructed blobs or distribution map.
We can further visualize Gaussian compositions by their boundaries that separate positive and negative components. We achieve this by translating the problem to a graph-cut problem, where pixels are represented as a graphical model connected to neighbors to either a sink or a source. We use a sum over all positive distributions for one pixel as a cost for sink and a sum over all negative distributions as a cost for source. The cost for the two neighboring pixels is considered as a squared difference between the distribution maps for that pixel. The optimal edge between them is obtained by finding a minimal cut that maximizes the flow in a graph. We highlight edges with strong borders between positive and negative distributions by multiplying with a difference of neighboring pixels in the distribution maps. The resulting image is finally re-sized and is visualized in the first and the third row of Fig. 4.
The proposed visualization is applied to the second layer features trained on PaCMan database. Most features are still representing edges at different orientations, but some (e.g., 4th and 8th feature) are compositions of edges in form of corners as is evident from the reconstructed boundaries. The features are visualized on a pruned model, i.e, we merged any overlapping compositions in each sub-feature to remove duplicate components, and compositions with weights below a 2% of a maximal weight are discarded since they do not contribute to the final score. With pruning we removed or merged approximately 400 out of 1024 Gaussians at the second layer, while reducing the score by less than 1%. This pruning process is another benefit of our explicit compositions, which can lead to a network with significantly reduced complexity.
Iv-B Inference speed-up with separable filters
Another advantage of our deep compositional network is the ability to decompose the units into separate filters by leveraging the separability of Gaussians to speed-up the inference. In this case, the complexity for the forward pass at a single layer is reduced from for the standard CNN to for implementation with separable filters, with sub-features, features, number of Gaussian components per sub-feature, feature map size, kernel filter size and as an additional overhead in a separable filter implementation. An important factor in this separable implementation is the number of Gaussian components , but we can reduce this number with the same pruning process described in the previous section, where we reduced the number of components by almost a half.
Based on time complexities a separable implementation should gain significant speed-up for . We evaluate this separable implementation on a pruned PaCMan model considering different kernel sizes. Results are depicted in Fig. 5. We achieve a slight speed-up at kernel sizes of while a significant speed-up of 3-fold or more is achieved with kernel sizes of pixels or bigger.
We use the Caffe implementation with convolution as matrix multiplication using the CBLAS library and implement separable convolution as multiple calls to AXPY methods using the same CBLAS library. The demonstrated speed-up factors are fairly conservative considering the Caffe implementation performs 2D convolution with a fully optimized single call while our implementation adds some overhead with multiple calls. We evaluated only a CPU implementation and enforced a single-core process. Since both implementations have a similar level of parallelism it is fair to assume that speed-up can be maintained in multicore CPU or GPU implementations as well.
A new deep compositional network is introduced in this paper. The new network is based on a novel form of an element unit (a filter) that applies a parametric model. We demonstrated that parametrization with Gaussian distributions retains the spatial structure of compositions of features and affords learning by optimizing a well-defined cost function. We derived the necessary equations for back-propagation and embedded our model into a deep neural network framework to evaluate discriminative learning of compositional parts on CIFAR-10 and PaCMan datasets. We showed that having a compositional representation is advantageous for deep networks by presenting a novel visualization and an inference speed-up. We performed visualization of deep network features using a mean reconstruction of parts. Other visualization techniques for deep networks typically rely on approximation with de-convolution  or a complex optimization , and need to process additional data. In contrast, the compositional representation allowed us to generate a representation of a feature with a fairly simple technique that uses only the model itself and no additional data. A simpler visualization and the 3-fold speed-up of inference speak of the advantages of using the new parametric units in deep networks with a convolutional layered architecture.
In the future we plan to explore other venues opened by combining compositional and convolutional hierarchies, such as, pre-training with co-occurrence statistics, performing generative and discriminative learning concurrently, and further leveraging inverted indexing of parts for inference.
-  N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection,” in CVPR, 2005, pp. 886–893.
-  D. Lowe, “Object recognition from local scale-invariant features,” in Proceedings of the Seventh IEEE ICCV, 1999, pp. 1150–1157.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” inAdvances In Neural Information Processing Systems 25, 2012, pp. 1097–1105.
-  Y. Lecun, S. Chopra, R. Hadsell, M. A. Ranzato, and F. J. Huang, “A Tutorial on Energy-Based Learning,” pp. 1–59, 2006.
-  A. Leonardis and S. Fidler, “Learning hierarchical representations of object categories for robot vision,” Robotics Research, vol. 66, pp. 99–110, 2011.
-  L. L. Zhu, Y. Chen, and A. Yuille, “Learning a hierarchical deformable template for rapid deformable object parsing.” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 6, pp. 1029–43, 6 2010.
S. Fidler and A. Leonardis, “Towards Scalable Representations of Object
Categories: Learning a Hierarchy of Parts,” in
Computer Vision and Pattern Recognition. IEEE Computer Society, 2007, pp. 1–8.
-  Z. Si and S.-c. Zhu, “Learning AND-OR Templates for Object Recognition and Detection,” IEEE Transactions on Pattern ananlysis and machine intelligence, vol. 35, no. 9, pp. 2189–2205, 2013.
-  M. Kristan, M. Boben, D. Tabernik, and A. Leonardis, “Adding discriminative power to hierarchical compositional models for object class detection,” in 18th Scandinavian Conference on Image Analysis, 2013, pp. 1–12.
-  D. Tabernik, A. Leonardis, M. Boben, D. Skočaj, and M. Kristan, “Adding discriminative power to a generative hierarchical compositional model using histograms of compositions,” Computer Vision and Image Understanding, vol. 138, pp. 102–113, 9 2015.
G. E. Hinton, A. Krizhevsky, and S. D. Wang, “Transforming auto-encoders,”
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 6791, no. 1, pp. 44–51, 2011.
-  A. Yuille and R. Mottaghi, “Complexity of Representation and Inference in Compositional Models with Part Sharing,” International Conference on Learning Representations, vol. 17, pp. 1–13, 2016.
-  M. D. Zeiler and R. Fergus, “Visualizing and Understanding Convolutional Networks,” in European Conference on Computer Vision, 11 2013, pp. 818–833. [Online]. Available: http://arxiv.org/abs/1311.2901
-  A. Mahendran and A. Vedaldi, “Understanding Deep Image Representations by Inverting Them,” in Computer Vision and Pattern Recognition, 2015, pp. 1–9.
-  ——, “Visualizing Deep Convolutional Neural Networks Using Natural Pre-Images,” 2016. [Online]. Available: http://arxiv.org/abs/1512.02017
-  C. Szegedy, W. Zaremba, and I. Sutskever, “Intriguing properties of neural networks,” in International Conference on Learning Representations, 2014, pp. 1–9.
-  M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up Convolutional Neural Networks with Low Rank Expansions,” in British Machine Vision Conference, 2014, p. 7.
-  A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Images,” Science Department, University of Toronto, TechReport, pp. 1–60, 2009.
-  “Synthetic RGB-D images of 400 objects from 20 classes. Generated from 3D mesh models.” [Online]. Available: http://www.pacman-project.eu/datasets/
-  M. D. Zeiler, “ADADELTA: An Adaptive Learning Rate Method,” 2012. [Online]. Available: http://arxiv.org/abs/1212.5701
-  S. Fidler, M. Boben, and A. Leonardis, “Optimization framework for learning a hierarchical shape vocabulary for object class detection,” in Procedings of the British Machine Vision Conference 2009, 2009, pp. 1–93.