In the domain of image recognition, the convolutional layer of a CNN today is almost exclusively associated with a spatial convolution in the image domain. In this work we will take a more signal theoretic viewpoint of the convolutional operation and present an algorithm that allows to process also sparse input data. This work is inspired by the use of special data structures (Adams et al., 2010) for bilateral filters (Aurich & Weule, 1995; Smith & Brady, 1997; Tomasi & Roberto, 1998) and generalizes it for the use of convolutional architectures.
Although the approach presented here is more general, the following two scenarios are instructive. Consider that at training time we have access to full resolution images to train a classifier. At test time only a random number of pixels from the test image is available. In other words, we sample the signal differently during training and test time. For a traditional CNN this would require a pre-processing step, for example to map from subsets of pixels to a dense grid that is the image. In our view there is no change, it is not required that we have a dense grid and access to all pixels of the image. That is the integration domain does not change. This is one example of sparsity, here we deal with a set of pixels, whose values are RGB and features are position. Similarly, color information can be used to define the filtering operation as well. One can devise a convolution with a domain respecting colorand location information (or color alone). One view from the image processing community is that of an edge-aware filter, the filter will be adaptive to the color and/or gradient of the image. RGB values do not lie on a regular dense grid, therefore a direct expansion of the spatial convolution is not applicable.
This approach falls into line with the view on encoding invariants (Mallat, 2012)
. It is possible to encode our knowledge invariants that we have about the data with the new way of looking at the data. Encoded in a spatial convolution is the prior knowledge about translation invariance. How to encode roation invariance, how similarity in color space? In the view we take here these are simply convolutions over different domains. A grid based convolution cannot easily be used to work with the sparse data (an interpolation might be needed) but the permutohedral lattice provides the right space and allows efficient implementations. Therefore the runtime is comparable to the ones of spatial convolutions, depending on the size of the invariants to include and can simply be used as a replacement of the traditional layers.
2 Permutohedral Lattice Convolution
We propose a convolution operation of a -dimensional input space that entirely works on a lattice. Input data is a tuple of feature locations and corresponding signal values . Importantly, this does not assume the feature locations to be sampled on a regular grid, for example can be position and RGB value. We then map the input signal to a regular structure, the so-called permutohedral lattice. A convolution then operates on the constructed lattice and the result is mapped back to the output space. Hence, the entire operation consists of three stages (see fig. 1): splat (the mapping to the lattice space), convolution and slice (the mapping back from the lattice). This strategy has already been used to implement fast Gaussian filtering (Paris & Durand, 2009; Adams et al., 2010, 2009). Here we generalize it to arbitrary convolutions.
The permutohedral lattice is the result of the projection of the set
onto a plane defined by its orthogonal vector. This dimensional plane is embedded into . The lattice points tessellate the subspace with regular cells. Given a point from the embedding space, it is efficient to find the enclosing simplex of the projection onto the plane. We will represent a sparse set of points from by a sparse set of simplex corners in the lattice. Importantly, the number of corners does not grow exponentially with the dimension as it would for an axis-align simplex representation. We continue to describe the different parts of the permutohedral convolution.
The splat and slice operations take the role of an interpolation between the different signal representations. All input samples that belong to a cell adjacent to lattice point are summed up and weighted with the barycentric coordinates to calculate the value . This is the splatting operation. The barycentric coordinates depend on both the corner point and the feature location . The reverse operation slice finds an output value by using its barycentric coordinates inside the lattice simplex and sums over the corner points .
The convolution is then performed on the permutohedral lattice. It uses a convolution kernel to compute . The convolution kernel is problem specific and its domain is restricted to the set of neighboring lattice points . For bilateral filters, this is set to be a Gaussian filter, here we learn the kernel values using back-propagation.
The size of the neighborhood takes a similar role as the filter size (spatial extent) of the grid-based CNN. A transitional convolutional kernel which considers sampled points to either side has parameters. A comparable filter on the permutohedral lattice with a neighborhood has elements.
3 Sparse CNNs and Encoding Invariants
The permutohedral convolution can be used as a new building block in a CNN architecture. We will omit the derivation of the gradients for the filter elements with respect to the output of such a new layer due to space constraints. We will discuss two possible application scenarios.
First, as mentioned before we are free to change the sampling of the input signal of a lattice-based convolution. The choice of the sampling is problem specific. Missing measurements or domain specific sampling techniques that gather more information in highly discriminant areas are only two possible scenarios. Furthermore, as we will show in our experiments the method is robust in cases where train-time sampling and test-time sampling do not coincide.
Second, the proposed method provides a tool to encode additional data invariances in a principled way. A common technique to include domain knowledge is to artificially augment the training set with deformations that leave the output signal invariant, such as translations, rotations, or nosiy versions.
A feature mapping is invariant with respect to a transformation and a signal if . In the case where belongs to a set of translations a possible invariant feature is the convolution with a window function (given its support has the right size) The same idea can be applied to the more general case and again calculating a mean with the help of a window function: .
We can use the permutohedral convolution to encode invariances like rotation and translation. Approximating the above integral by a finite sum and using lattice points as integration samples we arrive at . We further approximate with look-up at a lattice point location.
Consider the case of rotation and translation invariance. More intuitively, we stack rotated versions of the input images onto each other in a dimensional space – dimensions for the location of a sample and dimension for the rotation of the image. A grid-based convolution would not work here because the rotated image points might not coincide with a grid anymore. Filtering in the permutohedral space naturally respects the augmented feature space.
implementation that is part of Caffe(Jia et al., 2014) to the network with the first layer replaced by a permutohedral convolution layer (PCNN). Both are trained on the original image resolution (first two rows). Three more PCNN and CNN models are trained with randomly subsampled images (100%, 60% and 20% of the pixels). An additional bilinear interpolation layer samples the input signal on a spatial grid for the CNN model. (b) PSNR results of a denoising task using the BSDS500 dataset (Arbeláez et al., 2011).
We investigate the performance and flexibility of the proposed method on two sets of experiments. The first setup compares the permutohedral convolution with a spatial convolution that has been combined with a bilinear interpolation. The second part adds a denoising experiment to show the modelling strength of the permutohedral convolution.
It is natural to ask, how a spatial convolution combined with an interpolation compares to a permutohedral convolutional neural network (PCNN). The proposed convolution is particularly advantageous in cases where samples are addressed in a higher dimensional space. Nevertheless, a bilinear interpolation prior to a spatial convolution can be used for dense 2-dimensional positional features.
We take a reference implementation of LeNet (LeCun et al., 1998) that is part of the caffe project (Jia et al., 2014) on the MNIST dataset as a starting point for the following experiments. The permutohedral convolutional layer is also implemented in this framework.
We first compare the LeNet in terms of test-time accuracy when substituting only the first of the convolutional layers with a (position only) permutohedral layer and leave the rest identical. Table 1 shows that a similar performance is achieved, so it seems model flexibility is not lost. The network is trained according to the training parameters from the reference implementation. Next, we randomly sample continuous points in the input image, use their interpolated values as signal and continuous positions as features. Interestingly, we can train models with a different amount of sub-sampling than at test time. The permutohedral representation is robust with respect to this sparse input signal. Table 1 shows experiments with different signal degradation levels. All the sampling strategies have in common that the original input space of 28 by 28 pixels is densely covered. Hence, a bilinear interpolation prior to the first convolution allows us to compare against the original LeNet architecture. This baseline model performs similar to a PCNN.
One of the strengths of the proposed method is that it does not depend on a regular grid sampling as the tranditional convolution operators. We highlight this feature with the following denoising experiment and change the sampling space to be both sparse and 3-dimensional. The higher dimensional space renders a bilinear interpolation and spatial convolution more and more in-feasible due to the high number of corners of the hyper-cubical tessellation of the space. We compare the proposed permutohedral convolution in an illustrative denoising experiment to a spatial convolution. For bilateral filtering, which is one of the algorithmic use-cases of the permutohedral lattice, the input space features contain both the coordinates of a data sample and the color information of the image; hence a 5-dimensional vector for color images and a 3-dimensional vector for gray-scale images. In contrast to a direct application of a bilateral convolution to the noisy input the filter for a bilateral layer of a PCNN can now be trained. All experiments compare the performance of a PCNN to a common CNN with images from the BSDS500 dataset (Arbeláez et al., 2011)
. Each image is transformed into gray-scale by taking the mean across channels and noise is artificially added to it with samples from a Gaussian distribution.
The baseline network uses a spatial convolution (“CNN” in Table 1) with a kernel size of and predicts the scalar gray-scale value at each pixel ( filter weights). The layer is trained with a fixed learning rate of , momentum weight of and a weight decay of on the “train” set. In the second architecture the convolution is performed on the permutohedral lattice (“PCNN Gauss” and “PCNN Trained” in Table 1). We include the pixel’s gray value as an additional feature for the generalized operation and set the neighborhood size to ( filter weights). The filter weights are initialized with a Gaussian blur and are either applied directly to the noisy input (“PCNN Gauss”) or trained on the “train” set to minimize the Euclidean distance to the clean image with a learning rate of . We cross-validate the scaling of the input space on the “val” image set and reuse this setting for all experiments that operate on the permutohedral lattice. A third architecture that combines both spatial and permutohedral convolutions by summation (“CNN + PCNN”) is similarly trained and tested.
We evaluate the PSNR utility averaged over the images from the “test” set and see a slightly better performance of the bilateral network (“PCNN trained”) with trained filters in comparison to a bilateral filter (“PCNN Gauss”) and linear filter (“CNN”), see Table 1. Both convolutional operations combined further improve the performance and suggest that they have complementary strengths. Admittedly this setup is rather simple, but it validates that the generalized filtering has an advantage.
In the future we plan to investigate the use of the PCNN architecture for other computer vision problems, e.g. semantic segmentation, and modeling domain knowledge like rotation or scale invariance.
This paper presents a generalization of the convolutional operation to sparse input signals. We envision many consequences of this work. Consider signals that are naturally represented as measurements instead of images, like MRT scan readings. The permutohedral lattice filtering avoids the pre-processing assembling operation into a dense image, it is possible to work on the measured sparse signal directly. Another promising use of this filter is to encode scale invariance, typically this is encoded by presenting multiple scaled versions of an image to several branches of a network. The convolution presented here can be defined on the continuous range of image scales without a finite subselection. In summary, this technique allows to encode prior knowledge about the observed signal to define the domain of the convolution. The typical spatial filter of CNNs is a particular type of prior knowledge, we generalize this to sparse signals.
- Adams et al. (2009) Adams, Andrew, Gelfand, Natasha, Dolson, Jennifer, and Levoy, Marc. Gaussian kd-trees for fast high-dimensional filtering. In ACM SIGGRAPH 2009 Papers, SIGGRAPH ’09, pp. 21:1–21:12, New York, NY, USA, 2009.
- Adams et al. (2010) Adams, Andrew, Baek, Jongmin, and Davis, Myers Abraham. Fast high-dimensional filtering using the permutohedral lattice. Comput. Graph. Forum, 29(2):753–762, 2010.
- Arbeláez et al. (2011) Arbeláez, Pablo, Maire, Michael, Fowlkes, Charless, and Malik, Jitendra. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 33(5):898–916, May 2011.
- Aurich & Weule (1995) Aurich, Volker and Weule, Jörg. Non-linear Gaussian filters performing edge preserving diffusion. In Mustererkennung 1995, 17. DAGM-Symposium, Bielefeld, 13.-15. September 1995, Proceedings, pp. 538–545, 1995.
- Jia et al. (2014) Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, Jonathan, Girshick, Ross, Guadarrama, Sergio, and Darrell, Trevor. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
- LeCun et al. (1998) LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
- Mallat (2012) Mallat, Stéphane. Group invariant scattering. Communications in Pure and Applied Mathematics, 65(10):1331–1398, 2012.
- Paris & Durand (2009) Paris, Sylvain and Durand, Frédo. A fast approximation of the bilateral filter using a signal processing approach. International Journal of Compututer Vision, 81(1):24–52, January 2009.
- Smith & Brady (1997) Smith, Stephen M. and Brady, J. Michael. SUSAN – a new approach to low level image processing. Int. J. Comput. Vision, 23(1):45–78, May 1997. ISSN 0920-5691.
- Tomasi & Roberto (1998) Tomasi, Carlo and Roberto, Manduchi. Bilateral filtering for gray and color images. In Proceedings of the Sixth International Conference on Computer Vision, ICCV ’98, pp. 839–846, Washington, DC, USA, 1998. IEEE Computer Society.