red Soatto & Chiuso (2014) define an optimal representation as a minimal sufficient statistic (of past data for the scene) and a maximal invariant (of future data to nuisance factors), and propose a measure of how “useful” (informative) a representation is, via the uncertainty of the prediction density. What is a nuisance depends on the task, that includes decision and control actions about the surrounding environment, or scene, and its geometry (shape, pose), photometry (reflectance), dynamics (motion) and semantics (identities, relations of “objects” within).red
red red We show that optimal management of nuisance variability due to occlusion is generally intractable, but can be approximated leading to a composite (correspondence) hypothesis test, which provides grounding for the use of “patches” or “receptive fields,” ubiquitous in practicered. The analysis reveals that the size of the domain of the filters should be decoupled from spectral characteristics of the image, unlike traditionally taught in scale-space theory, an unintuitive consequence of the analysis. This idea has been exploited by Dong & Soatto (2015) to approximate the optimal descriptor of a single image, under an explicit model of image formation (the Lambert-Ambient, or LA, model) and nuisance variability, leading to DSP-SIFT. Extensions to multiple training images, leading to MV-HoG and R-HoG, have been championed by Dong et al. (2013). Here, we apply domain-size pooling to the scattering transform Bruna & Mallat (2011)
leading to DSP-SC, to a convolutional neural network, leading to DSP-CNN, and to deformable part modelsFelzenszwalb et al. (2008), leading to DSP-DPM, in Sect. 2.2, 2.3 and 2.4 respectively.
red We treat images as random vectorsand the scene as an (infinite-dimensional) parameter. An optimal representation is a function of past images that maximally reduces uncertainty on questions about the scene Geman et al. (2015) given images from it and regardless of nuisance variables . In Soatto & Chiuso (2014) the sampled orbit anti-aliased (SOA) likelihood is introduced as:
and is the joint likelihood, understood as a function of the parameter and nuisance for fixed data , with an anti-aliasing measure with positive weights . The SOA likelihood is an optimal representation in the sense that, for any , it is possible to choose and a finite number of samples so that approximates to within a minimal sufficient statistic (of for ) that is maximally invariant to group transformations in . This result is valid under the assumptions of the Lambert-Ambient (LA) model Dong & Soatto (2014), which is the simplest known to capture the phenomenology of image formation including scaling, occlusion, and rudimentary illumination.red red
2 Constructing Visual Representations
Theorem 1 (Contrast invariant).
Given a training image and a test image , assuming that the latter is affected by noise that is independent in the gradient direction and magnitude, then the maximal invariant of to the group of contrast transformations is given by
red Note that, other than for the gradient, the computations above can be performed point-wise under the assumption of LA model, so we could write (3) at each pixel : if ,
Note that (4) is invariant to contrast transformations of , but not of .red red Invariance to contrast transformations in the (single) training image can be performed by normalizing the likelihood, which in turn can be donered by simply dividing by the integral over , which is the norm of the histogram across the entire image/patch
that should be used instead of the customary Lowe (2004). red red Once invariance to contrast transformations is achieved, which can be done on a single image , we are left with nuisances that include general viewpoint changes, including the occlusions they induce. This can be handled by computing the SOA likelihood with respect to of (Sect. 2.1) from a training sample , leading to
Occlusion, or visibility, is arguably the single most critical aspect of visual representations. It enforces locality, as dealing with occlusion nuisances entails searching through, or marginalizing, all possible (multiply-connected) subsets of the test image. This power set is clearly intractable even for very small images. Missed detections (treating a co-visible pixel as occluded) and false alarms (treating an occluded pixel as visible) have different costs: Omitting a co-visible pixel from decreases the likelihood by a factor corresponding to multiplication by a Gaussian for samples drawn from the same distribution; vice-versa, including a pixel from (false alarm) decreases the log-likelihood by a factor equal to multiplying by a Gaussian evaluated at points drawn from another distribution, such as uniform. So, testing for correspondence on subsets of the co-visible regions, assuming the region is sufficiently large, reduces the power, but not the validity, of the test. This observation can be used to fix the shape of the regions, leaving only their size to be marginalized, or searched over.redblue This reasoning justifies the use of “patches” or “receptive fields” to seed image matching, but emphasizes that a search over different sizes Dong & Soatto (2015) is needed. blue
red Together with the SOA likelihood, this also justifies the local marginalization of domain sizes, along with translation, as recently championed in Dong & Soatto (2015).
Corollary 1 (Dsp-Sift).
red The assumptions underlying all local representations built using a single image break down when the scene is not flat and not moving parallel to the image plane. In this case, multiple views are necessary to manage nuisance due to general viewpoint changes.
2.1 General viewpoint changes
If a co-variant translation-scale and size sampling/anti-aliasing mechanism is employed, then around each sample the only residual variability to viewpoint is reducedred.
red In some cases, a consistent reference (canonical element) for both training and test images is available when scenes or objects are geo-referenced: The projection of the gravity vector onto the image plane Jones & Soatto (2011). In this case,red is the angle of the projection of gravity onto the image plane (well defined unless they are orthogonal). Alternatively, multiple (principal) orientation references can be selected based on the norm of the directional derivative Lowe (2004):
This leaves out-of-plane rotations to be managed.red Dong et al. (2013) have proposed extensions of local descriptors to multiple views, based on a sampling approximation of the likelihood function,
, or on a point estimate of the scene, MV-HoG and R-HoG respectively. The estimated scene has a geometric component (shape) and a photometric component (radiance) , inferred from the LA model as described in Dong & Soatto (2014)red. Once the effects of occlusions are considered (which force the representation to be local), and the effects of general viewpoint changes are accounted for (which creates the necessity for multiple training images of the same scene), a maximal contrast/viewpoint/occlusion invariant can be approximated: red the SOA likelihood (6) becomes:
in addition to domain-size pooling. The assumption that all existing multiple-view extensions of SIFT do not overcome is the conditional independence of the intensity of different pixelsred. This is discussed in Soatto & Chiuso (2014) for the case of convolutional deep architectures, and in the next section for Scattering Networks. Capturing the joint statistics of different components of the SOA likelihood is key to modeling intra-class variability of object or scene categories.
2.2 DSP-Scattering Networks
The scattering transform Bruna & Mallat (2011) convolves an image (or patch) with a Gabor filter bank at different rotations and dilations, takes the modulus of the responses, and applies an averaging operator to yield the scattering coefficients. This is repeated to produce coefficients at different layers in a scattering network. The first layer is equivalent to SIFT Bruna & Mallat (2011), in the sense that (3) can be implemented via convolution with a Gabor element with orientation then taking the modulus of the response.red One could conjecture that domain-size pooling (DSP) applied to a scattering network would improve performance in tasks that involve changes of scale and visibility. We call the resulting method DSP Scattering Transform (DSP-SC). Indeed, this is the case, as we show in the Appendix of Soatto et al. (2014), where we compare DSP-SC to the single-scale scattering transform (SC) to the datasets of Mikolajczyk & Schmid (2003) (Oxford) and Fischer et al. (2014).
Deep convolutional architectures can be understood as implementing successive approximations of an optimal representation by stacking layers of (conditionally) independent local representations of the form (8), which have been shown by Soatto & Chiuso (2014) to increasingly achieve invariance to large deformations, despite locally marginalizing only affine (or similarity) transformations. As Dong & Soatto (2015) did for SIFT, and as we did for the Scattering Transform above, we conjectured that pooling over domain size would improve the performance of a convolutional network. In the Appendix of Soatto et al. (2014), we report experiments to test the conjecture using a pre-trained network which is fine-tuned with domain-size pooling on benchmark datasets.
We have also developed domain-size pooling extensions of deformable part models (DPMs) Felzenszwalb et al. (2008), small trees of local HOG descriptors (“parts”), whereby local photometry is encoded in the latter (nodes), and geometry is encoded in their position on the image relative to the root node (edges). Intra-class shape variability is captured by the posterior density of edge values, learned from samples. Photometry is captured by a “HOG pyramid” where the size of each part is pre-determined and fixed relative to the root.red One could therefore conjecture that performing anti-aliasing with respect to the size of the parts would improve performance. Experimental results, reported in the Appendix of Soatto et al. (2014), validate the conjecture.
We acknowledge discussions with Alessandro Chiuso, Joshua Hernandez, Arash Amini, Ying-Nian Wu, Taco Cohen, Virginia Estellers, Jonathan Balzer. Research supported by ONR N000141110863, NSF RI-1422669, and FA8650-11-1-7154.
- Bruna & Mallat (2011) Bruna, J. and Mallat, S. Classification with scattering operators. In Proc. IEEE Conf. on Comp. Vision and Pattern Recogn., 2011.
- Dong & Soatto (2014) Dong, J. and Soatto, S. The Lambert-Ambient Shape Space and the Systematic Design of Feature Descriptors. R. Cipolla, S. Battiato, G.-M. Farinella (Eds), Springer Verlag, 2014.
- Dong & Soatto (2015) Dong, J. and Soatto, S. Domain-size pooling in local descriptors: DSP-SIFT. In Proc. IEEE Conf. on Comp. Vision and Pattern Recogn., 2015.
- Dong et al. (2013) Dong, J., Karianakis, N., Davis, D., Hernandez, J., Balzer, J., and Soatto, S. Multi-view feature engineering and learning. In Proc. IEEE Conf. on Comp. Vision and Pattern Recogn., 2015. (also ArXiv: 1311.6048, 2013).
- Felzenszwalb et al. (2008) Felzenszwalb, P., McAllester, D., and Ramanan, D. A discriminatively trained, multiscale, deformable part model. In CVPR, pp. 1–8, 2008.
- Fischer et al. (2014) Fischer, P., Dosovitskiy, A., and Brox, T. Descriptor matching with convolutional neural networks: a comparison to sift. ArXiv:1405.5769, 2014.
- Geman et al. (2015) Geman, D., Geman, S., Hallonquist, N., and Younes, L. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences, 112(12):3618–3623, 2015.
- Jones & Soatto (2011) Jones, E. and Soatto, S. Visual-inertial navigation, localization and mapping: A scalable real-time large-scale approach. Intl. J. of Robotics Res., 2011.
- Lowe (2004) Lowe, D. G. Distinctive image features from scale-invariant keypoints. IJCV, 2(60):91–110, 2004.
- Mikolajczyk & Schmid (2003) Mikolajczyk, K. and Schmid, C. A performance evaluation of local descriptors. 2003.
- Soatto & Chiuso (2014) Soatto, S. and Chiuso, A. Visual scene representations: sufficiency, minimality, invariance and deep approximation. Proc. of the ICLR Workshop, 2015 (also ArXiv: 1411.7676, 2014).
Soatto et al. (2014)
Soatto, S., Dong, J., and Karianakis, N.
Visual scene representation: scaling and occlusion in convolutional architectures.(Extended version of this manuscript) Technical report UCLA CSD140024, 2014.