1 Introduction
red Soatto & Chiuso (2014) define an optimal representation as a minimal sufficient statistic (of past data for the scene) and a maximal invariant (of future data to nuisance factors), and propose a measure of how “useful” (informative) a representation is, via the uncertainty of the prediction density. What is a nuisance depends on the task, that includes decision and control actions about the surrounding environment, or scene, and its geometry (shape, pose), photometry (reflectance), dynamics (motion) and semantics (identities, relations of “objects” within).red
red red We show that optimal management of nuisance variability due to occlusion is generally intractable, but can be approximated leading to a composite (correspondence) hypothesis test, which provides grounding for the use of “patches” or “receptive fields,” ubiquitous in practicered. The analysis reveals that the size of the domain of the filters should be decoupled from spectral characteristics of the image, unlike traditionally taught in scalespace theory, an unintuitive consequence of the analysis. This idea has been exploited by Dong & Soatto (2015) to approximate the optimal descriptor of a single image, under an explicit model of image formation (the LambertAmbient, or LA, model) and nuisance variability, leading to DSPSIFT. Extensions to multiple training images, leading to MVHoG and RHoG, have been championed by Dong et al. (2013). Here, we apply domainsize pooling to the scattering transform Bruna & Mallat (2011)
leading to DSPSC, to a convolutional neural network, leading to DSPCNN, and to deformable part models
Felzenszwalb et al. (2008), leading to DSPDPM, in Sect. 2.2, 2.3 and 2.4 respectively.red We treat images as random vectors
and the scene as an (infinitedimensional) parameter. An optimal representation is a function of past images that maximally reduces uncertainty on questions about the scene Geman et al. (2015) given images from it and regardless of nuisance variables . In Soatto & Chiuso (2014) the sampled orbit antialiased (SOA) likelihood is introduced as:(1) 
where
(2) 
and is the joint likelihood, understood as a function of the parameter and nuisance for fixed data , with an antialiasing measure with positive weights . The SOA likelihood is an optimal representation in the sense that, for any , it is possible to choose and a finite number of samples so that approximates to within a minimal sufficient statistic (of for ) that is maximally invariant to group transformations in . This result is valid under the assumptions of the LambertAmbient (LA) model Dong & Soatto (2014), which is the simplest known to capture the phenomenology of image formation including scaling, occlusion, and rudimentary illumination.red red
2 Constructing Visual Representations
red
red
red
Theorem 1 (Contrast invariant).
Given a training image and a test image , assuming that the latter is affected by noise that is independent in the gradient direction and magnitude, then the maximal invariant of to the group of contrast transformations is given by
(3) 
red Note that, other than for the gradient, the computations above can be performed pointwise under the assumption of LA model, so we could write (3) at each pixel : if ,
(4) 
Note that (4) is invariant to contrast transformations of , but not of .red red Invariance to contrast transformations in the (single) training image can be performed by normalizing the likelihood, which in turn can be donered by simply dividing by the integral over , which is the norm of the histogram across the entire image/patch
(5) 
that should be used instead of the customary Lowe (2004). red red Once invariance to contrast transformations is achieved, which can be done on a single image , we are left with nuisances that include general viewpoint changes, including the occlusions they induce. This can be handled by computing the SOA likelihood with respect to of (Sect. 2.1) from a training sample , leading to
(6) 
Occlusion, or visibility, is arguably the single most critical aspect of visual representations. It enforces locality, as dealing with occlusion nuisances entails searching through, or marginalizing, all possible (multiplyconnected) subsets of the test image. This power set is clearly intractable even for very small images. Missed detections (treating a covisible pixel as occluded) and false alarms (treating an occluded pixel as visible) have different costs: Omitting a covisible pixel from decreases the likelihood by a factor corresponding to multiplication by a Gaussian for samples drawn from the same distribution; viceversa, including a pixel from (false alarm) decreases the loglikelihood by a factor equal to multiplying by a Gaussian evaluated at points drawn from another distribution, such as uniform. So, testing for correspondence on subsets of the covisible regions, assuming the region is sufficiently large, reduces the power, but not the validity, of the test. This observation can be used to fix the shape of the regions, leaving only their size to be marginalized, or searched over.redblue This reasoning justifies the use of “patches” or “receptive fields” to seed image matching, but emphasizes that a search over different sizes Dong & Soatto (2015) is needed. blue
red Together with the SOA likelihood, this also justifies the local marginalization of domain sizes, along with translation, as recently championed in Dong & Soatto (2015).
Corollary 1 (DspSift).
red The assumptions underlying all local representations built using a single image break down when the scene is not flat and not moving parallel to the image plane. In this case, multiple views are necessary to manage nuisance due to general viewpoint changes.
2.1 General viewpoint changes
If a covariant translationscale and size sampling/antialiasing mechanism is employed, then around each sample the only residual variability to viewpoint is reducedred.
red In some cases, a consistent reference (canonical element) for both training and test images is available when scenes or objects are georeferenced: The projection of the gravity vector onto the image plane Jones & Soatto (2011). In this case,red is the angle of the projection of gravity onto the image plane (well defined unless they are orthogonal). Alternatively, multiple (principal) orientation references can be selected based on the norm of the directional derivative Lowe (2004):
(7) 
This leaves outofplane rotations to be managed.red Dong et al. (2013) have proposed extensions of local descriptors to multiple views, based on a sampling approximation of the likelihood function,
, or on a point estimate of the scene
, MVHoG and RHoG respectively. The estimated scene has a geometric component (shape) and a photometric component (radiance) , inferred from the LA model as described in Dong & Soatto (2014)red. Once the effects of occlusions are considered (which force the representation to be local), and the effects of general viewpoint changes are accounted for (which creates the necessity for multiple training images of the same scene), a maximal contrast/viewpoint/occlusion invariant can be approximated: red the SOA likelihood (6) becomes:(8) 
in addition to domainsize pooling. The assumption that all existing multipleview extensions of SIFT do not overcome is the conditional independence of the intensity of different pixelsred. This is discussed in Soatto & Chiuso (2014) for the case of convolutional deep architectures, and in the next section for Scattering Networks. Capturing the joint statistics of different components of the SOA likelihood is key to modeling intraclass variability of object or scene categories.
2.2 DSPScattering Networks
The scattering transform Bruna & Mallat (2011) convolves an image (or patch) with a Gabor filter bank at different rotations and dilations, takes the modulus of the responses, and applies an averaging operator to yield the scattering coefficients. This is repeated to produce coefficients at different layers in a scattering network. The first layer is equivalent to SIFT Bruna & Mallat (2011), in the sense that (3) can be implemented via convolution with a Gabor element with orientation then taking the modulus of the response.red One could conjecture that domainsize pooling (DSP) applied to a scattering network would improve performance in tasks that involve changes of scale and visibility. We call the resulting method DSP Scattering Transform (DSPSC). Indeed, this is the case, as we show in the Appendix of Soatto et al. (2014), where we compare DSPSC to the singlescale scattering transform (SC) to the datasets of Mikolajczyk & Schmid (2003) (Oxford) and Fischer et al. (2014).
red
2.3 DspCnn
Deep convolutional architectures can be understood as implementing successive approximations of an optimal representation by stacking layers of (conditionally) independent local representations of the form (8), which have been shown by Soatto & Chiuso (2014) to increasingly achieve invariance to large deformations, despite locally marginalizing only affine (or similarity) transformations. As Dong & Soatto (2015) did for SIFT, and as we did for the Scattering Transform above, we conjectured that pooling over domain size would improve the performance of a convolutional network. In the Appendix of Soatto et al. (2014), we report experiments to test the conjecture using a pretrained network which is finetuned with domainsize pooling on benchmark datasets.
2.4 DspDpm
We have also developed domainsize pooling extensions of deformable part models (DPMs) Felzenszwalb et al. (2008), small trees of local HOG descriptors (“parts”), whereby local photometry is encoded in the latter (nodes), and geometry is encoded in their position on the image relative to the root node (edges). Intraclass shape variability is captured by the posterior density of edge values, learned from samples. Photometry is captured by a “HOG pyramid” where the size of each part is predetermined and fixed relative to the root.red One could therefore conjecture that performing antialiasing with respect to the size of the parts would improve performance. Experimental results, reported in the Appendix of Soatto et al. (2014), validate the conjecture.
red
Acknowledgments
We acknowledge discussions with Alessandro Chiuso, Joshua Hernandez, Arash Amini, YingNian Wu, Taco Cohen, Virginia Estellers, Jonathan Balzer. Research supported by ONR N000141110863, NSF RI1422669, and FA86501117154.
References
 Bruna & Mallat (2011) Bruna, J. and Mallat, S. Classification with scattering operators. In Proc. IEEE Conf. on Comp. Vision and Pattern Recogn., 2011.
 Dong & Soatto (2014) Dong, J. and Soatto, S. The LambertAmbient Shape Space and the Systematic Design of Feature Descriptors. R. Cipolla, S. Battiato, G.M. Farinella (Eds), Springer Verlag, 2014.
 Dong & Soatto (2015) Dong, J. and Soatto, S. Domainsize pooling in local descriptors: DSPSIFT. In Proc. IEEE Conf. on Comp. Vision and Pattern Recogn., 2015.
 Dong et al. (2013) Dong, J., Karianakis, N., Davis, D., Hernandez, J., Balzer, J., and Soatto, S. Multiview feature engineering and learning. In Proc. IEEE Conf. on Comp. Vision and Pattern Recogn., 2015. (also ArXiv: 1311.6048, 2013).
 Felzenszwalb et al. (2008) Felzenszwalb, P., McAllester, D., and Ramanan, D. A discriminatively trained, multiscale, deformable part model. In CVPR, pp. 1–8, 2008.
 Fischer et al. (2014) Fischer, P., Dosovitskiy, A., and Brox, T. Descriptor matching with convolutional neural networks: a comparison to sift. ArXiv:1405.5769, 2014.
 Geman et al. (2015) Geman, D., Geman, S., Hallonquist, N., and Younes, L. Visual turing test for computer vision systems. Proceedings of the National Academy of Sciences, 112(12):3618–3623, 2015.
 Jones & Soatto (2011) Jones, E. and Soatto, S. Visualinertial navigation, localization and mapping: A scalable realtime largescale approach. Intl. J. of Robotics Res., 2011.
 Lowe (2004) Lowe, D. G. Distinctive image features from scaleinvariant keypoints. IJCV, 2(60):91–110, 2004.
 Mikolajczyk & Schmid (2003) Mikolajczyk, K. and Schmid, C. A performance evaluation of local descriptors. 2003.
 Soatto & Chiuso (2014) Soatto, S. and Chiuso, A. Visual scene representations: sufficiency, minimality, invariance and deep approximation. Proc. of the ICLR Workshop, 2015 (also ArXiv: 1411.7676, 2014).

Soatto et al. (2014)
Soatto, S., Dong, J., and Karianakis, N.
Visual scene representation: scaling and occlusion in convolutional architectures.
(Extended version of this manuscript) Technical report UCLA CSD140024, 2014.
Comments
There are no comments yet.