On the Design and Analysis of Multiple View Descriptors

11/23/2013 ∙ by Jingming Dong, et al. ∙ 0

We propose an extension of popular descriptors based on gradient orientation histograms (HOG, computed in a single image) to multiple views. It hinges on interpreting HOG as a conditional density in the space of sampled images, where the effects of nuisance factors such as viewpoint and illumination are marginalized. However, such marginalization is performed with respect to a very coarse approximation of the underlying distribution. Our extension leverages on the fact that multiple views of the same scene allow separating intrinsic from nuisance variability, and thus afford better marginalization of the latter. The result is a descriptor that has the same complexity of single-view HOG, and can be compared in the same manner, but exploits multiple views to better trade off insensitivity to nuisance variability with specificity to intrinsic variability. We also introduce a novel multi-view wide-baseline matching dataset, consisting of a mixture of real and synthetic objects with ground truthed camera motion and dense three-dimensional geometry.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 8

page 9

page 10

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Images of a particular object or scene depend on its intrinsic properties (e.g.,shape, reflectance) and on nuisance factors (e.g.,viewpoint, illumination, partial occlusion, sensor characteristics, etc.) that have no bearing on its intrinsic properties but nevertheless affect the data. A (feature) descriptor is a function of the data designed to remove or reduce nuisance variability while preserving intrinsic variability. The ideal descriptor

would be the probability density of the data (images), conditioned on the intrinsic factors, with nuisance factors integrated out (marginalized).

111That would be the likelihood function, which is a minimal sufficient statistic. This would enable evaluating the likelihood that any image is generated by a given object or scene. This is key to many visual decisions (detection, recognition, localization, categorization) that hinge on evaluating the probability that (local regions) in two images correspond, i.e.,are generated by the same portion of the underlying scene.

1.1 Contributions

The first observation we offer is that an extension of commonly used descriptors such as SIFT/HOG [24, 11]222And their many variants, which we refer to collectively as HOG. (4), can be interpreted as such a class-conditional density, but with nuisance factors marginalized with respect to a coarse approximation of the underlying nuisance distribution (Claim 1). Such an approximation is necessarily poor, as a single image does not afford the ability to separate intrinsic from nuisance variability (Sec. 2.1). Thus, any descriptor constructed from a single view fails to be shape-discriminative (Claim 2).

The second observation is that multiple views of the same underlying scene (e.g.,a video) allow us to attribute the variability in the data to nuisance factors, if the underlying scene is static at the time-scale of capture. Thus, our working hypothesis is that one ought to be able to leverage multiple views to construct better descriptors, i.e.,better approximations the class-conditional density.

Our main contribution is to show how this can be done in two ways: Using a sample approximation of the nuisance distribution, obtained via a tracker, leading to Multi-view HOG (MV-HOG, Sec. 3.1

), or using a point estimate of the intrinsic factors, obtained via multi-view stereo or other

reconstruction method, leading to Reconstructive HOG (R-HOG, Sec. 3.2). The two coincide under sufficient excitation conditions [5] on the samples. In either case, the result is a descriptor that has the same run-time and storage complexity of standard (single-view) HOG, but better separates intrinsic from nuisance variability. When only one “training image” is given, both reduce to single-view HOG.

Comparison with a (single, or multiple) “test image” can be performed in three ways, corresponding to different interpretations of the descriptors (Sec. 4): As expectations (statistics), using a distance in the embedding (linear) space, as customary; as distributions, using any distance of divergence between probability densities; as likelihoods, by testing the likelihood of a (test) image under the probability model of the underlying scene, encoded in the training images.

To empirically validate the descriptors proposed, we need a dataset that provides multiple training images (e.g.,video), and a plurality of test images obtained under different viewpoint, illumination, partial occlusion etc.. To make comparison of descriptors independent of the detection mechanism, we need ground truth shape, to establish correspondence between training and test images. Despite our best efforts, we have not been able to identify such a dataset among those commonly used for benchmarking descriptors. Therefore, another contribution we offer is a new dataset for multi-view wide-baseline matching (Sec. 5.1), comprising a variety of real and synthetic objects, with ground truth camera motion and object shape. Both the dataset and our code will be publicly released upon completion of the anonymous review process. Strengths and weaknesses of our method are discussed in Sec. 6.

1.2 Related Work

The literature offers a multitude of local single-view descriptors and empirical tests, e.g.,[28] as of 2004. More recent examples include [4, 9, 8, 33, 38, 23, 1]; [6, 43] couple learning with optimization to minimize classification error; [10, 40] analyze important implementation details. There are relatively few examples of local multi-view descriptors: e.g.,[12] combines spatial (averaged SIFT) and temporal information (co-visibility), [16]

learns descriptors by feature selection from trajectories of key points.

[26]

uses kernel principal component analysis to learn variations among tracked patches for wide baseline matching.

[22] learns the best template by taking the modes of learned distributions over time.

Technically, a descriptor is a statistic (a deterministic function of the data), designed to be insensitive to nuisance variability and yet retain intrinsic variability. The process of finding the optimal tradeoff between two properties that can be described probabilistically in a conditional model that has been formalized by the Information Bottleneck (IB) [37], a generalization of the notion of sufficient statistics. Thus, ideally, we would like to find the sufficient statistics of images for the purpose of classification under changes of viewpoint, illumination and partial occlusions. Unfortunately, this problem is intractable. A recent trend of learning descriptors ab-ovo [31, 27] is interesting but the results hard to analyze. At the opposite end of the modeling spectrum, statistical approaches prevalent in the nineties [17] have failed to translate into state-of-the-art methods. Some approaches [21, 7, 30, 18] are positioned in between, with some stages engineered and the rest learned. Nevertheless, some of the most fundamental questions on how to relate these approaches are still largely open.

2 Methodology

2.1 Notation and assumptions

Images of a static object (or “scene”), with and its spatial gradient, obtained under different viewpoints and illuminations, can be represented as domain deformations (“warpings”) and range deformations (“contrast”) of a common radiance function (informally “texture map” or “albedo”) defined on the surface (“shape”) . With an abuse of notation, we can define on by back-projecting it onto the surface: , where is the point of first intersection of the pre-image of under perspective projection (the line through the optical center and the pixel with coordinates ) with the surface . If the vantage point is represented by a Euclidean reference frame with position and orientation relative to some global frame, then the warping

(1)

entangles the shape (an intrinsic property of the scene) and the viewpoint (a nuisance factor in the data formation process). The composition of function is defined as . Then the image is generated via

(2)

up to a residual

(“noise”) assumed white, zero-mean, homoscedastic and Gaussian, with a variance

: , where and IID stands for independent and identically distributed.

This data formation model is only valid for Lambertian scene under constant diffuse illumination, away from occluding boundaries. We will therefore assume these conditions henceforth. The ideal descriptor would be the probability density of any image , conditioned on the intrinsic factors , and marginalized with respect to the nuisances :

(3)

This is the likelihood function, which is a minimal sufficient statistic. Although the distribution of an image is extremely complex, conditioned on the scene and nuisances it is IID under the assumptions of the model (2).

2.2 HOG as a conditional density

This section derives a continuous gradient orientation distribution, from which various descriptors such as HOG and SIFT are obtained by sampling (App. LABEL:sect-sparse-hog). This immediately suggests a point-wise inversion formula, in the sense of maximum likelihood. Let be the gradient magnitude,

and

the gradient orientation, represented by a unit-norm vector

is the angle formed with the abscissa. The (un-normalized) density of gradient orientations (DOG) is a function with parameters and representing the angular and spatial kernel widths:

(4)

where is an angular Gaussian[42]. Popular descriptors, such as [11, 24]333E.g., [24] quantizes orientation into bins and samples on a lattice, using a bi-linear kernel instead of a Gaussian., are sampled versions of (4) with different kernels. The measure

captures the statistics of natural images: It is small almost everywhere, except near image boundaries, where it approaches an impulse. A normalized version of (4) yields

(5)

Given a (“training”) image , since we cannot determine , we can assume them to be the identity so , given which the likelihood of a (“test”) image having gradient orientation at is given by

(6)

and HOG (4) is the (un-normalized) density in the variable

(7)

and DOG its normalized version. One then can, given a descriptor constructed from , infer the most likely image up to a contrast transformation point-wise, via

(8)

then solving a boundary-value problem to integrate into an estimate (Fig. 1).

Figure 1: HOG inversion: The classic Lena image is used to construct single-view descriptors at varying values of ; these are in turn used to reconstruct the gradient field, and integrated to yield the reconstructed images.

Since this inversion formula is point-wise, it only provides a reconstruction where the descriptor is computed. A dense reconstruction can be obtained from sparsely sampled descriptors as shown in the appendix.

2.3 Handling nuisance variability in HOG/DOG

If one interprets HOG/DOG as a probability density (7), it is worth understanding how it relates to the ideal descriptor (3). In particular, What measure does HOG/DOG use to marginalize nuisances?

Contrast (monotonic continuous transformations of the range of the image) , often used to model gross illumination changes, can be eliminated at the outset without marginalization. Despite being simplistic (they do not account for specularity, translucency, inter-reflection etc.), they are an infinite-dimensional group, so defining a base measure, learning a distribution , and marginalizing (3) is problematic.

However, contrast transformations do not depend on intrinsic properties of the scene. Neglecting (spatial and range) quantization, the geometry of the level curves of the image is a complete contrast invariant [36], in the sense that it enables reconstructing an image that is equivalent to the original but for a contrast transformation [2]. Since the gradient orientation is everywhere orthogonal to the level curves, replacing the intensity-based kernel with a gradient-orientation kernel [42]444We are overloading the notation by using the same symbol with to indicate an ordinary univariate Gaussian density, and with to indicate an angular Gaussian. with , we can annihilate the effects of contrast changes at no loss of discriminative power [36]. Following (7), given the radiance , and therefore its gradient orientation, , is either the likelihood of an the albedo having orientation , or the likelihood of it having (possibly normalized) intensity value at . Then (3) at some is written as

(9)

with and . This corresponds to HOG (7) if we restrict the nuisance variability to constant domain translations, , independent of :

(10)

after the change of variables and and normalization. The latter is necessary to obtain a probability density, and also to achieve contrast invariance: Unlike DOG, HOG is in fact not contrast invariant, as one can easily verify.

Claim 1 (HOG as a (poor approximation of the) class-conditional distribution).

HOG/DOG (5) is an approximation of the class-conditional distribution (3) of images of the same underlying scene, where only viewpoint changes that correspond to planar translations of the image are marginalized.

Similarly to contrast, we can eliminate small-dimensional group transformations of the domain at the outset, though a co-variant detector, that determines a local reference frame that co-varies with the group, and therefore the data in such a moving frame is, by construction, invariant to it. Because co-variant detectors do not commute with non-invertible transformations such as occlusion or spatial quantization, it can be shown [34] that one has to restrict the attention to local neighborhoods of co-variant detector frames. The literature offers a variety of location-scale co-variant detectors, e.g.,extrema (in location and scale) of the Hessian of Gaussian, Laplacian of Gaussian, or Difference of Gaussian convolutions of the data, in addition to other “corner detectors” applied to any scale-space. Planar rotation can also be eliminated using the maximal gradient direction [24], or based on an inertial reference (e.g.,, the projection of gravity onto the image plane [20]), if the scene of interest is geo-referenced. Therefore, from now on we assume planar similarities are removed, and deal with the residual domain deformations. Even so, however, marginalization in HOG is relative to a distribution that is independent of :

Claim 2 (HOG is not shape-discriminative).

The HOG/DOG descriptor (5) is agnostic of changes in the shape of the underlying scene that keep the (one) image constant.

One should not confuse the shape of the scene , which depends on its three-dimensional geometry, with the (two-dimensional) shape of the intensity profile of the image, which is a photometric property and is encoded in . Clearly HOG depends on the latter, but it is insensitive to any deformation of the scene that would yield the same projection onto the (single) image. The two are related only at occluding boundaries, that however are excluded from our analysis, as well as from the typical use of local descriptors (although see [41, 3]).

3 Extensions to multiple views

In a sequence of images, consistent co-variant frames can be determined by a tracker (Sec. 5) and eliminated at the outset with contrast transformations as in Sec. 2.3. The result is a sequence of (similarity-invariant) images/patches, that sample the residual nuisance variability, as by assumption they all portray the same underlying scene .

3.1 Extension via Sampling

If we were given , multiple views would provide us with samples from the (class-specific) distribution , via dense correspondence (or optical flow,

). Ideally, one would use the samples to compute a Parzen-like estimate of the distribution, but this presents technical challenges due to the curse of dimensionality. However,

one could use the (space-varying) samples to define a (space-varying) measure on constant displacement fields , via

(11)

where the dependency on is through the samples . Using this measure,555Note that this is not a proper measure on the group of diffeomorphisms. and calling , so , under sufficient excitation conditions [5] we can approximate in (3) with

(12)

This could be considered an extension of HOG/DOG to multiple views, if we knew . Unfortunately, we only measure , related to via (2). A further approximation can be made by assuming that is smooth, so the Jacobian is small, and . Applying the change of variable , (and since and are held constant, ), we obtain

(13)

since , and the effects of the approximation can be absorbed by inflating the noise covariance .666 An alternate derivation can be obtained via Monte Carlo approximation; while the group of diffeomorphisms is infinite-dimensional, formally we can write:

Unfortunately, cannot be a (sufficiently exciting) fair sample from . To introduce some regularization, and to take into account residual translations from the tracker, one can perform spatial blurring and enforce the statistics of natural images, thereby obtaining
which is the same expression as (13). This can be easily implemented given a collection of images , and can be used to compute the likelihood of a test image , or can be interpreted as the class-conditional distribution in (3), assuming the sample is sufficiently exciting: Writing it explicitly as a function of location , angle , given a collection of images (patches) , we have

(14)

which we call MV-HOG. Its contrast-insensitive version is obtained by normalization, as usual. We then have

(15)

asymptotically as under sufficient excitation conditions.

3.2 Extension via reconstruction

To compute a better marginalization, given more than one image of the same scene, we can perform (dense) reconstruction of shape , radiance and motion for instance using [13, 19, 15], by solving

where surface area of is used as a regularizer (a regularizer for is usually not needed under the Lambertian assumption [35]). This yields777Dense reconstruction is a difficult (non-convex, ill-posed, infinite-dimensional) optimization problem, that often fails to yield accurate solution when the data is not “sufficiently exciting.” For instance, multiple images of a white scene does not enable reconstructing ints shape. However, in our context, when conditions prevent us from accurately recovering shape, it means that any deformation is equally likely. So the resulting descriptor is already invariant to nuisance deformations. a point estimate of shape , from which an object-specific measure can be constructed using a base measure on , :

is uniform on . The restriction from to is possible thanks to the elimination of planar similarities, leaving us with marginalizing on a compact group with a proper Haar measure. If orientation is fixed by gravity, the measure can be further restricted to rotations about gravity . Note that this measure is not defined on the entire group of diffeomorphisms, , but only those generated by moving around an object with shape , .888If a measure on induces a measure on diffeomorphisms restricted to via the push-forward . From this a more specific descriptor can be constructed, which we call R-HOG:

(16)

where . As usual, we achieve insensitivity to contrast by normalization, obtaining , and

(17)

asymptotically if the estimator is unbiased. Note that informationally, and are equivalent under sufficient excitation, but computationally they embody very different philosophies (Monte Carlo for MV-HOG vs. Maximum Likelihood for R-HOG), and yield different implementations. Both MV-HOG and R-HOG have the same complexity of HOG, and can be compared as we describe next.

4 Evaluation and comparison of descriptors

The descriptors (14) or (16) can be interpreted as statistics, deterministic functions of the training set. In fact, we have

(18)

where the expectation is taken with respect to for HOG, for MV-HOG, and for R-HOG.

Once sampled, can be thought of as a vector having the dimension of the lattice where is evaluated, times the number of bins where the gradient is quantized. They can then be compared to a descriptor computed from a single view as an element of the ambient linear space (even though they do not live in a linear spaces), for instance the norm of the difference, or the correlation coefficient, as customary [24].

Alternatively, they can be interpreted as class-conditional densities for MV-HOG, where presumably , or for R-HOG, and compared using any distance or divergence between distributions, for instance Kullback-Liebler, Bhattacharyya, , etc..

Finally, they can be interpreted as likelihood functions; as such, they provide the likelihood that a given test image has gradient orientation at , under the model implied by the training set.

Yet another alternative is to max-out the nuisance, by evaluating The above maximum corresponds to the minimum of the least-squares residual

(19)

Experiments comparing the marginalized descriptor (16), the max-out descriptor (19), and the MV-HOG descriptor (14) are reported in Sect. 5.

5 Experiments and Evaluation

5.1 Dataset

Figure 2: Dataset. Sample objects in the dataset constructed for evaluation. In addition, synthetic samples are generated by texture-mapping random images onto solid models available in MeshLab.

Many datasets are available to test image-to-image matching, where both training and test sets are individual images, each of a different scene. Fewer are available for testing multi-view descriptors [29], and to the best of our knowledge none provide pixel-level correspondence that can be used for validation. Even more problematic, they do not provide a separate test set. We sought objects from [29] to capture a new test set, but those are no longer available. We explored new multi-modal datasets [14], where ground truth tracking and range are provided, but with no separate test set. We explored using commercial ground imaging, but most are absent in Karlsruhe due to German privacy laws.

We have therefore constructed a new dataset, similar in spirit to [29], but with a separate test set and dense reconstruction for validation, using a combination of real and synthetic (rendered) objects. The latter are generated by texture-mapping random images onto surface models available in MeshLab. The former are household objects of the kind seen in Fig. 2. Some with significant texture variability, others with little; some with complex shape and topology, others simple. In each case, a sequence of (training) images per object is obtained moving around the objects in a closed trajectory at varying distance. For real objects, the -frame long trajectory circumnavigates them so as to reveal all visible surfaces; for synthetic ones the frames span a smaller orbit (Fig. 3).

Figure 3: Data Generation. Synthetic 3D object models are texture-mapped with random city texture (not shown in the modeling snapshot). The trajectory of the camera in the training set is shown as an ellipse attached to the orange frustum. Test images are generated from a camera at vantage points (black frusta) sufficiently far from the training trajectories.

Ground Truth: We compare descriptors built from the (training) video and test single frames, or descriptors built from them, by first selecting test images where a sufficient co-visible area is present. To establish ground truth, we reconstruct a dense model of each (real) object using an RGB-D (structured light) range sensor and a variant of Kinect Fusion (YAS), for which code is available online. The reconstructed surface enables dense correspondence between co-visible regions in different image by back-projection. This is further validated with standard tools from multiple-view geometry, using epipolar RANSAC bootstrapped with SIFT features. Occlusions are determined using the range map.

5.2 Implementation Details

Detection and Tracking: Although our goal is to evaluate descriptors, and therefore we could do away with the detector, one has to select where the descriptor is computed. We use FAST [32] as a mechanism to (conservatively) eliminate regions that are expected to have non-discriminative descriptors, but this step could be forgone. Scale changes are handled in a discrete scale-space, i.e.,images are downsampled by half up to 4 times and FAST is computed at each level. We select different brightness difference thresholds in FAST to ensure roughly the same number of detections in each frame. A minimum distance between features is also enforced to avoid tracking ambiguities; short-baseline correspondence is established with standard MLK [25]. A conservative rejection threshold is chosen to favor long tracks. A sequence of image locations is returned by the tracker for each region, which is then sampled in a rectangular neighborhood at the scale of the detector. We report experiments on two window sizes, and , illustrative of a range of experiments conducted. The sequence of such windows are then used to compute the descriptors.

Descriptors: We use HOG from [39] as baseline, computed on each patch at each frame as determined by the detector and tracker. We set the spatial division parameter to ensure cells are created for both patch sizes. The dimension of the resulting HOG is thus . To be consistent with common practice, we perform no post-normalization. MV-HOG is implemented according to Sect. 3.1: For each pixel we build a histogram of gradient orientations at all location ’s within a neighborhood , weighted by the distance from to the center

, and the gradient magnitude. We linearly interpolate orientation bins with cut-off

, as an approximation of the angular Gaussian kernel . We choose to be a quarter of the patch size so as to have the same spatial weighting of HOG. The number of orientation bins is set to , and temporal aggregation is computed incrementally. Finally, we post-normalize the descriptor according to (5), thus obtaining a quantized DOG.

Reconstruction and marginalized descriptors: To compute an approximation of the marginalized descriptor (3.2), in lieu of a dense 3-D reconstruction from each tracked sequence we computed an approximate dense reconstruction of the entire object, and compared it to the reconstruction obtained from YAS. Since the former had significant artifacts, to obtain a performance upper-bound, we used the restriction of the reconstruction computed from YAS to the pre-image of the tracked patches to build the marginalized descriptors. Using the tracked point as the origin of the local reference frame, we back project the regular lattice around the origin onto the surface, rotate it, and project back onto the image plane to simulate different vantage points . We use keyframes sampled uniformly from the training sequence, and sample a viewing hemisphere with vantage points. We check the orientation of the surface with the inner product of its normal with the optical axis to ensure visibility. Fig. 4 shows sample synthesized patches; these are stored in the database. Max-out in Eqn. (19) amounts to searching through these samples.

Figure 4: Viewpoint Synthesis. A local patch (center) is used to synthesize patches from different vantage points. We restrict the synthesis to rotations around the plane and to in-plane rotations. A normal test is performed to reject patches (green) that are not visible. The artifacts in the lower-left are caused by the inaccurate estimates of the normal vector.

Test descriptors: In order to minimize the effects of the detector in the evaluation (our goal is to evaluate descriptors), we use the ground-truth dense reconstruction to determine ground-truth correspondences (Sect. 5.1) between training and test images, including scale. We then extract a patch of the same size as used in training from the same pyramid level of the test image. The test descriptors are then computed on the neighborhood of the true corresponding location. For each object, this generates around a thousand positive test descriptors, after rejecting those that are non-covisible.

5.3 Evaluation and Comparison

Given a descriptor database, the simplest method to match a test query is via nearest neighbor (NN) search. While not sophisticated comparison method, this suffices to our purpose. We compare four combinations using the same NN search method: (1) SVHOG – single view HOG computed on a random image from training sequence, (2) MV-HOG – time-aggregated densely computed histogram of gradient orientation using all tracked patches, (3) KeepALL – single view HOG descriptors computed and stored at each frame, and (4) R-HOG – single view HG computed via the reconstruction, as described in Sect. 5.2. The score used is recognition rate, shown in Fig. 5 for all 4 methods.

When averaging over all real objects and multiple random trials of the latter, MV-HOG improves accuracy by ( in synthetic objects) over SVHOG, and is only slightly worse than KeepALLHOG, but at a fraction of the cost. R-HOG nearly matches the performance of KeepALLHOG and improves on MV-HOG. Our implementation uses the range estimated by an RGB-D sensor, so it should be considered an upper-bound of performance, but it only uses keyframes per object, so as to keep the max-out cost in check. Run-time cost is reported in Sect. 5.4.

The fact MV-HOG and R-HOG perform similarly is indicative of the training sequences in the dataset being sufficiently exciting, as the given views are informationally equivalent to the reconstruction. This wold not be the case for short-baseline video, where the marginalized descriptor R-HOG can generate synthetic vantage points not represented in the training sequence, whereas MV-HOG would fail to capture a representative sample of the class-conditional distribution (Fig. 11).

Figure 5: Comparison Recognition rate for each of the 4 descriptors and each test object in the real dataset (abscissa), for a patch sizes .The average accuracy score is shown in the legend. Distributions of recognition rate are shown on the right side. MV-HOG and R-HOG perform comparably to KeepAllHOG, but at a cost comparable to SV-HOG.
Figure 6: Comparison (Real, Patch Size ) MV-HOG outperforms SV-HOG in all cases and is slightly worse than the KeepALL strategy, but the latter induces significant computational cost at decision time. R-HOG is generated using a reconstruction computed from out of a total of frames in the training video. The performance is comparable toMV-HOG and KeepALL-HOG, but slightly inferior due to the inaccuracy of the reconstruction. Unsurprisingly, where the reconstruction improves, so does overall performance; this is visible in Fig. 8 where reconstruction from synthetic data is more accurate.
Figure 7: Performance of various descriptors on the Real dataset, as in Fig. 6 but for patch size . Although the numerical scores are different, the relative performance ranking are largely unchanged.
Figure 8: Comparison on the Synthetic dataset, for patch Size . Similar to the real dataset, MV-HOG outperforms SV-HOG in all cases and is similar to KeepALLHOG at a fraction of the computational cost. Both RHOG’s are generated using out of a total frames in training video. Unlike Fig. 6, reconstruction from synthetic data is more accurate, thus boosting performance in both marginalized and max-out descriptors.
Figure 9: Comparison on the Synthetic dataset using patch Size . Performance is qualitatively similar to Fig. 8, although the numerical scores are different.

Although the numerical scores change as we change the support regions (here we show the extrema of the range of the experiments we conducted, between and pixels), the conclusion holds across experiments: The performance of MV-HOG exceeds SV-HOG computed on a random image in the sequence, and is comparable to Keep-All HOG, despite requiring a fraction of the storage and computational cost. R-HOG performs comparably to MV-HOG so long as the reconstruction is accurate, and improves with the latter. This is visible in the comparison of the performance in the Real and Synthetic dataset.

The numerical scores in the performance increase with the size of the patches, which is to be expected and may induce one to favor large patches, all the way to the entire image (i.e., computing a single, dense HOG descriptor). This, however, is not viable as the conditions under which the descriptors are an approximation of the class-conditional densities involve the domain of the descriptor not straddling an occlusion. Since occlusions cannot be determined in a single image, one is left with guessing the size of the domain that trades off the probability of straddling an occlusion with the maximization of discriminative power. This issue can be addressed by computing the descriptors at multiple scales, or “growing” the size of the domain during the matching process; both approaches have been well explored in the literature and therefore are not further elaborated here.

5.4 Time Complexity

At test time, all descriptors have the same complexity. KeepALLHOG, on the other hand, needs every instance seen in the training sequence, so storage complexity grows from to where in our case and is the number of features stored. If evaluation of MV-HOG and R-HOG is done with max-out (Sec. 4), the search approaches the complexity of KeepALLHOG. As a limit test, one can store the entire collection of (contrast-normalized) patches, and then search the entire dataset at test time. This can be done in approximate form using approximate nearest neighbors by preprocessing them into a search-efficient structure. Fig. 10 shows the training time using the fast library for approximate nearest neighbors (FLANN) vs MV-HOG on a commodity PC with 8GB memory and Xeon E3-1200 processor. MV-HOG scales well and is more memory-efficient while KeepALLHOG (training using FLANN) requires more time and occupies more than 60% of the available memory. Another advantage of MV-HOG is that the descriptor can be updated incrementally, and does not require storing processed samples for update. Fig. 5 shows that the performance loss of MV-HOG compared to KeepALLHOG is relatively small.

Figure 10: Complexity Comparison. Time complexity as a function of the number of features. FLANN precision is set to . Higher precision will further increase computational load.

5.5 Sample Sufficiency

MV-HOG relies on a sufficiently exciting sample from the class-conditional being available in the training set. Clearly, if a sequence of identical patches is given (video with no motion), the descriptor will fail to capture the representative variability of images generated by the underlying scene. In this case, MV-HOG reduces to HOG. In Fig. 11 we explore the relation between performance gain and excitation level of training sequence. As a proxy of the latter, we measure the variance of intensity to the mean patch using the distance. The right plot shows that the variance reaches the maximum when most frames are seen. We normalize the variance so that means maximum excitation. The left plot shows accuracy increases with excitation. The fact that accuracy does not saturate is due to the fact that the sufficient excitation is only reachable asymptotically.

Figure 11: Sufficient excitation. Left: accuracy as a function of a proxy of sufficient excitation (see text). Right: Excitation as a function of the number of frames. All results are averaged over multiple runs using frames where is selected at random.

6 Discussion

By interpreting HOG as the probability density of sample images, conditioned on the underlying scene, with nuisances marginalized, and observing that a single image does not afford proper marginalization, we have been able to extend it using nuisance distributions learned from multiple training samples. The result is a multi-view extension of HOG that has the same memory and run-time complexity of its single-view counterpart, but better trades off sensitivity to discriminative power, as shown empirically.

Our method has several limitations: It is restricted to static (or slowly-deforming) objects; it requires correspondence in multiple views to be assembled (although it reduces to standard HOG if only one image is available), and is therefore sensitive to the performance of the tracking (MV-HOG) or reconstruction (R-HOG) algorithm. The former also requires sufficient excitation conditions to be satisfied, and the latter requires sufficiently informative data for multi-view stereo to operate, although if this is not the case, then by definition the resulting descriptor is insensitive to nuisance factors; it is also, of course, uninformative, and therefore this case is of no major concern. It also requires the camera to be calibrated, but for the same reason, this is irrelevant as what matters is not that the reconstruction be correct in the Euclidean sense, but that it yields consistent reprojections.

Our empirical evaluation of R-HOG yields a performance upper bound, as we use the reconstruction form a structured light sensor rather than multi-view stereo. As the quality (and speed) of the latter improve, the difference between the two will shrink.

References

  • [1] Alexandre Alahi, Raphael Ortiz, and Pierre Vandergheynst. Freak: Fast retina keypoint. In Computer Vision and Pattern Recognition, 2012.
  • [2] L. Alvarez, F. Guichard, P. L. Lions, and J. M. Morel. Axioms and fundamental equations of image processing. Arch. Rational Mechanics, 123, 1993.
  • [3] A. Ayvaci and S. Soatto. Detachable object detection: Segmentation and depth ordering from short-baseline video. IEEE Trans. on Patt. Anal. and Mach. Intell., 34(10):1942–1951, 2012.
  • [4] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In Computer Vision–ECCV 2006, pages 404–417. Springer, 2006.
  • [5] R. Bitmead. Persistence of excitation conditions and the convergence of adaptive schemes. IEEE Trans. on Information Theory, 1984.
  • [6] M. Brown, G. Hua, and S. Winder. Discriminative learning of local image descriptors. PAMI, IEEE Transactions on, 33(1):43–57, 2011.
  • [7] J. Bruna and S. Mallat. Classification with scattering operators. In Proc. IEEE Conf. on Comp. Vision and Pattern Recogn., 2011.
  • [8] Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. Brief: binary robust independent elementary features. In European Conference on Computer Vision. 2010.
  • [9] Vijay Chandrasekhar, Gabriel Takacs, David Chen, Sam Tsai, Radek Grzeszczuk, and Bernd Girod. Chog: Compressed histogram of gradients a low bit-rate feature descriptor. In IEEE Computer Vision and Pattern Recognition, 2009.
  • [10] Ken Chatfield, Victor Lempitsky, Andrea Vedaldi, and Andrew Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. 2011.
  • [11] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proc. IEEE Conf. on Comp. Vision and Pattern Recogn., 2005.
  • [12] Elisabetta Delponte, N Noceti, F Odone, and A Verri. The importance of continuous views for real-time 3d object recognition. In ICCV07 Workshop on 3D Representation for Recognition, 2007.
  • [13] O. D. Faugeras and R. Keriven. Variational principles, surface evolution pdes, level set methods and the stereo problem. INRIA TR, 3021:1–37, 1996.
  • [14] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  • [15] G. Graber, T. Pock, and H. Bischof. Online 3d reconstruction using convex optimization. In 1st Workshop on Live Dense Reconstruction From Moving Cameras, ICCV 2011, 2011.
  • [16] Michael Grabner and Horst Bischof. Object recognition based on local feature trajectories. I cognitive vision works, 2005.
  • [17] U. Grenander and M. I. Miller. Representation of knowledge in complex systems. J. Roy. Statist. Soc. Ser. B, 56:549–603, 1994.
  • [18] F. V. Hundelshausen and R. Sukthankar. D-nets: Beyond patch-based image descriptors. In CVPR, 2012 IEEE Conference on, pages 2941–2948. IEEE, 2012.
  • [19] H. Jin, S. Soatto, and A. Yezzi. Multi-view stereo reconstruction of dense shape and complex appearance. Intl. J. of Comp. Vis., 63(3):175–189, 2005.
  • [20] E. Jones and S. Soatto. Visual-inertial navigation, localization and mapping: A scalable real-time large-scale approach. Intl. J. of Robotics Res., Apr. 2011.
  • [21] Yann LeCun, Fu Jie Huang, and Léon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, 2004.
  • [22] T. Lee and S. Soatto. Video-based descriptors for object recognition. Image and Vision Computing, 2011.
  • [23] Stefan Leutenegger, Margarita Chli, and Roland Y Siegwart. Brisk: Binary robust invariant scalable keypoints. In Computer Vision (ICCV), 2011 IEEE International Conference on, 2011.
  • [24] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2(60):91–110, 2004.
  • [25] B. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Int. J. Conf. on Artificial Intell., 1981.
  • [26] J. Meltzer, M. Yang, R. Gupta, and S. Soatto. Multiple view feature descriptors from image sequences via kernel principal component analysis. In Proc. of the Eur. Conf. on Comp. Vision, pages 215–227, May 2004.
  • [27] Roland Memisevic and Geoffrey E Hinton.

    Learning to represent spatial transformations with factored higher-order boltzmann machines.

    Neural Computation, 22(6):1473–1492, 2010.
  • [28] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaffalitzky, T. Kadir, and L. Van Gool. A comparison of affine region detectors. IJCV, 1(60):63–86, 2004.
  • [29] P. Moreels and P. Perona. Evaluation of features detectors and descriptors based on 3d objects. International Journal of Computer Vision, 73(3):263–284, 2007.
  • [30] T. Poggio. How the ventral stream should work. Technical report, Nature Precedings, 2011.
  • [31] Marc’Aurelio Ranzato, Fu Jie Huang, Y-L Boureau, and Yann LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE, 2007.
  • [32] E. Rosten and T. Drummond. Machine learning for high-speed corner detection. In Proc. of the Europ. Conf. on Comp. Vision, volume 1, pages 430–443, May 2006.
  • [33] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. Orb: an efficient alternative to sift or surf. In ICCV, 2011 IEEE International Conference on, pages 2564–2571. IEEE, 2011.
  • [34] S. Soatto. Steps Toward a Theory of Visual Information. http://arxiv.org/abs/1110.2053, Technical Report UCLA-CSD100028, September 13, 2010 2010.
  • [35] S. Soatto, A. J. Yezzi, and H. Jin. Tales of shape and radiance in multiview stereo. In Intl. Conf. on Comp. Vision, pages 974–981, October 2003.
  • [36] G. Sundaramoorthi, P. Petersen, V. S. Varadarajan, and S. Soatto. On the set of images modulo viewpoint and contrast changes. In Proc. IEEE Conf. on Comp. Vision and Pattern Recogn., June 2009.
  • [37] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proc. of the Allerton Conf., 2000.
  • [38] E. Tola, V. Lepetit, and P. Fua. Daisy: An efficient dense descriptor applied to wide baseline stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010.
  • [39] A. Vedaldi and B. Fulkerson. Vlfeat: An open and portable library of computer vision algorithms. In Proceedings of the international conference on Multimedia, pages 1469–1472. ACM, 2010.
  • [40] A. Vedaldi, H. Ling, and S. Soatto. Computer Vision: Recognition, Registration and Reconstructions, chapter Knowing a good feature when you see it: ground truth and methodology to evaluate local features for recognition. R, Cipolla, S. Battiato, G.-M. Farinella (Eds), 2010.
  • [41] A. Vedaldi and S. Soatto. Viewpoint induced deformation statistics and the design of viewpoint invariant features: singularities and occlusions. In Eur. Conf. on Comp. Vision (ECCV), pages II–360–373, 2006.
  • [42] G. S. Watson. Statistics on spheres. Wiley, 1983.
  • [43] Winder.S. and M. Brown. Learning local image descriptors. In CVPR 07. IEEE Conference on, pages 1–8. IEEE, 2007.