One of the most widely used results in signal processing is the convolution theorem, which states that convolution in the spatial domain is equivalent to multiplication in the frequency domain. Combined with the availability of Fast Fourier Transform algorithms[Cooley:MC:1965, Frigo:IEEE:2005], it reduces the complexity of what would be a quadratic-time operation to a nearly linear-time computation (linearithmic). This, together with the closely-related correlation theorem, have enabled efficient algorithms for applications in many domains, including audio analysis and synthesis [Allen:ASSP:1977, Moorer:IEEE:1977]
, pattern recognition and compression of images[Kumar:2005, Wallace:CACM:1991], symmetry detection in 2D images [Keller:TIP:2006] and 3D models [Kazhdan:SGP:2004], reconstruction of 3D surfaces [Schall:GMP:2006], inversion of the Radon transform for medical imaging [Kak:2001, Naterrer:SIAM:2001], and solving partial differential [Iorio:2001] and fluid dynamic equations [Kass:SIGGRAPH:1990, Stam:JGT:2001].
Despite the pervasiveness of convolution in signal processing, it has an inherent limitation: when convolving a signal with a filter, the filter remains fixed throughout the convolution, and cannot adapt to spatial information.
Early Work on Spatially-Varying Filters
A simple approach to allowing spatial variation is to limit the number of different filters that are allowed. For example, if differently-rotated versions of a filter are required, it is possible to quantize the rotation angle, compute a (relatively) small number of standard convolutions, and select the closest-matching rotation at each pixel.
Motivated by the early research of Knutsson et al. on non-stationary anisotropic filtering Knutsson:TOC:1983, Freeman and Adelson Freeman:PAMI:1991 investigated the idea of steerable filters. The essential observation is that, for angularly band-limited filters, the results of spatially adaptive filtering with arbitrary per-pixel rotation can be computed from per-pixel linear combinations of a finite set of convolutions with rotated filters. Given appropriate conditions on the filter, different transformation groups can be accommodated in this framework [Freeman:PAMI:1991, Simoncelli:IP:1996, Teo:TPAMI:1999].
In this work, we provide a representation-theoretic interpretation of steerable filters, and explore generalizations enabled by this interpretation. Our key idea is to focus not on the properties of filters that allow “steerability,” but rather on the structure of the group from which transformations are drawn. Specifically, we show how the ability to perform efficient function steering is related to the decomposition of the space of filters into irreducible representations of the transformation group. The analysis permits us to answer key questions such as:
How many convolutions are required for spatially-adaptive filtering, given a particular transformation group and a particular filter?
Given a fixed budget of convolutions, what is the best way to approximate the spatially-adaptive filtering?
We are able to answer these questions not only for 2D rotation, but for a variety of transformations including scaling and non-commutative groups such as 3D rotation. Moreover, we show that it is possible to obtain significantly better approximation results than previous methods that attempt to discretize the space of transformations.
One of our main generalizations is to apply our results to both the convolution and correlation operations, for which the effect of a spatially varying filter has different natural interpretations. Extended convolution is naturally interpreted as a scattering operation, in which the influence of each point in the signal is distributed according to the transformed filter. In contrast, extended correlation has a natural interpretation as a gathering operation, in which each output pixel is the linear combination of input pixels weighted by the locally-transformed filter. We show that the two approaches are appropriate for different applications, in particular demonstrating that the spatially adaptive voting of the Generalized Hough Transform [Ballard:PR:1981] may be implemented using extended convolution with a specially-designed filter.
Figure 1 shows several applications of spatially adaptive filtering that are enabled using our extended correlation and convolution. At left, we show pattern matching that locates all rotated instances of the pattern (top) in the target image (far left). At center we demonstrate an image manipulation in which gradient directions are used to place anisotropic brushstrokes across the image. At right, we show the effect of denoising an image with a filter whose scale is controlled by gradient magnitude, which yields edge-preserving smoothing similar to anisotropic diffusion and bilateral filtering [Perona:PAMI:1990, Weickert:SSTCV:1997, Tomasi:ICCV:1998, Slate:TOG:2006]. For the first two applications, we use spatially adaptive scattering to detect the pattern and to distribute the brushstrokes, respectively. For the third, we use our generalization of function steering to support spatially adaptive (data-dependent) scaling of a smoothing filter.
Our approach is to leverage the fact that linear transformations of the filter can be realized as invertible matrices acting on a high-dimensional vector space (the space of functions, corresponding to filters). Choosing a basis for the space of functions, the transformation associated to a spatial location can be expressed in matrix form and spatially adaptive filtering can be implemented as a sum of standard convolutions over the matrix entries (Section 3). When the basis is chosen to conform to the irreducible representations of the transformation group, the matrix becomes block-diagonal with repeating diagonal blocks (Section 4), thereby reducing the total number of convolutions that need to be performed.
The generality of the method makes it capable of supporting a number of image processing and image analysis operations. In this paper, we highlight its versatility by describing several of these applications, including the use of the extended convolution in two different types of pattern matching applications (Sections 5, 7, and 6) and three different types of image filtering applications (Section 8). Additionally, we provide a discussion of how filter steering can be generalized to three dimensions, where the group of rotations is no longer commutative (Section 9).
2 Defining Adaptive Filtering
We begin by formalizing our definitions of spatially adaptive filtering. Following the nomenclature of stationary signal processing, we consider both the correlation and convolution of a signal with a filter. Though these operators are identical in the stationary case, up to reflection of the filter through the origin, they define different notions of spatially adaptive filtering.
For both we assume that we are given a spatial function , a filter , and a transformation field that defines how the filter is to be transformed at each point in the domain.
The correlation of with is defined at a point as:
Using the notation to denote the operator that translates functions by the vector :
we obtain an expression for the correlation as:
That is, the correlation of with can be thought of as a gathering operation in which the value at a point is defined by first translating to the point and then accumulating the values of the conjugate of , weighted by the values of .
We generalize this to spatially adaptive filtering, defining the value of the extended correlation at a point as the value obtained by first applying the transformation to the filter, translating the transformed filter to , and then accumulating the conjugated values of the transformed , weighted by :
Note that, if the transformations are linear, extended correlation maintains standard correlation’s properties of being linear in the signal and conjugate-linear in the filter.
Similarly, convolution can be expressed as:
In this case, the convolution of with can be thought of as a scattering operation, defined by iterating over all points in the domain, translating to each point , and distributing the values of weighted by the value of .
Again, we generalize this to spatially adaptive filtering, defining the extended convolution by iterating over all points in the domain, applying the transformation to the filter , translating the transformed filter to , and then distributing the values of the transformed , weighted by :
Note that, as with standard convolution, the extended convolution is linear in both the signal and the filter (if the transformations are linear).
2.3 A Theoretical Distinction
While similar in the case of stationary filters, these operators give rise to different types of image processing techniques in the context of spatially adaptive filtering. This distinction becomes evident if we consider the response of filtering a signal that is the delta function centered at the origin.
In the case of extended convolution, the response is the function , corresponding to the transformation of the filter by the transformation defined at the origin. In the case of the extended correlation, the response is more complicated: the value at point comes from the conjugate of the filter at the point(s) . Since the transformation field is changing, this implies that some of the values of the filter can be represented by multiple positions in the response, while others might not be represented at all.
Beyond thinking of “gathering” and “scattering,” another way of understanding the distinction between how correlation and convolution extend to varying filters is by considering the dependency of the transformation on the variables. In extended correlation, the filter’s transformation depends on the spatial variable of the result. In contrast, in extended convolution the transformation depends on the variable of integration. This provides another way of deciding which of the operations will be appropriate for a given problem.
2.4 A Practical Distinction
The distinction between the two is also evidenced in a more practical setting, if one compares using steerable filters [Freeman:PAMI:1991] with using the Generalized Hough Transform [Ballard:PR:1981] for pattern detection where a local frame can be assigned to each point.
For steerable filters, pattern detection is performed using extended correlation. The filter corresponds to an aligned pattern template, and detection is performed by iterating over the pixels in the image, aligning the filter to the local frame, and gathering the correlation into the pixel. The pixel with the largest correlation value is then identified as the pattern center.
Generalized Hough Transform
For the Generalized Hough Transform, pattern detection is performed using extended convolution. The filter corresponds to candidate offsets for the pattern center and detection is performed by iterating over the pixels in the image, aligning the filter to the local frame, and distributing votes into the candidate centers, weighted by the confidence that the current pixel lies on the pattern. The pixel receiving the largest number of votes is then identified as the pattern center.
3 A First Pass at Adaptive Filtering
In this section, we show that for linear transformations, extended correlations and convolutions can be performed by summing the results of several standard convolutions. If we do not restrict the space of possible transformations, little simplification is possible (either mathematically or in algorithm design) to the brute-force computation implied by Equations 1 and 2. Therefore, we restrict our filter functions to lie within an -dimensional space , spanned by (possibly complex-valued) basis functions . Moreover, we restrict the transformations to act linearly on functions, meaning that they can be represented with matrices (possibly with complex entries). This permits significant simplification.
We expand the filter as , and write each transformation as a matrix with entries . Thus we can express the transformation of by as the linear combination:
This, in turn, gives an expression for extended correlation as:
which can be obtained by taking the linear combination of standard correlations. Similarly, we get an expression for extended convolution as:
which can also be obtained by taking the linear combination of standard convolutions.
Note that both equations can be further simplified to reduce the total number of standard correlations (resp. convolutions) by leveraging the linearity of the correlation (resp. convolution) operator:
As a simple example, we consider the case in which we would like to correlate a signal with an adaptively rotating filter which is supported within the unit disk and has values:
In this case, rotating by an angle amounts to multiplying the filter by . Thus, the extended correlation at point can be computed by multiplying the filter by , where is the angle of rotation at , and then evaluating the correlation with the transformed filter at . However, since correlation is conjugate-linear in the filter, the value of the extended correlation can also be obtained by first performing a correlation of with the un-transformed , and only then multiplying the result at point by .
Next, we consider a more complicated example in which the filter resides within a three-dimensional space of functions, , with the basis defined as:
In this case, rotating by an angle amounts to multiplying the first component of the filter by , the second by , and the third by so the previous approach will not work. However, by linearity, the extended correlation with can be expressed as the sum of the separate extended correlations with , , and . Each of these can each be obtained by computing the standard correlations with , , and and then multiplying the values at point by , , and respectively. Thus, we can obtain the extended correlation by performing separate correlations and taking their linear combination.
With respect to the notation in Equation 3, rotating the filter by an angle of multiplies the coefficients by:
Thus, the extended correlation with can be computed by computing the standard correlations with the functions , multiplying the results of these correlations by the functions , and then taking the sum. However, since the functions are uniformly zero whenever , the standard correlations with become unnecessary for , and the extended correlation can be expressed using only standard correlations.
Example , revisited
Though the previous example shows that the extended correlation with can be computed efficiently, we now show that the efficiency is tied to the way in which we factored the filter. In particular, we show that if the wrong factorization is chosen, the cost of computing the extended correlation can increase. Consider the same filter as above, but now expressed as the linear combination of a different basis as , with:
Rotating such a filter by an angle of multiplies the coefficients by:
Thus, the extended correlation with can be computed by computing the standard correlations with the functions , multiplying the results of these correlations by the functions respectively, and then taking the sum. In this case, since the matrix entries are all non-zero, all standard correlations are required.
Of course, the above discussion was purely a strawman: using the grouping of terms in Equations 5 and 6, it is possible to avoid the need for correlations. However, focusing on the structure of the matrix and using the tools of representation theory to find a basis in which it has a particularly simple structure, we can bring the computational requirements even below correlations or convolutions.
4 Choosing a Basis
As hinted at in the previous section, the efficiency of the implementation of extended correlation (resp. convolution) is tied to the choice of basis. In this section we make this explicit by showing that by choosing the basis of functions appropriately, we obtain matrices that are sparse (with many zero entries) and have repeated elements. Each zero and repetition corresponds to a standard correlation (resp. convolution) that does not need to be computed.
We begin by considering the group of planar rotations. We show that there exists a basis of functions in which the transformation matrix becomes sparse, specifically diagonal. We then use results from representation theory to generalize this, and to establish limits on how sparse the matrix can be made. We conclude this section with a detailed discussion of the relation of our work to earlier work in steerable functions.
To motivate the result that the choice of basis is significant to the structure of the matrix , consider planar rotations and their effect on 2D functions. In this case, the structure of is most easily exposed by considering the filter in polar coordinates. In particular, rotations preserve radius: is necessarily mapped to . Thus, in polar coordinates the only nonzero entries in occur in blocks around the diagonal, one block for each . Starting with an -pixel image, transformation into polar coordinates will give a function sampled at radii and angles. Hence the nonzero entries in will occupy blocks of size .
To make even more sparse, we consider representing the functions at each radius in the frequency domain, rather than the spatial domain. That is, within a radius the basis functions are proportional to , for a fixed . Applying a rotation to such a single-frequency function preserves that frequency; it is, in fact, expressible by multiplying the function by , where is the angle of the rotation. Therefore, in this basis has been simplified to purely diagonal (with complex entries).
In moving from an arbitrary basis to polar and polar/Fourier bases, we have reduced the number of nonzero entries in from to to . Correspondingly, the number of standard correlations (resp. convolutions) that need to be computed is also reduced.
There is one more reduction we may obtain by considering repeated entries in . In particular, we observe that all the diagonal entries corresponding to a particular frequency , across different radii, will be the same: . Although the rotation angle may vary across the image, all of these entries will vary in lock-step, and the associated diagonal entries will be identical. Thus, we may perform all such correlations (resp. convolutions) at once by correlating (resp. convolving) with the sum of all , where has angular frequency . As a result, the number of distinct correlations (resp. convolutions) is reduced to .
Summarizing, to compute the extended correlation of a 2D signal and rotation field with filter :
We first decompose as the sum of functions with differing angular frequencies:
This can be done, for example, by first expressing in polar coordinates, and then running the 1D FFT at each radius to get the different frequency coefficients. 
Next, we compute the standard correlations of the signal with the functions :
This can be done by first evaluating the function on a regular grid and then using the 2D Fast Fourier Transform to perform the correlation. 
Finally, we take the linear combination of the correlation results:
weighting the contribution of the -th correlation to the pixel by the conjugate of the -th power of the unit complex number corresponding to the rotation at . 
The extended convolution can be implemented similarly, but in this case we need to pre-multiply the signal:
and only then sum the convolutions of with .
In implementing the extended correlation for rotations we have taken advantage of the fact that the space of filters could be expressed as the direct-sum of subspaces that (1) are fixed under rotation, and (2) could be grouped into subspaces on which rotations act in a similar manner.
The decomposition of a space of functions into such subspaces is a central task of representation theory, which tells us that any vector-space , acted upon by a group , can be decomposed into a sum of subspaces (e.g. [Serre:1977:SV]):
where is the frequency, indexing the subspace fixed under the action of the group, and is the multiplicity of the subspace. While we are only guaranteed that the subspace is one-dimensional when the group is commutative, the subspace is guaranteed to be as small as possible (i.e. irreducible) so that cannot be decomposed further into subspaces fixed under the action of .
Using the decomposition theorem, we know that if represents the space of filters and the transformations belong to a group , then we can decompose into irreducible representations of :
where indexes the sub-representation and, for a fixed , the sub-representations are all isomorphic.
Referring back to the discussion of rotation in Section 4.1, the group acting on the filters is (the group of rotations in the plane) and the sub-representations are just functions of constant radius and constant angular frequency.
4.2.1 Block-Diagonal Matrix
Using the decomposition in Equation 7, we can choose a basis for by choosing a basis for each subspace . Since for fixed the are all isomorphic, we can denote their dimension by and represent the basis for by:
Additionally, since is a sub-representation, we know that maps back into itself. This implies that we can represent the restriction of to by an matrix with -th entry . Thus, given , we can express the transformation of by as:
corresponding to a block-diagonal representation of by a matrix with blocks, where the -th block is of size . As before, this gives:
Using this decomposition, evaluating the extended correlation now requires the computation of standard correlations. Note that, since , the number of linear combinations will be smaller than if the space contains more than one irreducible representation.
4.2.2 Multiplicity of Representations
We further improve the efficiency of the extended correlation by using the multiplicity of the representations. Since the spaces correspond to the same representation, we can choose bases for them such that the matrix entries have the property that for all . As a result, the extended correlation simplifies to:
Thus, we only need to perform one standard correlation for each set of isomorphic representations, further reducing the number of standard correlations to .
While the previous discussion has focused on the extended correlation, an analogous argument shows that the same decomposition of the space of filters results in an implementation of the extended convolution that requires standard convolutions.
In practice, we approximate the extended correlation (resp. convolution) by only summing the contribution from the first frequencies, for some constant . This further reduces the number of standard correlations (resp. convolutions) to and is equivalent to band-limiting the filter prior to the computation of the extended convolution. For example, when rotating images sampled on a regular grid with pixels, this can reduce the complexity of extended correlation to by band-limiting the filter’s angular components.
4.4 Relation to Steerable Filters
Using the extended correlation, the method described above can be used to perform efficient steerable filtering. While the implementation differs from the one described in [Freeman:PAMI:1991], the complexity is identical, with both implementations running in time for images and filters with maximal angular frequency .
We briefly review Freeman and Adelson’s implementation of steerable filtering and discuss how it fits into our representation-theoretic framework. We defer the discussion of the limitations of the earlier implementation in the context of higher-dimensional steering to Section 9.
In the traditional implementation of steerable filters, the filter is used to define the steering basis. (Note that the original work of Freeman and Adelson Freeman:PAMI:1991 also proposes, but does not use, an interpretation based on alternative basis functions.) Specifically, when the filter is angularly band-limited with frequency , the steerable filtering is performed using the functions , where the function is the rotation of by an angle of .
Because the span of these functions is closed under rotation and because it contains the filter , the functions can be used for performing the extended correlation. In particular, one can compute the matrix describing how the rotation of a basis function can be expressed as a linear combination of the basis, and then take the linear combinations of the standard correlations of the signal with the functions weighted by the matrix entries .
While this interpretation of steerable filtering within the context of our representation-theoretic framework hints at an implementation requiring standard correlations (since the entries are non-zero) this is not actually the case. What makes the classical implementation of steerable filtering efficient is that the filter is one of the basis vectors, , so the decomposition of the filter as , has and for all . Thus, while all matrix entries are non-zero, only of the functions are non-zero, so the steerable filtering only requires that standard correlations be performed.
5 Application to Pattern Detection
We apply extended convolution to detect instances of a pattern within an image, even if the pattern occurs at different orientations. Recall that this approach may be thought of as an instance of the generalized Hough transform, such that image pixels vote for locations consistent with the presence of the pattern. Figure 1, left, shows an example application in which we search for instances of a pattern in Escher’s Heaven and Hell. In this example, all three rotated versions of the pattern give a high response.
5.1 Defining the Filter and Transformation Field
Our strategy will be to operate on the gradients of both the pattern and the target image . In particular, we take the signal to be
and the transformation field to be rotation by the angle , where
and atan2 is the usual two-argument arctangent function.
To design the filter , we consider what will happen during the extended convolution when we place at some pixel . The values of will be scattered, with weight proportional to the gradient magnitude at ; in other words, the filter will have its greatest effect at edges in the target image. Now, if were the only point with non-zero gradient magnitude, the optimal filter would simply be the distribution that scatters all of its mass to the single point – the pattern center relative to the coordinate frame at . When there are multiple points with non-zero gradient magnitude, we set to be the “consensus filter”, obtained by taking the linear combination of the different distributions, with weights given by the gradient magnitudes.
In practice, the filter is itself constructed by a voting operation. For example, consider Figure 2, which shows an example of constructing the optimal filter (right) for an ‘A’ pattern (left) with respect to its gradient field (middle) at the point . For each point in the vicinity of the pattern’s center, the gradient determines both the position of the bin and the weight of the vote with which contributes to the filter. For example, since the gradient at is interpreted as the -axis of a frame centered at , the position of relative to this frame will have negative - and -coordinates. The gradient at has a large magnitude, so the point contributes a large vote to bin . The keypoint has positive - and -coordinates relative to the frame at but since the gradient is small, it contributes a lesser vote to bin .
Iterating over all points in the neighborhood of the pattern’s center, we obtain the filter shown on the right. While the filter does not visually resemble the initial pattern, several properties of the pattern can be identified. For example, since the gradients along the outer left and right sides of the ‘A’ tend to be outward facing, points on these edges cast votes into bins with negative -coordinates, corresponding to the two vertical seams on the left side of the filter. Similarly, the gradients on the inner edges point inwards, producing the small wing-like structures on the right side of the filter.
Formally, we define the filter as:
where the transformation field is defined as rotation by the gradient directions of the pattern , and is the unit impulse, or Dirac delta function. This encapsulates the voting strategy described above. In the appendix, we show that the filter defined in this manner optimizes the response of the extended convolution at the origin.
5.2 Discussion: Band-Limiting Revisited
As we have seen, the extended convolution of an image with a rotating filter can be computed in time by computing standard convolutions. Though this is faster than the brute force approach, a similar form of pattern matching could be implemented in by generating rotations of the filter, performing a convolution of the image with each one, and setting the value of the response to be the maximum of the responses over all rotations.
The difference becomes apparent when we consider limiting the number of convolutions. As an example, Figure 3, top, shows the results of extended convolution-based pattern detection using low order frequencies. The band-limiting in the angular component gives blurred versions of the match-strength image, with the amount of blur reduced as the number of convolutions is increased. In contrast, convolving the image with multiple rotations of the pattern, as shown in the middle row, yields sub-sampled approximations to the response image, and more standard convolutions are required in order to reliably find all instances of the pattern. We can actually make a specific statement: the best way (in the least-squares sense, averaged over all possible rotations) to approximate the ideal extended convolution with a specific number of standard convolutions is to use the ones corresponding to the largest
: the most important projections onto the rotational-Fourier basis. Since in practice the lowest frequencies have the highest coefficients, simply using the lowest few bases is a useful heuristic.
6 Application to Contour Matching
As a second test, we apply extended convolution to the problem of matching complementary planar curves. To generate the signal and the rotation field, we rasterize the contour lines and their normals into a regular grid. We further convolve both the signal and the vector field with a narrow Gaussian to extend the support of the functions. Using these, we can define filters for queries and compute extended convolutions.
Our contour matching algorithm differs from standard pattern matching in two ways. First, we are searching for complementary shapes, not matching ones. Using the fact that complementary shapes have similar local signals, but oppositely oriented gradient fields, we define the filter using the negative of the query contour’s gradient field. Additionally, after finding the optimal aligning translation using extended convolution, we perform an angular correlation to find the rotation minimizing the -difference between query and target.
An example search is shown in Figure 4. The image on the left shows the query contour, with a black circle indicating the region of interest. The image on the right shows the top nine candidate matches returned using the extended convolution, sorted by retrieval rank from left to right and top to bottom. Blue dots show the best matching position as returned by the extended convolution, and the complete transformation is visualized by showing the registered position of the query in the coordinate system of the target. Note that even for pairs of contours that do not match, our algorithm still finds meaningful candidate alignments.
To evaluate our matching approach, we applied it to the contours of fragmented objects. Reconstructing broken artifacts from a large collection of fragments is a labor-intensive task in archeology and other fields, and the problem has motivated recent research in pattern recognition and computer graphics [McBride:2003, Huang:2006, Brown:2008:ASF]. As a basis for our experiments, we used the ceramic-3 test dataset that is widely distributed by Leitão and Stolfi Stolfi:TPAMI:2002. This dataset consists of 112 two-dimensional fragment contours that were generated by fracturing five ceramic tiles, and then digitizing the pieces on a flatbed scanner and extracting their boundary outlines.
Running the contour matching algorithm on each pair of fragments produces a sorted list of the top candidate matching configurations for each fragment pair. These candidate matches are reviewed to verify if they correspond to a true match. By using the same dataset, we can directly compare our algorithm’s performance against the multiscale dynamic programming sequence matching algorithm and results described in [Stolfi:TPAMI:2002]. We used the same contour sampling resolution as their finest multiresolution scale: fragments thus ranged from 690 to 4660 samples per contour. The numbers of true matches found within the first ranked candidate matches found by the two algorithms are compared in Figure 5.
The extended convolution matching algorithm outperforms the multiscale sequence matching algorithm, and finds 72% more correct matches among the top-ranked 277 candidates. At this level of matching precision, our algorithm requires 6 hours to process the entire dataset of 112 fragments on a desktop PC (3.2 GHz Pentium 4). By reducing the sampling rate of the contour line rasterization grid or increasing the step size along the contours between extended convolution queries, the running time can be reduced significantly while trading off some search precision. For collections with a large number of fragments, the matching algorithm can easily be executed in parallel on subsets of the fragment pairs.
7 Application to Image Matching
An image feature descriptor can be constructed from the discretization of the optimal filter , as defined in Equation (10), relative to the signal and frame field in Equations (8) and (9). We call this descriptor the Extended Convolution Descriptor (ECD).
We compare the ECD image descriptor against SIFT in the context of feature matching on a challenging, large-scale dataset. We choose to compare against SIFT for several reasons. Foremost, SIFT has stood the test of time. Despite its introduction over two decades ago, SIFT is arguably the premier detection and description pipeline and remains widely used across a number of fields, including robotics, vision, and medical imaging. Competing pipelines have generally emphasized computational efficiency and have yet to definitively outperform SIFT in terms of discriminative power and robustness[karami2017image, tareen2018comparative].
The advent of deep learning in imaging and vision has coincided with the introduction of a number of contemporaneous learned descriptors which have been shown to significantly outperform SIFT and other traditional methods in certain applications[mishchuk2017working, he2018local, luo2018geodesc, Zhang:2019:LLD]. However, the performance of learned descriptors is often domain-dependent and “deterministic" descriptors such as SIFT can provide either comparable or superior performance in specialized domains that learned descriptors are not specifically designed to handle[Zhang:2019:HLU]. More generally, “classical" methods for image alignment and 3D reconstruction, e.g. SIFT + RANSAC, may still outperform state-of-the-art learned approaches with the proper settings[schonberger2017comparative, jin2020IMW]. It is also worth noting that the extended convolution can be written as a convolution on (the group of planar rigid motions) by introducing Dirac delta functions to restrict contributions of integral over to those orientations defined by transformation field. Similar approaches have been shown to be an effective basis for image template matching [kyatkin1999computation, kyatkin1999pattern, chirikjian2016harmonic].
The scope of this work is limited to local image descriptors – we do not consider the related problem of feature detection. The SIFT pipeline integrates both feature detection and description in the sense that keypoints are chosen based on the distinctive potential of the surrounding area. As we seek to compare against the SIFT descriptor directly, we perform two sets of experiments. In the first, we replace the SIFT descriptor with ECD within the SIFT pipeline to compare practical effectiveness. The goal of the second experiment is to more directly evaluate our contribution with respect to the design of rotationally invariant descriptors. Specifically, we seek an answer to the following question: By having all points in the local region encode the keypoint relative to their own frames, do we produce a more robust and discriminating descriptor than one constructed relative to the keypoint’s frame?
7.1 Comparison Regime
In both sets of experiments, we evaluate ECD and SIFT in the context of descriptor matching using the publicly available photo-tourism dataset associated with the 2020 CVPR Image Matching Workshop[jin2020IMW]. The dataset consists of collections of images of international landmarks captured in a wide range of conditions using different devices. As such, we use the dataset to simultaneously evaluate descriptiveness and robustness. The dataset also includes 3D ground-truth information in the form of the camera poses and depth maps corresponding to each image. In all of our experiments, we use the implementation of SIFT in the OpenCV library[opencv_library] with the default parameters.
Due to the large size of the dataset, we restrict our evaluations to the image pools corresponding to six landmarks:
westminster_abbey, which we believe reflect the diversity of the dataset as a whole. Experiments are performed by evaluating the performance of the descriptors in matching a set of scene images to a smaller set of models.
For each landmark, five model images are chosen and removed from the pool. These images are picked such that their subjects overlap but differ significantly in terms of viewpoint and image quality. The scenes are those images in the remainder of the pool that best match the models.
Specifically, SIFT keypoints are computed for all model in each pool. Keypoints without a valid depth measure are discarded. For each landmark, images in the pool are assigned a score based on the number of keypoints that are determined to correspond to at least one keypoint from the five models originally drawn from the pool.
Keypoints are considered to be in correspondence if the distance between their associated 3D points is less than a threshold value . For each of the five models, all pixels with valid depth are projected into 3D using the ground-truth depth maps and camera poses. These points are used to compute a rough triangulation corresponding to the surface of the landmark. As in [guo2016comprehensive], we define the threshold value relative to the area of the mesh, ,
The top 15 images with the highest score from each pool are chosen as the scenes. The scaling factor in the value of was determined empirically; it provides a good balance between keypoint distinctiveness and ensuring each scene contributes approximately 1000 keypoints to the total.
Comparisons within the SIFT Pipeline
In our first experiment, we perform comparisons with keypoints selected using the SIFT keypoint detector to gauge ECD’s practical effectiveness. For each model, we compute SIFT keypoints and sort them in descending order by “contrast”[lowe2004distinctive]. Of these, we retain the first 1000 distinct keypoints having a valid depth measure, preventing models with relatively large numbers of SIFT keypoints from having an outsize influence in our comparisons.
For each scene, we compute SIFT keypoints and discard those without valid depth. Those that remain are sorted by contrast and only the first 1000 distinct keypoints that match at least one keypoint from the five corresponding models are retained.
Next, ECD and SIFT descriptors are computed at each keypoint for both the models and the scenes. Both descriptors are computed at the location in the Gaussian pyramid assigned to the keypoint in the SIFT pipeline. The support radius, , of the SIFT descriptor is determined by the scale associated with the keypoint in addition to the number of bins used in the histogram. The ECD descriptor uses more bins and we find that it generally exhibits better performance using a support radius 2.5 times larger that of the corresponding SIFT descriptor.
Comparisons using Randomized Keypoints
Our second set of experiments are performed in the same manner using the same collection of models and scenes. The only difference is that the keypoints are selected at random so as to avoid the influence of the SIFT feature detection algorithm on the results. Specifically, for each model, 1000 keypoints are randomly chosen out of the collection of points that have a valid depth measure. Then, for each scene, we randomly select keypoints with valid depth and keep only those that correspond to at least one keypoint from the five associated images in the models. This process is iterated until 1000 such points are obtained.
We use the ground-truth 3D information to provide an idealized estimation of the scale. That is, for a keypoint, the associated 3D point is first translated by in a direction perpendicular to the camera’s view direction and then projected into the image plane. For both descriptors, the distance between the 2D keypoint and the projected offset defines the support radius.
Evaluating Matching Performance
In both sets of experiments, we evaluate the matching performance of the SIFT and ECD descriptors by computing precision-recall curves for all keypoints in the scenes, an approach that has been demonstrated to be well-suited to this task[ke2004pcasift, mikolajczyk2005GLOH]. Given a scene keypoint, and corresponding descriptor , all model keypoints are sorted based on the descriptor distance, giving with
Some keypoints may be assigned multiple descriptors in the SIFT pipeline depending on the number of peaks in the local orientation histogram. In such cases we use the minimal distance over all of the keypoint’s descriptors.
Scene and model keypoints are considered to match if they correspond to the same landmark and the distance between their 3D positions is less than the threshold defined in Equation (11). We define to be the set of all model keypoints that are valid matches with . Following [shilane2004princeton], the precision and recall assigned to are defined as functions of the top model keypoints,
7.2 Results and Discussion
We aggregate the results by computing the mean precision and recall across all keypoints in the scenes. For the first set of experiments, we compute three curves for each descriptor corresponding to the top 200, 500, and 1000 keypoints in each scene as ranked by contrast. The resulting precision-recall curves are shown in Figure5(a). For the second set, we compute a single mean curve for each descriptor using all 1000 keypoints in each scene; these are shown in Figure 5(b).
Overall we see that ECD performs better than SIFT in our evaluations, though the difference is more pronounced when keypoint detection and scale estimation are decoupled from the SIFT pipeline as in our second set of experiments. In the former case, the precision of each descriptor decreases as the number of scene keypoints increases. This is not surprising as each successive keypoint added is of lower quality in terms of potential distinctiveness. Figure 7 shows a comparison of the valid matches found using the SIFT and ECD descriptors between two pairs of scene (top) and model (bottom) images in the randomized keypoint paradigm. We find that ECD tends to find slightly more valid matches than SIFT in less challenging scenarios, as in the case on the left where the scene and model image differ mainly in terms of a small change in the 3D position of the cameras. However, both descriptors perform similarly in more challenging scenarios as shown on the right.
We do not argue that the results presented here show that the ECD descriptor is superior. Rather, they demonstrate that the ECD descriptor is distinctive, repeatable, and robust in its own right and has the potential to be an effective tool in challenging image matching scenarios. However, it is important to note that effective implementations of the ECD descriptor may come at an increased cost. In our experiments, we find that ECD performs best with a descriptor radius of , which translates to a descriptor size of elements, roughly two times the elements in the standard implementation of SIFT.
The run-time of our proof-of-concept implementation of ECD does not compare favorably to the highly optimized implementation of SIFT in OpenCV. (SIFT runs up to a factor of ten times faster.) However, both approaches have the same complexity, requiring similar local voting operations to compute the descriptor, and we believe that ECD can be optimized in the future to be more competitive.
8 Application to Image Filtering
We apply extended convolution to the problem of adaptive image filtering, associating a scale or rotation to every pixel. For example, Figure 8 shows adaptive smoothing of a market-stall image (left) with a Gaussian filter transformed according to a checkerboard scaling mask (center). White and black tiles in the mask correspond to wide and narrow filters, respectively.
Guided by the principles outlined in section 4, we can decompose any filter into functions of the form
Note the similarity to the case of rotation, with the Fourier transform applied to instead of to . For the radially-symmetric Gaussian filter, of course, is just a constant and the implementation becomes even simpler.
The result of the extended convolution is shown at right, exhibiting the desired smoothing effect with points overlapped by the white regions in the transformation field blurred out and points overlapped by the dark region retaining sharp details.
A similar technique is used in Figure 1 (right), but with the scaling field obtained from the gradient magnitudes of the original image. As a result, the smoothing filter is scaled down at strong edges, preserving the detail near the boundaries and smoothing away from them, effectively acting similar to a bilateral filter [Tomasi:ICCV:1998, Slate:TOG:2006].
To apply the extended convolution to image smoothing, we need to modify the output of the extended convolution so that the value at every point is defined as the weighted average of its neighbors. Treating the value as the weighted sum of contributions from the neighbors of , we can do this by dividing the value at by the total sum of weights. That is, if we denote by the extended convolution with a signal whose value is everywhere, the adaptively smoothed signal can be defined as:
To localize the smoothing, we modify the signal. Specifically, using the fact that scaling the filter by scales its integral by , we normalize the signal , setting:
so that the extended convolution distributes the value to its neighbors, using a unit-integral distribution. Note that this modification is necessary only when the transformation field includes scaling; it is not needed when consists of rotations.
As an example, Figure 9, left, shows the results of the extended convolution for the market stall signal and checkerboard transformation mask, without a division by
. Because the filters in the black regions in the mask have smaller variance, the corresponding regions in the image accumulate less contribution and are darker.
Dividing by , we obtain Figure 9, right. The pixels now have the correct luminance, but because the filters used in the light portions are not normalized to have unit-integral, the adaptively smoothed image exhibits blur-bleeding across the mask boundaries. The correct result, with normalized , is shown in Figure 8, right.
An example of adaptive smoothing with a more complex scaling mask is shown in Figure 10. The image on the left shows a wire-frame visualization of a dragon model and the image on the right shows the results of adaptive smoothing applied to the visualization. For the scaling mask, we set:
where is the value of the -buffer at pixel , and are the pixel coordinates of the center of the dragon’s left eye. For the filter, we used the indicator function of a disk, smoothed along the radial directions. Smoothing was necessary to ensure that undesirable ringing artifacts did not arise when we approximated the extended convolution by using only the first 64 frequencies. This visualization simulates the depth-defocus (e.g. [Potmesil:SIG:1981, Demers:GPUGems:2004, Scheuermann:SX3:2004, Kass:PIXAR:2006]) resulting from imaging the dragon with a wide-aperture camera whose depth-of-field is set to the depth at the dragon’s left eye. Although the implementation does not take into account the depth-order of pixels, and hence does not provide a physically accurate simulation of the effects of depth-defocus, it generates convincing visualizations that can be used to draw the viewer’s eye to specific regions of interest.
The effectiveness of adaptive blurring is made possible by two properties: First, despite the band-limiting of the filter, adaptive blurring accurately reproduces fine detail, such as the single-pixel-width wire-frame lines in the left eye. Second, because the extended convolution is implemented as a scattering operation, it exhibits fewer of the edge-bleeding artifacts known to be difficult (e.g. [Kass:PIXAR:2006]) in “gathering” implementations.
As a further example of the effects achievable using adaptive filtering, we demonstrate the use of extended convolution with a rotational field to implement the Line Integral Convolution (LIC) technique for vector field visualization [Cabral:SG:1993]. We apply the extended convolution of a long, narrow anisotropic Gaussian kernel to a random noise image, using the given vector field’s angle at each pixel to determine the rotation to apply to the kernel. The result is shown in Figure 11, center, while at right we show the result produced when the normalization in (13) is not performed. The same technique was used to produce Figure 1, center: the gradient of the source image was used to define the rotational field, and noise was added to the image before applying extended convolution.
9 Function Steering in 3D
One of the contributions of our presentation is that it allows function steering to be generalized to higher dimensions. In this section, we discuss the limitations of using the classical formulation of function steering to adaptively rotate filters in 3D, and describe how such filtering can still be supported within our generalized, representation-theoretic framework.
As summarized in Section 4.4, classical steerable filtering with a filter is performed by using the functions as a steering basis, where is the rotation of the function by and is the maximal angular frequency of the filter.
The efficiency of this implementation is based on three properties. (1) The space spanned by the is the -dimensional space containing the orbit of , so the functions can be used to steer the filter. (2) The number of rotations, , is equal to the dimension of the space spanned by the orbit, so that the functions are the smallest set of functions required to steer . And, (3) the filter is one of the basis functions, , so that only of the functions are non-zero, and hence an implementation of extended correlation only requires standard correlations.
What limits the extension of this approach to 3D function steering is that it is impossible to generically choose a set of rotations such that is the identity and the functions are linearly independent. (See Appendix for more details.)
The inability to generalize classical steerable filtering to 3D has been observed before, and it has been suggested that an expansion into spherical harmonics might be used to accomplish this [Freeman:PAMI:1991]. Our generalized approach provides the details, showing how to compute the extended correlation (resp. convolution) in a manner analogous to the one used for 2D rotations in Section 4.1. In this discussion, we will consider the spherical parameterization of the filter where we are assuming that is the maximal angular frequency, so that the dimensionality of each spherical function is , and the radial resolution is .
We first decompose as the sum of functions with differing angular frequencies:
where the functions are spherical harmonics of frequency and index . This decomposition can be done by first expressing in spherical coordinates and then running the Fast Spherical Harmonic Transform [SpharmonicKit] at each radius to get the coefficients of the different frequency components. 
Next, we compute the standard correlations of the signal with the functions :
This can be done by first evaluating the function on a regular grid and then using the 3D Fast Fourier Transform to perform the correlation. 
Finally, we take the linear combination of the correlation results:
where are the Wigner-D functions, giving the coefficient of the -th spherical harmonic within a rotation of the -th spherical harmonic. 
Thus, our method provides a way for steering 3D functions, sampled on a regular grid with voxels, in time complexity . If, as in the 2D case, we assume that the angular frequency of the filter is much smaller than the resolution of the voxel grid, , the complexity becomes .
We have presented a novel method for extending the convolution and correlation operations, allowing for the efficient implementation of adaptive filtering. We have presented a general description of the approach, using principles from representation theory to guide the development of an efficient algorithm, and we discussed specific applications of the new operations to challenges in pattern matching and image processing.
In the future, we would like to apply extended convolutions using transformation fields consisting of both rotations and isotropic scales. We believe that this type of implementation opens the possibility for performing local shape-based matching over conformal parameterizations.
This work was performed under National Science Foundation grant IIS-1619050 and Office of Naval Research Award N00014-17-1-2142.