1 Introduction
Computing correspondences between related images is a central issue in many computer vision problems, ranging from scene recognition to optical flow estimation
(Forsyth and Ponce, 2011; Szeliski, 2010). The goal of a matching algorithm is to discover shared visual content between two images, and to establish as many as possible precise pointwise correspondences, called matches. An essential aspect of matching approaches is the amount of rigidity they assume when computing the correspondences. In fact, matching approaches range between two extreme cases: stereo matching, where matching hinges upon strong geometric constraints, and matching “in the wild”, where the set of possible transformations from the source image to the target one is large and the problem is basically almost unconstrained. Effective approaches have been designed for matching rigid objects across images in the presence of large viewpoint changes (Lowe, 2004; Barnes et al, 2010; HaCohen et al, 2011). However, the performance of current stateoftheart matching algorithms for images “in the wild”, such as consecutive images in realworld videos featuring fast nonrigid motion, still calls for improvement (Xu et al, 2012; Chen et al, 2013). In this paper, we aim at tackling matching in such a general setting.Matching algorithms for images “in the wild” need to accommodate several requirements, that turn out to be often in contradiction. On one hand, matching objects necessarily requires rigidity assumptions to some extent. It is also mandatory that these objects have sufficiently discriminative textures to make the problem welldefined. On the other hand, many objects or regions are not rigid objects, like humans or animals. Furthermore, large portions of an image are usually occupied by weaklytono textured regions, often with repetitive textures, like sky or bucolic background.
Descriptor matching approaches, such as SIFT (Lowe, 2004) or HOG (Dalal and Triggs, 2005; Brox and Malik, 2011) matching, compute discriminative feature representations from rectangular patches. However, while these approaches succeed in case of rigid motion, they fail to match regions with weak or repetitive textures, as local patches are poorly discriminative. Furthermore, matches are usually poor and imprecise in case of nonrigid deformations, as these approaches rely on rigid patches. Discriminative power can be traded against increased robustness to nonrigid deformations. Indeed, propagationbased approaches, such as Generalized PatchMatch (Barnes et al, 2010) or Nonrigid Dense Correspondences (HaCohen et al, 2011), compute simple feature representations from small patches and propagate matches to neighboring patches. They yield good performance in case of nonrigid deformations. However, matching repetitive textures remains beyond the reach of these approaches.
In this paper we propose a novel approach, called DeepMatching, that gracefully combines the strengths of these two families of approaches. DeepMatching is computed using a multilayer architecture, which breaks down patches into a hierarchy of subpatches. This architecture allows to work at several scales and handle repetitive textures. Furthermore, within each layer, local matches are computed assuming a restricted set of feasible rigid deformations. Local matches are then propagated up the hierarchy, which progressively discard spurious incorrect matches. We called our approach DeepMatching, as it is inspired by deep convolutional approaches.
In summary, we make three contributions:
Dense matching: we propose a matching algorithm, DeepMatching, that allows to robustly determine dense correspondences between two images. It explicitly handles nonrigid deformations, with bounds on the deformation tolerance, and incorporates a multiscale scoring of the matches, making it robust to repetitive or weak textures. Furthermore, our approach is based on gradient histograms, and thus robust to appearance changes caused by illumination and color variations.
Fast, scale/rotationinvariant matching: we propose a computationally efficient version of DeepMatching, which performs almost as well as exact DeepMatching, but at a much lower memory cost. Furthermore, this fast version of DeepMatching can be extended to a scale and rotationinvariant version, making it an excellent competitor to stateoftheart descriptor matching approaches.
Largedisplacement optical flow: we propose an optical flow approach which uses DeepMatching in the matching term of the large displacement variational energy minimization of Brox and Malik (2011). We show that DeepMatching is a better choice compared to the HOG descriptor used by Brox and Malik (2011) and other stateoftheart matching algorithms. The approach, named DeepFlow, obtains competitive results on public optical flow benchmarks.
This paper is organized as follows. After a review of previous works (Section 2), we start by presenting the proposed matching algorithm, DeepMatching, in Section 3. Then, Section 4 describes several extensions of DeepMatching. In particular, we propose an optical flow approach, DeepFlow, in Section 4.3. Finally, we present experimental results in Section 5.
A preliminary version of this article has appeared in Weinzaepfel et al (2013)
. This version adds (1) an indepth presentation of DeepMatching; (2) an enhanced version of DeepMatching, which improves the match scoring and the selection of entry points for backtracking; (3) proofs on time and memory complexity of DeepMatching as well as its deformation tolerance; (4) a discussion on the connection between Deep Convolutional Neural Networks and DeepMatching; (5) a fast approximate version of DeepMatching; (6) a scale and rotation invariant version of DeepMatching; and (7) an extensive experimental evaluation of DeepMatching on several stateoftheart benchmarks. The code for DeepMatching as well as DeepFlow are available at
http://lear.inrialpes.fr/src/deepmatching/ and http://lear.inrialpes.fr/src/deepflow/. Note that we provide a GPU implementation in addition to the CPU one.2 Related work
In this section we review related work on “general” image matching, that is matching without prior knowledge and constraints, and on matching in the context of optical flow estimation, that is matching consecutive images in videos.
2.1 General image matching
Image matching based on local features has been extensively studied in the past decade. It has been applied successfully to various domains, such as wide baseline stereo matching (Furukawa et al, 2010)
and image retrieval
(Philbin et al, 2010). It consists of two steps, i.e., extracting local descriptors and matching them. Image descriptors are extracted in rigid (generally square) local frames at sparse invariant image locations (Mikolajczyk et al, 2005; Szeliski, 2010). Matching then equals nearest neighbor search between descriptors, followed by an optional geometric verification. Note that a confidence value can be obtained by computing the uniqueness of a match, i.e., by looking at the distance of its nearest neighbors (Lowe, 2004; Brox and Malik, 2011). While this class of techniques is well suited for welltextured rigid objects, it fails to match nonrigid objects and weakly textured regions.In contrast, the proposed matching algorithm, called DeepMatching, is inspired by nonrigid 2D warping and deep convolutional networks (LeCun et al, 1998a; Uchida and Sakoe, 1998; Keysers et al, 2007). This family of approaches explicitly models nonrigid deformations. We employ a novel family of feasible warpings that does not enforce monotonicity nor continuity constraints, in contrast to traditional 2D warping (Uchida and Sakoe, 1998; Keysers et al, 2007). This makes the problem computationally much less expensive.
It is also worthwhile to mention the similarity with nonrigid matching approaches developed for a broad range of applications. Ecker and Ullman (2009) proposed a similar pipeline to ours (albeit more complex) to measure the similarity of small images. However, their method lacks a way of merging correspondences belonging to objects with contradictory motions, e.g., on different focal planes. For the purpose of establishing dense correspondences between images, Wills et al (2006)
estimated a nonrigid matching by robustly fitting smooth parametric models (homography and splines) to local descriptor matches. In contrast, our approach is nonparametric and modelfree.
Recently, fast algorithms for dense patch matching have taken advantage of the redundancy between overlapping patches (Barnes et al, 2010; Korman and Avidan, 2011; Sun, 2012; Yang et al, 2014). The insight is to propagate good matches to their neighborhood in a loose fashion, yielding dense nonrigid matches. In practice, however, the lack of a smoothness constraint leads to highly discontinuous matches. Several works have proposed ways to fix this. HaCohen et al (2011) reinforce neighboring matches using an iterative multiscale expansion and contraction strategy, performed in a coarsetofine manner. Yang et al (2014) include a guided filtering stage on top of PatchMatch, which obtains smooth correspondence fields by locally approximating a MRF. Finally, Kim et al (2013) propose a hierarchical matching to obtain dense correspondences, using a coarsetofine (topdown) strategy. Loopy belief propagation is used to perform inference.
In contrast to these approaches, DeepMatching proceeds bottomup and, then, topdown. Due to its hierarchical nature, DeepMatching is able to consider patches at several scales, thus overcoming the lack of distinctiveness that affects small patches. Yet, the multilayer construction allows to efficiently perform matching allowing semirigid local deformations. In addition, DeepMatching can be computed efficiently, and can be further accelerated to satisfy lowmemory requirements with negligible loss in accuracy.
2.2 Matching for flow estimation
Variational energy minimization is currently the most popular framework for optical flow estimation. Since the pioneering work of Horn and Schunck (1981), research has focused on alleviating the drawbacks of this approach. A series of improvements were proposed over the years (Black and Anandan, 1996; Werlberger et al, 2009; Bruhn et al, 2005; Papenberg et al, 2006; Baker et al, 2011; Sun et al, 2014b; Vogel et al, 2013a). The variational approach of Brox et al (2004) combines most of these improvements in a unified framework. The energy decomposes into several terms, resp. the datafitting and the smoothness terms. Energy minimization is performed by solving the EulerLagrange equations, reducing the problem to solving a sequence of large and structured linear systems.
More recently, the addition of a descriptor matching term in the energy to be minimized was proposed by Brox and Malik (2011). Following this idea, several papers (Tola et al, 2008; Brox and Malik, 2011; Liu et al, 2011; Hassner et al, 2012) show that dense descriptor matching improves performance. Strategies such as reciprocal nearestneighbor verification (Brox and Malik, 2011) allow to prune most of the false matches. However, a variational energy minimization approach that includes such a descriptor matching term may fail at locations where matches are missing or wrong.
Related approaches tackle the problem of dense scene correspondence. SIFTflow (Liu et al, 2011), one of the most famous method in this context, also formulates the matching problem in a variational framework. Hassner et al (2012) improve over SIFTflow by using multiscale patches. However, this decreases performance in cases where scale invariance is not required. Xu et al (2012) integrate matching of SIFT (Lowe, 2004) and PatchMatch (Barnes et al, 2010) to refine the flow initialization at each level. Excellent results are obtained for optical flow estimation, yet at the cost of expensive fusion steps. Leordeanu et al (2013) extends sparse matches with locally affine constraints to dense matches and, then, uses a total variation algorithm to refine the flow estimation. We present here a computationally efficient and competitive approach for large displacement optical flow by integrating the proposed DeepMatching algorithm into the approach of Brox and Malik (2011).
3 DeepMatching
This section introduces our matching algorithm DeepMatching. DeepMatching is a matching algorithm based on correlations at the patchlevel, that proceeds in a multilayer fashion. The multilayer architecture relies on a quadtreelike patch subdivision scheme, with an extra degree of freedom to locally reoptimize the positions of each quadrant. In order to enhance the contrast of the spatial correlation maps output by the local correlations, a nonlinear transformation is applied after each layer.
We first give an overview of DeepMatching in Section 3.1 and show that it can be decomposed in a bottomup pass followed by a topdown pass. We, then, present the bottomup pass in Section 3.2 and the topdown one in Section 3.3. Finally, we analyze DeepMatching in Section 3.4.
3.1 Overview of the approach
A stateoftheart approach for matching regions between two images is based on the SIFT descriptor (Lowe, 2004). SIFT is a histogram of gradients with spatial and orientation bins, yielding a robust descriptor that effectively encodes a square image region. Note that its cell grid can also be viewed as 4 socalled “quadrants” of cells, see Figure 1. We can, then, rewrite with .
Let and be the SIFT descriptors of the corresponding regions in the source and target image. In order to remove the effect of nonrigid motion, we propose to optimize the positions of the quadrants of the target descriptor (rather than keeping them fixed), in order to maximize
(1) 
where is the descriptor of a single quadrant extracted at position and a similarity function. Now, is able to handle situations such as the one presented in Figure 1, where a region contains multiple objects moving in different directions. Furthermore, if the four quadrants can move independently (of course, within some extent), it can be calculated more efficiently as:
(2) 
When applied recursively to each quadrant by subdivided it into 4 subquadrants until a minimum patch size is reached (atomic patches), this strategy allows for accurate nonrigid matching. Such a recursive decomposition can be represented as a quadtree, see Figure 2. Given an initial pair of two matching regions, retrieving atomic patch correspondences is then done in a topdown fashion (i.e. by recursively applying Eq. (2) to the quadrant’s positions ).
Nevertheless, in order to first determine the set of matching regions between the two images, we need to compute beforehand the matching scores (i.e. similarity) of all largeenough patches in the two images (as in Figure 1), and keep the pairs with maximum similarity. As indicated by Eq. (2
), the score is formed by averaging the maxpooled scores of the quadrants. Hence, the process of computing the matching scores is bottomup. In the following, we call
correlation map the matching scores of a single patch from the first image at every position in the second image. Selecting matching patches then corresponds to finding local maxima in the correlation maps.To sumup, the algorithm can be decomposed in two steps: (i) first, correlation maps are computed using a bottomup algorithm, as shown in Figure 6. Correlation maps of small patches are first computed and then aggregated to form correlation maps of larger patches; (ii) next, a topdown method estimates the motion of atomic patches starting from matches of large patches.
3.2 Bottomup correlation pyramid computation
Let and be two images of resolution and .
Bottom level.
We use patches of size pixels as atomic patches. We split into nonoverlapping atomic patches, and compute the correlation map with image for each of them, see Figure 3. The score between two atomic patches and is defined as the average pixelwise similarity:
(3) 
where each pixel is represented as a histogram of oriented gradients pooled over a local neighborhood. We detail below how the pixel descriptor is computed.
Pixel descriptor :
We rely on a robust pixel representation that is similar in spirit to SIFT and DAISY (Lowe, 2004; Tola et al, 2010). Given an input image , we first apply a Gaussian smoothing of radius in order to denoise from potential artifacts caused for example by JPEG compression. We then extract the gradient at each pixel and compute its nonnegative projection onto 8 orientations . At this point, we obtain 8 oriented gradient maps. We smooth each map with a Gaussian filter of radius . Next we cap strong gradients using a sigmoid , to help canceling out effects of varying illumination. We smooth gradients one more time for each orientation with a Gaussian filter of radius . Finally, the descriptor for each pixel is obtained by the normalized concatenation of 8 oriented gradients and a ninth small constant value . Appending amounts to adding a regularizer that will reduce the importance of small gradients (i.e. noise) and ensures that two pixels lying in areas without gradient information will still correlate positively. Pixel descriptors are compared using dotproduct and the similarity function takes value in the interval . In Section 5.2.1, we evaluate the impact of the parameters of this pixel descriptor.
Bottomlevel correlation map:
We can express the correlation map computation obtained from Eq. (3) more conveniently in a convolutional framework. Let be a patch of size from the first image centered at ( is a power of 2). Let be a grid with step pixels. is the set of the centers of the atomic patches. For each , we convolve the flipped patch over
(4) 
to get the correlation map , where denotes an horizontal and vertical flip^{1}^{1}1This amounts to the crosscorrelation of the patch and .. For any pixel of , is a measure of similarity between and . Examples of such correlation maps are shown in Figure 3 and Figure 4. Without surprise we can observe that atomic patches are not discriminative. Recursive aggregation of patches in subsequent stages will be the key to create discriminative responses.
Iteration.
We then compute the correlation maps of larger patches by aggregating those of smaller patches. As shown in Figure 5, a patch is the concatenation of patches of size :
(5) 
They correspond respectively to the bottomleft, topleft, bottomright and topright quadrants. The correlation map of can thus be computed using its children’s correlation maps. For the sake of clarity, we define the shorthand notation describing the positional shift of a children patch relatively to its parent patch (see Figure 5).
Using the above notations, we rewrite Eq. (2) by replacing (i.e. assuming here that patch and that is centered at ). Similarly, we replace the similarity between children patches by . For each child, we retain the maximum similarity over a small neighborhood of width and height centered at . We then obtain:
(6) 
We now explain how we can break down Eq. (6) into a succession of simple operations. First, let us assume that , where is the current iteration. During iteration , we want to compute the correlation maps of every patch from the first image for which correlation maps of its children have been computed in the previous iteration. Formally, the position of such patches is defined according to the position of children patches according to Eq. (5):
(7)  
We observe that the larger a patch is (i.e. after several iterations), the smaller the spatial variation of its correlation map (see Figure 4). This is due to the statistics of natural images, in which low frequencies significantly dominate over high frequencies. As a consequence, we choose to subsample each map by a factor 2. We express this with an operator :
(8) 
The subsampling reduces by the area of the correlation maps and, as a direct consequence, the computational requirements. Instead of computing the subsampling on top of Eq. (6), it is actually more efficient to propagate it towards the children maps and perform it jointly with maxpooling. It also makes the maxpooling domain become independent from in the subsampled maps, as it exactly cancels out the effect of doubling at each iteration. We call the maxpooling operator with the iterationindependent domain :
(9) 
For the same reason, the shift applied to the correlation maps in ’s definition becomes simply after subsampling. Let be the shift (or translation) operator on the correlation map:
(10) 
Finally, we incorporate an additional nonlinear mapping at each iteration on top of Eq. (6) by applying a power transform (Malik and Perona, 1990; LeCun et al, 1998a):
(11) 
This step, commonly referred to as rectification, is added in order to better propagate high correlations after each level, or, in other words, to counterbalance the fact that maxpooling tends to retain only high scores. Indeed, its effect is to decrease the correlation values (which are in ) as we use . Such postprocessing is commonly used in deep convolutional networks (LeCun et al, 1998b; Bengio, 2009). In practice, good performance is obtained with , see Section 5. The final expression of Eq. (6) is:
(12) 
Boundary effects:
In practice, a patch can overlap with the image boundary, as long as its center remains inside the image (from Eq. (7)). For instance, a patch with center at has only a single valid child (the one for which as ). In such degenerate cases, the average sum in Eq. (12) is carried out on valid children only. For , it thus only comprises one term weighted by 1 instead of .
Note that Eq. (12) implicitly defines the set of possible displacements of the approach, see Figures 2 and 9. Given the position of a parent patch, each child patch can move only within a small extent, equal to the quarter of its own size. Figure 4 shows the correlation maps for patches of size , and . Clearly, correlation maps for larger patch are more and more discriminative, while still allowing nonrigid matching.
3.3 Topdown correspondence extraction
A score in the multilevel correlation pyramid represents the deformationtolerant similarity of two patches and . Since this score is built from the similarity of 4 matching subpatches at the lower pyramid level, we can thus recursively backtrack a set of correspondences to the bottom level (corresponding to matches of atomic patches). In this section, we first describe this backtracking. We, then, present the procedure for merging atomic correspondences backtracked from different entry points in the multilevel pyramid, which constitute the final output of DeepMatching.
Compared to our initial version of DeepMatching (Weinzaepfel et al, 2013), we have updated match scoring and entry point selection to optimize the execution time and the matching accuracy. A quantitative comparison is provided in Section 5.2.2.
Backtracking atomic correspondences.
Given an entry point in the pyramid (i.e. a match between two patches and ^{2}^{2}2 Note that only roughly corresponds to a square patch centered at in , due to subsampling and possible deformations.), we retrieve atomic correspondences by successively undoing the steps used to aggregate correlation maps during the pyramid construction, see Figure 7. The entry patch is itself composed of four moving quadrants , . Due to the subsampling, the quadrant matches with where
(13) 
For the sake of clarity, we define the shorthand notations and . Let be the function that assigns to a tuple , representing a correspondence between pixel and for patch of size with a score , the set of the correspondences of children patches:
(14) 
Given a set of such tuples, let be the union of the sets for all . Note that if all candidate correspondences corresponds to atomic patches, then .
Thus, the algorithm for backtracking correspondences is the following. Consider an entry match . We repeatedly apply on . After calls, we get one correspondence for each of the atomic patches. Furthermore, their score is equal to the sum of all patch similarities along their backtracking path.
Merging correspondences.
We have shown how to retrieve atomic correspondences from a match between two deformable (potentially large) patches. Despite this flexibility, a single match is unlikely to explain the complex set of motions that can occur, for example, between two adjacent frames in a video, i.e., two objects moving independently with significantly different motions exceeds the deformation range of DeepMatching. We quantitatively specify this range in the next subsection.
We thus merge atomic correspondences gathered from different entry points (matches) in the pyramid. In the initial version of DeepMatching (Weinzaepfel et al, 2013), entry points were local maxima over all correlation maps. This is now replaced by a faster procedure, that starts with all possible matches in the top pyramid level (i.e. ). Using this level only results in significantly less entry points than starting from all maxima in the entire pyramid. We did not observe any impact on the matching performance, see Section 5.2.2. Because contains a lot of overlapping patches, most of the computation during repeated calls to can be factorized. In other words, as soon as two tuples in are equal in terms of , and , the one with the lowest score is simply eliminated. We thus obtain a set of atomic correspondences :
(15) 
that we filter with reciprocal match verification. The final set of correspondences is obtained as:
(16) 
where (resp. ) returns the best match in a small vicinity of pixels around in (resp. around in ) from .
3.4 Discussion and Analysis of DeepMatching
Multisize patches and repetitive textures.
During the bottomup pass of the algorithm, we iteratively aggregate correlation maps of smaller patches to form the correlation maps of larger patches. Doing so, we effectively consider patches of different sizes (), in contrast to most existing matching methods. This is a key feature of our approach when dealing with repetitive textures. As one moves up to upper levels, the matching problem gets less ambiguous. Hence, our method can correctly match repetitive patterns, see for instance Figure 8.
Quasidense correspondences.
Our method retrieves dense correspondences for every single match between large regions (i.e. entry point for the backtracking in the toplevel correlation maps), even in weakly textured areas; this is in contrast to correspondences obtained when matching descriptors (e.g. SIFT). A quantitative assessment, which compares the coverage of matches obtained with several matching schemes, is given in Section 5.
Nonrigid deformations.
Our matching algorithm is able to cope with various sources of image deformations: objectinduced or camerainduced. The set of feasible deformations, explicitly defined by Eq. (6), theoretically allows to deal with a scaling factor in the range and rotations approximately in the range . Note also that DeepMatching is translationinvariant by construction, thanks to the convolutional nature of the processing.
Proof
Given a patch of size located at level , Eq. (6) allows each of its children patches to move by at most pixels from their ideal location in . By recursively summing the displacements at each level, the maximal displacements for an atomic patch is . An example is given in Figure 9 with and . Relatively to , we thus have and . For a rotation, the rationale is similar, see Figure 9. ∎
Note that the displacement tolerance in from Eq. (6) could be extended to pixels with (instead of ). Then the above formula for computing the lower bound on the scale factor of DeepMatching generalizes to . Hence, for we obtain instead of . This implies that the deformation range is extended to a point where any patch can be matched to a single pixel, i.e., this results in unrealistic deformations. For this reason, we choose to not expand the deformation range of DeepMatching.
Builtin smoothing.
Furthermore, correspondences generated through backtracking of a single entry point in the correlation maps are naturally smooth. Indeed, feasible deformations cannot be too “far” from the identity deformation. To verify this assumption, we conduct the following experiment. We artificially generate two types of correspondences between two images of size . The first one is completely random, i.e. for each atomic patch in the first image we assign randomly a match in the second image. The second one respects the backtracking constraints. Starting from a single entry point in the top level we simulate the backtracking procedure from Section 3.3 by replacing in Eq. (13) the max operation by a random sampling over . By generating 10,000 sets of possible atomic correspondences, we simulate a set which respects the deformations allowed by DeepMatching. Figure 10 compares the smoothness of these two types of artificial correspondences. Smoothness is measured by interpreting the correspondences as flow and measuring the gradient flow norm, see Eq. (19). Clearly, the two types of warpings are different by orders of magnitude. Furthermore, the one which respects the builtin constraints of DeepMatching is close to the identity warping.
Relation to Deep Convolutional Neural Networks (CNNs).
DeepMatching relies on a hierarchical, multilayer, correlational architecture designed for matching images and was inspired by deep convolutional approaches (LeCun et al, 1998a). In the following we describe the major similarities and differences.
Deep networks learn from data the weights of the convolutions. In contrast, DeepMatching does not learn any feature representations and instead directly computes correlations at the patch level. It uses patches from the first image as convolution filters for the second one. However, the bottomup pipeline of DeepMatching is similar to CNNs. It alternates aggregating channels from the previous layer with channelwise maxpooling and subsampling. As in CNNs, maxpooling in DeepMatching allows for invariance w.r.t. small deformations. Likewise, the algorithm propagates pairwise patch similarity scores through the hierarchy using nonlinear rectifying stages inbetween layers. Finally, DeepMatching includes a topdown pass which is not present in CNNs.
Time and space complexity.
DeepMatching has a complexity in memory and time, where and are the number of pixels per image.
Proof
Computing the initial correlations is a operation. Then, at each level of the pyramid, the process is repeated while the complexity is divided by a factor due to the subsampling step in the target image (since the cardinality of remains approximately constant). Thus, the total complexity of the correlation maps computation is, at worst, . During the topdown pass, most backtracking paths can be pruned as soon as they cross a concurrent path with a higher score (see Section 3.3). Thus, all correlations will be examined at most once, and there are values in total. However, this analysis is worstcase. In practice, only correlations lying on maximal paths are actually examined. ∎
4 Extensions of DeepMatching
4.1 Approximate DeepMatching
As a consequence of its space complexity, DeepMatching requires an amount of RAM that is orders of magnitude above other stateoftheart matching methods. This could correspond to several gigabytes for images of moderate size (800600 pixels); see Section 5.2.3. This section introduces an approximation of DeepMatching that allows to trade matching quality for reduced time and memory usage. As shown in Section 5.2.3, nearoptimal results can be obtained at a fraction of the original cost.
Our approximation proposes to compress the representation of atomic patches . Atomic patches carry little information, and thus are highly redundant. For instance, in uniform regions, all patches are nearly identical (i.e., gradientwise). To exploit this property, we index atomic patches with a small set of patch prototypes. We substitute each patch with its closest neighbor in a fixed dictionary of prototypes. Hence, we need to perform and store only convolutions at the first level, instead of (with ). This significantly reduces both memory and time complexity. Note that higher pyramid levels also benefit from this optimization. Indeed, two parent patches at the second level have the exact same correlation map in case their children are assigned the same prototypes. The same reasoning also holds for all subsequent levels, but the gains rapidly diminish due to statistical unlikeliness of the required condition. This is not really an issue, since the memory and computational cost mostly rests on the initial levels; see Section 3.4.
In practice, we build the prototype dictionary using kmeans, as it is designed to minimize the approximation error between input descriptors and resulting centroids (
i.e. prototypes). Given a pair of images to match, we perform online clustering of all descriptors of atomic patches in the first image. Since the original descriptors lie on an hypersphere (each pixel descriptor has norm 1), we modify the kmeans approach so as to project the estimated centroids on the hypersphere at each iteration. We find experimentally that this is important to obtain good results.4.2 Scale and rotation invariant DeepMatching
For a variety of tasks, objects to be matched can appear under image rotations or at different scales (Lowe, 2004; Mikolajczyk et al, 2005; Szeliski, 2010; HaCohen et al, 2011). As discussed above, DeepMatching (DM) is only robust to moderate scale changes and rotations. We now present a scale and rotation invariant version.
The approach is straightforward: we apply DM to several rotated and scaled versions of the second image. According to the invariance range of DM, we use steps of for image rotation and power of for scale changes. While iterating over all combinations of scale changes and rotations, we maintain a list of all atomic correspondences obtained so far, i.e. corresponding positions and scores. As before, the final output correspondences consists of the reciprocal matches in . Storing all matches and finally choosing the best ones based on reciprocal verification permits to capture distinct motions possibly occurring together in the same scene (e.g. one object could have undergone a rotation, while the rest of the scene did not move). The steps of the approach are described in Algorithm 2.
Since we iterate sequentially over a fixed list of rotations and scale changes, the space and time complexity of the algorithm remains unchanged (i.e. ). In practice, the runtime compared to DM is multiplied by a constant approximately equal to 25, see Section 5.2.4. Note that the algorithm permits a straightforward parallelization.
4.3 DeepFlow
We now present our approach for optical flow estimation, DeepFlow. We adopt the method introduced by Brox and Malik (2011), where a matching term penalizes the differences between optical flow and input matches, and replace their matching approach by DeepMatching. In addition, we make a few minor modifications introduced recently in the state of the art: (i) we add a normalization in the data term to downweight the impact of locations with high spatial image derivatives (Zimmer et al, 2011); (ii) we use a different weight at each level to downweight the matching term at finer scales (Stoll et al, 2012); and (iii) the smoothness term is locally weighted (Xu et al, 2012).
Let be two consecutive images defined on with channels. The goal is to estimate the flow
. We assume that the images are already smoothed using a Gaussian filter of standard deviation
. The energy we optimize is a weighted sum of a data term , a smoothness term and a matching term :(17) 
For the three terms, we use a robust penalizer with which has shown excellent results (Sun et al, 2014b).
Data term.
The data term is a separate penalization of the color and gradient constancy assumptions with a normalization factor as proposed by Zimmer et al (2011). We start from the optical flow constraint assuming brightness constancy: the spatiotemporal gradient. A basic way to build a data term is to penalize it, i.e. with
the tensor defined by
. As highlighted by Zimmer et al (2011), such a data term adds a higher weight in locations corresponding to high spatial image derivatives. We normalize it by the norm of the spatial derivatives plus a small factor to avoid division by zero, and to reduce a bit the influence in tiny gradient locations (Zimmer et al, 2011). Let be the normalized tensor with . We set in the following. To deal with color images, we consider the tensor defined for a channel denoted by upper indices and we penalize the sum over channels: . We consider images in the RGB color space.We separately penalize the gradient constancy assumption (Bruhn et al, 2005). Let and be the derivatives of the images with respect to the and axis respectively. Let be the tensor for the channel including the normalization
The data term is the sum of two terms, balanced by two weights and :
(18) 
Smoothness term.
Matching term.
The matching term encourages the flow estimation to be similar to a precomputed vector field
. To this end, we penalize the difference between and using the robust penalizer . Since the matching is not totally dense, we add a binary term which is equal to if and only if a match is available at .We also multiply each matching penalization by a weight , which is low in uniform regions where matching is ambiguous and when matched patches are dissimilar. To that aim, we rely on
, the minimum eigenvalue of the autocorrelation matrix multiplied by
. We also compute the visual similarity between matches as . We then compute the score as a Gaussian kernel on weighted by with a parameter , experimentally set to . More precisely, we define at each point with a match as:The matching term is then .
Minimization.
This energy objective is nonconvex and nonlinear. To solve it, we use a numerical optimization algorithm similar as Brox et al (2004). An incremental coarsetofine warping strategy is used with a downsampling factor . The remaining equations are still nonlinear due to the robust penalizers. We apply 5 inner fixed point iterations where the nonlinear weights and the flow increments are iteratively updated while fixing the other. To approximate the solution of the linear system, we use 25 iterations of the Successive Over Relaxation (SOR) method (Young and Rheinboldt, 1971).
5 Experiments
This section presents an experimental evaluation of DeepMatching and DeepFlow. The datasets and metrics used to evaluate DeepMatching and DeepFlow are introduced in Section 5.1. Experimental results are given in Sections 5.2 and 5.3 respectively.
5.1 Datasets and metrics
In this section we briefly introduce the matching and flow datasets used in our experiments. Since consecutive frames of a video are wellsuited to evaluate a matching approach, we use several optical flow datasets for evaluating both the quality of matching and flow, but we rely on different metrics.
The Mikolajczyk dataset
was originally proposed by Mikolajczyk et al (2005) to evaluate and compare the performance of keypoint detectors and descriptors. It is one of the standard benchmarks for evaluating matching approaches. The dataset consists of 8 sequences of 6 images each viewing a scene under different conditions, such as illumination changes or viewpoint changes. The images of a sequence are related by homographies. During the evaluation, we comply to the standard procedure in which the first image of each scene is matched to the 5 remaining ones. Since our goal is to study robustness of DeepMatching to geometric distortions, we follow HaCohen et al (2011) and restrict our evaluation to the 4 most difficult sequences with viewpoint changes: bark, boat, graf and wall.
The MPISintel dataset (Butler et al, 2012)
is a challenging evaluation benchmark for optical flow estimation, constructed from realistic computeranimated films. The dataset contains sequences with large motions and specular reflections. In the training set, more than of the pixels have a motion over pixels, approximately over pixels. We use the “final” version, featuring rendering effects such as motion blur, defocus blur and atmospheric effects. Note that groundtruth optical flows for the test set are not publicly available.
The Middlebury dataset (Baker et al, 2011)
has been extensively used for evaluating optical flow methods. The dataset contains complex motions, but most of the motions are small. Less than of the pixels have a motion over pixels, and no motion exceeds pixels (training set). Groundtruth optical flows for the test set are not publicly available.
The Kitti dataset Geiger et al (2013)
contains realworld sequences taken from a driving platform. The dataset includes nonLambertian surfaces, different lighting conditions, a large variety of materials and large displacements. More than 16% of the pixels have motion over 20 pixels. Again, groundtruth optical flows for the test set are not publicly available.
Performance metric for matching.
Choosing a performance measure for matching approaches is delicate. Matching approaches typically do not return dense correspondences, but output varying numbers of matches. Furthermore, correspondences might be concentrated in different areas of the image.
Most matching approaches, including DeepMatching, are based on establishing correspondences between patches. Given a pair of matching patches, it is possible to obtain a list of pixel correspondences for all pixels within the patches. We introduce a measure based on the number of correctly matched pixels compared to the overall number of pixels. We define “accuracy@” as the proportion of “correct” pixels from the first image with respect to the total number of pixels. A pixel is considered correct if its pixel match in the second image is closer than pixels to groundtruth. In practice, we use a threshold of pixels, as this represents a sufficiently precise estimation (about 1% of image diagonal for all datasets), while allowing some tolerance in blurred areas that are difficult to match exactly. If a pixel belongs to several matches, we choose the one with the highest score to predict its correspondence. Pixels which do not belong to any patch have an infinite error.
Performance metric for optical flow.
To evaluate optical flow, we follow the standard protocol and measure the average endpoint error over all pixels, denoted as “EPE”. The “s1040” variant measures the EPE only for pixels with a groundtruth displacement between 10 and 40 pixels, and likewise for “s010” and “s40+”. In all cases, scores are averaged over all image pairs to yield the final result for a given dataset.
5.2 Matching Experiments
In this section, we evaluate DeepMatching (DM). We present results for all datasets presented above but Middlebury, which does not feature longrange motions, the main difficulty in image matching. When evaluating on the Mikolajczyk dataset, we employ the scale and rotation invariant version of DM presented in Section 4.2. For all the matching experiments reported in this section, we use the Mikolajczyk dataset and the training sets of MPISintel and Kitti.
5.2.1 Impact of the parameters
We optimize the different parameters of DM jointly on all datasets. To prevent overfitting, we use the same parameters across all datasets.
Pixel descriptor parameters:
We first optimize the parameters of the pixel representation (Section 3.2): , , (different smoothing stages), (sigmoid slope) and (regularization constant). After performing a grid search, we find that good results are obtained at , and across all datasets. Figure 11 shows the accuracy@10 in the neighborhood of these values for all parameters. Image presmoothing seems to be crucial for JPEG images (Mikolajczyk dataset), as it smooths out compression artifacts, whereas it slightly degrades performance for uncompressed PNG images (MPISintel and Kitti). As expected, similar findings are observed for the regularization constant since it acts as a regularizer that reduces the impact of small gradients (i.e. noise). In the following, we thus use low values of and when dealing with PNG images (we set and , other parameters are unchanged).
Nonlinear rectification:
We also evaluate the impact of the parameter of the nonlinear rectification obtained by applying power normalization, see Eq. (11). Figure 12 displays the accuracy@10 for various values of . We can observe that the optimal performance is achieved at for all datasets. We use this value in the remainder of our experiments.
5.2.2 Evaluation of the backtracking and scoring schemes
We now evaluate two improvements of DM with respect to the previous version published in Weinzaepfel et al (2013), referred to as DM*:

Backtracking (BT) entry points: in DM* we select as entry points local maxima in the correlation maps from all pyramid levels. The new alternative is to start from all possible points in the top pyramid level.

Scoring scheme: In DM* we scored atomic correspondences based on the correlation values of start and end point of the backtracking path. The new scoring scheme is the sum of correlation values along the full backtracking path.
New BT  New scoring  accuracy@10  memory  matching  
entry points  scheme  usage  time  
Mikolajczyk dataset  
1/4  0.620  0.9 GB  1.0 min  
1/2  0.848  5.5 GB  20 min  
1/2  ✓  0.864  5.5 GB  7.3 min  
1/2  ✓  ✓  0.878  4.4 GB  6.3 min 
MPISintel dataset (final)  
1/4  0.822  0.4 GB  2.4 sec  
1/2  0.880  6.3 GB  55 sec  
1/2  ✓  0.890  6.3 GB  16 sec  
1/2  ✓  ✓  0.892  4.6 GB  16 sec 
Kitti dataset  
1/4  0.772  0.4 GB  2.0 sec  
1/2  0.841  6.3 GB  39 sec  
1/2  ✓  0.855  6.3 GB  14 sec  
1/2  ✓  ✓  0.856  4.7 GB  14 sec 
We report results for the different variants in Table 1 on each dataset. The first two rows for each dataset correspond to the exact settings used for DM* (i.e. with an image resolution of 1/4 and 1/2). We observe a steady increase in performance on all datasets when we add the new scoring and backtracking approach. We can observe that starting from all possible entry points in the top pyramid level (i.e. considering all possible translations) yields slightly better results than starting from local maxima. This demonstrates that some groundtruth matches are not covered by any local maximum. By enumerating all possible patch translations from the toplevel, we instead ensure to fully explore the space of all possible matches.
Furthermore, it is interesting to note that memory usage and runtime significantly decreases when using the new options. This is because (1) searching and storing local maxima (which are exponentially more numerous in lower pyramid levels) is not necessary anymore, and (2) the new scoring scheme allows for further optimization, i.e. early pruning of backtracking paths (Section 3.3).
5.2.3 Approximate DeepMatching
We now evaluate the performance of approximate DeepMatching (Section 4.1) and report its runtime and memory usage. We evaluate and compare two different ways of reducing the computational load. The first one simply consists in downsizing the input images, and upscaling the resulting matches accordingly. The second option is the compression scheme proposed in Section 4.1.
We evaluate both schemes jointly by varying the input image size (expressed as a fraction of the original resolution) and the size of the prototype dictionary (i.e. parameter of kmeans in Section 4.1). corresponds to the original dataset image size (no downsizing). We display the results in terms of matching accuracy (accuracy@10) against memory consumption in Figure 13 and as a function of in Figure 14. Figure 13 shows that DeepMatching can be computed in an approximate manner for any given memory budget. Unsurprisingly, too low settings (e.g. , ) result in a strong loss of performance. It should be noted that that we were unable to compute DeepMatching at full resolution ( for , as the memory consumption explodes. As a consequence, all subsequent experiments in the paper are done at . In Figure 14, we observe that good tradesoff are achieved for dictionary sizes comprised in . For instance, on MPISintel, at , 94% of the performance of the uncompressed case () is reached for half the computation time and one third of the memory usage. Detailed timings of the different stages of DeepMatching are given in Table 2. As expected, only the bottomup pass is affected by the approximation, with a runtime of the different operations involved (patch correlations, maxpooling, subsampling, aggregation and nonlinear rectification) roughly proportional to (or to , the actual number of atomic patches, if ). The overhead of clustering the dictionary prototypes with kmeans appears negligible, with the exception of the largest dictionary size () for which it induces a slightly longer runtime than in the uncompressed case. Overall, the proposed method for approximating DeepMatching is highly effective.
GPU Implementation.
We have implemented DM on GPU in the Caffe framework (Jia et al, 2014). Using existing Caffe layers like ConvolutionLayer and PoolingLayer, the implementation is straightforward for most layers. We had to specifically code a few layers which are not available in Caffe (e.g. the backtracking pass^{3}^{3}3Although the backtracking is conceptually close to the backpropagation training algorithm, it differs in term of how the scores are accumulated for each path.). For the aggregation layer which consists in selecting and averaging 4 children channels out of many channels, we relied on the sparse matrix multiplication in the cuSPARSE toolbox. Detailed timings are given in Table 2 on a GeForce Titan X. Our code runs in about 0.2s for a pair of MPISintel image. As expected, the computation bottleneck essentially lies in the computation of bottomlevel patch correlations and the backtracking pass. Note that computing patch descriptors takes significantly more time, in proportion, than on CPU: it takes about 0.024s = 11% of total time (not shown in table). This is because it involves a succession of many small layers (image smoothing, gradient extraction and projection, etc.), which causes overhead and is rather inefficient.
Proc.  Patch  Patch  Maxpooling  Aggre  Nonlinear  Back  Total  

Unit  R  D  clustering  Correlations  +subsampling  gation  rectification  tracking  time 
CPU  1/2  64  0.3  0.2  0.4  0.9  0.8  5.1  7.7 
CPU  1/2  1024  1.3  0.7  0.6  1.0  1.3  5.8  10.7 
CPU  1/2    4.3  1.6  1.0  3.2  6.2  16.4  
GPU  1/2    0.084  0.012  0.017  0.013  0.053  0.213 
5.2.4 Comparison to the state of the art
We compare DM with several baselines and stateoftheart matching algorithms, namely:

dense HOG matching, followed by nearestneighbor matching with reciprocal verification as done in LDOF (Brox and Malik, 2011), referred to as HOGNN,

Generalized PatchMatch (GPM) (Barnes et al, 2010), with default parameters, 32x32 patches and 20 iterations (best settings in our experiments)^{5}^{5}5We used the online code.,

Kdtree PatchMatch (KPM) (Sun, 2012), an improved version of PatchMatch based on better patch descriptors and kdtrees optimized for correspondence propagation,

NonRigid Dense Correspondences (NRDC) (HaCohen et al, 2011), an improved version of GPM based on a multiscale iterative expansion/contraction strategy^{6}^{6}6We report results from the original paper.,

SIFTflow (Liu et al, 2011), a dense matching algorithm based on an energy minimization where pixels are represented as SIFT features and a smoothness term is incorporated to explicitly preserve spatial discontinuities,

Scaleless SIFT (SLS) (Hassner et al, 2012), an improvement of SIFTflow to handle scale changes (multiple sized SIFTs are extracted and combined to form a scaleinvariant pixel representation),

Deformable Pyramid Matching (DSP) (Kim et al, 2013), a dense matching approach based on a coarsetofine (topdown) strategy where inference is performed with (inexact) loopy belief propagation.
SIFTNN, HOGNN and DM output sparse matches, whereas the other methods output fully dense correspondence fields. SIFT keypoints, GPM, NRDC and DaisyFF are scale and rotation invariant, whereas HOGNN, KPM, SIFTflow, SLS and DSP are not. We, therefore, do not report results for these latter methods on the Mikolajczyk dataset which includes image rotations and scale changes.
Statistics about each method (average number of matches per image and their coverage) are reported in Table 3. Coverage is computed as the proportion of points on a regular grid with 10 pixel spacing for which there exists a correspondence (in the raw output of the considered method) within a 10 pixel neighborhood. Thus, it measures how well matches “cover” the image. Table 3 shows that DeepMatching outputs 2 to 7 times more matches than SIFTNN and a comparable number to HOGNN. Yet, the coverage for DM matches is much higher than for HOGNN and SIFTNN. This shows that DM matches are well distributed over the entire image, which is not the case for HOGNN and SIFTNN, as they have difficulties estimating matches in regions with weak or repetitive textures.
Method  Mikolajczyk  MPISintel (final)  Kitti  
#  coverage  #  coverage  #  coverage  
SIFTNN  2084  0.59  836  0.25  1299  0.38 
HOGNN      4576  0.39  4293  0.34 
KPM      446K  1  462K  1 
GPM  545K  1  446K  1  462K  1 
NRDC  545K  1  446K  1  462K  1 
SIFTflow      446K  1  462K  1 
SLS      446K  1  462K  1 
DaisyFF  545K  1  446K  1  462K  1 
DSP      446K  1  462K  1 
DM (ours)  3120  0.81  5920  0.96  5357  0.88 
method  accuracy@10  memory  matching  
usage  time  
Mikolajczyk dataset  
SIFTNN  0.674  0.2 GB  1.4 sec  
GPM  0.303  0.1 GB  2.4 min  
NRDC  0.692  0.1 GB  2.5 min  
DaisyFF  0.410  6.1 GB  16 min  
DM  1/4  0.657  0.9 GB  38 sec  
DM  1/2  1024  0.820  1.5 GB  4.5 min 
DM  1/2  0.878  4.4 GB  6.3 min  
MPISintel dataset (final)  
SIFTNN  0.684  0.2 GB  2.7 sec  
HOGNN  0.712  3.4 GB  32 sec  
KPM  0.738  0.3 GB  7.3 sec  
GPM  0.812  0.1 GB  1.1 min  
SIFTflow  0.890  1.0 GB  29 sec  
SLS  0.824  4.3 GB  16 min  
DaisyFF  0.873  6.8 GB  12 min  
DSP  0.853  0.8 GB  39 sec  
DM  1/4  0.835  0.3 GB  1.6 sec  
DM  1/2  1024  0.869  1.8 GB  10 sec 
DM  1/2  0.892  4.6 GB  16 sec  
Kitti dataset  
SIFTNN  0.489  0.2 GB  1.7 sec  
HOGNN  0.537  2.9 GB  24 sec  
KPM  0.536  0.3 GB  17 sec  
GPM  0.661  0.1 GB  2.7 min  
SIFTflow  0.673  1.0 GB  25 sec  
SLS  0.748  4.4 GB  17 min  
DaisyFF  0.796  7.0 GB  11 min  
DSP  0.580  0.8 GB  2.9 min  
DM  1/4  0.800  0.3 GB  1.6 sec  
DM  1/2  1024  0.812  1.7 GB  10 sec 
DM  1/2  0.856  4.7 GB  14 sec 
Quantitative results are listed in Table 4, and qualitative results in Figures 15, 16 and 17. Overall, DM significantly outperforms all other methods, even when reduced settings are used (e.g. for image resolution and prototypes). As expected, SIFTNN performs rather well in presence of global image transformation (Mikolajczyk dataset), but yields the worst result for the case of more complex motions (flow datasets). Figures 16 and 17 illustrate the reason: SIFT’s large patches are way too coarse to follow motion boundaries precisely. The same issue also holds for HOGNN. Methods predicting dense correspondence fields return a more precise estimate, yet most of them (KPM, GPM, SIFTflow, DSP) are not robust to repetitive textures in the Kitti dataset (Figure 17) as they rely on weakly discriminative small patches. Despite this limitation, SIFTflow and DSP are still able to perform well on MPISintel as this dataset contains little scale changes. Other dense methods, NRDC, SLS and DaisyFF, can handle patches of different sizes and thus perform better on Kitti. But in turn this is at the cost of reduced performance on the MPISintel or Mikolajczyk datasets (qualitative results are in Figure 15). In conclusion, DM outperforms all other methods on the 3 datasets, including DSP which also relies on a hierarchical matching.
In terms of computing resources, DeepMatching with full settings (, ) is one of the most costly method (only SLS and DaisyFF require the same order of memory and longer runtime). The scale and rotation invariant version of DM, used for the Mikolajczyk dataset, is slow compared to most other approaches, due to its sequential processing (i.e. treating each combination of rotation and scaling sequentially), yet yields near perfect results. However, running DM with reduced settings is very competitive to the other approaches. On MPISintel and Kitti, for instance, DM with a quarter resolution has a runtime comparable to the fastest method, SIFTNN, with a reasonable memory usage, while still outperforming nearly all methods in terms of the accuracy@10 measure.
bark  boat  graf  wall  

Groundtruth 

SIFTNN 

GPM 

NRDC 

DaisyFF 

DeepMatching 
Correspondence field  Image/Error map  Correspondence field  Image/Error map  Correspondence field  Image/Error map  

GT 

SIFTNN 

HOGNN 

KPM 

GPM 

SIFTflow 

SLS 

DaisyFF 

DSP 

DM 
Correspondence field  Image/Error map  Correspondence field  Image/Error map  

GT 

SIFTNN 

HOGNN 

KPM 

GPM 

SIFTflow 

SLS 

DaisyFF 

DSP 

DM 
5.3 Optical Flow Experiments
We now present experimental results for the optical flow estimation. Optical flow is predicted using the variational framework presented in Section 4.3 that takes as input a set of matches. In the following, we evaluate the impact of DeepMatching against other matching methods, and compare to the state of the art.
5.3.1 Optimization of the parameters
We optimize the parameters of DeepFlow on a subset of the MPISintel training set (20%), called “small” set, and report results on the remaining image pairs (80%, called “validation set”) and on the training sets of Kitti and Middlebury. Groundtruth optical flows for the three test sets are not publicly available, in order to prevent parameter tuning on the test set.
We first optimize the different flow parameters (, , , and ) by employing a gradient descent strategy with multiple initializations followed by a local grid search. For the data term, we find an optimum at , which is equivalent to removing the color constancy assumption. This can be explained by the fact that the “final” version contains atmospheric effects, reflections, blurs, etc. The remaining parameters are optimal at , , , . These parameters are used in the remaining of the experiments for DeepFlow, i.e. using matches obtained with DeepMatching, except when reporting results on Kitti and Middlebury test sets in Section 5.3.3. In this case the parameters are optimized on their respective training set.
5.3.2 Impact of the matches on the flow
We examine the impact of different matching methods on the flow, i.e., different matches are used in DeepFlow, see Section 4.3. For all matching approaches evaluated in the previous section, we use their output as matching term in Eq. (17). Because these approaches may output matches with statistics different from DM, we separately optimize the flow parameters for each matching approach on the small training set of MPISintel^{7}^{7}7Note that this systematically improves the endpoint error compared to using the raw dense correspondence fields as flow. .
Method  R  D  MPISintel  Kitti  Middlebury 

No Match  5.863  8.791  0.274  
SIFTNN  5.733  7.753  0.280  
HOGNN  5.458  8.071  0.273  
KPM  5.560  15.289  0.275  
GPM  5.561  17.491  0.286  
SIFTflow  5.243  12.778  0.283  
SLS  5.307  10.366  0.288  
DaisyFF  5.145  10.334  0.289  
DSP  5.493  15.728  0.283  
DM  1/2  1024  4.350  7.899  0.320 
DM  1/2  4.098  4.407  0.328 
Table 5 shows the endpoint error, averaged over all pixels. Clearly, a sufficiently dense and accurate matching like DM allows to considerably improve the flow estimation on datasets with large displacements (MPISintel, Kitti). In contrast, none of the methods presented have a tangible effect on the Middlebury dataset, where the displacements are small.
The relatively small gains achieved by SIFTNN and HOGNN on MPISintel and Kitti are due to the fact that a lot of regions with large displacements are not covered by any matches, such as the sky or the blurred character in the first and second column of Figure 18. Hence, SIFTNN and HOGNN have only a limited impact on the variational approach. On the other hand, the gains are also small (or even negative) for the dense methods despite the fact that they output significantly more correspondences. We observe for these methods that the weight of the matching term tends to be small after optimizing the parameters, thus indicating that the matches are found unreliable and noisy during training. The cause is clearly visible in Figure 17, where large portions containing repetitive textures (e.g. road, trees) are incorrectly matched. The poor quality of these matches even leads to a significant drop in performance on the Kitti dataset.
In contrast, DeepMatching generates accurate matches well covering the image that enable to boost the optical flow accuracy in case of large displacements. Namely, we observe a relative improvement of 30% on MPISintel and of 50% on Kitti. It is interesting to observe that DM is able to effectively prune false matches arising in occluded areas (black areas in Figures 16 and 17). This is due to the reciprocal verification filtering incorporated in DM (Eq. (16)). When using the approximation with 1024 prototypes, however, a significant drop is observed on the Kitti dataset, while the performance remains good on MPISintel. This indicates that approximating DeepMatching can result in a significant loss of robustness when matching repetitive textures, that are more frequent in Kitti than in MPISintel.
5.3.3 Comparison to the state of the art
In this section, we compare DeepFlow to the state of the art on the test sets of MPISintel, Kitti and Middlebury datasets. For theses datasets, the results are submitted to a dedicated server which performs the evaluation. Prior to submitting our results for Kitti and Middlebury test sets, we have optimized the parameters on the respective training set.
Results on MPISintel.
Images 


GroundTruth 

DeepMatching 

DeepFlow 

MDPFlow2 

LDOF 
Table 6 compares our method to stateoftheart algorithms on the MPISintel test set. A comparison with the preliminary version of DeepFlow (Weinzaepfel et al, 2013), referred to as DeepFlow*, is also provided. In this early version, we used a constant smoothness weight instead of a local one here (see Section 4.3) and used DM* as input matches. We can see that DeepFlow is among the best performing methods on MPISintel, particularly for large displacements. This is due to the use of a reliable matching term in the variational approach, and this property is shared by all top performing approaches, e.g. (Revaud et al, 2015; Leordeanu et al, 2013). Furthermore, it is interesting to note that among the top performers on MPISintel, 3 methods out of 6 actually employ DeepMatching. In particular, the top3 method EpicFlow (Revaud et al, 2015) relies on the output of DeepMatching to produce a piecewise affine flow, and SparseFlowFused (Timofte and Van Gool, 2015) combines matches obtained with DeepMatching and another algorithm.
We refer to the webpage of the MPISintel dataset for complete results including the “clean” version.
Method  EPE  EPEocc  s010  s1040  s40+  Time 

FlowFields (Bailer et al, 2015)  5.810  31.799  1.157  3.739  33.890  23s 
DiscreteFlow (Menze et al, 2015)  5.810  31.799  1.157  3.739  33.890  180s 
EpicFlow (Revaud et al, 2015)  6.285  32.564  1.135  3.727  38.021  16.4s 
TF+OFM (Kennedy and Taylor, 2015)  6.727  33.929  1.512  3.765  39.761  400s 
DeepFlow  6.928  38.166  1.182  3.859  42.854  25s 
SparseFlowFused Timofte and Van Gool (2015)  7.189  3.286  1.275  3.963  44.319  20 
DeepFlow* (Weinzaepfel et al, 2013)  7.212  38.781  1.284  4.107  44.118  19s 
S2DMatching (Leordeanu et al, 2013)  7.872  40.093  1.172  4.695  48.782  2000s 
LocalLayering (Sun et al, 2014a)  8.043  40.879  1.186  4.990  49.426  
Classic+NLP (Sun et al, 2014b)  8.291  40.925  1.208  5.090  51.162  800s 
MDPFlow2 (Xu et al, 2012)  8.445  43.430  1.420  5.449  50.507  709s 
NLTGVSC (Ranftl et al, 2014)  8.746  42.242  1.587  4.780  53.860  
LDOF (Brox and Malik, 2011)  9.116  42.344  1.485  4.839  57.296  30s 
Timings.
As mentioned before, DeepMatching at half the resolution takes 15 seconds to compute on CPU and 0.2 second on GPU. The variational part requires 10 additional seconds on CPU. Note that by implementing it on GPU, we could obtain a significant speedup as well. DeepFlow consequently takes 25 seconds in total on a single CPU core @ 3.6 GHz or 10.2s with GPU+CPU. This is in the same order of magnitude as the fastest among the best competitors, EpicFlow (Revaud et al, 2015).
Results on Kitti.
Table 7 summarizes the main results on the Kitti benchmark (see official website for complete results), when optimizing the parameters on the Kitti training set. EPENoc is the EPE computed only in nonoccluded areas. “Out 3” corresponds to the proportion of incorrect pixel correspondences for an error threshold of 3 pixels, i.e. it corresponds to , and likewise for “OutNoc 3” for nonoccluded areas. In terms of EPEnoc, DeepFlow is on par with the best approaches, but performs somewhat worse in the occluded areas. This is due to a specificity of the Kitti dataset, in which motion is mostly homographic (especially on the image borders, where most surfaces like roads and walls are planar). In such cases, flow is better predicted using an affine motion prior, which locally well approximates homographies (a constant motion prior is used in DeepFlow). As a matter of facts, all top performing methods in terms of total EPE output piecewise affine optical flow, either due to affine regularizers (BTFILLUM (Demetz et al, 2014), NLTGBSC (Ranftl et al, 2014), TGV2ADCSIFT (BrauxZin et al, 2013)) or due to local affine estimators (EpicFlow (Revaud et al, 2015)).
Note that the learned parameters on Kitti and MPISintel are close. In particular, running the experiments with the same parameters as MPISintel decreases EPENoc by only pixels on the training set. This shows that our method does not suffer from overfitting.
Method  EPEnoc  EPE  OutNoc 3  Out 3  Time 

DiscreteFlow (Menze et al, 2015)  1.3  3.6  5.77%  16.63%  180s 
FlowFields (Bailer et al, 2015)  1.4  3.5  6.23%  14.01%  23s 
DeepFlow  1.4  5.3  6.61%  17.35%  22s 
BTFILLUM (Demetz et al, 2014)  1.5  2.8  6.52%  11.03%  80s 
EpicFlow (Revaud et al, 2015)  1.5  3.8  7.88%  17.08%  16s 
TGV2ADCSIFT (BrauxZin et al, 2013)  1.5  4.5  6.20%  15.15%  12s 
DeepFlow* (Weinzaepfel et al, 2013)  1.5  5.8  7.22%  17.79%  17s 
NLTGVSC (Ranftl et al, 2014)  1.6  3.8  5.93%  11.96%  16s 
DataFlow (Vogel et al, 2013b)  1.9  5.5  7.11%  14.57%  180s 
TF+OFM (Kennedy and Taylor, 2015)  2.0  5.0  10.22%  18.46%  350s 
Results on Middlebury.
We optimize the parameters on the Middlebury training set by minimizing the average angular error with the same strategy as for MPISintel. We find weights quasizero for the matching term due to the absence of large displacements. DeepFlow obtained an average endpoint error of on the test which is competitive with the state of the art.
6 Conclusion
We have introduced a dense matching algorithm, termed DeepMatching. The proposed algorithm gracefully handles complex nonrigid object deformations and repetitive textured regions. DeepMatching yields stateoftheart performance for image matching, on the Mikolajczyk (Mikolajczyk et al, 2005), the MPISintel (Butler et al, 2012) and the Kitti (Geiger et al, 2013) datasets. Integrated in a variational energy minimization approach (Brox and Malik, 2011), the resulting approach for optical flow estimation, termed DeepFlow, shows competitive performance on optical flow benchmarks.
Future work includes incorporating a weighting of the patches in Eq. (2) instead of weighting all patches equally to take into account that different parts of a large patch may belong to different objects. This could improve the performance of DeepMatching for thin objects, such as human limbs.
Acknowledgements.
This work was supported by the European integrated project AXES, the MSR/INRIA joint project, the LabEx PERSYVALLab (ANR11LABX0025), and the ERC advanced grant ALLEGRO.References
 Bailer et al (2015) Bailer C, Taetz B, Stricker D (2015) Flow Fields: Dense Correspondence Fields for Highly Accurate Large Displacement Optical Flow Estimation
 Baker et al (2011) Baker S, Scharstein D, Lewis JP, Roth S, Black MJ, Szeliski R (2011) A database and evaluation methodology for optical flow. IJCV
 Barnes et al (2010) Barnes C, Shechtman E, Goldman DB, Finkelstein A (2010) The generalized PatchMatch correspondence algorithm. In: ECCV

Bengio (2009)
Bengio Y (2009) Learning deep architectures for AI. Foundations and Trends in Machine Learning
 Black and Anandan (1996) Black MJ, Anandan P (1996) The robust estimation of multiple motions: parametric and piecewisesmooth flow fields. Computer Vision and Image Understanding
 BrauxZin et al (2013) BrauxZin J, Dupont R, Bartoli A (2013) A general dense image matching framework combining direct and featurebased costs. In: ICCV
 Brox and Malik (2011) Brox T, Malik J (2011) Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Trans PAMI
 Brox et al (2004) Brox T, Bruhn A, Papenberg N, Weickert J (2004) High accuracy optical flow estimation based on a theory for warping. In: ECCV
 Bruhn et al (2005) Bruhn A, Weickert J, Feddern C, Kohlberger T, Schnörr C (2005) Variational optical flow computation in real time. IEEE Trans on Image Processing
 Butler et al (2012) Butler DJ, Wulff J, Stanley GB, Black MJ (2012) A naturalistic open source movie for optical flow evaluation. In: ECCV
 Chen et al (2013) Chen Z, Jin H, Lin Z, Cohen S, Wu Y (2013) Large displacement optical flow from nearest neighbor fields. In: CVPR
 Dalal and Triggs (2005) Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: CVPR
 Demetz et al (2014) Demetz O, Stoll M, Volz S, Weickert J, Bruhn A (2014) Learning brightness transfer functions for the joint recovery of illumination changes and optical flow. In: ECCV
 Ecker and Ullman (2009) Ecker A, Ullman S (2009) A hierarchical nonparametric method for capturing nonrigid deformations. Image and Vision Computing
 Forsyth and Ponce (2011) Forsyth D, Ponce J (2011) Computer Vision: A Modern Approach. Pearson Education, Limited
 Furukawa et al (2010) Furukawa Y, Curless B, Seitz SM, Szeliski R (2010) Towards internetscale multiview stereo. In: CVPR
 Geiger et al (2013) Geiger A, Lenz P, Stiller C, Urtasun R (2013) Vision meets robotics: The KITTI dataset. IJRR
 HaCohen et al (2011) HaCohen Y, Shechtman E, Goldman DB, Lischinski D (2011) Nonrigid dense correspondence with applications for image enhancement. SIGGRAPH
 Hassner et al (2012) Hassner T, Mayzels V, ZelnikManor L (2012) On sifts and their scales. In: CVPR

Horn and Schunck (1981)
Horn BKP, Schunck BG (1981) Determining Optical Flow. Artificial Intelligence
 Jia et al (2014) Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:14085093
 Kennedy and Taylor (2015) Kennedy R, Taylor CJ (2015) Optical flow with geometric occlusion estimation and fusion of multiple frames. In: EMMCVPR
 Keysers et al (2007) Keysers D, Deselaers T, Gollan C, Ney H (2007) Deformation models for image recognition. IEEE Trans PAMI
 Kim et al (2013) Kim J, Liu C, Sha F, Grauman K (2013) Deformable spatial pyramid matching for fast dense correspondences. In: CVPR
 Korman and Avidan (2011) Korman S, Avidan S (2011) Coherency sensitive hashing. In: ICCV
 LeCun et al (1998a) LeCun Y, Bottou L, Bengio Y, Haffner P (1998a) Gradientbased learning applied to document recognition. Proceedings of the IEEE
 LeCun et al (1998b) LeCun Y, Bottou L, Orr G, Muller K (1998b) Efficient backprop. In: Neural Networks: Tricks of the trade
 Leordeanu et al (2013) Leordeanu M, Zanfir A, Sminchisescu C (2013) Locally affine sparsetodense matching for motion and occlusion estimation. In: ICCV
 Liu et al (2011) Liu C, Yuen J, Torralba A (2011) SIFT flow: Dense correspondence across scenes and its applications. IEEE Trans PAMI
 Lowe (2004) Lowe DG (2004) Distinctive image features from scaleinvariant keypoints. IJCV
 Malik and Perona (1990) Malik J, Perona P (1990) Preattentive texture discrimination with early vision mechanisms. Journal of the Optical Society of America A: Optics, Image Science, and Vision
 Menze et al (2015) Menze M, Heipke C, Geiger A (2015) Discrete Optimization for Optical Flow. In: GCPR
 Mikolajczyk et al (2005) Mikolajczyk K, Tuytelaars T, Schmid C, Zisserman A, Matas J, Schaffalitzky F, Kadir T, Gool LV (2005) A comparison of affine region detectors. IJCV
 Muja and Lowe (2009) Muja M, Lowe DG (2009) Fast approximate nearest neighbors with automatic algorithm configuration. In: International Conference on Computer Vision Theory and Application VISSAPP’09), INSTICC Press
 Papenberg et al (2006) Papenberg N, Bruhn A, Brox T, Didas S, Weickert J (2006) Highly accurate optic flow computation with theoretically justified warping. IJCV
 Philbin et al (2010) Philbin J, Isard M, Sivic J, Zisserman A (2010) Descriptor learning for efficient retrieval. In: ECCV
 Ranftl et al (2014) Ranftl R, Bredies K, Pock T (2014) Nonlocal total generalized variation for optical flow estimation. In: ECCV

Revaud et al (2015)
Revaud J, Weinzaepfel P, Harchaoui Z, Schmid C (2015) EpicFlow: EdgePreserving Interpolation of Correspondences for Optical Flow. In: CVPR
 Stoll et al (2012) Stoll M, Volz S, Bruhn A (2012) Adaptive integration of feature matches into variational optical flow methods. In: ACCV
 Sun et al (2014a) Sun D, Liu C, Pfister H (2014a) Local layering for joint motion estimation and occlusion detection. In: CVPR
 Sun et al (2014b) Sun D, Roth S, Black M (2014b) A quantitative analysis of current practices in optical flow estimation and the principles behind them. IJCV
 Sun (2012) Sun J (2012) Computing nearestneighbor fields via propagationassisted kdtrees. In: CVPR
 Szeliski (2010) Szeliski R (2010) Computer Vision: Algorithms and Applications. Springer
 Timofte and Van Gool (2015) Timofte R, Van Gool L (2015) Sparse flow: Sparse matching for small to large displacement optical flow. In: Applications of Computer Vision (WACV)
 Tola et al (2008) Tola E, Lepetit V, Fua P (2008) A fast local descriptor for dense matching. In: CVPR
 Tola et al (2010) Tola E, Lepetit V, Fua P (2010) DAISY: An Efficient Dense Descriptor Applied to Wide Baseline Stereo. IEEE Trans PAMI
 Uchida and Sakoe (1998) Uchida S, Sakoe H (1998) A monotonic and continuous twodimensional warping based on dynamic programming. In: ICPR
 Vogel et al (2013a) Vogel C, Roth S, Schindler K (2013a) An evaluation of data costs for optical flow. In: GCPR
 Vogel et al (2013b) Vogel C, Schindler K, Roth S (2013b) Piecewise rigid scene flow. In: ICCV
 Wedel et al (2009) Wedel A, Cremers D, Pock T, Bischof H (2009) Structure and motionadaptive regularization for high accuracy optic flow. In: ICCV
 Weinzaepfel et al (2013) Weinzaepfel P, Revaud J, Harchaoui Z, Schmid C (2013) Deepflow: Large displacement optical flow with deep matching. In: ICCV
 Werlberger et al (2009) Werlberger M, Trobin W, Pock T, Wedel A, Cremers D, Bischof H (2009) Anisotropic HuberL1 optical flow. In: BMVC
 Wills et al (2006) Wills J, Agarwal S, Belongie S (2006) A featurebased approach for dense segmentation and estimation of large disparity motion. IJCV
 Xu et al (2012) Xu L, Jia J, Matsushita Y (2012) Motion detail preserving optical flow estimation. IEEE Trans PAMI
 Yang et al (2014) Yang H, Lin W, Lu J (2014) DAISY filter flow: A generalized discrete approach to dense correspondences. In: CVPR
 Young and Rheinboldt (1971) Young DM, Rheinboldt W (1971) Iterative solution of large linear systems. Academic Press, New York, NY
 Zimmer et al (2011) Zimmer H, Bruhn A, Weickert J (2011) Optic flow in harmony. IJCV
Comments
There are no comments yet.