Matching neural paths: transfer from recognition to correspondence search

05/19/2017 ∙ by Nikolay Savinov, et al. ∙ ETH Zurich 0

Many machine learning tasks require finding per-part correspondences between objects. In this work we focus on low-level correspondences - a highly ambiguous matching problem. We propose to use a hierarchical semantic representation of the objects, coming from a convolutional neural network, to solve this ambiguity. Training it for low-level correspondence prediction directly might not be an option in some domains where the ground-truth correspondences are hard to obtain. We show how transfer from recognition can be used to avoid such training. Our idea is to mark parts as "matching" if their features are close to each other at all the levels of convolutional feature hierarchy (neural paths). Although the overall number of such paths is exponential in the number of layers, we propose a polynomial algorithm for aggregating all of them in a single backward pass. The empirical validation is done on the task of stereo correspondence and demonstrates that we achieve competitive results among the methods which do not use labeled target domain data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Finding per-part correspondences between objects is a long-standing problem in machine learning. The level at which correspondences are established can go as low as pixels for images or millisecond timestamps for sound signals. Typically, it is highly ambiguous to match at such a low level: a pixel or a timestamp just does not contain enough information to be discriminative and many false positives will follow. A hierarchical semantic representation could help to solve the ambiguity: we could choose the low-level match which also matches at the higher levels. For example, a car contains a wheel which contains a bolt. If we want to check if this bolt matches the bolt in another view of the car, we should check if the wheel and the car match as well.

One possible hierarchical semantic representation could be computed by a convolutional neural network. The features in such a network are composed in a hierarchical manner: the lower-level features are used to compute higher-level features by applying convolutions, max-poolings and non-linear activation functions on them. Nevertheless, training such a convolutional neural network for correspondence prediction directly (e.g.,

Zbontar and LeCun (2016b), Choy et al. (2016)) might not be an option in some domains where the ground-truth correspondences are hard and expensive to obtain. This raises the question of scalability of such approaches and motivates the search for methods which do not require training correspondence data.

To address the training data problem, we could transfer the knowledge from the source domain where the labels are present to the target domain where no labels or few labeled data are present. The most common form of transfer is from classification tasks. Its promise is two-fold. First, classification labels are one of the easiest to obtain as it is a natural task for humans. This allows to create huge recognition datasets like Imagenet 

Russakovsky et al. (2015). Second, the features from the low to mid-levels have been shown to transfer well to a variety of tasks Yosinski et al. (2014), Donahue et al. (2013), Razavian et al. .

Although there has been a huge progress in transfer from classification to detection Girshick (2015), Ren et al. (2015), Sermanet et al. (2013), Redmon et al. (2016), segmentation Long et al. (2015), Badrinarayanan et al. (2015) and other semantic reasoning tasks like single-image depth prediction Eigen and Fergus (2015), the transfer to correspondence search has been limited Long et al. (2014), Kim et al. (2017), Ham et al. (2016).

We propose a general solution to unsupervised transfer from recognition to correspondence search at the lowest level (pixels, sound millisecond timestamps). Our approach is to match paths of activations coming from a convolutional neural network, applied on two objects to be matched. More precisely, to establish matching on the lowest level, we require the features to match at all different levels of convolutional feature hierarchy. Those different-level features form paths. One such path would consist of neural activations reachable from the lowest-level feature to the highest-level feature in the network topology (in other words, the lowest level feature lies in the receptive field of the highest level). Since every lowest-level feature belongs to many paths, we do voting based on all of them.

Although the overall number of such paths is exponential in the number of layers and thus infeasible to compute naively, we prove that the voting is possible in polynomial time in a single backward pass through the network. The algorithm is based on dynamic programming and is similar to the backward pass for gradient computation in the neural network.

Empirical validation is done on the task of stereo correspondence on two datasets: KITTI 2012 Geiger et al. (2012) and KITTI 2015 Menze and Geiger (2015). We quantitatively show that our method is competitive among the methods which do not require labeled target domain data. We also qualitatively show that even dramatic changes in low-level structure can be handled reasonably by our method due to the robustness of the recognition hierarchy: we apply different style transfers Gatys et al. (2015) to corresponding images in KITTI 2015 and still successfully find correspondences.

2 Notation

Our method is generally applicable to the cases where the input data has a multi-dimensional grid topology layout. We will assume input objects to be from the set of -dimensional grids and run convolutional neural networks on those grids. The per-layer activations from those networks will be contained in the set of -dimensional grids . Both the input data and the activations will be indexed by a

-dimensional vector

, where is a column index, is a row index, etc., and is the channel index (we will assume for the input data, which is a non-restrictive assumption as we will explain later).

We will search for correspondences between those grids, thus our goal will be to estimate shifts

for all elements in the grid. The choice of the shift set is task-dependent. For example, for sound and only 1D shifts can be considered. For images, and could be a set of 1D shifts (usually called a stereo task) or a set of 2D shifts (usually called an optical flow task).

In this work, we will be dealing with convolutional neural network architectures, consisting of convolutions, max-poolings and non-linear activation functions (one example of such an architecture is a VGG-net Simonyan and Zisserman (2014), if we omit softmax which we will not use for the transfer). We assume every convolutional layer to be followed by a non-linear activation function throughout the paper and will not specify those functions explicitly.

The computational graph of these architectures is a directed acyclic graph , where

is a set of nodes, corresponding to neuron activations (

denotes the size of this set), and is a set of arcs, corresponding to computational dependencies ( denotes the size of this set). Each arc is represented as a tuple , where is the input (origin), is the output (endpoint). The node set consists of disjoint layers . The arcs are only allowed to go from the previous layer to the next one.

We will use the notation for the node in -th layer at position ; for the set of origins of arcs, entering layer at position of the reference object; for the set of endpoints of arcs, exiting layer at position of the reference object. Let be the mathematic operator which corresponds to forward computation in layer as , (with a slight abuse of notation, we use for both the nodes in the computational graph and the activation values which are computed in those nodes).

Figure 1: Four siamese paths are shown. Two of them (red) have the same origin and support the hypothesis of the shift for this origin. The other two (green and pink) have different origins and support hypotheses and for their respective origins.

3 Correspondence via path matching

We will consider two objects, reference and searched , for which we want to find correspondences. After applying a CNN on them, we get graphs and of activations. The goal is to establish correspondences between the input-data layers and . That is, every cell in the reference object has a certain shift in the searched object , and we want to estimate .

Here comes the cornerstone idea of our method: we establish the matching of with for a shift if there is a pair of “parallel” paths (we call this pair a siamese path), originating at those nodes and ending at the last layers , which match. This pair of paths must have the same spatial shift with respect to each other at all layers, up to subsampling, and go through the same feature channels with respect to each other. We take the subsampling into account by per-layer functions

(1)

where is how the zero-layer shift transforms at layer , is the -th layer spatial subsampling factor (note that rounding and division on vectors is done element-wise). Then a siamese path can be represented as

(2)

where and denotes the position at which the path intersects layer of the reference activation graph. Such paths are illustrated in Fig. 1. The logic is simple: matching in a siamese path means that the recognition hierarchy detects the same features at different perception levels with the same shifts (up to subsampling) with respect to the currently estimated position , which allows for a confident prediction of match. The fact that a siamese path is “matched” can be established by computing the matching function (high if it matches, low if not)

(3)

where is a matching function for individual neurons (prefers them both to be similar and non-zero at the same time) and is a logical-and-like operator. Both will be discussed later.

Since we want to estimate the shift for a node , we will consider all possible shifts and vote for each of them. Let us denote a set of siamese paths, starting at and and ending at the last layer, as .

For every shift we introduce as the log-likelihood of the event that is the correct shift, i.e. matches . To collect the evidence from all possible paths, we “sum up” the matching functions for all individual paths, leading to

(4)

where the sum-like operator will be discussed later.

The distribution can be used to either obtain the solution as or to post-process the distribution with any kind of spatial smoothing optimization and then again take the best-cost solution.

The obvious obstacle to using the distribution is that

Observation 1.

If is the minimal number of activation channels in all the layers of the network and is the number of layers, the number of paths, considered in the computation of for a single originating node, is — at least exponential in the number of layers.

In practice, it is infeasible to compute naively. In this work, we prove that it is possible to compute in — thus linear in the number of layers — using the algorithm which will be introduced in the next section.

4 Linear-time backward algorithm

Theorem 1.

For any and any pair of operators such that is left-distributive over , i.e. , we can compute for all and in .

Proof

Since there is distributivity, we can use a dynamic programming approach similar to the one developed for gradient backpropagation.

First, let us introduce subsampling functions . Note that as introduced in Eq. 1.

Then, let us introduce auxiliary variables for each layer , which have the same definition as except for the fact that the paths, considered in them, start from the later layer :

(5)

Note that . The idea is to iteratively recompute based on known for all . Eventually, we will get to the desired .

The first step is to notice that all the paths share the same prefix and write it out explicitly:

(6)

Now, we want to pull the prefix out of the “sum”. For that purpose, we will need the set of endpoints , introduced in the notation in Section 2. The “sum” can be re-written in terms of those endpoints as

(7)

The last step is to use the left-distributivity of over to pull the prefix out of the “sum”:

(8)

The detailed procedure is listed in Algorithm 1. We use the notation for the set of subsampled shifts which is the result of applying function to every element of the set of initial shifts .

1:procedure Backward()
2:     for  do
3:         for  do
4:              , Initialize the last layer.
5:         end for
6:     end for
7:     for  = L-1, …, 0 do
8:         for  do
9:              for  do
10:                  ,
11:                  for  do
12:                       ,
13:                  end for
14:                  ,
15:              end for
16:         end for
17:     end for
18:     return Return the distribution for the first layer.
19:end procedure
Algorithm 1 Backward pass

5 Choice of neuron matching function and operators ,

For the convolutional layers, we use the matching function

(9)

For the max-pooling layers, the computational graph can be truncated to just one active connection (as only one element influences higher-level features). Moreover, max-pooling does not create any additional features, only passes/subsamples the existing ones. Thus it does not make sense to take into account the pre-activations for those layers as they are the same as activations (up to subsampling). For these reasons, we use

(10)

where is the neighborhood of max-pooling covering node , is the indicator function ( if the condition holds, otherwise).

In this paper, we use sum as and product as . Another possible choice would be for and or product for — theoretically, those combinations satisfy the conditions in Theorem 1. Nevertheless, we found sum/product combination working better than others. This could be explained by the fact that as would be taken over a huge set of paths which is not robust in practice.

6 Experiments

We validate our approach in the field of computer vision as our method requires a convolutional neural network trained on a large recognition dataset. Out of the vision correspondence tasks, we chose stereo matching to validate our method. For this task, the input data dimensionality is

and the shift set is represented by horizontal shifts . We always convert images to grayscale before running CNNs, following the observation by Zbontar and LeCun (2016b) that color does not help.

For pre-trained recognition CNN, we chose the VGG-16 network Simonyan and Zisserman (2014). This network is summarized in Table 1. We will further refer to layer indexes from this table. It is important to mention that we have not used the whole range of layers in our experiments. In particular, we usually started from layer 2 and finished at layer 8. As such, it is still necessary to consider multi-channel input. To extend our algorithm to this case, we create a virtual input layer with and virtual per-pixel arcs to all the real input channels. While starting from a later layer is an empirical observation which improves the results for our method, the advantage of finishing at an earlier layer was discovered by other researchers as well Gatys et al. (2015) (starting from some layer, the network activations stop being related to individual pixels). We will thus abbreviate our methods as “ours(s, t)” where “s” is the starting layer and “t” is the last layer.

Layer index 1 2 3 4 5 6 7 8
Layer type c c p c c p c c
Output channels 64 64 64 128 128 128 256 256
Table 1: Summary of the convolutional neural network VGG-16. We only show the part up to the -th layer as we do not use higher activations (they are not pixel-related enough). In the layer type row,

stands for 3x3 convolution with stride

followed by the ReLU non-linear activation function 

Krizhevsky et al. (2012) and for 2x2 max-pooling with stride

. The input to convolution is padded with the “same as boundary” rule.

6.1 Experimental setup

For the stereo matching, we chose the largest available datasets KITTI 2012 and KITTI 2015. All image pairs in these datasets are rectified, so correspondences can be searched in the same row. For each training pair, the ground-truth shift is measured densely per-pixel. This ground truth was obtained by projecting the point cloud from LIDAR on the reference image. The quality measure is the percentage of pixels whose predicted shift error is bigger than a threshold of pixels. We considered a range of thresholds , while the main benchmark measure is . This measure is only computed for the pixels which are visible in both images from the stereo pair.

For comparison with the baselines, we used the setup proposed in Zbontar and LeCun (2016b)

— the seminal work which introduced deep learning for stereo matching and which currently stays one of the best methods on the KITTI datasets. [24] is an extensive study which has a representative comparison of learning-based and non-learning-based methods under the same setup and open-source code 

Zbontar and LeCun (2016a) for this setup. The whole pipeline works as follows. First, we obtain the raw scores from Algorithm 1 for the shifts up to . Then we normalize the scores per-pixel by dividing them over the maximal score, thus turning them into the range , suitable for running the post-processing code Zbontar and LeCun (2016a). Finally, we run the post-processing code with exactly the same parameters as the original method Zbontar and LeCun (2016b) and measure the quality on the same validation images.

6.2 Baselines

We have two kinds of baselines in our evaluation: those coming from Zbontar and LeCun (2016b)

and our simpler versions of deep feature transfer similar to 

Long et al. (2014), which do not consider paths.

The first group of baselines from Zbontar and LeCun (2016b) are the following: the sum of absolute differences “sad”, the census transform “cens” Zabih and Woodfill (1994), the normalized cross-correlation “ncc”. We also included the learning-based methods “fst” and “acrt” Zbontar and LeCun (2016b) for completeness, although they use training data to learn features while our method does not.

For the second group of baselines, we stack up the activation volumes for the given layer range and up-sample the layer volumes if they have reduced resolution. Then we compute normalized cross-correlation of the stacked features. Those baselines are denoted “corr(s, t)” where “s” is the starting layer, “t” is the last layer. Note that we correlate the features before applying ReLU following what Zbontar and LeCun (2016b) does for the last layer. Thus we use the input to the ReLU inside the layers.

All the methods, including ours, undergo the same post-processing pipeline. This pipeline consists of semi-global matching Hirschmuller (2005), left-right consistency check, sub-pixel enhancement by fitting a quadratic curve, median and bilateral filtering. We refer the reader to Zbontar and LeCun (2016b) for the full description. While the first group of baselines was tuned by Zbontar and LeCun (2016b) and we take the results from that paper, we had to tune the post-processing hyper-parameters of the second group of baselines to obtain the best results.

6.3 Kitti 2012

The dataset consists of training image pairs and test image pairs. The reflective surfaces like windshields were excluded from the ground truth.

The results in Table 2 show that our method “ours(2, 8)” performs better compared to the baselines. At the same time, its performance is lower than learning-based methods from Zbontar and LeCun (2016b). The main promise of our method is scalability: while we test it on a task where huge effort was invested into collecting the training data, there are other important tasks without such extensive datasets.

Methods
Threshold sad cens ncc corr(1, 2) corr(2, 2) corr(2, 8) ours(2, 8) fst acrt
1 - - - 20.6 20.4 20.7 17.4 - -
2 - - - 10.5 10.4 8.14 6.40 - -
3 8.16 4.90 8.93 7.58 7.52 5.23 3.94 3.02 2.61
4 - - - 6.19 6.13 4.02 2.99 - -
5 - - - 5.40 5.36 3.42 2.49 - -
Table 2: This table shows the percentages of erroneous pixels for thresholds on the KITTI 2012 validation set from Zbontar and LeCun (2016b). Our method is denoted “ours(2, 8)”. The two right-most columns “fst” and “acrt” correspond to learning-based methods from Zbontar and LeCun (2016b). We give them for completeness, as all the other methods, including ours, do not use learning.

6.4 Ablation study on KITTI 2012

The goal of this section is to understand how important is the deep hierarchy of features versus one or few layers. We compared the following setups: “ours(2, 2)” uses only the second layer, “ours(2, 3)” uses only the range from layer 2 to layer 3, “central(2, 8)” considers the full range of layers but only with central arcs in the convolutions (connecting same pixel positions between activations) taken into account in the backward pass, “ours(2, 8)” is the full method. The result in Table 3 shows that it is profitable to use the full hierarchy both in terms of depth and coverage of the receptive field.

Methods
Threshold ours(2, 2) ours(2, 3) central(2, 8) ours(2, 8)
1 17.7 18.4 17.3 17.4
2 7.90 8.16 6.58 6.40
3 5.28 5.41 4.02 3.94
4 4.08 4.05 3.04 2.99
5 3.41 3.32 2.53 2.49
Table 3: KITTI 2012 ablation study.

6.5 Kitti 2015

The stereo dataset consists of training image pairs and test image pairs. The main difference to KITTI 2012 is that the images are colored and the reflective surfaces are present in the evaluation.

Similar conclusions to KITTI 2012 can be drawn from experimental results: our method provides a reasonable transfer, being inferior only to learning-based methods — see Table 4. We show our depth map results in Fig. 2.

Methods
Threshold sad cens ncc corr(1, 2) corr(2, 2) corr(2, 8) ours(2, 8) fst acrt
1 - - - 26.6 26.5 29.6 26.2 - -
2 - - - 10.9 10.8 11.2 9.27 - -
3 9.44 5.03 8.89 6.68 6.63 6.16 4.78 3.99 3.25
4 - - - 5.05 5.03 4.42 3.36 - -
5 - - - 4.22 4.20 3.60 2.72 - -
Table 4: This table shows the percentages of erroneous pixels for thresholds on the KITTI 2015 validation set from Zbontar and LeCun (2016b). Our method is denoted “ours(2, 8)”. The two right-most columns “fst” and “acrt” correspond to learning-based methods from Zbontar and LeCun (2016b). We give them for completeness, as all the other methods, including ours, do not use learning.
Figure 2: Results on KITTI 2015. Top to bottom: reference image, searched image, our depth result. The depth is visualized in the standard KITTI color coding (from close to far: yellow, green, purple, red, blue).

6.6 Style transfer experiment on KITTI 2015

The goal of this experiment is to show the robustness of recognition hierarchy for the transfer to correspondence search — something we advocated in the introduction as the advantage of our approach. We apply the style transfer method Gatys et al. (2015), implemented in the Prisma app. We ran different style transfers on the left and right images. While now very different at the pixel level, the higher level descriptions of the images remain the same which allows to successfully run our method. The qualitative results show the robustness of our path-based method in Fig. 3 (see also Fig. 2 for visual comparison to normal data).

Figure 3: Results for the style transfer on KITTI 2015. Top to bottom: reference image, searched image, our depth result. The depth is visualized in the standard KITTI color coding (from close to far: yellow, green, purple, red, blue).

7 Conclusion

In this work, we have presented a method for transfer from recognition to correspondence search at the lowest level. For that, we re-use activation paths from deep convolutional neural networks and propose an efficient polynomial algorithm to aggregate an exponential number of such paths. The empirical results on the stereo matching task show that our method is competitive among methods which do not use labeled data from the target domain. It would be interesting to apply this technique to sound, which should become possible once a high-quality deep convolutional model becomes accessible to the public (e.g., van den Oord et al. (2016)).

Acknowledgements

We would like to thank Dmitry Laptev, Alina Kuznetsova and Andrea Cohen for their comments about the manuscript. We also thank Valery Vishnevskiy for running our code while our own cluster was down. This work is partially funded by the Swiss NSF project 163910 “Efficient Object-Centric Detection”.

References