Finding per-part correspondences between objects is a long-standing problem in machine learning. The level at which correspondences are established can go as low as pixels for images or millisecond timestamps for sound signals. Typically, it is highly ambiguous to match at such a low level: a pixel or a timestamp just does not contain enough information to be discriminative and many false positives will follow. A hierarchical semantic representation could help to solve the ambiguity: we could choose the low-level match which also matches at the higher levels. For example, a car contains a wheel which contains a bolt. If we want to check if this bolt matches the bolt in another view of the car, we should check if the wheel and the car match as well.
One possible hierarchical semantic representation could be computed by a convolutional neural network. The features in such a network are composed in a hierarchical manner: the lower-level features are used to compute higher-level features by applying convolutions, max-poolings and non-linear activation functions on them. Nevertheless, training such a convolutional neural network for correspondence prediction directly (e.g.,Zbontar and LeCun (2016b), Choy et al. (2016)) might not be an option in some domains where the ground-truth correspondences are hard and expensive to obtain. This raises the question of scalability of such approaches and motivates the search for methods which do not require training correspondence data.
To address the training data problem, we could transfer the knowledge from the source domain where the labels are present to the target domain where no labels or few labeled data are present. The most common form of transfer is from classification tasks. Its promise is two-fold. First, classification labels are one of the easiest to obtain as it is a natural task for humans. This allows to create huge recognition datasets like ImagenetRussakovsky et al. (2015). Second, the features from the low to mid-levels have been shown to transfer well to a variety of tasks Yosinski et al. (2014), Donahue et al. (2013), Razavian et al. .
Although there has been a huge progress in transfer from classification to detection Girshick (2015), Ren et al. (2015), Sermanet et al. (2013), Redmon et al. (2016), segmentation Long et al. (2015), Badrinarayanan et al. (2015) and other semantic reasoning tasks like single-image depth prediction Eigen and Fergus (2015), the transfer to correspondence search has been limited Long et al. (2014), Kim et al. (2017), Ham et al. (2016).
We propose a general solution to unsupervised transfer from recognition to correspondence search at the lowest level (pixels, sound millisecond timestamps). Our approach is to match paths of activations coming from a convolutional neural network, applied on two objects to be matched. More precisely, to establish matching on the lowest level, we require the features to match at all different levels of convolutional feature hierarchy. Those different-level features form paths. One such path would consist of neural activations reachable from the lowest-level feature to the highest-level feature in the network topology (in other words, the lowest level feature lies in the receptive field of the highest level). Since every lowest-level feature belongs to many paths, we do voting based on all of them.
Although the overall number of such paths is exponential in the number of layers and thus infeasible to compute naively, we prove that the voting is possible in polynomial time in a single backward pass through the network. The algorithm is based on dynamic programming and is similar to the backward pass for gradient computation in the neural network.
Empirical validation is done on the task of stereo correspondence on two datasets: KITTI 2012 Geiger et al. (2012) and KITTI 2015 Menze and Geiger (2015). We quantitatively show that our method is competitive among the methods which do not require labeled target domain data. We also qualitatively show that even dramatic changes in low-level structure can be handled reasonably by our method due to the robustness of the recognition hierarchy: we apply different style transfers Gatys et al. (2015) to corresponding images in KITTI 2015 and still successfully find correspondences.
Our method is generally applicable to the cases where the input data has a multi-dimensional grid topology layout. We will assume input objects to be from the set of -dimensional grids and run convolutional neural networks on those grids. The per-layer activations from those networks will be contained in the set of -dimensional grids . Both the input data and the activations will be indexed by a
-dimensional vector, where is a column index, is a row index, etc., and is the channel index (we will assume for the input data, which is a non-restrictive assumption as we will explain later).
We will search for correspondences between those grids, thus our goal will be to estimate shiftsfor all elements in the grid. The choice of the shift set is task-dependent. For example, for sound and only 1D shifts can be considered. For images, and could be a set of 1D shifts (usually called a stereo task) or a set of 2D shifts (usually called an optical flow task).
In this work, we will be dealing with convolutional neural network architectures, consisting of convolutions, max-poolings and non-linear activation functions (one example of such an architecture is a VGG-net Simonyan and Zisserman (2014), if we omit softmax which we will not use for the transfer). We assume every convolutional layer to be followed by a non-linear activation function throughout the paper and will not specify those functions explicitly.
The computational graph of these architectures is a directed acyclic graph , where
is a set of nodes, corresponding to neuron activations (denotes the size of this set), and is a set of arcs, corresponding to computational dependencies ( denotes the size of this set). Each arc is represented as a tuple , where is the input (origin), is the output (endpoint). The node set consists of disjoint layers . The arcs are only allowed to go from the previous layer to the next one.
We will use the notation for the node in -th layer at position ; for the set of origins of arcs, entering layer at position of the reference object; for the set of endpoints of arcs, exiting layer at position of the reference object. Let be the mathematic operator which corresponds to forward computation in layer as , (with a slight abuse of notation, we use for both the nodes in the computational graph and the activation values which are computed in those nodes).
3 Correspondence via path matching
We will consider two objects, reference and searched , for which we want to find correspondences. After applying a CNN on them, we get graphs and of activations. The goal is to establish correspondences between the input-data layers and . That is, every cell in the reference object has a certain shift in the searched object , and we want to estimate .
Here comes the cornerstone idea of our method: we establish the matching of with for a shift if there is a pair of “parallel” paths (we call this pair a siamese path), originating at those nodes and ending at the last layers , which match. This pair of paths must have the same spatial shift with respect to each other at all layers, up to subsampling, and go through the same feature channels with respect to each other. We take the subsampling into account by per-layer functions
where is how the zero-layer shift transforms at layer , is the -th layer spatial subsampling factor (note that rounding and division on vectors is done element-wise). Then a siamese path can be represented as
where and denotes the position at which the path intersects layer of the reference activation graph. Such paths are illustrated in Fig. 1. The logic is simple: matching in a siamese path means that the recognition hierarchy detects the same features at different perception levels with the same shifts (up to subsampling) with respect to the currently estimated position , which allows for a confident prediction of match. The fact that a siamese path is “matched” can be established by computing the matching function (high if it matches, low if not)
where is a matching function for individual neurons (prefers them both to be similar and non-zero at the same time) and is a logical-and-like operator. Both will be discussed later.
Since we want to estimate the shift for a node , we will consider all possible shifts and vote for each of them. Let us denote a set of siamese paths, starting at and and ending at the last layer, as .
For every shift we introduce as the log-likelihood of the event that is the correct shift, i.e. matches . To collect the evidence from all possible paths, we “sum up” the matching functions for all individual paths, leading to
where the sum-like operator will be discussed later.
The distribution can be used to either obtain the solution as or to post-process the distribution with any kind of spatial smoothing optimization and then again take the best-cost solution.
The obvious obstacle to using the distribution is that
If is the minimal number of activation channels in all the layers of the network and is the number of layers, the number of paths, considered in the computation of for a single originating node, is — at least exponential in the number of layers.
In practice, it is infeasible to compute naively. In this work, we prove that it is possible to compute in — thus linear in the number of layers — using the algorithm which will be introduced in the next section.
4 Linear-time backward algorithm
For any and any pair of operators such that is left-distributive over , i.e. , we can compute for all and in .
Since there is distributivity, we can use a dynamic programming approach similar to the one developed for gradient backpropagation.
First, let us introduce subsampling functions . Note that as introduced in Eq. 1.
Then, let us introduce auxiliary variables for each layer , which have the same definition as except for the fact that the paths, considered in them, start from the later layer :
Note that . The idea is to iteratively recompute based on known for all . Eventually, we will get to the desired .
The first step is to notice that all the paths share the same prefix and write it out explicitly:
Now, we want to pull the prefix out of the “sum”. For that purpose, we will need the set of endpoints , introduced in the notation in Section 2. The “sum” can be re-written in terms of those endpoints as
The last step is to use the left-distributivity of over to pull the prefix out of the “sum”:
The detailed procedure is listed in Algorithm 1. We use the notation for the set of subsampled shifts which is the result of applying function to every element of the set of initial shifts .
5 Choice of neuron matching function and operators ,
For the convolutional layers, we use the matching function
For the max-pooling layers, the computational graph can be truncated to just one active connection (as only one element influences higher-level features). Moreover, max-pooling does not create any additional features, only passes/subsamples the existing ones. Thus it does not make sense to take into account the pre-activations for those layers as they are the same as activations (up to subsampling). For these reasons, we use
where is the neighborhood of max-pooling covering node , is the indicator function ( if the condition holds, otherwise).
In this paper, we use sum as and product as . Another possible choice would be for and or product for — theoretically, those combinations satisfy the conditions in Theorem 1. Nevertheless, we found sum/product combination working better than others. This could be explained by the fact that as would be taken over a huge set of paths which is not robust in practice.
We validate our approach in the field of computer vision as our method requires a convolutional neural network trained on a large recognition dataset. Out of the vision correspondence tasks, we chose stereo matching to validate our method. For this task, the input data dimensionality isand the shift set is represented by horizontal shifts . We always convert images to grayscale before running CNNs, following the observation by Zbontar and LeCun (2016b) that color does not help.
For pre-trained recognition CNN, we chose the VGG-16 network Simonyan and Zisserman (2014). This network is summarized in Table 1. We will further refer to layer indexes from this table. It is important to mention that we have not used the whole range of layers in our experiments. In particular, we usually started from layer 2 and finished at layer 8. As such, it is still necessary to consider multi-channel input. To extend our algorithm to this case, we create a virtual input layer with and virtual per-pixel arcs to all the real input channels. While starting from a later layer is an empirical observation which improves the results for our method, the advantage of finishing at an earlier layer was discovered by other researchers as well Gatys et al. (2015) (starting from some layer, the network activations stop being related to individual pixels). We will thus abbreviate our methods as “ours(s, t)” where “s” is the starting layer and “t” is the last layer.
stands for 3x3 convolution with stride
followed by the ReLU non-linear activation functionKrizhevsky et al. (2012) and for 2x2 max-pooling with stride
. The input to convolution is padded with the “same as boundary” rule.
6.1 Experimental setup
For the stereo matching, we chose the largest available datasets KITTI 2012 and KITTI 2015. All image pairs in these datasets are rectified, so correspondences can be searched in the same row. For each training pair, the ground-truth shift is measured densely per-pixel. This ground truth was obtained by projecting the point cloud from LIDAR on the reference image. The quality measure is the percentage of pixels whose predicted shift error is bigger than a threshold of pixels. We considered a range of thresholds , while the main benchmark measure is . This measure is only computed for the pixels which are visible in both images from the stereo pair.
For comparison with the baselines, we used the setup proposed in Zbontar and LeCun (2016b)
— the seminal work which introduced deep learning for stereo matching and which currently stays one of the best methods on the KITTI datasets.  is an extensive study which has a representative comparison of learning-based and non-learning-based methods under the same setup and open-source codeZbontar and LeCun (2016a) for this setup. The whole pipeline works as follows. First, we obtain the raw scores from Algorithm 1 for the shifts up to . Then we normalize the scores per-pixel by dividing them over the maximal score, thus turning them into the range , suitable for running the post-processing code Zbontar and LeCun (2016a). Finally, we run the post-processing code with exactly the same parameters as the original method Zbontar and LeCun (2016b) and measure the quality on the same validation images.
We have two kinds of baselines in our evaluation: those coming from Zbontar and LeCun (2016b)
and our simpler versions of deep feature transfer similar toLong et al. (2014), which do not consider paths.
The first group of baselines from Zbontar and LeCun (2016b) are the following: the sum of absolute differences “sad”, the census transform “cens” Zabih and Woodfill (1994), the normalized cross-correlation “ncc”. We also included the learning-based methods “fst” and “acrt” Zbontar and LeCun (2016b) for completeness, although they use training data to learn features while our method does not.
For the second group of baselines, we stack up the activation volumes for the given layer range and up-sample the layer volumes if they have reduced resolution. Then we compute normalized cross-correlation of the stacked features. Those baselines are denoted “corr(s, t)” where “s” is the starting layer, “t” is the last layer. Note that we correlate the features before applying ReLU following what Zbontar and LeCun (2016b) does for the last layer. Thus we use the input to the ReLU inside the layers.
All the methods, including ours, undergo the same post-processing pipeline. This pipeline consists of semi-global matching Hirschmuller (2005), left-right consistency check, sub-pixel enhancement by fitting a quadratic curve, median and bilateral filtering. We refer the reader to Zbontar and LeCun (2016b) for the full description. While the first group of baselines was tuned by Zbontar and LeCun (2016b) and we take the results from that paper, we had to tune the post-processing hyper-parameters of the second group of baselines to obtain the best results.
6.3 Kitti 2012
The dataset consists of training image pairs and test image pairs. The reflective surfaces like windshields were excluded from the ground truth.
The results in Table 2 show that our method “ours(2, 8)” performs better compared to the baselines. At the same time, its performance is lower than learning-based methods from Zbontar and LeCun (2016b). The main promise of our method is scalability: while we test it on a task where huge effort was invested into collecting the training data, there are other important tasks without such extensive datasets.
|Threshold||sad||cens||ncc||corr(1, 2)||corr(2, 2)||corr(2, 8)||ours(2, 8)||fst||acrt|
6.4 Ablation study on KITTI 2012
The goal of this section is to understand how important is the deep hierarchy of features versus one or few layers. We compared the following setups: “ours(2, 2)” uses only the second layer, “ours(2, 3)” uses only the range from layer 2 to layer 3, “central(2, 8)” considers the full range of layers but only with central arcs in the convolutions (connecting same pixel positions between activations) taken into account in the backward pass, “ours(2, 8)” is the full method. The result in Table 3 shows that it is profitable to use the full hierarchy both in terms of depth and coverage of the receptive field.
|Threshold||ours(2, 2)||ours(2, 3)||central(2, 8)||ours(2, 8)|
6.5 Kitti 2015
The stereo dataset consists of training image pairs and test image pairs. The main difference to KITTI 2012 is that the images are colored and the reflective surfaces are present in the evaluation.
Similar conclusions to KITTI 2012 can be drawn from experimental results: our method provides a reasonable transfer, being inferior only to learning-based methods — see Table 4. We show our depth map results in Fig. 2.
|Threshold||sad||cens||ncc||corr(1, 2)||corr(2, 2)||corr(2, 8)||ours(2, 8)||fst||acrt|
6.6 Style transfer experiment on KITTI 2015
The goal of this experiment is to show the robustness of recognition hierarchy for the transfer to correspondence search — something we advocated in the introduction as the advantage of our approach. We apply the style transfer method Gatys et al. (2015), implemented in the Prisma app. We ran different style transfers on the left and right images. While now very different at the pixel level, the higher level descriptions of the images remain the same which allows to successfully run our method. The qualitative results show the robustness of our path-based method in Fig. 3 (see also Fig. 2 for visual comparison to normal data).
In this work, we have presented a method for transfer from recognition to correspondence search at the lowest level. For that, we re-use activation paths from deep convolutional neural networks and propose an efficient polynomial algorithm to aggregate an exponential number of such paths. The empirical results on the stereo matching task show that our method is competitive among methods which do not use labeled data from the target domain. It would be interesting to apply this technique to sound, which should become possible once a high-quality deep convolutional model becomes accessible to the public (e.g., van den Oord et al. (2016)).
We would like to thank Dmitry Laptev, Alina Kuznetsova and Andrea Cohen for their comments about the manuscript. We also thank Valery Vishnevskiy for running our code while our own cluster was down. This work is partially funded by the Swiss NSF project 163910 “Efficient Object-Centric Detection”.
- Badrinarayanan et al.  Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561, 2015.
- Choy et al.  Christopher B Choy, JunYoung Gwak, Silvio Savarese, and Manmohan Chandraker. Universal correspondence network. In Advances in Neural Information Processing Systems, pages 2414–2422, 2016.
- Donahue et al.  J Donahue, Y Jia, O Vinyals, J Hoffman, N Zhang, E Tzeng, and T Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. corr abs/1310.1531 (2013), 2013.
- Eigen and Fergus  David Eigen and Rob Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
- Gatys et al.  Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
Geiger et al. 
Andreas Geiger, Philip Lenz, and Raquel Urtasun.
Are we ready for autonomous driving? the kitti vision benchmark
Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
- Girshick  Ross Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 1440–1448, 2015.
- Ham et al.  Bumsub Ham, Minsu Cho, Cordelia Schmid, and Jean Ponce. Proposal flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3475–3484, 2016.
- Hirschmuller  Heiko Hirschmuller. Accurate and efficient stereo processing by semi-global matching and mutual information. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 2, pages 807–814. IEEE, 2005.
- Kim et al.  Seungryong Kim, Dongbo Min, Bumsub Ham, Sangryul Jeon, Stephen Lin, and Kwanghoon Sohn. Fcss: Fully convolutional self-similarity for dense semantic correspondence. arXiv preprint arXiv:1702.00926, 2017.
- Krizhevsky et al.  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Long et al.  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
- Long et al.  Jonathan L Long, Ning Zhang, and Trevor Darrell. Do convnets learn correspondence? In Advances in Neural Information Processing Systems, pages 1601–1609, 2014.
- Menze and Geiger  Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. In Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition (2014). arXiv preprint arXiv:1403.6382.
- Redmon et al.  Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 779–788, 2016.
- Ren et al.  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
- Russakovsky et al.  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115(3):211–252, 2015.
- Sermanet et al.  Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
- Simonyan and Zisserman  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- van den Oord et al.  Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. CoRR abs/1609.03499, 2016.
- Yosinski et al.  Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014.
- Zabih and Woodfill  Ramin Zabih and John Woodfill. Non-parametric local transforms for computing visual correspondence. In European conference on computer vision, pages 151–158. Springer, 1994.
- Zbontar and LeCun [2016a] Jure Zbontar and Yann LeCun. MC-CNN github repository. https://github.com/jzbontar/mc-cnn, 2016a.
- Zbontar and LeCun [2016b] Jure Zbontar and Yann LeCun. Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research, 17(1-32):2, 2016b.