Learning Image Matching by Simply Watching Video

03/19/2016 ∙ by Gucan Long, et al. ∙ Australian National University 0

This work presents an unsupervised learning based approach to the ubiquitous computer vision problem of image matching. We start from the insight that the problem of frame-interpolation implicitly solves for inter-frame correspondences. This permits the application of analysis-by-synthesis: we firstly train and apply a Convolutional Neural Network for frame-interpolation, then obtain correspondences by inverting the learned CNN. The key benefit behind this strategy is that the CNN for frame-interpolation can be trained in an unsupervised manner by exploiting the temporal coherency that is naturally contained in real-world video sequences. The present model therefore learns image matching by simply watching videos. Besides a promise to be more generally applicable, the presented approach achieves surprising performance comparable to traditional empirically designed methods.



There are no comments yet.


page 2

page 9

page 10

page 11

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: We train a deep convolutional network for frame interpolation, which can be done without manual supervision by exploiting the temporal coherency that is naturally contained in real-world video sequences. The learned CNN is then used to compute a sensitivity map for each output pixel. This sensitivity map, i.e. the gradients w.r.t. the input, indicates how much each input pixel influences a particular output pixel. The two input pixels (one per input frame) that have the maximum influence are considered as a match. Though indirect, the present model learn how to perform dense correspondence matching by simply watching video.

For our human beings, vision, and how the brain uses visual information, are learned skills. Meanwhile, the ultimate goal of computer vision research is to teach machines to understand the visual world. But obviously we cannot do it all in the manner of hand over hand, i.e. via empirically man-devised models. It would be more ideal and practicable if we can teach them to learn vision by themselves. This work focuses on the fundamental problem of establishing 2D-2D correspondences across a pair of consecutive frames, and notably proves that a solution to this low-level vision problem could be achieved in an unsupervised way by relying only on natural video sequences.

Our key insight lies in the understanding that frame interpolation implicitly solves for dense correspondences between the input image pair. It is well known that dense matching can be regarded as a sub-problem of frame-interpolation, as the interpolation could be immediately generated by correspondence-based image warping once dense inter-frame matches are available. It then comes as no surprise that if we were able to train a deep neural network for frame interpolation, its application would implicitly also generate knowledge about dense image correspondences. Retrieving this knowledge is known as analysis by synthesis [1], a paradigm in which learning is described as the acquisition of a measurement synthesising model, and inference of generating parameters as model inversion once correct synthesis is achieved. In our context, synthesis simply refers to frame interpolation. We then, for the analysis part, show that the correspondences can be recovered from the network through gradient back-propagation, which produces sensitivity maps for each interpolated pixel. The procedure is summarised in Figure 1, explaining how the reciprocal mapping between frame-interpolation and dense correspondences is encoded in the forward and backward propagation through one and the same network architecture. We call our approach MIND, which stands for Matching by INverting 111The term of inverting is read as back-propagation through the given deep neural network. a Deep neural network.

The key benefit of MIND lies in the fact that the deep convolutional network for frame-interpolation can be trained from ordinary video sequences without any man-made ground truth signals. The training data in our case is given by triplets of images, each one consisting of two input images and one output image that represents the ground-truth interpolated frame. A correct example of a ground truth output image is an image that—when inserted in between the input pair of images—forms a temporally coherent sequence of frames. Such temporal coherency is naturally contained in regular video sequences, which allows us to simply use triplets of sequential images from almost arbitrary video streams for training our network. The first and the third frame of each triplet are used as inputs to the network, and the second frame as the ground truth interpolated frame. Most importantly, since the inversion of our network returns frame-to-frame correspondences, it therefore learns how to do image matching without any requirement for manually designed models or expensive ground truth correspondences. In other words, the presented approach learns image matching by simply “watching videos”.

The paper is organized as follows. Section 2 reviews relevant prior work. Section 3 explains the present analysis-by-synthesis approach, including both the analysis part of how MIND works and the synthesis part of the deep convolutional architecture for frame interpolation. Section 4 demonstrates the surprising performance for the present purely unsupervised learning approach, which is comparable to several traditional empirically designed methods. Section 5 finally discusses our contribution and provides an outlook onto future works.

2 Related Work

Deep learning meets image matching Image matching is a classical problem in computer vision. Here we limit the discussion to recent works that address image matching through learning based approaches. Roughly speaking, there exist two lines of research for this topic: the first one consists of making use of features or representations learned by deep neural networks, which are either originally trained for other tasks such as object recognition [2, 3], or specially designed and trained for the purpose of image matching [4, 5, 6]. The second major line of research employs deep neural networks to compute the similarity between image patches [7, 8, 9]

. In contrast to our work, the cited contributions mainly address sub-modules of image matching (feature extraction or matching cost computation), rather than providing end-to-end solutions. An exception is given by FlowNet


, which presents an interesting deep learning based approach for dense optic flow computation. It does however depend on ground truth flow for training the network.

It is also worth to mention that the Gated restricted Boltzmann machine model proposed by Memisevic and Hinton

[11] and then extended by Taylor et al. [12] could also be trained in an unsupervised manner and be applied to infer constrained image transforms such as flow fields for “shifting pixel”. However, this line of work is mainly aiming at learning motion features for understanding video data. It is similar to the works of temporal coherence learning mentioned below.

Temporal coherence learning

Unsupervised learning is a broad topic in the field of machine learning. Our discussion here focuses on works that exploit temporal coherency in natural videos, sometimes also called

temporal coherence learning [13, 14, 15]. As a recent representative work, Wang et al. [16] exploit temporal coherency by visual tracking in videos, and report that the learned representation achieves competitive performance compared to some supervised alternatives. While temporal coherence learning mostly aims at learning features or representations, some recent works on reconstructing and predicting video frames in an unsupervised setting [17] are closely related to our work as well. Srivastava et al. [18] use an encoder LSTM to map input sequences into a fixed length representation, and use the latter for reconstructing the input or even predicting future frames. Goroshin et al. [19]

consider videos as one-dimensional, time-parametrized trajectories embedded in a low dimensional manifold. They train deep feature hierarchies that linearise the transformations observed in natural video sequences for the purpose of frame prediction. Though related to our work, these works are not aiming at image matching. It will be interesting to apply our concept of matching by inverting to the above models for temporal coherence learning.

Network inversion Note that inverting a learned network is traditionally defined as reconstructing the input from the output of an artificial neural network [20]. Mahendran et al. [21] and Dosovitskiy et al. [22] apply this concept to understand what information is preserved by a network. In our context, inverting a network means back-propogation through a learned network in order to obtain the gradient map with respect to the input signals. Interestingly, the idea has already been introduced in the work of Simonyan et al. [23], emphasizing that the retrieved sensitivity maps may serve to identify image-specific class saliency. Similarly, Bach et al. [24]

employ gradient maps as a measure for the contribution of single pixels to nonlinear classifier, thus helping to explain how decisions are made.

3 Methodology

The analysis by synthesis approach for dense image matching is described in this section: we first explain the analysis part, i.e. how to obtained correspondences given the trained neural network and the interpolated image. For the synthesis part, it is described here the detailed architecture of the deep convolutional network designed for frame interpolation.

3.1 Matching by Inverting a Deep neural network

Assuming that we have a well trained deep neural network for frame interpolation in our hand, the core technical question behind our work is how to recover the correspondences between the input pair of images from there. As explained previously, dense correspondence matching may be regarded as a sub-problem of frame-interpolation, which is why we should be able to trace back the matches starting from the interpolated frame generated during the forward-propagation through the trained network. Our task then consists of back-tracking each pixel in the output image to exactly one pixel in each of the two input images. Note that this back-tracking does not mean reconstructing input images from the output one. Instead, we only need to find the pixels in each input image which have the maximum influence to each pixel of the output image.

We perform back-tracking by applying a technique similar to the one adopted by Simonyan et al. [23]. For each pixel in the output image, we compute the gradient of its value with respect to each input pixel, thus telling us how much it is under the influence of individual pixels at the input. The gradient is computed based on back-propagation, and leads to sensitivity or influence maps at the input of the network.

From a more formal perspective, our approach may be explained as follows. Let denote a non-linear function (i.e. the trained deep neural network) that describes the mapping from two input images and to an interpolated image lying approximately at the “center” of the input frames. Thinking of as a vectorial mapping, it can be split up into non-linear sub-functions, each one producing the corresponding pixel in the output image


In order to produce the sensitivity maps, we apply back-propagation to compute the Jacobian matrix with respect to each input image individually. The Jacobian with respect to the first image is given by


illustrating that this derivative results in one matrix for each one of the pixels at the output. The Jacobian with respect to is given in a similar way. Let’s define the absolute gradients of the output point with respect to each one of the input images, and evaluated for the concrete inputs and . They are given by


where replaces each entry of a matrix by its absolute value. The gradient maps produced in this way notably represent the seeked sensitivity or influence maps that may now serve in order to derive the coordinates of each correspondence. We notably extract the most responsible point in each gradient map, and connect those two points in order to return the correspondence.

In the spirit of unsupervised learning, we opted for the simplest possible choice of taking the coordinates of the maximum entry in and , respectively. Let us denote these points with and . By computing the two gradient maps for each point in the output image and extracting each time the most responsible point, we thus obtain the following two lists of points


The set of correspondences is then given by combining same-index elements from and , eventually resulting in


3.2 Deep neural network for Frame Interpolation

The architecture of our frame-interpolation network is inspired by FlowNetSimple as presented in Fischer et al. [10]. As illustrated in Figure 2, it consists of a Convolutional Part and a Deconvolutional Part. The two parts serve as “encoder” and “decoder” respectively, similar to the auto-encoder architecture presented by Hinton and Salakhutdinov [25]. The basic block within the Convolutional Part—denoted Convolution Block—follows the common pattern of the convolutional neural network architecture:


The Parametric Rectified Linear Unit

[26] is adopted in our work. Following the suggestions from VGG-Net [27]

, we set the size of the receptive field of all convolution filters to three—along with a stride and a padding of one—and duplicate

[CONV –>PRELU] three times to better model the non-linearity.

The Deconvolution Part consists of Deconvolution Blocks, each one including a convolution transpose layer [28] and two convolution layers. The first one has a receptive field of four, a stride of two, and a padding of one. The pattern of the Deconvolution Block follows:


In order to maintain fine-grained image details in the interpolation frame, we make a copy of the output features produced by Convolution Blocks 2, 3, and 4, and concat them as an additional input to the Deconvolution Blocks 4, 3, and 2, respectively. This concept is illustrated by the side arrows in Figure 2, and similar ideas have already been used in prior work [10, 29]. Recent works [30, 31] indicate that the ‘side arrows’ may also help to better train the deep network.

It is easy to notice that our network is a fully convolutional one, thus allowing us to feed it with images of different resolutions. This is an important advantage, as different datasets may use different height-to-width ratios. The output blob size for each block in our network is listed in Table 1.

Figure 2: Architecture of our network. The network takes 2 RGB images as an input to produce the interpolated RGB image. Please note that Dconv Block 4 takes the outputs from both Conv Block 2 and Dconv Block 5 as input. Dconv Block 3 and Dconv Block 2 have a similar input configuration.
Input Conv1 Conv2 Conv3 Conv4 Conv5 Dconv5 Dconv4 Dconv3 Dconv2 Dconv1 Output
Depth 6 96 96 128 128 128 128 128 128 96 96 3
Height 128 64 32 16 8 4 8 16 32 64 128 128
Width 384 192 96 48 24 12 24 48 96 192 384 384
Table 1: The table lists the output blob size of each block in our network. Note that we stack two RGB images into one input blob, and thus the depth is 6. The output of the network is an RGB image and thus the depth equals to 3. The indicated widths are for the network trained on KITTI. The ones for the Sintel data are easily obtained, the only difference being that the input images are scaled to 256128 rather than 384128.

4 Experiments

In this section, we first explain the implementation details behind MIND such as training data and loss function. The examples as proofs of concept for MIND are introduced before a discussion on the generalization ability of the trained CNN. We finally evaluate MIND in terms of quantitative matching performance and compare it to traditional image matching methods.

4.1 Implementation Details

Training Data: Quantity and quality of training data are crucial for training a deep neural network. However, our case is particularly easy as we can simply use huge amounts of real-world videos. In this work, we focus on training with the KITTI RAW videos [32] and Sintel videos222Sintel, the Durian Open Movie Project. https://durian.blender.org/ and show that the resulting learned network performs reasonably well. The network is first trained with the KITTI RAW video sequences which are captured by driving around the city of Karlsruhe, through rural areas and over highways. The dataset contains 56 image sequences with in total 16,951 frames. For each sequence, we take every three consecutive frames (both in forward and backward direction) as a training triplet, where the first and the third image serve as inputs to the network and second image as the corresponding output. These images are then augmented by vertical flipping, horizontal flipping and a combination of both. The total number of sample triplets is 133,921. We then fine-tune the network on examples selected from the original Sintel movie. We manually collected 63 video clips with in total 5,670 frames from the movie. After grouping and data augmentation we finally obtain 44,352 sample triplets. Note that, compared to the KITTI sequences which are recorded with relatively uniform velocity, the Sintel sequences represent more difficult training examples in the context of our work, as they contain a lot of fast and unrealistic motion captured with a frame rate of only 24 fps. A significant portion of the Sintel samples therefore does not contain the required temporal coherence. We will discuss this issue further in Section 4.2.

Loss Function: Several previous works [19, 16] mention that minimizing the L2 loss between the output frame and the training example may lead to unrealistic and blurry predictions. We have not been able to confirm this throughout our experiments, but found that the Charbonnier loss commonly employed for robust optical flow computation [33] leads to an improvement over the L2 loss. We employ it to train our network, with set to 0.1.

Training Details:

The training is performed using Caffe

[34] on a machine with two K40c GPUs. The weights of the network are initialized by Xavier’s approach [35] and optimized by the Adam solver [36] with a fixed momentum of 0.9. The initial learning rate is set to 1e-3 and then manually tuned down once ceasing of loss reduction sets in. For training on the KITTI RAW data, the images are scaled to 384128. For training on the Sintel dataset, the images are scaled to 256

128. The batch size is 16. We run the training on KITTI RAW from scratch for about 20 epochs, and then fine-tuned it on the Sintel movie images for 15 epochs. We did not observe over-fitting during training, and terminated the training after 5 days.

Execution time: MIND can be applied to different scenarios (e.g. sparse or dense matching). We focus here on semi-dense image matching in order to obtain a result comparable with other methods. We compute the correspondences across the input images for each corner of a predefined raster grid of 4 pixels width in the interpolated image. Note that MIND currently depends on a large amount of computational resources as it performs back-propagation through the entire network for every pixel that needs to be matched. For an image of size 384128, each forward pass through our network takes 40ms on a PC with K40c GPU, and each backward pass takes 158ms. For each image pair, we need to perform one forward pass to first obtain the interpolation. We then need to perform 384128 / 4 / 4 = 3072 backward passes to find the correspondences, resulting in a total of about 486 seconds (8 minutes).

4.2 Qualitative examples for Interpolation and Matching

We demonstrate here the visual examples as proofs of concept for how the present approach works on both tasks of frame interpolation and image matching. We further introduce a discussion on the generalization ability of the trained model.

Figure 3: Examples of Frame interpolation (best viewed in colour). From left to right: example on KITTI, Sintel, ETH Multi-Person Tracking dataset[37] and Bonn Benchmark on Tracking [38], respectively. In each column, the first image is an overlay of the two input frames. The second one is the interpolated image obtained by our network. For the first example, we use the network trained on KITTI itself. For all others, we use the network fine-tuned on Sintel data.

Examples for frame interpolation: We show the examples of frame interpolation in Figure 3. The first two columns show the examples on KITTI and Sintel images which are taken from the validation datasets originally collected for the purpose of monitoring the network training process. It can be seen that the trained CNNs cover the motion correctly for both KITTI and Sintel image pairs. It could be noticed as well that some fine-grained details are not preserved well in both examples, even though we have put special considerations when designing the convolutional architecture, c.f. section 3.2. Nevertheless, we would like to mind the readers that the goal of the present work is not to provide a state-of-the-art frame interpolation algorithm. And for the goal of image matching, we will see that the preservation of perfect image details is in fact not necessary.

Examples for image matching: Here we present examples to demonstrate how MIND obtains correspondences given the trained CNNs for frame interpolation. The examples taken from KITTI and Sintel videos are shown in Figure 4. By computing the gradient of manually marked pixels in the interpolated image, MIND successfully obtains correct correspondences between the 2 input images. It can be seen that the correct correspondences are obtained even in some fast moving areas where fine-grained image details are missed, e.g. the area of the character’s shaking hand in the Sintel example.

We further show one failure example taken from Sintel images. In Figure 5, it can be observed that the interpolation fails as the motion of the small dragon and the character’s hand have not been covered correctly. It then comes as no surprise that MIND fails to extract correct matches for almost all of the selected points. However, it is worth to note that the No.4 match has better quality than others, of which the corresponding gradient maps are less distinctive. The matching score/confidence returned by MIND is inspired by this behaviour and defined as the ratio between the maximum gradient intensity and the mean gradient intensity within a small area around the maximal gradient location.

Figure 4: Two matching examples for image pairs taken from the KITTI RAW video and the Sintel movie clip (best viewed in colour). For each example, the corresponding row of images shows input image 1, the interpolated image, and then input image 2 (from left to right). The red points mark five sample correspondences. The two rows below each example show the gradient/saliency maps for each match (from left to right) in each input image (maps for input image 1 on top, and maps for input image 2 in the bottom). The figures also indicate the coordinates of the maximal gradient location (P) along with the corresponding matching score (S). The matching score is defined as the ratio between the maximum gradient intensity and the mean gradient intensity within a 2020 area around P.
Figure 5: Failure example of MIND for image pair taken from the Sintel movie clip (best viewed in colour). The gradient/saliency maps (from left to right) are for matches labelled as 1, 2, …, 5, respectively.

As illustrated in Section 4.3, the general performance of MIND, especially on KITTI images, is good. The failure example in Figure 5 outlines a extreme case in the Sintel sequences dominated by fast and highly non-rigid motion in the scene.

Generalization ability: It is essential for learning based approaches to hold good generalization ability. Though MIND enjoys the benefit that it can learn image matching by just “watching videos” (i.e. it could first do fine-tuning in the given image sequences and perform the interpolation & matching after that), it is important to verify whether the present CNN is indeed learning the ability to interpolate frames and match images, rather than only “remember” the KITTI or Sintel-like images.

We demonstrate the generalization ability of the trained CNN by applying it to images taken from the ETH Multi-Person Tracking dataset [37] and the Bonn Benchmark on Tracking [38], which have not been used for either training or fine-tuning. The results are showed in Figure 3, from which we can see that the trained CNN again covers the motion correctly. It provides evidence about what has been learned by “watching videos”.

Figure 6: Examples of MIND on DICOM images. There are two examples shown in different rows. For each example, the overlay of input image-pair, 1st input image, interpolation returned by the CNN and the 2nd input image are shown in the columns from left to right, respectively. The red points in columns 2, 3 and 4 indicate the matches obtained by MIND.

The generalization ability is further illustrated by applying MIND to DICOM images of coronary angiogram333The images are taken from a DICOM sample image set: http://www.osirix-viewer.com/datasets/. Alias Name: GRUSELAMBIX.. In Figure 6, it can be seen that these images are substantially different from natural ones. Though again failing to preserve perfect image details, the CNN, which is trained on natural images, performs impressively well on the DICOM images. The nice generalization ability of the CNN is underlined by results on both frame interpolation and image matching.

4.3 Quantitative Performance of Image Matching

  MIND   DeepM   HoG   KLT
APE 4.695 3.442 9.680 8.157
Accuracy@5 0.716 0.835 0.455 0.702
Accuracy@10 0.915 0.953 0.805 0.826
Accuracy@20 0.981 0.987 0.929 0.903
Accuracy@30 0.993 0.993 0.959 0.938
Table 2: Matching performance on the KITTI 2012 flow training set. DeepM denotes DeepMatching. Metrics: Average Point Error (APE) (the lower the better), and Accuracy@T (the higher the better). Bold numbers indicate best performance, underlined numbers 2nd best.
   MIND    DeepM    HoG    KLT
APE 5.838 3.240 7.856 8.836
Accuracy@5 0.719 0.875 0.688 0.808
Accuracy@10 0.876 0.951 0.875 0.864
Accuracy@20 0.948 0.977 0.947 0.906
Accuracy@30 0.967 0.986 0.964 0.927
Table 3: Matching performance on the MPI-Sintel training set (Final pass). DeepM denotes DeepMatching. Metrics: Average Point Error (APE) (the lower the better), and Accuracy@T (the higher the better). Bold numbers indicate best performance, underlined numbers 2nd best.

We compare the matches produced by MIND against those of several empirically designed methods: the classical Kanade–Lucas–Tomasi feature tracker [39], HoG descriptor matching [40] (which is widely employed to boost dense optical flow computation), and the more recent DeepMatching approach [41] which relies on a multilayer convolutional architecture and achieves state-of-the-art performance. As observed in [41], comparing different matching algorithms is delicate because they usually produce different numbers of matches for different parts of the image. For the sake of a fair comparison, we adjust the parameters of each algorithm to make them produce as many as possible matches with an as homogeneous as possible distribution across the input images. For DeepMatching, we use the default parameters. For MIND, we extract correspondences for each corner of a uniform grid of 4 pixels width. For KLT, we set the minEigThreshold to 1e-9 to generate as many matches as possible. For HoG, we again set the pixel sampling grid width to 4. We then sort the matches according to suitable metrics444For DeepMatching, we sort the matches according to the matching score given by the open source code [41]. For KLT, the metric is the error returned by the OpenCV implementation [42]. For HoG, we use the matching score defined in [40]. For MIND, the matching score is defined as the ratio between the maximum gradient intensity and the mean gradient intensity within a 2020 area around the maximal gradient location. and select the same amount of “best” matches for each algorithm. In this way, the 4 algorithms produce the same numbers of matches with similar coverage over each input image.

The comparisons are performed on both KITTI [32] and MPI-Sintel [43] training sets where ground truth correspondences can be extracted from the available ground truth flow fields. We perform all of our experiments on the same image resolution than the one used by our network.555It is ideal to evaluate both image matching and optical flow in benchmarks of KITTI and MPI-Sintel. Due to the fact that the present MIND is currently designed only for resolution-reduced images, we can’t process the benchmark datasets directly, but apply all algorithms locally to the test datasets, followed by the standard evaluation and error metrics known from prior art. On KITTI, the images are scaled to 384128, and for MPI-Sintel, 256128. We use the network trained on the KITTI RAW sequences for the matching experiment on the KITTI Flow 2012 training set. We then use the network fine-tuned on Sintel movie clips for the experiments on the MPI-Sintel Flow training set. The 4 algorithms are evaluated in terms of the Average Point Error (APE) and the Accuracy@T. The latter is defined as the proportion of “correct” matches from the first image with respect to the total number of matches [44]. A match is considered correct if its pixel match in the second image is closer than T pixels to ground-truth.

As can be observed in Table 2 and Table 3, DeepMatching produces matches with the highest quality in terms of all metrics and on both MPI-Sintel and KITTI sets. Notably, MIND performs very close to DeepMatching on KITTI and outperforms KLT tracking and HoG matching by a considerable amount in terms of Accuracy@10 and Accuracy@20. It is surprising to see that MIND—an unsupervised learning based approach—works so well. The performance on MPI-Sintel however drops a bit due to the difficulty of the contained unrealistic motion. Though the APE measure indicates better performance than HoG and KLT, it is only safe to conclude that MIND remains competitive in terms of overall performance on MPI-Sintel, which can be seen further in the next section.

4.4 Ability to Initialise Optical Flow Computation

To further understand the matching quality produced by MIND, we replace the DeepMatching part of DeepFlow[41] with MIND to see whether MIND matches are able to boost optical flow performance in a similar way than DeepMatching and HoG or KLT matches. Similar to the evaluation in [41], we feed DeepFlow with matches obtained by each matching method in the previous section. The parameters (e.g. the matching weight) of DeepFlow are tuned accordingly to make best use of the pre-obtained matches. Note that we scale down the input images to 384128 for KITTI and 256128 for MPI-Sintel. We then up-size the obtained flow field to the original resolution by bilinear interpolation, to the end of comparing results in full resolution.

The results on the KITTI Flow 2012 training set are indicated in Table 4. It can be seen that using the matches obtained by any of the 4 algorithms improves the flow performance compared to the case where we use no matches for initialization. Notably, MIND again reaches closest performance to DeepMatching in terms of all metrics, thus underlining the good matching quality obtained by MIND (better than KLT and HoG and comparable to DeepMatching). Table 5 shows the results obtained on the MPI-Sintel training dataset. As in KITTI, the pre-obtained matches indeed help to improve the optical flow results especially in terms of the APE and s40+ metrics, while flow initialized by DeepMatching remains best overall. The results initialized from MIND matches however rank behind those initialized by HoG or KLT matches, which again suggests the importance of temporal coherency for training our network. The reason why KLT works better than in the evaluation presented in [41] is because we run KLT in the downscaled images rather than the full resolution ones, and this helps KLT to better deal with large displacements.

From the quantitative evaluations of matching and flow performance, it should be concluded that MIND works well on the KITTI Flow training set and achieves comparable performance to the state-of-the-art defined by DeepMatching. In the MPI-Sintel Flow training set, MIND still obtains comparable performance to the traditional HoG and KLT methods. The latter should still be interpreted as a good result especially considering that the quality of training data for the unrealistic Sintel images is insufficient.

MIND DeepM HoG KLT No match
APE 2.89 2.63 3.06 3.40 3.55
out-2 17.70% 17.09% 17.89% 18.34% 18.49%
out-5 9.86% 9.18% 10.05% 10.58% 10.77%
out-10 6.45% 5.84% 6.66% 7.20% 7.40%
Table 4: Flow performance on KITTI 2012 flow training set (non-occluded areas). out-x

refers to the percentage of pixels where flow estimation has an error above

x pixels.
MIND DeepM HoG KLT No match
APE 5.78 4.80 5.46 5.42 6.63
s0-10 2.25 2.84 3.65 3.22 2.47
s10-40 6.26 6.08 6.52 6.48 6.18
s40+ 19.03 18.79 17.38 17.44 23.16
Table 5: Flow performance on MPI-Sintel flow training set. s0-10 is the APE for pixels with motions between 0 and 10 pixels. Similarly for s10-40 and s40+.

5 Conclusions

We have shown that the present work enables artificial neural networks to learn accurate matching from only ordinary videos. Though the performance evaluation indicates that MIND works surprisingly well in the expected unsupervised manner, it fails to outperform the existing empirically designed methods even in the resolution-reduced images. However, as stated in the very beginning, the aim of this work is to prove that it is possible to learn image matching without manual supervision, rather than to provide a more practicable algorithm for frame interpolation or image matching. Furthermore, we believe that the present unsupervised learning approach holds brilliant potential for the more natural solutions to similar low-level vision problems, such as optical flow, tracking and motion segmentation.

Our future work focuses on making the present approach more applicable in real-world scenarios, in terms of both computational efficiency and reliability. It is also our hope that the present work helps to promote the concept of analysis by synthesis towards broader acceptance.

Supplementary Material:
Quantitative evaluation of frame interpolation

We provide here the supplemental quantitative evaluation in terms of frame interpolation. The purposes are: 1. verifying that the trained CNNs performs quantitatively good frame interpolation; and 2. providing further evidence that the trained CNN holds good generalization ability.

For the trained CNN (refereed to as MIND below), the interpolated images are simply the outputs of the forward propagation though the CNN. We compare the results to the traditional interpolation method using state-of-the-art optical flow, i.e. DeepFlow [41] (initialized with DeepMatching). The interpolation algorithm used in the Middlebury benchmark [45] is employed to synthesize the in-between images, given the optical flow fields obtained by DeepFlow. For simplicity, this approach is refereed to as DeepFlow below.

The quantitative evaluations are performed on four image sequences: a representative sequence from KITTI RAW video[32], one Sintel movie clip666Sintel, the Durian Open Movie Project. https://durian.blender.org/, DICOM image sequence777The images are taken from a DICOM sample image set: http://www.osirix-viewer.com/datasets/. Alias Name: GRUSELAMBIX. and RubberWhale sequence from the Middlebury optical flow benchmark [45]. For each image sequence, MIND and DeepFlow are evaluated on each image triplet, where the first and third frames are taken as inputs and the second frame serves as ground truth interpolated frame.

5.1 Sample images

We first show some sample results for each image sequence. In Figure 7 and Figure 8, it can be seen that both MIND and DeepFlow work correctly for the task of frame interpolation. Please note that MIND works surprisingly well on DICOM and RubberWhale images, though it has never been trained with similar images. Notably, in the second example of DICOM images shown in Figure 6, DeepFlow fails to cover the motion correctly, while MIND still obtains a good interpolated image.

Figure 7: Examples of frame interpolation (best viewed in colour). From left to right: example on KITTI, Sintel and RubberWhale sequences, respectively. In each column, the first image is an overlay of the two input frames. The second one is the ground truth image. The third one and the fourth one is the interpolated image obtained by DeepFlow and MIND, respectively. For the first example, we use the CNN trained on KITTI itself. For all others, we use the CNN fine-tuned on Sintel data.
Figure 8: Two examples of frame interpolation on DICOM images. In each example, the first row shows the overlay of the two input frames and the ground truth frame. The second row shows the results obtained by DeepFlow and MIND. The inputs are scaled to the resolution of . For MIND, we use the CNN fine-tuned on Sintel. Note that DeepFlow fails to cover the correct motion in the areas marked by the red squares.

5.2 Numerical results

Following [45], the interpolation error (IE) is defined as the root-mean-square (RMS) difference between the ground-truth image and the estimated interpolated image



is the number of pixels. For color images, we take the L2 norm of the vector of RGB color differences.

Furthermore, the normalized interpolation error (NE) between an interpolated image and a ground truth image is defined as


where .

The numerical results of comparison between MIND and DeepFlow are reported in Table 6 and Figure 9. It can be seen that MIND works better on KITTI images than DeepFlow while failing to work well on Sintel images even though the CNN is fine-tuned using Sintel Movie clips. This fact is consistent with the evaluations on image matching and optical flow performance reported in our submission, which indicates that the current CNN cannot deal with Sintel images well mainly due to the existence of fast and complex motion.

Regarding the generalization ability, it is further illustrated by the quantitative results that MIND learns indeed the ability to interpolate and match images, rather than only ‘remembering’ the KITTI or Sintel-like images. MIND even does a better job than DeepFlow on DICOM images, which encourages us to explore its further applications.

  KITTI   Sintel   DICOM   RubbleWhale
IE (MIND) 30.99 27.43 6.00 9.31
IE (DeepFlow) 33.30 26.40 6.23 9.00
NE (MIND) 7.29 9.36 1.65 1.91
NE (DeepFlow) 7.82 9.06 1.70 1.84
Table 6: Mean interpolation error (IE) and Mean normalized interpolation error (NE) of the interpolated images obtained by MIND and DeepFlow on each selected image sequence.
Figure 9: Interpolation error (IE) and Normalized interpolation error (NE) of interpolated images obtained by MIND and DeepFlow.