The problem of image registration is a classic vision topic that has been studied for decades [4, 28, 8], but still active not only because its difficulty under certain circumstances, such as low-textured scenes, parallax and dynamic objects, but also due to its widespread applications, such as Panorama creation , multi-frame HDR/Denoising 
, multi-frame super resolution, and video stabilization .
A variety of motion models have been proposed for image registration, among which the homography 
is the most popular one given its simplicity and efficiency. Homography is estimated by matching image features between two images. The false matches are rejected by RANSAC . One problem is that the quality of estimated homography is highly dependent on the quality of matched image features. Insufficient number of correct matches or uneven distributions can easily damage the performance. Recently, deep homography has been proposed which takes two images as input to the network and output the homography [5, 23]. Compared with feature-based methods, deep homography is more robust against various challenging cases, such as low-light, low-texture, high-noise, etc 
. The other problem of homography is its limited degree of freedom. A homography can only describe plan motions or motions caused by camera rotations. Violation of these assumptions can produce incorrect alignments. For images with parallax, a global homograhy is usually used to estimate an initial alignment before subsequent more sophisticated models[7, 18, 31, 15]. Mesh-based image warping can represent spatially varying depth variations [18, 20]. Each of mesh grid undergoes a local linear homography, accumulating to a highly nonlinear representation. Igarashi et al. proposed as-rigid-as-possible image warping to enforce local rigidity of each mesh triangles . Later, Liu et al. extends  by proposing content-preserving warps that constrains the rigidity of mesh cells according to the image contents . Liu et al. proposed a meshflow motion model that further simplify the estimation of mesh model . The meshflow contains a sparse motion field with motions only located at the mesh vertexes. These methods have been proven to be sufficiently flexible for handling complex scene depth variations [14, 21, 20]. However, one common challenge faced by mesh-based methods is still the quality of image features. Both number of matched features and feature distribution can influence the performance. Zaragoza et al.  proposed an As-Projective-As-Possible (APAP) mesh deformation approach and Lin et al. 
proposed a spatially varying affine model to alleviate the problem of feature dependent by interpolating ideal features to non-ideal regions. However, they still require a certain number of qualified features to start with. Optical flow, on the other hand, can estimate per-pixel motions which can preserve fine motion details compared with mesh interpolated motion fields. However, optical flow estimation is more computationally expensive compared with light-weight mesh representations. Synthesis driven applications do not require physically accurate motion estimation at every pixel, estimating optical flow over-kills the requirement and is often not necessary.
Figure 1 shows some examples. Figure 1(a) and (b) show the comparison between our method and deep homography method . The source image is aligned to the target image and two images are blended for illustration. The scene contains multiple planes, e.g., ground and building facade. Misalignments (highlighted by the zoom-in window) can be observed at the building facade of deep homography method while our deep meshflow can align multiple planes and thus, is free from such problem. Figure 1(c) and (d) show the comparison of the traditional meshflow . The feature detection is difficult in this example due to the poor textures, causing the failure of meshflow. In contrast, our deep meshflow is robust to textureless scenes.
In this work, we propose an unsupervised approach with a new architecture for content adaptive deep meshflow estimation. We combine the advantage of deep homography that is robust to textureless regions, and the advantage of meshflow that is light weight for nonlinear motion representation. Specifically, our network takes two images for alignment as input, and output a sparse motion field meshflow, with motions only located at mesh vertexes. We learn a content mask to reject outlier regions, such as moving objects and discontinues large foreground that cannot be registed by the mesh deformation but can influence the overall alignment quality. The capability of content adaptive is similar as the RANSAC procedure when estimating homography or mesh warps [20, 19] in traditional approaches. This is realized by a novel triplet loss. Moreover, instead of directly output the mesh at the desired resolution, we first generate several intermediate meshes with different resolutions, e.g., meshes with , , etc. Then we choose the best combination among these meshes, assembling to the final output. This idea is borrowed from video coding x265 , in which the block division of a frame can be non-uniform according to the image content. Here, our mesh division is also non-uniform based on both image contents and motions. For regions that require higher degree of freedom, we chose finer scales for registration accuracy, while for regions that are relatively complanate, we chose coarse scales for robustness. This flexibility is realized by our segmentation module in the pipeline, which shows to be more effective than simply chose the finest scale.
In addition, we introduce a comprehensive meshflow dataset for training, within which the testing set contains manually labeled ground-truth point matches for the purpose of evaluation. We split the dataset into categories according to the scene characteristics, including scenes with multiple dominate planes, scenes captured at night, scenes with low-textured regions, with small-foreground and with large-foreground. The experiments show that our method outperforms previous leading traditional mesh-based methods [19, 31], as well as recent deep homography methods [5, 23, 32]. Our contributions can be summarized as:
A new unsupervised network structure for deep meshflow estimation, which outperforms previous state-of-the-arts methods.
The content-adaptive capability, in terms of rejecting interference regions and adaptive mesh scale selection.
A comprehensive dataset contains various scene types for training and testing.
2 Related Works
Global parametric models.
Homography is a wildly used parametric alignment model, which is a matrix with degree of freedom, describing either plan motions in the space or motions induced by pure camera rotations. Traditional methods require sparse feature matches [22, 1, 24] to estimate a homography. However, image features are unreliable with respect to low-textured regions. Recently, deep based solutions have been proposed for improved robustness such as, the supervised approach that train homographyNet under the guidance of random homography proposals  or unsupervised approach that directly minimizes warping MSE distance . On the other hand, homography model is restricted by its motion assumptions, violation which can easily introduce misalignments, such as scenes consisting of multiple plans or discontinuity depth variations.
To solve the depth parallax issue, mesh-based image warping is more popular. Liu et al proposed Content Preserving Warp (CPW) to encourage mesh cells to under go a rigid motion . Li et al. proposed a duel-feature warping by considering not only image features but also line segments for the warping in low-textured regions . Lin et al. incorporated curve preserving term to preserve curve structures . Liu et al. introduced MeshFlow, a non-parametric warping method for video stabilization . Compared with dense optical flow, meshflow is a sparse motion field with motions only located at mesh vertexes. It detects and tracks image features for meshflow model estimation. In this work, we propose a deep solution, DeepMeshFlow, for the similar purpose, but with largely improved robustness against scenes that suffer from feature detection and matching/tracking problems.
Optical flow estimates per-pixel dense motion between two images. Compared with global alignment methods, optical flow can produce better results in preserving motion details. The traditional method often adopt coarse-to-fine, variational optimization framework for flow estimation [9, 26, 2]. Recently, flow accuracy has been largely promoted by convolutional networks [29, 27, 11]. For some image/video editing applications, however, the optical flow often requires a series of post-processing before the usage, such as occlusion detection, motion inpainting, outlier filtering. For example, Liu et al., estimated a Steadyflow from raw optical flow by rejecting and inpainting motion inconsistent foregrounds. Our mesh-based representation, on the other hand, is free from such issues. It is light-weight and flexible for various applications, such as multi-frame HDR , burst denoising , and video stabilization [18, 20, 19].
MeshFlow is a motion model that describes non-linear warping between two image views . It has more degrees of freedom compared with homography but suffers less from computational complexity compared with optical flow. It is represented by a mesh of grids so that totally contains
vertices on the mesh. At each vertex, a 2D motion vector is defined so that each grid corresponds to one homography computed by the 4 vectors on its 4 corner vertices. With multiple homography matrices computed on the various mesh grids, the entire image can be warpped in an non-linear manner so as to fit multi-planes in the scene.
3.1 Network Structure
Our method is built upon convolutional neural network which takes two imagesand as input, and produces a mesh flow of size as output, where and are the height and width of the mesh with a 2D motion vector being defined on each vertex of the mesh. Given the mesh flow with such a form, each grid of it can be represented by a homography matrix , solved by the 4 motion vectors on its 4 corners. The entire network structure can be divided into four modules: a feature extractor , a mask predictor , a scene segmentation network and a multi-scale mesh flow estimator . and are fully convolutional networks which accept input of arbitrary sizes and produce a concatenation of feature maps. Then servers as a regressor that transfers the features into mesh flows in multiple scales. Then, a scene segmentation network produces a fusion mask that fuses the multi-scale mesh flows into one as the final output. Figure 4 illustrates the network structure, and in this sub-section we briefly introduce and and leave into the next sub-section.
Unlike previous DNN based methods that directly utilizes the pixel values as the feature, here our network automatically learns a feature from the input for robust feature alignment. To this end, we build a FCN that takes an input of size , and produces a feature map of size . For inputs and , the feature extractor shares weights and produces feature maps and , i.e.
In non-planar scenes, especially those including moving objects, there exists no single homography that can align the two views. Although mesh flow contains multiple homography matrices which can partially solve the non-planar issue, for a local single region, one homography could be still problematic to well align all the pixels. In traditional algorithm, RANSAC is widely applied to find the inliers for homography estimation, so as to solve the most approximate matrix for the scene alignment. Following the similar idea, we build a sub-network to automatically learn the inliers’ positions. Specifically, a sub-network
learns to produce an inlier probability map or mask, highlighting the content in the feature maps that contribute much for the homography estimation. The size of the mask is the same as the size of the feature. With the masks, we further weight the features extracted bybefore feeding them to the homography estimator, obtaining two weighted feature maps and as,
The weighted feature maps and are concatenated and fed to the following MeshFlow estimator , to produce mesh flows with different scales. These multi-scale mesh flows are then fused into one by a branch-selection scheme. It is achieved by training a scene segmentation network that segments the image into classes, each one of which corresponds to one branch, i.e.
where is of the same resolution of the finest-scale mesh flow, so its size is .
3.2 MeshFlow Estimator
As mentioned above, the output of our network is a mesh flow of size . Directly regressing the input, i.e. the two weighted feature maps and to this mesh flow is not straightforward, as there exists too many degrees of freedom (DoF) being involved. To tackle this issue, we divide the mesh flow regression part into
branches, each of which is responsible for one scale mesh flow. The intuition behind results from the fact that in complex scenes, various planes may differ in scales. A coarse-scaled mesh flow could be better align the two views rigidly, and trends to be easier for training compared with a fine-scaled mesh flow of more DoF. As for its backbone, it follows a ResNet-34 structure, which contains 34 layers of strided convolutions followed bybranches, each of which starts with an adaptive pooling layer and generates a mesh flow with specific size by an additional convolutional layer. In our experiments, we set to 3 so that the 3 branches correspond to mesh flow of size , of size and of size . The coarse-scaled mesh flows are then upsampled to the same scale of before fusing together, noted as . This process is expressed as follows,
With computed by the previous steps, we finally fuse the mesh flows into the output mesh flow using the segmentation mask in the following manner,
where and is the vertex coordinate on the mesh. By this strategy, the output mesh flow conveys homography alignment in various scales for each local grid. It has enough DoF to align the two views and is still easy for training.
3.3 Triplet Loss for Training
With the mesh flow estimated, we obtain by computing the homography matrix for each of its grid. Then we warp image to and then further extracts its feature map as . Intuitively, for a local grid , if the homography matrix is accurate enough, should be well aligned with , causing a low loss between them. Considering in real scenes, a single homography matrix cannot satisfy the transformation between the two views, we also normalize the loss by and . Here is the warped version of . So the loss between the warped and is as follows,
indicates a pixel location in the masks and feature maps. Here we utilize spatial transform network to achieve the warping operation.
Directly minimizing Eq. 7 may easily cause trivial solutions, where the feature extractor only produces all zero maps, i.e. . In this case, the features learned indeed describe the fact that and are well aligned, but it fails to reflect the fact that the original images and are mis-aligned. To this end, we involve another loss between and , i.e.
and further maximize it when minimizing Eq. 7. This strategy avoids the trivial solutions, and enables the network to learn a discriminative feature map for image alignment.
In practise, we swap the features of and and produce another reversed mesh flow , and a homography matrix is computed for each grid. Following Eq. 7 we involve a loss between the warped and . We also add a constraint that enforces and to be inverse. So, the optimization procedure of the network could be written as follows,
where and are balancing hyper-parameters, and
is a 3-order identity matrix. We setand in our experiments. We illustrates the loss formulations in Figure 4(b).
3.4 Unsupervised Content-Awareness Learning
As mentioned above, our network contains a sub-network to predict an inlier probability map or mask. It is such designed that our network can be of content-awareness by the two-fold effects. First, we use the masks to explicitly weight the features , so that only highlighted features could be fully fed into MeshFlow estimator . Meanwhile, they are also implicitly involved into the normalized distance between the warped feature and its original counterpart , or and
, meaning only those regions that are really fit for alignment would be taken into account. For those areas containing low texture or moving foreground, because they are non-distinguishable or misleading for alignment, they are naturally removed for local homography estimation in a grid during optimizing the triplet loss as proposed. Such a content-awareness is achieved fully by an unsupervised learning scheme, without any ground-truth mask data as supervision.
To demonstrate the effectiveness of mask, we illustrate an example in Figure 6 and 5. In Figure 6, we visualize the mask if one branch of mesh flow is used only. In this case, for coarse-scaled mesh flow, since each grid covers larger area where a single homography is less likely to represent the transformation, less pixels are highlighted in the mask. However, our DeepMeshFlow solution works in multiple scales, so the highlighted region in the mask is less than the one in mask trained with mesh flow, but more than the one in mask trained with mesh flow. Figure 5 shows mask examples generated in several scenarios. For example, in Figure 5(a)(c) where the scenes contain dynamic objects, our network successfully rejects moving objects, even if the movements are inapparent as the water in (c). These cases are very difficult for RANSAC to find robust inliers. In particular, the most challenging case is Figure 5(a), in which the moving foregrounds are complex, including people and the cars. Our method successfully locates the useful background for the homography estimation. Figure 5(d) is a low-textured example, in which the sky occupies half space of the image. It is challenging for traditional methods where the sky provides no features and the sea causes matching ambiguities. Our predicted mask concentrates on the horizon but with sparse weights on sea waves. Figure 5(e) is a low light example, where only visible areas contain weights as seen. We also conduct an ablation study to reveal the influence if disabling the mask prediction. As seen in Table 1, the accuracy has a significant decrease when mask is removed.
4 Experimental Results
4.1 Dataset and Implementation Details
To train and evaluate our deep meshflow, we present a comprehensive dataset that contains various of scenes as well as marked point correspondences. We split our dataset into several categories to test the performances under different scenarios. The categories includes: scenes consists of a single plane (SP), scenes mainly consists of multiple dominate planes (MP), scenes with large foreground (LF), scenes with low-textures (LT) and scenes captured with low-light (LL). The first three categories focus on the motion representation capability of motion models while the last two categories concentrate on the capability of feature extraction. Notably, for category LT and LL, they contain all type of scenes SP, MP, and LF. In particular, each category contains around image pairs, thus totally image pairs in the dataset. Figure 7 shows some examples.
For the testing set, we mark ground-truth point correspondence for the purpose of quantitative evaluation. Figure 8 shows several examples of our annotated correspondences. For each pair, we carefully marked around correspondences which equally distributed on the image. For category multi-plane(MP), we equally separate points on different planes. For category of low-textures(LT), we mark points with extra efforts to make sure the correctness. We marked about 3,000 pairs of images and nearly 30k pairs of matching points for all categories. Figure 8 shows three examples of our annotation.
Our network is trained with 30k iterations by an Adam optimizer , whose parameters are set as , , , . The batch size is set to . For every iterations, we multiply the learning rate by . Each iteration costs approximate s and it takes nearly
hours to complete the entire training. The implementation is based on PyTorch and the network training is performed onNVIDIA RTX 2080 Ti. To augment the training data and avoid black boundaries appearing in the warped image, we randomly crop patches of size from the original image to form and .
4.2 Comparison with Existing Methods
We compare our method with various methods, including classic traditional methods MeshFlow , As-Projective-As-Possible mesh Warping  and a deep method, supervised deep homography . For the unsupervised deep homography method, it uses aerial images as training data, which ignores the effect of depth parallex. For a more fair comparison, we fine-tune the method with our training data.
The source image is warped to the target image, where two images are blended for illustration. Methods who produces clearer blended images indicate good alignment. For each method we show two examples as shown in Figure 10. The first, second, and third row shows the comparison with As-Projective-As-Possible(APAP), Meshflow, and Unsupervised deep homography approaches, respectively, in which our results are shown in the second and forth columns. We highlight some regions for clearer illustration.
We verify the performances with our annotated points in the testing set. The comparison is based on the categories. Specifically, we use the estimated mesh/homography to transform the source points to the target points. The average
distances are recorded as an evaluation metric. We report the performances for each category as well as the overall averaged scores in Table1. Small number indicates better alignment. Table 1 ‘Eye’ refers to the identity matrix, indicating the original distances if no alignment is performed. As can be seen, the original distances are high, around pixels. After alignment, all methods decrease the original score, indicating that the alignment takes effect. Among all candidates, our method achieves the best result. In particular, we achieved average score of , which surpassed the two competitors with a relatively large margin. Meshflow achieved and unsupervised deep homography achieved on average.
4.3 Ablation Studies
We verify the effectiveness of our design of content adaptive capability, we design two experiments, with and without mask and with fixed mesh resolutions.
We exclude the mask component in our pipeline to produce the result for comparisons. Table 1 ‘w/o mask’ shows the result. As seen, without the mask, the average performance drops from to . Therefore, the mask is important during the meshflow estimation. In particular, for the low texture(LT) category, score without mask is while score with mask is , improving , which indicates that the mask is particularly helpful with respect to the LT category. For other categories, the scores with mask are also improved to a certain extent.
We train several different fixed mesh resolutions to compare with our adaptive mesh resolution. Table 1 shows the results. In particular, we conduct mesh, mesh and mesh. As can be seen, non of these fixed resolutions can achieve comparable results as our adaptive mesh resolution. We further demonstrate some visual comparisons with respect to mesh and mesh in Figure 9.
We have presented a network architecture for deep mesh flow estimation with content-aware abilities. Traditional feature-based methods heavily relies on the quality of image features which are vulnerable to low-texture and low-light scenes. Large foreground also causes troubles for RANSAC outlier removal. Previous deep based homography pay less attention to the depth disparity issue. They treat the image content equally which can be influenced by non-planer structures and dynamic objects. Our network learns a mask during the estimation to reject outlier regions for robust mesh flow estimation. In addition, we calculate loss with respect to our learned deep features instead of directly comparing the image contents. Moreover, we have provided a comprehensive dataset for two view alignment. The dataset have been divided into 5 categories, regular, low-texture, low-light, small-foregrounds, and large-foregrounds, to evaluate the estimation performance with respect to different aspects. The comparison with previous methods show the effectiveness of our method.
-  (2006) Surf: speeded up robust features. In Proc. ECCV, pp. 404–417. Cited by: §2.
-  (2015) Efficient sparse-to-dense optical flow estimation using a learned basis and layers. In Proc. CVPR, pp. 120–130. Cited by: §2.
-  (2003) Recognising panoramas.. In Proc. ICCV, Vol. 3, pp. 1218. Cited by: §1.
-  (2004) Image mosaicing. In Image Mosaicing and super-resolution, pp. 47–79. Cited by: §1.
-  (2016) Deep image homography estimation. arXiv preprint arXiv:1606.03798. Cited by: §1, §1, §2, §4.2.
-  (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §1.
-  (2011) Constructing image panoramas using dual-homography warping. In Proc. CVPR, pp. 49–56. Cited by: §1.
Multiple view geometry in computer vision. Cambridge university press. Cited by: §1, §1, §1.
-  (1981) Determining optical flow. 17, pp. 185–203. Cited by: §2.
-  (2005) As-rigid-as-possible shape manipulation. In ACM Trans. Graphics (Proc. of SIGGRAPH), Vol. 24, pp. 1134–1141. Cited by: §1.
-  (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In Proc. CVPR, pp. 2462–2470. Cited by: §2.
-  (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §3.3.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
-  (2015) Dual-feature warping-based motion model estimation. In Proc. ICCV, pp. 4283–4291. Cited by: §1, §2.
-  (2016) Seagull: seam-guided local alignment for parallax-tolerant image stitching. In Proc. ECCV, pp. 370–385. Cited by: §1.
-  (2016) Seamless video stitching from hand-held camera inputs. 35 (2), pp. 479–487. Cited by: §2.
-  (2011) Smoothly varying affine stitching. In Proc. CVPR, pp. 345–352. Cited by: §1.
-  (2009) Content-preserving warps for 3d video stabilization. In ACM Trans. Graphics (Proc. of SIGGRAPH), Vol. 28, pp. 44. Cited by: §1, §2, §2.
-  (2016) Meshflow: minimum latency online video stabilization. In Proc. ECCV, pp. 800–815. Cited by: Figure 1, §1, §1, §1, §1, §2, §2, §3, Figure 10, §4.2.
-  (2013) Bundled camera paths for video stabilization. ACM Trans. Graphics (Proc. of SIGGRAPH) 32 (4), pp. 78. Cited by: §1, §1, §1, §2.
-  (2014) Fast burst images denoising. ACM Trans. Graphics (Proc. of SIGGRAPH) 33 (6), pp. 232. Cited by: §1, §2.
-  (2004) Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision 60 (2), pp. 91–110. Cited by: §1, §2.
-  (2018) Unsupervised deep homography: a fast and robust homography estimation model. IEEE Robotics and Automation Letters 3 (3), pp. 2346–2353. Cited by: Figure 1, §1, §1, §1, §2, Figure 10.
-  (2011) ORB: an efficient alternative to sift or surf.. In Proc. ICCV, Vol. 11, pp. 2564–2571. Cited by: §2.
-  (2012) Overview of the high efficiency video coding (hevc) standard. IEEE Trans. on circuits and systems for video technology 22 (12), pp. 1649–1668. Cited by: §1.
-  (2010) Secrets of optical flow estimation and their principles. In Proc. CVPR, pp. 2432–2439. Cited by: §1, §2.
-  (2018) PWC-net: cnns for optical flow using pyramid, warping, and cost volume. In Proc. CVPR, pp. 8934–8943. Cited by: §2.
-  (2007) Image alignment and stitching: a tutorial. Foundations and Trends® in Computer Graphics and Vision 2 (1), pp. 1–104. Cited by: §1.
-  (2013) DeepFlow: large displacement optical flow with deep matching. In Proc. CVPR, pp. 1385–1392. Cited by: §2.
-  (2019) Handheld multi-frame super-resolution. ACM Trans. Graphics (Proc. of SIGGRAPH) 38 (4), pp. 28. Cited by: §1.
-  (2013) As-projective-as-possible image stitching with moving dlt. In Proc. CVPR, pp. 2339–2346. Cited by: §1, §1, Figure 10, §4.2.
-  (2019) Content-aware unsupervised deep homography estimation. arXiv preprint arXiv:1909.05983. Cited by: §1, §1.
-  Denoising vs. deblurring: hdr imaging techniques using moving cameras. In Proc. CVPR, Cited by: §1, §2.