Geometric distortion is a common problem in digital imagery and occurs in a wide range of applications. It can be caused by the acquisition system (e.g., optical lens, imaging sensor), imaging environment (e.g., motions of the platform or target, viewing geometry) and image processing operations (e.g., image warping). For example, camera lenses often suffer from optical aberrations, causing barrel distortion (), common in wide angle lenses, where the image magnification decreases with distance from the optical axis, and pincushion distortion (), where it increases. While lens distortions are intrinsic to the camera, extrinsic geometric distortions like rotation (), shear () and perspective distortion () may also arise from the improper pose or the movement of cameras. Furthermore, a wide number of distortion effects, such as wave distortion (), can be generated by image processing tools. We aim to design an algorithm that can automatically correct images with these distortions and can be generalized to a wide range of distortions easily (see Figure 2).
Geometric distortion correction is highly desired in both photography and computer vision applications. For example, lens distortion violates the pin-hole camera model assumption which many algorithms rely on. Second, remote sensing images usually contain geometric distortions that cannot be used with maps directly before correction[toutin2004geometric]
. Third, skew detection and correction is an important pre-processing step in document analysis and has a direct effect on the reliability and efficiency of the segmentation and feature extraction stages[al2009skew]. Finally, photos often contain slanted buildings, walls, and horizon lines due to improper camera rotation. Our visual system expects man-made structures to be straight, and horizon lines to be horizontal [lee2012automatic].
Completely blind geometric distortion correction is a challenging problem, which is under-constrained given that the input is only a single distorted image. Therefore, many correction methods have been proposed by using multiple images or additional information. Multiple views methods [barreto2005fundamental, hartley2007parameter, kukelova2011minimal] for radial lens distortion use point correspondences of two or more images. These methods can achieve impressive results. However, they cannot be applied when multiple images under camera motion are unavailable.
To address these limitations, distortion correction from a single image has also been explored. Methods for radial lens distortion based on the plumb line approach [wang2009simple, bukhari2013automatic, thormahlen2003robust] assume that straight lines are projected to circular arcs in the image plane caused by radial lens distortion. Therefore, accurate line detection is a very important aspect for the robustness and flexibility of these methods. Correction methods for other distortions [gallagher2005using, lee2012automatic, chaudhury2015automatic, santana2017automatic] also rely on the detection of special low-level features such as vanishing points, repeated textures, and co-planar circles. But these special low-level features are not always frequent enough for distortion estimation in some images, which greatly restrict the versatility of the methods. Moreover, all of the methods focus on a specific distortion. To our knowledge, there is no general framework which can address different types of geometric distortion from a single image.
In this paper, we propose a learning-based method to achieve this goal. We use the displacement field between distorted images and corrected images to represent a wide range of distortions. The correction problem is then converted to the pixel-wise prediction of this displacement field, or flow, from a single image. Recently, CNNs have become a powerful method in many fields of computer vision and outperform many traditional methods, which motivated us to use a similar network structure for training. The predicted flow is then further improved by our model fitting methods which estimate the distortion parameters. Lastly, we use a modified resampling method to generate the output undistorted image from the predicted flow.
Overall, our learning-based method does not make strong assumptions on the input images while generating high-quality results with few visible artifacts as shown in Figure 1. Our main contribution is to propose the first learning-based methods to correct a wide range of geometric distortions blindly. More specifically, we propose:
A single-model network, which implicitly learns the distortion parameters given the distortion type.
A multi-model network, which performs type classification jointly with flow regression without knowing the distortion type, followed by an optional model fitting method to further improve the accuracy of the estimation.
A new resampling method based on an iterative search with faster convergence.
Extended applications that can directly use this framework, such as distortion transfer, distortion exaggeration, and co-occurring distortion correction.
2 Related Work
Geometric distortion correction.
For camera lens distortions, pre-calibration techniques have been proposed for correction with known distortion parameters [tardif2009calibration, duane1971close, tsai1987versatile, heikkila1997four, zhang1999flexible]. However, they are unsuitable for zoom lenses, and the calibration process is usually tedious. On the other hand, auto-calibration methods do not require special calibration patterns and automatically extracts camera parameters from multi-view images [fitzgibbon2001simultaneous, hartley2007parameter, kukelova2011minimal, ramalingam2010generic, henrique2013radial]. But for many application scenarios, multiple images with different views are unavailable. To address these limitations, automatic distortion correction from a single image has gained more research interest recently. Fitzgibbon [fitzgibbon2001simultaneous] proposes a division model to approximate the radial distortion curve with higher accuracy and fewer parameters. Wang et al. [wang2009simple] studied the geometry property of straight lines under the division model and proposed to estimate the distortion parameters through arc fitting. Since plumb line methods rely on robust line detection, Aleman-Flores et al. [aleman2014automatic] used an improved Hough Transform to improve the robustness while Bukhari and Dailey [bukhari2013automatic]
proposed a sampling method that robustly chooses the circular arcs and determines distortion parameters that are insensitive to outliers.
For other distortions, such as rotation and perspective, most of the image correction methods rely on the detection of low-level features such as vanishing point, repeated textures, and co-planar circles [gallagher2005using, lee2012automatic, chaudhury2015automatic, santana2017automatic]. Recently, Zhai et al. [zhai2016detecting] proposed to use deep convolutional neural networks to estimate the horizon line by aggregating the global image context with the clue of the vanishing point. Workman et al. [workman2016horizon] goes further and directly estimates the horizon line in the single image. Unlike these specialized methods, our approach is generalizable for multi-type distortion correction using a single network.
There has been recent work on automatic detection of geometric deformation or variations in a single image. Dekel et al. [dekel2015revealing] use a non-local variations algorithm to automatically detect and correct small deformations between repeating structures from a single image. Wadhwa et al. [wadhwa2015deviation]
fit parametric models to compute the geometric deviations and exaggerate the departure from ideal geometries. Estimating deformations has also been studied in the context of texture images[kim2012symmetry, hays2006discovering, park2009deformed]. None of these techniques are learning-based and are mostly for specialized domains.
Neural networks for pixel-wise prediction.
Recently, convolutional neural networks have been used in many pixel-wise prediction tasks from a single image, such as semantic segmentation [long2015fully], depth estimation [eigen2014depth] and motion prediction [walker2015dense]. One of the main problems for dense prediction is how to combine multi-scale contextual reasoning with the full-resolution output. Long et al. [long2015fully] proposed fully convolutional networks which popularized CNNs for dense predictions without fully connected layers. Some methods focus on dilated or atrous convolution [yu2015multi, chen2018deeplab] which supports exponential expansion the receptive field and systematically aggregate multi-scale contextual information without losing resolution. Another strategy is to use the encoder-decoder architecture [noh2015learning, badrinarayanan2015segnet, ronneberger2015u]
. The encoder gradually reduces the spatial dimension to increase the receptive field of the neuron, while the decoder maps the low-resolution feature maps to full input resolution maps. Nohet al. [noh2015learning] developed deconvolution and unpooling layers for the decoder part. Badrinarayanan et al. [badrinarayanan2015segnet] used pooling indices to connect the encoder and the corresponding decoder, making the architecture more memory efficient. Another popular network is U-net [ronneberger2015u]
, which uses the skip connections to combine the contracting paths with the upsampled feature maps. Our networks use an encoder-decoder architecture with residual connection design and achieve more accurate results.
3 Network Architectures
Geometrically distorted images usually exhibit unnatural structures that can serve as clues for distortion correction. As a result, we presume that the network can potentially recognize the geometric distortions by extracting the features from the input image. We, therefore, propose a network to learn the mapping from the image domain to the flow domain
. The flow is the 2D vector field that specifies where pixels in the input image should move in order to get the corrected image. It defines a non-parametric transformation, thus being able to represent a wide range of distortions. Since the flow is a forward map from the distorted image to the corrected image, a resampling method is needed to produce the final result.
This strategy follows learning methods of other applications which have observed that it is often simpler to predict the transformation from input to output rather than predicting the output directly (e.g., [gharbi2015transform, isola2017image]). Thus, we designed our architecture to learn an intermediate flow representation. Additionally, the forward mapping indicates where each pixel with a known color in the distorted image maps to. Therefore, all pixels in the input image learn a distortion flow prediction directly associated to them, which would not be the case if we were attempting to learn a backward mapping, where some input regions could not have correspondences. It can be a serious problem when the distortion changes the image shape greatly. Furthermore, our resampling method that is required to generate the final image is fast and accurate.
We propose two networks by considering whether the user has prior knowledge of the distortion type. Our networks are trained in a supervised manner. Therefore, we first introduce how the paired datasets have been constructed (Section 3.1) and then we introduce our two networks, for single-model and multi-model distortion estimation (Sections 3.2 and 3.3, respectively).
3.1 Dataset construction
We generate the distorted image flow pair by warping an image with a given mapping, thus constructing the distorted image dataset and its corresponding distortion flow dataset , where and are paired.
We consider six geometric distortion models in our network. However, the architecture is not specialized to these types of distortion and can potentially be further extended. Each distortion type has a model
, which defines the mapping from the distorted image lattice to the original one. Bilinear interpolation is used if the corresponding point in the original image is not on the integer grid. The flow, is generated in the meantime to record how the pixel in the distorted image should be moved to the corresponding point in the original image. is the distortion parameter that controls the distortion effect. For instance, in the rotation distortion model, is the rotation angle while in the barrel and pincushion model, represents the parameter in Fitzgibbon’s single parameter division model [fitzgibbon2001simultaneous]. All distortion parameters
in different distortion models are randomly sampled using a uniform distribution within a specified range. As Figure2 shows, the geometric distortions change the image shapes. Thus we crop the images and flows to remove empty regions.
3.2 Single-model distortion estimation
We first introduce a network parameterized by to estimate the flow for distorted images with a known distortion type . learns the mapping from to with sub-datasets where and are sub-domains containing the images and flows of distortion type .
A possible architecture choice is to directly regress the distortion flow according to the ground truth with an auto-encoder-like structure. However, the network would only be optimized with the pixel-wise flow error, without taking advantage of global constraints imposed by the known distortion model. Instead, we design a network to first predict the model parameter directly. This parameter is then used to generate the flow in the network. Though the network should implicitly learn the distortion parameter, there is no explicit constraint for the network to do so exactly.
The network architecture, referred to as GeoNetS, is shown in Figure 3. It has three conv layers at the very beginning and five residual blocks ([he2016deep]
) to gradually downsize the input image and extract the features. Each residual contains two conv layers and has a shortcut connection from input to output. The shortcut connection helps ease the gradient flow, achieving a lower loss according to our experiments. Downsampling in spatial resolution is achieved using conv layers with a stride of 2 and
After the residual blocks, two conv layers are used to downsize the features further, and a fully-connected layer converts the 3D feature map into a 1D vector . With the distortion parameter , the corresponding distortion model analytically generates the distortion flow. The network is optimized with the pixel-wise flow error between the generated flow and the ground truth.
We train the network to minimize the loss , which measures the distance between the estimated distortion flow and the ground truth flow:
where is the sub-network of
, represents the part to regress the distortion parameter implicitly. Here we choose the endpoint error (EPE) as our loss function. The EPE is defined as the Euclidean distance between the predicted flow vector and the ground truth averaged over all pixels. Because the estimated distortion flow is explicitly constrained by the distortion model, it is naturally smooth.
Since the geometric distortion models
we consider are differentiable, the backward gradient of each layer can be computed using the chain rule:
Our trained network can estimate the distortion flow blindly from an input image for each distortion type and achieve comparable performance as traditional methods.
3.3 Multi-model distortion estimation
The GeoNetS network is only able to capture a specific distortion type with a distortion model at a time. For a new type, the entire network has to be retrained. Furthermore, the distortion type and model can be unknown in some cases. In view of these limitations, we designed a second network for multi-model distortion estimation. However, since the distortion model and the parameters
can vary drastically across types, it is impossible to train a multi-model network with the model constraints. We train a network to regress the distortion flow without model constraints and at the same time classify the distortion type. The network is illustrated in Figure3. The multi-model network parameterized by is jointly trained for two tasks. The first task estimates the distortion flow, learning the mapping from the image domain to the flow domain . The second task classifies the distortion type, learning the mapping from image domain to type domain .
The entire network adopts an encoder-decoder structure, which includes an encoder part, a decoder part, and a classification part. The input image is fed into an encoder to encode the geometric features and capture the unnatural structures. Then two branches follow: In the first branch, a decoder is used to regress the distortion flow, while in the second branch a classification subnet is used to classify the distortion type. The encoder part is the same as GeoNetS, and the decoder part is symmetric to the encoder. Downsampling/Upsampling in spatial resolution is achieved using conv/upconv layers with a stride of 2. The classification part also has two conv layers to downsize the features further, and a fully-connected layer converts the 3D feature map into a 1D score vector of each type.
We use the EPE loss in the flow regression branch, and a cross entropy loss in the classification branch. The two branches are jointly optimized by minimizing the total loss:
where the weight provides a trade-off between the flow prediction and the distortion type classification.
We observe that jointly learning the distortion type helps reduce the flow prediction error as well. These two branches share the same encoder, and the classification branch helps the encoder learn the geometric features for different distortion types better. Please refer to Section 5 for direct comparisons.
Our multi-model network simultaneously predicts the flow and the distortion type from the input image. Based on this information, we can estimate the actual distortion parameters in the model and regenerate the flow to obtain a more accurate result.
The Hough Transform is a widely used technique to extract features in an image. It is robust to noise by eliminating the outliers in the flow using a voting procedure. Moreover, it is a non-iterative approach. Each data point is treated independently, and therefore parallel processing of all points is possible. This makes it more computationally efficient.
For an input image , given its distortion type and distortion flow predicted by our network, we want to fit the corresponding distortion model with the distortion parameter . In our scenario, we map each data point in flow at position to a point in the distortion parameter space. The transform is given by
We assume the distortion parameter has a range from to and split the range into cells uniformly. All the points belong to a cell according to the parameter values. The cell receiving the maximum number of counts determine the best fitting result, and the final result is the average of all the points in this cell. We let in our experiments.
Once model fitting is completed, we have a refined and smoother flow . With model fitting, the efficiency in correcting of higher resolution images can be greatly improved. This is because we can estimate the flow and obtain the distortion parameter at a much smaller resolution, and generate the full resolution flow directly according to the distortion parameter.
|Distorted||IS (5)||IS (10)||IS (15)||Ours (5)|
Given the distortion flow, we employ a pixel evaluation algorithm to determine the backward-mapping and resample the final undistorted image. The approach is inspired by the bidirectional iterative search algorithm in [Yang2011Bidirectional]. Unlike mesh rasterization approaches, this iterative method runs entirely independently and in parallel for each pixel, fetching the color from the appropriate location of the source image.
The traditional backward mapping algorithm of [Yang2011Bidirectional] seeks to find a point in the source image that maps to . Since we only have the forward distortion flow, this method essentially inverts this mapping using an iterative search until the location converges:
where is the computed forward flow from the source pixel to the undistorted image.
Since the application in this paper often involves large, smooth distortions, we propose a modification that significantly improves the convergence rate and quality. The traditional method initializes based on the flow at . More specifically, . If , then the iterative search converges quickly. However, in the presence of large distortions, is large, and are distant, and thus, and can be very different, making it a poor initialization and decreasing conversion speed.
Instead of assuming that the flow in and are the same, we compute the local derivative of the flow at using the finite difference method, and use this derivative to estimate the flow at . We let and represent the horizontal and vertical flow respectively. Formally,
where is at coordinates , and its horizontal pixel neighbor is at coordinates . We then use this derivative to approximate the flow at :
We compute similarly and proceed with the iterative search. Note that we only use this finite difference method in the first iteration to get a coarse initial estimation. The traditional, faster iterative search is used to finetune until convergence.
In this section, we report the results of our work. We first analyze the results of our proposed networks in Section 5.1. Then we discuss the results of our resampling method in Section 5.2. In Section 5.3, we show qualitative and quantitative comparisons of our approach to previous methods for correcting specific distortion types. In Section 5.4, we show some applications of our method. CNNs training details are given in the supplementary material.
|GeoNetM w/o Clas||Single-type||1.57||1.12||3.01||2.91||1.01||1.32||1.82|
|GeoNetM w/o Clas||Multi-type||3.07||2.24||3.75||4.99||3.35||1.73||3.19|
|GeoNetM w/ Hou||Multi-type||1.78||1.34||2.77||2.27||2.25||1.22||1.94|
To evaluate the performance of GeoNetS, we compare with GeoNetM without classification branch and train these two networks on the same dataset with only single-type distortion. The first two rows in Table 1 show that by explicitly considering the distortion model in the network, GeoNetS achieves better results. Moreover, the flow is globally smooth due to the restriction given by the distortion model.
Second, we examine how the joint learning with classification improves the distortion flow prediction. The third and fourth rows in Table 1 show that GeoNetM with joint learning using the classification branch has more accurate prediction than GeoNetM w/o classification training in the multi-type distortion dataset. The classification accuracy of GeoNetM is for these six kinds of distortion. Table 1 also shows that single-type achieves lower flow error than multi-type since additional information needs to be learned for the multi-type task.
We also examine whether the model fitting method improves prediction accuracy for GeoNetM. The last two rows in Table 1 shows that the Hough transform based model fitting provides more accurate results. More results from GeoNetM are shown in Figure 5. For each example, the distorted image is shown on the left, the corrected output image in the middle, and the three flows (before fitting, after fitting, and ground truth) on the right. More real image results and detailed discussion of model fitting, GeoNetS and GeoNetM are given in the supplementary material.
Next we present the results of our resampling strategy. Figure 4 shows results applied to images with two different distortion levels. Note that, on the top row, it takes roughly 10 iterations to converge using the traditional iterative search approach (IS), whereas 5 iterations suffice when using our initialization. On the second row, with a more severe distortion, even after 15 iterations, the traditional method has not satisfactorily converged, whereas with our initialization the results also converge within 5 iterations.
Figure 6 demonstrates how our method more quickly converges to the ground truth (left), and how the vast majority of pixels already have an endpoint error lower than 1/5 of a pixel after just 5 iterations. A parallel version of our approach has been implemented on the GPU using an Intel Xeon E5-2670 v3 2.3 GHz machine with Nvidia Tesla K80 GPU. It can resample the image under 50 ms.
5.3 Comparison with previous techniques
Next, we compare our distortion correction technique to some existing methods that are specialized to some distortion types. Note that, unlike these methods, our learning-based approach is able to handle different distortion types.
|Source||Ours||Result of [aleman2014automatic]||Result of [santana2015invertibility]|
Figure 7 compares our approach with [aleman2014automatic] and [santana2015invertibility], which are specialized for lens distortion. Note that for cases where the image has obvious distortion (e.g., first row), all methods can correct accurately. However, in cases where the distortion is more subtle or does not exhibit highly distorted lines (e.g., bottom two rows), our approach yields improved results. Figure 8 shows a quantitative comparison based on 50 randomly chosen images from the dataset of [jegou2008hamming]. These images include a variety of scene types (e.g., nature, man-made, water) and are distorted with random distortion parameters to generate our synthetic dataset. All of these methods use Fitzgibbon’s single parameter division model [fitzgibbon2001simultaneous], therefore we can calculate the relative error of the distortion correction parameters for comparison. Note that, with our approach, the number of sample images (y-axis) that lie within the error thresholds (x-axis) is significantly higher than the other methods.
|Source||Ours||Result of [chaudhury2014auto]|
For the perspective distortion, we compare with [chaudhury2014auto]. Here we use angle deviation as the metric. We collect 30 building images under orthographic projection and distort them with different Homography matrices. We control the distorted vertical lines within . Then we detect straight lines [von2010lsd] in the correction results using the line segment detector within this range and assume that the angle of these lines should be after correction and calculate their average angle deviation. As shown in Table 2 and Figure 9, our method outperforms the previous approach [chaudhury2014auto].
In addition, we explored applications that can benefit from our distortion correction method directly.
Our system can detect the distortion from a reference image and transfer to a target image. We can estimate the forward flow from the reference image to the corrected version and then directly apply it to the target by bilinear interpolation. Figure 10 shows two examples of transferring distortion from a reference image to a target image, in order to accentuate the perspective of a house photograph (upper row) or apply aggressive barrel distortion to a portrait (lower row).
To achieve distortion exaggeration, we can reverse the direction of estimated flow field to make the pixels further away from its undistorted position, and use our resampling approach to generate an exaggerated distortion output. Figure 11 shows a building with perspective effect, we can adjust the level of distortion to exaggerate the effect by amplifying or reversing the flow, respectively.
Co-occurring distortion correction.
Sometimes an image could have more than one type of distortion. We can correct the distorted image simply by running our correction algorithm twice iteratively. For each iteration, it detects and corrects the most severe type of distortion that it encounters. See the supplementary material for some examples and results.
In conclusion, we present the first approach to blindly correct several types of geometric distortions from a single image. Our approach uses a deep learning method trained on several common distortions to detect the distortion flow and type. Our model fitting and parameter estimation approach then accurately predicts the distortion parameters. Finally, we present a fast parallel approach to resample the distortion-corrected images. We compare our techniques to recent specialized methods for distortion correction and present applications such as distortion transfer, distortion exaggeration, and co-occurring distortion correction.