1 Introduction
Underwater image analysis is a challenging and relatively less explored area of computer vision. In particular, if a scene submerged in water is imaged by a camera in air, the scene exhibits severe spatial distortions due to dynamic refraction at the wavy/dynamic water surface. Such distortions can interfere with higher level tasks such as object recognition, tracking, motion analysis or segmentation, which are required in applications such as coral reef monitoring, surveillance of shallow riverbeds to observe vegetation
[31], or studying of visual perception in water-birds (see references in [1]). Applying the principle of reversibility of light, there also exist applications in submarines where a camera in water is observing scenes in the air [1].Related Work: There exists a medium-sized body of literature on this topic. The earliest work, to our knowledge, is from [17] where frame-to-frame optical flow is estimated using a correlation-based method, and the underlying image is estimated from the centroid of the flow trajectory at each point. Such a method is expensive and error-prone due to ambiguities in optical flow (especially in case of large motion), and reflective or blur artifacts. The work in [28] infers a set of ‘water bases’ from synthetic underwater scenes generated from the wave equation, and then expresses deformation fields within small patches as linear combinations of the water bases. The work in [18] performs non-rigid registration of blurred versions of all frames in the sequence with an evolving template (initialized to be the mean image) followed by a robust PCA step [3]. Both these methods are expensive and prone to local minima in case of large motion, leaving behind some residual motion or geometric distortion. Along similar lines, [12] proposes a method to register all frames of a video sequence with a ‘reference frame’, chosen to be a frame with the least blur. The so-called ‘lucky region approach’ has been developed in [8], [5], [35] and [34]. In this, distortion-free patches which correspond to patches of the image formed due to a locally flat portion of the water surface, are identified and then stitched together using graph embedding. In [24], the restoration is performed based on the assumption that the water surface is a single unidirectional cyclic wave (UCW). The restoration process is framed as a blind deconvolution problem, the input to which is the average of the video frames (equivalent to a motion-blurred image due to long camera exposure). The UCW assumption will not hold in some scenarios, for example if the water waves are formed from a superposition of different constituent waves due to multiple independent disturbances. Recently, [15]
presented a deep learning framework to restore
singledistorted underwater images (as opposed to using video sequences). A neural network was trained on pairs of distortion-free and distorted images, to infer geometric distortion and also apply photometric correction. This method does not account for the extra information in the form of temporal redundancy, which is readily available in even short video sequences. To be effective, it also requires a large amount of training data. In contrast, our method is based on principles of basic physics/geometry. It also does not require training data as in
[15], representative distortion-free templates (which are generally hard to acquire) to drive the technique as in [29, 30], multiple illumination sources as in [31] or multiple viewpoints as in [20].Overview:
In this paper, we present a novel method that exploits the inherent spatio-temporal redundancy of water waves. We note that the motion vector fields (MVFs), defined as the collection of displacement vectors at each point in every frame of the video sequence, have a sparse representation in the discrete Fourier basis. This emerges from the spatial smoothness of the flow, and its temporal periodicity as well as smoothness. We begin by tracking some
salient feature points across all frames of the video, to yield point-trajectories (PTs) , and then convert these into displacement-trajectories (DTs) . Given these DTs, we use a compressed sensing (CS) method to infer the DTs at all other points in the image domain. The specific manner in which we have applied CS for motion estimation in this paper, is a novel contribution (see Sec. 2.3.4 for a comparison to other approaches for sparsity-based motion estimation). We also observe that our method largely reduces the non-rigid motion and outperforms the state of the art methods. Our second major contribution is the usage of an existing optical flow method based on local polynomial image representations [9], for this particular task. Despite its simplicity, we show that this method outperforms the state of the art. Lastly, we show that a two stage approach with CS followed by the optical flow method leads to better video stabilization as well as image structure preservation (visual and numerical) than the optical flow stage alone, at only slightly greater computational cost.2 Theory
We first present a complete description of the various assumptions made in our restoration task, and compare them with those of other aforementioned methods.
2.1 Image Formation
We assume a static planar scene submerged at unknown depth below a clear, shallow water surface, imaged by an orthographic camera in air whose optical axis is pointing downwards perpendicular to the scene. This assumption is valid in practice (as also seen from our results on real sequences in Section 3) and has also been made in existing work such as [28, 24, 17]. Let be the original image (size ) as if it were formed in the absence of any wavy water surface. Then the distorted image due to the wavy water surface is given as
, where is the displacement at point (indexing into the undistorted image ) at time . A precise relationship between and the derivatives of the dynamic height of the water surface at the point of refraction, has been derived in previous work [17]. Here, our aim is to estimate for all given . We are assuming that the video frames are largely blur-free, though moderate deviations from this assumption do not affect our theory or results. We ignore effects such as reflection of light from the water surface (which were found to be absent or rare in real videos we gathered or those from [28]).
2.2 Water Surface and Motion Vector Field Model
In our work, we essentially require the wavy water surface to be a smooth signal in space and time, and also temporally periodic. The assumption of spatio-temporal smoothness is common in the literature on this topic, for example [28, 18], and excludes turbulent flows. We do not require the water surface to be explicitly modelled as a linear combination of sinusoidal waves, though our method works very well even for such cases. The motion vector at point (of the underlying undistorted video) at time instant is denoted as . The complete motion vector field (MVF) can be represented as two 3D signals in , containing the - and -components of the displacements at every pixel and every time instant . Due to the spatio-temporal smoothness (and thereby their band-limited nature), both will admit a sparse (or compressible) decomposition in the Fourier space. For computational reasons, we use the DiscreteFourier Transform (DFT) basis. Given the innate interdependence between (since they emerge from the same wavy water surface), we instead work with a complex-valued vector field where and . This is a natural way to exploit the interdependence. Moreover, if the video sequence is long enough so that the actual MVF is temporally periodic, then that further contributes to the sparsity of the Fourier domain representation. This is because by definition, periodic signals are sparse in the Fourier domain, even more so if they are band-limited - which is a consequence of smoothness.
The assumption of sparsity of the MVF in the Fourier basis finds corroboration in the fluid mechanics literature. For example in [22, 10, 36], the rapid convergence of the Fourier series of different types of time-periodic velocity vector fields arising in fluid flow, has been demonstrated. Due to this, the water surface height and hence the MVFs (which are related to derivatives of ) will also admit sparse Fourier decomposition. In addition, in Sec. 3.3, we present an empirical verification of the Fourier sparsity of the MVF from real underwater video sequences.
2.3 Method Overview
An overview of our method is summarized in a pseudo-code in Alg. 1. The detailed steps are described further.
2.3.1 Feature point detection and tracking
We begin with a salient feature point detection and tracking algorithm yielding point-trajectories for salient feature points detected in the first frame. The coordinates represent the position in frame of the point whose coordinates in a distortion-free frame are denoted as , where the subscript ‘’ refers to an index in the undistorted image. Of course, are unknown at the outset. Our salient feature point detection combines four algorithms: (1) difference of Gaussians (DoG) used by SURF[2], (2) the FAST method [23], (3) the popular Harris corner method, and (4) the BRISK technique [14]. Consider a union-set of salient points in the first frame, as detected by all these methods. All points in this set are tracked in subsequent frames using the well-known Kanade-Lucas-Tomasi (KLT) tracker [16]. We obtain excellent results with the standard KLT tracker because it inherently takes care of locally affine motion (a first approximation to non-rigid motion). In some cases, however, we encounter tracking errors. Such trajectories are weeded out and not used in later steps, if (1) they are considered invalid by the KLT tracker itself (which happens when the region around a salient feature point in a frame cannot be accurately expressed as an affine transformation of the corresponding region in a previous frame), or if (2) the center of trajectory (COT), as defined in Sec. 2.3.2, computed over the first and last frames differ by a threshold of more than 3 pixels.
We also trained a Siamese network following [26] to learn good feature descriptors. See supplemental material for further details about this. The Siamese network produced slightly better results than the KLT tracker on unseen synthetic and real data. However it did not perform as well as the KLT tracker if there was blur in the video frames. Hence we used the KLT tracker in all experiments. Examples of point tracking on real sequences are shown in the supplemental material folder ‘CS_MotionReduction’.
2.3.2 Displacement computation
Following previous definitions of and the point-trajectory , we approximate (termed ‘center of trajectory’ or COT), although more robust ‘means’ such as the median can also be considered. This approximation is well justified by the assumption of the local symmetry of water motion, due to which the average surface normal (across time) at any point on the water surface is close to the vertical line [17]. Our experiments for synthetic and real sequences confirm that this is valid even for moderate frames. The supplemental material includes an illustrative example. With this, our set of displacements for the salient feature point are given as . We term these as ‘displacement-trajectories’ (DTs).
2.3.3 MVF Estimation using CS
The DTs can be regarded as sparse samples (in the space-time domain) of the 3D MVF signal . The signal is sparse in the Fourier domain (see Sec. 2.2) and hence can be expressed as where is the 3D-DFT basis matrix and is a sparse vector of Fourier coefficients. If the DTs are concatenated to form a complex-valued ‘measurement vector’ of elements, then we have the following model:
(1) |
where is a sampling matrix of size and is a noise vector of elements indicating errors in the DTs obtained from the tracking algorithm. Note that
is the row-subsampled version of the identity matrix of size
, and each row of is a one-hot vector which indicates whether or not the displacement at some pixel (in the undistorted image) at some time frame was included in the set (and hence the measurement vector ). The sensing matrix and representation matrix are an ideal combination, because they are highly incoherent with each. This augurs well for the application of a CS algorithm for estimation of (and thereby ) from . This is because CS theory states that measurements are sufficient for accurate reconstruction of the -sparse vectorwith very high probability
[4]. Here is the coherence between and and is defined as where and are the row of and column of respectively. Given the choice of for our task, reaches its lower bound of 1, thereby reducing the number of samples required for reconstruction guarantees. To account for the noise in , we determine using an estimator (popularly called the LASSO) which minimizes the following objective function:(2) |
The regularization parameter can be chosen by cross-validation [33] from a set of candidate values. That is, for every , a candidate signal is computed by the LASSO method using a set of only (say) 90% of the measurements from . Following this, the value of is computed, where is the set of the remaining measurements in . The value of that minimizes can be selected. Following this, is re-estimated by the LASSO method from all measurements in and with the selected value.
2.3.4 Comments Regarding MVF Estimation using CS
Note that our method is very different from the bispectral approach in [35]
which chooses ‘lucky’ (i.e. least distorted) patches, by comparing to a mean template. In that method, the Fourier transform is computed locally on small patches in the spatial domain for finding similarity with corresponding patches from a mean image. On the other hand, our Fourier decomposition is spatio-temporal and global. The idea of dense optical flow interpolation (not specific to underwater scenes) from a sparse set of feature point correspondences has been proposed in the so-called EpicFlow technique
[21]. The interpolation uses non-parametric kernel regression or a locally affine method. However our method uses key properties (spatio-temporal smoothness and temporal periodicity) of water waves, and thus considers temporal aspects of the MVFs. This aspect is missing in EpicFlow. Nevertheless, we present comparisons to EpicFlow in Sec. 3.The use of sparsity based techniques for dense flow field estimation is not entirely new and has been used earlier in [6, 25, 13]. However besides the usage of sparsity for underwater image restoration, there are key differences between our approach and the existing ones. (a) First, these papers use a sparse representation (eg., wavelets [25], learned dictionaries [13] or low-rank and sparse models [6]) for optical flow in small patches unlike our method which is more global. (b) Second, they compute the optical flow only between two frames with a data fidelity term based on the brightness constancy equation (unlike our approach which uses displacement trajectories), they do not consider spatio-temporal patches, and do not account for temporal redundancy, which is a readily available and useful prior that our approach exploits.
2.3.5 Polynomial Image Expansions for Optical Flow
The classical optical flow method in [9] expresses small patches from the two images and , between which the MVF has to be computed, as second-degree polynomials. This method can unduly smooth motion discontinuities as mentioned in [9], but it is well suited to our problem, due to the spatial smoothness of water waves. Consider the following:
(3) | ||||
where is the 2D displacement vector at the point . Consider small patches in the two images respectively, centered around point . The polynomial coefficients can be determined by local regression. This process is repeated in sliding window fashion all through the image, and so these coefficients become functions of . Assuming a slowly changing MVF, the displacement can be computed in the following manner:
(4) |
where is a point in a neighborhood around , and . Further details about this method can be found in [9]. We term this method as polynomial expansion based optical flow (PEOF). Given , the image is warped, followed by polynomial fitting in the warped version of , and re-estimation of . The procedure is repeated iteratively. In the present work, the PEOF method is used to find the MVF between each video frame and the mean image (the average of all video frames). The computed MVFs are applied to each frame to obtain the restored video. These restored images are then averaged to yield a final restored image. As shown in Sec. 3, this method outperforms all state of the art methods in terms of image restoration quality as well as computation speed.
3 Experimental Results
In this section, we present an extensive suite of results on both synthetic and real video sequences. All image and video results are available in the supplemental material.
3.1 Description of Datasets
We created several synthetic 50 fps videos of size by simulating the refraction model from [17] on different images containing objects/text, for a scene depth of 25 cm below the water surface. The water surface was generated using superposition of sinusoidal waves with randomly chosen parameters. We henceforth refer to this dataset as Synthetic. We also gathered real video sequences (of size with a 50 fps camera) of laminated posters kept at the bottom of a water-tank, with waves generated by mechanical paddles. See supplemental material for details of the acquisition. Visual inspection revealed that blur was occasionally present in some frames. We henceforth refer to this dataset as Real1. For ground truth (), we also acquired a single image of the same posters under still water with the same camera settings. We also demonstrate results on three text sequences (size at 125 fps) obtained from [28], for which ground truth was available. We henceforth refer to this dataset as Real2. Note that Real1 is a more challenging dataset than Real2 due to greater frame-to-frame motion - see Table 2
for the standard deviation values of the motion
, computed over salient point trajectories.3.2 Key Parameters and Comparisons
We compared restoration results for several algorithms: (1) our CS-based method (CS) from Sec. 2.3.3; (2) our PEOF method from Sec. 2.3.5; (3) the CS-based method followed by the PEOF method (CS+PEOF); (4) the two-stage method in [18] consisting of spline-based registration followed by RPCA (SBR-RPCA) which is considered state of the art for underwater image restoration; (5) the method from [28] using learned water bases (LWB); and (6) the deep learning (DL) approach from [15]. For SBR-RPCA and LWB, we used code provided by the authors with default parameters. For DL, we used the pre-trained network and code provided by the authors, on each video frame separately and then computed the mean image. We performed all computation and quality assessment with each video frame resized to (after suitable cropping to maintain aspect-ratio), as required by their specific implementation. For CS, we used the well-known YALL1 (Basic) solver [37], which allows for -norm optimization of complex-valued signals. We observed better and faster results in practice by downsampling the DTs comprising by a factor of 8 in directions (which is in tune with the bandlimited nature of water waves), followed by CS reconstruction and subsequent upsampling to obtain the final reconstructed MVF. For PEOF, we used the OpenCV implementation with a multi-scale pyramidal approach with 3 levels, a pyramid scale of 0.5 and 10 iterations (i.e. the default parameters). For quality assessment referring to ground truth, we used the following measures: (i) visual inspection of the restored video as well as the mean-frame of the restored video, (ii) RMSE computed as where is the image representing the undistorted static scene, (iii) normalized mutual information (NMI) between and , and (iv) SSIM [32] between and .
We also considered comparing PEOF to a very competitive optical flow algorithm: EpicFlow [21] (EF). For EF [21], we used the authors’ code to estimate the deformation of all video frames w.r.t. the mean frame of the video. We then applied the deformations to each frame to yield the final image. Results comparing PEOF and EF are included in the supplemental material, and show that PEOF outperforms EF for underwater image restoration. Note that the EF method has so far not been applied for this task in the literature. We do not present results with the state of the art deep learning approaches for optical flow such as [7] or [27] here, for two reasons: (i) EF yielded superior results on our data compared to [7], and (ii) the results of PWC-net from [27] show only a small improvement over EpicFlow on some datasets such as Sintel.
We did not compare with the work in [24] because it relies on a unidirectional wave motion assumption (whereas we assume more general wave models), and due to unavailability of publicly released code. Also, we did not explicitly compare our results with the method in [12] for which publicly released code is unavailable. However, we observed that CS-PEOF outperformed the method of [12] on Real2 (compare ‘Middle’, ‘Small’ and ‘Tiny’ in Table 1 to ‘Large’, ‘Medium’ and ‘Small’ respectively in Table 1 of [12]).

![]() |
![]() |
3.3 Discussion of Results
The numerical results are presented in Table 1. The mean images of three real videos restored by various methods are presented in Figs. 3. The supplemental material contains results on 14 videos (mean images and restored videos) for all methods. From these results, it is clear that our methods (CS, PEOF and CS-PEOF) yield results surpassing SBR-RPCA, LWB and DL on synthetic as well as real datasets, numerically and also in terms of visual quality. We also supplemented our method with a step involving RPCA [3] to remove sparse artifacts, which improved video stability but had very little impact over the quality of the mean image. Although PEOF produces superior numerical results to CS, we have observed that CS produces restored videos and mean images with superior visual quality as compared to PEOF - see Fig. 3 (grid lines on ‘Elephant’, words ‘Imaging’, ‘Distortion’ in ‘Middle’) as well as supplemental material.
CS | PEOF | CS+PEOF | ||||||||||
Dataset | Time | NMI | SSIM | RMSE | Time | NMI | SSIM | RMSE | Time | NMI | SSIM | RMSE |
Real1 | ||||||||||||
Cartoon | 0m 42s | 1.227 | 0.902 | 0.065 | 0m 41s | 1.216 | 0.913 | 0.062 | 1m 23s | 1.255 | 0.928 | 0.057 |
Checker | 1m 9s | 1.206 | 0.884 | 0.104 | 0m 40s | 1.196 | 0.89 | 0.105 | 1m 49s | 1.22 | 0.892 | 0.104 |
Dices | 1m 20s | 1.172 | 0.937 | 0.067 | 0m 40s | 1.139 | 0.905 | 0.075 | 2m 1s | 1.188 | 0.956 | 0.059 |
Bricks | 1m 1s | 1.148 | 0.785 | 0.142 | 0m 34s | 1.151 | 0.803 | 0.121 | 1m 36s | 1.167 | 0.843 | 0.118 |
Elephant | 0m 28s | 1.128 | 0.801 | 0.141 | 0m 26s | 1.102 | 0.763 | 0.152 | 0m 55s | 1.132 | 0.808 | 0.143 |
Eye | 1m 22s | 1.266 | 0.961 | 0.052 | 0m 57s | 1.26 | 0.975 | 0.042 | 2m 19s | 1.303 | 0.982 | 0.037 |
Math | 1m 19s | 1.193 | 0.942 | 0.05 | 0m 37s | 1.163 | 0.929 | 0.053 | 1m 56s | 1.215 | 0.961 | 0.044 |
Synthetic | ||||||||||||
BlueTiles | 0m 28s | 1.141 | 0.792 | 0.256 | 0m 23s | 1.141 | 0.816 | 0.204 | 0m 52s | 1.161 | 0.871 | 0.182 |
BrickWall | 0m 23s | 1.094 | 0.667 | 0.144 | 0m 24s | 1.098 | 0.69 | 0.142 | 0m 47s | 1.1 | 0.703 | 0.141 |
Vision | 29s | 1.181 | 0.938 | 0.09 | 23s | 1.162 | 0.916 | 0.113 | 52s | 1.211 | 0.972 | 0.066 |
HandWritten | 0m 37s | 1.123 | 0.878 | 0.081 | 0m 23s | 1.131 | 0.907 | 0.077 | 1m 0s | 1.156 | 0.938 | 0.075 |
Real2 | ||||||||||||
Middle | 0m 13s | 1.192 | 0.838 | 0.139 | 0m 7s | 1.211 | 0.85 | 0.165 | 0m 20s | 1.23 | 0.914 | 0.101 |
Small | 0m 9s | 1.169 | 0.763 | 0.164 | 0m 6s | 1.182 | 0.772 | 0.206 | 0m 16s | 1.195 | 0.849 | 0.133 |
Tiny | 0m 11s | 1.166 | 0.661 | 0.201 | 0m 7s | 1.176 | 0.698 | 0.263 | 0m 19s | 1.186 | 0.745 | 0.19 |
SBR-RPCA[18] | LWB[28] | DL[15] | ||||||||||
Dataset | Time | NMI | SSIM | RMSE | Time | NMI | SSIM | RMSE | Time | NMI | SSIM | RMSE |
Real1 | ||||||||||||
Cartoon | 3h 2m | 1.173 | 0.843 | 0.111 | 0h 54m | 1.152 | 0.836 | 0.095 | 3s | 1.203 | 0.803 | 0.162 |
Checker | 4h 9m | 1.158 | 0.791 | 0.239 | 1h 37m | 1.105 | 0.66 | 0.322 | 3s | 1.129 | 0.544 | 0.384 |
Dices | 3h 58m | 1.1 | 0.758 | 0.17 | 1h 26m | 1.086 | 0.783 | 0.126 | 3s | 1.085 | 0.637 | 0.242 |
Bricks | 3h 43m | 1.128 | 0.686 | 0.192 | 1h 24m | 1.118 | 0.673 | 0.225 | 3s | 1.058 | 0.49 | 0.422 |
Elephant | 3h 7m | 1.075 | 0.516 | 0.257 | 0h 59m | 1.068 | 0.584 | 0.204 | 3s | 1.075 | 0.378 | 0.347 |
Eye | 4h 4m | 1.179 | 0.913 | 0.104 | 1h 22m | 1.155 | 0.903 | 0.089 | 3s | 1.141 | 0.804 | 0.191 |
Math | 4h 34m | 1.1 | 0.841 | 0.102 | 3h 0m | 1.067 | 0.766 | 0.1 | 3s | 1.073 | 0.678 | 0.139 |
Synthetic | ||||||||||||
BlueTiles | 2h 5m | 1.142 | 0.763 | 0.372 | 0h 55m | 1.104 | 0.72 | 0.204 | 3s | 1.091 | 0.365 | 1.067 |
BrickWall | 2h 30m | 1.093 | 0.666 | 0.158 | 1h 0m | 1.066 | 0.481 | 0.19 | 3s | 1.079 | 0.479 | 0.218 |
Vision | 3h 4 | 1.115 | 0.739 | 0.216 | 36m | 1.021 | 0.446 | 0.266 | 3s | 1.095 | 0.599 | 0.215 |
HandWritten | 0h 0m | 1.112 | 0.851 | 0.12 | 0h 52m | 1.073 | 0.678 | 0.147 | 3s | 1.074 | 0.546 | 0.177 |
Real2 | ||||||||||||
Middle | 1h 28m | 1.189 | 0.782 | 0.204 | 0h 54m | 1.163 | 0.761 | 0.194 | 3s | 1.122 | 0.512 | 0.307 |
Small | 1h 21m | 1.153 | 0.741 | 0.181 | 0h 33m | 1.151 | 0.688 | 0.198 | 3s | 1.114 | 0.418 | 0.323 |
Tiny | 1h 6m | 1.161 | 0.657 | 0.395 | 0h 34m | 1.167 | 0.654 | 0.238 | 3s | 1.144 | 0.492 | 0.306 |
Additionally, we observed that SBR-RPCA, DL and LWB did not preserve the image structure as well as our method (see grid-lines in ‘Elephant’, the words ‘Fluctuation’ or ‘Distortion’ in ‘Middle’ and alphabets E, T, Z, large D in ‘Eye’ in Fig. 3. We do believe the DL method [15] may yield improved results if their network were trained to restore multiple frames together, as opposed to single frames individually (as done by their current algorithm) - which ignores the temporal aspect leading to loss in performance. All in all, our results show that exploiting spatio-temporal properties of water waves for this task is indeed useful.
Computational Time: The compute time for all methods (measured on a 2.6GHz Intel Xeon machine with 32GB RAM) are presented in Table 1. The DL method is the fastest, whereas our methods are much faster than SBR-RPCA and LWB. However for CS, the YALL1 solver uses GPU support, which is unavailable in the authors’ code for SBR-PCA and LWB. We note that although cross-validation is an excellent way to pick the parameter in Eqn. 2, we found that the optimal choice of this parameter did not change much across datasets. Also small changes in did not affect the performance much. Hence the time for cross-validation is not included in Table 1.
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Verification of Fourier Sparsity: Here, we demonstrate the sparsity of the MVFs from real underwater sequences. This is shown in Fig. 1 for the ‘Elephant’ sequence (similar plots can be generated for all other sequences). We note that the actual MVF can only be estimated. However, we contend that the MVF estimated by our CS method is a good approximation to the actual one. This is evident when comparing the quality of the estimated mean image with the original undistorted image (gathered in still water). For further quantification, we also tracked the same points (from the distorted video ) that were used for the CS algorithm in Sec. 2.3.3, in the restored video () produced by the CS step. This gave us new DTs . We computed a measure of the motion reduction given as . We note that in most cases, we achieve more than motion reduction by the CS step - see Table 2. We have included a few videos in the supplemental material for the visual comparison of the estimated MVF w.r.t. the ground truth MVF.
Dataset | |||
---|---|---|---|
Cartoon (Real1) | 1029 | 94.11% | 7.42 |
Checker (Real1) | 3149 | 85.25% | 8.5 |
Dices (Real1) | 2230 | 91.9% | 7.75 |
Bricks (Real1) | 1300 | 87.38% | 7.42 |
Elephant (Real1) | 3670 | 97.7% | 7.34 |
Eye (Real1) | 1647 | 81.66% | 7.84 |
Math (Real1) | 2309 | 96.12% | 5.64 |
BlueTiles (Synth.) | 2192 | 94.77% | 5.71 |
BrickWall (Synth.) | 3134 | 94.57% | 8.68 |
Vision (Synth.) | 5266 | 93.49% | 6.77 |
HandWritten (Synth.) | 3789 | 95.82% | 4.33 |
Middle (Real2) | 785 | 96.34% | 5.65 |
Small (Real2) | 993 | 97.26% | 4.22 |
Tiny (Real2) | 155 | 87.84% | 5.03 |
Effect of number of frames :
In the absence of attenuation, a large helps improve the performance of our algorithm, due to better Fourier sparsity and better approximation of the COT. In practice, we observed on real datasets that just 100 frames were sufficient to yield good reconstruction results. Further increase in had an insignificant impact on the result quality. A graph showing the effect of on reconstruction from real sequences is shown in Fig. 2.
Effect of number of tracked points : The number and accuracy of DTs affects the performance of the CS algorithm. The number of DTs varied across datasets, depending on the number of available salient feature points, but was always less than . The slightly lower performance of CS on the ‘Tiny’ sequence for example (see Table 1) is due to the small number of available salient points, less than - see Table 2. A graph showing the positive impact of increase in the number of good quality tracks (upto a point, beyond which the performance saturates) is shown in Fig. 2. We note, that we have ensured good quality of the trajectories for further stages of our algorithms, as mentioned in Sec. 2.3.1. We considered global sparsity in this work, as opposed to sparsity of small spatial or spatio-temporal patches. Since, many patches without any salient points could exist.
4 Conclusion
We have presented two methods for correction of refractive deformations due to a wavy water surface, one based on a novel application of CS for interpolating MVFs starting from a small set of salient PTs (and their DTs), and the other based on polynomial image expansions. In both cases, we obtain results superior to the state of the art at low computational cost. Avenues for future work include (1) extending the CS algorithm to handle moving objects; (2) studying the effect of depth variation, perspective projection or wave attenuation on the results of our algorithms; and (3) exploring MVF sparsity in other bases instead of the DFT.
Acknowledgements: The authors wish to thank the Qualcomm Innovation Fellowship Program (India) for supporting this work, NVIDIA Corp. for donation of a Titan Xp GPU, and Prof. Manasa Ranjan Behera of the Civil Engineering Department at IITB, for the wave-flume facility to acquire real data.
The source code, datasets and supplemental material can be accessed at [11], [19].
References
- [1] (2013) Detecting motion through dynamic refraction. IEEE Trans. Pattern Anal. Mach. Intell. 35 (1), pp. 245–251. Cited by: §1.
- [2] (2008) SURF: speeded up robust features. Computer Vision and Image Understanding 110 (3), pp. 346–359. Cited by: §2.3.1.
-
[3]
(2011)
Robust principal component analysis?
. J. ACM 58 (3), pp. 11:1–11:37. Cited by: §1, §3.3. - [4] (2006) Robust uncertainty prin- ciples: exact signal reconstruction from highly incomplete. IEEE Trans. on Information Theory. Cited by: §2.3.3.
- [5] (2007) Improved reconstruction of images distorted by water waves. In Advances in Computer Graphics and Computer Vision, Cited by: §1.
- [6] (2014) Nonlocal sparse and low-rank regularization for optical flow estimation. IEEE Transactions on Image Processing 23 (10). Cited by: §2.3.4.
- [7] (2015) FlowNet: learning optical flow with convolutional networks. In ICCV, pp. 2758–2766. Cited by: §3.2.
- [8] (2004) Seeing through water. In NIPS, pp. 393–400. Cited by: §1.
- [9] (2003) Two-frame motion estimation based on polynomial expansion. In Image Analysis, pp. 363–370. Cited by: §1, §2.3.5.
- [10] (1982) A fourier method for solving nonlinear water-wave problems: application to solitary-wave interactions. Journal of Fluid Mechanics 118, pp. 411–443. Cited by: §2.2.
- [11] Github Repository. Note: https://github.com/jeringeo/CompressiveFlows Cited by: §4.
- [12] (2017-04) Correction of geometrically distorted underwater images using shift map analysis. J. Opt. Soc. Am. A 34 (4), pp. 666–673. Cited by: §1, §3.2.
- [13] (2011) Optical flow estimation using learned sparse model. In CVPR, Cited by: §2.3.4.
- [14] (2011) BRISK: binary robust invariant scalable keypoints. In ICCV, Cited by: §2.3.1.
- [15] (2018) Learning to see through turbulent water. In WACV, pp. 512–520. Cited by: §1, Figure 3, §3.2, §3.3, Table 1.
- [16] MATLAB implementation for KLT tracker. Note: https://in.mathworks.com/help/vision/ref/vision.pointtracker-system-object.html Cited by: §2.3.1.
- [17] (1992) Surface shape reconstruction of a nonrigid transport object using refraction and motion. IEEE Trans. Pattern Anal. Mach. Intell. 14 (10), pp. 1045–1052. Cited by: §1, §2.1, §2.3.2, §3.1.
- [18] (2011) A two-stage reconstruction approach for seeing through water. In CVPR, pp. 1153–1160. Cited by: §1, §2.2, Figure 3, §3.2, Table 1.
- [19] Project Page. Note: https://www.cse.iitb.ac.in/~ajitvr/publications.html Cited by: §4.
- [20] (2018) Simultaneous 3d reconstruction for water surface and underwater scene. In ECCV, pp. 776–792. Cited by: §1.
- [21] (2015) EpicFlow: edge-preserving interpolation of correspondences for optical flow. In CVPR, Cited by: §2.3.4, §3.2.
- [22] (1995) Utilization of fourier decomposition for analyzing time-periodic flows. Computers and Fluids 24 (4), pp. 349 – 368. Cited by: §2.2.
-
[23]
(2010)
Faster and better: a machine learning approach to corner detection
. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (1), pp. 105–119. Cited by: §2.3.1. - [24] (2015) Deskewing of underwater images.. IEEE Trans. Image Processing 24 (3), pp. 1046–1059. Cited by: §1, §2.1, §3.2.
- [25] (2010) Sparsity model for robust optical flow estimation at motion discontinuities. In CVPR, Cited by: §2.3.4.
- [26] (2015) Discriminative learning of deep convolutional feature point descriptors. In ICCV, pp. 118–126. Cited by: §2.3.1.
- [27] (2018) PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, Cited by: §3.2.
- [28] (2009) Seeing through water: image restoration using model-based tracking. In ICCV, pp. 2303–2310. Cited by: §1, §2.1, §2.2, Figure 3, §3.1, §3.2, Table 1.
- [29] (2012) Globally optimal estimation of nonrigid image distortion. International Journal of Computer Vision 98 (3), pp. 279–302. Cited by: §1.
- [30] (2015) Theory and practice of hierarchical data-driven descent for optimal deformation estimation. International Journal of Computer Vision 115 (1), pp. 44–67. Cited by: §1.
- [31] (2013) On observing underwater objects through a wavy water surface: a new algorithm for image correction and laboratory experiment. Izvestiya Atmosph. Ocean. Phys. 49 (3), pp. 339–345. Cited by: §1, §1.
- [32] (2009) Mean squared error: Love it or leave it? A new look at signal fidelity measures. IEEE Signal Processing Magazine 26 (1), pp. 98–117. Cited by: §3.2.
- [33] (2009) Compressed sensing with cross validation. IEEE Trans. Inf. Theor. 55 (12), pp. 5773–5782. Cited by: §2.3.3.
- [34] (2009) Bicoherence: a new lucky region technique in anisoplanatic image restoration. Appl. Opt. 48 (32), pp. 6111–6119. Cited by: §1.
- [35] (2010) Bispectral analysis and recovery of images distorted by a moving water surface. Appl. Opt. 49 (33), pp. 6376–6384. Cited by: §1, §2.3.4.
- [36] (2007) Real-time PDE-Constrained optimization. Cited by: §2.2.
- [37] YALL1: your algorithms for L1. Note: http://yall1.blogs.rice.edu/ Cited by: §3.2.