1 Introduction
Underwater image analysis is a challenging and relatively less explored area of computer vision. In particular, if a scene submerged in water is imaged by a camera in air, the scene exhibits severe spatial distortions due to dynamic refraction at the wavy/dynamic water surface. Such distortions can interfere with higher level tasks such as object recognition, tracking, motion analysis or segmentation, which are required in applications such as coral reef monitoring, surveillance of shallow riverbeds to observe vegetation
[31], or studying of visual perception in waterbirds (see references in [1]). Applying the principle of reversibility of light, there also exist applications in submarines where a camera in water is observing scenes in the air [1].Related Work: There exists a mediumsized body of literature on this topic. The earliest work, to our knowledge, is from [17] where frametoframe optical flow is estimated using a correlationbased method, and the underlying image is estimated from the centroid of the flow trajectory at each point. Such a method is expensive and errorprone due to ambiguities in optical flow (especially in case of large motion), and reflective or blur artifacts. The work in [28] infers a set of ‘water bases’ from synthetic underwater scenes generated from the wave equation, and then expresses deformation fields within small patches as linear combinations of the water bases. The work in [18] performs nonrigid registration of blurred versions of all frames in the sequence with an evolving template (initialized to be the mean image) followed by a robust PCA step [3]. Both these methods are expensive and prone to local minima in case of large motion, leaving behind some residual motion or geometric distortion. Along similar lines, [12] proposes a method to register all frames of a video sequence with a ‘reference frame’, chosen to be a frame with the least blur. The socalled ‘lucky region approach’ has been developed in [8], [5], [35] and [34]. In this, distortionfree patches which correspond to patches of the image formed due to a locally flat portion of the water surface, are identified and then stitched together using graph embedding. In [24], the restoration is performed based on the assumption that the water surface is a single unidirectional cyclic wave (UCW). The restoration process is framed as a blind deconvolution problem, the input to which is the average of the video frames (equivalent to a motionblurred image due to long camera exposure). The UCW assumption will not hold in some scenarios, for example if the water waves are formed from a superposition of different constituent waves due to multiple independent disturbances. Recently, [15]
presented a deep learning framework to restore
singledistorted underwater images (as opposed to using video sequences). A neural network was trained on pairs of distortionfree and distorted images, to infer geometric distortion and also apply photometric correction. This method does not account for the extra information in the form of temporal redundancy, which is readily available in even short video sequences. To be effective, it also requires a large amount of training data. In contrast, our method is based on principles of basic physics/geometry. It also does not require training data as in
[15], representative distortionfree templates (which are generally hard to acquire) to drive the technique as in [29, 30], multiple illumination sources as in [31] or multiple viewpoints as in [20].Overview:
In this paper, we present a novel method that exploits the inherent spatiotemporal redundancy of water waves. We note that the motion vector fields (MVFs), defined as the collection of displacement vectors at each point in every frame of the video sequence, have a sparse representation in the discrete Fourier basis. This emerges from the spatial smoothness of the flow, and its temporal periodicity as well as smoothness. We begin by tracking some
salient feature points across all frames of the video, to yield pointtrajectories (PTs) , and then convert these into displacementtrajectories (DTs) . Given these DTs, we use a compressed sensing (CS) method to infer the DTs at all other points in the image domain. The specific manner in which we have applied CS for motion estimation in this paper, is a novel contribution (see Sec. 2.3.4 for a comparison to other approaches for sparsitybased motion estimation). We also observe that our method largely reduces the nonrigid motion and outperforms the state of the art methods. Our second major contribution is the usage of an existing optical flow method based on local polynomial image representations [9], for this particular task. Despite its simplicity, we show that this method outperforms the state of the art. Lastly, we show that a two stage approach with CS followed by the optical flow method leads to better video stabilization as well as image structure preservation (visual and numerical) than the optical flow stage alone, at only slightly greater computational cost.2 Theory
We first present a complete description of the various assumptions made in our restoration task, and compare them with those of other aforementioned methods.
2.1 Image Formation
We assume a static planar scene submerged at unknown depth below a clear, shallow water surface, imaged by an orthographic camera in air whose optical axis is pointing downwards perpendicular to the scene. This assumption is valid in practice (as also seen from our results on real sequences in Section 3) and has also been made in existing work such as [28, 24, 17]. Let be the original image (size ) as if it were formed in the absence of any wavy water surface. Then the distorted image due to the wavy water surface is given as
, where is the displacement at point (indexing into the undistorted image ) at time . A precise relationship between and the derivatives of the dynamic height of the water surface at the point of refraction, has been derived in previous work [17]. Here, our aim is to estimate for all given . We are assuming that the video frames are largely blurfree, though moderate deviations from this assumption do not affect our theory or results. We ignore effects such as reflection of light from the water surface (which were found to be absent or rare in real videos we gathered or those from [28]).
2.2 Water Surface and Motion Vector Field Model
In our work, we essentially require the wavy water surface to be a smooth signal in space and time, and also temporally periodic. The assumption of spatiotemporal smoothness is common in the literature on this topic, for example [28, 18], and excludes turbulent flows. We do not require the water surface to be explicitly modelled as a linear combination of sinusoidal waves, though our method works very well even for such cases. The motion vector at point (of the underlying undistorted video) at time instant is denoted as . The complete motion vector field (MVF) can be represented as two 3D signals in , containing the  and components of the displacements at every pixel and every time instant . Due to the spatiotemporal smoothness (and thereby their bandlimited nature), both will admit a sparse (or compressible) decomposition in the Fourier space. For computational reasons, we use the DiscreteFourier Transform (DFT) basis. Given the innate interdependence between (since they emerge from the same wavy water surface), we instead work with a complexvalued vector field where and . This is a natural way to exploit the interdependence. Moreover, if the video sequence is long enough so that the actual MVF is temporally periodic, then that further contributes to the sparsity of the Fourier domain representation. This is because by definition, periodic signals are sparse in the Fourier domain, even more so if they are bandlimited  which is a consequence of smoothness.
The assumption of sparsity of the MVF in the Fourier basis finds corroboration in the fluid mechanics literature. For example in [22, 10, 36], the rapid convergence of the Fourier series of different types of timeperiodic velocity vector fields arising in fluid flow, has been demonstrated. Due to this, the water surface height and hence the MVFs (which are related to derivatives of ) will also admit sparse Fourier decomposition. In addition, in Sec. 3.3, we present an empirical verification of the Fourier sparsity of the MVF from real underwater video sequences.
2.3 Method Overview
An overview of our method is summarized in a pseudocode in Alg. 1. The detailed steps are described further.
2.3.1 Feature point detection and tracking
We begin with a salient feature point detection and tracking algorithm yielding pointtrajectories for salient feature points detected in the first frame. The coordinates represent the position in frame of the point whose coordinates in a distortionfree frame are denoted as , where the subscript ‘’ refers to an index in the undistorted image. Of course, are unknown at the outset. Our salient feature point detection combines four algorithms: (1) difference of Gaussians (DoG) used by SURF[2], (2) the FAST method [23], (3) the popular Harris corner method, and (4) the BRISK technique [14]. Consider a unionset of salient points in the first frame, as detected by all these methods. All points in this set are tracked in subsequent frames using the wellknown KanadeLucasTomasi (KLT) tracker [16]. We obtain excellent results with the standard KLT tracker because it inherently takes care of locally affine motion (a first approximation to nonrigid motion). In some cases, however, we encounter tracking errors. Such trajectories are weeded out and not used in later steps, if (1) they are considered invalid by the KLT tracker itself (which happens when the region around a salient feature point in a frame cannot be accurately expressed as an affine transformation of the corresponding region in a previous frame), or if (2) the center of trajectory (COT), as defined in Sec. 2.3.2, computed over the first and last frames differ by a threshold of more than 3 pixels.
We also trained a Siamese network following [26] to learn good feature descriptors. See supplemental material for further details about this. The Siamese network produced slightly better results than the KLT tracker on unseen synthetic and real data. However it did not perform as well as the KLT tracker if there was blur in the video frames. Hence we used the KLT tracker in all experiments. Examples of point tracking on real sequences are shown in the supplemental material folder ‘CS_MotionReduction’.
2.3.2 Displacement computation
Following previous definitions of and the pointtrajectory , we approximate (termed ‘center of trajectory’ or COT), although more robust ‘means’ such as the median can also be considered. This approximation is well justified by the assumption of the local symmetry of water motion, due to which the average surface normal (across time) at any point on the water surface is close to the vertical line [17]. Our experiments for synthetic and real sequences confirm that this is valid even for moderate frames. The supplemental material includes an illustrative example. With this, our set of displacements for the salient feature point are given as . We term these as ‘displacementtrajectories’ (DTs).
2.3.3 MVF Estimation using CS
The DTs can be regarded as sparse samples (in the spacetime domain) of the 3D MVF signal . The signal is sparse in the Fourier domain (see Sec. 2.2) and hence can be expressed as where is the 3DDFT basis matrix and is a sparse vector of Fourier coefficients. If the DTs are concatenated to form a complexvalued ‘measurement vector’ of elements, then we have the following model:
(1) 
where is a sampling matrix of size and is a noise vector of elements indicating errors in the DTs obtained from the tracking algorithm. Note that
is the rowsubsampled version of the identity matrix of size
, and each row of is a onehot vector which indicates whether or not the displacement at some pixel (in the undistorted image) at some time frame was included in the set (and hence the measurement vector ). The sensing matrix and representation matrix are an ideal combination, because they are highly incoherent with each. This augurs well for the application of a CS algorithm for estimation of (and thereby ) from . This is because CS theory states that measurements are sufficient for accurate reconstruction of the sparse vectorwith very high probability
[4]. Here is the coherence between and and is defined as where and are the row of and column of respectively. Given the choice of for our task, reaches its lower bound of 1, thereby reducing the number of samples required for reconstruction guarantees. To account for the noise in , we determine using an estimator (popularly called the LASSO) which minimizes the following objective function:(2) 
The regularization parameter can be chosen by crossvalidation [33] from a set of candidate values. That is, for every , a candidate signal is computed by the LASSO method using a set of only (say) 90% of the measurements from . Following this, the value of is computed, where is the set of the remaining measurements in . The value of that minimizes can be selected. Following this, is reestimated by the LASSO method from all measurements in and with the selected value.
2.3.4 Comments Regarding MVF Estimation using CS
Note that our method is very different from the bispectral approach in [35]
which chooses ‘lucky’ (i.e. least distorted) patches, by comparing to a mean template. In that method, the Fourier transform is computed locally on small patches in the spatial domain for finding similarity with corresponding patches from a mean image. On the other hand, our Fourier decomposition is spatiotemporal and global. The idea of dense optical flow interpolation (not specific to underwater scenes) from a sparse set of feature point correspondences has been proposed in the socalled EpicFlow technique
[21]. The interpolation uses nonparametric kernel regression or a locally affine method. However our method uses key properties (spatiotemporal smoothness and temporal periodicity) of water waves, and thus considers temporal aspects of the MVFs. This aspect is missing in EpicFlow. Nevertheless, we present comparisons to EpicFlow in Sec. 3.The use of sparsity based techniques for dense flow field estimation is not entirely new and has been used earlier in [6, 25, 13]. However besides the usage of sparsity for underwater image restoration, there are key differences between our approach and the existing ones. (a) First, these papers use a sparse representation (eg., wavelets [25], learned dictionaries [13] or lowrank and sparse models [6]) for optical flow in small patches unlike our method which is more global. (b) Second, they compute the optical flow only between two frames with a data fidelity term based on the brightness constancy equation (unlike our approach which uses displacement trajectories), they do not consider spatiotemporal patches, and do not account for temporal redundancy, which is a readily available and useful prior that our approach exploits.
2.3.5 Polynomial Image Expansions for Optical Flow
The classical optical flow method in [9] expresses small patches from the two images and , between which the MVF has to be computed, as seconddegree polynomials. This method can unduly smooth motion discontinuities as mentioned in [9], but it is well suited to our problem, due to the spatial smoothness of water waves. Consider the following:
(3)  
where is the 2D displacement vector at the point . Consider small patches in the two images respectively, centered around point . The polynomial coefficients can be determined by local regression. This process is repeated in sliding window fashion all through the image, and so these coefficients become functions of . Assuming a slowly changing MVF, the displacement can be computed in the following manner:
(4) 
where is a point in a neighborhood around , and . Further details about this method can be found in [9]. We term this method as polynomial expansion based optical flow (PEOF). Given , the image is warped, followed by polynomial fitting in the warped version of , and reestimation of . The procedure is repeated iteratively. In the present work, the PEOF method is used to find the MVF between each video frame and the mean image (the average of all video frames). The computed MVFs are applied to each frame to obtain the restored video. These restored images are then averaged to yield a final restored image. As shown in Sec. 3, this method outperforms all state of the art methods in terms of image restoration quality as well as computation speed.
3 Experimental Results
In this section, we present an extensive suite of results on both synthetic and real video sequences. All image and video results are available in the supplemental material.
3.1 Description of Datasets
We created several synthetic 50 fps videos of size by simulating the refraction model from [17] on different images containing objects/text, for a scene depth of 25 cm below the water surface. The water surface was generated using superposition of sinusoidal waves with randomly chosen parameters. We henceforth refer to this dataset as Synthetic. We also gathered real video sequences (of size with a 50 fps camera) of laminated posters kept at the bottom of a watertank, with waves generated by mechanical paddles. See supplemental material for details of the acquisition. Visual inspection revealed that blur was occasionally present in some frames. We henceforth refer to this dataset as Real1. For ground truth (), we also acquired a single image of the same posters under still water with the same camera settings. We also demonstrate results on three text sequences (size at 125 fps) obtained from [28], for which ground truth was available. We henceforth refer to this dataset as Real2. Note that Real1 is a more challenging dataset than Real2 due to greater frametoframe motion  see Table 2
for the standard deviation values of the motion
, computed over salient point trajectories.3.2 Key Parameters and Comparisons
We compared restoration results for several algorithms: (1) our CSbased method (CS) from Sec. 2.3.3; (2) our PEOF method from Sec. 2.3.5; (3) the CSbased method followed by the PEOF method (CS+PEOF); (4) the twostage method in [18] consisting of splinebased registration followed by RPCA (SBRRPCA) which is considered state of the art for underwater image restoration; (5) the method from [28] using learned water bases (LWB); and (6) the deep learning (DL) approach from [15]. For SBRRPCA and LWB, we used code provided by the authors with default parameters. For DL, we used the pretrained network and code provided by the authors, on each video frame separately and then computed the mean image. We performed all computation and quality assessment with each video frame resized to (after suitable cropping to maintain aspectratio), as required by their specific implementation. For CS, we used the wellknown YALL1 (Basic) solver [37], which allows for norm optimization of complexvalued signals. We observed better and faster results in practice by downsampling the DTs comprising by a factor of 8 in directions (which is in tune with the bandlimited nature of water waves), followed by CS reconstruction and subsequent upsampling to obtain the final reconstructed MVF. For PEOF, we used the OpenCV implementation with a multiscale pyramidal approach with 3 levels, a pyramid scale of 0.5 and 10 iterations (i.e. the default parameters). For quality assessment referring to ground truth, we used the following measures: (i) visual inspection of the restored video as well as the meanframe of the restored video, (ii) RMSE computed as where is the image representing the undistorted static scene, (iii) normalized mutual information (NMI) between and , and (iv) SSIM [32] between and .
We also considered comparing PEOF to a very competitive optical flow algorithm: EpicFlow [21] (EF). For EF [21], we used the authors’ code to estimate the deformation of all video frames w.r.t. the mean frame of the video. We then applied the deformations to each frame to yield the final image. Results comparing PEOF and EF are included in the supplemental material, and show that PEOF outperforms EF for underwater image restoration. Note that the EF method has so far not been applied for this task in the literature. We do not present results with the state of the art deep learning approaches for optical flow such as [7] or [27] here, for two reasons: (i) EF yielded superior results on our data compared to [7], and (ii) the results of PWCnet from [27] show only a small improvement over EpicFlow on some datasets such as Sintel.
We did not compare with the work in [24] because it relies on a unidirectional wave motion assumption (whereas we assume more general wave models), and due to unavailability of publicly released code. Also, we did not explicitly compare our results with the method in [12] for which publicly released code is unavailable. However, we observed that CSPEOF outperformed the method of [12] on Real2 (compare ‘Middle’, ‘Small’ and ‘Tiny’ in Table 1 to ‘Large’, ‘Medium’ and ‘Small’ respectively in Table 1 of [12]).
3.3 Discussion of Results
The numerical results are presented in Table 1. The mean images of three real videos restored by various methods are presented in Figs. 3. The supplemental material contains results on 14 videos (mean images and restored videos) for all methods. From these results, it is clear that our methods (CS, PEOF and CSPEOF) yield results surpassing SBRRPCA, LWB and DL on synthetic as well as real datasets, numerically and also in terms of visual quality. We also supplemented our method with a step involving RPCA [3] to remove sparse artifacts, which improved video stability but had very little impact over the quality of the mean image. Although PEOF produces superior numerical results to CS, we have observed that CS produces restored videos and mean images with superior visual quality as compared to PEOF  see Fig. 3 (grid lines on ‘Elephant’, words ‘Imaging’, ‘Distortion’ in ‘Middle’) as well as supplemental material.
CS  PEOF  CS+PEOF  
Dataset  Time  NMI  SSIM  RMSE  Time  NMI  SSIM  RMSE  Time  NMI  SSIM  RMSE 
Real1  
Cartoon  0m 42s  1.227  0.902  0.065  0m 41s  1.216  0.913  0.062  1m 23s  1.255  0.928  0.057 
Checker  1m 9s  1.206  0.884  0.104  0m 40s  1.196  0.89  0.105  1m 49s  1.22  0.892  0.104 
Dices  1m 20s  1.172  0.937  0.067  0m 40s  1.139  0.905  0.075  2m 1s  1.188  0.956  0.059 
Bricks  1m 1s  1.148  0.785  0.142  0m 34s  1.151  0.803  0.121  1m 36s  1.167  0.843  0.118 
Elephant  0m 28s  1.128  0.801  0.141  0m 26s  1.102  0.763  0.152  0m 55s  1.132  0.808  0.143 
Eye  1m 22s  1.266  0.961  0.052  0m 57s  1.26  0.975  0.042  2m 19s  1.303  0.982  0.037 
Math  1m 19s  1.193  0.942  0.05  0m 37s  1.163  0.929  0.053  1m 56s  1.215  0.961  0.044 
Synthetic  
BlueTiles  0m 28s  1.141  0.792  0.256  0m 23s  1.141  0.816  0.204  0m 52s  1.161  0.871  0.182 
BrickWall  0m 23s  1.094  0.667  0.144  0m 24s  1.098  0.69  0.142  0m 47s  1.1  0.703  0.141 
Vision  29s  1.181  0.938  0.09  23s  1.162  0.916  0.113  52s  1.211  0.972  0.066 
HandWritten  0m 37s  1.123  0.878  0.081  0m 23s  1.131  0.907  0.077  1m 0s  1.156  0.938  0.075 
Real2  
Middle  0m 13s  1.192  0.838  0.139  0m 7s  1.211  0.85  0.165  0m 20s  1.23  0.914  0.101 
Small  0m 9s  1.169  0.763  0.164  0m 6s  1.182  0.772  0.206  0m 16s  1.195  0.849  0.133 
Tiny  0m 11s  1.166  0.661  0.201  0m 7s  1.176  0.698  0.263  0m 19s  1.186  0.745  0.19 
SBRRPCA[18]  LWB[28]  DL[15]  
Dataset  Time  NMI  SSIM  RMSE  Time  NMI  SSIM  RMSE  Time  NMI  SSIM  RMSE 
Real1  
Cartoon  3h 2m  1.173  0.843  0.111  0h 54m  1.152  0.836  0.095  3s  1.203  0.803  0.162 
Checker  4h 9m  1.158  0.791  0.239  1h 37m  1.105  0.66  0.322  3s  1.129  0.544  0.384 
Dices  3h 58m  1.1  0.758  0.17  1h 26m  1.086  0.783  0.126  3s  1.085  0.637  0.242 
Bricks  3h 43m  1.128  0.686  0.192  1h 24m  1.118  0.673  0.225  3s  1.058  0.49  0.422 
Elephant  3h 7m  1.075  0.516  0.257  0h 59m  1.068  0.584  0.204  3s  1.075  0.378  0.347 
Eye  4h 4m  1.179  0.913  0.104  1h 22m  1.155  0.903  0.089  3s  1.141  0.804  0.191 
Math  4h 34m  1.1  0.841  0.102  3h 0m  1.067  0.766  0.1  3s  1.073  0.678  0.139 
Synthetic  
BlueTiles  2h 5m  1.142  0.763  0.372  0h 55m  1.104  0.72  0.204  3s  1.091  0.365  1.067 
BrickWall  2h 30m  1.093  0.666  0.158  1h 0m  1.066  0.481  0.19  3s  1.079  0.479  0.218 
Vision  3h 4  1.115  0.739  0.216  36m  1.021  0.446  0.266  3s  1.095  0.599  0.215 
HandWritten  0h 0m  1.112  0.851  0.12  0h 52m  1.073  0.678  0.147  3s  1.074  0.546  0.177 
Real2  
Middle  1h 28m  1.189  0.782  0.204  0h 54m  1.163  0.761  0.194  3s  1.122  0.512  0.307 
Small  1h 21m  1.153  0.741  0.181  0h 33m  1.151  0.688  0.198  3s  1.114  0.418  0.323 
Tiny  1h 6m  1.161  0.657  0.395  0h 34m  1.167  0.654  0.238  3s  1.144  0.492  0.306 
Additionally, we observed that SBRRPCA, DL and LWB did not preserve the image structure as well as our method (see gridlines in ‘Elephant’, the words ‘Fluctuation’ or ‘Distortion’ in ‘Middle’ and alphabets E, T, Z, large D in ‘Eye’ in Fig. 3. We do believe the DL method [15] may yield improved results if their network were trained to restore multiple frames together, as opposed to single frames individually (as done by their current algorithm)  which ignores the temporal aspect leading to loss in performance. All in all, our results show that exploiting spatiotemporal properties of water waves for this task is indeed useful.
Computational Time: The compute time for all methods (measured on a 2.6GHz Intel Xeon machine with 32GB RAM) are presented in Table 1. The DL method is the fastest, whereas our methods are much faster than SBRRPCA and LWB. However for CS, the YALL1 solver uses GPU support, which is unavailable in the authors’ code for SBRPCA and LWB. We note that although crossvalidation is an excellent way to pick the parameter in Eqn. 2, we found that the optimal choice of this parameter did not change much across datasets. Also small changes in did not affect the performance much. Hence the time for crossvalidation is not included in Table 1.
Verification of Fourier Sparsity: Here, we demonstrate the sparsity of the MVFs from real underwater sequences. This is shown in Fig. 1 for the ‘Elephant’ sequence (similar plots can be generated for all other sequences). We note that the actual MVF can only be estimated. However, we contend that the MVF estimated by our CS method is a good approximation to the actual one. This is evident when comparing the quality of the estimated mean image with the original undistorted image (gathered in still water). For further quantification, we also tracked the same points (from the distorted video ) that were used for the CS algorithm in Sec. 2.3.3, in the restored video () produced by the CS step. This gave us new DTs . We computed a measure of the motion reduction given as . We note that in most cases, we achieve more than motion reduction by the CS step  see Table 2. We have included a few videos in the supplemental material for the visual comparison of the estimated MVF w.r.t. the ground truth MVF.
Dataset  

Cartoon (Real1)  1029  94.11%  7.42 
Checker (Real1)  3149  85.25%  8.5 
Dices (Real1)  2230  91.9%  7.75 
Bricks (Real1)  1300  87.38%  7.42 
Elephant (Real1)  3670  97.7%  7.34 
Eye (Real1)  1647  81.66%  7.84 
Math (Real1)  2309  96.12%  5.64 
BlueTiles (Synth.)  2192  94.77%  5.71 
BrickWall (Synth.)  3134  94.57%  8.68 
Vision (Synth.)  5266  93.49%  6.77 
HandWritten (Synth.)  3789  95.82%  4.33 
Middle (Real2)  785  96.34%  5.65 
Small (Real2)  993  97.26%  4.22 
Tiny (Real2)  155  87.84%  5.03 
Effect of number of frames :
In the absence of attenuation, a large helps improve the performance of our algorithm, due to better Fourier sparsity and better approximation of the COT. In practice, we observed on real datasets that just 100 frames were sufficient to yield good reconstruction results. Further increase in had an insignificant impact on the result quality. A graph showing the effect of on reconstruction from real sequences is shown in Fig. 2.
Effect of number of tracked points : The number and accuracy of DTs affects the performance of the CS algorithm. The number of DTs varied across datasets, depending on the number of available salient feature points, but was always less than . The slightly lower performance of CS on the ‘Tiny’ sequence for example (see Table 1) is due to the small number of available salient points, less than  see Table 2. A graph showing the positive impact of increase in the number of good quality tracks (upto a point, beyond which the performance saturates) is shown in Fig. 2. We note, that we have ensured good quality of the trajectories for further stages of our algorithms, as mentioned in Sec. 2.3.1. We considered global sparsity in this work, as opposed to sparsity of small spatial or spatiotemporal patches. Since, many patches without any salient points could exist.
4 Conclusion
We have presented two methods for correction of refractive deformations due to a wavy water surface, one based on a novel application of CS for interpolating MVFs starting from a small set of salient PTs (and their DTs), and the other based on polynomial image expansions. In both cases, we obtain results superior to the state of the art at low computational cost. Avenues for future work include (1) extending the CS algorithm to handle moving objects; (2) studying the effect of depth variation, perspective projection or wave attenuation on the results of our algorithms; and (3) exploring MVF sparsity in other bases instead of the DFT.
Acknowledgements: The authors wish to thank the Qualcomm Innovation Fellowship Program (India) for supporting this work, NVIDIA Corp. for donation of a Titan Xp GPU, and Prof. Manasa Ranjan Behera of the Civil Engineering Department at IITB, for the waveflume facility to acquire real data.
The source code, datasets and supplemental material can be accessed at [11], [19].
References
 [1] (2013) Detecting motion through dynamic refraction. IEEE Trans. Pattern Anal. Mach. Intell. 35 (1), pp. 245–251. Cited by: §1.
 [2] (2008) SURF: speeded up robust features. Computer Vision and Image Understanding 110 (3), pp. 346–359. Cited by: §2.3.1.

[3]
(2011)
Robust principal component analysis?
. J. ACM 58 (3), pp. 11:1–11:37. Cited by: §1, §3.3.  [4] (2006) Robust uncertainty prin ciples: exact signal reconstruction from highly incomplete. IEEE Trans. on Information Theory. Cited by: §2.3.3.
 [5] (2007) Improved reconstruction of images distorted by water waves. In Advances in Computer Graphics and Computer Vision, Cited by: §1.
 [6] (2014) Nonlocal sparse and lowrank regularization for optical flow estimation. IEEE Transactions on Image Processing 23 (10). Cited by: §2.3.4.
 [7] (2015) FlowNet: learning optical flow with convolutional networks. In ICCV, pp. 2758–2766. Cited by: §3.2.
 [8] (2004) Seeing through water. In NIPS, pp. 393–400. Cited by: §1.
 [9] (2003) Twoframe motion estimation based on polynomial expansion. In Image Analysis, pp. 363–370. Cited by: §1, §2.3.5.
 [10] (1982) A fourier method for solving nonlinear waterwave problems: application to solitarywave interactions. Journal of Fluid Mechanics 118, pp. 411–443. Cited by: §2.2.
 [11] Github Repository. Note: https://github.com/jeringeo/CompressiveFlows Cited by: §4.
 [12] (201704) Correction of geometrically distorted underwater images using shift map analysis. J. Opt. Soc. Am. A 34 (4), pp. 666–673. Cited by: §1, §3.2.
 [13] (2011) Optical flow estimation using learned sparse model. In CVPR, Cited by: §2.3.4.
 [14] (2011) BRISK: binary robust invariant scalable keypoints. In ICCV, Cited by: §2.3.1.
 [15] (2018) Learning to see through turbulent water. In WACV, pp. 512–520. Cited by: §1, Figure 3, §3.2, §3.3, Table 1.
 [16] MATLAB implementation for KLT tracker. Note: https://in.mathworks.com/help/vision/ref/vision.pointtrackersystemobject.html Cited by: §2.3.1.
 [17] (1992) Surface shape reconstruction of a nonrigid transport object using refraction and motion. IEEE Trans. Pattern Anal. Mach. Intell. 14 (10), pp. 1045–1052. Cited by: §1, §2.1, §2.3.2, §3.1.
 [18] (2011) A twostage reconstruction approach for seeing through water. In CVPR, pp. 1153–1160. Cited by: §1, §2.2, Figure 3, §3.2, Table 1.
 [19] Project Page. Note: https://www.cse.iitb.ac.in/~ajitvr/publications.html Cited by: §4.
 [20] (2018) Simultaneous 3d reconstruction for water surface and underwater scene. In ECCV, pp. 776–792. Cited by: §1.
 [21] (2015) EpicFlow: edgepreserving interpolation of correspondences for optical flow. In CVPR, Cited by: §2.3.4, §3.2.
 [22] (1995) Utilization of fourier decomposition for analyzing timeperiodic flows. Computers and Fluids 24 (4), pp. 349 – 368. Cited by: §2.2.

[23]
(2010)
Faster and better: a machine learning approach to corner detection
. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (1), pp. 105–119. Cited by: §2.3.1.  [24] (2015) Deskewing of underwater images.. IEEE Trans. Image Processing 24 (3), pp. 1046–1059. Cited by: §1, §2.1, §3.2.
 [25] (2010) Sparsity model for robust optical flow estimation at motion discontinuities. In CVPR, Cited by: §2.3.4.
 [26] (2015) Discriminative learning of deep convolutional feature point descriptors. In ICCV, pp. 118–126. Cited by: §2.3.1.
 [27] (2018) PWCNet: CNNs for optical flow using pyramid, warping, and cost volume. In CVPR, Cited by: §3.2.
 [28] (2009) Seeing through water: image restoration using modelbased tracking. In ICCV, pp. 2303–2310. Cited by: §1, §2.1, §2.2, Figure 3, §3.1, §3.2, Table 1.
 [29] (2012) Globally optimal estimation of nonrigid image distortion. International Journal of Computer Vision 98 (3), pp. 279–302. Cited by: §1.
 [30] (2015) Theory and practice of hierarchical datadriven descent for optimal deformation estimation. International Journal of Computer Vision 115 (1), pp. 44–67. Cited by: §1.
 [31] (2013) On observing underwater objects through a wavy water surface: a new algorithm for image correction and laboratory experiment. Izvestiya Atmosph. Ocean. Phys. 49 (3), pp. 339–345. Cited by: §1, §1.
 [32] (2009) Mean squared error: Love it or leave it? A new look at signal fidelity measures. IEEE Signal Processing Magazine 26 (1), pp. 98–117. Cited by: §3.2.
 [33] (2009) Compressed sensing with cross validation. IEEE Trans. Inf. Theor. 55 (12), pp. 5773–5782. Cited by: §2.3.3.
 [34] (2009) Bicoherence: a new lucky region technique in anisoplanatic image restoration. Appl. Opt. 48 (32), pp. 6111–6119. Cited by: §1.
 [35] (2010) Bispectral analysis and recovery of images distorted by a moving water surface. Appl. Opt. 49 (33), pp. 6376–6384. Cited by: §1, §2.3.4.
 [36] (2007) Realtime PDEConstrained optimization. Cited by: §2.2.
 [37] YALL1: your algorithms for L1. Note: http://yall1.blogs.rice.edu/ Cited by: §3.2.