Video Reconstruction from a Single Motion Blurred Image using Learned Dynamic Phase Coding

12/28/2021
by   Erez Yosef, et al.
Tel Aviv University
0

Video reconstruction from a single motion-blurred image is a challenging problem, which can enhance existing cameras' capabilities. Recently, several works addressed this task using conventional imaging and deep learning. Yet, such purely-digital methods are inherently limited, due to direction ambiguity and noise sensitivity. Some works proposed to address these limitations using non-conventional image sensors, however, such sensors are extremely rare and expensive. To circumvent these limitations with simpler means, we propose a hybrid optical-digital method for video reconstruction that requires only simple modifications to existing optical systems. We use a learned dynamic phase-coding in the lens aperture during the image acquisition to encode the motion trajectories, which serve as prior information for the video reconstruction process. The proposed computational camera generates a sharp frame burst of the scene at various frame rates from a single coded motion-blurred image, using an image-to-video convolutional neural network. We present advantages and improved performance compared to existing methods, using both simulations and a real-world camera prototype.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 6

page 7

page 8

page 11

page 12

page 13

02/18/2020

Motion Deblurring using Spatiotemporal Phase Aperture Coding

Motion blur is a known issue in photography, as it limits the exposure t...
12/05/2021

Snapshot HDR Video Construction Using Coded Mask

This paper study the reconstruction of High Dynamic Range (HDR) video fr...
04/26/2022

Acquiring a Dynamic Light Field through a Single-Shot Coded Image

We propose a method for compressively acquiring a dynamic light field (a...
08/01/2019

Deep Optics for Single-shot High-dynamic-range Imaging

High-dynamic-range (HDR) imaging is crucial for many computer graphics a...
06/13/2018

Convolutional Sparse Coding for High Dynamic Range Imaging

Current HDR acquisition techniques are based on either (i) fusing multib...
12/17/2021

Video-Based Reconstruction of the Trajectories Performed by Skiers

Trajectories are fundamental in different skiing disciplines. Tools enab...
10/20/2020

Video Reconstruction by Spatio-Temporal Fusion of Blurred-Coded Image Pair

Learning-based methods have enabled the recovery of a video sequence fro...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern cameras are required to satisfy two conflicting requirements: to provide excellent imaging performance while decreasing the space and weight of the system. To address this inherent contradiction, novel design methods attempt to harness fundamental imaging limitations and leverage them as a design advantage. One such example is motion blur, which is a known limitation in photography of dynamic scenes. It is caused due to objects’ movements during exposure, whose duration is set according to the lighting conditions and noise requirements. As most scenes are dynamic, light from moving objects is accumulated by the sensor in several consecutive pixels along their trajectory, resulting in image blur. Although blur is an undesirable effect, it can be used for video generation from a single image.

In contrast to motion deblurring methods that aim at sharp image reconstruction, in video generation the goal is to exploit this ’artifact’ for reconstructing a sharp video frame burst that represents the scene at different times during acquisition. Yet, as signal averaging in the acquisition process eliminates the motion direction in the captured image, this task is highly ill-posed. The pioneering work of Jin et al. [Jin_2018_CVPR] suggests a pairwise frames order invariant loss to mitigate this ambiguity. Even though, since the global motion direction is lost in the acquisition, the processing stage can only assume the direction of the motion for the video reconstruction but cannot really resolve this ambiguity.

To overcome this deficiency, some works suggested to capture multiple frames with different exposures during the acquisition process [Rengarajan2020PhotosequencingOM] or alternatively replace the sensor with coded two-bucket [Anupama, shedligeri2020unified] or event measurements [Pan_2019_CVPR]. Yet, these solutions do not fit with a standard optical system or require capturing multiple images.

(a) Coded-blurred image
(b) Motion-color cues

[loop,width=1in,poster=12, bb=0 0 190 190]8Figs/teaser/learned_rflr_2_s0_f024

(c) Reconstructed video
Figure 1: Method demonstration. (a) A flower moving left captured using our dynamic phase-coded camera, which embeds (b) color-motion cues in the intermediate image. These cues guide our image-to-video reconstruction CNN, resulting in a (c) sharp video of the scene (play the video by clicking (c) in Adobe Reader).
Figure 2: Overview of our suggested method. An acquisition of dynamic scene using our dynamically phase coded camera provides an intermediate image which contains scene dynamics cues in its coded motion-blur. We reconstruct sharp video frames of the scene at desired timesteps from the single coded-blurred image using a time-dependent CNN. The optical coding parameters are jointly optimized with the reconstruction network weights using end-to-end learning.

Contribution. To overcome the limitations of conventional cameras in dynamic scenes acquisition, we suggest a computational coded-imaging approach (see Figs. 2 and 1) that can be easily integrated in many conventional cameras (equipped with a focusing mechanism) by just adding a phase-mask to their lens (which is a simple process). Joint operation of the phase-mask and focus variation during exposure generates a dynamic phase coding, which encodes scene motion information in the intermediate image as chromatic cues. The cues are generated by the PSF nature of our solution (plotted in Fig. 4), which encodes the beginning of the movement by blue and the end by red, e.g., see the zoomed left and right edges of the moving flower in Fig. 0(b), enhanced for visualization. These cues serve as guidance for generating a video of the scene motion by post-processing the captured coded image (see Fig. 0(c)).

Our method is capable of generating a sharp frame in any user-controlled time in the exposure interval. Therefore, a video burst at any user-desired frame rate can be produced from a single coded image. The proposed coding and reconstruction approach is based on a learnable imaging layer and a Convolutional Neural Network (CNN), which are jointly end-to-end optimized; the learnable imaging layer simulates the physical image acquisition process applying the coded spatiotemporal point spread function (PSF), and the CNN reconstructs the sharp frames from the coded image.

The main contributions of our method are:

  • [leftmargin=*]

  • Simple coding method, based on conventional sensor and lens that has a focusing mechanism, equipped with a simple add-on optical phase-mask.

  • Flexible video from single image reconstruction method enabling modular reconstruction process, parametrically controlled for any desired frame rate video.

  • End-to-end optimization framework of optical and digital processing parameters for dynamic scene acquisition.

  • Improved video from motion reconstruction, with unambiguous directionality, higher accuracy and lower noise sensitivity, tested in both simulation and real-world experiments.

2 Related Work

Given a motion blurred image, various methods attempted to reconstruct a sharp image of the scene from it. Some techniques were developed for the case of conventional imaging and then the reconstruction is only computational, which recently is usually based on training a neural network for this task [Zhang2018_Scene_Deblurring, tao2018srndeblur, Kupyn_2018_CVPR, Nah_2017_CVPR]. Other holistic design approaches utilize a computational imaging strategy to encode motion information in an intermediate image and recover the sharp image using corresponding post-processing methods [Raskar2006CodedEP, Levin_motionInvariant, Levin_orthParab, CAVE_0040, Srinivasan_2017_CVPR, R2017_ICCV].

Method Acquisition Input Output size
Computational only [Jin_2018_CVPR, Purohit2019BringingAB, Zhang2020EveryMM] conventional image fixed
Multiple exposures [Rengarajan2020PhotosequencingOM] short-long-short exposures 3 images dynamic
Coded two-bucket [Anupama, shedligeri2020unified] C2B sensor two coded images fixed
Event camera [Pan_2019_CVPR] event sensor

50-100 events vector

dynamic
Proposed method dynamic phase coding coded image dynamic
Table 1: Overview of existing solutions for video reconstruction from a motion-blurred scene.

The problem of video reconstruction from a single image takes motion deblurring a step forward by attempting to reconstruct a frame burst of the dynamic scene that resulted in the blurred image, and not only the central sharp frame (see Tab. 1 for an overview). Some works are based on images taken using a conventional camera, and apply processing-only methods to obtain a frame burst of the scene. However, without optical coding, this problem is highly ill-posed as even if the edges and textures are reconstructed perfectly, various motion permutations can generate the same motion blurred image (e.g., see Fig. 2 in [Jin_2018_CVPR]). Thus, coded imaging approaches were proposed to acquire additional information about the scene dynamics and achieve higher quality results.

Conventional imaging based methods. Generating a video sequence from a single motion blurred image is a challenging task: since the temporal order of the reconstructed frames is ambiguous, the problem is highly ill-posed. Jin et al. [Jin_2018_CVPR] address the temporal order ambiguity and present a pioneering approach for this task using several reconstruction networks and a novel pairwise frames order-invariant loss. Their suggested method consists of iterative generation of seven sequential frames of the scene, starting from the central frame reconstruction and proceeding to the edge frames of the dynamic scene using the preceding reconstruction results. The method’s architecture limits the reconstruction for only seven frames of the scene in the exposure interval, and it uses three different trained models for the reconstruction process. Purohit et al. [Purohit2019BringingAB]

present a solution for video reconstruction using motion representations of the scene learned by recurrent video autoencoder network. Zhang

et al. [Zhang2020EveryMM] suggested detail-aware network using cascaded generator. All of these methods suffer from the inherent motion direction ambiguity, and their reconstruction performance is more sensitive to noise (as discussed in [Cossairt_compImagAnlz, OSA_compImgRev] and empirically presented in Sec. 4).

Coded imaging based methods. To handle with the inherent limitations of conventional imaging, some works adopted computational photography methods for image deblurring and video frames recovery. Raskar et al. [Raskar2006CodedEP] introduced an amplitude coded exposure technique using fluttered shutter for motion deblurring. This method performs a temporal binary amplitude coding, resulting in a wider frequency response, which is utilized for improved motion deblurring results. Levin et al. [Levin_motionInvariant, Levin_orthParab] presented a parabolic motion camera with motion invariant PSF utilized for non-blind motion deblurring. Both of these approaches are limited to the reconstruction of a single image. Dynamic phase coding in the lens aperture for motion coding was presented by Elmalem et al. [Shay2020phase] for motion deblurring. This coding embeds motion cues in the intermediate image for improved deblurring performance. For video restoration from a single coded-blurred image, several approaches had been presented, such as using an event camera [Pan_2019_CVPR], or coded two-bucket (C2B) sensor [Anupama, shedligeri2020unified], which both require a non-conventional sensor, or lensless imaging and rolling shutter effect [Antipa2019VideoFS], which omits the lens and therefore changes the entire imaging concept even for static scenes. For simplicity, we adopt the coding method of Elmalem et al. [Shay2020phase], which is based on a commercial sensor and lens (with focusing mechanism), equipped with a simple add-on optical element, which allows unambiguous motion cues encoding.

A closely related problem is a reconstruction of a sharp and high frame rate video from motion blurred and low frame rate video, using either processing of conventional camera videos [Jin_2019_CVPR, Rengarajan2020PhotosequencingOM] or computational imaging methods [FlutterShutterVideo, 6552198, Llull_Coded_aperture]. These methods require a video input (which enables to solve the direction ambiguity), and are not applicable for single image input.

Deep optics.

As the end-to-end backpropagation-based optimization process of deep models proved itself to be very efficient for various tasks, its power was also harnessed for optical design, either for a standalone optical system design process or jointly with a post-processing algorithm (for recent review on this topics see

[compImagDL_revOsa, OpticsDL_revNature]). Specifically for enhanced optical imaging applications, this scheme had been presented for extended depth of field [EDOF_DL, Gordon_EDOF, Ugur_EDOF]

, depth estimation

[Depth_2018, Yicheng_Depth, Gordon_Depth], high dynamic range [Gordon_HDR, Heide_HDR], and several microscopy applications [Waller_Illum, Waller_MiniscopeS, Shechtmann_MultiChan, Shechtman_rev], to name a few. Yet, it was not considered for the problem of video from blur.

3 Method

As our goal is to reconstruct video frames from a motion blurred image of the scene, we engineer the camera’s PSF to encode cues in the motion blur of dynamic objects. The coded PSF is achieved using a spatiotemporal dynamic phase coding in the lens aperture, which results in motion-coded blur. The coded blur serves as prior information for the image-to-frames CNN, trained to generate sharp video frames from the coded image. Utilizing the end-to-end optimization ability, the optical coding process is modeled as a layer in the model, and its physical parameters are optimized along with the conventional CNN layers in a supervised manner. The learned optical coding is then implemented in a prototype camera, and images taken using it are processed using the digital processing layers of the CNN.

3.1 Camera Dynamic Phase Coding

Moving objects in a scene during exposure result in motion blur, as the light from a moving object is integrated in different pixels along the motion trajectory. In addition, both static and dynamic objects are blurred by the lens PSF which is never perfect (due to aberrations/diffraction etc.). This imaging process is formulated in Eq. 1; the two-dimensional PSF is spatially convolved with the instantaneous scene at any and integrated during exposure, is the acquired blurred image111

All images mentioned are in the linear regime (signal space), i.e. before any non-linear transformations such as gamma correction.

, is the exposure time, and denote the instantaneous sharp scene and the PSF respectively, and denotes the spatial convolution operator (the spatial coordinates are omitted for ease of notation).

(1)

The averaging nature of image sensors results in the loss of the motion direction, which introduces inherent ambiguity. Also, as every object moves independently from others, general motion blur is shift-variant. Thus, video reconstruction from undirected motion blur is a highly ill-posed task.

To address both issues, we implement a coded lens designed to embed motion cues in the acquired image, and the prior knowledge about the camera time-variant behavior serves as guidance to the reconstruction process of the video burst. We adopt dynamic phase coding in the lens aperture, similar to the motion deblurring method presented by Elmalem et al. [Shay2020phase], which is based on a sptiotemporally coded PSF that encodes motion information in the intermediate image. Such a PSF is generated using a conventional camera equipped with a simple add-on phase-mask; the temporal coding is achieved using a joint operation of the static phase-mask designed to introduce color-focus cues, and a dynamic focus sweep performed during exposure (using a simple focusing mechanism). The phase-mask (originally designed for depth estimation [Depth_2018] and extended depth of field imaging [EDOF_DL]) introduces a predesigned chromatic aberration to the lens, generating a controlled dependence between the defocus condition and the color distribution of the PSF. To get a time-varying PSF the defocus condition (denoted as ) is changed during exposure, and a temporally coded PSF (denoted as ) is achieved. The instantaneous scene is spatially convolved with the corresponding PSF , resulting in the motion-coded image as described in the following formula:

(2)

Using the proposed spatiotemporally coded imaging scheme, the dynamics of the scene are encoded in the intermediate image acquired by the camera. Moving objects are smeared in the image with color cues along their trajectories, based on the spatiotemporal PSF . The acquired coded image is then fed to the reconstruction network trained to decode these cues as guidance for improved video reconstruction. Fig. 2 presents these steps visually.

To achieve optimal motion cues encoding in the intermediate image, the imaging process is modeled as a learnable layer (with corresponding forward and backward models), and the focus sweep parameters are optimized in the end-to-end training process, along with the CNN layers.

Figure 3: Network architecture. Our CNN is based on the UNet [Unet2015] model, with the coded blurred image and a time parameter as inputs and the sharp reconstructed frame at the output (see Eq. 3). The decoder part is controlled by the time parameter (using AdaIN [Huang_2017_ICCV]), to set the relative time of the reconstructed frame.

3.2 Reconstruction network

Our proposed model for video frames reconstruction from a coded motion-blurred image is based on a single time-dependent convolutional neural network (CNN) with AdaIN mechanism [Huang_2017_ICCV]. Our model inputs are the coded-blurred intermediate image and a normalized time parameter . The time parameter controls the relative time of the generated sharp frame in the normalized exposure time interval. The output of the model is the estimated sharp scene frame at time , denoted as . Hence, the architecture is designed to reconstruct the scene at any desired instant in the exposure time interval, and thus to create a video at any desired frame rate. We denote it by

(3)

Our reconstruction CNN (presented in Fig. 3) is based on the UNet architecture [Unet2015], consisting of a four levels encoder-decoder network structure with skip connections between the encoder to the decoder in each level. The double convolution blocks (presented in the original UNet architecture) are improved by adding skip-connections modifying them to the form of dense blocks [Huang_2017_CVPR]. The output of the last layer of the model is added to the input image, such that the network learns only the residual correction required to reconstruct the desired frame.

The time parameter is used to reconstruct the frame corresponding to the desired normalized time in the exposure interval. It controls the network by both the AdaIN mechanism [Huang_2017_ICCV] and concatenating it to the input as an additional channel. To bridge between the shift-invariant convolutional operations of the CNN and the shift-variant (and scene dependent) motion blur that exists in our target application, we leverage positional encoding to add image position dependency to the model. We provide additional details on the architecture and these changes in the following.

Positional encoding. Assuming a general scene in which every object might move in a different direction and velocity, an intermediate image captured using our proposed coded lens will contain a shift-variant blur kernel, which is a composition of the color-temporal PSF coding and the spatial movement of the objects. Since convolutions are shift-invariant, we want to add a position dependency to the model, such that we can utilize the local information of the coding in the surrounding area that relates to the same object with the same motion characteristics and blurring profile. We adopt Fourier features to get a better representation of the position coordinates [tancik2020fourfeat]. Similar to Metzer et al. [Metzer2021Z2PIR], we add a positional dependency to the model by concatenating the Fourier features of the pixel coordinates as additional channels to the input. Five log-linear spaced frequencies were sampled in the range [1,20] to generate 20 positional features in total for each pixel coordinate of the image. Each frequency contributes the following four positional features to each pixel using the normalized pixel coordinate in the range of [0,1]:

(4)

Time encoding.

To achieve a time-dependent CNN, the batch normalization layers in the UNet architecture are replaced with AdaIN layers

[Huang_2017_ICCV] controlled by a normalized time parameter. The exposure time interval is normalized to the range of such that corresponds to the middle of exposure time. The time parameter is mapped to a higher dimension vector

, using an MLP network consisting of two sequential blocks of a linear layer followed by a leaky-ReLU activation function. The encoded time-representation vector

is shared across all AdaIN layers and controls the mean and standard deviation of the features in each AdaIN layer.

In each AdaIN layer with an input of feature channels, the mean and standard deviation are obtained from by a designated MLP mapping networks with two layers of the same structure mentioned above. The AdaIN transformation (Eq. 5) is preformed along the features dimension, where and are computed across spatial dimensions (instance normalization).

(5)

As our scheme is designed to utilize the optically encoded motion cues to generate a sharp frame in a relative time , the encoder part of the UNet is generic, and we apply the temporally controlled AdaIN only on the decoder part of the architecture (as in Fig. 3). We set the encoder part of the UNet model to be time independent by performing instance normalization followed by a learnable affine transformation instead of the AdaIn blocks. In this setting, the encoder is optimized to encode more general information about the image and scene dynamics regardless of the normalized time parameter. The generic encoder and time-specific decoder design enable the network to converge better. Note though that we concatenate the time parameter to the input channels which contain the input image and the positional encoding features. This improves reconstruction performance as shown in the ablation in Sec. 4.3.

Dataset. To train our network and evaluate its performance quantitatively, we used the REDS dataset [Nah_2019_CVPR_Workshops], consisting of scenes captured at 240 frames per second (FPS). To achieve smoother motion-blur simulation we used

frame interpolation using the DAIN method

[DAIN] (similarly to the process presented in [Nah_2019_CVPR_Workshops]), to achieve video frames at 1920 FPS. To simulate the acquisition of a dynamic scene by our coded camera, the spatiotemporal PSF had been applied to 49 consecutive frames, which were then averaged along the time axis as in Eq. (3.1) (where ). For performance comparison with Jin et al. [Jin_2018_CVPR], conventional camera images were simulated by averaging only of the same input frames. Due to the applied frame interpolation, not all the 49 frames are true images; therefore only the seven real frames (in indices ) are used as our GT images for the training/validation/test metrics. For improved generalization, we add additive white Gaussian noise (AWGN) to the simulated blurred images in the signal space, which partially simulates the imaging process noise and improves the robustness of our model and generalization to the camera prototype (different noise levels were set according to the application, as discussed in Sec. 4).

Loss Functions. We use a linear combination of three losses for the training: pixel-values smooth-L1 loss (), perceptual loss () using VGG features [perceptual2016], and a video-consistency perceptual loss (). Thus, our loss is

(6)

The perceptual loss is a known practice for image reconstruction tasks [perceptual2016]. In this loss, we compute the smooth-L1 distance between the VGG [Simonyan2015VeryDC] features of the reconstructed image and the ground truth image.

To improve temporal consistency and perceptuality between consecutive reconstructed video frames, we developed a video loss using a 3D convolution network over the video time-space volume. We use 3D-ResNet [Tran2018ACL], a spatiotemporal convolution network for video action recognition, and compare the network-extracted feature maps between the reconstruction and the GT videos.

4 Experiments

As an experimental validation to our proposed approach, we first train our system (optical coding layer and reconstruction network) and evaluate our results quantitatively (while the optical coding process is simulated), and compare the performance to the previous work by Jin et al. [Jin_2018_CVPR]. Following the satisfying simulative experiment, we built a prototype camera implementing our spatiotemporal coding, and examined our method qualitatively (as pixelwise GT sharp frame bursts are almost impossible to acquire). Lastly, we present an ablation study for our architecture and used methods. Some of the results are presented below, and additional results are presented in the supplementary material.

Training details.

We train our model on a training set consisting of 9,680 scenes for 40 epochs, with a batch size of 72 samples of patches in size 128x128x3 each. We used Adam optimizer

[Kingma2015AdamAM] with learning rate of and weight decay of . The loss weightings (as defined as Eq. 6) are . Additional 2460 scenes are dedicated for validation/testing, such that the quantitative reconstruction performance (Sec. 4.1) was evaluated using 1,968 scenes dedicated for testing. In the optical coding layer we define a learnable vector initialized linearly in the range following [Shay2020phase] (as presented in Fig. 3(a)). We optimize the focus sweep parameters of the camera, which defines the camera response in time as presented in Eq. (3.1). To improve robustness we apply flip augmentations and add AWGN to the input image (1% as in [Jin_2018_CVPR] for Sec. 4.1, and 3% for Sec. 4.2).

The optimized focus sweep parameters (of the imaging simulation layer) resulted in the coding demonstrated in Fig. 3(b). In this example, a motion blur of a white dot moving right is simulated with a coding based on either linear (as in [Shay2020phase]) or learned focus sweep. Comparing to the white trace that would have been captured in a conventional camera, the color coding of the motion profile is clearly visible. The learned pattern provides improved coding for video reconstruction, thanks to the end-to-end optimization with the image-to-video CNN. The coding is also validated experimentally on a moving point source (Fig. 3(c)).

(a) linear PSF
(b) simulated learned PSF
(c) experimental PSF
Figure 4: PSF coding. The spatiotemporal PSF coding of the (a) linear focus sweep [Shay2020phase], (b) learned focus variation (simulation) and (c) same PSF in experiment. The PSF visualizations represent the blur of point light source moving horizontally (left to right) during the exposure time. The joint effect of the phase-mask and focus variation during exposure results in different wavelength (color) that is in-/out-of-focus when the point moves.

4.1 Simulative experiment

Figure 5: Per-frame performance evaluation. PSNR and SSIM averaged per-frame reconstruction performance for a 7-frames burst, for our method and Jin et al. [Jin_2018_CVPR]. Since the motion blur of conventional camera is undirected, we also evaluate the reverse order of [Jin_2018_CVPR] reconstructed frames (compared to the ground truth) for each input scene, and considered the higher results for the ’best order’ presented evaluation.
Figure 6: Noise sensitivity analysis. Averaged PSNR results vs. noise level (as percent of the image dynamic range) of our method and Jin et al. [Jin_2018_CVPR] (in both predicted and best order). Our method has better noise robustness, due to the optically embedded cues.

To evaluate the reconstruction results we used a test dataset consisting of motion blurred simulated images (both conventional and coded). We evaluate our model with respect to the GT sharp scene images using PSNR and the structural similarity index measure (SSIM) [SSIM]. We compare our results to the performance of Jin et al. [Jin_2018_CVPR] that presented a method for video reconstruction from a conventional camera (uncoded) motion blurred images.222The comparison is made only to [Jin_2018_CVPR] as other related works did not publish their code for evaluation.

[loop,width=0.145]6Figs/jin_compare/legs_video/frame_033

Figure 7: Reconstruction performance (simulation). (top row) GT image and zoom-in for a 7-frames burst, (middle row) conventional blur and Jin et al. [Jin_2018_CVPR] results, and (bottom row) our coded input and reconstruction results. Our method achieves improved results along the entire burst and also provides a higher frame rate video. Click on the blurred input images (left) to play the result videos.

A visual example of our reconstruction performance is presented in Fig. 7, where improved results along the entire frame burst can be clearly seen. Figure 5 presents the per frame performance in PSNR and SSIM (for a 7-frames burst, as Jin et al. [Jin_2018_CVPR] is limited to such burst length only) averaged over all the test scenes. Tab. 2

shows the overall statistics of the evaluated metrics of the reconstructions. Since the motion direction is lost in conventional motion blur, the frames’ reconstruction of

[Jin_2018_CVPR] may be predicted in the reversed order, i.e. in the opposite motion direction. Thus, each reconstructed scene was compared to the GT in both the predicted order and the reverse order, and the higher one (PSNR-wise) was selected to the ’best order’ average. Note that in of the cases higher performance is achieved in the reversed order, which shows that the order ambiguity is prominent. Since the coded blur in our camera is designed to provide direction cues, our method is expected to reconstruct the frames in the correct order. Therefore, we do not need to reverse the order for it.

PSNR SSIM
Method mean std mean std
Jin et al. [Jin_2018_CVPR] 22.6 3.78 0.654 0.143
Best order [Jin_2018_CVPR] 23.85 3.45 0.69 0.128
Ours 26.08 3.01 0.737 0.081
Table 2: Quantitative comparison. Averaged PSNR/SSIM metrics on the entire test set for our method compared to Jin et al. [Jin_2018_CVPR].

To assess the benefit in noise robustness of the encoded optical cues, a noise sensitivity analysis is carried by evaluating the reconstruction results of our method vs. the work in [Jin_2018_CVPR] for different noise levels (Fig. 6). Similar to the performance analysis in Fig. 5, The reconstruction performance of [Jin_2018_CVPR] is evaluated both in the predicted order and best order. The prominent gap is achieved due to the optically encoded motion information, which allows reconstruction with much better noise robustness.

4.2 Prototype Camera Results

To assess our method on real-world scenes, a prototype camera with dynamic phase coding was implemented. The color-focus phase-mask is incorporated in the lens aperture, and lens defocus setting is set to vary during exposure following the desired learned code. The joint operation of the phase-mask and focus variation temporally manipulates the PSF as presented in Sec. 3.1 (additional details on the prototype camera are presented in the supplementary material). Several dynamic scenes had been captured using the prototype camera, and processed using our image-to-frames CNN for different values, thus creating short videos of the moving scenes. For comparison, we took motion-blurred images of the same scenes with a conventional camera (i.e. with constant focus and clear aperture). The results are presented in Figs. 8 and 1. Note how the truck moves and its back wheel rotates (front wheel is fixed) in Fig. 8. Our method provides sharp results and higher frame rate video. Note also that Jin et al. reconstructs the motion in the opposite direction.

[loop,width=0.95]6Figs/exp/cam_video/frame_r25_032

(a) blurred input - click to play the reconstructed videos
(b) Zoom-ins on 7 frames
Figure 8: Real-world results. (a) blurred image from (top) conventional camera and (bottom) our coded camera; click on the blurred images to play the output videos, (b) Zoom-ins on 7 reconstructed frames of (top) Jin et al. [Jin_2018_CVPR] and (bottom) our results. Our method achieves improved results along the entire burst, reconstruct the correct motion direction and also provides a higher frame rate video.

4.3 Ablation Study

We apply an ablation study for the proposed method and architecture to evaluate the contribution of each component in our system. Tab. 3 presents the different experiments applied and tested configurations. Firstly, we started from a UNet architecture controlled by a time parameter using AdaIN modules as described in Sec. 3.2, where the input is the blurred image only, and the output is the sharp frame in the desired relative time in the exposure interval, and trained without the video-perceptual loss. Keeping the encoder part of the UNet uncontrolled by the time parameter (using instance normalization instead of AdaIN) enables better reconstruction results (config-a in Tab. 3) compared to the full AdaIN network, both in encoder and decoder (config-0 in Tab. 3). The following configurations include the addition of image-coordinates positional encoding features (config-b) and the time parameter concatenated to the input image (config-c). These features achieve improvement in PSNR while the similarity measure is slightly decreased, however, while testing the models on the prototype camera images we noticed better generalization to the real world images using these additions. Adding the video-frames perceptual loss (by setting , see config-d) we get improvement both in PSNR and SSIM. To comprehend the improvement of our optics and computational imaging method for the task, we train our best network (config-d) on uncoded images (i.e. temporal averaging only), and evaluated the results (config-e). Without the phase coding we observe a significant performance degradation, which validates the optical coding benefit to the reconstruction ability. Using the learned temporal coding we gain improvement in both reconstruction metrics (config-f).

Method PSNR SSIM
(0) time dependent encoder 21.5 0.69
(a) Unet 25.06 0.73
(b) + positional encoding 25.16 0.728
(c) + time concatenation 25.70 0.705
(d) + video perceptual loss 25.93 0.735
(e) (d) w/o phase coding 22.96 0.645
(f) (d) with learned PSF 26.08 0.737
Table 3: Ablation study. To assess the contribution of each feature of our method, we performed a gradual performance evaluation.

Limitations. Despite the improved performance achieved, our method still suffers from several limitations. The most prominent one is objects acceleration; as our coding method is a composition of the dynamic phase coding and object movement, there is some hidden assumption that this movement (and specifically its acceleration) is not too acute. In such cases, the resulting coded information will be too obscure, with a limited benefit. Another limitation relates to the imaging scenario; the temporal part of the coding is focus variation. Therefore, the underlying assumption in such a design is that the entire scene is in the same focus condition (either in- or out-of-focus). Such a design limits our solution to infinite-conjugate lenses (e.g. GoPro cameras).

5 Conclusion

A spatiotemporally coded camera for video reconstruction from motion blur is proposed and analyzed. Motivated by the on going requirement to improve the imaging capabilities of cameras, the motion blur limitation is utilized as an advantage, to encode motion cues allowing reconstruction of a frame burst from a single coded image. The coding process is performed using a phase-mask and a learnable focus variation, resulting in color-motion cues encoding in the acquired image. This image, along with a relative time parameter , are fed to a CNN trained to reconstruct a sharp frame at time in the exposure time. By choosing a sequence of values, a frame burst of the scene is reconstructed. Simulation and real-world results are presented, with improved performance compared to existing methods based on conventional imaging, both in reconstruction performance and handling the inherent direction ambiguity.

Our method can assist balancing the various trade-offs that a camera designer has to handle. For example, the promising results achieved hold potential to extend the method to perform a low-blurred to high-sharp frame rate conversion, achieved with lower sampling rate and improved light efficiency. This can extend existing photography capabilities with simple and minor hardware changes.

References

Appendix A Appendix

In addition to the details presented in the paper, the supplementary material includes: (1) a detailed description of our prototype implementing the dynamic phase-coding, (2) additional experimental results, (3) a video containing a description of the system and various additional results (in the CMT in low-res and in an anonymous link in high resolution) and (4) evaluation code of our reconstruction CNN. The trianed network and several test coded-images are shared in the anonymous link as well.

Appendix B Dynamic phase-coded camera prototype

Our method is based on a dynamic phase coded camera, designed to embed color-motion cues in the intermediate image, and a corresponding CNN trained to decode these cues and reconstruct a sharp frame burst. After achieving satisfying simulation results (i.e. with simulated coded images), we assembled a prototype camera implementing our proposed dynamic phase-coding. As mentioned in the paper, our coding method is relatively simple, and based mostly on conventional commercial parts. As such, it can be easily integrated to any camera equipped with a focusing mechanism.

Our prototype camera (see Figs. 10 and 9) is based on a standard C-mount lab-camera (IDS UI-3590CP) equipped with a 4912 x 3684 pixels ( pixel pitch) color CMOS sensor [UI-3590CP]. The camera is mounted with a fixed focal length C-mount lens with focusing mechanism based on a liquid-lens (Edmund Cx C-mount lens #33-632 [liqlens])

The coding is achieved jointly using a phase-mask in the lens aperture and by performing a focus sweep during the exposure time. Following the work in [Shay2020phase], we use a similar phase-mask, comprised of two phase rings. The phase-mask aperture diameter is ; the first ring (inner-to-outer) radii are and its phase shift is ; the second ring radii are and its phase shift is (the phase-shifts are measured with respect to , which is the peak wavelength of the camera’s blue channel). The phase-mask is fabricated using conventional photo-lithography and wet etching process.

The dynamic PSF encoding is achieved by applying the learned focus variation during the exposure time. The focus change is performed electronically, using the camera focusing mechanism, controlled by a dedicated micro-controller (Arduino Nano) [ArduinoNano]. The micro-controller contains the learned focus sweep parameters, and triggers the required coding in synchronization with the exposure (utilizing the camera flash-signal, designed to indicate the start of the exposure). Note that although various components had been used in our implementation, the coding can also be implemented easily on existing cameras (assuming availability of API to the focusing and exposure mechanisms).

Figure 9: Prototype Camera. the dynamic phase coded camera prototype is based on a commercial camera and a lens with a focusing mechanism, where our phase-mask is incorporated in the lens aperture. The camera flash signal is utilized to trigger the focus variation, controlled using the micro-controller (located near the camera).
Figure 10: Prototype Camera Diagram. The flash signal from the camera initiates the learned focus variation during the exposure using a micro-controller, such that the designed dynamic phase coding is performed and a motion-coded image is acquired.

Appendix C Results

In addition to the results presented in the paper, we present additional video results in the supplementary video. The reconstructed videos were generated using 25 frames since we can choose any frames number using our time dependent CNN. The frame-rate difference compared to [Jin_2018_CVPR] (which is limited to 7-frames only) is clearly noticeable in the supplementary video. A comparison between the frames in our results and [Jin_2018_CVPR] are presented in Fig. 11, Fig. 12 and Fig. 13. Our method achieves improved results along the entire frame burst.

Figure 11: Reconstruction performance (simulation) for seven frames. (top row) GT image and zoom-in for a 7-frames burst, (middle row) conventional blur and Jin et al. [Jin_2018_CVPR] results, and (bottom row) our coded input and reconstruction results. The full result videos are presented in the supplementary video.
Figure 12: Reconstruction performance (simulation) for seven frames. (top row) GT image and zoom-in for a 7-frames burst, (middle row) conventional blur and Jin et al. [Jin_2018_CVPR] results, and (bottom row) our coded input and reconstruction results. The full result videos are presented in the supplementary video.
Figure 13: Reconstruction performance (simulation) for seven frames. (left column) GT image and zoom-in for a 7-frames burst, (middle column) conventional blur and Jin et al. [Jin_2018_CVPR] results, and (right column) our coded input and reconstruction results. The full result videos are presented in the supplementary video.