Today’s commercially available high-frame-rate cameras are expensive as their sensors need to be highly light-sensitive and should be able to handle large data bandwidth. For a given bandwidth and cost there is a trade-off between spatial and temporal resolution [liu2013efficient]. One way to overcome this trade-off is to temporally upsample the frames captured from the low frame-rate camera using computational methods. Recently, learning-based methods like [purohit2019bringing, jin2018learning] take in as input a long-exposure (or blurred) image and computationally decompose it into multiple frames of a video sequence. Recovering multiple video frames from a single image is a highly ill-posed problem. The recovered video sequences obtained from [purohit2019bringing, jin2018learning] are not completely deblurred and also contain several motion artifacts. These methods also suffer from motion ambiguity as the input frame has lost the information about motion direction due to temporal averaging.
This motion ambiguity can be overcome by using imaging systems such as flutter shutter [holloway2012flutter, raskar2006coded] and per-pixel exposure coding [reddy2011p2c2, liu2013efficient]. The exposure systems such as [reddy2011p2c2, liu2013efficient] use a per-pixel binary code, to temporally multiplex the information into a single coded frame. These systems compress only the temporal information at each pixel into a single value. The acquisition process of these imaging architectures form a better-posed linear inverse problem and the compressed high frame rate video signal can be recovered with high fidelity. However, one shortcoming of these encoding techniques is that they are light inefficient [baraniuk2017compressive] and throw away a significant amount of light. This inefficiency can be overcome by the use of a blurred image that integrates the light over the whole exposure. We propose and investigate an algorithm to extract a video sequence from a complementary system consisting of a coded exposure image and a fully-exposed image. We show that when using the coded-blurred image pairs, we can better recover the static parts of the scene from the blurred image and the dynamic parts of the scene from the coded image. This overall leads to a higher fidelity in reconstruction compared to recovering the video from only the blurred image or the coded image.
In this work, we propose a learning-based framework to extract a sequence of video frames from a coded and blurred image pair. Such an acquisition system already exists in the form of the recently proposed multi-bucket sensor architecture [sarhangnejad20195]. The recently proposed Coded-2-Bucket (C2B) sensors contain two buckets per pixel of the sensor. While the first bucket outputs a temporally multiplexed frame encoded using the binary code , the second bucket outputs a coded frame multiplexed using the code . By adding the two coded exposure frames, we can obtain the fully exposed frame. We use the coded-fully exposed image pair and the knowledge of the known forward model specified by the code
to solve for a low-spatial resolution video sequence. A shallow, shared encoder network is used to extract features from the low-resolution video and fully-exposed frame. These features are then used to compute an attention map which helps in fusing the information from the coded frames and the fully exposed frame. The dynamic regions are better recovered from the coded frame while the static regions are better recovered from the fully-exposed image. The features fused using the attention map are fed to a deep neural network which outputs the full resolution video sequence. We train the network end-to-end on data simulated from videos captured using a high frame-rate camera. In summary, we make the following contributions:
Static parts of a scene are best captured by the fully exposed image and the dynamic parts by the coded exposure image. We design a deep learning architecture that exploits this complementary information to reconstruct high spatial and temporal resolution video.
Video reconstruction from just a blurred image suffers from the motion ambiguity problem. We are able to resolve this ambiguity by using a the coded image.
We show that we obtain better video reconstructions than just using the fully exposed image or the coded image only.
Ii Related Work
High speed imaging systems: A hybrid system consisting of multiple conventional image sensors for high speed imaging was proposed in [wilburn2004high, shechtman2002increasing]. Another hybrid system of intensity and event sensor based system was proposed in [pan2019bringing] for extracting a video sequence using a blurry image and information from an event sensor. Event based sensors have also been used to design a low power high speed camera [scheerlinck2018continuous, reinbacher2016real, rebecq2019events, shedligeri2018photorealistic]
. Other methods of high speed imaging involve temporal super-resolution of video sequences captured from a low frame-rate-camera[karim2003low]
. Some methods propose interpolation of multiple frames between successive frames of a low-frame rate video using optical flow[kaviani2015frame], auto-regressive model[zhang2009spatio], kernel regression [takeda2009super], learning-based methods[jiang2018super] among others. Recently, few works have also explored the possibility of decomposing a single blurred frame into a sequence of video frames[purohit2019bringing, jin2018learning].
Coded exposure imaging: A global pixel coding system known as flutter shutter was proposed in [raskar2006coded] for image deblurring and compressive video recovery[holloway2012flutter]. A similar system was proposed in [veeraraghavan2010coded], which used a coded strobing photography system for compressive sensing of high speed videos. The flutter shutter camera further extended to recover a video sequence in [holloway2012flutter]. Pixel-wise coded exposure imaging system which use DMD [reddy2011p2c2] or modified conventional CMOS sensors [liu2013efficient] have been very popular. Several algorithms have been proposed [yoshida2018joint, gupta2010flexible, iliadis2020deepbinarymask, park2009multiscale] which utilize this imaging architecture for compressive high speed imaging. Recently, a multi-bucket sensor called Coded2Bucket camera [sarhangnejad20195, wei2018coded] capable of controlling the exposure for individual pixels has also been used [li2020endtoend].
Iii Video from coded-blurred image pair
Fig. 2 shows the overview of our proposed algorithm which takes in as input a pair of coded and fully exposed images and outputs a sequence of video frames. The algorithm can be broadly divided into stages. In the first stage, we obtain low spatial resolution video sequences from each of the input pair of frames as explained in Sec. III-A. The second stage uses an attention module to fuse features that are extracted from the two video sequences. The third stage consists of a deep neural network which predicts the full-resolution video sequence from the fused features.
The predicted video from our proposed algorithm has consecutive frames each of spatial resolution . This output is obtained from only two input frames: coded-exposure and fully exposed frames. The coded exposure frame is obtained by multiplexing the temporal scene information by a predetermined binary code . This multiplexing is done by first dividing the full exposure of the sensor into equal sub-exposures. In each of these sub-exposures, the code can be either (block incoming light) or (integrate incoming light), acting as a shutter for each of the pixels individually. The fully-exposed frame is obtained by integrating the light over all the sub-exposures. The process of obtaining the coded image and the fully exposed image can be written as,
where is the element-wise multiplication, are the intensity frames and is the binary code. Each of the and have the same spatial resolution .
In our algorithm we consider a code of size where . This code is then repeated spatially to obtain the full code C of size . As the code is repeating, we divide the input coded and blurred frames of size into tiles of size and explain our algorithm on an individual tile. This process can be repeated over all the tiles to obtain the final reconstructed video sequence of size .
Iii-a Initial low-resolution video reconstruction
The very first part of our proposed algorithm consists of obtaining two low-res videos and from the coded image and the fully-exposed image, respectively. We obtain by solving the linear inverse problem and by re-arranging the pixels in the fully-exposed image into a video sequence.
First we wish to obtain a video sequence of size from the coded image of size by inverting the linear system shown in Eq. (1). However, this is a highly ill-posed problem with unknowns with only observed quantities. In our case, we have and hence we have times more unknown quantities than the observed quantities. To make this inversion better posed, we make the assumption of uniform intensities in a local spatial neighborhood of as shown in Fig. 3, based on the local spatial correlation in natural images. In our experiments we choose , hence our assumption of uniform intensity in the small spatial neighborhood remains valid. With this assumption, the number of unknowns reduce to from and hence can be solved by inverting the system of equations. Hence, from the tile of size of the coded image, we obtain the video sequence of size , where we have .
Next, we consider the Pixel Shuffle block in Fig. 2 where we reshape a (, in our case) tile from the blurred image into a
low spatial resolution video. For this, we vectorize thetile and the index of this vector represents the frame number in the low-res video where we restrict . This process is repeated for both the coded and blurred images to obtain the low spatial resolution video and each of size .
Iii-B High resolution video reconstruction
Our objective here is to obtain the full-resolution video from the given two input videos of size . The video sequence gives us the information about scene motion direction which otherwise would have been lost due to temporal averaging. The video sequence contains valuable spatial information for parts of the scene which remain mostly static. We use an attention mechanism to fuse the information from these two inputs. The fused features are then input to a deep neural network to predict the final full-resolution video of size .
First, we use a shallow encoder network to extract features and from the two input video sequences. We then compute as the cosine distance along the channel dimension between the feature maps and . As and share the same encoder network , similar (or dissimilar) features correspond to similar (or dissimilar) regions in the input videos and . The low-resolution video sequences and are similar in the static regions of the scene and dissimilar in the dynamic regions of the scene. This is because the video has good motion information and is obtained by merely rearranging the input pixels. Since the attention map measures the cosine distance, it has higher values for static regions and lower values for the dynamic regions of the scene. We normalize the cosine distance map to be between and obtain the attention map . As shown in Fig. 2, we obtain the combined feature map by concatenating the scaled feature maps and . The attention map learns to attend to dynamic regions in and to the static regions in . The combined feature map is then fed to a U-net like architecture which outputs the full-resolution video corresponding to the ground truth video . The network is trained end-to-end using the cost function defined as,
where denotes the finite difference spatial gradient operator.
Iii-C Architecture details
Fig. 2 depicts the architecture of our proposed network. The shared encoder block consists of three convolution layers of sizes andand consist of channels and the same spatial dimension as the inputs and . The attention map is obtained by computing normalized inner product between features and along the channel dimension, then scaled to the range . The feature maps and are multiplied by and respectively and concatenated to form a fused feature map of channels. The fused feature map is then passed to the U-Net which follows a similar architecture as [ronneberger2015u]. It consists of three contracting encoder blocks each followed by a Maxpool layer, a bottleneck block, two expanding decoder blocks and a final decoder block. The final layer in our network is a Pixel-Shuffle layer [shi2016real] which increases the spatial resolution times and hence outputs the video sequence at the same resolution as the ground truth video.
|Blurred image as input|
|First and last frames of videos extracted using [jin2018learning]|
|PSNR 26.06 dB; SSIM 0.940||15.70 dB; 0.709||18.82 dB; 0.848|
|First and last frames of videos extracted using [purohit2019bringing]|
|PSNR 26.71 dB; SSIM 0.957||13.92 dB; 0.639||18.05 dB; 0.840|
|Coded image as input|
|First and last frames of videos extracted using GMM [yang2014video]|
|PSNR 30.96 dB; SSIM 0.962||24.24 dB; 0.822||30.67 dB; 0.953|
|First and last frames of videos extracted using our method|
|PSNR 32.72 dB; SSIM 0.973||29.56 dB; 0.934||33.11 dB; 0.970|
|Pair of coded-blurred images as input|
|First and last frames of videos extracted using GMM [yang2014video]|
|PSNR 32.93 dB; SSIM 0.975||25.52 dB; 0.857||32.52 dB; 0.966|
|First and last frames of videos extracted using our method|
|PSNR 34.66 dB; SSIM 0.981||30.18 dB; 0.940||34.94 dB; 0.979|
|First and last frames of ground truth videos|
|Input||Blurred image||Coded image||Coded+Blurred|
|Input||Blurred image||Coded image||Coded+Blurred|
Iv Experimental Results
We use GoPro dataset [Nah_2017_CVPR] consisting of video sequences with a frame rate of fps and a spatial resolution of . This data is split into train sequences and test sequences following the split proposed in [Nah_2017_CVPR]. The first sharp frames under each training sequence are taken to form our training set. We simulate our coded exposure images using consecutive 9 frames from each video sequence. We train our network on non-overlapping patches of size extracted from the training set. Hence, the ground truth video to our model is of size during training.
For all our experiments we set the code
to be a tensor of size, where is the spatial extent and is the temporal extent of the code. For our proposed method we fix the code to be a sequential impulse code as shown in Fig. 5. The code sequentially samples all the spatial locations over the 9 temporal frames exactly once. Each of the pixels is sampled at one of the temporal frames. This code is then repeated spatially to span the entire spatio-temporal volume of the input frames. The coded images are obtained by multiplexing the ground-truth videos with the above tiled code according to Eq. (1) and the blurred or fully exposed images are obtained by just adding the frames of the ground truth videos.
As we use a fully-convolutional model, the network can take in as input coded exposure frames of any spatial resolution that is a multiple of . The regularizer weight is set to in Eq. (2) for all our experiments. We train the network using Adam optimizer [kingma2014adam] with a learning rate of , for epochs of the training set and with a batch size of
. We use Pytorch[paszke2019pytorch] to build our entire network architecture. During testing, we use the code used during training to simulate the coded exposure frames from consecutive video frames, each of size . These coded exposure frames are then used to predict the video sequence and compared against the ground truth video sequence.
Iv-a Video reconstruction
In this section, we compare the fidelity of video reconstruction for different inputs: a single blurred frame; a single coded frame; and a pair of coded-blurred images.
For evaluation of video extraction from a single blurred frame we choose algorithms [purohit2019bringing, jin2018learning], which take as input a blurred frame obtained by averaging consecutive frames from the test set of GoPro dataset [Nah_2017_CVPR]. These algorithms predict the
consecutive video frames from which the blurred image was formed. For evaluation of video reconstruction from a single coded image, we use the data driven Gaussian Mixture Model (GMM) based algorithm proposed in[yang2014video]. The GMM is trained with 20 components using video patches extracted from the training dataset of GoPro [Nah_2017_CVPR]. We then use the trained GMM to predict video of 9 frames from an input coded exposure image. We also modify our proposed algorithm to predict video sequence from a single coded frame as input. This modification only acts as another strong baseline in our experiments and is not our proposed method. In our modified architecture, we first extract the low-res video sequence from the single coded image as described in section III-A. This low-res video sequence is then given as an input to the U-Net[ronneberger2015u] model whose output the full-resolution video sequence with frames. As this model takes only a single coded exposure frame as input we do not use any attention map. This modified architecture is also used to predict the video sequence from a single blurred frame as input. We also evaluate our proposed algorithm which takes as input a pair of coded-blurred image frames, and predicts the corresponding video sequence of frames. The GMM based algorithm [yang2014video] is also modified to provide the output video sequence of frames from an input of pair of coded-blurred frames. Notice that algorithms such as [jin2018learning, purohit2019bringing] extract only frames from the input while GMM [yang2014video] and our proposed algorithms extract frames from the input. Although the comparison is not very fair, we only wish to highlight the extreme ill-posedness of extracting video from a single blurred frame input over a coded frame input.
For each of the cases, we compute the peak signal-to-noise ratio (PSNR) and the structural similarity index (SSIM)[wang2004image] of the predicted frames against the original frames. In Table II we provide quantitative evaluation on the entire GoPro test set. For blurred frame video extraction, we consider the algorithm proposed in [jin2018learning] and our architecture modified to take single blurred frame as input. We compare the reconstruction from single coded frame input and coded-blurred frame input variants of our proposed method. As the code for [purohit2019bringing]
is not open source, we requested the authors to provide the predicted frames forrandomly selected blurred frames from the test set. In Table II, we provide quantitative comparison on this set of test sequences for all the above algorithms. We also provide qualitative comparison for some of the sequences, in Fig. 4. From the comparisons it can be seen that extracting video from a coded frame is a much better-posed problem in comparison to extracting video from a single blurred frame. In Fig. 4, we see that the motion direction of the car is reversed when the video is extracted from a single blurred frame. The qualitative results can be better seen as video in the accompanying supplementary material.
|Deblurred images using [tao2018srndeblur] from blurred image as input|
|PSNR 30.99 dB, SSIM 0.971||PSNR 29.98 dB, SSIM 0.959||PSNR 30.43 dB, SSIM 0.967||PSNR 24.08 dB, SSIM 0.864|
|Middle frames extracted using our method from a single coded image as input|
|PSNR 34.64 dB, SSIM 0.981||PSNR 34.16 dB, SSIM 0.967||PSNR 33.95 dB, SSIM 0.969||PSNR 29.64 dB, SSIM 0.933|
|Middle frames extracted using our method from a pair of coded-blurred images as input|
|PSNR 36.05 dB, SSIM 0.986||PSNR 35.88 dB, SSIM 0.977||PSNR 35.71 dB, SSIM 0.978||PSNR 30.27 dB, SSIM 0.940|
|Ground truth middle frames|
|Input||Blurred image||Coded image||Coded+Blurred|
Iv-B Middle frame extraction
Algorithms that extract video from a single blurred image [purohit2019bringing, jin2018learning] suffer from motion ambiguity due to the loss of temporal information. However, for these algorithms, deblurring the image or extracting the middle frame of the video sequence is a slightly better-posed problem. Comparing only the fidelity of the deblurred images instead of the entire video sequence gives a slightly fairer chance to algorithms which use only the blurred frame as input. We also use a state of the art learning-based image deblurring algorithm [tao2018srndeblur] as an additional comparison. To add another stronger baseline, we include the quantitative results from GMM [yang2014video] and our modified architecture that takes only a single coded frame as input. In Table III, we compare the PSNR and SSIM [wang2004image] values for the predicted middle frame. From Table III, we observe an average gain of dB PSNR when using a single coded image versus a blurred image for extracting the middle frame. We observe a further gain of dB PSNR when the additional fully exposed frame is used as an input. This points to the fact that the issue of motion ambiguity has a huge impact on the performance of [jin2018learning, purohit2019bringing]. Our proposed method is able to outperform all the previous works as it is able to exploit the complementary information provided by the fully exposed and coded frames. We show some of the extracted middle frames in Fig. 6 for some of the test sequences used for the evaluation. More results can be found in the accompanying supplementary material. Through this experiment, we only reiterate the advantage of using the additional information from coded image in terms of fidelity of video reconstruction.
|Sequences from [jodoin2017extensive]||Sequences from [ferryman2009pets2009]|
|Learned attention maps|
|Middle frame of videos extracted using our method|
|PSNR 34.11 dB||36.44 dB||32.80 dB||34.57 dB|
Iv-C Visualizing attention maps
As the inputs to our framework provide complementary information, we use attention mechanism to attend to different parts of the scene from different inputs. The attention map helps to combine features extracted from the two inputs, the coded frames and the fully-exposed frame. We use the datasets proposed in[jodoin2017extensive, ferryman2009pets2009] to visualize our attention maps. These videos contain a static background and a few dynamic objects in the foreground and are captured using a static camera. We simulate the two coded frames and the fully-exposed frames from these videos and provide it as an input to our trained model. We visualize the attention maps learned from these inputs in Fig. 7. We observe that the attention maps make a clear distinction between the static parts and dynamic parts of the scene. This experiment confirms that the trained network attends to different regions of the scene from different inputs. The trained network attends to the static regions of the scene in the fully-exposed frame and to the dynamic regions of the scene in the coded exposure frames.
In this work, we propose a method to extract a video sequence from a coded exposure image and a fully exposed image. We propose an attention mechanism to fuse the complementary spatial and temporal information from the input images. In this work we observe a tradeoff between imaging hardware complexity and the fidelity of the recovered video sequence. A blurred image can be easily acquired, but recovering the video sequence from this blurred image is highly ill-posed and the current algorithms require further innovation. Acquiring a coded image requires significant hardware modification, but provides a large improvement in the fidelity of the recovered video. However, the sensors that can acquire coded images are still not commercially available. Recently proposed C2B sensors show a promise in this direction which allows per pixel exposure control and is also light efficient. We show that, the additional information of fully-exposed image that can be obtained from these sensors provide a further improvement in the fidelity of the reconstructed video. In future, we would also like to obtain the prototype C2B camera and test our proposed method on the real data acquired from the sensor.