Bringing Blurry Alive at High Frame-Rate with an Event Camera

03/12/2019 ∙ by Liyuan Pan, et al. ∙ Australian National University 6

Event-based cameras can measure intensity changes (called ` events') with microsecond accuracy under high-speed motion and challenging lighting conditions. With the active pixel sensor (APS), the event camera allows simultaneous output of the intensity frames. However, the output images are captured at a relatively low frame-rate and often suffer from motion blur. A blurry image can be regarded as the integral of a sequence of latent images, while the events indicate the changes between the latent images. Therefore, we are able to model the blur-generation process by associating event data to a latent image. Based on the abundant event data and the low frame-rate easily blurred images, we propose a simple and effective approach to reconstruct a high-quality and high frame-rate shape video. Starting with a single blurry frame and its event data, we propose the Event-based Double Integral (EDI) model. Then, we extend it to multiple Event-based Double Integral (mEDI) model to get more smooth and convincing results based on multiple images and their events. We also provide an efficient solver to minimize the proposed energy model. By optimizing the energy model, we achieve significant improvements in removing general blurs and reconstructing high temporal resolution video. The video generation is based on solving a simple non-convex optimization problem in a single scalar variable. Experimental results on both synthetic and real images demonstrate the superiority of our mEDI model and optimization method in comparison to the state-of-the-art.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 5

page 7

page 8

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) The Blurry Image (b) The Events (c) A Clean Image of The Sweater
(d) Tao et al.[1] (e) Jin et al.[2] (f)
Scheerlinck et al.[3]
(events only)
(g) Scheerlinck et al.[3] (h) Our EDI [4] (i) Our mEDI
Fig. 1: Deblurring and reconstruction results of our method compared with the state-of-the-art methods on our real blurry event dataset. (a) The input blurry image. (b) The corresponding event data. (c) A clean image for the sweater that captured as a ground truth. (d) Deblurring result of Tao et al.[1]. (e) Deblurring result of Jin et al.[2]. Jin uses video as training data to train a supervised model to perform deblur, where the video can also be considered as similar information as the event data. (f)-(g) Reconstruction results of Scheerlinck et al.[3], (f) from only events, (g) from combining events and frames. (h) Reconstruction result of Pan et al.[4] from combining events and a single blurry frame. (i) Our reconstruction result from combining events and multiple blurry frame. Our result preserves more abundant and faithful texture and the consistency of natural image. (Best viewed on screen).

Event cameras (such as the Dynamic Vision Sensor (DVS) [5] and the Dynamic and Active-pixel Vision Sensor (DAVIS) [6]) are sensors that asynchronously measure the intensity changes at each pixel independently with microsecond temporal resolution (if nothing moves in the scene, no events are triggered). The event stream encodes the motion information by measuring the precise pixel-by-pixel intensity changes. Event cameras are more robust to low lighting and highly dynamic scenes than traditional cameras since they are not affected by under/over exposure or motion blur associated with a synchronous shutter.

Due to the inherent differences between event cameras and standard cameras, existing computer vision algorithms designed for standard cameras cannot be applied to event cameras directly. Although the DAVIS

[6] can provide the simultaneous output of the intensity frames and the event stream, there still exist major limitations with current event cameras:

  • Low frame-rate intensity images: In contrast to the high temporal resolution of event data (s frame rate), the current event cameras only output low frame-rate intensity images (s time resolution).

  • Inherent blurry effects: When recording highly dynamic scenes, motion blur is a common issue due to the relative motion between the camera and the scene. The output of the intensity image from the APS tends to be blurry.

To address these above challenges, various methods have been proposed by reconstructing high frame-rate videos. The existing methods can be in general categorized as:

  1. Event data only solutions [7, 8, 9], where the results tend to lack the texture and consistency of natural videos, as they fail to use the complementary information contained in the low frame-rate intensity image;

  2. Low frame-rate intensity-image-only solutions [2], where an end-to-end learning framework has been proposed to learn regression between a single blurry image and a video sequence, whereas the rich event data are not used;

  3. Jointly exploiting event data and intensity images [3, 10, 11, 12], building upon the interaction between both sources of information. However, these methods fail to address the blur issue associated with the captured image frame. Therefore, the reconstructed high frame-rate videos can be degraded by blur.

Contrary to the existing methods treat blur as the degree of the image quality, we give an alternative new insight into the problem. Although blurry frames cause undesired image degradation, they also encode the relative motion between the camera and the observed scene. Taking full advantage of the encoded motion information would benefit the reconstruction of high frame-rate videos.

To alleviate these problems, in our previous work [4], we proposed an Event-based Double Integral (EDI) model to resolve the above problems by reconstructing a high frame-rate video from a single image (even blur) and its event sequence, where the blur effects have been reduced in each reconstructed frame. Our EDI model naturally relates the desired high frame-rate sharp video, the captured intensity frame and event data. Based on the EDI model, high frame-rate video generation is as simple as solving a non-convex optimization problem in a single scalar variable.

As the EDI model only based on a single image, error from the event data will accumulate to reduce the quality of the reconstructed video. Though we integral over small time intervals from the centre of the exposure time to mitigate this error, flickering (un-smooth) sometimes occurs at two endpoints of the video. The flickering usually happens when objects and camera have a large relative motion. Therefore, in this paper, we propose an extended multiple Event-based Double Integral (mEDI) model to handle the discontinue in the reconstructed videos and further improves the quality of the reconstructed video significantly.

In this paper, we extend the previous work significantly in the following ways:

  1. We propose a simple and effective model, named the Event-based Double Integral (EDI) model, and extend it to mEDI model to restore a better high frame-rate sharp video from multiple images (even blur) and their corresponding event data.

  2. Using our proposed formulation of mEDI, we propose a stable and general method to generate a sharp video under various types of blur by solving a single variable non-convex optimization problem, especially in low lighting and complex dynamic conditions.

  3. The frame rate of our reconstructed video can theoretically be as high as the event rate (200 times greater than the original frame rate in our experiments). With multiple images as input, the reconstructed video preserves more abundant texture and the consistency of natural image.

2 Related Work

Event cameras such as the DAVIS and DVS [6, 5]

report log intensity changes, inspired by human vision. The result is a continuous, asynchronous stream of events that encodes non-redundant information about local brightness change Estimating image intensity from events is important because it grants computer vision researchers a readily available high-temporal-resolution, high-dynamic-range imaging platform that can be used for tasks such as face-detection

[9], SLAM [13, 14, 15], or optical flow estimation [7]. Although several works try to explore the advantages of the high temporal resolution provided by event cameras [16, 17, 18, 19, 20, 21], how to make the best use of the event camera has not yet been fully investigated. Event-based image reconstruction. Kim et al.[14]

reconstruct high-quality images from an event camera under a strong assumption that the only movement is pure camera rotation, and later extend their work to handle 6-degree-of-freedom motion and depth estimation

[15]. Bardow et al.[7] simultaneously optimise optical flow and intensity estimates within a fixed-length,sliding spatio-temporal window using the primal-dual algorithm [22]. Taking a spatio-temporal window of events imposes a latency cost at minimum equal to the length of the time window, and choosing a time-interval (or event batch size)that works robustly for all types of scenes is not trivial. Reinbacher et al.[8] integrate events over time while periodically regularising the estimate on a manifold defined by the timestamps of the latest events at each pixel. Barua et al.[9] generate image gradients by dictionary learning and obtain a logarithmic intensity image via Poisson reconstruction. However, the intensity images reconstructed by the previous approaches suffer from obvious artifacts as well as lack of texture due to the spatial sparsity of event data.

To achieve more image detail in the reconstructed images, several methods trying to combine events with intensity images have been proposed. The DAVIS [6] uses a shared photo-sensor array to simultaneously output events (DVS) and intensity images (APS). Scheerlinck et al.[3] propose an asynchronous event-driven complementary filter to combine APS intensity images with events, and obtain continuous-time image intensities. Brandli et al.[12] combine image frames and event stream from the DAVIS camera to create inter-frame intensity estimates by dynamically estimating the contrast threshold (temporal contrast) of each event. Each new image frame resets the intensity estimate, preventing excessive growth of integration error, but also discarding important accumulated event information. Shedligeri et al.[11] first exploit two intensity images to estimate depth. Then, they use the event data only to reconstruct a pseudo-intensity sequence (using [8]) between the two intensity images and use the pseudo-intensity sequence to estimate ego-motion using visual odometry. Using the estimated 6-DOF pose and depth, they directly warp the intensity image to the intermediate location. Liu et al.[23] assume a scene should have static background. Thus, their method needs an extra sharp static foreground image as input and the event data are used to align the foreground with the background. To alleviate these problems, in our previous work [4], we proposed an EDI model to resolve the above problems by reconstructing a high frame-rate video from a single image (even blur) and its event sequence, where the blur effects have been reduced in each reconstructed frame. Our EDI model naturally relates the desired high frame-rate sharp video, the captured intensity frame and event data. Based on the EDI model, high frame-rate video generation is as simple as solving a non-convex optimization problem in a single scalar variable.

Image deblurring. Traditional deblurring methods usually make assumptions on the scenes (such as a static scene) or exploit multiple images (such as stereo, or video) to solve the deblurring problem. Significant progress has been made in the field of single image deblurring. Methods using gradient based regularizers, such as Gaussian scale mixture [24], norm [25], edge-based patch priors [26] and -norm regularizer [27], have been proposed. Non-gradient-based priors such as the color line based prior [28], and the extreme channel (dark/bright channel) prior [29, 30] have also been explored. Another family of image deblurring methods tends to use multiple images [31, 32, 33, 34, 35].

Driven by the success of deep neural networks, Sun

et al.[36]

propose a convolutional neural network (CNN) to estimate locally linear blur kernels. Gong

et al.[37] learn optical flow from a single blurry image through a fully-convolutional deep neural network. The blur kernel is then obtained from the estimated optical flow to restore the sharp image. Nah et al. [38] propose a multi-scale CNN that restores latent images in an end-to-end learning manner without assuming any restricted blur kernel model. Tao et al.[1] propose a light and compact network, SRN-DeblurNet, to deblur the image. However, deep deblurring methods generally need a large dataset to train the model and usually require sharp images provided as supervision. In practice, blurry images do not always have corresponding ground-truth sharp images.

Blurry image to sharp video.

Recently, two deep learning based methods

[2, 39] propose to restore a video from a single blurry image with a fixed sequence length. However, their reconstructed videos do not obey the 3D geometry of the scene and camera motion. Although deep-learning based methods achieve impressive performance in various scenarios, their success heavily depend on the consistency between the training datasets and the testing datasets, thus hinder the generalization ability for real-world applications.

(a) The Blurry Image (b) The Events (c) (d)
(e) Sample of Reconstructed Image with Different
Fig. 2: The event data and our reconstructed result, where (a) and (b) are the input of our method. (a) The intensity image from the event camera. (b) Events from the event camera plotted in 3D space-time (blue: positive event; red: negative event). (c) The first integral of several events during a small time interval. (d) The second integral of events during the exposure time. (e) The reconstructed latent images using different values of , low (0.10), proper (around 0.23) and high (0.60). (Best viewed on screen).

3 Formulation

In this section, we first introduce our EDI model and then develop a mEDI model of the relationships between the events, the latent images and the blurry images. Our goal is to reconstruct a high frame-rate, sharp video from single and multiple images and its corresponding events. This model can tackle various blur types and work stably in highly dynamic sceneries and low lighting conditions.

3.1 Event Camera Model

Event cameras are bio-inspired sensors that asynchronously report logarithmic intensity changes [6, 5]. Unlike conventional cameras that produce the full image at a fixed frame-rate, event cameras trigger events whenever the change in intensity at a given pixel exceeds a preset threshold. Event cameras do not suffer from the limited dynamic ranges typical of sensors with synchronous exposure time, and are able to capture high-speed motion with microsecond accuracy.

Inherent in the theory of event cameras is the concept of the latent image , denoting the instantaneous intensity at pixel at time , related to the rate of photon arrival at that pixel. The latent image is not directly output by the camera. Instead, the camera outputs a sequence of events, denoted by , which record changes in the intensity of the latent image. Here, are image coordinates, is the time the event takes place, and polarity denotes the direction (increase or decrease) of the intensity change at that pixel and time. Polarity is given by,

(1)

where is a truncation function,

Here, is a threshold parameter determining whether an event should be recorded or not, is the intensity at pixel in the image and denotes the timestamp of the previous event. When an event is triggered, at that pixel is updated to a new intensity level.

3.2 Intensity Image Formation

In addition to the event sequence, event cameras can provide a full-frame grey-scale intensity image, at a much slower rate than the event sequence. The grey-scale images may suffer from motion blur due to their long exposure time. A general model of image formation is given by,

(2)

where is a blurry image, equal to the average value of the latent image during the exposure time . This equation applies to each pixel independently, and subscripts denoting pixel location are often omitted henceforth.

(a) The Blurry Images , and
(b) The Deblurred Images , and
Fig. 3: Examples of our results based on real event data. The is estimate automatically from three images and their events based on our mEDI model. (Best viewed on screen).

3.3 Event-based Double Integral Model

We aim to recover a sequence of latent intensity images by exploiting both the blur model and the event model. We define as a function of continuous time such that

whenever there is an event . Here, is an impulse function, with unit integral, at time , and the sequence of events is turned into a continuous time signal, consisting of a sequence of impulses. There is such a function for every point in the image. Since each pixel can be treated separately, we omit the subscripts .

During an exposure period , we define as the sum of events between time and at a given pixel,

which represents the proportional change in intensity between time and . Except under extreme conditions, such as glare and no-light conditions, the latent image sequence is expressed as,

In particular, an event is triggered when the intensity of a pixel increases or decreases by an amount at time . We put a tilde on top of things to denote logarithm, e.g..

(3)

Given a sharp frame, we can reconstruct a high frame-rate video from the sharp starting point by using Eq. (3). When the input image is blurry, a trivial solution would be to first deblur the image with an existing deblurring method and then to reconstruct the video using Eq. (3) (see Fig.8 for details). However, in this way, the event data between intensity images is not fully exploited, thus resulting in inferior performance. Instead, we propose to reconstruct the video by exploiting the inherent connection between event and blur, and present the following model.

As for the blurred image,

(4)

In this manner, we construct the relation between the captured blurry image and the latent image through the double integral of the event. We name Eq. 4 the Event-based Double Integral (EDI) model.

We denote

Taking the logarithm on both sides of Eq. 4 and rearranging, yields

(5)

which shows a linear relation between the blurry image, the latent image and the integral of the events in the log space.

4 Using More Than One Frames

If two or more blurry images are available, then we can solve the equations directly, in least-squares, or some method for solving an over-constrained system. In particular, if there are frames, with pixels, then this gives equations in unknowns, assuming that is known. If is not known, there are sufficiently many equations, but they are not linear in .

4.1 mEDI Model

Consider now an event camera supplying a continuing sequence of events, and also images, for called the ’blurry images’. Suppose that the image is captured between time . Therefore, we write and as

Then EDI model in section 3 becomes:

(6)

Let denote the frame in the video reconstructed from the blurry image. We use and to represent and separately in the following section.

(7)

Then we would have the new mEDI model based on multiple images and their events.

(8)

The knowns are , whereas the unknowns are the , and . There is a linear set of equations in the , the equations become

(9)

If is not known, this does not quite work any more. The constants and depend on , but particularly depends on non-linear way. If (9) is denoted by , the least-squares solution is given by solving , or perhaps .

Expanding this out specifically, gives

(10)

This is a particularly easy set of equations to solve. Since it has to be solved for each pixel, it is important to do it efficiently.

4.2 Decomposition

The best way is to take the decomposition of the left-hand-side matrix, which has a particularly simple form. The decomposition of the above (with appropriate recording of rows) is given by

more precisely, if the Fibonacci sequence is and denotes the th entry of this sequence (thus ), then the top line of the left-hand matrix is

consisting of the even numbered entries of the Fibonacci sequence, and the entry at the bottom left of the right-hand matrix is

, the next odd-numbered Fibonacci number, which is also the determinant of the original matrix. Solving equations by

decomposition by back-substitution is particularly simple in this case. The procedure in solving equations is done by solving

The solution of is simply

The solution of is given by back-substitution from the bottom:

(a) The blurry image (b) Tao et al.[1]
(c) By human observation (d) By energy minimization
Fig. 4: An example of our reconstruction result using different methods to estimate , from the real dataset [40]. (a) The blurry image. (b) Deblurring result of [1] (c) Our result where is chosen by manual inspection. (d) Our result where is computed automatically by our proposed energy minimization (12).
Fig. 5: The figure plot deblurring performance against the value of . The image is clearer with higher PSNR value.

4.3 High Frame-Rate Video Generation

The right-hand side of Eq. (8) is known, apart from perhaps the value of the contrast threshold , the first term from the grey-scale image, the second term from the event sequence, it is possible to compute , and hence by exponentiation. Subsequently, from Eq. (6) the latent image at any time may be computed.

To avoid accumulated errors of constructing a video from many frames of a blurred video, it is more suitable to construct each frame using the closest blurred frame.

Theoretically, we could generate a video with frame-rate as high as the DVS’s eps (events per second). However, as each event carries little information and is subject to noise, several events must be processed together to yield a reasonable image. We generate a reconstructed frame every 50-100 events, so for our experiments, the frame-rate of the reconstructed video is usually 200 times greater than the input low frame-rate video. Furthermore, as indicated by Eq. (8), the challenging blind motion deblurring problem has been reduced to a single variable optimization problem of how to find the best value of the contrast threshold . In the following section, we use to present the latent sharp image with different .

5 Optimization

The unknown contrast threshold represents the minimum change in log intensity required to trigger an event. By choosing an appropriate in Eq. (8), we can generate a sequence of sharper images. Here, we propose two different methods to estimate the unknown variable : manually chosen and automatically optimized.

Average result of the deblurred images on dataset[38]
Pan et al.[29] Sun et al.[36] Gong et al.[37] Jin et al.[2] Tao et al.[1] Zhang et al.[41] Nah et al.[38] Ours
PSNR(dB) 23.50 25.30 26.05 26.98 30.26 29.18 29.08 29.06
SSIM 0.8336 0.8511 0.8632 0.8922 0.9342 0.9306 0.9135 0.9430
Average result of the reconstructed videos on dataset[38]
Baseline 1 [1] + [3] Baseline 2 [3] + [1] Scheerlinck et al.[3] Jin et al.[2] EDI
PSNR(dB) 25.52 26.34 25.84 25.62 28.49
SSIM 0.7685 0.8090 0.7904 0.8556 0.9199
TABLE I: Quantitative comparisons on the Synthetic dataset [38]. This dataset provides videos can be used to generate not only blurry images but also event data. All methods are tested under the same blurry condition, where methods [38, 2, 1, 41] use GoPro dataset [38] to train their models. Jin [2] achieves their best performance when the image is down-sampled to 45% mentioned in their paper.
(a) The Blurry Image
(b) Jin et al.[2]
(c) Ours
(d) The Reconstructed Video of [2]
(e) The Reconstructed Video of Our Method
(f) Reinbacher et al.[8]
(g) Scheerlinck et al.[3]
Fig. 6: An example of the reconstructed result on our synthetic event dataset based on the GoPro dataset [38]. [38] provides videos to generate the blurry images and event data. (a) The blurry image. The red close-up frame is for (b)-(e), the yellow close-up frame is for (f)-(g). (b) The deblurring result of Jin et al.[2]. (c) Our deblurring result. (d) The crop of their reconstructed images and the frame number is fixed at 7. Jin et al.[2] uses the GoPro dataset added with 20 scenes as training data and their model is supervised by 7 consecutive sharp frames. (e) The crop of our reconstructed images. (f) The crop of Reinbacher [8] reconstructed images from only events. (g) The crop of Scheerlinck [3] reconstructed image, they use both events and the intensity image. For (e)-(g), the shown frames are the chosen examples, where the length of the reconstructed video is based on the number of events. (Best viewed on screen).

(a) The Blurry Image (b) Our EDI [4] (c) Our mEDI
Fig. 7: Examples of reconstruction result on real event dataset. (a) The intensity image from the event camera. (b) Reconstruction result of Pan et al.[4] from combining events and a single blurry frame. (c) Our reconstruction result from combining events and multiple blurry frame. The new mEDI model is based on multiple images and can get better results compared with our previous one based a single image. (Best viewed on screen).
(a) The Blur Image (b) Jin et al.[2] (c) Baseline 1 (d) Baseline 2
(f) Samples of Our Reconstructed Video
Fig. 8: Deblurring and reconstruction results on our real blurry event dataset. (a) Input blurry images. (b) Deblurring result of [2]. (c) Baseline 1 for our method. We first use the state-of-the-art video-based deblurring method [2] to recover a sharp image. Then use the sharp image as input to a state-of-the-art reconstruction method [3] to get the intensity image. (d) Baseline 2 for our method. We first use method [3] to reconstruct an intensity image. Then use a deblurring method [2] to recover a sharp image. (e) Samples from our reconstructed video from to . (Best viewed on screen).

5.1 Manually Chosen

According to our mEDI model in Eq. (8), given a value for , we can obtain a sharp image. Therefore, we develop a method for deblurring by manually inspecting the visual effect of the deblurred image. In this way, we incorporate human perception into the reconstruction loop and the deblurred images should satisfy human observation. In Fig. 4, we give an example for manually chosen and automatically optimized results on dataset from [40].

5.2 Automatically Chosen

To automatically find the best

, we need to build an evaluation metric (energy function) that can evaluate the quality of the deblurred image

. Specifically, we propose to exploit different prior knowledge for sharp images and the event data.

5.2.1 Energy function

For an unknown parameter , the values on the right-hand side of Eq. (8) are dependent on . In particular, we may write

(11)

Refer to eq. 9, we denote it as , the least-squares solution is given by solving ,

The optimal can be estimate by solving Eq. (12),

(12)

Examples show that as a function of , the residual error in solving the equations is not convex. However, in most cases (empirically) it seems to be convex, or at least it has a single minimum.

5.2.2 Fibonacci search

Finding the minimum of a function along a single line is easy if that function has a single minimum. In the case of least-squares minimization problems, various strategies for determining the line-search direction are currently used, such as conjugate gradient methods, gradient descent, Levenberg-Marquardt.

The minimum may be reliably located using Fibonacci search [42]. In this work, we use Fibonacci search for the value of that gives the least error is indicated. It helps to reduce the running time three times fast than using gradient decent in our case. In Fig. 5, we illustrate the clearness of the reconstructed image against the value of . Meanwhile, we also provide the PSNR of the corresponding reconstructed image. As demonstrated in the figure, our proposed reconstruction metric could locate/identify the best-deblurred image with peak PSNR properly.

(a) The Blurry Image (b) The Event (c) Pan et al.[29] (d) yan et al.[30]
(e) Tao et al.[1] (f) Nah et al.[38] (g) Jin et al.[2] (h) Our EDI [4]
(i) Reinbacher et al.[8] (j)
Scheerlinck et al.[3]
(events only)
(k) Scheerlinck et al.[3] (l) Our mEDI
Fig. 9: Examples of reconstruction result on our real blurry event dataset in low lighting and complex dynamic conditions (a) Input blurry images. (b) The event information. (c) Deblurring results of [29]. (d) Deblurring results of [30]. (e) Deblurring results of [1]. (f) Deblurring results of [38]. (g) Deblurring results of [2] and they use video as training data. (h) Reconstruction result of [4] from combining events and frames. (i) Reconstruction result of [8] from only events. (j)-(k) Reconstruction results of [3], (j) from only events, (k) from combining events and frames. (l) Our reconstruction result. Results in (c)-(g) show that real high dynamic settings and low light condition is still challenging in the deblurring area. Results in (i)-(j) show that while intensity information of a scene is still retained with an event camera recording, color, and delicate texture information cannot be recovered. (Best viewed on screen).

6 Experiment

6.1 Experimental Setup

Synthetic dataset. In order to provide a quantitative comparison, we build a synthetic dataset based on the GoPro blurry dataset [38]. It supplies ground truth videos which are used to generate the blurry images. Similarly, we employ the ground-truth images to generate event data based on the methodology of event camera model.

Real dataset. We evaluate our method on a public Event-Camera dataset [40], which provides a collection of sequences captured by the event camera for high-speed robotics. Furthermore, we present our real blurry event dataset 111To be released with codes, where each real sequence is captured with the DAVIS[6] under different conditions, such as indoor, outdoor scenery, low lighting conditions, and different motion patterns (e.g., camera shake, objects motion) that naturally introduce motion blur into the APS intensity images.

Implementation details. For all our real experiments, we use the DAVIS that shares photosensor array to simultaneously output events (DVS) and intensity images (APS). The framework is implemented by using MATLAB with C++ wrappers. It takes around 1.5 second to process one image on a single i7 core running at 3.6 GHz.

6.2 Experimental Results

We compare our proposed approach with state-of-the-art blind deblurring methods, including conventional deblurring methods [29, 30], deep based dynamic scene deblurring methods [38, 2, 1, 41, 36], and event-based image reconstruction methods [8, 3]. Moreover, Jin et al.[2] can restore a video from a single blurry image based on a deep network, where the middle frame in the restored odd-numbered sequence is the best.

In order to prove the effective of our model, we show some baseline comparisons in Fig. 8 and Table I. For baseline 1, we first apply a state-of-the-art deblurring method [1] to recover a sharp image, and then the recovered image as an input is then fed to a reconstruction method [3]. For baseline 2, we first use the video reconstruction method to construct a sequence of intensity images, and then apply the deblurring method to each frame. As seen in Table I, our approach obtains higher PSNR and SSIM in comparison to both baseline 1 and baseline 2. This also implies that our approach better exploits the event data to not only recover sharp images but also reconstruct high frame-rate videos.

In Table I and Fig. 6, we show the quantitative and qualitative comparisons with the state-of-the-art image deblurring approaches [36, 29, 37, 2, 1, 41, 38], and the video reconstruction method [3] on our synthetic dataset, respectively.

As indicated in Table I, our approach achieves the best performance on SSIM and competitive result on PSNR compared to the state-of-the-art methods, and attains significant performance improvements on high-frame video reconstruction.

In Fig. 6, we compare our generated video frames with the state-of-the-art deblurring methods [29, 2, 1, 38] qualitatively. Furthermore, image reconstruction methods [8, 3] are also included for comparisons. Fig. 6 shows that our method can generate more frames from a single blurry image and the recovered frames are much sharper.

We also report our reconstruction results on the real dataset, including text images and low-lighting images, in Fig. 1, Fig. 3, Fig. 4 and Fig. 9. Compared with state-of-the-art deblurring methods, our method achieves superior results. In comparison to existing event-based image reconstructed methods [8, 3], our reconstructed images are not only more realistic but also contain richer details. 222For more deblurring results and high-temporal resolution videos please go to our home page..

7 Limitation

Thought an event camera has record a continuous, asynchronous stream of events that encodes redundant information for our EDI model, there still have some limitations.

  1. Extreme lighting changes, such as suddenly turn on/off the light, move from dark indoor scenes to outdoor scenes;

  2. Event error accumulation will reduce the quality of reconstructed images. However, we integrate over small time intervals from the centre of the exposure time to mitigate this error.

8 Conclusion

In this paper, we propose a multiple Event-based Double Integral (mEDI) model to naturally connect intensity images and event data captured by the event camera, which also takes the blur generation process into account. In this way, our model can be used to not only recover latent sharp images but also reconstruct intermediate frames at high frame-rate. We also propose a simple yet effective method to solve our mEDI model. Due to the simplicity of our optimization process, our method is efficient as well. Extensive experiments show that our method can generate high-quality high frame-rate videos efficiently under different conditions, such as low lighting and complex dynamic scenes.

Acknowledgments

This research was supported in part by the Australian Research Council through the “Australian Centre of Excellence for Robotic Vision” CE140100016, the Natural Science Foundation of China grants (61871325, 61420106007, 61671387, 61603303) and the Australian Research Council (ARC) grants (DE140100180, DE180100628).

References

  • [1] X. Tao, H. Gao, X. Shen, J. Wang, and J. Jia, “Scale-recurrent network for deep image deblurring,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), June 2018.
  • [2] M. Jin, G. Meishvili, and P. Favaro, “Learning to extract a video sequence from a single motion-blurred image,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), June 2018.
  • [3] C. Scheerlinck, N. Barnes, and R. Mahony, “Continuous-time intensity estimation using event cameras,” arXiv e-prints, Nov. 2018. [Online]. Available: https://arxiv.org/abs/1811.00386
  • [4] L. Pan, C. Scheerlinck, X. Yu, R. Hartley, M. Liu, and Y. Dai, “Bringing a blurry frame alive at high frame-rate with an event camera,” arXiv preprint arXiv:1811.10180, 2018.
  • [5] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128128 120 db 15 s latency asynchronous temporal contrast vision sensor,” IEEE journal of solid-state circuits, vol. 43, no. 2, pp. 566–576, 2008.
  • [6] C. Brandli, R. Berner, M. Yang, S.-C. Liu, and T. Delbruck, “A 240 180 130 db 3 s latency global shutter spatiotemporal vision sensor,” IEEE Journal of Solid-State Circuits, vol. 49, no. 10, pp. 2333–2341, 2014.
  • [7] P. Bardow, A. J. Davison, and S. Leutenegger, “Simultaneous optical flow and intensity estimation from an event camera,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2016, pp. 884–892.
  • [8] C. Reinbacher, G. Graber, and T. Pock, “Real-time intensity-image reconstruction for event cameras using manifold regularisation,” in British Machine Vis. Conf. (BMVC), 2016.
  • [9]

    S. Barua, Y. Miyatani, and A. Veeraraghavan, “Direct face detection and video reconstruction from event cameras,” in

    IEEE Winter Conf. Appl. Comput. Vis. (WACV), 2016, pp. 1–9.
  • [10] C. Scheerlinck, N. Barnes, and R. Mahony, “Computing spatial image convolutions for event cameras,” IEEE Robotics and Automation Letters, 2019.
  • [11] P. A. Shedligeri, K. Shah, D. Kumar, and K. Mitra, “Photorealistic image reconstruction from hybrid intensity and event based sensor,” arXiv preprint arXiv:1805.06140, 2018.
  • [12] C. Brandli, L. Muller, and T. Delbruck, “Real-time, high-speed video decompression using a frame- and event-based DAVIS sensor,” in IEEE Int. Symp. Circuits Syst. (ISCAS), Jun. 2014, pp. 686–689.
  • [13] M. Cook, L. Gugelmann, F. Jug, C. Krautz, and A. Steger, “Interacting maps for fast visual interpretation,” in Int. Joint Conf. Neural Netw. (IJCNN), 2011, pp. 770–776.
  • [14] H. Kim, A. Handa, R. Benosman, S.-H. Ieng, and A. J. Davison, “Simultaneous mosaicing and tracking with an event camera,” in British Machine Vis. Conf. (BMVC), 2014.
  • [15] H. Kim, S. Leutenegger, and A. J. Davison, “Real-time 3D reconstruction and 6-DoF tracking with an event camera,” in Eur. Conf. Comput. Vis. (ECCV), 2016, pp. 349–364.
  • [16] Y. Zhou, G. Gallego, H. Rebecq, L. Kneip, H. Li, and D. Scaramuzza, “Semi-dense 3d reconstruction with a stereo event camera,” arXiv preprint arXiv:1807.07429, 2018.
  • [17] H. Rebecq, T. Horstschäfer, G. Gallego, and D. Scaramuzza, “EVO: A geometric approach to event-based 6-DOF parallel tracking and mapping in real-time,” IEEE Robot. Autom. Lett., vol. 2, 2017.
  • [18] A. Z. Zhu, N. Atanasov, and K. Daniilidis, “Event-based visual inertial odometry,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2017, pp. 5816–5824.
  • [19] A. Zhu, L. Yuan, K. Chaney, and K. Daniilidis, “Ev-flownet: Self-supervised optical flow estimation for event-based cameras,” in Proceedings of Robotics: Science and Systems, Pittsburgh, Pennsylvania, June 2018.
  • [20] D. Gehrig, H. Rebecq, G. Gallego, and D. Scaramuzza, “Asynchronous, photometric feature tracking using events and frames,” in Eur. Conf. Comput. Vis. (ECCV), 2018.
  • [21] B. Kueng, E. Mueggler, G. Gallego, and D. Scaramuzza, “Low-latency visual odometry using event-based feature tracks,” in IEEE/RSJ Int. Conf. Intell. Robot. Syst. (IROS), Daejeon, Korea, Oct. 2016, pp. 16–23.
  • [22] C. Posch, D. Matolin, and R. Wohlgenannt, “A QVGA 143dB dynamic range asynchronous address-event PWM dynamic image sensor with lossless pixel-level video compression,” in IEEE Intl. Solid-State Circuits Conf. (ISSCC), Feb. 2010, pp. 400–401.
  • [23] H.-C. Liu, F.-L. Zhang, D. Marshall, L. Shi, and S.-M. Hu, “High-speed video generation with an event camera,” The Visual Computer, vol. 33, no. 6-8, pp. 749–759, 2017.
  • [24] R. Fergus, B. Singh, A. Hertzmann, S. T. Roweis, and W. T. Freeman, “Removing camera shake from a single photograph,” ACM Trans. Graph., vol. 25, pp. 787–794, 2006.
  • [25] D. Krishnan, T. Tay, and R. Fergus, “Blind deconvolution using a normalized sparsity measure,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2011, pp. 233–240.
  • [26] L. Sun, S. Cho, J. Wang, and J. Hays, “Edge-based blur kernel estimation using patch priors,” in IEEE Int. Conf. Comput. Photography (ICCP), 2013.
  • [27] L. Xu, S. Zheng, and J. Jia, “Unnatural l0 sparse representation for natural image deblurring,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2013, pp. 1107–1114.
  • [28] W.-S. Lai, J.-J. Ding, Y.-Y. Lin, and Y.-Y. Chuang, “Blur kernel estimation using normalized color-line prior,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2015, pp. 64–72.
  • [29] J. Pan, D. Sun, H. Pfister, and M.-H. Yang, “Deblurring images via dark channel prior,” IEEE Trans. Pattern Anal. Mach. Intell., 2017.
  • [30] Y. Yan, W. Ren, Y. Guo, R. Wang, and X. Cao, “Image deblurring via extreme channels prior,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), July 2017.
  • [31] S. Cho, J. Wang, and S. Lee, “Video deblurring for hand-held cameras using patch-based synthesis,” ACM Transactions on Graphics (TOG), vol. 31, no. 4, p. 64, 2012.
  • [32] T. Hyun Kim and K. Mu Lee, “Generalized video deblurring for dynamic scenes,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2015, pp. 5426–5434.
  • [33] A. Sellent, C. Rother, and S. Roth, “Stereo video deblurring,” in Eur. Conf. Comput. Vis. (ECCV).   Springer, 2016, pp. 558–575.
  • [34] L. Pan, Y. Dai, M. Liu, and F. Porikli, “Simultaneous stereo video deblurring and scene flow estimation,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), July 2017.
  • [35] ——, “Depth map completion by jointly exploiting blurry color images and sparse depth maps,” in Applications of Computer Vision (WACV), 2018 IEEE Winter Conference on.   IEEE, 2018, pp. 1377–1386.
  • [36] J. Sun, W. Cao, Z. Xu, and J. Ponce, “Learning a convolutional neural network for non-uniform motion blur removal,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2015, pp. 769–777.
  • [37] D. Gong, J. Yang, L. Liu, Y. Zhang, I. Reid, C. Shen, A. van den Hengel, and Q. Shi, “From motion blur to motion flow: A deep learning solution for removing heterogeneous motion blur,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), 2017, pp. 2319–2328.
  • [38] S. Nah, T. H. Kim, and K. M. Lee, “Deep multi-scale convolutional neural network for dynamic scene deblurring,” in IEEE Conf. Comput. Vis. Pattern Recog. (CVPR), July 2017.
  • [39] K. Purohit, A. Shah, and A. Rajagopalan, “Bringing alive blurred moments!” arXiv preprint arXiv:1804.02913, 2018.
  • [40] E. Mueggler, H. Rebecq, G. Gallego, T. Delbruck, and D. Scaramuzza, “The event-camera dataset and simulator: Event-based data for pose estimation, visual odometry, and slam,” The International Journal of Robotics Research, vol. 36, no. 2, pp. 142–149, 2017.
  • [41]

    J. Zhang, J. Pan, J. Ren, Y. Song, L. Bao, R. W. Lau, and M.-H. Yang, “Dynamic scene deblurring using spatially variant recurrent neural networks,” in

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2018.
  • [42] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, “Numerical recipes in c,” Cambridge University Press, vol. 1, p. 3, 1988.