Motion-from-Blur: 3D Shape and Motion Estimation of Motion-blurred Objects in Videos

11/29/2021
by   Denys Rozumnyi, et al.
Google
ETH Zurich
0

We propose a method for jointly estimating the 3D motion, 3D shape, and appearance of highly motion-blurred objects from a video. To this end, we model the blurred appearance of a fast moving object in a generative fashion by parametrizing its 3D position, rotation, velocity, acceleration, bounces, shape, and texture over the duration of a predefined time window spanning multiple frames. Using differentiable rendering, we are able to estimate all parameters by minimizing the pixel-wise reprojection error to the input video via backpropagating through a rendering pipeline that accounts for motion blur by averaging the graphics output over short time intervals. For that purpose, we also estimate the camera exposure gap time within the same optimization. To account for abrupt motion changes like bounces, we model the motion trajectory as a piece-wise polynomial, and we are able to estimate the specific time of the bounce at sub-frame accuracy. Experiments on established benchmark datasets demonstrate that our method outperforms previous methods for fast moving object deblurring and 3D reconstruction.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 5

page 7

page 8

11/25/2019

Sub-frame Appearance and 6D Pose Estimation of Fast Moving Objects

We propose a novel method that tracks fast moving objects, mainly non-un...
06/16/2021

Shape from Blur: Recovering Textured 3D Shape and Motion of Fast Moving Objects

We address the novel task of jointly reconstructing the 3D shape, textur...
12/15/2020

FMODetect: Robust Detection and Trajectory Estimation of Fast Moving Objects

We propose the first learning-based approach for detection and trajector...
09/22/2021

A Method For Adding Motion-Blur on Arbitrary Objects By using Auto-Segmentation and Color Compensation Techniques

When dynamic objects are captured by a camera, motion blur inevitably oc...
08/14/2018

Moving Object Segmentation in Jittery Videos by Stabilizing Trajectories Modeled in Kendall's Shape Space

Moving Object Segmentation is a challenging task for jittery/wobbly vide...
10/31/2020

Dense Pixel-wise Micro-motion Estimation of Object Surface by using Low Dimensional Embedding of Laser Speckle Pattern

This paper proposes a method of estimating micro-motion of an object at ...
12/01/2020

DeFMO: Deblurring and Shape Recovery of Fast Moving Objects

Objects moving at high speed appear significantly blurred when captured ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D object reconstruction from 2D images is one of the key tasks in computer vision 

[sfm, nerf, schoenberger2016sfm, schoenberger2016mvs]. It allows better modeling of the underlying 3D world. Applications of 3D object reconstruction are broad, ranging from robotic mapping [han2021reconstructing] to augmented reality [Xu_2018_CVPR_Workshops]. Even though some recent methods deal with the extreme and under-constrained case of reconstructing 3D objects from a single 2D image [pixelnerf, what3d_cvpr19], most methods take advantage of a multi-view setting [nerf, sfm, schoenberger2016sfm, schoenberger2016mvs]. However, all generic 3D object reconstruction methods assume that the object moves slowly compared to the camera frame rate, resulting in sharp 2D images. The task of 3D object reconstruction becomes much more challenging when the object moves fast during the camera exposure time, resulting in a motion-blurred 2D image. The Shape-from-Blur (SfB) method [sfb] tackled this challenging scenario to extract 3D shape and motion from a single motion-blurred image of the object. This scenario is difficult because motion blur makes the input image noisier, and many high-frequency details are lost. On the other hand, even a single image gives potentially several views of the object, which are averaged by motion blur into one frame. SfB [sfb] explicitly modeled this phenomenon and successfully exploited it.

}50*[]
Input Shape & Motion Novel views TSR ()
Figure 1: Reconstructing 3D shape and motion of a motion-blurred falling key. We jointly optimize over multiple input frames to estimate a single 3D textured mesh and corresponding motion model (blue: observed trajectory, yellow

: the exposure gap). Temporal super-resolution (TSR) is one of the applications of the proposed Motion-from-Blur method.

In this paper, we go beyond previous methods by estimating the 3D object’s shape and its motion from a series of motion-blurred video frames. To achieve this, we optimize all parameters jointly over multiple input frames (i.e. the object’s 3D shape and texture, as well as its 3D motion). We tie up the object’s 3D shape and texture to be constant over all frames. Due to the longer time intervals involved, we must model more complex object motions (3D translation and 3D rotation) than necessary for a single motion-blurred frame [sfb]e.g. the acceleration of a falling object (Fig. 1), or a ball bouncing against a wall (Fig. 3). Using multiple frames also comes with an additional challenge: the camera shutter opens and closes in set time intervals, leading to a gap in the object’s visible trajectory and appearance. To properly succeed in our task, we must also recover this exposure gap. For a single frame only (as in [sfb]), the motion direction (forward vs. backward motion along the estimated axis) is ambiguous. For instance, in Fig. 1, the key could be translating from top to bottom or vice-versa, both resulting in the same input image. Since we consider multiple frames jointly, the motion direction is no longer ambiguous and can always be recovered. Moreover, for rotating objects, we can reconstruct a more complete 3D model as we can integrate more observations covering its total surface. In contrast, previous single-frame work [sfb] produces strong artifacts on unseen object parts. An example of our method’s output and an application to temporal super-resolution is shown in Fig. 1.
To summarize, we make the following contributions:

  1. [itemsep=0.1pt,topsep=3pt,leftmargin=*,label=(0)]

  2. We propose a method called Motion-from-Blur (MfB) that jointly estimates the 3D motion, 3D shape, and texture of motion-blurred objects in videos by optimizing over multiple blurred frames. Motion-from-Blur is the first method to optimize over a video sequence instead of a single frame.

  3. Our multi-frame optimization enables the estimation of the motion direction as well as more complex object motions such as acceleration and abrupt direction changes, e.g. bounces, for both 3D translation and 3D rotation. Moreover, compared to single-frame approaches, our estimates are also more consistent over time, with always correct motion direction, and more complete 3D shape reconstruction.

  4. As a requirement to model multiple frames, we estimate the exposure gap as part of the proposed optimization.

The code and models will be made publicly available.

2 Related work

Many methods have been proposed for generic deblurring, e.g[Chi_2021_CVPR, Li_2021_CVPR, Pan_2020_CVPR, Zhang_2020_CVPR, Suin_2020_CVPR, Kaufman_2020_CVPR, Kupyn_2018_CVPR, Kupyn_2019_ICCV]

. A related task of frame interpolation or temporal super-resolution is studied in 

[Gui_2020_CVPR, Niklaus_CVPR_2020, Shen_2020_CVPR, Jin_2019_CVPR, Pan_2020_CVPR, Ding_2021_CVPR, Siyao_2021_CVPR, Jin_2018_CVPR]. However, none of the generic deblurring methods work on extremely motion-blurred objects as shown in [fmo], and specific methods are required.

We focus on deblurring and 3D reconstruction of highly motion-blurred objects. These are called fast moving objects as defined in [fmo] – objects that move over distances larger than their size within the exposure time of one image. Detection and tracking of such objects are usually done by classical image processing methods [fmo, tbd, tbd_ijcv]

or more recently by deep learning 

[fmodetect, fmo_segmentation].

Single-frame deblurring of fast moving objects. The first methods for fast moving object deblurring [fmo, kotera2018] assumed an object with a constant 2D appearance  and 2D shape mask . Hence, the object was represented by a single 2D image patch that could only be rigidly translated and rotated in 2D. They defined the image formation model for such objects as the blending of the blurred object appearance and the background :

(1)

where the motion blur is modeled by the convolution of the sharp object appearance and its trajectory, defined by the blur kernel . Several follow-up methods [tbd, tbd_ijcv, tbdnc, sroubek2020, tbd3d, kotera2020, fmodetect] were proposed to solve for given the input image and background . They approximate the solution in a least-squares sense by energy minimization with suitable regularizers summarized by function :

(2)

As common in blind deblurring problems [kotera2018], they deploy alternating minimization w.r.t. object  and trajectory  separately in a loop. Optimization is made possible thanks to many regularizers such as appearance total variation, blur kernel sparsity [kotera2018, tbd, tbd_ijcv], motion blur prior for curves [sroubek2020], appearance and mask rotational symmetry [tbd3d], among others. All of these methods share the same drawback that stems from the underlying image formation model (1), which assumes a constant 2D object appearance.

TbD-3D [tbd3d] extended the image formation model to support fast moving objects with a piece-wise constant 2D appearance as

(3)

where the trajectory is split into several pieces , assuming that along each piece the object appearance and mask are constant. All unknowns are again estimated by energy minimization with additional problem-specific priors, e.g. object appearance in neighboring pieces is similar.

Later, DeFMO [defmo] was the first learning-based method for fast moving object deblurring, and it generalized the image formation model further to objects with a 2D appearance that can change arbitrarily along the trajectory:

(4)

where object appearance and mask are modeled by an encoder-decoder network. The network places at the right image location, directly encoding the object trajectory. Although trained on synthetic ShapeNet data [shapenet2015], DeFMO was shown to generalize to real-world images.


Figure 2: Overview of Motion-from-Blur (MfB). For a video of a motion-blurred object, we estimate its 3D motion, 3D shape, and texture. From right to left, the pipeline can be interpreted as a generative model: Starting from all parameters for an object and its motion, we render high-frame-rate videos with the object appearance (foreground) and its silhouette. Together with the known background, we generate a motion-blurred video of the object that should match the input video as good as possible. At test time, we optimize all object parameters (and the exposure gap) of this inverse problem by backpropagating the image differences through the differentiable renderer (left to right). We initialize the optimization using the DeFMO method [defmo], which provides rough silhouettes of the blurred object. MfB models a piece-wise smooth motion path to allow for a motion discontinuity like a bounce. Video source: YouTube.

Single-frame 3D reconstruction of fast moving objects. The only prior work capable of 3D reconstruction of fast moving objects is Shape-from-Blur [sfb]. Instead of merely recovering the 2D object projections , they reconstruct the object’s 3D shape mesh  as well as 3D motion. The latter is represented as the 3D translation and 3D rotation , defining the object’s pose at the beginning of the exposure time (), and the offsets and , moving the object to its pose at the end of the exposure time (). With these definitions, the image formation model becomes

(5)

where the function transforms the mesh by the given 3D translation and 3D rotation. Energy minimization is constructed from (5) to find the mesh and motion parameters that would re-render the input image as closely as possible. To make minimization feasible, mesh rendering is made differentiable using Differentialbe Interpolation-Based Rendering [dibr], denoted by and for the appearance and 2D object silhouette, respectively. To differentiate from 2D masks , silhouettes denote real renderings of a 3D object mesh. In contrast to Shape-from-Blur, our method models more complex trajectories, estimates the exposure gap, and takes into account several frames jointly, thereby allowing temporally consistent predictions and more completely reconstructed 3D shape models.

3D shape from sharp images. Many methods for 3D reconstruction have been proposed, both for single-frame [pixelnerf, pixel2mesh, what3d_cvpr19, Richter_2018_CVPR, Fan_2017_CVPR] and multi-frame setting [nerf, sfm, schoenberger2016sfm, schoenberger2016mvs]. But these methods assume sharp objects in the scene (the methods listed in previous paragraphs are the only ones dedicated to fast moving objects). In other words, they either assume that an object moves slowly compared to the camera frame rate (or, equivalently, that the camera moves slowly).

3 Method

When images are captured by a conventional camera, the camera opens its shutter to allow the right amount of light to reach the camera sensor. Then, the shutter closes, and the whole process is repeated until the required number of frames is captured. This physical reality of the camera capturing process leads to two phenomena, which we model and exploit in our optimization. The first one is the motion blur that appears when the object moves while the shutter is open. The second one is the exposure gap that makes the camera ‘blind‘ when the shutter is closed, thus not observing the moving object for some parts of its motion.

We assume the input is a video stream of RGB images depicting a fast moving object. The desired output of our method is a single textured 3D object mesh , its motion parameters consisting of a continuous 3D translation and 3D rotation at every point in time during the video duration, and the exposure gap (a real-valued parameter). Sec. 3.1 introduces these parameters and a video formation model to generate video frames for given parameters. In case we know the real values of all parameters, we could render the input video . Then, in Sec. 3.2, we show how to optimize these parameters to re-render the input video frames as closely as possible.

3.1 Modeling

Mesh modeling. The mesh parameters consist of an index to a prototype mesh, vertex offsets from its initial vertex positions to deform the mesh, and the texture map. We use a set of prototype meshes to account for varying mesh complexity and different genus numbers. Our set of prototype meshes contains a torus and two spheres with a different number of vertices. The texture mapping from vertices to the 2D location on the texture map is assumed to be fixed. Similarly, the mesh triangular faces consist of fixed sets of edges that connect vertices.

Motion modeling. The object motion is composed of continuous 3D translations and 3D rotations represented by quaternions . Both translations and rotations are viewed from the camera perspective, which is assumed to be static. We assume that they are defined at all points in time , spanning the duration of the entire input video. We implement the functions and as piece-wise polynomials, and their parameters are the polynomial coefficients. More precisely, we use piece-wise quadratic functions with two connected pieces, which are able to model one bounce, as well as accelerating motions (e.g. a falling object).

Exposure modeling. We denote the exposure gap as a real-valued parameter that represents the fraction of the duration of a frame during which the camera shutter is closed. In other words, it is the duration of the closed shutter divided by the duration of one shutter cycle. A hypothetical full exposure camera that never closes its shutter would result in . In most cases, conventional cameras would set their exposure gap close to 0 for dark environments to get as much light as possible and close to 1 for very bright environments to avoid overexposure. Typically, smaller exposure gaps lead to more motion blur in the image.

Video formation model. The video formation model is the core of our method. It renders a video frame for a given set of all above-mentioned parameters:

(6)

where the interval bounds for frame go from the beginning of its exposure time when the shutter opens at time to the end of its exposure time when the shutter closes at time . Consequently, the object is not observed between and . As defined previously, the function first rotates the mesh by the 3D rotation and then moves it by the 3D translation . Mesh rendering is implemented by Differentiable Interpolation-Based Rendering [dibr], denoted by for the appearance and by for the silhouette. Like all previous methods for fast moving object deblurring, we compute the background as the median of all frames in the input video . Note that our modeling is a strict generalization of SfB [sfb] for the case of and linear motion.

3.2 Model fitting

This section presents an optimization method to fit the introduced model to the given input video.

Loss function. The main driving force of the proposed approach is the video reconstruction loss

(7)

This loss is low if the frames rendered by our model via Eq. (6) closely look like the input frames .

In order to make the optimization easier and well-behaved, we apply auxiliary loss terms and regularizers, similar to [sfb]. We briefly summarize them here and refer to [sfb] for details. The silhouette consistency loss helps localize the object in the image faster and serves as initialization for estimating the 3D mesh and its translation. First, we run DeFMO [defmo] and use their estimated masks for approximate object location. To synchronize the motion direction (forward vs. backward) for DeFMO masks across frames, we minimize the distance between consecutive masks in adjacent frames. Then, is defined as an intersection over union (IoU) between the DeFMO masks and 2D mesh silhouettes, rendered by our method:

(8)

Furthermore, we add the commonly employed [pixel2mesh, dibr, sfb, tbd_ijcv, kotera2020] total variation and Laplacian regularizers. Total variation on texture maps encourages the model to produce smooth textures, and the Laplacian regularizer promotes smooth meshes. Finally, the joint loss is a weighted sum of all four loss terms:

(9)
Inputs Iy Bounce!
DeFMO [defmo] Iy
SfB [sfb] Iy
MfB (ours) Iy
GT Iy
Figure 3: Estimating 3D shape and motion of a motion-blurred volleyball, shown as temporal super-resolution. The proposed Motion-from-Blur (MfB) method is the first to use multiple video frames during optimization and the first to model complex trajectories with bounces, accounting for the exposure gap. The previous methods for FMO deblurring (DeFMO) and single-frame 3D reconstruction (SfB) have difficulties reconstructing the bounce as they get confused by the ball’s shadow due to the lack of multi-frame optimization.

Optimization. Fig. 2 shows an overview of the pipeline. We backpropagate the joint loss up to the mesh , motion parameters , and exposure gap . Optimization is done with ADAM[adam] using a learning rate of . In the beginning, we run pre-optimization for at most 100 iterations with , thus omitting the video reconstruction loss and texture map updates. Pre-optimization stops when the silhouette loss becomes , meaning that the mesh silhouettes have average IoU with the DeFMO masks. This pre-optimization phase is required since the 3D translation has to put the mesh at approximately the right location in the image to get a training signal for the video reconstruction loss to estimate the texture map, 3D object rotation, and 3D shape. The more video frames are used, the more important this step becomes because the object’s 2D location varies more across the frames. Experimentally, for the optimization never converges without pre-optimization. We optimize over the mesh prototypes by running the optimization for each prototype and choosing the best one based on the lowest value of the video reconstruction loss (7

). During optimization, the mesh is always kept in canonical space by normalizing the vertices to zero mean and unit variance. The main optimization is run for 1000 iterations using the full loss (

9) with

. The hyperparameter

of the Laplacian regularizer is set to 1000 experimentally. Both the texture total variation and silhouette consistency losses have no weights since the default value of worked well in our experiments.

Initialization. The mesh parameters are initialized to the prototype shape with zero vertex offsets and a white texture map. The motion parameters are initialized such that the object is placed in the middle of the image with zero rotation. Finally, the exposure gap is initialized to .

Implementation.

We use PyTorch 

[pytorch] with Kaolin [kaolin] for differentiable rendering. All integrals in each frame are discretized by splitting time intervals into 8 evenly-spaced pieces. All experiments are run on an Nvidia GTX 1080Ti GPU with seconds average runtime per frame.

4 Experiments

We evaluate our method’s accuracy by measuring the deblurring quality on 3 real-world datasets from the fast moving object deblurring benchmark [defmo]. Since there are no real image datasets of fast moving objects with associated ground-truth 3D shapes and motion, we follow the protocol of [sfb] and evaluate the quality of reconstructed 3D meshes, 3D translations, and 3D rotations on a synthetic dataset.

Method Falling Objects [kotera2020] TbD-3D Dataset [tbd3d] TbD Dataset [tbd]
TIoU PSNR SSIM TIoU PSNR SSIM TIoU PSNR SSIM
Jin et al. [Jin_2018_CVPR] N / A 23.54 0.575 N / A 24.52 0.590 N / A 24.90 0.530
DeblurGAN [Kupyn_2019_ICCV] N / A 23.36 0.588 N / A 23.58 0.603 N / A 24.27 0.537
TbD [tbd] 0.539 20.53 0.591 0.598 18.84 0.504 0.542 23.22 0.605
TbD-3D [tbd3d] 0.539 23.42 0.671 0.598 23.13 0.651 0.542 25.21 0.674
DeFMO [defmo] 0.684 26.83 0.753 0.879 26.23 0.699 0.550 25.57 0.602
SfB [sfb] 0.701 27.18 0.760 0.921 26.54 0.722 0.610 25.66 0.659
MfB (ours) 0.772 27.54 0.765 0.927 26.57 0.728 0.614 26.63 0.678
Table 1: Fast moving object deblurring benchmark. We compare the proposed MfB method to generic deblurring methods [Kupyn_2019_ICCV, Jin_2018_CVPR] (no trajectory output, thus TIoU is undefined) and to methods specifically designed for fast moving object deblurring [tbd, tbd3d, defmo, sfb].
TIoU PSNR SSIM
full SfB [sfb] 0.921 26.54 0.722
MfB (ours) 0.927 26.57 0.728
bnc SfB [sfb] 0.892 21.77 0.628
MfB (ours) 0.902 25.01 0.643
bnc SfB [sfb] 0.863 20.77 0.595
MfB (ours) 0.889 24.57 0.620
Table 2: Deblurring quality at bounces. We compare scores on the full TbD-3D dataset [tbd3d], on a subset of frames at bounces (bnc ), and additionally on frames that are immediately before and after the bounce (bnc ). The proposed multi-frame MfB is significantly more accurate at bounces (Fig. 3) than the single-frame SfB, especially on the deblurring metric PSNR.

Fast moving object deblurring benchmark. It consists of 3 datasets of varying difficulty. The easiest one is TbD [tbd] that contains mostly spherical objects with uniform color (12 sequences, total 471 frames). A more difficult dataset is TbD-3D [tbd3d] that contains mostly spherical objects with complex textures that move with significant 3D rotation (10 sequences, total 516 frames). The most difficult dataset is Falling Objects [kotera2020] with objects of various shapes and complex textures (6 sequences, total 94 frames). The ground truth for these datasets was recorded by a high-speed camera capturing the moving object without motion blur. Therefore, we have 8 high-speed frames for each frame input to our method. We measure the deblurring quality by reconstructing the high-speed camera footage as temporal super-resolution. For that, we apply the video formation model (6

) at a 8 times finer temporal resolution by using the estimated object parameters after optimization on the input slow-speed frames. Then, the reconstructed high-speed camera frames and the ground-truth ones are compared by the Peak Signal to Noise Ratio (PSNR) and Structural Similarity (SSIM) metrics. Additionally, these datasets contain ground-truth 2D object trajectories and 2D object masks. Therefore, we also measure the trajectory intersection over union (TIoU), defined as the IoU between the ground-truth mask placed at the ground-truth 2D location and the reconstructed 2D location (averaged over time). We reconstruct the 2D object location for our method as the center of mass of the projected mesh silhouette at each high-speed frame.

We compare to various state-of-the-art methods: a generic deblurring method DeblurGAN-v2 [Kupyn_2019_ICCV], a generic method for temporal super-resolution [Jin_2018_CVPR], and methods designed for fast moving object deblurring [tbd, tbd3d, defmo, sfb]. All compared methods use each video frame independently, whereas our method is the first to exploit multiple frames simultaneously. We run MfB in a temporal sliding window approach with if not mentioned otherwise. For each frame, we always choose the window for which the video reconstruction loss (7) is the lowest, measured only on this frame (similar to the best prototype selection).

Table 1 presents the results. MfB outperforms all other methods on all three datasets and for all three metrics. Qualitatively, the estimated temporal super-resolution is more consistent compared to single-frame approaches since MfB explains all frames by a single 3D object mesh and texture (Fig. 5, ). Novel view synthesis is also considerably better as the object outline is accurate from all viewpoints, and even sharp angles of the box (Fig. 5, novel views) are clear. Interestingly, the previous state-of-the-art single-frame 3D reconstruction approach [sfb] produces several artifacts, inconsistencies, and produces an entirely incorrect 3D shape for object parts that are not visible in a single input frame. Moreover, DeFMO [defmo] and SfB [sfb] fail in the presence of shadows and specularities, whereas MfB better reconstructs the object due to additional constraints from neighboring frames (Fig. 5, and ).

SfB [sfb] 37.8 % 10.9 3.0 %
MfB (ours) 20.0 % 6.4 2.7 %
SfB [sfb] 12.8 % 4.8 2.3 %
MfB (ours) 8.8 % 3.7 2.2 %
Table 3: Evaluating 3D translation, 3D rotation, and 3D shape on a synthetic dataset. First block: dataset with at most 90 rotation over 3 frames, second block: at most 30 rotation. MfB is almost twice as accurate as the single-frame SfB on the large rotation dataset when measuring 3D translation and 3D rotation errors and is still significantly better on the small rotation dataset.

Evaluating at bounces. A unique new feature of our approach is its ability to model bounces, which results in better deblurring in those cases. Here, we evaluate this effect explicitly. To this end, we manually annotate the frames in which a bounce happens in the TbD-3D dataset [tbd3d] (the only dataset with relatively frequent bounces). Overall, we found 38 bounces from 516 frames in total from 10 sequences, which amounts to chance of a bounce. Since the frames immediately before and after a bounce are usually affected too (e.g. due to a shadow as in Fig. 3), we also evaluate them, yielding a total of 114 frames (). As shown in Table 2, MfB significantly outperforms SfB at bounces, especially in terms of deblurring quality metric PSNR. The performance gap is still significant when evaluating on frames that are adjacent to the bounce but is relatively small when averaged over the whole dataset. This indicates that bounces are significantly more difficult than other parts of the dataset, as shown qualitatively in Fig. 4 and Fig. 3, and our method successfully reconstructs such frames as well. For single-frame approaches, the difficulty comes mainly from the trajectory non-linearity, slight object deformation, and shadows near the bounce point. Motion-from-Blur is robust to these difficulties since the optimization is more constrained from easier frames before and after the bounce, and the trajectory is explicitly modeled with a bounce. On frames that are far from the bounce, the difference in deblurring quality between the single-frame and multi-frame approaches is marginal on the TbD-3D dataset. Note that our model is generic and estimates continuously connected trajectories also if there is no bounce.

Synthetic 3D dataset. We construct a synthetic dataset of fast moving objects with ground-truth 3D models and 3D motions for evaluation. We sample random 3D models from the ShapeNet dataset [shapenet2015], random linear 3D translations and 3D rotations (for a fair comparison with SfB [sfb] that reconstructs only linear motions), and random consecutive frames from the VOT [VOT_TPAMI] tracking dataset as backgrounds. 3D translation is randomly chosen in the interval between 1 to 5 object sizes, and 3D rotation is randomly chosen up to (first subset) or (second subset) during the video duration. Then, we apply the video formation model (6) with to create two subsets, each consisting of 30 short videos. We report the mesh error as the average bidirectional distance between the closest vertices of the ground-truth and the estimated mesh, both placed at the ground-truth and predicted initial 6D pose, and divided by the object size. For evaluating the translation error

, we compute the norm of the difference vector between the predicted and ground-truth translation offset

, divided by the object size. Thus, these two scores ( and ) are reported as a fraction of the object size. For evaluating the rotation error , we compute the average angle between the estimated rotation change (rotation between and ) and the ground-truth one.

We compare to the only other method that can reconstruct a 3D object and its motion from the motion-blurred input (SfB [sfb]). Our method is applied to all three video frames in each video, whereas SfB is applied to them individually, and the scores are averaged (w.r.t. one video frame). As shown in Table 3, on the synthetic dataset with up to rotation, our method is almost twice as accurate as SfB in terms of 3D translation and 3D rotation estimation. For smaller rotations up to , the difference is smaller but is still significant. This highlights that multi-frame optimization is especially beneficial for complex objects (as from ShapeNet) with non-negligible rotations.

Applications. MfB can be used for imitating high-speed cameras or multiplying their capabilities by creating temporal super-resolution from motion-blurred videos. MfB can perform 3D reconstruction of blurred objects that are almost unidentifiable by humans, e.g. image forensics of surveillance cameras. Applications also include 6D object tracking and reconstruction in sports, e.g. football, tennis, basketball.

Traj.
Figure 4: Reconstructing 2D object trajectories with bounces. For each video, we reconstruct 3D object and its motion (blue: observed trajectory, yellow: the exposure gap). We visualize the trajectory for the center of mass of the mesh silhouettes and further render the first and last pose of the object (right-most image). Top row: Scene from TbD [tbd] dataset; Center row: TbD-3D [tbd3d] scene; Bottom row: YouTube scene from Fig. 2.
Inputs Iy
DeFMO [defmo] Iy No, cannot do.
SfB [sfb] Iy aa aa aa
MfB (ours) Iy
GT Iy Not available.
Novel views
Figure 5: 3D reconstruction and temporal super-resolution of a falling box from Falling Objects dataset [kotera2020]. Our method produces more consistent results over input frames than previous methods and does not suffer from artifacts on frames with shadows ( and ). The final 3D reconstruction is also more complete and accurate than the single-frame approach SfB [sfb] as shown on novel views.

5 Limitations

Static camera. MfB assumes that the video is captured by a nearly static camera. A moving camera adds even more ambiguity to the observed blur that could stem from both camera and object motion blur. Moreover, motion blur also has to be compensated by the camera motion, and the whole problem would become much more difficult. Since all previous methods for fast moving object deblurring and 3D reconstruction [tbd, tbd3d, tbd_ijcv, defmo, sfb] also assume a static camera, tackling this problem remains challenging future work.

Changing and rolling shutter. Currently, we assume that the shutter is constant. However, some cameras have an adjustable shutter that changes the exposure gap based on lighting conditions, e.g. less exposure for bright scenes and more exposure for dark scenes. Nevertheless, this transition is smooth in most cases, and our sliding window approach should be reasonably robust in such cases. Most digital cameras, like in mobile devices, have a rolling shutter that captures a frame line by line. Thus, the motion blur and exposure gap are different for each line in the frame, depending on object speed and location. Modeling a rolling shutter is beyond the scope of this paper. However, we observed that the rolling shutter effect is small, and our optimization of the video formation model without rolling shutter still leads to satisfactory results on many real-world videos.

Texture-less objects. Reconstructing 3D objects that lack noticeable texture is a challenge even for generic 3D reconstruction methods since no distinctive geometry features are observable, and the correspondences are ambiguous. In this case, detecting any 3D rotation is almost infeasible. As observed on the TbD dataset [tbd] that has mostly uniformly textured objects, our method mainly reports zero rotation for such objects, even if they have imperceptible rotation. Yet, the reconstructed object translation is mostly correct, with deblurring results outperforming other methods (cf. Table 1).

Non-rigid objects. We assume that the object is rigid, i.e. its 3D model is constant for the video duration. Such assumption is invalid for deforming objects, which often happens during the bounce. However, since these deformations are often insignificant and only for a very short duration of time, our modeling still handles such cases well.

Acknowledgements. This work was supported by a Google Focused Research Award, Innosuisse grant No. 34475.1 IP-ICT, and a research grant by the International Federation of Association Football (FIFA).

6 Conclusion

We presented the first method for estimating textured 3D shapes and complex motions of motion-blurred objects in videos. By optimizing over multiple input frames, we are able to correctly recover 3D object shape and motion, its motion direction, and the camera exposure gap. Various experiments have shown that our method produces sharper and more consistent results compared to other methods for fast moving object deblurring. Compared to single-image 3D shape and motion estimation [sfb], which is a special instance of our approach, we recover more complete shapes and significantly more precise motion estimation.

References