3D object reconstruction from 2D images is one of the key tasks in computer vision[sfm, nerf, schoenberger2016sfm, schoenberger2016mvs]. It allows better modeling of the underlying 3D world. Applications of 3D object reconstruction are broad, ranging from robotic mapping [han2021reconstructing] to augmented reality [Xu_2018_CVPR_Workshops]. Even though some recent methods deal with the extreme and under-constrained case of reconstructing 3D objects from a single 2D image [pixelnerf, what3d_cvpr19], most methods take advantage of a multi-view setting [nerf, sfm, schoenberger2016sfm, schoenberger2016mvs]. However, all generic 3D object reconstruction methods assume that the object moves slowly compared to the camera frame rate, resulting in sharp 2D images. The task of 3D object reconstruction becomes much more challenging when the object moves fast during the camera exposure time, resulting in a motion-blurred 2D image. The Shape-from-Blur (SfB) method [sfb] tackled this challenging scenario to extract 3D shape and motion from a single motion-blurred image of the object. This scenario is difficult because motion blur makes the input image noisier, and many high-frequency details are lost. On the other hand, even a single image gives potentially several views of the object, which are averaged by motion blur into one frame. SfB [sfb] explicitly modeled this phenomenon and successfully exploited it.
|Input||Shape & Motion||Novel views||TSR ()|
: the exposure gap). Temporal super-resolution (TSR) is one of the applications of the proposed Motion-from-Blur method.
In this paper, we go beyond previous methods by estimating the 3D object’s shape and its motion from a series of motion-blurred video frames.
To achieve this, we optimize all parameters jointly over multiple input frames (i.e. the object’s 3D shape and texture, as well as its 3D motion).
We tie up the object’s 3D shape and texture to be constant over all frames.
Due to the longer time intervals involved, we must model more complex object motions (3D translation and 3D rotation) than necessary for a single motion-blurred frame [sfb], e.g. the acceleration of a falling object (Fig. 1), or a ball bouncing against a wall (Fig. 3).
Using multiple frames also comes with an additional challenge: the camera shutter opens and closes in set time intervals, leading to a gap in the object’s visible trajectory and appearance.
To properly succeed in our task, we must also recover this exposure gap.
For a single frame only (as in [sfb]), the motion direction (forward vs. backward motion along the estimated axis) is ambiguous.
For instance, in Fig. 1, the key could be translating from top to bottom or vice-versa, both resulting in the same input image.
Since we consider multiple frames jointly, the motion direction is no longer ambiguous and can always be recovered.
Moreover, for rotating objects, we can reconstruct a more complete 3D model as we can integrate more observations covering its total surface.
In contrast, previous single-frame work [sfb] produces strong artifacts on unseen object parts.
An example of our method’s output and an application to temporal super-resolution is shown in Fig. 1.
To summarize, we make the following contributions:
We propose a method called Motion-from-Blur (MfB) that jointly estimates the 3D motion, 3D shape, and texture of motion-blurred objects in videos by optimizing over multiple blurred frames. Motion-from-Blur is the first method to optimize over a video sequence instead of a single frame.
Our multi-frame optimization enables the estimation of the motion direction as well as more complex object motions such as acceleration and abrupt direction changes, e.g. bounces, for both 3D translation and 3D rotation. Moreover, compared to single-frame approaches, our estimates are also more consistent over time, with always correct motion direction, and more complete 3D shape reconstruction.
As a requirement to model multiple frames, we estimate the exposure gap as part of the proposed optimization.
The code and models will be made publicly available.
2 Related work
Many methods have been proposed for generic deblurring, e.g. [Chi_2021_CVPR, Li_2021_CVPR, Pan_2020_CVPR, Zhang_2020_CVPR, Suin_2020_CVPR, Kaufman_2020_CVPR, Kupyn_2018_CVPR, Kupyn_2019_ICCV]
. A related task of frame interpolation or temporal super-resolution is studied in[Gui_2020_CVPR, Niklaus_CVPR_2020, Shen_2020_CVPR, Jin_2019_CVPR, Pan_2020_CVPR, Ding_2021_CVPR, Siyao_2021_CVPR, Jin_2018_CVPR]. However, none of the generic deblurring methods work on extremely motion-blurred objects as shown in [fmo], and specific methods are required.
We focus on deblurring and 3D reconstruction of highly motion-blurred objects. These are called fast moving objects as defined in [fmo] – objects that move over distances larger than their size within the exposure time of one image. Detection and tracking of such objects are usually done by classical image processing methods [fmo, tbd, tbd_ijcv]
or more recently by deep learning[fmodetect, fmo_segmentation].
Single-frame deblurring of fast moving objects. The first methods for fast moving object deblurring [fmo, kotera2018] assumed an object with a constant 2D appearance and 2D shape mask . Hence, the object was represented by a single 2D image patch that could only be rigidly translated and rotated in 2D. They defined the image formation model for such objects as the blending of the blurred object appearance and the background :
where the motion blur is modeled by the convolution of the sharp object appearance and its trajectory, defined by the blur kernel . Several follow-up methods [tbd, tbd_ijcv, tbdnc, sroubek2020, tbd3d, kotera2020, fmodetect] were proposed to solve for given the input image and background . They approximate the solution in a least-squares sense by energy minimization with suitable regularizers summarized by function :
As common in blind deblurring problems [kotera2018], they deploy alternating minimization w.r.t. object and trajectory separately in a loop. Optimization is made possible thanks to many regularizers such as appearance total variation, blur kernel sparsity [kotera2018, tbd, tbd_ijcv], motion blur prior for curves [sroubek2020], appearance and mask rotational symmetry [tbd3d], among others. All of these methods share the same drawback that stems from the underlying image formation model (1), which assumes a constant 2D object appearance.
TbD-3D [tbd3d] extended the image formation model to support fast moving objects with a piece-wise constant 2D appearance as
where the trajectory is split into several pieces , assuming that along each piece the object appearance and mask are constant. All unknowns are again estimated by energy minimization with additional problem-specific priors, e.g. object appearance in neighboring pieces is similar.
Later, DeFMO [defmo] was the first learning-based method for fast moving object deblurring, and it generalized the image formation model further to objects with a 2D appearance that can change arbitrarily along the trajectory:
where object appearance and mask are modeled by an encoder-decoder network. The network places at the right image location, directly encoding the object trajectory. Although trained on synthetic ShapeNet data [shapenet2015], DeFMO was shown to generalize to real-world images.
Single-frame 3D reconstruction of fast moving objects. The only prior work capable of 3D reconstruction of fast moving objects is Shape-from-Blur [sfb]. Instead of merely recovering the 2D object projections , they reconstruct the object’s 3D shape mesh as well as 3D motion. The latter is represented as the 3D translation and 3D rotation , defining the object’s pose at the beginning of the exposure time (), and the offsets and , moving the object to its pose at the end of the exposure time (). With these definitions, the image formation model becomes
where the function transforms the mesh by the given 3D translation and 3D rotation. Energy minimization is constructed from (5) to find the mesh and motion parameters that would re-render the input image as closely as possible. To make minimization feasible, mesh rendering is made differentiable using Differentialbe Interpolation-Based Rendering [dibr], denoted by and for the appearance and 2D object silhouette, respectively. To differentiate from 2D masks , silhouettes denote real renderings of a 3D object mesh. In contrast to Shape-from-Blur, our method models more complex trajectories, estimates the exposure gap, and takes into account several frames jointly, thereby allowing temporally consistent predictions and more completely reconstructed 3D shape models.
3D shape from sharp images. Many methods for 3D reconstruction have been proposed, both for single-frame [pixelnerf, pixel2mesh, what3d_cvpr19, Richter_2018_CVPR, Fan_2017_CVPR] and multi-frame setting [nerf, sfm, schoenberger2016sfm, schoenberger2016mvs]. But these methods assume sharp objects in the scene (the methods listed in previous paragraphs are the only ones dedicated to fast moving objects). In other words, they either assume that an object moves slowly compared to the camera frame rate (or, equivalently, that the camera moves slowly).
When images are captured by a conventional camera, the camera opens its shutter to allow the right amount of light to reach the camera sensor. Then, the shutter closes, and the whole process is repeated until the required number of frames is captured. This physical reality of the camera capturing process leads to two phenomena, which we model and exploit in our optimization. The first one is the motion blur that appears when the object moves while the shutter is open. The second one is the exposure gap that makes the camera ‘blind‘ when the shutter is closed, thus not observing the moving object for some parts of its motion.
We assume the input is a video stream of RGB images depicting a fast moving object. The desired output of our method is a single textured 3D object mesh , its motion parameters consisting of a continuous 3D translation and 3D rotation at every point in time during the video duration, and the exposure gap (a real-valued parameter). Sec. 3.1 introduces these parameters and a video formation model to generate video frames for given parameters. In case we know the real values of all parameters, we could render the input video . Then, in Sec. 3.2, we show how to optimize these parameters to re-render the input video frames as closely as possible.
Mesh modeling. The mesh parameters consist of an index to a prototype mesh, vertex offsets from its initial vertex positions to deform the mesh, and the texture map. We use a set of prototype meshes to account for varying mesh complexity and different genus numbers. Our set of prototype meshes contains a torus and two spheres with a different number of vertices. The texture mapping from vertices to the 2D location on the texture map is assumed to be fixed. Similarly, the mesh triangular faces consist of fixed sets of edges that connect vertices.
Motion modeling. The object motion is composed of continuous 3D translations and 3D rotations represented by quaternions . Both translations and rotations are viewed from the camera perspective, which is assumed to be static. We assume that they are defined at all points in time , spanning the duration of the entire input video. We implement the functions and as piece-wise polynomials, and their parameters are the polynomial coefficients. More precisely, we use piece-wise quadratic functions with two connected pieces, which are able to model one bounce, as well as accelerating motions (e.g. a falling object).
Exposure modeling. We denote the exposure gap as a real-valued parameter that represents the fraction of the duration of a frame during which the camera shutter is closed. In other words, it is the duration of the closed shutter divided by the duration of one shutter cycle. A hypothetical full exposure camera that never closes its shutter would result in . In most cases, conventional cameras would set their exposure gap close to 0 for dark environments to get as much light as possible and close to 1 for very bright environments to avoid overexposure. Typically, smaller exposure gaps lead to more motion blur in the image.
Video formation model. The video formation model is the core of our method. It renders a video frame for a given set of all above-mentioned parameters:
where the interval bounds for frame go from the beginning of its exposure time when the shutter opens at time to the end of its exposure time when the shutter closes at time . Consequently, the object is not observed between and . As defined previously, the function first rotates the mesh by the 3D rotation and then moves it by the 3D translation . Mesh rendering is implemented by Differentiable Interpolation-Based Rendering [dibr], denoted by for the appearance and by for the silhouette. Like all previous methods for fast moving object deblurring, we compute the background as the median of all frames in the input video . Note that our modeling is a strict generalization of SfB [sfb] for the case of and linear motion.
3.2 Model fitting
This section presents an optimization method to fit the introduced model to the given input video.
Loss function. The main driving force of the proposed approach is the video reconstruction loss
This loss is low if the frames rendered by our model via Eq. (6) closely look like the input frames .
In order to make the optimization easier and well-behaved, we apply auxiliary loss terms and regularizers, similar to [sfb]. We briefly summarize them here and refer to [sfb] for details. The silhouette consistency loss helps localize the object in the image faster and serves as initialization for estimating the 3D mesh and its translation. First, we run DeFMO [defmo] and use their estimated masks for approximate object location. To synchronize the motion direction (forward vs. backward) for DeFMO masks across frames, we minimize the distance between consecutive masks in adjacent frames. Then, is defined as an intersection over union (IoU) between the DeFMO masks and 2D mesh silhouettes, rendered by our method:
Furthermore, we add the commonly employed [pixel2mesh, dibr, sfb, tbd_ijcv, kotera2020] total variation and Laplacian regularizers. Total variation on texture maps encourages the model to produce smooth textures, and the Laplacian regularizer promotes smooth meshes. Finally, the joint loss is a weighted sum of all four loss terms:
Optimization. Fig. 2 shows an overview of the pipeline. We backpropagate the joint loss up to the mesh , motion parameters , and exposure gap . Optimization is done with ADAM[adam] using a learning rate of . In the beginning, we run pre-optimization for at most 100 iterations with , thus omitting the video reconstruction loss and texture map updates. Pre-optimization stops when the silhouette loss becomes , meaning that the mesh silhouettes have average IoU with the DeFMO masks. This pre-optimization phase is required since the 3D translation has to put the mesh at approximately the right location in the image to get a training signal for the video reconstruction loss to estimate the texture map, 3D object rotation, and 3D shape. The more video frames are used, the more important this step becomes because the object’s 2D location varies more across the frames. Experimentally, for the optimization never converges without pre-optimization. We optimize over the mesh prototypes by running the optimization for each prototype and choosing the best one based on the lowest value of the video reconstruction loss (7
). During optimization, the mesh is always kept in canonical space by normalizing the vertices to zero mean and unit variance. The main optimization is run for 1000 iterations using the full loss (9) with
. The hyperparameterof the Laplacian regularizer is set to 1000 experimentally. Both the texture total variation and silhouette consistency losses have no weights since the default value of worked well in our experiments.
Initialization. The mesh parameters are initialized to the prototype shape with zero vertex offsets and a white texture map. The motion parameters are initialized such that the object is placed in the middle of the image with zero rotation. Finally, the exposure gap is initialized to .
We use PyTorch[pytorch] with Kaolin [kaolin] for differentiable rendering. All integrals in each frame are discretized by splitting time intervals into 8 evenly-spaced pieces. All experiments are run on an Nvidia GTX 1080Ti GPU with seconds average runtime per frame.
We evaluate our method’s accuracy by measuring the deblurring quality on 3 real-world datasets from the fast moving object deblurring benchmark [defmo]. Since there are no real image datasets of fast moving objects with associated ground-truth 3D shapes and motion, we follow the protocol of [sfb] and evaluate the quality of reconstructed 3D meshes, 3D translations, and 3D rotations on a synthetic dataset.
|Method||Falling Objects [kotera2020]||TbD-3D Dataset [tbd3d]||TbD Dataset [tbd]|
|Jin et al. [Jin_2018_CVPR]||N / A||23.54||0.575||N / A||24.52||0.590||N / A||24.90||0.530|
|DeblurGAN [Kupyn_2019_ICCV]||N / A||23.36||0.588||N / A||23.58||0.603||N / A||24.27||0.537|
Fast moving object deblurring benchmark. It consists of 3 datasets of varying difficulty. The easiest one is TbD [tbd] that contains mostly spherical objects with uniform color (12 sequences, total 471 frames). A more difficult dataset is TbD-3D [tbd3d] that contains mostly spherical objects with complex textures that move with significant 3D rotation (10 sequences, total 516 frames). The most difficult dataset is Falling Objects [kotera2020] with objects of various shapes and complex textures (6 sequences, total 94 frames). The ground truth for these datasets was recorded by a high-speed camera capturing the moving object without motion blur. Therefore, we have 8 high-speed frames for each frame input to our method. We measure the deblurring quality by reconstructing the high-speed camera footage as temporal super-resolution. For that, we apply the video formation model (6
) at a 8 times finer temporal resolution by using the estimated object parameters after optimization on the input slow-speed frames. Then, the reconstructed high-speed camera frames and the ground-truth ones are compared by the Peak Signal to Noise Ratio (PSNR) and Structural Similarity (SSIM) metrics. Additionally, these datasets contain ground-truth 2D object trajectories and 2D object masks. Therefore, we also measure the trajectory intersection over union (TIoU), defined as the IoU between the ground-truth mask placed at the ground-truth 2D location and the reconstructed 2D location (averaged over time). We reconstruct the 2D object location for our method as the center of mass of the projected mesh silhouette at each high-speed frame.
We compare to various state-of-the-art methods: a generic deblurring method DeblurGAN-v2 [Kupyn_2019_ICCV], a generic method for temporal super-resolution [Jin_2018_CVPR], and methods designed for fast moving object deblurring [tbd, tbd3d, defmo, sfb]. All compared methods use each video frame independently, whereas our method is the first to exploit multiple frames simultaneously. We run MfB in a temporal sliding window approach with if not mentioned otherwise. For each frame, we always choose the window for which the video reconstruction loss (7) is the lowest, measured only on this frame (similar to the best prototype selection).
Table 1 presents the results. MfB outperforms all other methods on all three datasets and for all three metrics. Qualitatively, the estimated temporal super-resolution is more consistent compared to single-frame approaches since MfB explains all frames by a single 3D object mesh and texture (Fig. 5, ). Novel view synthesis is also considerably better as the object outline is accurate from all viewpoints, and even sharp angles of the box (Fig. 5, novel views) are clear. Interestingly, the previous state-of-the-art single-frame 3D reconstruction approach [sfb] produces several artifacts, inconsistencies, and produces an entirely incorrect 3D shape for object parts that are not visible in a single input frame. Moreover, DeFMO [defmo] and SfB [sfb] fail in the presence of shadows and specularities, whereas MfB better reconstructs the object due to additional constraints from neighboring frames (Fig. 5, and ).
|SfB [sfb]||37.8 %||10.9||3.0 %|
|MfB (ours)||20.0 %||6.4||2.7 %|
|SfB [sfb]||12.8 %||4.8||2.3 %|
|MfB (ours)||8.8 %||3.7||2.2 %|
Evaluating at bounces. A unique new feature of our approach is its ability to model bounces, which results in better deblurring in those cases. Here, we evaluate this effect explicitly. To this end, we manually annotate the frames in which a bounce happens in the TbD-3D dataset [tbd3d] (the only dataset with relatively frequent bounces). Overall, we found 38 bounces from 516 frames in total from 10 sequences, which amounts to chance of a bounce. Since the frames immediately before and after a bounce are usually affected too (e.g. due to a shadow as in Fig. 3), we also evaluate them, yielding a total of 114 frames (). As shown in Table 2, MfB significantly outperforms SfB at bounces, especially in terms of deblurring quality metric PSNR. The performance gap is still significant when evaluating on frames that are adjacent to the bounce but is relatively small when averaged over the whole dataset. This indicates that bounces are significantly more difficult than other parts of the dataset, as shown qualitatively in Fig. 4 and Fig. 3, and our method successfully reconstructs such frames as well. For single-frame approaches, the difficulty comes mainly from the trajectory non-linearity, slight object deformation, and shadows near the bounce point. Motion-from-Blur is robust to these difficulties since the optimization is more constrained from easier frames before and after the bounce, and the trajectory is explicitly modeled with a bounce. On frames that are far from the bounce, the difference in deblurring quality between the single-frame and multi-frame approaches is marginal on the TbD-3D dataset. Note that our model is generic and estimates continuously connected trajectories also if there is no bounce.
Synthetic 3D dataset. We construct a synthetic dataset of fast moving objects with ground-truth 3D models and 3D motions for evaluation. We sample random 3D models from the ShapeNet dataset [shapenet2015], random linear 3D translations and 3D rotations (for a fair comparison with SfB [sfb] that reconstructs only linear motions), and random consecutive frames from the VOT [VOT_TPAMI] tracking dataset as backgrounds. 3D translation is randomly chosen in the interval between 1 to 5 object sizes, and 3D rotation is randomly chosen up to (first subset) or (second subset) during the video duration. Then, we apply the video formation model (6) with to create two subsets, each consisting of 30 short videos. We report the mesh error as the average bidirectional distance between the closest vertices of the ground-truth and the estimated mesh, both placed at the ground-truth and predicted initial 6D pose, and divided by the object size. For evaluating the translation error
, we compute the norm of the difference vector between the predicted and ground-truth translation offset, divided by the object size. Thus, these two scores ( and ) are reported as a fraction of the object size. For evaluating the rotation error , we compute the average angle between the estimated rotation change (rotation between and ) and the ground-truth one.
We compare to the only other method that can reconstruct a 3D object and its motion from the motion-blurred input (SfB [sfb]). Our method is applied to all three video frames in each video, whereas SfB is applied to them individually, and the scores are averaged (w.r.t. one video frame). As shown in Table 3, on the synthetic dataset with up to rotation, our method is almost twice as accurate as SfB in terms of 3D translation and 3D rotation estimation. For smaller rotations up to , the difference is smaller but is still significant. This highlights that multi-frame optimization is especially beneficial for complex objects (as from ShapeNet) with non-negligible rotations.
Applications. MfB can be used for imitating high-speed cameras or multiplying their capabilities by creating temporal super-resolution from motion-blurred videos. MfB can perform 3D reconstruction of blurred objects that are almost unidentifiable by humans, e.g. image forensics of surveillance cameras. Applications also include 6D object tracking and reconstruction in sports, e.g. football, tennis, basketball.
|DeFMO [defmo]||No, cannot do.|
Static camera. MfB assumes that the video is captured by a nearly static camera. A moving camera adds even more ambiguity to the observed blur that could stem from both camera and object motion blur. Moreover, motion blur also has to be compensated by the camera motion, and the whole problem would become much more difficult. Since all previous methods for fast moving object deblurring and 3D reconstruction [tbd, tbd3d, tbd_ijcv, defmo, sfb] also assume a static camera, tackling this problem remains challenging future work.
Changing and rolling shutter. Currently, we assume that the shutter is constant. However, some cameras have an adjustable shutter that changes the exposure gap based on lighting conditions, e.g. less exposure for bright scenes and more exposure for dark scenes. Nevertheless, this transition is smooth in most cases, and our sliding window approach should be reasonably robust in such cases. Most digital cameras, like in mobile devices, have a rolling shutter that captures a frame line by line. Thus, the motion blur and exposure gap are different for each line in the frame, depending on object speed and location. Modeling a rolling shutter is beyond the scope of this paper. However, we observed that the rolling shutter effect is small, and our optimization of the video formation model without rolling shutter still leads to satisfactory results on many real-world videos.
Texture-less objects. Reconstructing 3D objects that lack noticeable texture is a challenge even for generic 3D reconstruction methods since no distinctive geometry features are observable, and the correspondences are ambiguous. In this case, detecting any 3D rotation is almost infeasible. As observed on the TbD dataset [tbd] that has mostly uniformly textured objects, our method mainly reports zero rotation for such objects, even if they have imperceptible rotation. Yet, the reconstructed object translation is mostly correct, with deblurring results outperforming other methods (cf. Table 1).
Non-rigid objects. We assume that the object is rigid, i.e. its 3D model is constant for the video duration. Such assumption is invalid for deforming objects, which often happens during the bounce. However, since these deformations are often insignificant and only for a very short duration of time, our modeling still handles such cases well.
We presented the first method for estimating textured 3D shapes and complex motions of motion-blurred objects in videos. By optimizing over multiple input frames, we are able to correctly recover 3D object shape and motion, its motion direction, and the camera exposure gap. Various experiments have shown that our method produces sharper and more consistent results compared to other methods for fast moving object deblurring. Compared to single-image 3D shape and motion estimation [sfb], which is a special instance of our approach, we recover more complete shapes and significantly more precise motion estimation.