Occlusion-Robust MVO: Multimotion Estimation Through Occlusion Via Motion Closure

05/13/2019 ∙ by Kevin M. Judd, et al. ∙ University of Oxford 0

Visual motion estimation is an integral and well-studied challenge in autonomous navigation. Recent work has focused on addressing multimotion estimation, which is especially challenging in highly dynamic environments. Such environments not only comprise multiple, complex motions but also tend to exhibit significant occlusion. Previous work in multiple object tracking focuses on maintaining the integrity of object tracks but usually relies on specific appearance-based descriptors or constrained motion models. These approaches are very effective in specific applications but do not generalize to the full multimotion estimation problem. This paper extends the multimotion visual odometry (MVO) pipeline to estimate multiple motions through occlusion, including the camera egomotion, by employing physically founded motion priors. This allows the pipeline to consistently estimate the full trajectory of every motion in a scene and recognize when temporarily occluded motions become unoccluded. The estimation performance of the pipeline is evaluated on real-world data from the Oxford Multimotion Dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The ability to safely navigate through a dynamic environment is a crucial task in autonomous robotics. Visual odometry (VO) is widely used to estimate the egomotion of a camera by isolating the static parts of a scene [1]. It is more challenging to segment multiple motions within a complex dynamic scene, and recent work has focused on addressing this multimotion estimation problem [2]. Such highly dynamic scenes not only pose difficult motion estimation challenges but also tend to include significant amounts of occlusion.

Fig. 1: Motion segmentation produced by our occlusion-robust multimotion visual odometry (MVO) system. The egomotion of the camera is estimated from the static points in the scene shown in black. The motions of the swinging block (4, magenta) and the moving block tower (1, cyan) are segmented and estimated simultaneously with the egomotion.

Occlusions represent any lack of direct observations of parts of a scene. Direct occlusions are caused when an object obscures another or when it leaves the field of view of the sensor. Occlusions can also be caused indirectly by sensor limitations or algorithmic failure, such as when motion blur or lighting changes corrupt feature matching or object detection. Consistently estimating multiple, continuous motions in the presence of both direct and indirect occlusions is necessary for autonomous navigation in complex dynamic environments.

Multiple object tracking (MOT) focuses on the challenge of tracking through occlusion in highly dynamic scenes. These approaches often employ appearance- and motion-based techniques to both predict and recover from partial and full occlusions. Most MOT approaches focus on consistently tracking the target objects in Cartesian or image space, often from a stationary camera [3]. They employ application-specific object or motion models that do not generalize well to other domains [4, 5, 6, 7]. These assumptions limit their ability to track general objects and estimate the full pose of each object.

Our multimotion visual odometry (MVO) pipeline [2] addresses the multimotion estimation problem by applying multimodel fitting techniques to the traditional VO pipeline (Fig. 1). MVO simultaneously estimates the full trajectory of every motion in a scene, including the egomotion, without a prior assumptions about object appearance.

(a)
(b)
(c)
Fig. 2: Trajectory estimates produced by our occlusion-robust MVO system before (a), during (b), and after (c) an occlusion in the occlusion_2_unconstrained segment of the Oxford Multimotion Dataset [8]. The trajectory of the swinging block (4, magenta) is directly estimated when it is visible in (a) and (c) and is extrapolated using the constant-velocity motion prior (dashed line) when the block is occluded by the moving tower (1, cyan) in (b). When the block becomes unoccluded in (c), it is rediscovered through motion closure

and the estimates are interpolated to match the directly estimated trajectory.

This paper extends the MVO framework to estimate multiple motions through occlusion by exploiting a physically founded motion prior (Fig. 2). This prior is used to extrapolate previously observed motion estimates until the object becomes visible again. Extrapolated estimates are used in motion closure to recover tracking when objects reappear in the predicted location. This not only maintains trajectory consistency without relying on appearance models but also improves the estimates of the occluded motion trajectory via interpolation. The full trajectory of every motion in the scene is estimated through both direct and indirect occlusions. Estimation accuracy is demonstrated on ground-truth data from the Oxford Multimotion Dataset [8].

The rest of the paper is organized as follows: Section II summarizes existing approaches to the multimotion estimation and multiobject tracking problems. Section III explores continuous motion priors and their applications to estimating through occlusion. The occlusion-robust MVO pipeline is detailed in Section IV, and Section V presents the performance of our approach in a dynamic environment with significant, repeated occlusions using ground-truth trajectory data. Sections VI and VII discuss the performance results, as well as the limitations of the pipeline and plans for future work.

Fig. 3: An illustration of the occlusion-aware MVO pipeline, which extends the original MVO pipeline to accurately estimate trajectories through occlusions. Given a set of tracklets, the multimotion-fitting section of the pipeline both segments the tracklets according to their motion and estimates the egocentric trajectories that explain that motion. After the segmentation converges, a motion label is chosen to represent the camera egomotion and used to estimate the geocentric trajectories of all other objects in the scene. In motion closure

, a white-noise-on-acceleration motion prior is used to extrapolate occluded trajectories and determine if newly discovered motions can be explained by the reappearance of an occluded object.

Ii Background

Motion estimation and object tracking are integral to a wide range of computer vision applications. Tracking and estimating the motion of an individual object through a scene has been widely explored, but doing so in complex, dynamic scenes is significantly more difficult due to frequent occlusions. Extending this to the multimotion estimation problem requires the ability to both track and estimate multiple motions in the presence of direct and indirect occlusions.

Ii-a Multimotion Estimation

Many multimotion estimation approaches only solve a subset of the rigid multiomotion estimation problem by applying application-specific simplifying constraints and assumptions. This limits their applicability to real-world multimotion estimation challenges.

Costeira and Kanade [9] use the affine model and matrix decomposition to determine the motion and shape of each dynamic object. This factorization usually requires points to be tracked for the entirety of the estimation window, which is difficult due to direct and indirect occlusions. Some techniques allow for missing data points [10] but are not designed for many short feature tracks, as is commonly encountered in practice.

Torr [11] uses a recursive RANSAC framework to find and remove dominant motion models from the remaining feature points. This framework is efficient at finding the dominant models in a scene, but the ability to sample consistent models decreases as models are removed and the signal-to-noise ratio of the remaining points decreases. Sabzevari and Scaramuzza [12]

improve the probability of sampling consistent models by applying geometric and kinematic constraints specific to driving scenarios. These constraints do not generalize well to other applications.

Ozden et al. [13] consider many practical challenges in multimotion estimation, such as incomplete feature tracks, and propose a model selection framework that relies on separate egomotion estimation. While this technique explicitly models the merging and splitting of motions, it does not address direct occlusions.

Our previous MVO pipeline [2] addresses the multimotion estimation problem by applying multimodel fitting techniques to the traditional VO pipeline. MVO simultaneously estimates the full trajectory of every motion in a scene, including the egomotion, without a priori assumptions about object appearance. The original pipeline relies on direct observations and can estimate through some partial occlusions, but it is unable to handle significant observation dropouts.

Ii-B Multiple Object Tracking

Most visual tracking techniques follow the tracking-by-detection paradigm. They use a variety of specific, appearance-based object models to detect targets in each frame. The tracking problem then focuses on accurately associating present and past detections [14].

Target detectors often use bounding-box representations rather than the full target pose, so objects are usually tracked in image or Cartesian space using simple motion models. These simplifications limit their ability to track general objects and estimate the full pose of each object.

Data association is often performed using recursive filters or global energy minimizations. Kalman [14, 15] and particle [16] filters use simple motion models to recursively predict the location of the target and update the current state based on current observations. Tracking-by-detection techniques are limited by the quality of the detectors they use, and recursive methods often fail due to occlusions and appearance changes.

Energy-based techniques incorporate object appearance, motion, and interaction models in a cost functional. The functional is defined over a graph where vertices represent detections and edges represent transitions between frames, and it is minimized using flow-based techniques [17]. Byeon et al. [18] include 3D reconstruction and object interactions in their cost function to track objects from multiple static cameras. The problem of assigning tracks to new detections or other tracks can also be solved using the Hungarian algorithm [19] or other greedy alternatives [16, 4]. These specialized approaches are dependent on defining representative cost functionals and do not generalize well to other applications.

Ii-C Tracking Through Occlusion

Accurate data association is more difficult in highly dynamic environments with significant occlusion. Direct occlusions can be predicted by modeling object overlaps [5]

or using scene understanding

[20], which can help to avoid misassociated detections. Yang et al. [21] propose a learning-based conditional random field model that considers the interdependence of observed motions, especially in the presence of occlusion. These prediction methods can be used for direct occlusions, but indirect occlusions are more difficult to predict.

Even partial occlusions are challenging for appearance-based techniques because they change the observed shape of the occluded object. Feature-based techniques track targets through partial occlusions when a sufficient number of feature points can be tracked [22], but grouping features into distinct objects is difficult if their bulk motion is similar. Other techniques define specific, part-based appearance models to infer the position of the entire object from the portions that are visible [4, 23, 6].

Full occlusions are often overcome by using motion priors to extrapolate trajectories in the absence of direct observations. Zhang et al. [17] generate occlusion hypotheses that are explicitly incorporated into their flow minimization. This hypothesize-and-test paradigm works well in the presence of short or partial occlusions but fails under long occlusions as there is no information available to prune hypotheses [14]. Ryoo et al. [24] avoid this impractical growth with their observe-and-explain paradigm, which avoids hypothesizing occluded motions until an unoccluded detection is observed near the source of occlusion. Likewise, Mitzel et al. [5] extrapolate unobserved target trajectories for a set number of frames to allow for reassociation when the target becomes unoccluded. The applicability of these occlusion models is limited by the effectiveness of the target detectors and the fidelity of their motion models to the object motions in a scene.

This paper extends our previous work directly tracking and estimating motions [2] by introducing a continuous motion prior into the MVO framework. This prior is used to extrapolate motions through both direct and indirect occlusions and allows for motion closure to reacquire motions and improve their occluded estimates. The full trajectory of every motion in the scene is estimated through direct and indirect occlusions. This approach is evaluated on an occlusion dataset containing ground-truth trajectories for all motions in the scene.

Iii Motion Priors

A motion model is a simplified representation of the many complex motions encountered in the world. The choice of motion model is integral to the accuracy of the trajectory estimation and is often dependent on the intended application. Motion models should accurately describe the motions in the environment while not overcomplicating the estimation. The rigid-motion assumption reduces the complex space of motion trajectories to while maintaining fidelity with many dynamic motions in the world. Many approaches further simplify this by reducing the estimation space to [12, 5], [18, 17], or image [24] space, or by designing specific, high-dimensional models for applications such as human tracking [4, 6, 7].

Models can be defined discretely or continuously. Discrete models represent a trajectory as a sparse set of states, which is well-suited for synchronized sensors such as globally shuttered cameras. Continuous models smoothly represent a trajectory at all times using an assumption, or prior, about the motion of objects [25]. These models incorporate a smooth prior directly into the representation, which is preferable for scanning or high-rate sensors. This prior can also be exploited to intelligently estimate occluded trajectories.

Motions can also be defined in different frames, and simple motions in one frame may become complex when expressed in another. Two bodies, each moving according to some known prior relative to some static reference frame, do not exhibit the same type of motion relative to each other. A model that is expressed egocentrically may be appropriate for estimating the egomotion of a camera relative to its static environment but not relative to other dynamic objects. It is therefore often necessary to express models in an inertial or quasi-inertial (e.g., geocentric) frame.

Iii-a White-Noise-on-Acceleration Motion Prior

This paper employs the white-noise-on-acceleration (i.e., locally constant-velocity) motion prior described by Anderson et al. [25]. This prior effectively penalizes the trajectory’s deviation from a constant body-centric velocity. It is physically founded because objects tend to move smoothly throughout their environment.

The continuous-time trajectory of the motion , , is defined as both the poses, , and the local, body-centric velocities, . The trajectory state is assumed to vary smoothly over time in the Lie algebra, . This prior takes the form

(1)

where is a zero-mean, white-noise Gaussian process with power spectral density matrix, , and is the representation of as defined in [26].

This continuous-time trajectory can be estimated at a collection of discrete time steps, , such that,

where and . These time steps correspond to observation times when measurements of the scene are collected.

The system in (1) is nonlinear and finding a numerical solution is costly. If the motion between measurement times is small, then the system can be recast as a set of local, linear time-invariant stochastic differential equations of the form,

where is the local GP state, u is the exogenous input, and w is defined similarly to in (1). The local state is defined as

(2)

where is the left Jacobian of , and .

With , we have the solution,

(3)

where is the local GP prior mean, and is the state transition function from to ,

Applying this prior locally at each time step represents the global nonlinear system as a piecewise sequence of linear, time-invariant systems.

Iv Methodology

The original MVO pipeline [2] extends VO to multimodel segmentation and estimation. As with traditional stereo VO pipelines, a set of tracklets is generated by matching salient image points across rectified stereo image pairs and temporally across consecutive stereo frames. The motion segmentation and estimation of these tracklets are then cast as a multilabeling problem where a label, , represents a motion hypothesis, , calculated from a subset of tracklets, . The labeling is found using CORAL [27], a convex optimization approach to the multilabeling problem. All motion hypotheses are initially treated as egocentric and potentially belonging to the static portions of the scene (i.e., the camera’s egomotion). Geocentric trajectories are found in a final step where a label is selected to represent the motion of the camera.

This paper extends the MVO pipeline (Fig. 3) to handle occlusions by using a continuous, physically founded motion prior. The prior is used both to estimate directly observed trajectories and to extrapolate occluded motions. As with the original pipeline, all motion hypotheses are treated egocentrically until the segmentation converges (Section IV-A). Unlike the original pipeline, a label is selected to represent the motion of the camera before performing a full-batch estimation of each trajectory in a geocentric frame (Section IV-A2). The motion prior is used to extrapolate previously estimated trajectories that are not found in the current frame due to occlusion or estimation failure (Section IV-B). These extrapolated trajectories are then used in motion closure to determine if any recently discovered trajectory is similar in both location and velocity (Section IV-C). Trajectories found to belong to the same motion are used to correct occluded estimates through interpolation (Section IV-C).

Iv-a Continuous-Time Geocentric Estimation

The original MVO pipeline estimates motion trajectories egocentrically without making any assumptions as to which motion represented that of the camera [2]. The white-noise-on-acceleration prior is not valid when both the estimated and reference frames are moving because two bodies moving with constant velocity relative to a geocentric frame do not generally have zero acceleration relative to each other. The camera egomotion is estimated from the static background of the scene, which means the prior is appropriate, but this does not hold for the other dynamic motions in the scene. The camera egomotion must therefore be estimated first and then used to estimate the other trajectories in a geocentric frame.

The egomotion label,

, is chosen using prior information or heuristics. As in VO, it can be initialized as the label with the largest support,

after which it can be propagated forward in time by choosing a label that maximizes the overlap in support with the previous label,

Iv-A1 Egomotion Estimation

The egomotion of the camera is estimated using the approach described in [25]. The system state, , comprises the estimated pose transforms and body-centric velocities, , and the associated points, . The estimated state, , is found by minimizing an objective function, , consisting of the measurement and prior terms.

The measurement term, , constrains the trajectory and landmark estimates with the observations. The measurement model, , applies the sensor model, , derived from the perspective camera model, to landmark points transformed by the transform model, . Each observation, , of point at pose is modeled as

(4)

where , , and is additive Gaussian noise with zero mean and covariance . The least-squares cost function relating the poses and observations is defined as the difference between the measurement model and the observations,

where,

The prior term, , constrains the current trajectory estimate by the previous velocity,

where the inverse covariance matrix is

The error term penalizes deviation from the constant-velocity prior,

(5)

which (2) and (3) simplify to

The total cost, , is minimized using Gauss-Newton by linearizing the error functions about an operating point, . The operating point is perturbed according to the transform perturbations, , velocity perturbations , and landmark perturbations, , which are stacked to form the full state perturbation, .

Linearizing the cost function requires linearizing both (4) and (5). Using the Jacobians of the measurement error function, , and the prior error function, , the linearized cost is given by

(6)

where,

The indicator matrices and are defined such that and .

The Jacobian of the measurement function is given by

(7)
where

, and and are defined in [26].

The Jacobian of the prior error function is

where , and is the adjoint of .

The optimal perturbation, , to minimize the linearized cost, , is the solution to . Each element of the operating point is then updated according to

and the cost is relinearized about the updated operating point. The process iterates until the state convergences and . See [26] for more detail.

Iv-A2 Third-Party Estimation

The geocentric motions of the other objects in the scene, , are calculated using the estimated egomotion, . As in Section IV-A, each label’s system state, , comprises the estimated pose transforms and body-centric velocities, , and the landmark points, .

The transform model, , used in the measurement model, , is adjusted to transform egocentrically observed points through a geocentrically estimated state,

where is the object deformation matrix (identity for rigid bodies), and is the camera egomotion as estimated in Section IV-A. The transform from the camera to the object centroid is given by

(8)

The rotation, , is arbitrary and initially assumed to be identity. The translation, , is assumed to be the centroid of all points belonging to the motion, , observed in the first frame,

In a sliding-window pipeline, these estimates are updated for each estimation window.

The motion model part of the measurement Jacobian is given by the block-row vector,

This Jacobian is used to estimate in (7), and (6) is used to estimate the continuous-time geocentric trajectory, , of every third-party motion in the scene.

Iv-B Trajectory Extrapolation

Motion priors are used to extrapolate motions in the presence of occlusions. The accuracy of these extrapolated estimates is dependent on the fidelity of the motion priors to the true motions of the objects in the scene. The local state, , at time can be used in (3) to estimate the extrapolated state, , at time [25]. The extrapolated state is then transformed to the global state, consisting of the extrapolated transform, , and velocity, , via

Estimates can be extrapolated forward or backward in time. As the length of the extrapolation grows, the estimates will diverge from the true motion of the occluded object, especially if it exhibits significant changes in velocity.

Iv-C Motion Closure

The extrapolated estimates of previously seen motions can be used to identify unoccluded motions through motion closure. A newly discovered trajectory, , is compared to an occluded motion’s extrapolated trajectory, , at time using a motion-based threshold incorporating both position and velocity.

Each newly discovered trajectory, , is estimated as a new motion with identity rotation, , in (8). Upon successful motion closure, the trajectory is reestimated using the extrapolated estimate of the transform from the camera to the object centroid, (Section IV-A2). The corrected trajectory, , is then estimated from the extrapolated trajectory, , and the correction transform, ,

where is identity because the corrected and extrapolated trajectories are equivalent before the occlusion.

The correction transform uses the observed centroid of the newly discovered trajectory, , to adjust the extrapolated trajectory position, ,

where
and the extrapolated rotation, , comes from

It is difficult to determine the true rotation of the object after the occlusion without using appearance-based metrics, so the extrapolated trajectory rotation is taken directly, i.e., is identity.

The corrected pose and velocity, , are then used to interpolate from the beginning of the occlusion at time and correct the extrapolated estimates, . The occluded trajectory state can be interpolated between and according to

where

and . This interpolated estimation can explain the occluded motion of the object better than extrapolation because it includes direct estimates on both sides of the occlusion.

V Experiments and Results

The performance of occlusion-robust MVO is evaluated using the Oxford Multimotion Dataset [8]. The results (Fig. 4) were produced from a -frame sequence of Bumblebee XB3 stereo camera data from the occlusion_2_unconstrained segment. Estimation is performed as a 16-frame sliding window and the Gauss-Newton minimization was performed analytically with Ceres [28]. The transforms between the Vicon frames and our estimated frames are arbitrary, so the first frames of the estimates are used to calibrate these transforms [29]. All errors are reported for geocentric trajectory estimates.

(a) Camera Egomotion
(b) Block 1 (Moving Tower)
(c) Block 4 (Swinging Block)
Fig. 4: The translational and rotational errors for the estimated motion of the camera (a), the moving block tower (b), and the swinging block (c) for the occlusion_2_unconstrained segment of the Oxford Multimotion Dataset [8]. Grey regions represent times when the swinging block was occluded by the tower, or when the tower was stationary and effectively part of the static background. Dashed lines represent the error in extrapolation and the solid lines represent the error in the direct or interpolated estimates. Each object is compared to ground-truth trajectory data over a -frame section of the segment. Errors are reported in an arbitrary geocentric frame with the z-axis up and arbitrary x- and y-axes..

The grey regions represent times when the swinging block was occluded by the moving tower (Fig. (c)c), or when the tower was stationary and effectively part of the static background (Fig. (b)b). In these regions, the dashed lines represent the error in extrapolation and the solid lines represent that of the interpolated estimates. Elsewhere, the solid lines represent the errors in the directly estimated trajectories.

The newly calculated centroid of a motion after an occlusion does not always match that of the original motion. This discrepancy can cause jumps in the trajectory, as the original centroid will be projected forward to a different location than is calculated in the current frame. This is a major source of error in the estimates of the block tower (Fig. (b)b), as it is often partially outside the view of the camera, which changes its observable centroid.

The camera egomotion (Fig. (a)a) exhibited a maximum total drift of m (over a m path) and a maximum rotational error of , , and in roll-pitch-yaw, respectively. This error is reasonable compared to the level of drift in other comparable, camera-only VO systems [30].

The interpolated error of the block tower and the swinging block was generally worse than the extrapolated error due to the shifting centroid estimates used in motion closure. The block tower exhibited a maximum total drift of m (over a m path). It exhibited a maximum rotational error of , , and in roll-pitch-yaw, respectively. The swinging block exhibited a maximum total drift of m, (over a m path). It exhibits a maximum rotational error of , , and in roll-pitch-yaw, respectively.

Vi Discussion

The pipeline consistently segmented the motions of the camera and the blocks while also estimating the trajectories through occlusions; however, it is still a sparse, feature-based technique. The observable shape (and centroid) of an object changes as it moves, affecting the geocentric estimate of its trajectory. As the object becomes more occluded, the quality of the trajectory estimation will degrade. It is unlikely that an object will become occluded or unoccluded instantaneously, but this can be mitigated by predicting occlusions as in [5].

The accuracy of the object centroid, and therefore the trajectory, depends on the distribution of observed features. The centroid shift can be drastic when an object is partially occluded or when it is rediscovered after being fully occluded. This is particularly significant in the estimation of the block tower when it moves after being stationary (Fig. (b)b). A large portion of the tower is outside the camera view which drastically changes the object centroid location. The original MVO pipeline partially mitigated this through a rolling-average calculation of the centroid, but this is difficult in geocentric estimation as the centroid is required for the calculation of third-party trajectories. More robust centroid calculation remains an area of ongoing work, including considering part-to-whole extrapolation techniques [4, 23, 6].

The reliance of the pipeline on motion means that objects that temporarily have the same motion will be given the same label, such as when a dynamic object becomes stationary. This is often desirable, as it implicitly handles trajectory merging, but many applications might require a form of motion permanence. This could be introduced by using appearance-based object descriptors or by explicitly modeling trajectory merging.

The applicability of the white-noise-on-acceleration motion prior is limited in scenes where objects change direction or speed. Future work will focus on introducing a white-noise-on-jerk prior [31], which is more applicable to motions with smoothly varying velocities.

Vii Conclusion

This paper extends the multimotion visual odometry (MVO) pipeline to address the challenges posed by occlusion in highly dynamic environments. The occlusion-robust MVO pipeline uses a white-noise-on-acceleration motion prior to extrapolate occluded trajectories until they are observed again. A motion-based similarity threshold incorporating both position and velocity is used to determine if a newly discovered motion belongs to an occluded object.

The performance of occlusion-robust MVO is evaluated on a challenging segment from the Oxford Multimotion Dataset [8] exhibiting significant occlusion and highly dynamic motions. We are currently exploring extensions to the pipeline as discussed in Section VI, as well as its applicability to other sensor modalities, such as RGB-D and event cameras.

References

  • [1] H. P. Moravec, “Obstacle avoidance and navigation in the real world by a seeing robot rover,” Ph.D. dissertation, Stanford, CA, USA, 1980.
  • [2] K. M. Judd, J. D. Gammell, and P. Newman, “Multimotion visual odometry (MVO): Simultaneous estimation of camera and third-party motions,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 3949–3956.
  • [3] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM Computing Surveys, vol. 38, no. 4, p. 13, 2006.
  • [4] B. Wu and R. Nevatia, “Detection and tracking of multiple, partially occluded humans by Bayesian combination of edgelet based part detectors,” International Journal of Computer Vision, vol. 75, no. 2, pp. 247–266, 2007.
  • [5] D. Mitzel, E. Horbert, A. Ess, and B. Leibe, “Multi-person tracking with sparse detection and continuous segmentation,” in European Conference on Computer Vision.    Springer, 2010, pp. 397–410.
  • [6] G. Shu, A. Dehghan, O. Oreifej, E. Hand, and M. Shah, “Part-based multiple-person tracking with partial occlusion handling,” in

    IEEE/CVF Conference on Computer Vision and Pattern Recognition

    .    IEEE, 2012, pp. 1815–1821.
  • [7] B. Yang and R. Nevatia, “Online learned discriminative part-based appearance models for multi-human tracking,” in emphEuropean Conference on Computer Vision.    Springer, 2012, pp. 484–498.
  • [8] K. M. Judd and J. D. Gammell, “The oxford multimotion dataset: Multiple SE(3) motions with ground truth,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 800–807, 2019.
  • [9] J. P. Costeira and T. Kanade, “A multibody factorization method for independently moving objects,” International Journal of Computer Vision, vol. 29, no. 3, pp. 159–179, 1998.
  • [10] R. Vidal and R. Hartley, “Motion segmentation with missing data using power factorization and gpca,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2004, pp. 310–316.
  • [11] P. H. Torr, “Geometric motion segmentation and model selection,” Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, vol. 356, no. 1740, pp. 1321–1340, 1998.
  • [12] R. Sabzevari and D. Scaramuzza, “Multi-body motion estimation from monocular vehicle-mounted cameras,” IEEE Transactions on Robotics, vol. 32, no. 3, pp. 638–651, 2016.
  • [13] K. E. Ozden, K. Schindler, and L. V. Gool, “Multibody structure-from-motion in practice,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 6, pp. 1134–1141, 2010.
  • [14] D. Reid, “An algorithm for tracking multiple targets,” IEEE Transactions on Automatic Control, vol. 24, no. 6, pp. 843–854, 1979.
  • [15] Z. Khan, T. Balch, and F. Dellaert, “An MCMC-based particle filter for tracking multiple interacting targets,” in European Conference on Computer Vision.    Springer, 2004, pp. 279–290.
  • [16] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. V. Gool, “Robust tracking-by-detection using a detector confidence particle filter,” in IEEE International Conference on Computer Vision, 2009, pp. 1515–1522.
  • [17] L. Zhang, Y. Li, and R. Nevatia, “Global data association for multi-object tracking using network flows,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition.    IEEE, 2008, pp. 1–8.
  • [18] M. Byeon, H. Yoo, K. Kim, S. Oh, and J. Y. Choi, “Unified optimization framework for localization and tracking of multiple targets with multiple cameras,” Computer Vision and Image Understanding, vol. 166, pp. 51 – 65, 2018.
  • [19] J. Xing, H. Ai, and S. Lao, “Multi-object tracking through occlusions by local tracklets filtering and global tracklets association with detection responses,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009, pp. 1200–1207.
  • [20] R. Kaucic, A. G. Amitha Perera, G. Brooksby, J. Kaufhold, and A. Hoogs, “A unified framework for tracking through occlusions and across sensor gaps,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2005, pp. 990–997 vol. 1.
  • [21] B. Yang, C. Huang, and R. Nevatia, “Learning affinities and dependencies for multi-target tracking using a CRF model,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2011, pp. 1233–1240.
  • [22] D. Sugimura, K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto, “Using individuality to track individuals: Clustering individual trajectories in crowds using local appearance and frequency trait,” in IEEE International Conference on Computer Vision.    2009, pp. 1467–1474.
  • [23] W. Hu, X. Li, W. Luo, X. Zhang, S. Maybank, and Z. Zhang, “Single and multiple object tracking using log-Euclidean Riemannian subspace and block-division appearance model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 12, pp. 2420–2440, 2012.
  • [24] M. S. Ryoo and J. K. Aggarwal, “Observe-and-explain: A new approach for multiple hypotheses tracking of humans and objects,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition.    IEEE, 2008, pp. 1–8.
  • [25] S. Anderson and T. D. Barfoot, “Full STEAM ahead: Exactly sparse Gaussian process regression for batch continuous-time trajectory estimation on SE(3),” in IEEE/RSJ International Conference on Intelligent Robots and Systems.    IEEE, 2015, pp. 157–164.
  • [26] T. D. Barfoot, State Estimation for Robotics.    Cambridge University Press, 2017.
  • [27] P. Amayo, P. Piniés, L. M. Paz, and P. Newman, “Geometric Multi-Model Fitting with a Convex Relaxation Algorithm,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  • [28] S. Agarwal, K. Mierle, and Others, “Ceres Solver.”
  • [29] Z. Zhang and D. Scaramuzza, “A tutorial on quantitative trajectory evaluation for visual(-inertial) odometry,” in IEEE/RSJ International Conference on Intelligent Robots and Systems.    2018, pp. 7244–7251.
  • [30] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2012, pp. 3354–3361.
  • [31] T. Y. Tang, D. J. Yoon, and T. D. Barfoot, “A white-noise-on-jerk motion prior for continuous-time trajectory estimation on SE(3),” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 594–601, 2019.