DEFT: Detection Embeddings for Tracking
Most modern multiple object tracking (MOT) systems follow the
tracking-by-detection paradigm, consisting of a detector followed by a method
for associating detections into tracks. There is a long history in tracking of
combining motion and appearance features to provide robustness to occlusions
and other challenges, but typically this comes with the trade-off of a more
complex and slower implementation. Recent successes on popular 2D tracking
benchmarks indicate that top-scores can be achieved using a state-of-the-art
detector and relatively simple associations relying on single-frame spatial
offsets – notably outperforming contemporary methods that leverage learned
appearance features to help re-identify lost tracks. In this paper, we propose
an efficient joint detection and tracking model named DEFT, or "Detection
Embeddings for Tracking." Our approach relies on an appearance-based object
matching network jointly-learned with an underlying object detection network.
An LSTM is also added to capture motion constraints. DEFT has comparable
accuracy and speed to the top methods on 2D online tracking leaderboards while
having significant advantages in robustness when applied to more challenging
tracking data. DEFT raises the bar on the nuScenes monocular 3D tracking
challenge, more than doubling the performance of the previous top method. Code
is publicly available.
READ FULL TEXT