DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion

03/27/2023
by   Sauradip Nag, et al.
0

We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD in short. Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video. This presents a generative modeling perspective, against previous discriminative learning manners. This capability is achieved by first diffusing the ground-truth proposals to random ones (i.e., the forward/noising process) and then learning to reverse the noising process (i.e., the backward/denoising process). Concretely, we establish the denoising process in the Transformer decoder (e.g., DETR) by introducing a temporal location query design with faster convergence in training. We further propose a cross-step selective conditioning algorithm for inference acceleration. Extensive evaluations on ActivityNet and THUMOS show that our DiffTAD achieves top performance compared to previous art alternatives. The code will be made available at https://github.com/sauradip/DiffusionTAD.

READ FULL TEXT

page 3

page 9

page 10

research
11/17/2022

DiffusionDet: Diffusion Model for Object Detection

We propose DiffusionDet, a new framework that formulates object detectio...
research
08/14/2023

DiffSED: Sound Event Detection with Denoising Diffusion

Sound Event Detection (SED) aims to predict the temporal boundaries of a...
research
12/06/2022

DiffusionInst: Diffusion Model for Instance Segmentation

Recently, diffusion frameworks have achieved comparable performance with...
research
07/14/2022

Proposal-Free Temporal Action Detection via Global Segmentation Mask Learning

Existing temporal action detection (TAD) methods rely on generating an o...
research
03/31/2023

Diffusion Action Segmentation

Temporal action segmentation is crucial for understanding long-form vide...
research
04/06/2023

Boundary-Denoising for Video Activity Localization

Video activity localization aims at understanding the semantic content i...
research
12/08/2021

Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs

Today's VidSGG models are all proposal-based methods, i.e., they first g...

Please sign up or login with your details

Forgot password? Click here to reset