Human visual recognition is a sparse process, where only a few salient v...
Traditional video action detectors typically adopt the two-stage pipelin...
Traditional object detectors employ the dense paradigm of scanning over
...
The classification and regression head are both indispensable components...
Spatial downsampling layers are favored in convolutional neural networks...