Learning Discriminative Representations for Skeleton Based Action Recognition
Human action recognition aims at classifying the category of human action from a segment of a video. Recently, people dive into designing GCN-based models to extract features from skeletons for performing this task, because skeleton representations are much efficient and robust than other modalities such as RGB frames. However, when employing the skeleton data, some important clues like related items are also dismissed. It results in some ambiguous actions that are hard to be distinguished and tend to be misclassified. To alleviate this problem, we propose an auxiliary feature refinement head (FR Head), which consists of spatial-temporal decoupling and contrastive feature refinement, to obtain discriminative representations of skeletons. Ambiguous samples are dynamically discovered and calibrated in the feature space. Furthermore, FR Head could be imposed on different stages of GCNs to build a multi-level refinement for stronger supervision. Extensive experiments are conducted on NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets. Our proposed models obtain competitive results from state-of-the-art methods and can help to discriminate those ambiguous samples.
READ FULL TEXT