Boosting Continuous Sign Language Recognition via Cross Modality Augmentation

10/11/2020
by   Junfu Pu, et al.
0

Continuous sign language recognition (SLR) deals with unaligned video-text pair and uses the word error rate (WER), i.e., edit distance, as the main evaluation metric. Since it is not differentiable, we usually instead optimize the learning model with the connectionist temporal classification (CTC) objective loss, which maximizes the posterior probability over the sequential alignment. Due to the optimization gap, the predicted sentence with the highest decoding probability may not be the best choice under the WER metric. To tackle this issue, we propose a novel architecture with cross modality augmentation. Specifically, we first augment cross-modal data by simulating the calculation procedure of WER, i.e., substitution, deletion and insertion on both text label and its corresponding video. With these real and generated pseudo video-text pairs, we propose multiple loss terms to minimize the cross modality distance between the video and ground truth label, and make the network distinguish the difference between real and pseudo modalities. The proposed framework can be easily extended to other existing CTC based continuous SLR architectures. Extensive experiments on two continuous SLR benchmarks, i.e., RWTH-PHOENIX-Weather and CSL, validate the effectiveness of our proposed method.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

page 8

page 9

research
05/18/2023

Cross-modality Data Augmentation for End-to-End Sign Language Translation

End-to-end sign language translation (SLT) aims to convert sign language...
research
03/21/2023

Self-Sufficient Framework for Continuous Sign Language Recognition

The goal of this work is to develop self-sufficient framework for Contin...
research
11/25/2022

XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning

We present XKD, a novel self-supervised framework to learn meaningful re...
research
01/07/2022

Sign Language Video Retrieval with Free-Form Textual Queries

Systems that can efficiently search collections of sign language videos ...
research
10/07/2020

Universal Weighting Metric Learning for Cross-Modal Matching

Cross-modal matching has been a highlighted research topic in both visio...
research
06/24/2021

Towards Automatic Speech to Sign Language Generation

We aim to solve the highly challenging task of generating continuous sig...
research
05/03/2023

SeqAug: Sequential Feature Resampling as a modality agnostic augmentation method

Data augmentation is a prevalent technique for improving performance in ...

Please sign up or login with your details

Forgot password? Click here to reset