Multi-Modal Zero-Shot Sign Language Recognition

09/02/2021 ∙ by Razieh Rastgoo, et al. ∙ 10

Zero-Shot Learning (ZSL) has rapidly advanced in recent years. Towards overcoming the annotation bottleneck in the Sign Language Recognition (SLR), we explore the idea of Zero-Shot Sign Language Recognition (ZS-SLR) with no annotated visual examples, by leveraging their textual descriptions. In this way, we propose a multi-modal Zero-Shot Sign Language Recognition (ZS-SLR) model harnessing from the complementary capabilities of deep features fused with the skeleton-based ones. A Transformer-based model along with a C3D model is used for hand detection and deep features extraction, respectively. To make a trade-off between the dimensionality of the skeletonbased and deep features, we use an Auto-Encoder (AE) on top of the Long Short Term Memory (LSTM) network. Finally, a semantic space is used to map the visual features to the lingual embedding of the class labels, achieved via the Bidirectional Encoder Representations from Transformers (BERT) model. Results on four large-scale datasets, RKS-PERSIANSIGN, First-Person, ASLVID, and isoGD, show the superiority of the proposed model compared to state-of-the-art alternatives in ZS-SLR.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Advances in Sign Language Recognition (SLR) have been predominantly driven by the advent of accurate technologies. Furthermore, deep neural networks have achieved the best performance in most of vision tasks

[1, 2, 3, 4], such as SLR [5, 6, 7, 8, 9, 10, 11], video classification [12], action recognition [13], and gesture recognition [14, 15]. The necessity to a huge number of labelled training samples makes such model inherently biased towards the prediction phase, and also less useful where there might not be enough labeled data for all classes. In addition, the conventional deep learning methods for SLR are not able to recognize the samples from a new class/concept. This is where Zero-Shot Learning (ZSL) comes to use. To this end, ZSL evaluates the effectiveness of the embedding space, which is made using the input data and the auxiliary information.

Although the previous SLR methods have achieved the state-of-the-art performance [7, 8, 9, 10, 11], they suffer from the annotation bottleneck and do not work efficiently if face with a sample from an unseen class in training duration. To address the mentioned weaknesses, we explore the idea of ZS-SLR with no annotated visual examples. We propose a ZSL model for SLR from multi-modal data and hybrid features. One of the input modalities used in the proposed model is the 3D skeleton data that has also received considerable attention in recent years [7, 8, 9, 10, 11]. Generally, skeleton representation is beneficial because it is compact and can robustly separate the action subject (human) from the background. Furthermore, the presentation of large-scale skeleton annotated datasets makes a room for researchers to develop different approaches for skeleton-based SLR. Considering these advantages, the proposed model in this work employs some skeleton-based features.

While ZSL plays an effective role to get closer to real-life applications, there are some major challenges in ZSL, including domain shift, bias, cross-domain knowledge transfer, semantic loss, and hubness. To tackle these challenges, there are two main methodologies used in ZSL-based tasks: embedding-based and generative-based models. While embedding-based models aim to map the visual features and semantic attributes into a common embedding space, generative-based models target to generate visual features for unseen classes using the semantic attributes. In this paper, we propose an embedding-based model using Transformer [16]

, 3D Convolutional Neural Network (3DCNN)

[17], Auto-Encoder (AE) [18], Long Short Term Memory (LSTM) [19], and Bidirectional Encoder Representations from Transformers (BERT) model [20] from multi-modal inputs and hybrid features for ZS-SLR.

Our main contributions can be listed as follows:

ZS-SLR: Towards overcoming the annotation bottleneck in the SLR, we formulate the problem of ZS-SLR with no annotated visual examples, by leveraging their textual descriptions.

Hand detection: Since hand detection is an important step for hand sign language recognition, we configure a Transformer-based model, as a fast and accurate hand detection model, to tackle the challenges of the current hand detection models. To the best of our knowledge, this is the first time that a Transformer-based model is configured for the hand detection.

Hybrid features representation:

We propose a handcrafted plus deep end-to-end hybrid model for ZS-SLR. We use some skeleton-based features, including distances, angles, and singular values of the human hand joints, fused with deep features. To the best of our knowledge, this is the first time that skeleton-based features and deep features are working together within ZS-SLR.

Multi-modal inputs: We include three modalities in the model: skeleton, RGB video, and text. Using the complementary capabilities of three modalities is beneficial for our model. The proposed model is the first multi-modal model for the ZS-SLR task.

Performance: We perform a detailed analysis of the proposed model using two evaluation protocols. Results on four large-scale datasets, RKS-PERSIANSIGN, First-Person, ASLVID, and isoGD, show the superiority of the proposed model compared to the state-of-the-art alternatives in the ZS-SLR.

The remainder of this paper is organized as follows. Section 2 briefly reviews recent works in the ZS-SLR, Zero-Shot Gesture Recognition (ZS-GR), and Zero-Shot Action Recognition (ZS-AR). The proposed model is presented in details in section 3. Results are discussed in section 4. Section 5 and 6 analyze the results and conclude the work with the comments on possible lines for future research, respectively.

Ii Related work

The ZSL idea was initially proposed by Palatucci et al. [25] and Larochelle et al. [26]. Since there is only one work in ZS-SLR [27], we briefly review recent works in the related areas, especially ZS-GR and ZS-AR (Summarized in Table I). The ZSL scenario typically needs to map the visual features to the semantic embedding obtained from the unseen data[28, 29, 30]. While the semantic representation can be either class-related attributes or the embedding of the class labels [31, 32], the visual representation is either handcrafted features [33, 34, 35] based on the Improved Dense Trajectories (IDT) method [36], or deep [37, 32, 29] features extracted with pre-trained models, such as C3D network [17]

. While the IDT features can be represented using a single vector, deep features refer to a predefined length of a video segment. Different techniques can be applied to the semantic and visual domains to obtain a powerful discriminative capability in the ZSL models coping with unseen data. Generally, we can categorize the reviewed work into Inductive and Transductive ZSL:

Inductive ZSL:

In this group, only the labeled training data from seen classes are accessible and the test data is completely unknown at training time. Bishay et al. proposed a model, entitled Temporal Attentive Relation Network (TARN), for ZS-AR. A combination of a C3D network and Bidirectional Gated Recurrent Unit (Bi-GRU) is used to obtain a single vector from the embedding module. Finally, the encoded video is mapped into Word2Vec embedding. Results on the UCF101 and HMDB-51datasets show that the model presents a comparable performance with the state-of-the-art alternatives in ZS-AR

[38]

. Hahn et al. proposed a model, entitled Action2Vec, by combining linguistic embedding of the class labels with spatio-temporal features obtained from the video inputs. The architecture of this model includes a C3D model for visual features extraction and a two-layer hierarchical LSTM network for temporal features. Results on the UCF101, HMDB-51, and Kinetics datasets confirm the superiority of the Action2Vec model, achieving state-of-the-art with 1.3%  4.38%  and 7.75% relative margins, respectively

[39]. Gupta et al. proposed a generative-based model for ZS-AR from the skeleton data. The Variational Auto Encoder (VAE) is used as the base architecture to learn the generative space of the latent representations. The latent visual representations of skeleton-based human actions are fused with the syntactic information obtained from the textual descriptions. Results on the NTU-60 and NTU-120 datasets show the outperforming of the model performance compared to state-of-the-art models in ZS-AR with the 4.34% and 3.16% relative improvements, respectively [1]. Bilge et al. proposed a model for ZS-SLR using the combination of I3D and BERT. The body and hand regions are used to obtain the visual features through 3D-CNNs. The longer temporal relationships are obtained via Bidirectional Long-Short-Term Memory (BLSTM) network. Relying on the textual descriptions, they extended the current ASL dataset, namely ASL-Text, which includes 250 signs and the corresponding sign descriptions. Results on this dataset show that this model can provide a basis for further exploration of the ZSL in SLR [27]. Madapana and Wachs proposed a bi-linear Auto-Encoder (AE) model for ZS-GR that jointly optimizes reconstruction and classification errors. Furthermore, they analyzed three feature extraction techniques, raw features, handcrafted features, and deep features, and conducted experiments to compare unseen class accuracies obtained using these methods. Results on a dataset, including a subset of two public datasets, show that the model achieves comparable results with state-of-the-art models in the ZS-GR [40]. Wu et al. proposed an attribute-based model for zero-shot dynamic hand gestures recognition using a BLSTM network. The skeletal joint data obtained by Leap Motion Controller (LMC) is used in the model. Relying on the extracted features from the BLSTM network and semantic attributes, a Semantic Auto-encoder (SAE) is used to learn a mapping from feature space to semantic space. Results on own dataset show the effectiveness of the proposed model [41].
Transductive ZSL:

In this group, the unlabeled test data are available during training. Alexiou et al. proposed a ZSL model using Fisher Vector (FV) method, as a low-complex and unsupervised method, to obtain the visual representation of the input video. A Support Vector Machine (SVM) regressor is used to make a mapping from the visual domain to the word-vector domain. Furthermore, they employ a search algorithm to find a set of words with similar semantic meanings to the class labels. Results on two large-scale datasets, UCF101 and HMDB-51, show a recognition accuracy improvement compared to state-of-the-art models in ZS-AR

[42]

. Mishra et al. provided a probabilistic generative model for ZS-AR by representing each class via a Gaussian distribution model. This model can synthesize new examples for any unseen class by sampling from the class distribution. The C3D network is used for visual features extraction. Results on three datasets, UCF101, HMDB-51, and Olympic, show that the proposed approach achieves a comparable performance with the state-of-the-art methods in the filed

[32]. Wang and Chen proposed a ZS-AR model using textual descriptions as a word vector representation. Furthermore, they employed the action-related still images for semantic representation of the video-based human actions in ZSL. The visual features are extracted using a CNN model. Results on UCF101 and HMDB-51datasets show the effectiveness of the proposed semantic representations in the model [41].

Method Ref. Model Dataset Task Year
Inductive [38] C3D, Bi-GRU, Word2Vec UCF101, HMDB-51 AR 2019
[39] C3D, LSTM UCF101, HMDB-51 AR 2019
[1] VAE NTU-60, NTU-120 AR 2021
[27] I3D, BLSTM, BERT ASL-Text SLR 2019
[40] BLSTM Subset of CGD 2013 dataset and MSRC-12 GR 2020
Transductive [42] FV, SVM UCF101, HMDB-51 AR 2016
[32] C3D, Word2Vec UCF101, HMDB-51, Olympic AR 2018
[41] CNN UCF101, HMDB-51 AR 2017
TABLE I: A summary of the reviewed works.

Iii Proposed model

In this section, we present the details of the proposed model. Figure 1 shows an overview of the proposed architecture.

Fig. 1:

Proposed model for ZS-SLR, including six main blocks: (1) Transformer-based hand detection, (2) Hand pose estimation, (3) Lingual, skeleton-based, and deep feature extraction, (4) Features fusion, (5) Semantic space, and (6) Classification.

Iii-a Problem definition

Let denote the set of N pairs of video x and the corresponding class label y of the seen data during the training with the subscript S standing for seen data. On similar lines, denote the set of M pairs of video x and the corresponding class label y of the unseen data during the testing with the subscript U standing for unseen data. Suppose . Also, consider represents the test time class prediction. For ZS-SLR, we have

. ZSL classifiers require generalization capability to unseen test classes. One way to have this capability is using nearest-neighbor search in a semantic space. To make an inference, given a video

, we infer the corresponding semantic embedding and classify as the nearest neighbor of z in the embedding of the test classes. Finally, a trained classification model M(·) outputs

(1)

where is the cosine distance and the semantic embedding is computed using BERT [20], . The function is a combination of the visual and lingual encoding.

Iii-B Model details

Different blocks of the proposed model are described in the following.

Iii-B1 Inputs

Three input modalities are used in the model: skeleton, video, and text. Two input modalities, skeleton and video, are used for visual feature extraction. The text modality is employed for lingual embedding. The visual features and lingual embedding are used as the input and output of the semantic space, respectively.

Iii-B2 Transformer-based hand detection

Hand detection is an important step of the proposed model. While different models are used for hand detection, such as Faster-RCNN [21], Single Shot Detector (SSD) [22], or You Only Look Once (YOLO) v3 [23], the performance of these models is highly dependent on the hand-designed components like a Non-Maximum Suppression (NMS) procedure or anchor generation. These components need to explicitly encode the prior knowledge about the task. To tackle these challenges, we configure the DEtection TRansformer (DETE) model, as a Transformer-based model for object detection task developed by Facebook AI [16], for hand detection. DETR includes an end-to-end architecture by eliminating any customized layers to predict the bounding boxes. This model is configured for hand detection in the proposed model to benefit from the Transformer capabilities.

Iii-B3 Hand pose estimation

After hand detection, the 3D hand keypoints are obtained using the OpenPose model [24]. This model estimates 21 3D hand keypoints for each hand. We will process the pixel regions of the detected hands and the estimated 3D hand keypoints, in the next block.

Iii-B4 Features extraction

The features used in the proposed model fall into visual and lingual categories:
Visual features: In this block, the visual features are extracted from two main categories:

  • Skeleton-based features:

    Three feature types are extracted from skeleton data: distances, angles, and singular values from Singular Value Decomposition (SVD).


    Distance features: After exploring the relations between different joints of the human hands, some joint distances have been included. The norm value of the distance between two joints is included in the features. The middle part of the Figure 2 shows some samples of these features.
    Angle features: Relying on the relations between different joints of the hand fingers, we selected two consecutive triple sets of hand keypoints in each hand to obtain a more accurate description from them. In more detail, let denote the set of four 3D keypoints of each hand finger. We compute two angles, and , as the angles formed between two consecutive triple sets of these keypoints.

    The left part of the Figure 2 shows a sample of this feature.
    SVD: Based on the relations between different joints of the human hands, we selected the sets of four human hand keypoints to obtain the SVD features. In more detail, considering , we calculate the SVD features and obtain . We include the singular values, , in the model features. The right part of the Figure 2 shows a sample of this feature.

  • Pixel-level features: The C3D model is employed to extract the pixel-level features every 16 frames in the video. Then, the features of all snippets are passed to a LSTM layer on top of the C3D network to obtain the temporal features. Furthermore, to obtain a compact representation of the features, an Auto-Encoder (AE), including two Fully Connected (FC) layers, is stacked on the LSTM as a dimension reduction methodology. We apply an Auto-Encoder (AE) to the LSTM output obtained from video input to balance the dimentionality between the deep and skeleton-based features.

Lingual features: We use the sentence BERT model [20] to obtain a 1024-dimensional word embedding.

Iii-B5 Features fusion

Three visual feature types are fused to input to the semantic space.

Iii-B6 Semantic space

The main goal of semantic space is to map the visual features to the lingual embedding using a projection function learned using deep networks. During training, the projection function is learned using seen data. Since deep networks can be used as a function approximator, the learned model is employed at the inference phase to predict the lingual embedding of unseen data. The predicted embedding achieved from the projection network is used to obtain the similarity degree to the unseen class embeddings. A simple similarity measure is using the embedding of the class labels. The similarity between the two classes is defined as follows:

(2)

where is the class label corresponding to the ith class from the unseen data and is the predicted class embedding using the trained projection network. This prediction is compared to all class embeddings obtained from the unseen data to select the closest one.

Iii-B7 Classification

In this phase, the final class label is recognized using a Softmax layer applied on the similarity degrees achieved from the previous phase.

Fig. 2: The skeleton-based features visualization: (a) Angle, (b): Distance, and (c) SVD.

Iv Results

Here, the details of the model implementation and obtained results are presented.

Iv-a Implementation details

The proposed model are evaluated on Intel(R) Xeon(R) CPU E5-2699 (2 processors) with 90GB RAM on Microsoft Windows 10 operating system and Python software on 10 NVIDIA GPUs. The PyTorch library is used for implementation. Four datasets are used for evaluation. All of the results are reported on the unseen test data. The proposed model is optimized using Adam with a learning rate for 1e-3 and a total training epoch of 300. Based on the C3D architecture, we use some video snippets, including 16 frames. For each video frame, we include three skeleton-based features, including the distances, angles, and singular values from the SVD. Since each sign can involve one or two hands, we define these skeleton-based features not only for one hand but also for the relations between two hands. In the signs containing only one hand, we simply repeat the features to be compatible with the case of having two hands. Details of these features are as follows (See Table

II):
Distance features: We include 20 distance features for each hand in a sign. Furthermore, 21 distance features corresponding to the peer-to-peer distance between the hand joints are added to the individual distances for each hand. We totally have 61 distance features for each video sample.
Angle features: For each hand finger, we include two angle features. We have in total 10 angle features per hand and 20 angle features for two hands.
SVD features: While 21 3D hand keypoints are obtained for each hand, we use twenty of them and ignore the hand palm keypoint. Considering four keypoints per finger, 20 3D hand keypoints are placed in a matrix with a 4x15 shape. Furthermore, three singular values are obtained from the 21 3D hand keypoints. In this way, we have a total of seven singular values per hand. Considering the relation between two hands, we also apply the SVD method to the concatenated joints of two hands in four different shapes: 21x6, 42x3, 8x15, and 4x30. So, we have 21 singular values. We totally have 35 singular values for each video sample.

Considering all of the skeleton-based features, we have 116 features. The extracted features from the video snippets are input to the LSTM, including 1024 hidden neurons. Using an AE, we obtain a latent feature vector of 510 shape. To balance between the skeleton-based and deep features, we repeated the skeleton-based features before the fusion with deep features. Since the singular values play an important role in the skeleton-based features, we assign a higher weight to them. To this end, we repeat the distances, angles, and SVD features four, three, and six times, respectively. Since the shape of the pixel-level features is 510, the final fusion of the skeleton-based and deep features will get a feature vector with the 1024 shape. The lingual embedding outputs a 1024 embedding vector. In the semantic space, we have a deep model, including two Fully Connected (FC) layers to map the visual features into the lingual embedding. We use two evaluation protocols to analyze the results. These protocols are compatible with the current works in the field to make a fair comparison with state-of-the-art models.


First evaluation protocol: In this protocol, we randomly select 80% of the classes for training and the other classes for testing. This protocol aims to prepare the model with more seen data.
Second evaluation protocol: In this protocol, we randomly select 50% of the classes for training and the other classes for testing. This challenging protocol aims to assign equal weight to seen and unseen data.

Feature One-hand Two-hands
Distance (5,9), (9,13), (13,17), (17,21), (5,1), (9,1), (13,1), (17,1), (21,1), (5,2), (9,6), (13,10), (17,14), (21,18), (9,4), (9,12), (13,8), (13,16), (17,12), (17,20) All joints
Angle (2,3,4), (3,4,5), (6,7,8), (7,8,9), (10,11,12), (11,12,13), (14,15,16), (15,16,17), (18,19,20), (19,20,21) -
SVD (2,3,4,5), (6,7,8,9), (10,11,12,13), (14,15,16,17), (18,19,20,21), (2,6,10,14,18), (3,7,11,15,19), (4,8,12,16,20), (5,9,13,17,21), (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21) All joints
TABLE II: Details of the skeleton-based features. Each brace shows a collection of the joints considered for a skeleton-based feature.

Iv-B Datasets

Four datasets, [11, 43, 44, 45], are used in our evaluations. Details can be found in Table III. Only the RGB samples of the First-Person and isoGD datasets are used in our evaluations. Furthermore, some samples of these datasets are shown in Figure 3.

Fig. 3: Dataset samples: (a) RKS-PERSIANSIGN, (b) First-Person, (c) ASLVID, (d) isoGD.
Dataset Total samples Total CN First protocol Second Protocol
Train CN Test CN Train CN Test CN
RKS-PERSIANSIGN [11] 10000 100 80 20 50 50
First-Person [43] 10000 45 36 9 22 22
ASLVID [44] 9,800 250 200 50 125 125
isoGD [45] 47933 249 199 50 124 124
TABLE III: Details of the dataset samples used in the train and test. In this table, CN is used for Class Number.

Iv-C Experimental results

Here, we report the results of the proposed model using the two protocols mentioned in the previous subsection.

Iv-C1 Ablation analysis

We analyze the impact of different configurations of the proposed model.
Different visual features: We use both of the skeleton-based and deep features in the proposed model. To analyze the impact of each feature, we performed a detailed analysis on each feature, shown in Table IV. As this table shows, the proposed model has the highest performance using the fused features.

Visual Modality Method RKS-PERSIANSIGN First-Person ASLVID isoGD
P1 P2 P1 P2 P1 P2 P1 P2
Skeleton Distances + BERT 52.4 50.8 50.2 48.6 54.0 46.0 46.5 40.0
Angles + BERT 52.8 51.2 51.6 49.2 54.2 46.1 46.8 40.2
SVD + BERT 53.4 52.0 52.5 50.1 54.8 46.3 47.2 41.0
Distances + Angles + BERT 53.1 51.8 52.1 49.8 55.1 47.8 47.0 40.8
SVD + Angles + BERT 54.0 52.3 53.7 50.7 57.0 49.4 48.1 41.6
SVD + Distances + BERT 54.2 52.6 54.1 51.1 58.2 50.8 48.8 42.5
Distances + Angles + SVD + BERT 55.4 53.1 54.9 51.9 58.8 51.4 49.4 43.6
RGB C3D + LSTM + BERT 59.7 55.2 55.5 52.5 60.1 52.6 50.2 44.2
Skeleton+RGB Distances + C3D + LSTM + BERT 64.2 57.8 57.6 53.2 61.5 53.6 50.9 44.9
Angles + C3D + LSTM + BERT 65.4 59.3 58.8 53.9 62.2 54.8 51.8 45.8
SVD + C3D + LSTM + BERT 68.1 62.1 61.4 54.7 63.1 55.7 52.5 46.6
Angles + SVD + C3D + LSTM + BERT 68.8 62.8 61.8 55.0 63.6 55.8 53.8 46.8
Distances + SVD + C3D + LSTM + BERT 69.4 63.4 62.2 55.2 64.0 56.1 54.1 47.2
Distances + Angles + C3D + LSTM + BERT 66.2 61.1 60.8 54.8 61.6 55.6 53.6 46.6
Distances + Angles + SVD + C3D + LSTM + BERT 74.6 69.2 67.2 61.7 68.8 60.5 60.2 52.1
TABLE IV: Recognition accuracy of the proposed model using different features combination.

Different input modalities: We initially included only the skeleton modality in the model. Skeleton representation can be beneficial since it is compact, low-complex, and robustly separate the subject (human) from the background. However, obtaining higher performance using only one modality is hard. In this way, we included the pixel information corresponding to the RGB video modality. The deep visual features are obtained using the C3D model belonging to the LSTM network. Using two modalities in the visual embedding can provide complementary capabilities to the model. Results of the model using these modalities are shown the Table IV. As this table shows, the proposed model obtains the highest performance using the fused features of the skeleton and video modalities.

LSTM network: Using different hidden neurons in the LSTM network can change the model performance. We used different hidden numbers in the LSTM network and reported the results in Table V. As this table shows, the proposed model obtains a higher performance using 1024 hidden neurons in the LSTM network.

Model RKS-PERSIANSIGN First-Person ASLVID isoGD
P1 P2 P1 P2 P1 P2 P1 P2

LSTM-256
72.7 67.1 66.1 59.8 67.1 59.1 58.3 50.6
LSTM-512 73.4 68.8 66.3 60.2 67.9 59.9 59.4 51.4
LSTM-1024 74.6 69.2 67.2 61.7 68.8 60.5 60.2 52.1
TABLE V: Recognition accuracy of the proposed model using different hidden layers for LSTM network. In this table, LSTM-N indicates an LSTM network with N hidden neurons.

Deep network in the semantic space: To project the visual features into the lingual embedding, Deep Neural Networks (DNNs) can be used due to their capabilities in the function approximation. Relying on the learning process, DNNs can learn the mapping patterns from the visual features into the lingual embedding. After the training of DNN, this model can be used to predict the lingual embedding for the unseen visual features. We used different deep networks with different settings. Finally, we selected a deep model including two dense layers (see Table VI).

Model RKS-PERSIANSIGN First-Person ASLVID isoGD
P1 P2 P1 P2 P1 P2 P1 P2
LSTM-256-1 72.3 66.8 65.6 59.2 66.4 58.8 57.9 50.1
LSTM-256-2 72.7 67.1 66.1 59.8 67.1 59.1 58.3 50.6
Ours (LSTM-512-1) 73.1 68.4 66.0 60.0 67.4 59.5 59.1 51.0
Ours (LSTM-512-2) 73.4 68.8 66.3 60.2 67.9 59.9 59.4 51.4
Ours (LSTM-1024-1) 73.8 69.0 66.8 60.9 68.1 60.1 59.8 51.9
Ours (LSTM-1024-2) 74.6 69.2 67.2 61.7 68.8 60.5 60.2 52.1
TABLE VI: Recognition accuracy of the proposed model using different configurations of the semantic space. In this table, LSTM-N-M indicates an LSTM network with N hidden neurons connected to a DNN with M FC layers.

Different hand detection models: Due to the high importance of the hand detection module in the SLR, we perform an analysis on the most used models for hand detection, Faster-RCNN, SSD, YOLO, and Transformer, in the proposed model. As Table VII shows, our model achieves a higher accuracy using the Transformer model for hand detection.

Hand detection model RKS-PERSIANSIGN First-Person ASLVID isoGD
P1 P2 P1 P2 P1 P2 P1 P2
Faster-RCNN 70.2 67.2 65.1 58.8 66.1 58.3 58.2 49.2
SSD 71.9 68.6 66.2 60.1 67.2 59.3 59.1 50.9
YOLO 72.1 68.9 66.9 60.8 67.9 59.9 59.6 51.3
Transformer 74.6 69.2 67.2 61.7 68.8 60.5 60.2 52.1
TABLE VII: Comparison of different hand detection models used in the proposed model.

Iv-C2 Comparison with state-of-the-art models

We compare our results with the state-of-the-art alternatives in the ZS-SLR (See Table VII). We followed our evaluation protocols described in the previous sub-sections. Furthermore, we report the results of the proposed model after averaging on ten runs. In each run, we randomly select the training and testing classes. Since there is only one work in the ZS-SLR, we only compare the proposed model with Bigle et al. [27]. As Table VIII shows, the proposed model outperforms the state-of-the-art model in ZS-SLR.

Model RKS-PERSIANSIGN First-Person ASLVID isoGD
P1 P2 P1 P2 P1 P2 P1 P2
[27] - - - - 51.4 - - -
Ours 74.6 69.2 67.2 61.7 68.8 60.5 60.2 52.1
TABLE VIII: Comparison with state-of-the-art models

V Discussion

We analyze the proposed model as follows:
ZS-SLR: Although many models have been proposed and obtained the state-of-the-art performance in SLR [7, 8, 9, 10, 11, 46, 47], they suffer from the annotation bottleneck and do not work efficiently for unseen classes. To overcome the mentioned weakness, we formulated ZS-SLR with no annotated visual examples. We performed different analysis of the proposed model to provide a basis for further exploration of ZS-SLR problem.

Hand detection: Since hand detection is an important step in SLR, we analyzed some of the most-used models, such as Faster-RCNN, SSD, and YOLO, for hand detection. The performance of these models is highly dependent on the hand-based components, such as a NMS method or anchor production. These components need to explicitly encode the prior knowledge about the task. To tackle these challenges, we configured the DEtection TRansformer (DETE) model, as a Transformer-based model for object detection, for hand detection in the our model and obtained a higher performance compared to other hand detection models.

Hybrid features representation: We fused some skeleton-based features, including the distances, angles, and singular values from SVD, with deep features to obtain more discriminative visual features. The step-by-step analysis of different features used in the model showed that the model obtained the higher performance using the fused features (See Table IV). In the first analysis, only one type of skeleton-based feature was used in the model. As the first three rows of Table IV show, the model performance is higher using the SVD features. This leads to assign a higher weight to the SVD features compared to the other skeleton-based features. In another analysis, the pixel-level features were included. We firstly analyzed the effect of the pixel-level, after that, one or more skeleton-based features were fused with the pixel-level features. As the last seventh rows of Table IV show, the proposed model is more accurate relying on the complementary capabilities of the fused features.

Multi-modal inputs: Two visual modalities, the pixel-level information in the RGB video input and the compact representation of the hand skeleton, were fused to make a more powerful projection from the visual features into the lingual embedding obtained from the third modality. Relying on the results, our model successfully benefited from the complementary capabilities of the fused features from multi-modal inputs and improved the state-of-the-art results in ZS-SLR.

Performance: The step-by-step analysis of the proposed model shows the outperforming of the model performance compared to state-of-the-art alternatives in the ZS-SLR. The proposed model decreased the false recognition using the complementary capabilities of the hybrid features and multi-modalities. This means that if the model cannot have a powerful capability in each of the features or modalities, it has this chance to boost the performance by relying on the other features or modalities. This makes the model more robust and useful because the model is not biased to a special feature or modality. However, there is much room to improve the recognition accuracy in the ZS-SLR. Since our main focus in this work was on the sign language, we provided more analysis on the false recognition samples in SLR datasets, RKS-PERSIANSIGN and ASLVID. Analyzing the results on these datasets showed that the proposed model obtained a higher performance on the RKS-PERSIANSIGN. Results showed that the recognition accuracy of each sign in the RKS-PERSIANSIGN dataset is more than 0.7. Fig. 4, 5, and 6 show some samples of false and true recognition in the sign datasets. As one can see in Fig. 4, some signs, such as ’Sister’, ’Brother’, ’Alligator’, and ’Sailboat’ are challenging and difficult to discriminate from the other signs. Analyzing the false recognition showed that there are some similarities between these signs and some signs in these datasets. Therefore, increasing the samples of these signs can reduce miss-classifications and prepare the model to learn more powerful discriminative patterns in these categories. Complex patterns in different signs in the real environment using unseen data during the training is hard and challenging. Compared to the related areas, such as action recognition, in SLR, even a soft variation in motion and/or handshape can variate the whole meaning. Thus, dedicated models are indispensable for ZS-SLR.

Fig. 4: False recognition samples from the RKS-PERSIANSIGN (a, c) and ASLVID datasets (b, d): (a) Predicted: Sister, True label: Brother, (b) Predicted: Bathing suit, True label: Alligator, (c) Predicted: Brother, True label: Sister, (d) Predicted: Algebra, True label: Sailboat.

Fig. 5: True recognition samples from the ASLVID dataset: (a) Bathing suit, (b) Alligator, (c) Old testament, (d) Sailboat.

Fig. 6: True recognition samples from the RKS-PERSIANSIGN dataset: (a) Love, (b) Woman, (c) Book, (d) Kind.

Vi Conclusion and future work

In this work, we proposed a multi-modal deep learning-based model for ZS-SLR. In this model, a Transformer-based model was configured for hand detection. The proposed model successfully benefited from the complementary capabilities of deep features fused with skeleton-based features. Three skeleton-based features, containing the distances, angles, and singular values obtained from the SVD method, were fused. The pixel-level features obtained from the C3D model were employed in the deep features to input to the LSTM network. Furthermore, an AE model was fell on the LSTM output to make a dimension reduction on the deep features and balance the dimensionality of the skeleton-based and deep features. The fused features were used in the semantic space to map the visual features into the lingual embedding obtained from the BERT model. We performed a detailed analysis of the model and reported the results. While the proposed model achieved state-of-the-art results in ZS-SLR on four large-scale datasets, RKS-PERSIANSIGN, First-Person, ASLVID, and isoGD, some challenges of ZS learning still need to be addressed. A powerful discriminative model for semantic space is a key part of the ZS model. We would like to check the effectiveness of substituting the CNN model with the Transformer model. This refers to the limitations of the CNN model in terms of the small receptive field that it can capture.

References

  • [1] P. Gupta, D. Sharma, and R. Kiran Sarvadevabhatla, Syntactically guided generative embeddings for zero-shot skeleton action recognition, IEEE International Conference on Image Processing (ICIP), Anchorage, Alaska, USA, 2021.
  • [2]

    N. Majidi, K. Kiani, and R. Rastgoo, A deep model for super-resolution enhancement from a single image, Journal of AI and Data Mining 8 (2020) 451–460.

  • [3] K. Kiani, R. Hematpour, R. Rastgoo, “Automatic Grayscale Image Colorization using a Deep Hybrid Model,” Journal of AI and Data Mining, doi: 10.22044/JADM.2021.9957.2131, 2021.
  • [4]

    R. Rastgoo, K. Kiani, “Face recognition using fine-tuning of Deep Convolutional Neural Network and transfer learning,” Journal of Modeling in Engineering, Vol. 17, pp. 103-111.

  • [5]

    R. Rastgoo, K. Kiani, S. Escalera, and M. Sabokrou, “Sign Language Production: A Review,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3451-3461, 2021.

  • [6]

    R. Rastgoo, K. Kiani, and S. Escalera, “Multi-modal deep hand sign language recognition in still images using restricted Boltzmann machine,” Entropy, Vol. 20, No. 809, 2018.

  • [7] R. Rastgoo, K. Kiani, and S. Escalera, Real-time Isolated Hand Sign Language Recognition Using Deep Networks and SVD, Journal of Ambient Intelligence and Humanized Computing (2021).
  • [8] R. Rastgoo, K. Kiani, and S. Escalera, Sign language recognition: a deep survey, Expert Systems with Applications 164 (2021) 113794.
  • [9] R. Rastgoo, K. Kiani, and S. Escalera, Hand pose aware multi-modal isolated sign language recognition, Multimedia Tools and Applications 80 (2021) 127–163.
  • [10] R. Rastgoo, K. Kiani, and S. Escalera, Video-based isolated hand sign language recognition using a deep cascaded model, Multimedia Tools and Applications 79 (2020) 22965–22987.
  • [11] R. Rastgoo, K. Kiani, and S. Escalera, Hand sign language recognition using multi-view hand skeleton, Expert Systems with Applications 150 (2020) 113336.
  • [12] G. Huang and A.G. Bors, Video Classification with FineCoarse Networks, arXiv:2103.15584v1, 2021.
  • [13] M.E. Kalfaoglu, S. Kalkan, and A. Alatan, Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition, arXiv:2008.01232v3, 2020.
  • [14]

    Ch. Li, X. Zhang, L. Liao, L. Jin, W. Yang,Skeleton-based Gesture Recognition Using Several Fully Connected Layers with Path Signature Features and Temporal Transformer Module, Proceedings of the AAAI Conference on Artificial Intelligence, 2019, pp. 8585-8593.

  • [15] Y. Li, X. Wang, W. Liu, and B. Feng, Deep attention network for joint hand gesture localization and recognition using static RGB-D images, Information Sciences 441 (2018) 66-78.
  • [16] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and Sergey Zagoruyko, End-to-End Object Detection with Transformers, ECCV, 2020, pp. 213-229.
  • [17] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, Learning Spatio-temporal Features with 3D Convolutional Networks, ICCV ’15: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4489–4497.
  • [18] D. Bank, N. Koenigstein, and R. Giryes, Autoencoders, arXiv:2003.05991v2, 2021.
  • [19] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural computation 9 (1997) 1735–1780.
  • [20] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv:1810.04805, 2018.
  • [21] S. Ren, K. He, R.B. Girshick, J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Transactions on Pattern Analysis and Machine Intelligence (2015) 1137-1149.
  • [22] W. Liu, D. Anguelov, D. Erhan, Ch. Szegedy, S. Reed, Ch. Y. Fu, and A.C. Berg, SSD: Single Shot MultiBox Detector, ECCV, 2016, pp 21-37.
  • [23] J. Redmon and A. Farhadi, YOLOv3: An Incremental Improvement, arXiv:1804.02767, 2018.
  • [24]

    Zh. Cao, G. Hidalgo, T. Simon, Sh.E. Wei, and Y. Sheikh, OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields, IEEE Transactions on Pattern Analysis and Machine Intelligence 4 (2019) 172-186.

  • [25] M. Palatucci, D. Pomerleau, G.E. Hinton, and T.M. Mitchell, Zero-shot learning with semantic output codes, In Advances in Neural Information Processing Systems 22 (2009) 1410–1418.
  • [26] H. Larochelle, D. Erhan, and Y. Bengio, Zero-data learning of new tasks, In Proceedings of the 23rd National Conference on Artificial Intelligence: AAAI’08,2008, pp. 646–651.
  • [27] Y.C. Bilge, N. Ikizler-Cinbis, and R. Gokberk Cinbis, Zero-Shot Sign Language Recognition: Can Textual Data Uncover Sign Languages?, BMVC, 2019.
  • [28] E. Kodirov, T. Xiang, Zh. Fu, and Sh. Gong, Unsupervised domain adaptation for zero-shot learning, In Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2452–2460.
  • [29] Q. Wang and K. Chen, Zero-shot visual recognition via bidirectional latent embedding, Int. J. Comput. Vision 124 (2017) 356–383.
  • [30] X. Xu, T. Hospedales, and Sh. Gong, Semantic embedding space for zero-shot action recognition, In 2015 IEEE International Conference on Image Processing (ICIP), 2015, pp. 63–67.
  • [31] J. Liu, B. Kuipers, and S. Savarese, Recognizing human actions by attributes, In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’11, 2011, pp. 3337–3344.
  • [32] A. Mishra, V. Kumar Verma, M. Shiva, K. Reddy, S. Arulkumar, P. Rai, and A. Mittal, A generative approach to zero-shot and few-shot action recognition, In 2018 IEEE Winter Conference on Applications of Computer Vision, 2018, pp. 372–380.
  • [33] J. Qin, L. Liu, L. Shao, F. Shen, B. Ni, J. Chen, and Y. Wang, Zero-shot action recognition with error-correcting output codes, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2833–2842.
  • [34] X. Xu, T. M Hospedales, and Sh. Gong, Multi-task zero-shot action recognition with prioritised data augmentation, In European Conference on Computer Vision, 2016, pp. 343–359.
  • [35] Y. Zhu, Y. Long, Y. Guan, Sh. Newsam, and L. Shao, Towards universal representation for unseen action recognition, In 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [36] H. Wang and C. Schmid, Action recognition with improved trajectories, In Proceedings of the IEEE international conference on computer vision, 2013, pp. 3551–3558.
  • [37] Ch. Gan, T. Yang, and B. Gong, Learning attributes equals multi-source domain generalization, In 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 87–97.
  • [38] M. Bishay, G. Zoumpourlis, I. Patras, TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition, arXiv:1907.09021v1, 2019.
  • [39] M. Hahn, A. Silva and J.M. Rehg, Action2Vec: A Crossmodal Embedding Approach to Action Learning, arXiv:1901.00484v1, 2019.
  • [40]

    N. Madapana and J. Wachs, Feature Selection for Zero-Shot Gesture Recognition, 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), 2020.

  • [41] J. Wu, K. Li, X. Zhao, and M. Tan, Unfamiliar Dynamic Hand Gestures Recognition Based on Zero-Shot Learning, ICONIP, 2018, pp. 244–254.
  • [42] I. Alexiou, T. Xiang, Sh. Gong, Exploring synonyms as context in zero-shot action recognition, IEEE International Conference on Image Processing (ICIP), 2016.
  • [43] G. Garcia-Hernando, Sh. Yuan, S. Baek, and T. Kim, First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations, CVPR, 2018, pp. 409–419.
  • [44]

    V. Athitsos, C. Neidle, S. Sclaroff, J. Nash, A. Stefan, Q. Yuan, and A. Thangali, The american sign language lexicon video dataset, IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2008.

  • [45] Y. Zhao, Sh. Zhou, Sh. Guyon, S. Escalera, and S.Z Li, ChaLearn Looking at People RGB-D Isolated and Continuous Datasets for Gesture Recognition, CVPR Workshop, 2016.
  • [46] M. Nguyen, W. Qi-Yan, and H. Ho, Sign Language Recognition from Digital Videos Using Deep Learning Methods, Geometry and Vision (2021) 108–118.
  • [47] D. Li, Ch. Xu, X. Yu, K. Zhang, B. Swift, H. Suominen, H. Li, TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation, NIPS, 2020.