ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos

08/23/2021 ∙ by Razieh Rastgoo, et al. ∙ 0

Sign Language Recognition (SLR) is a challenging research area in computer vision. To tackle the annotation bottleneck in SLR, we formulate the problem of Zero-Shot Sign Language Recognition (ZS-SLR) and propose a two-stream model from two input modalities: RGB and Depth videos. To benefit from the vision Transformer capabilities, we use two vision Transformer models, for human detection and visual features representation. We configure a transformer encoder-decoder architecture, as a fast and accurate human detection model, to overcome the challenges of the current human detection models. Considering the human keypoints, the detected human body is segmented into nine parts. A spatio-temporal representation from human body is obtained using a vision Transformer and a LSTM network. A semantic space maps the visual features to the lingual embedding of the class labels via a Bidirectional Encoder Representations from Transformers (BERT) model. We evaluated the proposed model on four datasets, Montalbano II, MSR Daily Activity 3D, CAD-60, and NTU-60, obtaining state-of-the-art results compared to state-of-the-art ZS-SLR models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Sign Language Recognition (SLR), as a structured form of human gestures, contains the visual motion of different parts of the human body to transfer information between the deaf and hearing community. While many applications benefit from SLR [1]

, this area is a challenging research area in computer vision. One challenge is regarding the variation of a sign meaning due to the small variation in the hand shape, orientation, movement, hand location, body posture, and non-manual features, such as facial expressions. In addition, viewpoint changes and sign language evolution over time are other challenges in this area. Like other research areas, SLR also faces the challenges of data annotation. This challenge is more clear in the recent SLR models, based on deep learning. While deep learning approaches have obtained state-of-the-art performance in most vision tasks

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], they require a huge number of labelled training samples. To address the annotation bottleneck in SLR, we explore the idea of recognizing unseen signs without the annotated visual samples by using their textual information. In this way, we propose a Zero-Shot Learning (ZSL) model for Sign Language Recognition from two input modalities. ZSL-based models aim to recognize the unseen classes that are not visible during training. While the traditional models use the same classes during the training and test, ZSL models use separate classes in training and test [31].

Generally, there are two main methodologies in ZSL-based tasks: embedding-based and generative-based models. Projecting from the visual features into a semantic attribute space and generating the samples of the unseen classes are the main goals of these methodologies. In this paper, we propose an embedding-based model using Transformer [15]

, Long Short Term Memory (LSTM)

[18], and Bidirectional Encoder Representations from Transformers (BERT) [19] from two input modalities for ZS-SLR.

Although classical SLR models have obtained high performance [5, 1, 6, 7, 8], they do not work effectively with unseen classes. To address this challenge, we propose a two-stream Zero-Shot Sign Language Recognition (ZS-SLR) model from two input modalities. Two vision Transformer models are used for the human detection and visual features representation. To the best of our knowledge, this is the first time that the RGB and Depth modalities are used in a two-stream architecture for ZS-SLR. Relying on the multi-modality and accurate features, the proposed model works effectively on the public datasets even in coping with the unseen classes.

Our main contributions are as follows: (i) We formulate the problem of ZS-SLR with no annotated visual examples and propose a two-stream model, including the vision Transformer model, LSTM network, and BERT model, (ii) We configure a transformer encoder-decoder architecture, as a fast and accurate human detection model, to tackle the challenges of the current human detection models, (iii) We propose the first two-stream model for ZS-SLR. Three modalities, RGB, Depth, and text, are used in the proposed model to benefit from the complementary capabilities of these modalities, (iv) A step-by-step analysis of the proposed model is presented on four datasets using two evaluation protocols. Our model outperforms the state-of-the-art models in ZS-SLR, Zero-Shot Gesture Recognition (ZS-GR), and Zero-Shot Action Recognition (ZS-AR).

The remainder of this paper is organized as follows. Section 2 briefly reviews recent works in the ZS-SLR, ZS-GR, and ZS-AR. The proposed model is presented in detail in section 3. Results are discussed in section 4. Section 5 and 6 analyze the results and conclude the work, respectively.

Ii Literature Review

In this section, we briefly review recent work in ZS-AR, ZS-GR, and ZS-SLR.

ZS-AR: Gupta et al. proposed a deep learning-based model for ZS-AR from skeleton data. The generative space of the latent representations is learned using a Variational Auto Encoder (VAE). A fusion of the visual features and the embeddings corresponding to the textual descriptions is used in the model. Results on the NTU-60 and NTU-120 datasets show that the model outperforms the state-of-the-art models in ZS-AR with the 4.34% and 3.16% relative improvement, respectively [2]

. Mishra et al. proposed a probabilistic generative model for ZS-AR by using a Gaussian distribution model for a class representation. The new samples from unseen classes are synthesized by sampling from the class distribution. Results on three datasets, UCF101, HMDB-51, and Olympic, show the comparable performance with the state-of-the-art methods in ZS-AR

[36]

. Wang and Chen proposed a video-based ZS-AR model by employing the textual descriptions and action-related still images for semantic representation of the human actions. A CNN is used for visual features extraction. Results on UCF101 and HMDB-51datasets confirm the effectiveness of the proposed semantic representations

[47]. Gupta et al. proposed a generative-based model, entitled SynSE, for ZS-AR. This model learns progressively to refine the generative embedding spaces of the visual and lingual modalities. Results of SynSE on the NTU-60 and NTU-120 skeleton action datasets show a relative accuracy improvement of 0.92 % and 3.73% compared to state-of-the-art models, respectively [53]. Schonfeld et al. proposed a generative-based model, namely CADA-VAE, where a shared latent space of image features and class embeddings is learned by modality-specific aligned Variational Auto Encoders (VAEs). During the training, the distributions learned from images and from class labels are aligned to construct the multi-modal latent features. Results on several datasets show that the model obtains state-of-the-art performance in ZS-AR [52]. Wray et al proposed a model, entitled JPoSE, for ZS-AR by disentangling Parts-Of-Speech (PoS) in the video captions. A separate multi-modal embedding space is considered for each PoS tag. The outputs of multiple PoS embeddings are input to an integrated multi-modal space for action retrieval. Results on the EPIC and MSR-VTT datasets show the effectiveness of the JPoSE model for ZS-AR [51]. Hahn et al. presented a model, namely Action2Vec, by incorporating linguistic embedding of the class labels with extracted features from the video inputs. A C3D model is combined with a two-layer LSTM network for spatio-temporal features extraction. Results on the UCF101, HMDB-51, and Kinetics datasets show that the Action2Vec model achieves state-of-the-art in ZS-AR with 1.3% , 4.38% , and 7.75% relative margins, respectively [43]

. Bishay et al. designed a model, entitled Temporal Attentive Relation Network (TARN), for ZS-AR. The TRAN model contains a C3D network combined with a Bidirectional Gated Recurrent Unit (Bi-GRU). A single vector is obtained from the embedding module to map into the Word2Vec embedding. Results on the UCF101 and HMDB-51 datasets show a competitive performance of the proposed model compared to the state-of-the-art alternatives in ZS-AR

[42].

ZS-GR:

Madapana et al. proposed a deep learning-based framework for ZS-GR. They combined two datasets, CGD 2013 and MSRC-12, to develop a database of gesture attributes, including a range of categories. A deep learning-based model, including a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN), is proposed to optimize the semantic and classification losses. Results show the effectiveness of the proposed model, obtaining a recognition accuracy of 38% for ZS-GR

[45, 44]

. Madapana et al. proposed an adaptive learning paradigm to indicate the amount of transfer learning carried out by the algorithm. In addition, this model aims to show how much knowledge is necessary for an unseen gesture to be recognized. To this end, they used some user semantic descriptors to improve the performance of the ZS-GR model. Results on own data show the effectiveness of the proposed solution for ZS-GR

[46].

ZS-SLR: Bilge et al. proposed a model for ZS-SLR using the combination of I3D and BERT. The body and hand regions are used to obtain the visual features through 3D-CNNs. The longer temporal relationships are obtained via Bidirectional Long-Short-Term Memory (BLSTM) network. Relying on the textual descriptions, they extended the current ASL dataset, namely ASL-Text, which includes 250 signs and the corresponding sign descriptions. Results on this dataset show that this model can provide a basis for further exploration of the ZSL in SLR [31]. In this way, we propose a ZS-SLR model from two input modalities.

Iii Proposed model

Details of the proposed model are presented in this section. An overview of the model is shown in Fig. 1.

Fig. 1:

The proposed model including five main steps: Human detection, Pose estimation, Human segmentation, Spatio-temporal visual and lingual features extraction, Zero-Shot classification.

Iii-a Problem definition

ZS-SLR uses two sources of information: the visual and lingual domains. While the former domain contains the sign videos, the latter one includes the textual information corresponding to the sign videos. The video signs along with the corresponding class labels are available only for the seen classes. During the test time, the goal is to correctly classify the examples of the unseen classes.

More concretely, let consider the training set , containing N video samples. Each sample includes a video sign () and the corresponding class label (). The model is trained on the training samples. The point is that class labels are not available during the test phase. In this way, the goal is to learn a Zero-Shot classifier that can correctly classify an unseen video sign based on the textual information corresponding to the class label. To this end, given a video sample during the test phase, the model predicts the corresponding semantic embedding using the trained model. The nearest neighbor of the predicted embedding is used for the classification target. Finally, the output of the trained model, F(·), will be as follows:

(1)

where is the cosine distance and the semantic embedding is computed using BERT [19], . The function is a combination of the visual and lingual encoding.

Iii-B Details of the proposed model

Different blocks of the proposed model are presented in the following.

Iii-B1 Inputs

Three input modalities, RGB video, Depth video, and text, are used in the model. The visual features are extracted from the RGB and Depth video modalities. The text modality is used for lingual embedding. The visual features and lingual embedding are the input and output of the semantic space, respectively.

Iii-B2 Human detection

There are some widely-used and deep learning-based models for object detection, such as Faster-RCNN [20], Single Shot Detector (SSD) [21], and You Only Look Once (YOLO) v3 [22]. However, the detection accuracy of these models is highly related to the hand-based procedures like a Non-Maximum Suppression (NMS) or anchor generation. The explicitly encoding of the prior knowledge about the task is essential in these procedures. To overcome these challenges, we configure a Transformer-based model, namely DEtection TRansformer (DETE) [15], for human detection. DETE, developed by Facebook AI, contains an end-to-end architecture that employs the Transformer capabilities.

Iii-B3 Human pose estimation

We use the OpenPose model [27]

for human pose estimation. 13 keypoints are obtained from the human body. We will use these keypoints in the next step.

Iii-B4 Human body segmentation

Considering the skeleton representation of the estimated human keypoints, the human body are segmented into nine sections, as Fig 2 shows. These segments are used to feed to the Transformer model to obtain the visual features representation.

Fig. 2: The human body segmentation step in the proposed model. Nine parts are obtained.

Iii-B5 Features representation

We use two feature types in the model: visual and lingual features.

Visual features: We use the vision Transformer, developed by Google Brain Team [28], to obtain discriminative visual features. To this end, nine body segments are input to the encoder-decoder transformer along with a LSTM network for temporal learning (See Fig. 3). The human body segments are prepossessed to obtain nine same size segments to input to the vision Transformer model.

Fig. 3: The vision Transformer model used in the proposed model.

Lingual features: We use the sentence BERT model [19] to obtain a 1024-dimensional word embedding.

Iii-B6 Features fusion

Two visual features representation obtained from the RGB and Depth modalities are fused to input to the semantic space.

Iii-B7 Semantic space

The semantic space aims to map the visual features to the lingual embedding by employing a projection function learned using deep networks. To this end, the similarity degree between the predicted embedding and the unseen class embedding is calculated as follows:

(2)

where is the class label corresponding to the ith class from the unseen data and is the predicted class embedding using the trained projection network. This prediction is compared to all class embeddings obtained from the unseen data to select the closest one.

Iii-B8 Classification

In this step, a Softmax layer is used to recognize the final class label.

Iv Results

Details of the model implementation and obtained results are presented in the following.

Iv-a Implementation details

Our evaluation has been done on Intel(R) Xeon(R) CPU E5-2699 (2 processors) with 90GB RAM on Microsoft Windows 10 operating system, Python software, and PyTorch library on 10 NVIDIA Tesla K80 GPUs. Results on four datasets are reported on the unseen test data. The Adam optimization algorithm and a learning rate of 1e-3 are used for 300 epochs. The input resolution is 384X384X3. We select 13 body keypoints to create the body segmentation and feed them to the vision Transformer model for visual feature representation. An LSTM network with 1024 hidden neuron is used for temporal learning. The lingual embedding includes a 1024 embedding vector. A deep model, including two Fully Connected (FC) layers, is used in the semantic space. Two evaluation protocols are used for analyzing the results. These protocols are configured such that we can make a fair comparison with state-of-the-art models.

First evaluation protocol: In this protocol, approximately 90% of the classes is randomly picked up for training and the remaining is reserved for testing. This protocol assigns more samples to the seen data category.

Second evaluation protocol: In this protocol, approximately 70% of the classes is randomly picked up for training and the remaining for testing. This protocol is more challenging compared to the first protocol.

Iv-B Datasets

Montalbano II [23], MSR Daily Activity 3D [24], CAD-60 [25], and NTU-60 [26] datasets are used for model evaluation. Details of these datasets can be found in Table I.

Dataset Class Sample Num. Type
Montalbano II 20 12575 RGB-D
MSR Daily Activity 3D 16 320 RGB-D
CAD-60 12 60 RGB-D
NTU-60 60 56’000 RGB-D
TABLE I: Details of the datasets used for evaluation.

Iv-C Experimental results

Results of the proposed model using the two protocols are reported in the following.

Iv-C1 Ablation analysis

We analyze the effect of different configurations of the proposed model.

Different input modalities: To evaluate the impact of input modalities, we firstly include only one RGB/Depth video modality in the model to benefit from the pixel-level or distance information corresponding to these modalities. After that, we include both modalities in a two-stream architecture to benefit from their complementary capabilities. The deep visual features corresponding to two input modalities are extracted using a compositional model, including two Transformer, LSTM, and BERT model. Results of the proposed model using one/two input modalities are shown in Table II. As one can see in this table, the proposed model obtains the highest performance using two input modalities.

Visual Modality Montalbano II MSR Daily Activity 3D CAD-60 NTU-60
P1 P2 P1 P2 P1 P2 P1 P2
RGB 50.3 43.2 49.1 41.1 49.2 43.2 59.3 54.7
Depth 40.3 36.4 42.3 36.4 38.3 40.2 46.6 46.2
RGB and Depth 62.4 56.3 59.2 52.8 60.6 52.5 79.2 66.4
TABLE II: Recognition accuracy of the proposed model using different input modalities.

LSTM network: We evaluate the impact of the number of hidden neurons in the LSTM network in Table III. As one can see in this table, the highest performance is obtained using 1024 hidden neurons in the LSTM network.

Model Montalbano II MSR Daily Activity 3D CAD-60 NTU-60
P1 P2 P1 P2 P1 P2 P1 P2
LSTM-256 58.6 52.4 53.3 50.2 56.8 49.1 66.8 65.8
LSTM-512 60.2 54.2 54.1 50.9 57.4 50.5 67.2 66.0
LSTM-1024 62.4 56.3 59.2 52.8 60.6 52.5 79.2 66.4
TABLE III: Recognition accuracy of the proposed model using different hidden layers for LSTM network. In this table, LSTM-N indicates an LSTM network with N hidden neurons.

Semantic space: A semantic space, including two FC layers, is used to project the visual features into the lingual embedding. Analyzing different DNNs showed that a two-layers model had the highest performance (See Table IV).

Model Montalbano II MSR Daily Activity 3D CAD-60 NTU-60
P1 P2 P1 P2 P1 P2 P1 P2
LSTM-256-1 56.2 51.3 52.2 49.0 55.5 48.6 66.0 64.9
LSTM-256-2 58.6 52.4 53.3 50.2 56.8 49.1 66.8 65.8
Ours (LSTM-512-1) 59.1 53.0 53.2 49.3 56.5 49.6 66.4 65.1
Ours (LSTM-512-2) 60.2 54.2 54.1 50.9 57.4 50.5 67.2 66.0
Ours (LSTM-1024-1) 61.3 55.5 58.6 51.6 59.2 51.7 69.2 65.6
Ours (LSTM-1024-2) 62.4 56.3 59.2 52.8 60.6 52.5 79.2 66.4
TABLE IV: Recognition accuracy of the proposed model using different configurations of the semantic space. In this table, LSTM-N-M indicates an LSTM network with N hidden neurons connected to a DNN with M FC layers.

Different human detection models: Some widely-used object detection models, such as Faster-RCNN, SSD, and YOLO, have been used for human detection. Comparison of the results corresponding to these models with the Transformer model shows that the proposed model achieves a higher accuracy using the Transformer model for human detection (See Table V).

Model Montalbano II MSR Daily Activity 3D CAD-60 NTU-60
P1 P2 P1 P2 P1 P2 P1 P2
Faster-RCNN 56.1 51.6 55.2 49.8 56.6 48.4 65.6 61.4
SSD 58.3 53.1 58.0 51.3 58.8 50.8 67.5 63.5
YOLO 59.6 54.2 58.4 51.9 59.4 51.2 68.1 64.2
Vision Transformer 62.4 56.3 59.2 52.8 60.6 52.5 79.2 66.4
TABLE V: Comparison of different human detection models used in the proposed model.

Iv-C2 Comparison with state-of-the-art models

Using the two evaluation protocols, we compare the results with the state-of-the-art models in ZS-SLR, ZS-GR, and ZS-AR. Results of the proposed model are reported after averaging on ten runs. In each run, we randomly select the training and testing classes. As Table VI shows, the proposed model outperforms the state-of-the-art models in ZS-SLR, ZS-GR, and ZS-AR.

Model Montalbano II MSR Daily Activity 3D CAD-60 NTU-60
P1 P2 P1 P2 P1 P2 P1 P2
ZSGL [44] - - 38.1 - - - - -
ReViSE [50] - - - - - - 29.22 31.16
JPoSE [51] - - - - - - 56.49 30.75
CADA-VAE [52] - - - - - - 65.37 35.41
GZSL [53] - - - - - - 75.81 -
Ours 62.4 56.3 59.2 52.8 60.6 52.5 79.2 66.4
TABLE VI: Comparison with state-of-the-art models

V Discussion

We analyze the proposed model as follows:

ZS-SLR: Recently, deep learning-based models have obtained state-of-the-art performance in SLR [5, 1, 6, 7, 8, 48, 49]. However, they face the annotation bottleneck and do not work effectively for unseen classes. To tackle the annotation bottleneck, we formulated ZS-SLR with no annotated visual examples and proposed the first two-stream model, including two vision Transformer models, LSTM network, and BERT model. Detailed analysis of the proposed model on four datasets have been performed to provide a basis for further exploration on ZS-SLR.

Human detection: we analyzed some of the widely-used models, such as Faster-RCNN, SSD, and YOLO, for object detection. The performance of these models is highly dependent on the hand-based components, such as a NMS method or anchor production, that require the explicitly encoding the prior knowledge about the task. To overcome these challenges, we configured the DEtection TRansformer (DETE) model for human detection and obtained a higher performance compared to other human detection models.

Two-stream modality: Two visual modalities, RGB and Depth videos, were included in the model to benefit from the complementary capabilities of two modalities. Harnessing from these modalities, the proposed model outperformed the state-of-the-art models in ZS-SLR, ZS-GR, and ZS-AR.

Performance: We performed a step-by-step analysis of the proposed model on four datasets and showed that our model outperformed the state-of-the-art models in ZS-SLR, ZS-GR, and ZS-AR. The false recognition rate was reduced relying on the multi-modality and accurate features obtained from deep blocks. The multi-modality makes the model more robust and effective because the model is not biased to a pre-defined modality. Furthermore, using the vision Transformer capabilities, the proposed model processes the human body segmentation in parallel. While the proposed model outperformed state-of-the-art models in ZS-SLR with high margin, there is much room to improve the recognition accuracy. Detailed analysis on false recognition samples showed that the proposed model obtained a higher performance on the Montalbano II dataset with the recognition accuracy higher than 0.5 in each action. The confusion bars and some samples of false and true recognition can be found in Fig. 4 and Fig. 5. As these figures show, some signs/actions, such as ”shake head”, ”rub two hands together”, ”Rising mouth”, ”Brushing teeth”, ”Eat”, ”Drink”, ”Diving Signals”, and ”Surgeon Signals”, are challenging and difficult to discriminate from the other signs/actions. This comes from high similarities between these signs/actions. As a result, increasing the samples of these actions can decrease the miss-classification error and assist the model to learn complex discriminative patterns.

Fig. 4: Confusion bars of the proposed model on the four datasets used for evaluation.

Fig. 5: Samples of false recognition of the proposed model on the XX datasets used for evaluation: (a) NTU-60: Predicted: ”shake head”, Ground truth: ”rub two hands together”, (b) CAD-60: Predicted: ”Rising mouth”, Ground truth: ”Brushing teeth”, (c) MSR Daily Activity 3D: Predicted: ”Eat”, Ground truth: ”Drink”, (d) Montalbano II: Predicted: ”OK”, Ground truth: ”Nonme me friega niente”.
Class num. Class label
Montalbano II MSR Daily 3D CAD-60 NTU-60
Class 1 Vateene rinsing mouth drink brushing teeth
Class 2 Perfetto brushing teeth eat brushing hair
Class 3 E un furbo wearing contact lense read book clapping
Class 4 Nonme me friega niente call cellphone reading
Class 5 OK drinking water write on a paper writing
Class 6 Cosa ti farei opening pill container use laptop cross hands in front (say stop)
Class 7 Non ce ne piu cooking (chopping) use vacuum cleaner rub two hands together
Class 8 Ho fame cooking (stirring) cheer up nod head/bow
Class 9 Buonissimo talking on couch sit still shake head
Class 10 Sono stufo relaxing on couch toss paper wipe face
TABLE VII:

Class labels per each dataset used for evaluation in confusion matrix.

Vi Conclusion and future work

In this work, we proposed a two-stream deep learning-based model for ZS-SLR from two input modalities: RGB and Depth videos. In this model, two vision Transformer models were configured for human detection and visual feature representation. Considering the human keypoints, the detected human body is segmented into nine parts. A spatio-temporal representation from human body is obtained using a vision Transformer and a LSTM network. Finally, the visual features are mapped into the lingual embedding of the class labels, achieved via the BERT model. Results on four datasets, Montalbano II, MSR Daily Activity 3D, CAD-60, and NTU-60, show performance improvement of the proposed model compared to state-of-the-art models in ZS-SLR, ZS-GR, and ZS-AR.

References

  • [1] R. Rastgoo, K. Kiani, and S. Escalera, “Sign language recognition: a deep survey,” Expert Systems with Applications, Vol. 164, 113794, 2021.
  • [2] P. Gupta, D. Sharma, and R. Kiran Sarvadevabhatla, “Syntactically guided generative embeddings for zero-shot skeleton action recognition,” IEEE International Conference on Image Processing (ICIP), Anchorage, Alaska, USA, 2021.
  • [3]

    N. Majidi, K. Kiani, and R. Rastgoo, “A deep model for super-resolution enhancement from a single image,” Journal of AI and Data Mining, Vol. 8, pp. 451–460, 2020

  • [4]

    R. Rastgoo, K. Kiani, S. Escalera, and M. Sabokrou, “Sign Language Production: A Review,” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3451-3461, 2021.

  • [5] R. Rastgoo, K. Kiani, and S. Escalera, “Real-time Isolated Hand Sign Language Recognition Using Deep Networks and SVD,” Journal of Ambient Intelligence and Humanized Computing, 2021.
  • [6] R. Rastgoo, K. Kiani, and S. Escalera, “Hand pose aware multi-modal isolated sign language recognition,” Multimedia Tools and Applications, Vol. 80, pp. 127–163, 2021.
  • [7] R. Rastgoo, K. Kiani, and S. Escalera, “Video-based isolated hand sign language recognition using a deep cascaded model,” Multimedia Tools and Applications, Vol. 79, pp. 22965–22987, 2020.
  • [8] R. Rastgoo, K. Kiani, and S. Escalera, “Hand sign language recognition using multi-view hand skeleton,” Expert Systems with Applications, 150, 113336, 2020.
  • [9]

    R. Rastgoo, K. Kiani, and S. Escalera, “Multi-modal deep hand sign language recognition in still images using restricted Boltzmann machine,” Entropy, Vol. 20, No. 809, 2018.

  • [10] G. Huang and A.G. Bors, “Video Classification with FineCoarse Networks,” arXiv:2103.15584v1, 2021.
  • [11] K. Kiani, R. Hematpour, R. Rastgoo, “Automatic Grayscale Image Colorization using a Deep Hybrid Model,” Journal of AI and Data Mining, doi: 10.22044/JADM.2021.9957.2131, 2021.
  • [12]

    R. Rastgoo, K. Kiani, “Face recognition using fine-tuning of Deep Convolutional Neural Network and transfer learning,” Journal of Modeling in Engineering, Vol. 17, pp. 103-111.

  • [13] M.E. Kalfaoglu, S. Kalkan, and A. Alatan, ‘Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition,” arXiv:2008.01232v3, 2020.
  • [14]

    Ch. Li, X. Zhang, L. Liao, L. Jin, W. Yang, ‘Skeleton-based Gesture Recognition Using Several Fully Connected Layers with Path Signature Features and Temporal Transformer Module,” Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8585-8593, 2019.

  • [15] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and Sergey Zagoruyko, “End-to-End Object Detection with Transformers,” ECCV, pp. 213-229, 2020.
  • [16] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatio-temporal Features with 3D Convolutional Networks,” Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4489–4497.
  • [17] D. Bank, N. Koenigstein, and R. Giryes, “Autoencoders,” arXiv:2003.05991v2, 2021.
  • [18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, Vol. 9, pp. 1735–1780, 1997.
  • [19] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805, 2018.
  • [20] S. Ren, K. He, R.B. Girshick, J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” PAMI, pp. 1137-1149, 2015.
  • [21] W. Liu, D. Anguelov, D. Erhan, Ch. Szegedy, S. Reed, Ch. Y. Fu, and A.C. Berg, “SSD: Single Shot MultiBox Detector,” ECCV, pp 21-37, 2016.
  • [22] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” arXiv:1804.02767, 2018.
  • [23] S. Escalera, J. Gonzalez, X. Baro, M. Reyes, O. Lopes, I. Guyon, V. Athitsos, H. Escalante, “Multi-modal gesture recognition challenge 2013: dataset and results,” In Proceedings of the 15th ACM on International conference on multi-modal interaction, pp. 445–452, 2013.
  • [24] J. Wang, Z. Liu, Y. Wu, J. Yuan, “Mining actionlet ensemble for action recognition with depth cameras,” IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 1290–1297, 2012.
  • [25] J. Sung, C. Ponce, B. Selman, A. Saxena, “Unstructured human activity detection from RGBD images,” IEEE International Conference on Robotics and Automation, 2012.
  • [26] A. Shahroudy, J. Liu, T.T. Ng, G. Wang, “NTU RGB+D: A large scale dataset for 3D human activity analysis,” Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1010–1019, 2016.
  • [27] Zh. Cao, G. Hidalgo, T. Simon, Sh.E. Wei, and Y. Sheikh, “OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 4, pp. 172-186, 2019.
  • [28] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, Th. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, “An image is worth 16x16 words: transformers for image recognition at scale,” ICLR, 2021.
  • [29] M. Palatucci, D. Pomerleau, G.E. Hinton, and T.M. Mitchell, “Zero-shot learning with semantic output codes,” In Advances in Neural Information Processing Systems, Vol. 22, pp. 1410–1418, 2009.
  • [30] H. Larochelle, D. Erhan, and Y. Bengio, “Zero-data learning of new tasks,” In Proceedings of the 23rd National Conference on Artificial Intelligence: AAAI’08, pp. 646–651, 2008.
  • [31] Y.C. Bilge, N. Ikizler-Cinbis, and R. Gokberk Cinbis, “Zero-Shot Sign Language Recognition: Can Textual Data Uncover Sign Languages?,” BMVC, 2019.
  • [32] E. Kodirov, T. Xiang, Zh. Fu, and Sh. Gong, “Unsupervised domain adaptation for zero-shot learning,” In Proceedings of the IEEE International Conference on Computer Vision, pp. 2452–2460, 2015.
  • [33] Q. Wang and K. Chen, “Zero-shot visual recognition via bidirectional latent embedding,” Int. J. Comput. Vision, Vol. 124, pp. 356–383, 2017.
  • [34] X. Xu, T. Hospedales, and Sh. Gong, “Semantic embedding space for zero-shot action recognition,” In 2015 IEEE International Conference on Image Processing (ICIP), pp. 63–67, 2015.
  • [35] J. Liu, B. Kuipers, and S. Savarese, “Recognizing human actions by attributes,” In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’11, pp. 3337–3344, 2011.
  • [36] A. Mishra, V. Kumar Verma, M. Shiva, K. Reddy, S. Arulkumar, P. Rai, and A. Mittal, “A generative approach to zero-shot and few-shot action recognition,” In 2018 IEEE Winter Conference on Applications of Computer Vision, pp. 372–380, 2018.
  • [37] J. Qin, L. Liu, L. Shao, F. Shen, B. Ni, J. Chen, and Y. Wang, “Zero-shot action recognition with error-correcting output codes,” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2833–2842, 2017.
  • [38] X. Xu, T. M Hospedales, and Sh. Gong, “Multi-task zero-shot action recognition with prioritised data augmentation,” In European Conference on Computer Vision, pp. 343–359, 2016.
  • [39] Y. Zhu, Y. Long, Y. Guan, Sh. Newsam, and L. Shao, “Towards universal representation for unseen action recognition,” In 2018 IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [40] H. Wang and C. Schmid, “Action recognition with improved trajectories,” In Proceedings of the IEEE international conference on computer vision, pp. 3551–3558, 2013.
  • [41] Ch. Gan, T. Yang, and B. Gong, “Learning attributes equals multi-source domain generalization,” In 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 87–97, 2016.
  • [42] M. Bishay, G. Zoumpourlis, I. Patras, “TARN: Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition,” arXiv:1907.09021v1, 2019.
  • [43] M. Hahn, A. Silva and J.M. Rehg, “Action2Vec: A Crossmodal Embedding Approach to Action Learning,” arXiv:1901.00484v1, 2019.
  • [44]

    N. Madapana and J.P. Wachs, “Feature Selection for Zero-Shot Gesture Recognition,” 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), 2020.

  • [45] N. Madapana and J.P. Wachs, “Zero-Shot Learning for Gesture Recognition,” 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), 2020.
  • [46] N. Madapana and J.P. Wachs, “A Semantical and Analytical Approach for Zero Shot Gesture Learning,” 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), 2020.
  • [47] J. Wu, K. Li, X. Zhao, and M. Tan, “Unfamiliar Dynamic Hand Gestures Recognition Based on Zero-Shot Learning,” ICONIP, pp. 244–254, 2018.
  • [48] M. Nguyen, W. Qi-Yan, and H. Ho, “Sign Language Recognition from Digital Videos Using Deep Learning Methods,” Geometry and Vision, pp. 108–118, 2021.
  • [49] D. Li, Ch. Xu, X. Yu, K. Zhang, B. Swift, H. Suominen, H. Li, “TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation,” NIPS, 2020.
  • [50] Y. H Tsai, L. K. Huang, and R. Salakhutdinov, “Learning robust visual-semantic embeddings,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 3571–3580, 2017.
  • [51] M. Wray, D. Larlus, G. Csurka, and D. Damen, “Fine-grained action retrieval through multiple partsof-speech embeddings,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  • [52] E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata, “Generalized zero-shot learning via aligned variational auto-encoders,” CVPR, pp. 8247–8255.
  • [53] P. Gupta, D. Sharma, R. Kiran Sarvadevabhatla, “Syntactically guided generative embeddings for zero-shot skeleton action recognition,” arXiv:2101.11530v1, 2021.