Sign language recognition (SLR) is one of the open problems in computer vision, with several challanges remain to be addressed. For instance, while the definitions of the signs are typically clear and structural, the meaning of a sign can change based on the shape, orientation, movement, location of the hand, body posture, and non-manual features like facial expressions[wu1999vision, stokoe2005sign]. Even the well-known hand shapes in the sign languages can be difficult to discriminate and annotate even due to viewpoint changes [neidle2012challenges]. In addition, similar to natural languages, sign languages change and embrace variations over time. Therefore, the development of scalable methods to deal with such variations and challenges is needed.
The existing SLR approaches typically require a large number of annotated examples for each sign class of interest [cihan2018neural, camgoz2017subunets, koller2015continuous, koller2016deep, stoll2018sign]. Towards overcoming the annotation bottleneck in scaling up SLR recognition, we explore the idea of recognizing sign language classes with no annotated visual examples, by leveraging their textual descriptions. To this end, we introduce the problem of zero-shot sign language recognition (ZSSLR). Unlike the traditional supervised SLR, where training and test classes are the same, in ZSSLR, the aim is to recognize sign classes that are not seen during training. Compared to the commonly studied ZSL problems, where most seen (training) classes have a large number of per class samples [patterson2012sun, wah2011caltech, lampert2014attribute, farhadi2009describing], ZSSLR takes ZSL into a new extreme where most seen classes have only few training examples. This challenging situation corresponds to a hard zero-shot learning problem [madapana2018hard].
In order to realize seen to unseen class transfer, we use textual descriptions of sign classes taken from a sign language dictionary. Using a sign language dictionary for obtaining the class representations has two major advantages of being (i) readily available, and, (ii) prepared by the sign language experts in a detailed way. In this manner, we construct our ZSSLR approach on highly-informative class descriptions without requiring any ad-hoc annotations, unlike possible alternative approaches such as the attribute-based ZSL [lampert2014attribute, farhadi2009describing].
To study ZSSLR, we introduce a new benchmark dataset with 250 sign classes and their textual definitions. Our benchmark dataset is constructed from the ASLLVD corpus [neidle2012challenges]
, where the top 250 sign classes with most in-class variance are selected and the corresponding descriptions are gathered from the Webster American Sign Language Dictionary[costello1999random]111The dataset is available at: https://ycbilge.github.io/zsslr.html.
We propose an embedding-based framework for ZSSLR that consists of two main components. First component models the visual data with extra attention to temporal and spatial structure, using 3D-CNNs and LSTMs. These networks operate over body and hand regions of video in conjunction, since hands are important focal points of signs. The second component, the ZSL component, learns an embedding of the visual representation to the closest text description. We rigorously evaluate our proposed approach on ASL-Text dataset and show its advantages.
To summarize, the main contributions of the paper are as follows: (i) we formulate the problem of zero-shot sign language recognition (ZSSLR), (ii) we define a new benchmark dataset for ZSSLR called ASL-Text, (iii) we propose a spatio-temporal representation that focuses hand and full body regions via 3D-CNNs and LSTMs and learn it in an end-to-end manner for ZSSLR, and, (iv) we present the benchmark results with detailed analyses.
2 Related Work
Sign Language Recognition (SLR). SLR has been studied more than three decades [tamura1988recognition]. The mainstream SLR approaches can be grouped into two categories: (i) Isolated SLR [wang2016isolated], and, (ii) Continuous SLR [cihan2018neural]. Our work belongs to isolated SLR category as we target to recognize individual sign instances.
Early SLR methods mostly use hand-crafted features in combination with a classifier, such as support vector machines. Hidden Markov Models (HMM), Conditional Random Fields and neural network based approaches have also been explored to model the temporal patterns[grobel1997isolated, huang1998sign]
. Recently, several deep learning based SLR approaches have been proposed[huang2015sign, pigou2016sign, molchanov2016online, koller2016deep, camgoz2017subunets, cui2017recurrent, stoll2018sign, cihan2018neural, narayana2018gesture].
Despite the relative popularity of the topic, the problem of annotated data sparsity has been seldomly addressed in SLR research. Farhadi and Forsyth [farhadi2006aligning] is first to study the alignment of sign language video subtitles and signs in action to overcome annotation difficulty. Their approach [farhadi2007transfer] was based on transferring large amounts of labelled avatar data with few labelled human signer videos to spot words in videos. Buehler et al[buehler2009learning] also try to reduce the annotation effort by using the subtitles of British Sign Language TV broadcasts. They apply Multiple Instance Learning (MIL) to recognize signs out of TV broadcast subtitles. Kelly et al[kelly2011weakly] and Pfister et al[pfister2013large] also use subtitles of TV broadcasts. Pfister et al[pfister2013large] differ from the two aforementioned MIL studies as they track the co-occurrences of lip and hand movements to reduce the search space for visual and textual content mapping. Nayak et al[nayak2009automated] proposes to locate signs in continuous sign language sentences using iterated conditional modes. Pfister et al[pfister2014domain] define each sign class with one strongly supervised example, and, train an SVM based detector out of one-shot examples. The resulting detector is then used to acquire more training samples from an another weakly-labeled data. Koller et al[koller2016deep] propose a combined CNN and HMM approach to train a model with large but noisy data. None of the aforementioned models approach the problem of annotated data sparsity from a zero-shot learning perspective.
Zero-Shot Learning. ZSL has been a focus of interest in vision and learning research in recent years, especially following the pioneering works of Lampert et al[lampert2009learning] and Farhadi et al[farhadi2009describing]. The main idea is learning to generalize a recognition model for identifying unseen classes. Most of the ZSL approaches rely on transferring semantic information from seen to unseen classes. For this purpose, semantic attributes are largely used in the literature [ferrari2008learning, farhadi2009describing, liu2011recognizing, parikh2011relative, fu2014learning, lampert2014attribute, jain2015objects2action]. Semantic word/sentence vectors and concept ontologies are also studied in this context [rohrbach2011evaluating, norouzi2013zero, elhoseiny2013write, frome2013devise, socher2013zero, mensink2014costa, lei2015predicting]. Label embedding models are explored to make connection between seen and unseen classes via semantic representations [akata2013label, frome2013devise, norouzi2013zero, fu2014transductive, romera2015embarrassingly, qin2017zero, sumbul2018fine]. Akata et al[akata2013label] propose a method to learn a compatibility function from visual to semantic feature space. As opposed to models that learn to map to a semantic space, there are also studies that learn to map to a common embedding space [romera2015embarrassingly, fu2014transductive].
Recently, ZSL has also been explored in the context of action recognition. Liu et al[liu2011recognizing] is first to propose attribute based model for recognizing novel actions. Jain et al[jain2015objects2action] propose a semantic embedding based approach using commonly available textual descriptions, images, and object names. Xu et al[Xu2015SemanticES] propose a regression based method to embed actions and labels to a common embedding space. Xu et al[xu2017transductive] also use word-vectors as a semantic embedding space in transductive settings. Wang et al[wang2017alternative] exploit human actions via related textual descriptions and still images. Their aim is to improve word vector semantic representations of human actions with additive information. Habibian et al[habibian2017video2vec] also propose to learn semantic representations of videos with freely available video and relevant descriptions. Qin et al[qin2017zero] use error-correcting output codes to overcome the disadvantages of attributes and/or semantic word embeddings for information transfer. Compared to action recognition, in SLR, even a subtle change in motion and/or handshape can change the entire meaning. Therefore, we argue that specialized methods are required for zero-shot recognition in SLR.
There are a couple of recent methods that introduce ZSL to gesture recognition. However, these methods are mostly limited to either robot interactions with single held-out classes ([thomason2016recognizing]), or based on attributes with limited datasets ([madapana2018hard]). We argue that attribute based semantic representations can be subjective and there is a high chance of missing beneficial attributes when annotating attributes manually. As also noted by [zhu2018towards], attribute based semantic representations are difficult to scale up as defining the attributes of even a single class can require a laborious amount of human effort. In this work, we work over an extensive dataset of classes for sign language and present an approach that does not require any manual attribute annotations.
|Move both S hands in alternating forward circles, palms facing down, in front of each side of the body.||Beginning with the bent thumb and middle finger of the right 5 hand touching the chest, palm facing in, bring the hand forward while closing the fingers to form an 8 hand.||With the right index finger extended up, move the right hand, palm facing back, in a small repeated circle in front of the right shoulder.||Beginning with the fingertips of both F hands touching in front of the chest, palms facing each other, bring the hands away from each other in outward arcs while turning the palms in, ending with the little fingers touching.|
|Bring the fingertips of the right flattened O hand, palm facing in, to the lips with a repeated movement.||Move the right H hand, palm facing left and fingers pointing forward, from in front of the right side of the chest upward to near the right side of the head.||Strike the knuckles of the right A hand, palm facing in, against the extended left index finger held up in front of the chest, palm facing right.||Move the right L hand, palm facing forward, in a circle in front of the right shoulder.|
3 ASL-Text Dataset
To facilitate ZSSLR research, we use the ASLLVD dataset [neidle2012challenges], which is the largest isolated sign language recognition dataset available, to the best of our knowledge. We select top 250 sign classes, ranked by the number of samples per class, from ASLLVD signer variances and augment this dataset with the textual definitions of the signs from Webster American Sign Language Dictionary [costello1999random]. We refer to this new benchmark dataset as ASL-Text. Example frames and their textual descriptions for the ASL-Text dataset are presented in Figure 2.
The textual descriptions include the detailed instructions of a sign with emphasis on four basic parts: hand-shape, orientation of the palms (forward, backward, etc.), movements of the hands (right, left, etc.), and the location of hands with respect to the body (in front of the chest, each side of the body, right shoulder, etc.). Moreover, some descriptions also include non-manual cues such as the facial expressions, head movements and body posture. Hand shapes are described with specialized vocabulary including the terms F-hand, A-hand, S-hand, 5-hand, 8-hand, 10-hand, open-hand, bent-v hand, flattened-o hand [costello1999random]. Such a specialized vocabulary highlights the fact that ASL is a language on its own. From the example hand shapes shown in Figure 2, it can be seen that the textual sign language descriptions are indeed quite indicative of the ongoing gesture.
In the ASL-Text dataset, there are 1598 videos (54151 frames) in total for the 250 sign classes. The number of frames of individual videos range between 6 to 116, where the average sequence length is 33 frames. For ZSL purposes, we split the dataset into three disjoint sets (train, validation and test) based on classes. Train set includes 170, validation and test sets include 30 and 50 disjoint classes, respectively. The classes with most signer variation and in-class samples are assigned to training set. The remaining classes, which have relatively lower number of visual examples, are allocated into validation and test sets. This is done to demonstrate the real-world case; i.e. it is harder to train classifiers for classes that are rarely seen, therefore, we train with the classes that have relatively more examples and test on the rare classes. Overall, we have 1188, 151, and 259 video samples in training, validation, and test sets. The average length of the textual descriptions is 30 words per description, where the total vocabulary includes 154 distinct words.
The average number of instances per class is 7 for the training classes and 5 for the validation and test classes. Note that, still, the number of examples per class even for training is considerably lower than the commonly studied ZSL datasets, e.gAWA-2 [xian2017zero] and SUN Attribute [patterson2012sun], on which hundreds of per-class examples are used for training.
In this section, we first give a formal definition of the problem, and then explain the components of the proposed approach, an overview of which is given in Figure 1. The implementation details can be found in Section 5.1.
Problem definition. In ZSSLR, there are two sources of information: (i) the visual domain , which consists of sign videos, and, (ii) the textual domain , which includes the textual sign descriptions. At training time, the videos, labels and the sign descriptions, are available only for the seen classes, . At test time, our goal is to correctly classify the examples of novel unseen classes, , which are distinct from the seen classes.
The training set consists of samples where is the -th training video and is the corresponding sign class label. We assume that we have access to a textual description of each class , represented by . The goal is to learn a zero-shot classifier that can correctly assign each test video to a class in , based on the textual descriptions.
In our approach, we aim to construct a label embedding based zero-shot classification model. For this purpose, we define the compatibility function as a mapping from an input video and class pair to a score representing the confidence that the input video belongs to the the class . Given the compatibility function , the test-time zero-shot classification function is defined as:
In this way, we leverage the compatibility function to recognize novel signs at test time.
The performance of the resulting zero-shot sign recognition model directly depends on three factors: (i) video representation, (ii) class representation, and, (iii) the model used as the compatibility function . The following three sections provide the corresponding details.
4.1 Spatio-temporal video embedding
We aim to obtain an effective video representation by extracting short-term spatio-temporal features using ConvNet features of the video snippets, and then capturing longer-term dynamics through recurrent models. We additionally improve our representation by extracting features in two separate streams: the full frames and hand regions only. The details are given in the following paragraphs.
Short-term spatio-temporal representation. We obtain our basic spatio-temporal representation by first splitting each video into 8 frames long snippets and then extracting their features using a pre-trained I3D model [carreira2017quo], a state-of-the-art 3D-ConvNet architecture. The I3D model is obtained by adapting a pre-trained Inception model [szegedy2015going] to the video domain and then fine-tuning on the Kinetics dataset. We obtain our most basic video representation by average pooling the resulting snippet features.
Modeling longer-term dependencies. Average pooling the 3D-CNN features is a well-performing technique for the recognition of non-complex (singleton) actions. Signs, on the contrary, portray more complicated patterns that are composed of the sequences of multiple basic gestures. In order to capture the transition dynamics and longer-term dependencies across the snippets of a video, we use recurrent network models that take the I3D representation sequence as input, and, provide an output embedding. For this purpose, we propose to use the bidirection LSTM (bi-LSTM) [graves2005framewise] model, and, compare it against the average pooling, LSTM [hochreiter1997long] and GRU [cho2014learning] models.
Two-stream video representation. Hands play a central role in expressing signs. In order to encode details of the hand-area information in a manner isolated from the the overall body movements, we detect and crop the hand regions using OpenPose [cao2018openpose] and form a hand-only sequence corresponding to each video snippet. We define two separate streams, including I3D and bi-LSTM networks, over these video inputs and then concatenate the resulting features to obtain the final video representation (Figure 1). When using recurrent networks, we train both streams together with the compatibility function in an end-to-end fashion.
4.2 Text-based class embeddings
We extract contextualized language embeddings from textual sign descriptions using the state-of-the-art language representation model BERT [devlin2018bert]. BERT architecture basically consists of a stack of encoders; specifically, multi-layer bidirectional transformers[vaswani2017attention]. The model’s main advantage over word2vec [mikolov2013distributed] and glove [pennington2014glove] representations is that BERT model is contextual and the extracted representations of the words change with respect to other words in a sentence.
Figure 3 shows the t-SNE visualization of all sign class BERT embeddings. A close inspection to this feature space reveals that classes that appear closer in t-SNE embeddings have indeed similar descriptions. For instance, friend and hamburger signs are composed of similar motions with different hand-shapes, obscure and most signs have similar hand movements but different hand shapes and directions, and, comb and boss signs include the same repeated movement with different hand-shape and locations with respect to the body.
4.3 Zero-shot learning model
In our work, we adapt a label embedding [akata2013label, weston2010large] based formulation to tackle the ZSSLR problem. More specifically, we use bi-linear compatibility function that associates the video and class representations:
where is the -dimensional embedding of the video , is the -dimensional BERT embeddings of the textual description for the class , and, is the compatibility matrix. In order to learn this matrix, we use the cross entropy loss with -regularization:
where is the regularization weight. This core formulation is also used in [sumbul2018fine]
, in a completely different ZSL problem. Since the objective function is analogous to the logistic regression classifier, we refer to this approach aslogistic label embedding (LLE).
In addition to LLE, we also adapt the embarrassingly simple zero-shot learning (ESZSL) [romera2015embarrassingly] and semantic auto-encoder (SAE) [kodirov2017semantic] formulations as baselines. We, however, skip their formulational details here for brevity.
5.1 Implementation Details
We fix the number of video frames of each sign video to 32 by either down-scaling or up-scaling. For every consecutive 8 frames, we extract 1024-d features from the last average pooling layer of the I3D model using a stride of 4. When modeling the longer temporal context, we set LSTM’s or bi-LSTM’s initial hidden and cell state to average pool of each sequence during training. Hence, hidden size equals to the size of average pooled feature vector, which is 1024. For representing text, we use themodel [devlin2018bert] and extract 768-dimensional sentence-based features. Following the description in [devlin2018bert], we concatenate the features from the last four layers of the pretrained Transformer of and -normalize them.
We measure normalized accuracy, i.e. the mean accuracy per class, in all experiments. We run each experiment 5 times and report the average. Top-1, top-2, and top-5 accuracies for the random baseline are calculated by averaging over 10000 runs.
|Method||Val (30 Classes)||Test (50 Classes)|
5.2 Experimental Results
We first evaluate the ZSL component of our framework. In this context, we explore three different ZSL approaches, namely SAE [kodirov2017semantic], ESZSL [romera2015embarrassingly], and LLE. In these set of experiments, we have pooled the extracted 3D-CNN features over the whole frame. Table 1 shows the corresponding results, where top-1 validation accuracy, and top-1, top-2, and top-5 test accuracies are reported.
We observe that SAE [kodirov2017semantic] performs poorly with respect to other approaches. We think that this is due to auto-encoder structure of SAE method. The model learns linear embedding from video to semantic space with the purpose of reconstruction back from semantic space to video. This idea might not work well since we do not have many in-class samples for reconstruction. In addition, as stated earlier, intra-class variance is very high among signers.
Consequently, we evaluate the performance of the two-stream spatio-temporal representation of the framework. Specifically, we carry out an ablation study, where body denotes the full frame input stream, hand denotes the hand videos and body+hand is the case when these two streams are used in conjunction. The corresponding results are given in Table 2. Hand stream provides additional cues and increases the performance for validation classes using both methods. In test classes, ESZSL [romera2015embarrassingly] does not perform well on the hand-stream; on the contrary, its performance increases when both streams are used in conjunction. Similarly, LLE benefits from the introduction of hand-stream, and outperforms ESZSL method when two streams are utilized together. Overall, we observe that proposed framework based on LLE formulation works better, especially regarding top-1 and top-2 accuracies.
|Method||visual rep.||Val (30 Classes)||Test (50 Classes)|
|top - 1||top-1||top-2||top-5|
|body + hand||14.6||17.1||25.7||43.0|
|body + hand||16.2||18.0||27.4||43.8|
We further evaluate the effect of longer temporal modeling with different RNN architectures. We experiment with three different RNN models, namely LSTM[hochreiter1997long], GRU[cho2014learning] and bi-LSTM[graves2005framewise] units using LLE over both hand and body streams. Table 3 presents these results. We observe that, compared to average pooling of streams, the framework benefits from the introduction of longer temporal modeling over all architectures, and performs the best with bi-LSTMs. This illustrates the importance of visual representations for ZSSLR. Our overall proposed framework reaches a top-1 normalized accuracy of 20.9% and top-5 normalized accuracy of 51.4%, which is quite impressive compared to top-1 and top-5 accuracies of random baseline (2.0% and 10.0% respectively).
|Correctly Predicted Label: STRANGE Move the right C hand from near the right side of the face, palm facing left, downward in an arc in front of the face, ending near the left side of the chin, palm facing down.|
|Correctly Predicted Label: AHEAD Beginning with the palm sides of both A hands together, move the right hand forward in a small arc.|
|Correctly Predicted Label: INSULT Move the extended right index finger from in front of the right side of the body, palm facing left and finger pointing forward, forward and upward sharply in an arc.|
|Correctly Predicted Label: GET-UP Place the fingertips of the right bent V hand, palm facing in and fingers pointing down, on the upturned palm of the left open hand held in front of the body.|
|Predicted Label: BREAK-DOWN Beginning with the fingertips of both curved 5 hands touching in front of the chest, palms facing each other, allow the fingers to loosely drop, ending with the palms facing down. Correct Label: MEETING Beginning with both open hands in front of the chest, palms facing each other and fingers pointing up, close the fingers with a double movement into flattened O hands while moving the hands together.|
|Predicted Label: AVERAGE Brush the little-finger side of the right open hand, palm facing left, back and forth with a short repeated movement on the index-finger side of the left open hand, palm angled right. Correct Label: GRAB-CHANCE Bring the right curved 5 hand from in front of the right side of the body, palm facing left and fingers pointing forward, in toward the body in a downward arc while changing into an S hand, brushing the little-finger side of the right S hand across the palm of the left open hand, palm facing up in front of the chest.|
Figure 4 shows examples from correctly and incorrectly classified test sequences. We observe that, either the textual descriptions or the visual aspects of the classes confused with each other are very similar. This indicates that the problem domain can benefit from more detailed analyses and representations that focus on nuances, both in visual and in textual domain, which can be explored as a future direction.
This paper introduces and explores the problem of ZSSLR. We present a benchmark dataset for this novel problem by augmenting a large ASL dataset with sign language dictionary descriptions. Our proposed framework builds upon the idea of using these auxiliary texts as an additional source of information to recognize unseen signs. We propose an end-to-end trainable ZSSLR method that focuses hand and full body regions via 3D-CNNs+LSTMs and learns a compatibility function via label embedding. Overall, the experimental results indicate that, zero-shot recognition of signs based on textual descriptions can be possible. Nevertheless, the acquired accuracy levels are quite low compared to other ZSL domains, pinpointing a substantial need for further exploration in this direction.
This work was supported in part by TUBITAK Career Grant 116E445.