Sign language (SL) is a native language of people with disabled hearing. As a visual language, it consists of various hand gestures, movements, facial expressions, transitions, etc. Sign Language Recognition (SLR) and Translation (SLT) aim at converting the video-based sign languages into sign gloss sequences and spoken language sentences, respectively. Most previous works in this field focus on continuous SLR with the gloss supervision [27, 26, 8, 37, 43, 9, 38, 36, 18, 7, 2, 53, 20], few attempts have been made for SLT [3, 4, 31, 5]. The main difference is that gloss labels are in the same order with sign gestures, and thus the gloss annotations significantly ease the syntactic alignment under the SLR methods. However, the word ordering rules in natural language are distinct from their counterparts in video-based sign languages . Moreover, sign videos are composed of continuous sign gestures represented by sub-video clips without explicit boundaries. Therefore, directly learning the mapping between frame-wise signs and natural language words is challenging.
To achieve better translation performance, a promising research line is to perform joint sign language recognition and translation model, which recognizing glosses and translating natural language sentences simultaneously [3, 4]. By doing so, learning with the glosses supervision can better understand sign videos and bring significant benefits to sign language translation. Along this line, Camgzet et al.  proposes a joint model, Sign Language Transformer (SLTR), which is based on vanilla Transformer . They learn recognition and translation simultaneously and achieve state-of-the-art results due to the Transformer’s advantage in sequence modeling tasks. However, there are still some inherent flaws that limit the capabilities of the Transformer model when solving the SLR and SLT tasks:
The self-attention mechanism aggregates temporal sign visual features in a frame-wise manner. This mechanism neglects the temporal structure of sign gestures represented by sub-videos, leading to substantial ambiguity in recognition and translation.
The attention mechanism is permutation-insensitive. Thus position encoding is essential to inject position information for sequence learning, e.g., sign video learning and sentence learning. However, the absolute position encoding used in vanilla Sign Language Transformer (SLTR)  is demonstrated distance and direction unaware [40, 48], thus limit its ability for better performance.
To remedy this first shortcoming (a), an intuitive idea is to gather neighboring temporal features to enhance the frame-wise sign representation. However, it is difficult to determine the boundaries of a sign gesture and select the surrounding neighbors precisely. In this paper, we propose a Content-aware and Position-aware Temporal Convolution (CPTcn) to learn robust sign representations. We first propose a content-aware neighborhood gathering method to adaptively select the surrounding neighbors. Specifically, we leverage the local consistency of sign gestures. That is to say, adjacent frames that belong to a sign gesture share similar semantics. Accordingly, we dynamically select neighboring features based on the similarities. Then we aggregate the selected features with temporal convolution layers. However, temporal convolution with a limited receptive field is insufficient to capture the position information of the features in the selected region . To alleviate the drawback, we inject position awareness into convolution layers with Relative Position Encoding (RPE). By aggregating with neighboring similar features, our CPTcn module obtains discriminative sign representations, thus improving the recognition and translation results.
To solve the second issue (b), we inject relative position information into the learning of sign videos and target sentences. Furthermore, we consider the relative position between sign frames and target words. To the best of our knowledge, we are the first trying to model the position relationship between source sequence and target sequence in sequence-to-sequence architectures. There are several existing methods to endow the self-attention mechanism with relative position information [40, 10, 17, 39, 24]. In this paper, we adopt the Disentangled Relative Position Encoding (DRPE)  in our video-based sign language learning, target sentence learning, and their mapping learning. Note that, different from RPE mentioned above, DRPE contains the correlations between relative position and sign features, which is proven effective to bring improvements [17, 49]. With the distance and direction awareness learning from DRPE, our improved Transformer model learns better feature representations, thus gaining significant improvements.
We call our approach PiSLTRc for ”Position-informed Sign Language TRansformer with content-aware convolution”. The overview of our model can be seen in Figure 1. The main technical contributions of our work are summarized as follows:
We propose a content-aware and position-aware CPTcn module to learn neighborhood-enhanced sign features. Specifically, We first introduce a novel neighborhood gathering method based on the semantic similarities. Then we aggregate the selected features with position-informed temporal convolution layers.
We endow the Transformer model with relative position information. Compared with absolute position encoding, relative position encoding performs better for sign video and natural sentence learning. Furthermore, we are the first to consider the relative position relationship between sign frames and target words.
Equipped with the proposed two techniques, our model achieves state-of-the-art performance in translation accuracy on the largest public dataset RWTH-PHOENIX-Weather 2014T. Also, we obtain significant improvements in recognition accuracy compared with other RGB-based models on both PHOENIX-2014 and PHOENIX-2014-T dataset.
The remainder of this paper is organized as follows. Section II reviews related works in sign language and position encoding. Section III introduces the architecture of our proposed PiSLTRc model. Section IV provides implementation details on our model, presents a quantitative analysis that provides some intuition as to why our proposed techniques work, and finally presents the experimental results compared with several baseline models.
Ii Related Work
Ii-a Sign Language Recognition
Most previous sign language works focus on continuous sign language recognition (cSLR), which is a weakly supervised sequence labeling problem . cSLR aims at transcribing video-based sign language into gloss sequence. With the released of larger-scale cSLR datasets , numerous researches burst out implementing sign language recognition tasks in an end-to-end manner [27, 26, 8, 9, 37, 43, 38, 36, 18, 7, 2, 53, 20]. The gloss annotations are in same order with sign language, this monotonic relationship significantly ease the syntactic alignment with the cSLR methods. However, the relationship between gloss sequences and the spoken natural language is non-monotonic. Thus it is infeasible to realize SLT with cSLR methods. Fortunately, the knowledge learned by cSLR can be transferred to SLT models and facilitate their performance.
Ii-B Sign Language Translation
Sign language Translation (SLT) is much more challenging because the alignment learning of frame-wise sign gestures and natural language words is difficult. Camgz et al. 
first introduce an end-to-end SLT model that uses Convolution Neural Networks (CNNs) backbone to capture spatial feature and utilizes attention-based encoder-decoder model to learn the mapping of sign videos and natural language sentences. Based on this work, Camgz et al.  replace the sequence-to-sequence structure with Transformer architecture 
which is the state-to-the-art model in Neural Machine Translation (NMT) area. Furthermore, they jointly learn the sign language recognition and translation with a shared Transformer encoder and demonstrate that joint training provides significant benefits. Our work is built upon their joint sign language Transformer model, where we improve the Transformer with our proposed CPTcn module and endow the Transformer model with relative position information.
Ii-C Position Encoding in Convolution
Temporal convolution neural network is a common method to model sequential information[33, 47, 14, 51]
. Convolution layer is demonstrated implicitly to learn absolute position information from the commonly used padding operation. However, it is insufficient to learn powerful representations that encode sequential information, especially with the limited receptive field. Explicitly encoding absolute position information is shown effective to learn image features . Upon their hypothesis, we apply relative position encoding (RPE) to the temporal convolution layers, aiming to model the positional correlations between the current feature and its surrounding neighbors.
Ii-D Position Encoding in Self-attention
Transformer entirely relies on the attention mechanism, which does not explicitly model the position information. To remedy the drawback, the sinusoidal absolute position encoding  and learnable absolute position encoding  are proposed to endow their model with position information. Afterward, relative position encoding is proposed to model long sequence  and provides the model with relation awareness [48, 40]. In our work, we reuse the disentangled position encoding  to exploit the distance and direction awareness with relative position encoding. Moreover, we also explore the position relationship between sign video and target sentence. Note that, different from RPE in convolution, DRPE in attention mechanism considers the relationship between content and position feature, which is demonstrated effective in previous works [49, 17]. Our experiments indicated that the relative position information is vital for sequence-to-sequence mapping learning.
Iii-a Preliminaries and Model Overview
Figure 1 illustrates the overall architecture of our proposed model, which jointly learns to recognize and translate sign videos into gloss annotations and spoken language sentences. In the following subsections, we will first revisit the sign language Transformer structure and then give detailed descriptions about our proposed two methods: content-aware and position-aware temporal convolution (CPTcn), and self-attention with disentangled relative position encoding (DRPE).
Iii-B Joint Sign Language Transformer Structure
Given a series of video frames, the vanilla sign language Transformer (SLTR) model firstly adopts a CNN backbone to extract frame-wise spatial features and uses a word embedding to transfer one-hot natural language words into dense vectors. Then a Transformer-based encoder-decoder model is utilized to learn SLR and SLT simultaneously. For SLR, the encoder output learned temporal sign features. A Connectionist Temporal Classification (CTC)
loss is applied to learn the mapping of gloss annotations and sign features. For SLT, the decoder output decomposes sequence level conditional probabilities in an autoregressive manner and then calculates the cross-entropy loss for each word. Meanwhile, the learning of SLR and SLT share the Transformer encoder.
Vanilla Transformer is a sequence-to-sequence structure, which consists of several Transformer blocks. Each block contains a multi-head self-attention and a fully feed-forward network. Given a feature sequence with frames, taking single-head attention as an example, the standard self-attention can be formulated as:
where represents projection matrices. represents the similarity computed by query and key . represent the normalized attention weights respectively.
Our work concentrates on improving the self-attention mechanism to understand sign video and target sentences better. To focus on our main contributions, we omit the detailed architecture and refer readers to  for reference.
Iii-C Content-aware and Position-aware Temporal Convolution
As shown in Figure 2, we propose content-aware and position-aware temporal convolution (CPTcn) to learn local temporal semantics, aiming at obtaining more discriminative sign representations. In this section, we first introduce a content-aware neighborhood gathering method, which adaptively selects surrounding neighbors. Secondly, we elaborate on the detail of endowing the temporal convolution with relative position information, which models the relationship between surrounding features and the current feature. Finally, we incorporate the proposed CPTcn module with the self-attention mechanism.
Iii-C1 Content-aware neighborhood gathering Method
In sign videos, we observe that each sign gesture usually lasts about 0.50.6 seconds (16 frames). However, the vanilla Sign Language Transformer (SLTR) model aggregates sign features in a frame-wise manner, thus neglecting the local temporal structure of sign gestures. Unlike their work, we develop a content-aware neighborhood gathering method to adaptively select the relevant surrounding features, which are around a specific feature and in a contiguous region. Shown as Figure 3, we obtain the clip-level feature with neighboring features via three steps:
1). Given the sequential representations
from the CNN backbone model, we apply outer tensor product to get a similarity matrix:
where the diagonal elements in represent similarities towards the features themselves.
2). To ensure neighbors are going to be selected instead of the far-away ones, we only consider a range for a specific feature to keep local semantic consistency. Then we replace the similarity scores with -inf outside this range and at the current feature. Mathematically, the selecting criterion for becomes as:
where represents the maximum distance among the considering features from the current feature. Then we apply the softmax function to obtain the masked distribution in the local region around the current feature :
Note that the weight at the current feature is zero, thus the summation of the weights before and after the current feature is 1.
It is hard to determine the size and boundaries of the local region. Fortunately, the normalized distribution of similarities obtained in Equation4 indicates the location of similar neighbors. Therefore, we use the weights of the normalized distribution before and after the current feature to adaptively determine the size of the selected region. Respectively, we define the size before and after the current feature with and :
is a hyperparameter to control the size of selected region, and the size of the region is. We define the final selected contiguous region as (Locally Similar Region) for a specific feature :
Finally, we adaptively obtain the clip-level features which are in a contiguous local region:
where denotes the content-aware neighborhood gathering method, and denotes the current feature with its surrounding neighbors. The clip-level features with temporal surrounding neighbors can be computed using Algorithm 1.
Iii-C2 Position-aware Temporal Convolution
Temporal convolution is a common method to aggregate sequential features. However, convolution layers with a limited receptive field are insufficient to capture the position information , which is important for sign gesture understanding. More specifically, the recognition of sign language is sensitive to the frame order. Absolute position encoding used in previous methods [4, 31] is a promising approach to encode position information. However, it is demonstrated direction- and distance-unaware . Inspired by recent work on language modelling , we infuse relative position information to the clip-level feature. We first compute the relative position matrix between the frame-wise feature and the current feature:
Then we represent the relative position indices in learnable embedding space, and obtain the position embeddings . Adding to clip-level features , resulting in position-informed clip-level representation:
Lastly, we aggregate the clip-level features with position information to compressed features , and apply a residual function:
Iii-C3 Self-attention with CPTcn
Similar to the vanilla Transformer model, we feed the aggregated feature to the self-attention mechanism. Note that, as shown in Figure 2, we only set , and keep as original frame-wise representation . The reason for this design is to maintain the difference within adjacent features in . Experimental result demonstrates that this network design performs better than .
Iii-D Self-Attention with DRPE
As shown in Figure 1, we further inject relative position information into the attention mechanism for sign video learning, target sentence learning, and mapping learning between them. Most existing approaches for endowing the attention mechanism relative position information are based on pairwise distance . They have been explored in machine translation , music generation  and language modelling [10, 17]. Here, we propose a disentangled relative position encoding (DRPE) .
Different from RPE used in Section III-C2, DRPE considers the correlations between relative positions and content features, which are proven that improving the performance [17, 49]. Specifically, we separate the content features and relative position encoding to compute attention weights. The first line of projection in Equation 1 is reparameterized as:
where represent the content feature. represent query, key and value content vectors which are obtained with projection matrices . represents created learned relative position embedding, where is the max relative distance. represent the projected position embedding with projection matrices , respectively.
Following this, we generate the attention weights with the relative position bias. The calculation of pairwise content-content is in the same way as standard self-attention, thereby generating the content-based content vector. While the calculation of pairwise content-position is different from standard self-attention. We first create a relative position distance matrix , and then generate the position-based content vectors. The lines of computing attention weights in Equation 1 are reparameterized as:
where represents the unnormalized attention score matrix and represents the score computed by query at position and key at the position . represents the relative distance matrix computed by the positions of query and key. lies in the -th of , and represents the relative distance between -th query and -th key. and are computed in similar ways. Note that and are opposite numbers thus providing our model with directional information.
Moreover, in the first line of the above equation, the first item represents content-to-content which is the content-based content vectors. The second and third item and represent content-to-position and position-to-content respectively, which are relative position based content vectors. represents position-to-position which is omitted in vanilla DRPE . However, in our experiments, we find that bring improvements to our performance in both recognition and translation. Therefore, we keep this item of position-to-position. In Section IV-C3, we analyze the impact of different item in the first line of Equation 12.
Preceding this, in the last two lines, we apply softmax function and scaling factor to get normalized scaled attention weights.
Totally, there are two differences between the DRPE method applied in our architecture and DeBERTa . The first is that we consider the position-to-position information, which is omitted in DeBERTa. Experimental results in Table V show the effectiveness of this item. The second difference is that DRPE is used in text-only in DeBERTa for language modeling. However, in our proposed model, as seen in Figure 1, we apply the relative position method in text-only target sentence learning, image-only sign video learning, and even the cross-modal video sequence and target sentence interaction. Experimental results in Table IV show the effectiveness of our improvements. Note that we are the first to consider the relative position relationship between sign frames and target words.
In summary, equipping with the CPTcn module and DRPE in self-attention layers, the heart module in the Transformer model, we finally arrive at our proposed PiSLTRc model.
Iv-a Dataset and Metrics
PHOENIX-2014 is a publicly available German Sign Language dataset, which is the most popular benchmark for continuous SLR. The corpus was recorded from broadcast news about weather. It contains videos of 9 different signers with a vocabulary size of 1295. The split of videos for Train, Dev, and Test is 5672, 540, and 629, respectively.
PHOENIX-2014-T is the benchmark dataset of sign language recognition and translation. It is an extension of the PHOENIX14 dataset . Parallel sign language videos, gloss annotations, and spoken language translations are available in PHOENIX14T, which makes it feasible to learn SLR and SLT tasks jointly. The corpus is curated from a public television broadcast in Germany, where all signers wear dark clothes and perform sign language in front of a clean background. Specifically, the corpus contains 7096 training samples (with 1066 different sign glosses in gloss annotations and 2887 words in German spoken language translations), 519 validation samples, and 642 test samples.
CSL is a Chinese Sign Language dataset, which is also a widely used benchmark for continuous SLR. These videos were recorded in a laboratory environment, using a Microsoft Kinect camera with a resolution of 1280 × 720 and a frame rate of 30 FPS. In this corpus, there are 100 sentences, and each sentence is signed five times by 50 signers (in total 2,500 videos). As no official split is provided, we split the dataset by ourselves. We give 20,000 and 5,000 samples to the training set and testing set, respectively. When splitting the dataset, we ensure that the sentences in the training and testing sets are the same, but the signers are different.
We evaluate our model on the performance of SLR and SLT as following :
Sign2gloss aims to transcribe sign language videos to sign glosses. It is evaluated using word error rate (WER), which is a widely used metric for cSLR:
Sign2text aims to directly translate sign language videos to spoken language translation without intermediary representation. It is evaluated using BLEU  which is widely used for machine translation.
Sign2(gloss+text) aims to jointly learn continuous SLR and SLT simultaneously. This approach is currently state-of-the-art in the performance of SLT since the training of cSLR brings benefits for sign video understanding, thus improving the performance of translation.
Iv-B Implementation and Evaluation Details
Iv-B1 Network Details
. Then we apply the improved Transformer network to learn SLR and SLT simultaneously. Its setting used in our experiments is based on Camgzet al. . Specifically, we use 512 hidden units, 8 heads, 6 layers, and 0.1 dropout rate.
In our proposed CPTcn model, the size of the select contiguous local similar region is set to be 16 (about 0.5-0.6 seconds), which is the average time needed for completing a gloss. We analyze the impact of the size in Section IV-C1.
The setting of two temporal convolution layers is F3-S1-P0-F3-S1-P0, where F, S, P denote the kernel filter size, stride, and padding size, respectively. The analysis of different modules of the position-informed convolution is concluded in SectionIV-C2.
In the self-attention and cross-attention mechanism, we apply DRPE to inject relative position information. We set the max relative distance to be 32 in our experiments. The analysis of the DRPE is conduct in Section IV-C3.
Besides, we train the SLR and SLT simultaneously. Thus we set and as the weight of recognition loss and translation loss.
We use the Adam optimizer  to optimize our model. We adopt the warmup schedule for learning rate that increases the learning rate from 0 to 6.8e-4 within the first 4000 warmup steps and gradually decay it with respect to the inverse square root of training steps. We train the model on 1 NVIDIA TITAN RTX GPU, and use 5 checkpoints averaging for the final results.
During inference, we adopt CTC beam search decoder with a beam size of 5 for SLR decoding. Meanwhile, we also utilize the beam search with the width of 5 for SLT decoding, and we apply a length penalty  with values ranging from 0 to 2.
Iv-C Ablation Study
|Centered NG ()||23.64||24.17||20.73||21.23|
|Sparse NG ()||23.06||23.52||21.83||22.08|
|Content-aware NG ()||22.23||23.01||23.17||23.40|
|Size of LSR||SLR(WER)||SLT(BLEU-4)|
Iv-C1 Analysis of content-aware neighborhood gathering method
In our proposed CPTcn module, we introduce a content-aware neighborhood gathering method to select the relevant surrounding neighbors dynamically. Three potential concerns with using this method are: 1) How many improvements does the content-aware method bring? 2) Must be the selected features contiguous in position? 3) What is the appropriate size of the selected region?
In Table I, we compared three methods to verify the first two questions: the essential of whether the selected region is content-aware and contiguous. For notation, w/o NG means no neighborhood gathering method. Centered NG means directly to select k features centered around the current feature. Sparse NG means dynamically selecting k features with the highest similarity, which may be discontinuous in position. Content-aware NG means to select k contiguous features adaptively based on similarity using our proposed content-aware segmentation method. We can see those neighborhood gathering methods effectively improve the performance. Sparse NG substantially outperforms Centered NG
. This gap suggests that the content-aware method is critical for feature selecting. Moreover,Content-aware NG performs better than other methods. This indicates that our content-aware contiguous feature aggregation is more suitable for capturing sign gesture representation.
In Table II, we explore the appropriate size of the select local similar region (LSR). The performance of our model performs best when the size of LSR is 16. This is consistent with the finding that the 16-frame (about 0.5-0.6 seconds) is the average time needed for completing a gloss. Besides, by gathering the larger width regions (for example, 20 frames), we observed slight performance degradation. This is because 20 frames (about 1 second) usually contain more than one gesture and thus lower the performance.
|module in CPTcn||SLR (WER)||SLT (BLEU-4)|
Iv-C2 Analysis of position-aware Temporal Convolution
In the first two lines in Table III, we study the relative position encoding in the CPTcn module. Experimental results show that position information is crucial for aggregating the sequential features. Furthermore, compared with absolute position encoding (APE), relative position encoding (RPE) bring improvements with BLEU scores and WER score on the test dataset. The result supports the conjecture of Yan et. al.  that RPE provides direction and distance awareness for sequence modeling compared with APE method.
Iv-C3 Analysis of DRPE in self-attention
We further conduct comparative experiments to analyze the effectiveness of disentangle relative position encoding (DRPE) in the attention mechanism. As shown in Figure 1, we replace absolute position encoding (APE) with DRPE in three places: encoder self-attention, decoder self-attention, and encoder-decoder cross attention. For notation, in Table IV, ”Enc-SA” means self-attention in the encoder module. ”Dec-SA” means self-attention in the decoder module. ”Enc-Dec-CA” means cross attention between encoder and decoder. In the lines of Table IV, we can see that the DRPE method used in the encoder and decoder all brings significant improvements. This further demonstrates that relative position encoding provides the direction and distance awareness for sequence representation learning. In addition, we find that the performance of DRPE used only in the encoder is better than that of DRPE used only in the decoder. This phenomenon suggests that direction and distance information are more critical for sign video learning than sentence representation learning.
|Method||SLR (WER)||SLT (BLEU-4)|
|Enc-SA w/ DRPE||22.89||23.76||22.35||22.47|
|Dec-SA w/ DRPE||23.29||23.84||21.89||21.27|
|Enc-SA & Dec-SA w/ DRPE||22.54||22.89||22.78||22.90|
|Enc-Dec-CA w/ DRPE||23.74||23.93||21.59||21.41|
|All w/ DRPE||22.23||23.01||23.17||23.40|
|Item in DRPE||SLR (WER)||SLT (BLEU-4)|
|+ c2p & p2c||22.57||23.26||22.84||22.79|
As we move to the fourth line in Table IV, the results show that DRPE in encoder-decoder attention also increases the performance. This phenomenon shows that even if the order of the word in the natural language is inconsistent with the sign language gloss, the relative position information still benefits their mapping learning.
Different from DRPE used in DeBERTa , we further explore the effectiveness of different items mentioned in Equation 12 in our task. Experimental results are shown in Table V, the correlations between content and position feature bring significant improvement. Moreover, the position-to-position item also benefits our model. This result is consistent with the conclusion in Ke et al. . Accordingly, we adopt these four items in our disentangled relative position encoding.
|GT:||in der nacht sinken die temperaturen auf vierzehn bis sieben grad .|
|(at night the temperatures drop to fourteen to seven degrees .)|
|SLTR:||heute nacht werte zwischen sieben und sieben grad .|
|(tonight values between seven and seven degrees .)|
|PiSLTRc:||heute nacht kühlt es ab auf vierzehn bis sieben grad .|
|(tonight it’s cooling down to fourteen to seven degrees .)|
|GT:||an der saar heute nacht milde sechzehn an der elbe teilweise nur acht grad .|
|(on the saar tonight a mild sixteen on the elbe sometimes only eight degrees .)|
|SLTR:||südlich der donau morgen nur zwölf am oberrhein bis zu acht grad .|
|(south of the danube tomorrow only twelve on the upper rhine up to eight degrees .)|
|PiSLTRc:||am oberrhein heute nacht bis zwölf am niederrhein nur kühle acht grad .|
|(on the upper rhine tonight until twelve on the lower rhine only a cool eight degrees .)|
|GT:||am tag von schleswig holstein bis nach vorpommern und zunächst auch in brandenburg gebietsweise länger andauernder regen .|
|(In the south, denser clouds sometimes appear, otherwise it is partly clear or only slightly cloudy .)|
|SLTR:||am mittwoch in schleswig holstein nicht viel regen .|
|(not much rain on wednesday in schleswig holstein .)|
|PiSLTRc:||am donnerstag erreicht uns dann morgen den ganzen tag über brandenburg bis zum teil dauerregen .|
|(on thursday we will reach us tomorrow the whole day over brandenburg until partly constant rain.)|
|GT:||im süden gibt es zu beginn der nacht noch wolken die hier und da auch noch ein paar tropfen fallen lassen sonst ist es meist klar oder nur locker bewölkt .|
|(In the south there are still clouds at the beginning of the night that drop a few drops here and there, otherwise it is mostly clear or only slightly cloudy .)|
|SLTR:||im süden tauchen im süden teilweise dichtere wolken auf sonst ist es verbreitet klar .|
|(in the south there are sometimes denser clouds in the south otherwise it is widely clear .)|
|PiSLTRc:||im süden tauchen auch mal dichtere wolken auf sonst ist es gebietsweise klar oder nur locker bewölkt .|
|(In the south, denser clouds sometimes appear, otherwise it is partly clear or only slightly cloudy .)|
|STMC (RGB) ||-||25.0||-||-|
|STMC (RGB) ||-||25.0||-||-|
Iv-C4 Qualitative Analysis on SLR
In Figure 4
, we show two examples with different methods on the SLR task. Equipped with proposed approaches, our PiSLTRc model learns accurate sign gesture recognition and thus achieving significant improvements. Furthermore, we find that the model trained based on CTC loss function tends to predict ”peak” on the continuous gestures. And our proposed CPTcn model is adequate to alleviate this situation. As shown in Figure4, the recognition of adjacent frames in a contiguous region is more precise.
Iv-C5 Qualitative Analysis on SLT
Comparing the translation results of the first example as illustrated in Table VI, we see that ”vierzehn (fourteen)” is mistranslated as ”sieben (seven)” in SLRT model. However, it is correctly translated in our PiSLTRc model. As we move to the second example in this table, we see that ”heute nacht (tonight)” is mistranslated as ”morgen (tomorrow)” in SLRT model, and it is correctly in our PiSLTRc model. To sum up, specific numbers and named entities are challenging since there is no grammatical context to distinguish one from another. However, in these two examples, we see that our model translates specific numbers and named entities more precisely. This demonstrates that our proposed model has a stronger ability to understand sign videos.
When we move to the third and fourth example in the Table VI, we see that our model generate complete sentence with less under-translation. For example, in the third example, ”gebietsweise länger andauernder regen (rain lasting longer in some areas)” is under-translated in SLTR model, while it is correctly translated as ”bis zum teil dauerregen (partly constant rain)” in our PiSLTRc model.
In summary, our proposed model performs better than the previous SLTR model when facing the specifical numbers and name entities, which are challenging to translate since there is no grammatical context to distinguish one from another. Moreover, the sentences produced follow standard grammar. Nevertheless, it may be improved on the translation quality of the long sentences in the future.
We leverage neighboring similar features to enhance sign representation. The selected features are in a fixed-size region. This is not consistent with the characteristics of sign language. That is to say, the number of frames corresponding to different sign gestures is dynamic.
Iv-D Comparison Against Baselines
In this section, we compare several state-of-the-art models to demonstrate the effectiveness of our work. Similar to Camgz et al. , we elaborate the comparison between our proposed model and baseline models in the three tasks: sign2gloss, sign2text, and sign2(gloss+text).
We evaluate this task in three datasets: PHOENIX-2014-T, PHOENIX-2014 and CSL.
In Table VII, we compare our model with several methods for the sign2gloss task on PHOENIX-2014-T dataset. DNF  adopt iterative optimization approaches to tackle the weakly supervised problem. They first train an end-to-end recognition model for alignment proposal, and then use the alignment proposal to tune the feature extractor. CNN-LSTM-HMM  embeds powerful CNN-LSTM models in multi-stream HMMs and combines them with intermediate synchronization constraints among multiple streams. Vanilla SLTR-R  uses the backbone pretrained with CNN-LSTM-HMM setup and then employes a two-layered transformer encoder model. FCN  is built upon an end-to-end fully convolutional neural network for cSLR. Furthermore, they introduce a Gloss Feature Enhancement (GFE) to enhance the frame-wise representation, where GFE is trained to provide a set of alignment proposals for the frame feature extractor. STMC (RGB) 
proposes a spatial-temporal multi-cue network to learn the video-based sequence. For a fair comparison, we only selected the RGB-based model of STMC without leveraging the additional information of hand, face, and body pose. PiSLTRc-R is our model which is trained when the weight of translation lossis set zero. Similar to vanilla SLTR-R, our work extracts feature from the CNN-LSTM-HMM backbone. As shown in this table, our proposed PiSLTRc-R surpasses the vanilla SLTR model by and on Dev and Test datasets, respectively. Furthermore, in the RGB-based models, we achieve state-of-the-art performance on the sign2gloss task.
In Table VIII we also evaluate our PiSLTRc-R model on the PHOENIX-2014 dataset. Compared with existed baseline models, our proposed model achieves comparable results. Note that the vanilla SLTR-R does not report the experimental results on the PHOENIX-2014 dataset. We implement it by ourselves. Compared with SLTR-R, our PiSLTRc-R model gains and improvements on Dev and Test datasets, respectively.
In Table XI we conduct experiments on CSL dataset. We see that our proposed PiSLTRc-R model achieves state-of-the-art performance. Compared with the SLTR-R model, our PiSLTRc-R model gains improvements on the Test datasets (5,000 examples split by ourselves), respectively.
In Table IX, we compare our approach with several sign2text methods on PHOENIX-2014-T dataset. The RNN-based model  adopt full frame features from Re-sign. TSPnet  utilizes I3D  to extract the spatial features, and further finetune I3D on two WSLR datasets [30, 23]. Multi-channel  allows both the inter and intra contextual relationship between different asynchronous channels to be modelled within the transformer network itself. PiSLTRc-T is our model that training with the weight of recognition loss being zero. Like in sign2gloss, SLTR-T and our PiSLTRc-T model utilize the pretrained feature from CNN-LSTM-HMM. Experimental results show that our proposed model achieves state-of-the-art performance and surpasses the vanilla SLTR-T model by and BLEU-4 scores.
In Table X, we compare our model on sign2(gloss+text) task. In this task, we jointly learn sign language recognition and translation simultaneously. Namely, and are set as non-zero. Note that different settings will obtain different results. Weighing up the performance on recognition and translation in our experiments, we set . Compared with vanilla SLTR, our model gains significant improvements on both two tasks. Experiments demonstrate that our proposed techniques bring significant improvements for recognition and translation quality based on the sign language Transformer model.
In this paper, we indicate two drawbacks of the sign language Transformer (SLTR) model for sign language recognition and translation. The first shortcoming is that self-attention aggregates sign visual features in a frame-wise manner, thus neglecting the temporal semantic structure of sign gestures. To overcome this problem, we propose a CPTcn module to generate neighborhood-enhanced sign features by leveraging the temporal semantic consistency of sign gestures. Specifically, we introduce a novel content-aware neighborhood gathering method to select relevant features dynamically. And then, we apply position-informed temporal convolution layers to aggregate them.
The second disadvantage is the absolute position encoding used in the vanilla SLTR model. It is demonstrated unable to capture the direction and distance information, which are critical for sign video understanding and sentence learning. Therefore, we inject relative position information to SLTR model with disentangled relative position encoding (DRPE) method. Extensive experiments on two large-scale sign language datasets demonstrate the effectiveness of our PiSLTRc framework.
-  (2016) Layer normalization. ArXiv abs/1607.06450. Cited by: §III-C2.
SubUNets: end-to-end hand shape and continuous sign language recognition.
2017 IEEE International Conference on Computer Vision (ICCV), pp. 3075–3084. Cited by: §I, §II-A, TABLE VIII.
Neural sign language translation.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7784–7793. Cited by: §I, §I, §II-B, §IV-A, §IV-A, §IV-D2, TABLE X, TABLE IX.
-  (2020) Sign language transformers: joint end-to-end sign language recognition and translation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10020–10030. Cited by: item (b), §I, §I, §II-B, §III-C2, §IV-B1, §IV-C5, §IV-D1, §IV-D, TABLE X, TABLE VII, TABLE IX.
-  (2020) Multi-channel transformers for multi-articulatory sign language translation. In European Conference on Computer Vision (ECCV), pp. 301–319. Cited by: §I, §IV-D2, TABLE X, TABLE IX.
-  (2017) Quo vadis, action recognition? a new model and the kinetics dataset. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733. Cited by: §IV-D2.
-  (2020) Fully convolutional networks for continuous sign language recognition. In European Conference on Computer Vision (ECCV), pp. 697–714. Cited by: §I, §II-A, §IV-D1, TABLE XI, TABLE VII, TABLE VIII.
-  (2017) Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1610–1618. Cited by: §I, §II-A, TABLE VIII.
-  (2019) A deep neural framework for continuous sign language recognition by iterative training. IEEE Transactions on Multimedia 21, pp. 1880–1891. Cited by: §I, §II-A, §IV-D1, TABLE VII, TABLE VIII.
Transformer-xl: attentive language models beyond a fixed-length context. In Proceedings of the 57th Conference of the Association for Computational Linguistics (ACL), Cited by: §I, §III-D.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (NAACL-HLT), Cited by: §II-D.
-  (2014) Extensions of the sign language recognition and translation corpus rwth-phoenix-weather. In Proceedings of the Ninth International Conference on Language Resources and Evaluation, (LREC), Cited by: §II-A, §IV-A, §IV-A.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Machine Learning, Proceedings of the Twenty-Third International Conference, (ICML), Cited by: §III-B.
-  (2018) Fully convolutional network for multiscale temporal action proposals. IEEE Transactions on Multimedia 20 (12), pp. 3428–3438. External Links: Cited by: §II-C.
Dense temporal convolution network for sign language translation.
International Joint Conferences on Artificial Intelligence Organization (IJCAI), Cited by: TABLE XI.
-  (2018) Hierarchical lstm for sign language translation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Cited by: TABLE XI.
-  (2021) DeBERTa: decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations (ICLR), Cited by: §I, §II-D, §III-D, §III-D, §III-D, §III-D, §IV-C3.
-  (2020) Global-local enhancement network for nmfs-aware sign language recognition. ArXiv abs/2008.10428. Cited by: §I, §II-A.
-  (2019) Music transformer: generating music with long-term structure. In International Conference on Learning Representations (ICLR), Cited by: §III-D.
-  (2019) Attention-based 3d-cnns for large-vocabulary sign language recognition. IEEE Transactions on Circuits and Systems for Video Technology 29 (9), pp. 2822–2832. External Links: Cited by: §I, §II-A.
-  (2018) Video-based sign language recognition without temporal segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Cited by: §IV-A, TABLE XI.
-  (2020) How much position information do convolutional neural networks encode?. In International Conference on Learning Representations (ICLR), Cited by: §I, §II-C, §III-C2.
-  (2019) MS-asl: a large-scale data set and benchmark for understanding american sign language. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: §IV-D2.
-  (2021) Rethinking positional encoding in language pre-training. In International Conference on Learning Representations (ICLR), Cited by: §I, §IV-C3.
-  (2015) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §IV-B2.
Weakly supervised learning with multi-stream cnn-lstm-hmms to discover sequential parallelism in sign language videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, pp. 2306–2320. Cited by: §I, §II-A, §IV-B1, §IV-D1, TABLE VII, TABLE VIII.
-  (2016) Deep hand: how to train a cnn on 1 million hand images when your data is continuous and weakly labelled. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3793–3802. Cited by: §I, §II-A, TABLE VIII.
-  (2016) Deep sign: hybrid cnn-hmm for continuous sign language recognition. In Proceedings of the British Machine Vision Conference (BMVC), Cited by: TABLE VIII.
-  (2017) Re-sign: re-aligned end-to-end sequence modelling with deep recurrent cnn-hmms. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3416–3424. Cited by: TABLE VIII.
-  (2020) Word-level deep sign language recognition from video: a new large-scale dataset and methods comparison. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1448–1458. Cited by: §IV-D2.
-  (2020) TSPNet: hierarchical feature learning via temporal semantic pyramid for sign language translation. In Annual Conference on Neural Information Processing Systems (NeurIPS), Cited by: §I, §III-C2, §IV-D2, TABLE IX.
Effective approaches to attention-based neural machine translation.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §II-B.
-  (2020) Lipreading using temporal convolutional networks. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 6319–6323. External Links: Cited by: §II-C.
-  (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §IV-A.
-  (2018) The syntax of sign language agreement: common ingredients, but unusual recipe. Glossa 3, pp. 1–46. Cited by: §I.
-  (2020) Boosting continuous sign language recognition via cross modality augmentation. Proceedings of the 28th ACM International Conference on Multimedia. Cited by: §I, §II-A.
-  (2018) Dilated convolutional network with iterative optimization for continuous sign language recognition. In International Joint Conferences on Artificial Intelligence Organization (IJCAI), pp. 885–891. Cited by: §I, §II-A.
-  (2019) Iterative alignment network for continuous sign language recognition. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4160–4169. Cited by: §I, §II-A.
Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, pp. 140:1–140:67. Cited by: §I, §III-C2.
-  (2018) Self-attention with relative position representations. In 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (NAACL-HLT), pp. 464–468. Cited by: item (b), §I, §II-D, §III-D.
-  (2017) Attention is all you need. In Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 5998–6008. Cited by: §I, §II-B, §II-D, §III-B.
-  (2015) Sequence to sequence – video to text. 2015 IEEE International Conference on Computer Vision (ICCV), pp. 4534–4542. Cited by: TABLE XI.
-  (2019) A novel sign language recognition framework using hierarchical grassmann covariance matrix. IEEE Transactions on Multimedia 21, pp. 2806–2814. Cited by: §I, §II-A.
-  (2018) Connectionist temporal fusion for sign language translation. Proceedings of the 26th ACM international conference on Multimedia (ACM MM). Cited by: TABLE XI.
Semantic boundary detection with reinforcement learning for continuous sign language recognition. IEEE Transactions on Circuits and Systems for Video Technology 31, pp. 1138–1149. Cited by: TABLE VIII.
-  (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. ArXiv abs/1609.08144. Cited by: §IV-B3.
-  (2020) Spatiotemporal recurrent convolutional networks for recognizing spontaneous micro-expressions. IEEE Transactions on Multimedia 22 (3), pp. 626–640. External Links: Cited by: §II-C.
TENER: adapting transformer encoder for named entity recognition. ArXiv abs/1911.04474. Cited by: item (b), §II-D, §III-C2, §IV-C2.
-  (2019) XLNet: generalized autoregressive pretraining for language understanding. In Annual Conference on Neural Information Processing Systems (NeurIPS), Cited by: §I, §II-D, §III-D.
-  (2019) SF-net: structured feature network for continuous sign language recognition. ArXiv abs/1908.01341. Cited by: TABLE XI.
-  (2018) Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Transactions on Multimedia 20 (6), pp. 1576–1590. External Links: Cited by: §II-C.
-  (2021) Spatial-temporal multi-cue network for sign language recognition and translation. IEEE Transactions on Multimedia (), pp. 1–1. External Links: Cited by: TABLE X.
-  (2020) Spatial-temporal multi-cue network for continuous sign language recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vol. 34, pp. 13009–13016. Cited by: §I, §II-A, §IV-D1, TABLE VII, TABLE VIII.