ASL Recognition with Metric-Learning based Lightweight Network

04/10/2020 ∙ by Evgeny Izutov, et al. ∙ Intel 0

In the past decades the set of human tasks that are solved by machines was extended dramatically. From simple image classification problems researchers now move towards solving more sophisticated and vital problems, like, autonomous driving and language translation. The case of language translation includes a challenging area of sign language translation that incorporates both image and language processing. We make a step in that direction by proposing a lightweight network for ASL gesture recognition with a performance sufficient for practical applications. The proposed solution demonstrates impressive robustness on MS-ASL dataset and in live mode for continuous sign gesture recognition scenario. Additionally, we describe how to combine action recognition model training with metric-learning to train the network on the database of limited size. The training code is available as part of Intel OpenVINO Training Extensions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Humanity put artificial intelligence into service in a wide range of applied tasks. Nonetheless, for a number of problems we are still trying to get closer to the human-level performance. One of such challenges is a sign language translation that can help to overcome the communication barrier between larger number of groups of people.

There are millions of people around the world, who use one from over several dozens of sign languages (e.g. ASL in United States and most of Anglophone Canada, RSL in Russia and neighboring countries, CSL in China, etc.). A sign language itself is a natural language that uses the visual-manual modality to represent meaning through manual articulations. It goes without saying that sign language is different from the common language in the same country by its grammar and lexicon - it’s not just a literal translation of single words in a sentence. In addition, sign language from a certain country can have different dialects in various locations. The latter aspect significantly complicates solving the sign language recognition problem due to the need of a large and diverse database.

To tackle this challenge, researchers have tried to use methods from the adjacent action recognition area like 3D convolution networks [40], two-stream networks with additional depth or flow stream [37], skeleton-based action recognition [13]. Unfortunately, the aforementioned approaches don’t work very well in case of petty size datasets that we are dealing with in the sign language recognition space. Another issue is related to the inference speed - the network needs to run in real-time to be useful in live usage scenarios.

To solve the listed problems we propose several architectural choices (namely, applying the 2D depth-wise framework to 3D case) and a training procedure that aims to combine a metric-learning paradigm with continuous-stream action recognition. Summarizing all of the above, our contributions are as follows:

  • Extending the family of efficient 3D networks for processing continuous video stream by merging S3D framework [48] with lightweight edge-oriented MobileNet-V3 [14] backbone architecture.

  • Introducing residual spatio-temporal attention module with auxiliary loss to control the sharpness of the mask by using Gumbel sigmoid [17].

  • Using metric-learning techniques to deal with limited size of ASL datasets to reach robustness.

Finally, the model trained on the MS-ASL dataset [19] is prepared for inference by Intel OpenVINO™toolkit111https://software.intel.com/en-us/openvino-toolkit and is available as a part of the Intel OpenVINO™OMZ222https://github.com/opencv/open_model_zoo. There you can find sample code on how to run the model in demo mode. In addition, we release the training framework333https://github.com/opencv/openvino_training_extensions that can be used in order to re-train or fine-tune our model with a custom database.

Ii Related Work

Action Recognition

Recent developments in deep learning helped to make a step from well-studied image-level problems (e.g. classification, detection, segmentation) to video-level problems (forecasting, action recognition, temporal segmentation). Such domain difference appears by introducing an extra temporal dimension. First solutions used direct incorporation of motion information by processing motion fields in two-stream network

[37]. Another approach was based on a simple idea of extending of common 2D convolutions to 3D case: C3D [40], I3D [4]

. The main disadvantage of aforementioned methods was the inability to train deep 3D networks from scratch because of over-fitting on target datasets (note that collecting a dataset close to ImageNet by size and impact

[32] for video-oriented problems is still challenging). Next steps were focused on reducing the number of parameters and thereby decreasing over-fitting by using separable 3D convolutions (P3D [31] and R(2+1)D [41] networks) and investigating the ability to mix 2D and 3D building blocks inside a backbone in an optimal way [48]. We follow the same concept as in S3D [48] but use depth-wise convolutions as in the original 2D MobileNetV3 architecture [14].

Other research directions are based on the ideas of using appearance from appropriate (key) frames rather than any kind of motion information [45], to mix motion information on feature level by shifting channels [22], to incorporate relational reasoning over frames in videos [55]. Unfortunately, as it was shown in [19] the appearance- and late-fusion- [56] based methods are not able to recognize quick gestures like sign language due to insufficient information at the single-frame level. Most hand gestures are, essentially, a quick movement of fingers and it’s impossible to recognize it by inspecting any single image from the video sequence – it should be considered in full.

ASL Recognition

Following the success of CNNs for action recognition, the first sign language recognition approaches tried to reuse 3D convolutions [29] to use frame-level [36] or skeleton [8], [6], [13] feature fusion by recurrent networks [35] or graph convolution networks [47]. Other approaches use multi-stream and multi-modal architectures to capture motion of each hand and head independently [50], mix depth and flow streams [18].

Aforecited methods talk about sign level recognition problem rather than sentence translation. To solve the translation problem, another kind of language model is trained: [30], [8].

Unfortunately, most of such methods were discovered on small dictionaries and don’t allow us to work in real sign language translation systems. In this paper we are focused on building sign-level instead of a sentence-level recognition model but with the ability to learn a good number of signs for communication. In contrast to [19] we developed the model for continuous stream sign language recognition (instead of clip-level recognition).

The main obstacle for gesture recognition (all the more so for translation) system building is the limited amount of public datasets. The available datasets are recorded with a minor number of signers and gestures, so the list of dataset appropriate for training of deep networks datasets is mostly limited by RWTH-PHOENIX-Weather [9] and MS-ASL [19].

An insufficient amount of data causes over-fitting and limited model robustness for changes in background, viewpoint, signer dialect. To overcome the limitations of available databases, we reuse the best practices from metric-learning area [39].

Sp. size Temp. size Operator Sp. kernel Temp. kernel Exp size Num out SE NL

Sp. stride

Temp. stride Dropout
224 16 conv3d 3 1 - 16 - HS 2 1 -
112 16 bneck 3 5 16 16 - RE 1 1
112 16 bneck 3 3 64 24 - RE 2 2
56 8 bneck 3 3 72 24 - RE 1 1
56 8 bneck 5 3 72 40 RE 2 1
28 8 bneck 5 3 120 40 RE 1 1
28 8 bneck 5 5 120 40 RE 1 1
28 8 bneck 3 5 240 80 - HS 2 1
14 8 bneck 3 3 200 80 - HS 1 1
14 8 bneck 3 3 184 80 - HS 1 1
14 8 bneck 3 5 184 80 - HS 1 1
14 8 attention 3 3 - 80 - - 1 1 -
14 8 bneck 3 3 480 112 HS 1 2
14 4 bneck 3 3 672 112 HS 1 1
14 4 bneck 5 3 672 160 HS 2 1
7 4 attention 3 3 - 160 - - 1 1 -
7 4 bneck 5 3 960 160 HS 1 1
7 4 bneck 5 3 960 160 HS 1 1
7 4 conv3d 1 1 - 960 - HS 1 1 -
TABLE I: Specification for S3D MobileNet-V3-Large large backbone with residual spatio-temporal attention modules.

Iii Methodology

Our goal is to predict one of hand gestures for each frame from the continuous input stream. To do that, we process the fixed size sliding window of input frames. Experimentally, we’ve chosen to set the number of input frames to 16 at constant frame-rate of 15. It captures, roughly, 1 second of live video and covers the duration of the majority of ASL gestures (according to the statistics of MS-ASL dataset). The extracted sequence of frames is cropped according to the maximal (maximum is taken over all frames in a sequence) bounding box of a person’s face and both hands (only raised hands are taken into account). Finally, the cropped sequence is resized to 224 square size producing a network input of shape .

Unlike other solutions, we don’t split network input into independent streams for head and both hands [18]. Instead, we use a single RGB stream of a cropped region that includes face and both hands of the signer to provide the real-time performance.

You can find our demo application at Intel OpenVINO™OMZ444https://github.com/opencv/open_model_zoo. It employs a person detector, a tracker module and the ASL recognition network itself along with all the necessary processing.

Iii-a Backbone design

Instead of designing a custom lightweight backbone adopted for inference on video stream we reuse a 2D backbone developed for efficient computing at the edge. The logic behind this is based on the assumption that the network efficient for 2D image processing will be a solid starting point after extending it to additional temporal dimension due to high correlation between the neighboring frames. We have selected MobileNet-V3 [14] as a base architecture.

To extend a 2D backbone to 3D case, we follow the practices from the S3D network[48]: spatial and temporal separable 3D convolutions and top-heavy network design. According to the latter paradigm, we remove temporal kernels from the very first convolution of a 3D backbone.

The default MobileNet-V3 bottleneck consists of three consecutive convolutions: , depth-wise , . To convert it into a 3D bottleneck following the concept of separable convolutions the last convolution is replaced with a one, where is the temporal kernel size. In SE-blocks we carry out average pooling along temporal dimension independently, so the shape of the attention mask is , where is the temporal feature size. Following the original MobileNet-V3 architecture we use different temporal kernels of sizes 3 and 5 but on contrasting positions.

Unlike spatial kernels, we don’t use convolutions with stride more than one for temporal kernels. To reduce the temporal size of a feature map the temporal average pooling operator with appropriate kernel size and stride sizes is used. Note, the positions of temporal pooling operations are different from spatial ones.

One more change to the original MobileNet-V3 architecture is an addition of two residual spatio-temporal attentions after the bottlenecks 9 and 12. See the table I for more details about the S3D MobileNet-V3 backbone (the original table from the Mobilenet-V3 paper is supplemented by temporal dimension-related columns).

Iii-B Spatio-Temporal Attention

Fig. 1: Block-scheme of residual spatio-temporal attention module. ”Spatial Pool” block carries out global average pooling over spatial dimensions.

Inspired by [43] and [7]

we reuse the paradigm of residual attention due to the possibility to insert it inside the pre-trained network for training on a target task. One more advantage is based on an ideology of consequence filtering of spatial appearance-irrelevant regions and temporal motion-poor segments. Unlike the above solutions, we are interested not only in unsupervised behavior of extra blocks but also in feature-level self-supervised learning

555Originally the term is related to the unsupervised pre-training problem, like in [12] and [20]..

To efficiently incorporate the attention module in 3D framework the original single-stream block design is replaced by the two-stream design with independent temporal and spatial branches. Each branch uses separable 3D convolutions like in the bottleneck proposed above: consecutive depth-wise and convolutions with BN [15]

and intermediate H-Swish activation function

[14] for spatial stream (the only difference for the temporal stream is in the first convolution which is depth-wise

). Then, both streams are added up and normalized by sigmoid function during the inference stage (during the training stage the mask is sampled – see next section). For more details see Figure

1.

The main drawback of using an attention module in unsupervised manner is a weak discriminative ability of learnt features (take a look on Figure 2, where attention masks from the second row are too noisy to extract robust features). As a result, even attention-augmented networks cannot fix an incorrect prediction and no significant benefit from using attention mechanisms can be observed. In our opinion, it’s because no extra information is provided during training to force the network to fix the prediction by focusing on the most relevant spatio-temporal regions rather than soft tuning over all model parameters (some kind of the ”Divide and Conquer” principle).

We propose to encourage the spatio-temporal homogeneity by using the total variation (TV) loss [25] over the spatio-temporal confidences. In addition, to force the attention mask to be sharp, the TV-loss is modified to work with hard targets ( and values):

(1)

where is a confidence score at a spatial position and a temporal position of a spatio-temporal confidence map of shape , is a set of neighboring spatio-temporal positions of element and

is an indicator function. Note, we use TV-loss over spatio-temporal confidences, rather than logits.

Another drawback of attention modules is a tendency of getting stuck in local minima (e.g. a network can learn to mask a central image region only regardless of input features). To overcome the above problem we propose to learn the distribution of masks and sample one during training666The idea is related to energy-based learning, like in [42]. and use the expected value during inference. For this purposes, we reuse the Gumbel-Softmax trick [17], but for sigmoid function [33]. As you can see on figure 2, the proposed methods allow us to train a much sharper and robust attention mask.

Fig. 2: Example of spatio-temporal attention masks. In rows from top to bottom: the original sequence of RGB frames, corresponding attention maps without auxiliary loss and attention maps after training with the proposed self-supervised loss.

Iii-C Metric-Learning approach

The default approach to train an action recognition network is to use Cross-Entropy classification loss. This method works fine for large size datasets and there is no reason to change it. Unfortunately, if we are limited in available data or the data is significantly imbalanced, then sophisticated losses are needed.

We are inspired by the success of metric-leaning approach to train networks on the limited size datasets to solve the person re-identification problem. So, we follow the practice to use the AM-Softmax [44] loss 777Originally the loss has been designed for the Face Verification problem but has become the standard for several adjacent tasks. with some auxiliary losses to form the manifold structure according the view of ideal geometrical structure of such space. Similar to [53]

we replace constant scale for logits by the straightforward schedule: gradual descent from 30 to 5 during 40 epochs.

Additionally, to prevent over-fitting on the simplest samples we follow the technique proposed in [1] to regularize the cross-entropy loss by addition of max-entropy term:

(2)

where is the predicted distribution and is the entropy of input distribution.

As mentioned in [16], AM-Softmax loss forms a global structure of manifold but the decision boundary of exact classes is also defined by a local interaction between neighboring samples. So, the PushPlus loss between samples of different classes in batch is used, too. In a similar manner, the push loss is introduced between the centers of classes to prevent the collapse of close clusters (aka loss). The final loss is a sum of all of the mentioned above losses: .

Further, metric-learning approach allows us to train networks that are close to large networks in terms of quality, but are much lighter and, thereby, faster [16].

Iii-D Network Architecture

The sign gesture recognition network architecture consists of S3D MobileNet-V3 backbone, reduction spatio-temporal module and classification metric-learning based head. The backbone outputs the feature map of size

(the number of channels is unchanged for MobileNet-V3 and equals to 960) thereby reducing input by 32 times in spatial dimension and 4 times in temporal one. Then, the spatio-temporal module carries out reduction of the final feature map by applying global average pooling. Lastly, the obtained vector is convolved with

kernel to align the number of channels to the target of 256 (we have experimented with different embedding sizes but the best trade off between speed and accuracy is obtained with that value). The output embedding vector is normalized. Also, we use BN layer before the normalization stage, like in [49] and [23].

Iii-E Model training

To enhance the situation with model robustness against appearance cluttering and motion shift, a number of image- and video-level augmentation techniques is used: brightness, contrast, saturation and hue image augmentations, plus, random crop erasing [54] and the mixup [51] (with a random image from ImageNet [32] and a gesture clip without mixing the labels). All the mentioned augmentations are sampled once per clip and applied for each frame in the clip identically. Additionally, to force the model to guess about action of the partially presented sequence of sign gesture we use the temporal jitter for temporal limits of action. During training we set the minimal intersection between ground-truth and augmented temporal limits to 0.6.

Additionally, to prevent from over-fitting, we augment training at the network level by addition of continuous dropout [34] layer inside each bottleneck (instead of single one on top of the network) as it was originally proposed in [27].

The network training procedure cannot converge when starting from scratch. So, we use the two-stage pre-training scheme: on the first stage the 2D Mobilenet-V3 backbone is trained on ImageNet [32] dataset. Then the S3D MobileNet-V3 network equipped with residual spatio-temporal attention modules and metric-learning losses is trained on Kinetics-700 [3] dataset. The only change before starting the main training stage is replacing the centers of classes (the weight matrix with which an embedding vector should be multiplied) to randomly picked ones according to the configuration of MS-ASL dataset with 1000 classes (unlike the mentioned paper with didn’t see the benefit from training directly on 100 classes due to fast over-fitting).

The final network has been trained on two GPUs by 14 clips per node with SGD optimizer and weight decay regularization using PyTorch framework

[28]. The initial learning rate is set to with single drop after the 25th epoch. Additionally, warm-up [10] is used during the first 5 epochs starting from learning rate and PR-Product (for last inner product only) [46] to enable parameter tuning around the convergence point. Note, due to significant over-fitting we use early stopping after approximately 30 epochs.

Iv Experiments

Iv-a Data

Sign language databases and American Sign Language (ASL), in particular, are hard to collect due to the need of capable signers. The first attempt to build a large-scale database has been made by [2] when they published ASLLBD database. We have experimented with this dataset but the final model suffers from significant domain shift and doesn’t allow us to run it on a video with an arbitrary signer or cluttered background, even though it achieves nearly maximal quality on the train-val split. It’s because the database has been collected with a limited number of signers (less then ten) and constant background. So, for the appearance-based solutions the emphasized database is not very useful.

The major leap has been made when MS-ASL [19] dataset has been published. It includes more than 25000 clips over 222 signers and covers 1000 most frequently used ASL gestures. Additionally, the dataset has a predefined split on train, val and test subsets. To utilize the maximal number of lacking samples of sign gestures, we train the network on full 1000-class train subset, but our goal is high metrics on the 100-class subset. Unlike the previously mentioned paper, we didn’t see the benefit of using 100-class subset directly for training. Moreover, we have observed significant over-fitting even for the much smaller network in comparison with the I3D baseline from the paper. Nonetheless, we use MS-ASL dataset to train and validate the proposed ASL recognition model.

Note, the paper proposes to test models (and provides baselines) for MS-ASL dataset under the clip-level setup. It implies the knowledge about the time of start and end of the sign gesture sequence. In this paper, we are focused on developing continuous stream action recognition model which should work on the unaligned (unknown start and end) sequence of sign gesture. So, the baselines from the original paper and the current paper are not directly comparable due to a more complicated scenario that we consider (we hope the future models will be compared under more suitable continuous recognition scenario).

Iv-B Test protocol

To better model the scenario of action recognition of a continuous video stream, we follow the next testing protocol. From each sequence of annotated sign gestures we select the central temporal segment with length equal to the network input (if the length of the sequence is less than the network input then the first frame is duplicated as many times as required). After that, the sequence of frames is cropped according to the mean bounding box of person (it includes head and two hands of a signer). The predicted score on this sequence is considered a prediction for the input sample (no over-sampling or other test-time techniques for metric boosting are used).

We measure mean top-1 accuracy and mAP metrics. Unlike the original MS-ASL paper we don’t use top-5 metric to level the annotation noise in the dataset (incorrect labels, mismatched temporal limits) due to weak correlation between the model robustness and high value of this metric (our experiments showed that a model with high top-5 metric can demonstrate low robustness in live-mode scenario).

Note, as mentioned in the original paper [19], the data includes significant noise in annotation. Likewise, we observed many mismatches in annotated sign gestures, so it’s expected that the real model performance is higher than the metric values suggest and it was confirmed indirectly by the impressive model accuracy in live mode.

Iv-C Ablation study

Here, we present the ablation study (see the table II). The baseline model includes training in continuous scenario with default AM-Softmax loss and scheduled scale for logits. As you can see from the table, the first solution is much lower than the best one due to weak learnt features even though it uses metric-learning approach from the very beginning.

Method top-1 mAP
AM-Softmax 73.93 76.01
+ temporal jitter 76.49 78.63
+ dropout in each block 77.11 79.45
+ continuous dropout 77.35 79.85
+ extra ml losses 80.75 82.34
+ pr-product 80.77 82.67
+ mixup 82.61 86.40
+ spatio-temporal attention 83.20 87.54
+ hard TV-loss 85.00 87.79
TABLE II: Ablation study on MS-ASL test-100 dataset

The first thing that should be fixed is weak annotation that includes mostly incorrect temporal segmentation of gestures. To fix it we let loose the condition to match the ground-truth temporal segment and a network input. The proposed change improves both metrics with a decent gap.

Then, the issue with insufficiently large and diverse dataset should be handled. To do that, we follow the practice of using dropout regularization inside each bottleneck. At the expense of reduction of a model capacity, the model enhances collective decision making [38] by suppression of some kind of ”grandmother cell” [11]

. One more small step is to replace the default Bernoulli distribution with continuous Gaussian distribution, like in

[34]. After this, the model enhances both metrics but still suffers from domain shift problem [24].

To overcome the mentioned above issue we have proposed to go deeper into metric-leaning solutions by introducing local structure losses [16]. As you can see, it allows us to score higher than 80 percent for both metrics. Additionally, the PR-Product is used to force learning near zero-gradient regions. Note, in our experiments the usage of PR-Product was justified with extra metric-learning losses only.

Another improvement is tied to increasing the variety of appearance by mixing video clips with random images (see the description of the implemented mixup-like augmentation in III-E

). The amount of the accuracy increase tells us about the importance of appearance diversity for neural network training.

The last leap is provided by using the residual spatio-temporal attention module with the proposed self-supervised loss.

Iv-D Results

As it was mentioned earlier, we cannot compare the proposed solution with a previous one on MS-ASL dataset because we have changed the testing protocol from the clip-level to continuous-stream paradigm. The final metrics on MS-ASL dataset (test split) are presented in table III. Note, as mentioned in the Data section, the quality of the provided annotation doesn’t allow us to measure the real power of the trained network even after manual filtering of the data (we carried out simple filtering to exclude empty or incorrectly cut gesture sequences).

MS ASL split top-1 mAP
100 signs 85.00 87.79
200 signs 79.66 83.06
500 signs 63.36 70.01
1000 signs 45.65 55.58
TABLE III: Results of continuous ASL recognition model on MS-ASL dataset

The final model takes 16 frames of image size as input at the constant 15 frame-rate and outputs embedding vector of 256 floats. As far as we know, the proposed solution is the fastest ASL Recognition model (according our measurements on Intel CPU) with competitive metric values on MS-ASL dataset. The model has only 4.13 MParams and 6.65 GFlops. For more details see table IV.

Spatial size Temporal size Embd size MParams GFlops
16 256 4.13 6.65
TABLE IV: Continuous ASL recognition model specification

Iv-E Discussion

Presently, graph-based approaches [26], [5], [21] gain popularity for action recognition tasks. Aforementioned methods rely on modeling the interactions between objects in a frame through time. It looks like the idea from [52] can be transferred to gesture recognition challenge but, on practice, the addition of residual attention modules with simple global average pooling reduction operator shows similar quality without the need of extra computation. In our opinion, the most appropriate explanation of the mentioned behavior is that a sign gesture has a fixed spatial (placement of two hands and face) and temporal (transition of fingers through time) structure which can be easily captured by 3D neural network with sufficient spatio-temporal receptive field. However, incorporating low-level design of graph-based approach for feature extractor directly could give a fresh view on the proposed solution and we hope it will be done in the future.

V Conclusion

In this paper we propose the lightweight ASL gesture recognition model which is trained under the metric-learning framework and allows us to recognize ASL signs in a live stream. Besides that, for better model robustness to appearance changes, it’s proposed to use residual spatio-temporal attention with the auxiliary self-supervised loss. The results show that the proposed gesture recognition model can be used in a real use case for ASL sign recognition.

References

  • [1] G. Adaimi, S. Kreiss, and A. Alahi (2019) Rethinking person re-identification with confidence. CoRR abs/1906.04692. External Links: Link, 1906.04692 Cited by: §III-C.
  • [2] V. Athitsos, C. Neidle, S. Sclaroff, J. Nash, A. Stefan, Q. Yuan, and A. Thangali (2008) The american sign language lexicon video dataset. In

    2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops

    ,
    pp. 1–8. Cited by: §IV-A.
  • [3] J. Carreira, E. Noland, C. Hillier, and A. Zisserman (2019) A short note on the kinetics-700 human action dataset. CoRR abs/1907.06987. External Links: Link, 1907.06987 Cited by: §III-E.
  • [4] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? A new model and the kinetics dataset. CoRR abs/1705.07750. External Links: Link, 1705.07750 Cited by: §II.
  • [5] B. Chen, B. Wu, A. Zareian, H. Zhang, and S. Chang (2020)

    General partial label learning via dual bipartite graph autoencoder

    .
    External Links: 2001.01290 Cited by: §IV-E.
  • [6] C. C. de Amorim, D. Macêdo, and C. Zanchettin (2019) Spatial-temporal graph convolutional networks for sign language recognition. CoRR abs/1901.11164. External Links: Link, 1901.11164 Cited by: §II.
  • [7] N. Dhingra and A. Kunz (2019-09) Res3ATN - deep 3d residual attention network for hand gesture recognition in videos. 2019 International Conference on 3D Vision (3DV). External Links: ISBN 9781728131313, Link, Document Cited by: §III-B.
  • [8] B. Fang, J. Co, and M. Zhang (2018) DeepASL: enabling ubiquitous and non-intrusive word and sentence-level sign language translation. CoRR abs/1802.07584. Note: Withdrawn. External Links: Link, 1802.07584 Cited by: §II, §II.
  • [9] J. Forster, C. Schmidt, O. Koller, M. Bellgardt, and H. Ney (2014-05) Extensions of the sign language recognition and translation corpus RWTH-PHOENIX-weather. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, pp. 1911–1916. External Links: Link Cited by: §II.
  • [10] A. Gotmare, N. S. Keskar, C. Xiong, and R. Socher (2018)

    A closer look at deep learning heuristics: learning rate restarts, warmup and distillation

    .
    CoRR abs/1810.13243. External Links: Link, 1810.13243 Cited by: §III-E.
  • [11] C. G. Gross (2002) Genealogy of the ”grandmother cell”. The Neuroscientist 8 (5), pp. 512–518. Cited by: §IV-C.
  • [12] D. Hendrycks, M. Mazeika, S. Kadavath, and D. Song (2019) Using self-supervised learning can improve model robustness and uncertainty. CoRR abs/1906.12340. External Links: Link, 1906.12340 Cited by: footnote 5.
  • [13] A. A. Hosain, P. S. Santhalingam, P. Pathak, J. Kosecka, and H. Rangwala (2019) Sign language recognition analysis using multimodal data. External Links: 1909.11232 Cited by: §I, §II.
  • [14] A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and H. Adam (2019) Searching for mobilenetv3. CoRR abs/1905.02244. External Links: Link, 1905.02244 Cited by: 1st item, §II, §III-A, §III-B.
  • [15] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167. External Links: Link, 1502.03167 Cited by: §III-B.
  • [16] E. Izutov (2018) Fast and accurate person re-identification with rmnet. CoRR abs/1812.02465. External Links: Link, 1812.02465 Cited by: §III-C, §III-C, §IV-C.
  • [17] E. Jang, S. Gu, and B. Poole (2016) Categorical reparameterization with gumbel-softmax. External Links: 1611.01144 Cited by: 2nd item, §III-B.
  • [18] L. Jing, E. Vahdani, M. Huenerfauth, and Y. Tian (2019) Recognizing american sign language manual signs from RGB-D videos. CoRR abs/1906.02851. External Links: Link, 1906.02851 Cited by: §II, §III.
  • [19] H. R. V. Joze and O. Koller (2018) MS-ASL: A large-scale data set and benchmark for understanding american sign language. CoRR abs/1812.01053. External Links: Link, 1812.01053 Cited by: §I, §II, §II, §II, §IV-A, §IV-B.
  • [20] A. Kolesnikov, X. Zhai, and L. Beyer (2019) Revisiting self-supervised visual representation learning. CoRR abs/1901.09005. External Links: Link, 1901.09005 Cited by: footnote 5.
  • [21] Z. Liang, Y. Guan, and J. Rojas (2020) Visual-semantic graph attention network for human-object interaction detection. External Links: 2001.02302 Cited by: §IV-E.
  • [22] J. Lin, C. Gan, and S. Han (2018) Temporal shift module for efficient video understanding. CoRR abs/1811.08383. External Links: Link, 1811.08383 Cited by: §II.
  • [23] H. Luo, W. Jiang, Y. Gu, F. Liu, X. Liao, S. Lai, and J. Gu (2019) A strong baseline and batch normalization neck for deep person re-identification. CoRR abs/1906.08332. External Links: Link, 1906.08332 Cited by: §III-D.
  • [24] Y. Luo, L. Zheng, T. Guan, J. Yu, and Y. Yang (2018) Taking A closer look at domain shift: category-level adversaries for semantics consistent domain adaptation. CoRR abs/1809.09478. External Links: Link, 1809.09478 Cited by: §IV-C.
  • [25] A. Mahendran and A. Vedaldi (2014) Understanding deep image representations by inverting them. CoRR abs/1412.0035. External Links: Link, 1412.0035 Cited by: §III-B.
  • [26] J. Materzynska, T. Xiao, R. Herzig, H. Xu, X. Wang, and T. Darrell (2019) Something-else: compositional action recognition with spatial-temporal interaction networks. External Links: 1912.09930 Cited by: §IV-E.
  • [27] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello (2016) ENet: A deep neural network architecture for real-time semantic segmentation. CoRR abs/1606.02147. External Links: Link, 1606.02147 Cited by: §III-E.
  • [28] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §III-E.
  • [29] L. Pigou, M. Van Herreweghe, and J. Dambre (2017-10) Gesture and sign language recognition with temporal residual networks. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Cited by: §II.
  • [30] J. Pu, W. Zhou, and H. Li (2019-06) Iterative alignment network for continuous sign language recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II.
  • [31] Z. Qiu, T. Yao, and T. Mei (2017) Learning spatio-temporal representation with pseudo-3d residual networks. CoRR abs/1711.10305. External Links: Link, 1711.10305 Cited by: §II.
  • [32] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. Cited by: §II, §III-E, §III-E.
  • [33] C. Shen, G. Qi, R. Jiang, Z. Jin, H. Yong, Y. Chen, and X. Hua (2018) Sharp attention network via adaptive sampling for person re-identification. CoRR abs/1805.02336. External Links: Link, 1805.02336 Cited by: §III-B.
  • [34] X. Shen, X. Tian, T. Liu, F. Xu, and D. Tao (2019) Continuous dropout. External Links: 1911.12675 Cited by: §III-E, §IV-C.
  • [35] A. Sherstinsky (2018)

    Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network

    .
    CoRR abs/1808.03314. External Links: Link, 1808.03314 Cited by: §II.
  • [36] B. Shi, A. M. D. Rio, J. Keane, D. Brentari, G. Shakhnarovich, and K. Livescu (2019-10) Fingerspelling recognition in the wild with iterative visual attention. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §II.
  • [37] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. CoRR abs/1406.2199. External Links: Link, 1406.2199 Cited by: §I, §II.
  • [38] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.

    Journal of Machine Learning Research

    15, pp. 1929–1958.
    External Links: Link Cited by: §IV-C.
  • [39] J. Suárez, S. García, and F. Herrera (2018) A tutorial on distance metric learning: mathematical foundations, algorithms and software. CoRR abs/1812.05944. External Links: Link, 1812.05944 Cited by: §II.
  • [40] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2014) C3D: generic features for video analysis. CoRR abs/1412.0767. External Links: Link, 1412.0767 Cited by: §I, §II.
  • [41] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2017) A closer look at spatiotemporal convolutions for action recognition. CoRR abs/1711.11248. External Links: Link, 1711.11248 Cited by: §II.
  • [42] R. Turner, J. Hung, E. Frank, Y. Saatci, and J. Yosinski (2018) Metropolis-hastings generative adversarial networks. External Links: 1811.11357 Cited by: footnote 6.
  • [43] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang (2017) Residual attention network for image classification. CoRR abs/1704.06904. External Links: Link, 1704.06904 Cited by: §III-B.
  • [44] F. Wang, W. Liu, H. Liu, and J. Cheng (2018) Additive margin softmax for face verification. CoRR abs/1801.05599. External Links: Link, 1801.05599 Cited by: §III-C.
  • [45] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool (2017) Temporal segment networks for action recognition in videos. CoRR abs/1705.02953. External Links: Link, 1705.02953 Cited by: §II.
  • [46] Z. Wang, W. Zou, and C. Xu (2019) PR product: A substitute for inner product in neural networks. CoRR abs/1904.13148. External Links: Link, 1904.13148 Cited by: §III-E.
  • [47] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2019) A comprehensive survey on graph neural networks. CoRR abs/1901.00596. External Links: Link, 1901.00596 Cited by: §II.
  • [48] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy (2017) Rethinking spatiotemporal feature learning for video understanding. CoRR abs/1712.04851. External Links: Link, 1712.04851 Cited by: 1st item, §II, §III-A.
  • [49] F. Xiong, Y. Xiao, Z. Cao, K. Gong, Z. Fang, and J. T. Zhou (2018) Towards good practices on building effective CNN baseline model for person re-identification. CoRR abs/1807.11042. External Links: Link, 1807.11042 Cited by: §III-D.
  • [50] Z. Yang, Z. Shi, X. Shen, and Y. Tai (2019) SF-net: structured feature network for continuous sign language recognition. External Links: 1908.01341 Cited by: §II.
  • [51] H. Zhang, M. Cissé, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. CoRR abs/1710.09412. External Links: Link, 1710.09412 Cited by: §III-E.
  • [52] J. Zhang, F. Shen, X. Xu, and H. T. Shen (2019) Temporal reasoning graph for activity recognition. External Links: 1908.09995 Cited by: §IV-E.
  • [53] X. Zhang, R. Zhao, Y. Qiao, X. Wang, and H. Li (2019) AdaCos: adaptively scaling cosine logits for effectively learning deep face representations. CoRR abs/1905.00292. External Links: Link, 1905.00292 Cited by: §III-C.
  • [54] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang (2017) Random erasing data augmentation. CoRR abs/1708.04896. External Links: Link, 1708.04896 Cited by: §III-E.
  • [55] B. Zhou, A. Andonian, and A. Torralba (2017) Temporal relational reasoning in videos. CoRR abs/1711.08496. External Links: Link, 1711.08496 Cited by: §II.
  • [56] M. Zolfaghari, K. Singh, and T. Brox (2018) ECO: efficient convolutional network for online video understanding. CoRR abs/1804.09066. External Links: Link, 1804.09066 Cited by: §II.