LIGAR: Lightweight General-purpose Action Recognition

08/30/2021 ∙ by Evgeny Izutov, et al. ∙ Intel 0

Growing amount of different practical tasks in a video understanding problem has addressed the great challenge aiming to design an universal solution, which should be available for broad masses and suitable for the demanding edge-oriented inference. In this paper we are focused on designing a network architecture and a training pipeline to tackle the mentioned challenges. Our architecture takes the best from the previous ones and brings the ability to be successful not only in appearance-based action recognition tasks but in motion-based problems too. Furthermore, the induced label noise problem is formulated and Adaptive Clip Selection (ACS) framework is proposed to deal with it. Together it makes the LIGAR framework the general-purpose action recognition solution. We also have reported the extensive analysis on the general and gesture datasets to show the excellent trade-off between the performance and the accuracy in comparison to the state-of-the-art solutions. Training code is available at: https://github.com/openvinotoolkit/training_extensions. For the efficient edge-oriented inference all trained models can be exported into the OpenVINO format.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

[nindent=0em,lines=3]Nowadays Action Recognition (AR) plays the key role in many real-world applications, including human-robot interaction, behavior analysis, gesture translation. It aims to solve the classification problem in the video domain. The recent success of Deep Learning (DL) based solutions has brought the ability to solve the mentioned above problems close to the human level. Unfortunately, there is a big gap between a working solution developed for the academic purposes and an edge-oriented network design.

To connect the business goals and academic achievements the researchers have proposed several solutions for adverted tasks, like MobileNets [MN-V1, MN-V2, MN-V3] for 2D image classification and the OSNet architecture [OSNet] for the Re-Identification problem. The AR domain still lacks the fast and edge-oriented solutions. So, we developed the lightweight action recognition network architecture called LIGAR (LIghtweigt General-purpose Action Recognition) to fill the gap between accurate and real-time solutions.

Motivation. Current SOTA Action Recognition approaches mostly use 3D-like backbone design [I3D, P3D, R(2+1)D, S3D, X3D] to capture the motion component paired with appearance one. It allows us to ingest huge datasets (e.g. Kinetics [Kinetics] and YouTube-8M [Ytb8m]) and demonstrates the impressive accuracy on the admissible robustness level. Unluckily the final 3D-like AR networks are too big for the edge-oriented inference and cannot be used somewhere except benchmarks.

The opposite segment of fast AR networks is represented by 2D network architectures [TSN, TSM, TRN], which split a video processing pipeline on independent image processing and the late feature fusion. The 2D networks work fast enough for the real-time inference. The main drawback of the adverted 2D solutions is an inability to capture complex motions, like hand movements in the sign language recognition problem (e.g. American Sign Language – ASL).

Summarizing, the development of a general-purpose network requires capturing the motion component as well as appearance one. In the proposed paper we are focused on merging representatives of 2D and 3D worlds of lightweight architectures – MobileNet-V3 [MN-V3] and X3D [X3D] respectively. The obtained architecture is suitable for motion capturing mostly and to enhance processing the appearance component the model is supplemented with additional global branch (proposed in the LGD paper [LGD]).

Our contributions are as follows:

  • Extending the family of efficient 3D networks for the general-purpose action recognition by merging X3D [X3D] and LGD [LGD] video processing frameworks with the lightweight edge-oriented MobileNet-V3 [MN-V3] backbone architecture.

  • Supplementing the training procedure with a bag of tricks (augmentations, multi-head pre-training, feature regularization, metric-learning based head, learning rate scheduler) to deal with limited size of AR datasets to reach robustness.

  • Proposing the Adaptive Clip Selection (ACS) module to tackle with a weak temporal annotation in common AR datasets.

In addition, we release the training framework111https://github.com/openvinotoolkit/training_extensions that can be used in order to re-train or fine-tune our model with a custom database. The final model is inference-ready to use with Intel OpenVINO™toolkit222https://software.intel.com/en-us/openvino-toolkit – the sample code on how to run the model in the demo mode is available at Intel OpenVINO™OMZ333https://github.com/openvinotoolkit/open_model_zoo.

Index terms: action recognition, network regularization, lightweight network, edge-oriented inference, metric learning head, label noise suppression.

Ii Related Work

i Action Recognition

[nindent=0em,lines=3]In the Action Recognition problem we are concentrated on solving the classification problem on a sequence of images, which can be derived from a live video stream or an untrimmed long video. First attempts to deal with it were based on extending the 2D classification networks into the 3D use case by addition an extra temporal dimension inside the CNN’s kernels, like in C3D [C3D] and I3D [I3D] papers. The final networks were too large for the existed datasets and suffered from strong over-fitting. In the same time, purely 2D solutions perform independent image processing and after this the extracted features are merged at the decision [TwoStream, TSN] or feature [TRN] levels. That design allows the 2D network to be the fastest video processing framework but not accurate.

The latter improvement is addressed to the publication of sizable datasets, like Kinetics [Kinetics] and YouTube-8M [Ytb8m], which are suitable for pre-training purposes. Next steps were focused on the reduction of parameters of 3D-based networks (e.g. R(2+1)D [R(2+1)D], S3D [S3D]), extending 2D approaches with Graph Convolutional Networks (GCNs) [TRG] and merging 2D and 3D networks into the single one (e.g. SlowFast [SlowFast], AssembleNet [AssembleNet], LGD [LGD]) to achieve the trade-off between the accuracy of 3D networks and the low computation budget of 2D ones. We pursue the the same paradigm and designed the LGD-based architecture.

Other researchers pay attention to integrating specialized attention modules into the backbone [RCCA] or feature aggregation module [NL] or designing a specialized head for better spatio-temporal feature aggregation [Bert]. Unfortunately, the listed above methods lead to the grow of computations and thereby are less suitable for the edge-oriented inference.

ii Edge-oriented inference

[nindent=0em,lines=3]After the resounding success of DL-based approaches of solving a wide range of complex problems the necessity of porting solutions closer to the end users by adopting networks for the inference on the edge has appeared. The first 2D solutions are represented by the MobileNet family [MN-V1, MN-V2, MN-V3], the ShuffleNet architecture [ShuffleNet-V1, ShuffleNet-V2] and EfficientNets [EfficientNet] later.

The attempts to bring 3D networks closer to business include mostly introducing the X3D [X3D] and MoViNet [MoViNet] network families. The last one is designed automatically by the NAS-based approach. In the proposed paper we adhere to the X3D framework and merge it with the mentioned above edge-oriented 2D backbone – MobileNet-V3 architecture.

Figure 1: Overall network design. The model consists of Local and Global paths with intermediate fusion modules like in LGD network [LGD]. The Local path is built upon X3D MobileNet-V3-Large backbone architecture. The Global path follows the original LGD design but with Global Context (GC) blocks instead of Global Average Pooling (GAP) operator. The lightweight head is represented by GAP operator and convolution module. The network output is normalized to enable metric-learning losses.

iii Network regularization

[nindent=0em,lines=3]Another question addressed to the researchers is the ability to train a network on the limited amount of a data without significant over-fitting. The most common way to tackle it is to use advanced augmentation techniques, like MixUp [MixUp], AugMix [AugMix], CrossNorm [CrossNorm] and so on.

A different point of view is related to using feature-level regularization methods, like Dropout family [Dropout, ContinuousDropout, InfoDrop, FocusedDropout], mutual learning [MutualLearning] and self-feature regularization [SFR]. We follow the mutual learning paradigm and blend it with the mentioned above AugMix strategy.

Further, the network regularization can be performed by using more sophisticated losses instead of default one (the pair of the Softmax normalization and the CrossEntropy loss): transition to the Metric-Learning (ML) paradigm [ASLNet] and using AM-Softmax loss [AM-Softmax] as a main target and some auxiliary losses from the Re-Identification task, like PushPlus [RMNet] and CenterPush [ASLNet] losses. We act in accordance with the same practices and use ML head with the aforementioned auxiliary losses.

Iii Network design

i Motion-appearance trade-of

[nindent=0em,lines=3]Originally the data sources for AR can be divided by the data representation onto continual and sparse. The last one is most common for general-purpose scenarios and described by appearance mostly rather than motion. In other words, the significant amount of data samples from that category can be viewed as a sequence of key frames, where each frame reflects a change in a global appearance. For better understanding imagine the "covering something" atomic action – to recognize the mentioned class it is needed to extract the sequence of two frames only – "uncovered object" and then "covered by something" one. As it can be seen the motion component in the above sample is rudiment. To recognize similar classes it is enough to model the relationship between key frames by a 2D network, like in TSN [TSN] or TRN [TRN].

The continues segment of data representations consists of gesture recognition scenarios, like ASL (American Sign Language) recognition task. It is complicated by the difficult hand and fingers movements. The attempt to compress an image sequence into a sequence of key frames (comparable with 2D approaches) fails [MSASL] due to the unchanging appearance component (a person or background on a video) from the one hand and the rapidly changing motion component (fingers) on the other hand. The only way to model the motion component is using fully 3D networks.

In this paper we accept the common paradigm for "inflating" 2D network architectures into the 3D one, while keeping the original 2D weight initialization, as it was proposed in I3D [I3D]. In terms of edge-oriented network design there are two main inflating frameworks: S3D [S3D] and recently published X3D [X3D]. The choice between S3D and X3D frameworks can be viewed as the choice between the motion- and the appearance-preferable architectures respectively due to the different placement of a temporal convolution kernel – inside a depth-wise kernel (X3D) and after it (S3D). As it was shown in [ASLNet] the S3D framework is most suitable for the heavy gesture recognition scenarios, like the ASL gesture recognition problem. But our extensive experiments evinced that S3D is worse than X3D framework in general scenarios (e.g. UCF-101 [UCF101] or AcivityNet [ActivityNet] datasets). Eventually, we have concluded to use the X3D framework.

ii Multi-path architecture

[nindent=0em,lines=3]Recently proposed Local and Global Diffusion (LGD) framework [LGD] achieves SOTA results on several general-purpose datasets. The main idea of the mentioned method is to split the regular single-path backbone design into two paths for separately processing the local and global context. The framework also offered the between paths communication modules. We found the proposed framework is preferable in terms of motion-appearance trade-off due to the possibility to move the lion’s share of a work of appearance component processing into the lightweight global path. Such a find also allows us to compensate the loss for the gesture recognition issues, while making the choice in favor X3D instead of S3D framework.

In our paper we implemented the same LGD-based backbone architecture with several changes:

  • We refused of using the kernel function to merge the global and local branches. Instead of it the local branch is used as the backbone output and the global branch is shorter by single LGD block. We did not beneficiate from using any complex merging scheme because of following the end-to-end training paradigm instead of separate path pre-training, proposed originally (see section i).

  • Basically, the local-to-global diffusion module uses Global Average Pooling (GAP) operator to merge spatio-temporal representations into the single feature-vector. Experimentally we have found that the more accurate way is to use the attention module through the spatio-temporal dimensions to extract the most relevant features. Having multiple choice of possible architectures of the attention module we have focused on Global Context (GC) block

    [GCN].

iii Overall architecture

[nindent=0em,lines=3]The suggested edge-oriented backbone architecture is based on X3D framework but with MobileNet-V3 2D skeleton architecture instead of MobileNet-V2 originally. According to S3D framework the first convolution has only spatial dimensions (

kernel) and places the temporal strides in different positions than the spational one. All the above changes allow us to design the efficient 3D network architecture.

The mentioned 3D architecture constitutes the local path in the described early LGD network design. The global path is implemented on the same way as in the original paper but with GC block instead of simple GAP operator. The output of the merged backbone is preserved the same as in X3D+MobileNet-V3 architecture without addition of kernel fusion module.

The backbone output followed the simple spatio-temporal reduction module implemented by GAP operator. On top of network the Metric-Learning based head is placed. It consists of two consecutive convolutions with batch norms [BN] and forms the output of the network to be 256-dimensional feature-vector. Note, the network output is normalized. For more details see the Figure 1.

Iv Network regularization

i Multi-head pre-training

[nindent=0em,lines=3]Recent success of transfer learning approach 

[TransferLearning]

through pre-training the network on tremendous general datasets (e.g. ImageNet 

[ImageNet] for 2D image classification and Kinetics [Kinetics] for the 3D use case) plays important role in the network regularization. It allows us to train the final model on the limited amount of data by using (aka. transferring) the original pre-trained weights.

Another point of view on the pre-training stage is formulated as increasing the number of targets [ImageNet-21k] to reach the affluent semantics of the learned classes. Unfortunately, the data collection process of video datasets is time consuming and the existing datasets cannot boast of sufficient number of classes. The solution has been proposed in the paper [SplitML] by simultaneously training on the merged set of available datasets. The idea is to sample the batch from the joint set of samples but split the network heads according to the target datasets. In such framework each head spawn the independent decision space and there is no possible conflicting between similar classes in different datasets.

The mentioned above method allows us to use the original annotation of existing datasets but train the network on the full set of samples simultaneously. The proposed network architecture is trained according to the same idea on the joint set of Kinetics and YouTube-8M-Segments datasets but with ML-heads (see the next section ii).

ii Feature regularization

[nindent=0em,lines=3]Unfortunately, the good network initialization is not enough for training the robust and accurate final model due to possible over-fitting on the limited-size target dataset. To overcome foregoing issue it is proposed to use a bag of regularization methods during the training stage.

The paper [ASLNet] advocate the idea to combine a regular 3D classification network with Metric-Learning (ML) based design of heads and appropriate ML losses. It adds the two-layer’s head on top of a backbone and normalizes the output 256-dimensional feature vector. Additionally, to control the structure of the learned manifold the extra losses are used: AM-Softmax loss [AM-Softmax] instead of default CE-loss, CenterPush and LocalPush to model the sample-sample interactions and the confidence penalty to decrease the impact of overconfident samples. We use the same approach and for more details see the mentioned above paper.

The ML-based regularization is explicit and it is performed by forcing the extra properties for the decision space. The different (implicit) way to carry out the regularization is to restrict the expressiveness of features by some rules. Our experiments made to believe that there are several outstanding methods which are universal enough for the regularization purposes. The first method is the advanced version of the well-known dropout – Representation Self-Challenging (RSC) module [RSC]. It drops the most relevant filters and thereby forces the network to find the diverse set of features.

The other regularization method is a part of the AugMix [AugMix] augmentation. It applies the different augmentation pipelines to the same sample in the batch and adds auxiliary loss to pull the predictions of the same instance to each other. Thereby the network tries to learn the augmentation-independent representation. We treat the method as some kind of self-mutual learning [SFR] and use it as the way to fill the network capacity with utility task-relevant filters.

V Implementation details

i Augmentations

[nindent=0em,lines=3]As it was described in the previous section the AugMix [AugMix] augmentation is used to effectively regularize the network during the final training on the limited-size datasets. Unlike the original paper we have chosen the fixed set of operations used in the augmentation pipeline: random rotate, crop, horizontal flip and CrossNorm [CrossNorm] augmentations. Furthermore, we have found useful to integrate the random selection of time segment (aka. temporal crop during sampling a clip from the full input video) inside the augmentation pipeline. The last finding significantly increases the difficulty of the auxiliary task.

The other direction of augmenting consists of extending the background diversity of input frames by mixing the source video clips with some external spatial information. In our experiments the most impressive result have showed the MixUp augmentation [MixUp] but adopted for the video input (originally proposed in [ASLNet]). It selects a randomly chosen image from the predefined set of images (commonly from ImageNet dataset [ImageNet]) and mix it with a full sequence of selected frames in a video clip.

The same model quality can be also achieved by the recently proposed CrossNorm [CrossNorm]

augmentation. To carry out the last one it is only need at source of pairs of clip mean and variance. Finally, we have chosen the CrossNorm augmentation in our pipeline due the simplicity for the end user to use that augmentation.

ii Label noise suppression

[nindent=0em,lines=3]One more problem for training AR models is related to the induced label noise – it is possible while a network should be trained on an untrimmed dataset, like ActivityNet [ActivityNet]. In this way there is no accurate temporal segmentation of the target action classes and the clip sampling procedure can select background frames from the video thereby producing the sample with an incorrect label (aka. induced label noise – see Figure 2). According to the our observations the output label noise can exceeds more than in a worse case.

Figure 2: Example of invalid clip sampling. – source video, – ground-truth borders of a target action clip, – incorrectly sampled action clip (according to the common used uniform sampling strategy).

There are many solutions have been developed to tackle it, like MARVEL [MARVEL] and PRISM [PRISM]. Unfortunately the mentioned methods reduce the quality in case of relatively clean datasets (e.g. UCF-101 [UCF101]) due to the inability to distinguish the noisy sample from the difficult one. The last thing does not allow us to use them in the universal framework suitable for all general use cases.

In our opinion, the more elegant way to tackle the mentioned problem is to design the procedure for the careful clip selection instead of fixing a label of an already sampled clip. In light of this, we have designed the Adaptive Clip Selection (ACS) procedure. The main idea of the method is to permanently collect the predicted accuracy of each sampled clip and then sample a new one according to the probability of temporal segment to be a correct carrier of a target action category.

More formally lets assume that is set of frame indices in some video and is a number of frames in it. The frame sampling procedure aims to select some continues subset of frames with length . We can split the into continuous segments , where . Additionally, lets associate each frame in a video with some positive score: . Having the probabilities of selecting of each segment we can sample it from the defined above multinomial distribution. In this way can be expressed as follow:

(1)

During the training procedure we store the predicted scores of classes after each forward pass. Lets assume the segment has been predicted as class with the probability . The true label of segment is . To enable the collection of statistics we update the associated with frames scores according to the following equation:

(2)

In practice we enable ACS for each video sample in dataset independently and do it after collecting the sufficient statistics of frames – number of frames with non-zero score should be not less than some threshold (70% in our experiments). Before that the ACS is disabled and clip sampling is performed uniformly. Note that collecting statistics is performed from the beginning to the end of training (except the very first epoch to omit the potential volatility of the beginning). Also we have experimented with labeling the sample as ignore (or negative) in case of all frames in a video predicted as incorrect (all

) but the quality was worse because the method faced with the same as previous label correction methods problem – the mentioned above dilemma of a sample difficulty or a label noise.

iii Training

[nindent=0em,lines=3]The network training procedure is performed in multi-stage manner to preserve the pre-trained weights (see section i) and speed up training at all. Training stages are as follow:

  1. Training the network head only. During this stage we freeze the whole backbone parameters (all batch norms in backbone are switched to the inference mode) and train the head only with high learning rate (). That training allows to roughly optimize the class centroids and prevents the future gradient explosion after enabling the gradient propagation for all model parameters.

  2. On the next stage we enable learning rate WarmUp [WarmUp]. It increases the learning rate from up to by cosine schedule. It allows to sew together the gradients of the head and the rest network.

  3. Finally the training procedure is switched to the default one.

The main training procedure is performed by cosing learning rate schedule starting from and ending with . We use slightly modified version of cosine schedule to increase a fraction of the time on a high learning rate. Practically it prevents from strong over-fitting on the late phase of training in case of limited-size datasets. The learning rate at the -th iteration is given by the following expression:

(3)

where and are the lower and upper bounds respectively for the learning rate. represents the current iteration and is maximum number of iterations. The parameter allows us to control the fraction of high learning rate phase – in our experiments we have fixed it to the value.

Furthermore, it was found that training with the recently proposed Adaptive Gradient Clipping (AGC) 

[AGC]

allows us to not only speed up training but increase the model accuracy too. In all experiments below we set it enabled for our network. The final network is trained on two GPUs by 12 clips per node with SGD optimizer and weight decay regularization using the PyTorch framework 

[PyTorch].

Sampler UCF-101 ActivityNet-200 Jester-27
1 segment 10 segments 1 segment 10 segments 1 segment 10 segments
sparse 93.79 94.71 59.94 74.32 95.25 95.68
continuous 93.63 94.85 64.19 75.27 95.43 95.56
Table 1: The pivot table of different frame sampling strategies and test protocols. For each combination the Top-1 accuracy is reported.

Vi Experimental Results

i Data

[nindent=0em,lines=3]We conduct experiments on several commonly-used benchmarks for general action and gesture recognition. The listed below datasets allow us to validate the proposed architecture in different scenarios: trimmed and untrimmed general purpose action recognition and continuous hand gesture recognition. As it was described early the first category checks the ability to learn the appearance-based action recognition while the last one – the ability to model the motion component of actions.

  • UCF-101 [UCF101] is a middle-size general-purpose action recognition dataset of trimmed action videos, splitted on the 101 action categories. The total number of samples is 13320 and there are three train-val splits provided. As many other researchers we report the results on the first split only to reduce the training time.

  • ActivityNet-200 [ActivityNet] is a middle-size general-purpose action recognition dataset. Unlike the previous dataset the source videos are untrimmed and split on 200 action categories. Note that the dataset is designed to solve the temporal action detection/segmentation task mostly but we adopt it for action recognition purposes (see the next section for more details). The total number of available for downloading video instances is 17196.

  • Jester-27 [Jester] is large-size dataset of labeled video clips that shows humans performing pre-defined hand gestures. The dataset consists of 148092 unique video clips splitted on the 27 gesture categories.

ii Evaluation Protocol

[nindent=0em,lines=3]The main protocol to measure the performance of action recognition algorithms consists of top-1 and top-5 accuracies. The last metric is used to flatten the perturbations in case of noisy labels in the annotation. However as it was mentioned in the paper [ASLNet] the transition to ML-based heads allows us to measure more noise resistant metric, like rank mAP (for more details see the reported above paper). In this paper we measure it too.

Another question is related to the choice of a procedure for the clip selection from an input video during the testing phase. Commonly used approach is to split the input video onto 10 temporal segments and apply a AR network for each segment independently. Sometimes AR network is run for several crops inside each temporal segment. For the end user it means that the reported performance metrics in terms of GFlops should be multiplied onto 30 (10 segments 3 crops). The mentioned method is suitable for the academic community (to show the best quality) but not for a business. To report the fair performance-accuracy pairs we follow the reduced protocol – single central spatio-temporal crop for each input video (for comparison purposes we report the metrics for both protocols in the section iii). Note, the ActivityNet data is untrimmed, so the single temporal crop may not reflect the target class due to temporal mismatch with ground-truth (see the section ii). So, the 10-temporal-crop protocol is considered as a primary one for the ActivityNet benchmark.

One more possible difference in measurements is related to the frame sampling procedure. For the general-purpose datasets it is reasonable to sample frames inside a segment uniformly to achieve the best temporal coverage. Unfortantely, the discussed method is not suitable for the real-time applications (e.g. hand gesture recognition) which process frames on-the-fly and does not have an access to the future frames. In the paper [ASLNet] the continues protocol has been proposed – sampling the frames during the train and test phases with a fixed frame-rate. In the paper we follow the same continues protocol (the comparison between protocols can be found in the iii section).

Method Top-1 Top-5 mAP
from scratch 64.05 87.97 57.33
ImageNet (2D init) 71.90 90.67 77.74
Kinetics (3D init) 92.81 98.76 96.88
Kinetics+Ytb8M (3D init) 93.63 99.10 97.82
Table 2: The ablation study of the backbone initialization on the UCF-101 dataset.

iii Ablation study

[nindent=0em,lines=3]Below we present the ablation study of the proposed framework in terms of the frame sampling strategy (comparison between sparse and continues protocols during the inference), the pre-training method and the influence of the introduced ACS module.

Method UCF-101 ActivityNet-200
Top-1 Top-5 mAP Top-1 Top-5 mAP
w/o ACS 93.52 99.07 97.50 74.89 92.77 73.33
w/ ACS 93.63 99.10 97.82 75.27 92.70 73.81
Table 3: The comparison of using ACS method on UCF-101 and ActivityNet-200 datasets. Note the metrics on the ActivityNet dataset is specified for the testing protocol with 10 temporal segments.

Frame sampling strategy. First, we would like to compare the frame sampling strategies. Depending on the target use case there are two possible solutions described early: sparse sampling and fixed frame rate sampling (or just with fixed temporal stride). The first strategy assumes the low impact of an ratio between a video and the network input lengths but is not suitable for the live demo mode due to inability to see the future frames. Recently the sparse sampling method [MoViNet] allowed the authors to show the magnificent performance with single temporal crop. Our experiments (see the Table 1) demonstrate that two-path network with separated global context branch is able to reduce the impact of an sampling strategy. Moreover we see the improvement in case of single crop for the ActivityNet dataset. In our opinion it is because the target clip with a valid action is significantly shorter than the full video and as a result the sparse sampling strategy collects a scarce number of valid frames. Generally speaking, the LIGAR framework allows us to close the question of the frame sampling strategy in case of live demo applications.

Additionally, we have compared here the difference between testing protocols, especially the single- or multi-view video prediction. As it is expected increasing the number of views per an video improves the accuracy metric due to smoothing the impact of possible temporal mismatch between the unknown ground truth and the tested central temporal crop. Unfortunately most of papers report the multi-view metrics to get the best results and it is not suitable for the live demo scenario. Regarding the reported results the difference between the two testing protocols is comfortable with the exception of ActivityNet dataset (the reason of that is described early).

Name
Input
frames
Views
Single
GFLOPs
MParams UCF-101 Jester-27
Top-1 Top-5 mAP Top1 Top-5 mAP
R(2+1)D-BERT [Bert] 64 - 152.97 66.67 98.69 - - - - -
LGD-3D RGB [LGD] 16 15 - - 97.00 - - - - -
PAN ResNet101 [PAN] 32 2 251.7 - - - - 97.40 99.90 -
STM [STM] 16 10 66.5 22.4 96.20 - - 96.70 99.90 -
3D-MobileNetV2 1.0x [3DMob] 16 10 0.45 3.12 81.60 - - 94.59 - -
LIGAR (our) 16 1 4.74 4.47 93.63 99.10 97.82 95.43 99.45 96.95
10 94.85 99.50 98.61 95.56 99.52 97.33
Table 4: Comparison the proposed LIGAR framework with SOTA solutions on UCF-101 and Jester-27 datasets. For fairness we report the number of views for each measure if it is specified.

Network initialization. Another important question is about the network initialization and noteworthiness of a pre-training stage. Originally, pre-training is designed to transfer the knowledge from the big datasets to the small one thereby reducing a harmful effect of over-fitting. For the 3D-based networks the initialization have two sources: 2D initialization of a backbone only before the inflating procedure (see the I3D [I3D] paper for more details) and direct 3D initialization of a full network. In the Table 2 we have summarized the potential strategies of the network initialization. Note, the last line is different from the previous one by enabling the mentioned early multi-head pre-training on the merged dataset. Overall, the model behavior reflects the intuition behind it – the method with more data during the pre-training stage exceeds the previous one in terms of all measured metrics.

Induced noise suppression. The last question is related to the suppression of the described early induced label noise. The main benchmark to measure the importance of the proposed Adaptive Clip Selection (ACS) module is ActivityNet dataset. As it was mentioned before we do not follow the original ActivityNet protocol and measure the action recognition metrics instead of localization one. Unfortunately we cannot measure the real performance of the model on this dataset due to the lack of an accurate temporal annotation of target actions and it is expected that the real quality is higher than it is announced. Nevertheless, we can see in the Table 3 the improvement over the baseline by using the proposed ACS module. Furthermore, the improvement is observed for the initially clean dataset like UCF-101. In our opinion, it is because the impact of induced label noise is much stronger than it is commonly believed even for the clean popular datasets.

iv Comparison with the State-of-the-Arts

[nindent=0em,lines=3]We further demonstrate the advances of our proposed LIGAR framework in comparison with state-of-the-art methods for the general-purpose action recognition. For fair comparison all methods use the RGB modality as an input. We report both results for possible testing protocols – 1 and 10 temporal crops (spatial crops do not benefit in our experiments). In the Table 4 we have collected the best solutions in term of accuracy on two sufficiently different datasets.

Results on UCF-101. We first verify the appearance modeling ability on UCF-101. The model can achieve the superior results compared with other methods which are computationally more expensive. For example the BERT-like solution [Bert] is times more expensive in term of GFLOPs but the accuracy drop is only of top-1 metric. In case of equal testing protocol (10 views per video) the drop is slightly less. In comparison to the solution with a similar computation budget [3DMob] the accuracy of the reported approach is significantly better.

Results on Jester-27. We also compare with other methods on Jester-27 to verify the model ability to make the prediction according to the motion component. Comparing with results on general dataset the gap between heavy solutions and the proposed one is even less. Specifically, we can observe that our proposed method drops less than 2 percentage in comparison to the SOTA solution [PAN]. Like for the previous dataset the advantage in terms of a computation budget even more than 53 times.

Vii Conclusion

[nindent=0em,lines=3]This paper has presented the extension to the LGD framework for more efficient and accurate solving an action recognition problem for a wide range of applications, like general (appearance-based) and gesture (motion-based) recognition problems. Moreover, the paper proposed a novel clips selection module to tackle the induced label noise issues. The described training pipeline allows us to train a robust DL-based solution which is able to solve most of real-world action recognition problems in a fast and accurate manner. The reported results give us the hope that DL-based solutions will continue advance on the rest vital challenges of a humanity.

References