SMART: Skeletal Motion Action Recognition aTtack

Adversarial attack has inspired great interest in computer vision, by showing that classification-based solutions are prone to imperceptible attack in many tasks. In this paper, we propose a method, SMART, to attack action recognizers which rely on 3D skeletal motions. Our method involves an innovative perceptual loss which ensures the imperceptibility of the attack. Empirical studies demonstrate that SMART is effective in both white-box and black-box scenarios. Its generalizability is evidenced on a variety of action recognizers and datasets. Its versatility is shown in different attacking strategies. Its deceitfulness is proven in extensive perceptual studies. Finally, SMART shows that adversarial attack on 3D skeletal motion, one type of time-series data, is significantly different from traditional adversarial attack problems.



There are no comments yet.


page 6


Understanding the Robustness of Skeleton-based Action Recognition under Adversarial Attack

Action recognition has been heavily employed in many applications such a...

BASAR:Black-box Attack on Skeletal Action Recognition

Skeletal motion plays a vital role in human activity recognition as eith...

Adversarial Attacks for Optical Flow-Based Action Recognition Classifiers

The success of deep learning research has catapulted deep models into pr...

Efficient Action Poisoning Attacks on Linear Contextual Bandits

Contextual bandit algorithms have many applicants in a variety of scenar...

Attack Type Agnostic Perceptual Enhancement of Adversarial Images

Adversarial images are samples that are intentionally modified to deceiv...

A Perceptual Distortion Reduction Framework for Adversarial Perturbation Generation

Most of the adversarial attack methods suffer from large perceptual dist...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Adversarial attack has invoked a new wave of research interest recently. On the one hand, it shows that deep learning models, as powerful as they are, are vulnerable to attack, leading to security and safety concerns

[29]; on the other hand, it has been proven to be useful in improving the robustness of existing models [14]

. Starting from object recognition, the list of target tasks for adversarial attack has been rapidly expanding, now including face recognition

[25], point clouds [38], and 3D meshes [40].

While adversarial attack on static data (images, geometries, etc.) has been explored, its effectiveness on time-series has only been attempted under general settings [9]. In computer vision, video-based attack has been attempted in attacking recognition tasks [36]. In this paper, we would like to attack another type of time-series data: 3D skeletal motions, for action recognition tasks.

Skeletal motion has been widely used in action recognition [3]. It can greatly improve recognition accuracy by mitigating issues such as lighting, occlusion and posture ambiguity. In this paper we show that 3D skeletal motions are also vulnerable to adversarial attack and it can thus cause serious concerns.

Adversarial attack on 3D skeletal motion faces two unique and related challenges which are significantly different from other attack problems: low redundancy and perceptual sensitivity. When attacking images/videos, it is possible to perturb some pixels without causing too much visual distortion. This largely depends on the redundancy in image space [31]

. Unlike images, which have thousands of Degrees of Freedom (DoFs), a motion frame (or a pose) in 3D skeletal motions is usually parameterized by fewer than 100 DoFs (in our experiments, we use 25 joints, equivalent to 25*3=75 DoFs). This not only restricts the space of possible attacks, but also has severe consequences on the imperceptibility of the adversarial examples: a small perturbation on a single joint can be easily noticed. Furthermore, coordinated perturbations on multiple joints in only one frame can hardly work either, because in the temporal domain, similar constraints apply. Any sparsity-based perturbation (on single joints or individual frames) will greatly affect the dynamics (causing jittering or bone-length violations) and will be very obvious to an observer.

We propose an adversarial attack method, SMART, based on an optimization framework that explicitly considers motion dynamics and skeletal structures. The optimization finds perturbations by balancing between classification goals and perceptual distortions, formulated as classification loss and perceptual loss. Varying the classification loss leads to different attacking strategies. The new perceptual loss fully utilizes the dynamics of the motions and bone structures. We empirically show that SMART is effective in both white-box and black-box settings, on several state-of-the-art models, across a variety of datasets.

Formally, our contributions include:

  • A novel perceptual loss function for adversarial attack on action recognition based on 3D skeletal motions. The new perceptual loss captures the perceptual realism and fully exploits the motion dynamics.

  • Empirical evidence that 3D skeletal motions are vulnerable to attack under multiple settings and attacking strategies, by extensive experiments and user studies.

  • Insights into the role of dynamics in the imperceptibility of adversarial attack based on comprehensive perceptual studies. This result differs significantly from widely accepted approaches, which use -norm with -ball constraints on static data such as images.

2 Related Work

2.1 Skeleton-based Action Recognition

Action recognition is crucial for many important applications, namely visual surveillance, human-robot interaction and entertainment. Recent advances in 3D sensing and pose estimation motivate the use of clean skeleton data to robustly classify human actions, overcoming the biases from raw RGB video due to body occlusion, scattered background, lighting variation, etc. Unlike conventional approaches that are limited to handcrafted skeletal features 

[32, 5, 2], recent methods taking the advantage of trained features from deep learning have gained the state-of-the-art performance. According to the representation of the skeleton data used for training, deep learning based methods can be classified into three categories, including sequence-based methods, image-based methods, and graph-based methods, respectively.

Sequence-based methods represent skeleton data as a chronological sequence of postures, each of which consists of the coordinates of all the joints. Then RNN-based architecture is employed to perform the classification [3, 23, 17, 28, 46]

. Image-based methods represent skeletal motion as a pseudo-image, which is a 2D tensor where one dimension corresponds to time, and the other dimension stacks all joints from a single skeleton. Such representation enables CNN-based image classification to be applied in the action recognition context 

[18, 10]

. Different from the previous two categories that mainly rely on skeleton geometry represented by the joint coordinates, graph-based methods utilize graph representation to naturally consider the skeleton topology (i.e., joint connectivity) encoded by bones that connect neighboring joints. Graph neural networks (GNN) are used to train the classifier and recognize the action 

[43, 30, 26]. Based on the code released by the original authors, we perform adversarial attacks on two most representative categories (i.e., RNN- and GNN-based), demonstrating the vulnerability of different types of neural networks.

2.2 Adversarial Attacks

Despite its significance of enhancing vision-based tasks such as classification and recognition, deep neutral networks are vulnerable to carefully crafted adversarial attacks as firstly pointed out in [29]. In other words, delicately designed neural networks with high performance can be easily fooled by an unnoticeable perturbation on the original data. With the above concern being addressed, researchers have extensively investigated adversarial attacks on different data types, including 2D images [6, 22, 19, 39, 41], videos [37, 35], 3D shapes [15, 45, 40, 38], physical objects [12, 1, 4], while rare attention has been paid on 3D skeletal motions.

Adversarial attacks in the context of action recognition is much less explored. Inkawhich et al. [8] perform adversarial attacks for optical-flow based action recognition classifiers, which is mainly inspired from image classifier attacks and differs from our work in terms of the input data.

In recent contemporaneous work to our own [16] , adversarial attack is applied to a GNN network for action classification from skeletal motion [43]. The loss function used for the attack minimises the acceleration of joint positions, for which there is a qualitative demonstration of imperceptibility. In our work, we demonstrate improved results using a perceptual loss that minimises acceleration relative to the original skeletal motion, thereby preserving the large accelerations intrinsic to actions such as running and jumping. We also perform a perceptual study to validate the imperceptibility of the perturbed skeletal motions and the effectiveness of our choice of perceptual loss.

We demonstrate successful attacks on a range of network architectures, including RNN and GNN based methods, using three datasets. Finally, we present results for three different attack objectives, including the novel objective of placing the correct action beneath the first n actions in a ranked classification, for a given n.

3 Methodology

SMART is formulated as an optimization problem, where the minimizer is an adversarial example, for a given motion, that minimizes the perceptual distortion while fooling the target model. The optimization has three variants constructed for three different attacking strategies: Anything-but Attack, Anything-but-N Attack and Specified Attack. They are used in white-box and black-box scenarios.

3.1 Optimization for Attack

In an action recognition task, given a motion = {, , … ,}, where is the frame at time and consists of stacked 3D joint locations, a trained classifier can predict its class label = ), where is namely a deep neural network and is the predicted distribution over class labels. is usually a softmax function and is the predicted label. We aim to find a perturbed example, , for , such as .

Without any constraints, it is trivial to find . So normally, it requires that the difference between and is not perceptible. This can be formulated into a generic optimization problem:


where and are classification loss and perceptual loss respectively and is a weight. We use . Intuitively, there are two forces governing . is the classification loss where we can design different attacking strategies. is the perceptual loss which dictates that the should be visually indistinguishable from .

To optimize for , we have only one mild assumption: we can compute the gradient: . This way, we can compute iteratively by = + where is at step , computes the updates and is the learning rate. We set = and use Adam [11] for .

3.2 Perceptual Loss

Imperceptibility is a hard constraint in adversarial attacks. It requires that human cannot distinguish easily between the adversarial examples and real data. Many existing approaches on images and videos achieve imperceptibility by computing the image-wise or frame-wise minimal changes, measured by a certain type of norm, e.g. , or . However, it would not work for motions because they do not consider dynamics.

To fully represent the dynamics of a motion, we need the derivatives from zero-order (joint location), first-order (joint velocity) up to nth-order. One common approximation is to use first n terms, e.g. up to the second-order. When it comes to imperceptibility on motions, the perceived motion naturalness is vital and not all derivatives are at the same level of importance [33]. Inspired by the work in character animation [33, 34], we propose a new perceptual loss:


where is a weight and set to 0.3 for our experiments. penalizes any bone length deviations in every frame where is the total frame number. Theoretically, the bone lengths do not change over time. However, due to motion capture errors, they do change in different frames even in the original motions. This is why is designed to be a frame-wise bone length loss term.

is the dynamics loss. We use a strategy called derivative matching. It is a weighted (by ) sum of frame-wise distance between and , where and are the -order derivatives and can be computed by forward differencing.

is a 25-dimensional vector of joint weights. Although

goes up to infinity, in practice, we explored up to , which includes joint position, velocity, acceleration, jerk and snap. After exhaustive experiments, we found that a good compromise is to set , and the rest to 0. Matching the 2-order profiles of two motions is critical. For skeletal motions, small location deviations can still generate large acceleration differences, resulting two distinctive motions. More often, it generates severe jittering and thus totally unnatural motions. An alternative way of regulating the dynamics is to purely smooth the motion, by e.g. minimizing the acceleration. But it damps highly dynamic motions such as jumping [33]. Also, considering more derivatives makes the optimization harder to solve and over-weighs their benefits.

Finally, we fix the values of s. Based on our preliminary studies, we found that the perceived motion naturalness is not affected by all joint equally. The jittering on the torso has higher impact. So we use higher weights on the spinal joints. For all our experiments, the skeleton has 25 joints. We use 0.04 for the 5 spinal joints and 0.02 for the rest.

3.3 White-box Attack

With the perceptual loss fixed, varying the formulation of classification loss allows us to form different attacking strategies. We present three strategies.

Anything-but Attack (AB). Anything-but Attack aims to fool the classifier so that . This can be achieved by maximizing the cross entropy between and :


Comparatively, AB is the easiest optimization problem among the three strategies. could peak on any class label but the ground-truth or even become just flat.

Anything-but-N Attack (ABN). Anything-but-N Attack is a generalization of AB. It aims to confuse the classifier so that it has similar confidence levels in multiple classes. ABN is more suitable to confuse classifiers which rely on top N accuracy. In addition, we found that it performs better in black-box attacks by transferability, which will be detailed in experiments.

Instead of simply using multiple AB for the top N classes, we propose an easier loss function, maximizing the entropy of the predicted distribution of :


where is the set of top n class labels in the predictive distribution . By minimizing , we actually maximize the entropy of , i.e. forcing it to be flat over all class labels and thus reduce the confidence of the classifier over any particular class. We stop the optimization once the ground-truth label falls beyond the top n classes. ABN is a harder optimization problem than AB because it needs the predictive distribution to be as flat as possible.

3.3.1 Specified Attack (SA)

Different from AB and ABN, sometimes it is useful to fool the classifier with a specific class label. Given a fake label , we can compute its corresponding class label distribution and minimize the cross entropy:


However, this is the most difficult scenario because it is highly related to the similarity between the source and target labels. Turning a ‘clapping over the head’ motion into a ‘raising two hands’ is easy and causes minimal visual changes; while turning a ‘running’ motion into a ‘squat’ motion is impossible without noticeable visual changes.

3.4 Black-box Attack

Black-box attack assumes that any information about the target classifier is not accessible. Under such circumstances, we use attack by transferability [31]. It begins with training a surrogate classifier. Then adversarial examples are computed by attacking the surrogate classifier. Finally, the adversarial examples can be used to attack the target classifier. In this paper, we do not construct our own surrogate model. Instead, we use an existing classifier as our surrogate classifier to attack the others. In experiments, we attack several state-of-the-art models. To test the transferability and generalizability of our method, we use every model in turns as the surrogate model and attack the others.

4 Experimental Results

We first introduce the datasets (Section 4.1) and models (Section 4.2) for our experiments. Then we present our white-box (Section 4.3) and black-box (Section 4.4) attack results. Finally, we present our perceptual studies on imperceptibility (Section 4.5).

Since we attack multiple models on multiple datasets, we first use the source code shared by the authors if available or implement the models ourselves. Then we train them strictly following the protocols in their papers. Next, we test the models and collect the data samples that the trained classifiers can successfully recognize, to create our adversarial attack datasets. Finally, we compute the adversarial samples using different attacking strategies.

4.1 Datasets

When choosing datasets, our criteria are: 1. It needs to be widely used and contain 3D skeletal motions. 2. The motion quality needs to be as high as possible because it is tightly related to our perceptual study. Finally, we chose 3 benchmark datasets:

HDM05 dataset [20] is a 3D motion database captured with an Mocap system. It contains 2337 sequences for 130 actions performed by 5 non-professional actors. The 3D joint locations of the subjects are provided in each frame.

Berkeley MHAD dataset [21] is captured using a multi-modal acquisition system. It consists of 11 actions performed by 12 subjects, where 5 repetitions are performed for each action, resulting in 659 sequences. In each frame the 3D joint positions are extracted based on the 3D marker trajectory.

NTU RGB+D dataset [24] is captured by Kinect v2 and is currently one of the largest publicly available datasets for 3D action recognition. It is composed of more than 56,000 action sequences. A total of 60 action classes are performed by 40 subjects. The videos are captured from 80 distinct viewpoints. The 3D coordinates of joints are provided by the Kinect data. Due to the huge number of samples and the large intra-class and viewpoint variations, the NTU RGB+D dataset is very challenging and is highly suitable to validate the effectiveness and generalizability of our approach.

4.2 Target Models

We selected 5 state-of-the-art methods: HRNN [44], ST-GCN [42], AS-GCN [13], DGNN [26] and 2s-AGCN [27]. They include both RNN- and GNN-based models. We implemented HRNN following the paper and used the code shared online for the rest four methods.

We also followed their protocols in data pre-processing. Specifically, we preprocess the HDM05 dataset and Berkeley MHAD dataset as in [44], and the NTU RGB+D dataset as in [27]. Their respective class numbers are 65, 11 and 60. We also map different skeletons to a standard 25-joint skeleton as in [33] (see Figure 1). Please refer to relevant papers for details.

Figure 1: Skeletal structure with 25 joints and their labels.

The five target models require different inputs, but they can be easily unified. HRNN, ST-GCN and AS-GCN all take joint positions in each frame as input. DGNN and 2s-AGCN require joint positions and bones. Although bones are taken as an independent input, they can be computed from joint positions. So we add another input layer before the original model to compute bones from joint positions. As this layer is only for computing bones and introduce no new variables, it does not change the behaviours of the original models.

4.3 White-box Attack

In this section, we qualitatively and quantitatively evaluate the performance of SMART on the aforementioned three datasets. We use a learning rate between 0.005 and 0.0005 and a maximum of 300 iterations. The setting for AB and ABN is straightforward. In SA, the number of experiments needed will be prohibitively large if we attack every motion with every other label but the ground-truth. Instead, we randomly select a fake label to attack. Since the number of motions attacked is large, the results are adequately representative. Note that this is a very strict test as most of the motions are rather distinctive.

To perform the attack, we first train the target models using the settings in the original papers to ensure similar training results. We then test the models with the testing dataset and record the motions that can be successfully recognized. Lastly, we attack these motions and record the success rates. For simplicity, we only show representative results in the paper. For more comprehensive results, please refer to the supplementary material and video111

4.3.1 Attack Results

We show the quantitative results of AB in Table 1. High success rates are universally achieved across different datasets and different target models, demonstrating the generalizability of SMART.

HRNN 100 100 99.56
ST-GCN 99.57 99.96 100
AS-GCN 99.36 92.84 97.43
DGNN 96.09 94.46 92.51
2s-AGCN 99.18 95.97 100
mean 98.84 96.65 97.9
Table 1: Success rate of Anything-but (AB) Attack.
Figure 2:

Normalized confusion matrix of 2s-AGCN on HDM05 (Left, 65 classes), MHAD (Middle, 11 classes) and NTU (Right, 60 classes). The darker the cell, the higher the value.

In addition, we show the normalized confusion matrices of 2s-AGCN on all datasets using AB (Figure 2). High-resolution results are in available the supplementary material. In AB attack, we found that semantically similar motions can be easily confused, such as the grab motions (‘Grab_XX’) vs. deposit motions (‘Deposit_XX’) in HDM05, ‘Jumping_in_place’ vs ‘Jumping_jacks’ in MHD and ‘wear_a_shoe’ vs ‘take_off_a_shoe’ in NTU. Some of them mainly differ in the temporal order, and some of them mainly differ in spatial variations. These are classes that are hard to distinguish in action recognition and are thus prone to being attacked.

Next, we show the results of ABN in Table 2. As a generalization of AB, we show two variations: AB3 and AB5. They are good for attacking classifiers based on top N accuracy. Although the overall performance is still good, the success rates are relatively lower compared with AB. It verifies our qualitative analysis in Section 3.3. ABN is harder than AB. Also AB5 is harder than AB3. In addition, the results on MHAD is worse than the other two. This is because there are only 11 classes as oppose to 65 and 60 in the other two. Excluding the ground-truth label from the top 5 out of 11 classes is harder than that of 65 and 60 classes.

Model (AB3/AB5) HDM05 MHAD NTU
HRNN 100/100 100/100 99.84/99.62
ST-GCN 93.30/90.28 76.86/70.5 95.86/91.32
AS-GCN 91.46/82.83 42.07/22.34 91.18/82.47
DGNN 93.55/86.32 87.54/74.27 98.73/97.62
2s-AGCN 83.40/75.2 55.9/32.08 100/100
mean 92.34/86.93 72.47/59.84 97.12/94.21
Table 2: Success rate of Anything-but-N Attack. The results are when n = 3 (AB3) and 5 (AB5).

Next, Table 3 shows the results of SA, which are not so good as AB or ABN. Again, this is consistent with our expectation on the difficulties of the tasks. SA is the most difficult because randomly selected class labels often come from significantly different action classes. Although it is relatively easy to confuse the model between a deposit motion and a grab motion, it is extremely difficult to do so for a jumping motion and a wear-a-shoe motion. However, even under such circumstances, SMART is still able to succeed in more than 70% cases on average.

HRNN 67.19 57.41 49.17
ST-GCN 74.95 66.93 100
AS-GCN 64.62 40.18 99.48
DGNN 97.26 96.13 99.99
2s-AGCN 96.72 97.53 100
mean 80.15 71.64 89.73
Table 3: Success rate of Specified Attack.

Finally, we also shows one motion in NTU attacked using three strategies in Figure 3.

Figure 3: AS-GCN on NTU. Four rows from top to bottom: Original, AB, AB5 and SA. ‘brushing_teeth’ is the ground-truth label.

4.3.2 Attack Behavioural Analysis

We also analyze the behaviour of SMART by looking at which joint or joint groups are vulnerable. Initially, we thought that if some joints tend to be attacked together, the correlation between the displacements of these joints should be high. So we compute the norm of joint displacements after the attack and their Pearson correlations. We show the results of HDM05 and MHAD on 2SAGCN and DGNN respectively using AB in Figure 4 Left. Although some local high correlations between joint 2 and 3, 6 and 7, 9 and 10, 20 and 21 can be found, they are not universal. Please see other results in the supplementary material. Then we tried to find if across-joint correlations are action-dependent. But no universal conclusion was found either.

Figure 4: Joint correlations: displacement-displacement (Left), displacement-speed (Middle) and displacement-acceleration (Right). The brighter the color is, the bigger the value is. Top: 2S-AGCN on HDM05. Bottom: DGNN on MHAD.

Finally, the displacement-speed and displacement-acceleration correlations can give a consistent description of SMART, shown in Figure 4 Middle and Right. The correlations are computed between the joint displacements and the original velocities and accelerations, respectively. These two correlations reveal the behaviour of SMART: the higher the speed/acceleration is, the more the joint is attacked (shown by the high values along the main diagonal). In addition, they also reveal some consistent across-joint correlations (as shown by red squares). Note that the joints in a red square belong to one part of the body (four limbs and one trunk). Finally, this also suggests joints with high velocity and acceleration are important features in the target models because these joints are attacked the most.

4.4 Black-box Attack

In the black-box setting, we need a surrogate model to fool target models. To this end, we use three models: AS-GCN, DGNN and 2s-AGCN, and one dataset: NTU. This is because these models are the latest state-of-the-art methods and their original implementations are available online. In addition, NTU dataset is the one used in all three papers. To test the generalizability of SMART, we in turns take every model as the surrogate model and produce adversarial examples using AB and AB5. Then we use the adversarial examples to attack the other two models. Results are shown in Table 4.

Surrogate/Target/ DGNN 2s-AGCN AS-GCN
DGNN(AB/AB5) n/a 90.6(90.99) 7.24(7.63)
2s-AGCN(AB/AB5) 98.37(98.46) n/a 98.10(98.96)
AS-GCN(AB/AB5) 10.90(12.97) 91.17(91.99) n/a
Table 4: Success rate of black-box attack.

Firstly, AB5 results are in general better than AB. We speculate that there are two factors. First, the predictive class distribution of AB5 is likely to be flatter than AB. The flatness improves the transferability because a target model with similar decision boundaries will also produce a similarly flat predictive distribution, and thus is more likely to be fooled. Besides, since the ground-truth label is pushed away from the top 5 classes in the surrogate model, it is also likely to be far away from the top in the target model.

We also notice that the transferability is not universally successful. DGNN and AS-GCN cannot easily fool one another. Meanwhile, 2S-AGCN can fool and be fooled by both of them. Since the transferability can be described by distances between decision boundaries [31], our speculation is that 2S-AGCN’s boundary structure overlaps with both DGNN and AS-GCN significantly but the other two overlap little. The theoretic reason is hard to identify, as the formal analysis on transferability has just emerged on static data [31, 47]. The theoretic analysis of time-series data is beyond the scope of this paper and is therefore left for future work.

4.5 Perceptual Study

Imperceptibility is a requirement for adversarial attack. Although it is possible to do qualitative visual comparisons on image-based attack, more rigorous evaluations need to be done for more complex data [40]. This is especially the case for motions which are time-series data.

To investigate the imperceptibility, we conducted three user studies (Deceitfulness, Naturalness and Indistinguishability). Since our sample space is huge (5 models 3 datasets 3 attacking strategies), we chose the most representative model and data. We chose 2S-AGCN as our model because it is one of the latest state-of-the-art methods, with AB as the attacking strategy. We use HDM05 and MHAD datasets because NTU dataset has motion jittering and generally lower visual quality (see the video for details). In total, we recruited 37 subjects (age between 18 and 37).

Deceitfulness. In each user study, we randomly chose 100 motions with the ground-truth and attacked label for 100 trials. In each trial, the video was played for 6 seconds then the user was asked to choose which label best describes the motion with no time limit. This is to test whether SMART visually changes the semantics of the motion.

Naturalness. Since unnatural motions can be easily identified as the result of attack, we performed ablation tests on different loss term combinations. We designed four settings: l2, l2-acc, l2-bone, SMART. l2 is where only the error of joint locations is used, l2-acc is l2 plus the acceleration profile loss, l2-bone is l2 plus the bone-length loss and SMART is our proposed perceptual loss. We first show qualitative comparisons in Figure 5. Video comparisons are available in the supplementary video. Visually, SMART is the best. Even from static postures, one can easily see the artifacts caused by joint displacements. The spine joints are the most obvious. The joint displacements causes unnatural zig-zag bending in l2, l2-acc and l2-bone.

Figure 5: Visual ablation test between different loss terms.

Next, we conducted user studies. In each study, we randomly selected 50 motions. For each motion, we made two trials. The first includes one attacked motion by SMART and one randomly selected from l2, l2-acc and l2-bone. The second includes two motions randomly drawn from l2, l2-acc and l2-bone. The first trial aims to evaluate our results again other alternatives and the second gives insights to the impact of different perceptual loss terms. In each of the 100 trials, two motions were played together for 6 seconds twice, then the user was asked to choose which motion looks more natural or cannot tell the difference, with no time limit.

Indistinguishability. In this study, we did a very strict test to see if the users can tell if a motion is changed in any way at all. In each experiment, 100 pairs of motions were randomly selected. In each trial, the left motion is always the original and the user is told so. The right one can be the original (sensitivity) or attacked (perceivability). We ask if the user can see any visual differences. Each video is played for 6 seconds then the user was asked to choose if the right motion is a changed version of the left, with no time limit. This user study serves two purposes. Perceivability is a direct test on Indistinguishability on the attack while sensitivity is to screen out subjects who tend to give random choices. Most users are able to recognize if two motions are the same (close to 100% accuracy), but there are a few whose choices are more random. We discard any user data which falls below 80% accuracy on the sensitivity test.

4.5.1 Results

The success rate of Deceitfulness is 93.32% overall, which means that most of the time SMART does not change the semantics of the motions. When looking into the success rates on different datasets, SMART achieved 86.77% on HDM05 and 96.38% on MHAD. So we looked into in what motions SMART did change the visual meanings. We discovered that the confusion was caused by the motion ambiguity in the original data and labels. For instance, when a ’Hopping’ motion is attacked into a ’Jumping’ motion, some users chose ’Jumping’. Similar situations occur for ’Waving one hand’ with ’Throwing a ball’, ’Walking forward’ with ’Walking LC’. All these motions have small spatial variations and do not distinguish well.

Next, Figure 6 shows the results of Naturalness. Users’ preferences over different losses are SMART l2-acc l2 l2-bone. SMART leads to the most natural results as expected.

Figure 6: Normalized user preferences on Naturalness. our: SMART. bone: l2-bone. acc: l2-acc. The vertical axis is the percentage of user preference on a particular loss

Dynamics in Imperceptibility. To investigate the benefits of exploiting dynamics compared to joint-only perturbation, we did further analysis on SMART-vs-l2 where users chose SMART over l2. We first compute their respective joint-wise deviations from the original motions, shown in Figure 7

. In general, the perturbations of SMART is in general higher than l2 and have larger standard deviations. However, the users still chose SMART over l2. It indicates that with proper exploitation of dynamics, larger joint deviations can generate even more desirable results. This is significantly different from the static data (e.g. images), where it is believed that

-norms are tightly tied to imperceptibility [7] (under a -ball constraints).

Figure 7: The mean (Top) and standard deviation (Bottom) of the joint-wise deviations of SMART and l2.

Finally, we did the Indistinguishability test. The final results are 81.9% on average, 80.83% on HDM05 and 83.97% on MHAD. Note that this is a side-by-side comparison and thus is very harsh on SMART. The users were asked to find any visual differences they could. To avoid situations where motions are too fast to spot any differences (e.g. kicking and jumping motions), we also played the motions three times slower than the original. Even under such harsh tests, SMART is still able to fool humans most of the time.

5 Discussion

Imperceptibility is vital in adversarial attack. When it comes to skeletal motions, perceptual studies are essential because existing metrics (e.g. -norm) cannot fully reflect perceived realism/naturalness/quality. In addition, it helps us to uncover a unique feature of attacking skeletal motions. Losses solely based on joint location deviations are often overly conservative. It is understandable because they are mainly used for static data and thus are not able to fully utilize the dynamics.

Next, forming the joint deviation as a hard constraint [16] is not the best strategy in our problem. First, a threshold needs to be given and it is unclear how to set it. Second, restricting joint deviation implies that it is solely the most important factor on imperceptibility. Our perceptual study shows that larger joint deviations can be used if the dynamics are exploited properly.

Lastly, we could also use joint angles as representations. However, in practice, we find that perturbing joint angles causes jittering and makes the system hard to optimize, similar to [33].

6 Conclusion and Future Work

In this paper, we proposed a method, SMART, to attack action recognizers based on 3D skeletal motions. Through comprehensive qualitative and quantitative evaluations, we show that SMART is general across multiple state-of-the-art models on various benchmark datasets. Moreover, SMART is versatile as it can delivery both white-box and black-box attacks with multiple attacking strategies. Finally, SMART is deceitful verified in extensive perceptual studies. We hope that this work will lead to further investigation and counter measures to improve the robustness of action recognizers in the future.

In the future, we would like to theoretically investigate why the transferability varies between different models under black-box attack. Although there has been research on static data, dynamic data is still not investigated. It involves testing more target models and developing a method to describe the structures of class boundaries. We will also investigate on what attack is in people’s blind regions. Our experiments indicate that there might be a blind-region sub-space of motions where changes are not perceivable to humans. Adversarial attacks in this sub-space would lead to better results regarding human perception.


  • [1] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok (2017) Synthesizing robust adversarial examples. arXiv abs/1707.07397. Cited by: §2.2.
  • [2] M. Devanne, H. Wannous, S. Berretti, P. Pala, M. Daoudi, and A. Del Bimbo (2015) 3-d human action recognition by shape analysis of motion trajectories on riemannian manifold. IEEE Transactions on Cybernetics 45 (7), pp. 1340–1352. Cited by: §2.1.
  • [3] Y. Du, W. Wang, and L. Wang (2015)

    Hierarchical recurrent neural network for skeleton based action recognition


    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 1110–1118. Cited by: §1, §2.1.
  • [4] I. Evtimov, K. Eykholt, E. Fernandes, T. Kohno, B. Li, A. Prakash, A. Rahmati, and D. Song (2017)

    Robust physical-world attacks on machine learning models

    arXiv abs/1707.08945. Cited by: §2.2.
  • [5] B. Fernando, E. Gavves, M. José Oramas, A. Ghodrati, and T. Tuytelaars (2015) Modeling video evolution for action recognition. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5378–5387. Cited by: §2.1.
  • [6] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. Vol. abs/1412.6572. Cited by: §2.2.
  • [7] Q. Huang, I. Katsman, H. He, Z. Gu, S. J. Belongie, and S. Lim (2019) Enhancing adversarial example transferability with an intermediate level attack. CoRR abs/1907.10823. External Links: Link, 1907.10823 Cited by: §4.5.1.
  • [8] N. Inkawhich, M. Inkawhich, Y. Chen, and H. Li (2018) Adversarial attacks for optical flow-based action recognition classifiers. arXiv abs/1811.11875. Cited by: §2.2.
  • [9] F. Karim, S. Majumdar, and H. Darabi (2019) Adversarial attacks on time series. arXiv abs/1902.10755. Cited by: §1.
  • [10] Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid (2017) A new representation of skeleton sequences for 3d action recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4570–4579. Cited by: §2.1.
  • [11] D. P. Kingma and J. L. Ba. (2014) Adam : a method for stochastic optimization. arXiv abs/1412.6980v9. Cited by: §3.1.
  • [12] A. Kurakin, I. J. Goodfellow, and S. Bengio (2016) Adversarial examples in the physical world. arXiv abs/1607.02533. Cited by: §2.2.
  • [13] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian (2019-06) Actional-structural graph convolutional networks for skeleton-based action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.2.
  • [14] F. Liao, M. Liang, Y. Dong, T. Pang, X. Hu, and J. Zhu (2018-06) Defense against adversarial attacks using high-level representation guided denoiser. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [15] H. D. Liu, M. Tao, C. Li, D. Nowrouzezahrai, and A. Jacobson (2019) Beyond pixel norm-balls: parametric adversaries using an analytically differentiable renderer. In International Conference on Learning Representations, Cited by: §2.2.
  • [16] J. Liu, N. Akhtar, and A. Mian (2019) Adversarial attack on skeleton-based human action recognition. arXiv abs/1909.06500. Cited by: §2.2, §5.
  • [17] J. Liu, A. Shahroudy, D. Xu, and G. Wang (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), pp. 816–833. Cited by: §2.1.
  • [18] M. Liu, H. Liu, and C. Chen (2017) Enhanced skeleton visualization for view invariant human action recognition. Pattern Recogn. 68 (C), pp. 346–362. Cited by: §2.1.
  • [19] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) DeepFool: a simple and accurate method to fool deep neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2574–2582. Cited by: §2.2.
  • [20] M. Müller, T. Röder, M. Clausen, B. Eberhardt, B. Krüger, and A. Weber (2007-06) Documentation mocap database hdm05. Technical report Technical Report CG-2007-2, Universität Bonn. External Links: ISSN 1610-8892 Cited by: §4.1.
  • [21] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy (2013-01) Berkeley mhad: a comprehensive multimodal human action database. In 2013 IEEE Workshop on Applications of Computer Vision (WACV), Vol. , pp. 53–60. External Links: Document, ISSN Cited by: §4.1.
  • [22] N. Papernot, P. D. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami (2015) The limitations of deep learning in adversarial settings. arXiv abs/1511.07528. Cited by: §2.2.
  • [23] A. Shahroudy, J. Liu, T. Ng, and G. Wang (2016) NTU rgb+d: a large scale dataset for 3d human activity analysis. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1010–1019. Cited by: §2.1.
  • [24] A. Shahroudy, J. Liu, T. Ng, and G. Wang (2016-06) NTU rgb+d: a large scale dataset for 3d human activity analysis. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.1.
  • [25] M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter (2016) Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 23rd ACM SIGSAC Conference on Computer and Communications Security, Cited by: §1.
  • [26] L. Shi, Y. Zhang, J. Cheng, and H. Lu (2019-06) Skeleton-based action recognition with directed graph neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7912–7921. Cited by: §2.1, §4.2.
  • [27] L. Shi, Y. Zhang, J. Cheng, and H. Lu (2019-06) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.2, §4.2.
  • [28] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu (2017)

    An end-to-end spatio-temporal attention model for human action recognition from skeleton data


    Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence

    AAAI’17, pp. 4263–4270. Cited by: §2.1.
  • [29] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Synthesizing robust adversarial examples. arXiv abs/1312.6199. Cited by: §1, §2.2.
  • [30] Y. Tang, Y. Tian, J. Lu, P. Li, and J. Zhou (2018)

    Deep progressive reinforcement learning for skeleton-based action recognition

    In 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 5323–5332. Cited by: §2.1.
  • [31] F. Tramèr, N. Papernot, I. J. Goodfellow, D. Boneh, and P. D. McDaniel (2017) The space of transferable adversarial examples. arXiv abs/1704.03453. Cited by: §1, §3.4, §4.4.
  • [32] R. Vemulapalli, F. Arrate, and R. Chellappa (2014) Human action recognition by representing 3d skeletons as points in a lie group. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 588–595. Cited by: §2.1.
  • [33] H. Wang, E. S. L. Ho, H. P. H. Shum, and Z. Zhu (2019) Spatio-temporal manifold learning for human motions via long-horizon modeling. IEEE Transactions on Visualization and Computer Graphics (), pp. 1–1. External Links: Document, ISSN 1077-2626 Cited by: §3.2, §3.2, §4.2, §5.
  • [34] H. Wang, K. A. Sidorov, P. Sandilands, and T. Komura (2013) Harmonic parameterization by electrostatics. ACM Transactions on Graphics (TOG) 32 (5), pp. 155. Cited by: §3.2.
  • [35] J. Wang and A. Cherian (2018) Learning discriminative video representations using adversarial perturbations. In Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), pp. 716–733. Cited by: §2.2.
  • [36] X. Wei, J. Zhu, and H. Su (2018) Sparse adversarial perturbations for videos. arXiv abs/1803.02536. Cited by: §1.
  • [37] X. Wei, J. Zhu, S. Yuan, and H. Su (2018) Sparse adversarial perturbations for videos. In AAAI, Cited by: §2.2.
  • [38] C. Xiang, C. R. Qi, and B. Li (2019-06) Generating 3d adversarial point clouds. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9136–9144. Cited by: §1, §2.2.
  • [39] C. Xiao, B. Li, J. Zhu, W. He, M. Liu, and D. Song (2018) Generating adversarial examples with adversarial networks. In IJCAI, pp. 3905–3911. Cited by: §2.2.
  • [40] C. Xiao, D. Yang, B. Li, J. Deng, and M. Liu (2019) MeshAdv: adversarial meshes for visual recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6898–6907. Cited by: §1, §2.2, §4.5.
  • [41] C. Xiao, J. Zhu, B. Li, W. He, M. Liu, and D. Song (2018) Spatially transformed adversarial examples. In International Conference on Learning Representations, Cited by: §2.2.
  • [42] S. Yan, Y. Xiong, and D. Lin (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI, Cited by: §4.2.
  • [43] S. Yan, Y. Xiong, and D. Lin (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, AAAI’18. Cited by: §2.1, §2.2.
  • [44] Yong Du, W. Wang, and L. Wang (2015-06) Hierarchical recurrent neural network for skeleton based action recognition. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 1110–1118. External Links: Document, ISSN Cited by: §4.2, §4.2.
  • [45] X. Zeng, C. Liu, Y. Wang, W. Qiu, L. Xie, Y. Tai, C. Tang, and A. L. Yuille (2019) Adversarial attacks beyond the image space. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4302–4311. Cited by: §2.2.
  • [46] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng (2019) View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (8), pp. 1963–1978. Cited by: §2.1.
  • [47] C. Zhao, P. Fletcher, M. Yu, Y. Peng, G. Zhang, and C. Shen (2019-07) The adversarial attack and detection under the fisher information metric. Proceedings of the AAAI Conference on Artificial Intelligence 33, pp. 5869–5876. External Links: Document Cited by: §4.4.