Towards Understanding the Adversarial Vulnerability of Skeleton-based Action Recognition

05/14/2020 ∙ by Tianhang Zheng, et al. ∙ 2

Skeleton-based action recognition has attracted increasing attention due to its strong adaptability to dynamic circumstances and potential for broad applications such as autonomous and anonymous surveillance. With the help of deep learning techniques, it has also witnessed substantial progress and currently achieved around 90% accuracy in benign environment. On the other hand, research on the vulnerability of skeleton-based action recognition under different adversarial settings remains scant, which may raise security concerns about deploying such techniques into real-world systems. However, filling this research gap is challenging due to the unique physical constraints of skeletons and human actions. In this paper, we attempt to conduct a thorough study towards understanding the adversarial vulnerability of skeleton-based action recognition. We first formulate generation of adversarial skeleton actions as a constrained optimization problem by representing or approximating the physiological and physical constraints with mathematical formulations. Since the primal optimization problem with equality constraints is intractable, we propose to solve it by optimizing its unconstrained dual problem using ADMM. We then specify an efficient plug-in defense, inspired by recent theories and empirical observations, against the adversarial skeleton actions. Extensive evaluations demonstrate the effectiveness of the attack and defense method under different settings.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Action recognition is an important task in computer vision, motivated by many downstream applications such as video surveillance and indexing, and human-machine interaction

(Cheng et al., 2015)

. It is also a very challenging task since it requires to capture long-term spatial-temporal context and understand the semantics of actions. One method proposed by the community is to learn action recognition on the human skeleton information collected by cameras or sensors, where an action is represented by a time series of human joint locations. Compared with video streams, skeleton representation is more robust to the variance of background conditions, and also easier-to-handle for machine learning models due to its compact representation. Recent advances in deep learning techniques boost the performance of this method. Currently, a variety of deep learning model structures have been applied to skeleton-based action recognition, including convolutional neural networks

(Ke et al., 2017; Li et al., 2018b)

, recurrent neural networks

(Li et al., 2018; Si et al., 2019), and graph neural networks (Yan et al., 2018; Shi et al., 2019; Liu et al., 2020).

Figure 1. The targeted Setting: misleading the model to recognize “kicking people” as “drinking water” (normal action) by perturbing the skeleton action. To launch the attack in a real-world scenario (e.g., under a surveillance camera), the adversarial skeleton action should satisfy certain constraints. The figure is drawn based on (Shahroudy et al., 2016).

On the other hand, existing work has demonstrated the vulnerability of deep learning techniques to adversarial examples in many application domains. This phenomenon gives us a good reason to suspect that the DNNs for skeleton-based action recognition might also be vulnerable to adversarial skeleton examples despite achieving high accuracy in a benign environment. Note that a thorough study on the adversarial vulnerability of action-recognition models is indispensable before deploying them to real-world applications such as surveillance systems because otherwise, the potential adversaries might easily deceive those systems by generating and imitating specific adversarial actions. However, the study on the adversarial skeleton examples is scant and non-trivial***The only parallel work is detailed in section 2.3., due to the fundamental differences between the properties of adversarial skeleton actions and other adversarial examples. The differences are mainly caused by the bones between joints and the joint angles, which impose unique spatial constraints on skeleton data (Shahroudy et al., 2016). Specifically, in the generated adversarial skeleton actions, lengths of the bones must be maintained the same, and joint angles can not violate certain physiological structures. Otherwise, the adversarial actions are not reproducible by the individuals who perform the original actions. Also, considering the physical conditions of human beings, the speeds of motions in the adversarial actions should also be constrained.

To address the above issues, in this paper, we propose an optimization based method for generating adversarial skeleton actions. Specifically, we formulate the generation of adversarial skeleton actions as a constrained optimization problem by representing those constraints with mathematical equations. Since the primal constrained problem is intractable, we turn to solve its dual problem. Moreover, since all the constraints are represented by mathematical equations, both primal and dual variables are nonrestrictive in the dual problem. We further specify an efficient algorithm based on ADMM to solve the unconstrained dual problem, in which the internal minimization objective is optimized by an Adam optimizer, and the external maximization objective is optimized by one-step gradient ascent. We show that this algorithm can find an adversarial skeleton action within 200 internal steps.

Other than the attack, we further propose an efficient defense against adversarial skeleton actions based on previous theories and empirical observations. Our defense consists of two core steps, i.e.,

adding Gaussian noise and Gaussian filtering to action data. The first step, adding Gaussian noise, is inspired by the recent advance in certified defenses. Specifically, adding Gaussian noise to the input is proved to be a certified defense, which means additive Gaussian noise on the adversarial examples can guarantee the model to output a correct prediction (with high probability), as long as the adversarial perturbation is restricted within a certain radius in the neighbor of the original data sample. Note that there are also several other methods to certify model robustness, such as dual approach, interval analysis and abstract interpretations

(Dvijotham et al., 2018; Wong and Kolter, 2018; Mirman et al., 2018; Gowal et al., 2018; Wang et al., 2018). We adopt the Gaussian noise method because it is simple, effective, and more importantly, scalable to complicated models. Note that skeleton-based action recognition models are always more complicated than the common ConvNets certified by (Dvijotham et al., 2018; Wong and Kolter, 2018; Mirman et al., 2018; Gowal et al., 2018; Wang et al., 2018). The second step is to smooth the skeleton frames along the temporal axis using a Gaussian filter. This step will not affect the robustness certified by the first step according to the post-processing property (Lecuyer et al., 2018; Li et al., 2018b; Cohen et al., 2019), but can always filter out a certain amount of adversarial perturbation and random noise in practice, thus making our defense applicable to normally trained models.

Our proposed attack and defense are evaluated on two opensource models, i.e., 2s-AGCN and HCN

We select these two models because the authors have released the code and hyperparameters on Github so that we can correctly reproduce the results. Also, these two models achieve fairly good performance.

. Extensive evaluations show that our attack can achieve attack success rate with almost no violation of the constraints. Moreover, the visualization results, including images and videos, demonstrate that the difference between the original and adversarial skeleton actions is imperceptible. Extensive evaluations also show that our defense is effective and efficient. Specifically, our defense can improve the empirical accuracy of normally trained models to over against adversarial skeleton actions under different settings.

To summarize, our main contribution is four-fold:

  1. We identify the constraints needed to be considered in adversarial skeleton actions, and formulate the problem of generating adversarial skeleton actions as a constrained optimization problem by representing those constraints as mathematical equations.

  2. We propose to solve the primal constrained problem by optimizing the dual problem using ADMM, which is the first trial on generating adversarial actions with ADMM and yields outstanding performance.

  3. We propose an efficient two-step plug-in defense against adversarial skeleton actions, and specify the defense in both inference and certification stages.

  4. We conduct extensive evaluations, and provide several interesting observations regarding adversarial skeleton actions based on the experimental results.

2. Preliminaries

2.1. Definitions and Notations

Let and respectively denote a data sample and the label, where is the number of all possible classes. For an image, is a 2D matrix. For a skeleton action studied in this paper, , where denotes the position (coordinates) of the -th joint of the -th skeleton frame in an action sequence, with and denoting the number of joints in a skeleton and the number of skeleton frames in an action sequence, respectively. The corresponding adversarial skeleton action is denoted by . We take the skeletons in the largest dataset, i.e., NTU RGB+D dataset, as an example. As shown in figure 2, in a skeleton, there are totally 25 joints in a skeleton frame, and thus . The number of frames

differs for each skeleton action, and usually, we subsample a constant number of frames from each sequence or pad zeros after each sequence to endow all the skeleton actions with the same

. Let denote a classification network, where

represents the network weights. The logit output on

is denoted by with elements ().

can correctly classify

iff . The goal of adversarial attacks is to find an adversarial sample , which satisfies several pre-defined constraints, such that or ( is the target label). A commonly-used constraint is that should be close to the original sample according to some distance metric.

Figure 2. Skeleton Representation

2.2. DNNs for Skeleton-based Action Recognition

In the following, we briefly introduce the two DNNs used for evaluation of our proposed attack method in this project. HCN is a CNN-based end-to-end hierarchical network for learning global co-occurrence features from skeleton data (Li et al., 2018b). HCN is designed to learn different levels of features from both raw skeleton and skeleton motion. The joint-level features are learned by a multi-layer CNN, and the global co-occurrence features are learned from the fused joint-level features. At the end, the co-occurrence features are also fed to a fully-connected network for action classification. 2s-AGCN is one of the state-of-the-art GCN-based models for skeleton-based action recognition. In contrast to the earliest GCN-based model, (i.e., ST-GCN), 2s-AGCN learns the appropriate graph topology of every skeleton action rather than prefine the graph topology. This enables 2s-AGCN to capture the implicit connections between joints in certain actions, such as the connection between hand and face in the “wiping face” action. Besides, 2s-AGCN also adopts the two-stream framework to learn from both static and motion information. Overall, 2s-AGCN significantly improves the accuracy of ST-GCN by nearly 7%.

2.3. Adversarial Attacks

After the discovery of adversarial examples, the community has developed hundreds of attack methods to generate adversarial samples. In the following, we mainly introduce four attack methods plus a parallel work, with a discussion on the difference between our proposed method and these attacks.

Fast Gradient Sign Method (FGSM)

FGSM is a typical one-step adversarial attack algorithm proposed by (Goodfellow et al., 2014). The algorithm updates a benign sample along the direction of the gradient of the loss w.r.t. the sample. Formally, FGSM follows the update rule as


where controls the maximum perturbation of the adversarial samples; is the valid element-wise value range and function clips its input into the range of .

Projected Gradient Descent (PGD)

PGD (Kurakin et al., 2016; Madry et al., 2017) is a strong iterative version of FGSM, which executes Eq. 2.3 for multiple steps with a smaller step size and then projects the updated adversarial examples into the pre-defined -norm ball. Specifically, in each step, PGD updates the sample by


The function is a clip function for -norm balls, and an normalizer for -norm balls.

Carlini and Wagner Attack

(Carlini and Wagner, 2017) proposes an attack called C&W attack, which generates -norm adversarial samples by optimization over the C&W loss:


In the C&W loss, represents some distance metric between the benign sample and the adversarial sample , and the metrics used in (Carlini and Wagner, 2017) include , , and distances. is a customized loss. It is worth noting that our proposed attack is completely different from PGD or C&W attack. For PGD, C&W, or many other attacks, the simple constraints on the pixel value can be resolved by projection functions or naturally incorporated into the objective by / function. However, in our scenario, the constrained optimization problem is much more complicated, and thus has to be solved by more advanced methods.

ADMM-based Adversarial Attack

(Zhao et al., 2018) also proposes a framework based on ADMM to generate adversarial examples. However, we note that our proposed attack is also completely different from (Zhao et al., 2018) in two aspects: First, the constraints we consider are more complicated than the -norm constraints, which makes ADMM more appropriate than the other attack algorithms here. Second, we formulate the problem in a completely different manner from (Zhao et al., 2018). (Zhao et al., 2018) follows the ADMM framework to break the problem defined like Eq. 3 into two sub-problems, while our attack formulates a completely different problem with indispensable equality constraints, and ADMM is naturally an appropriate solution to this problem.

Adversarial Attack on Skeleton Action

Note that (Liu et al., 2019) is a parallel work that proposes an attack based on FGSM and BIM (PGD) to generate adversarial skeleton actions. Specifically, (Liu et al., 2019) adapts the FGSM and BIM to skeleton-based action recognition by using a clipping function and an alignment operation to impose the bone and joint constraints on the updated adversarial skeleton actions in each iteration. However, the method is very different from our work. First, the joint constraint considered in (Liu et al., 2019) is not the constraint for joint angles mentioned before. Second, the alignment operation might corrupt the perturbation learned in each iteration. In contrast to (Liu et al., 2019), we attempt to formulate adversarial skeleton action generation as a constrained optimization problem with equality constraints. Reformulating the equality constraints by Lagrangian multipliers yields an unconstrained dual optimization problem, which does not need any complicated additional operation in the optimization process. Third, we propose to solve the the dual optimization problem by ADMM, which is a more appropriate algorithm to optimize complicated constrained problems. Therefore, the attack achieves better performance than (Liu et al., 2019), which will be detailed in section 6.1. Finally, we specify a defense method against adversarial skeleton actions based on the state-of-the-art theories and our observations.

2.4. Alternating Direction Method of Multipliers (ADMM)

Alternating Direction Method of Multipliers (ADMM) is a powerful optimization algorithm to handle large-scale statistical tasks in diverse application domains. It blends the decomposability of dual ascent with the great convergence property of the method of multipliers. Currently, ADMM plays a significant role in solving statistical problems, such as support vector machines

(Forero et al., 2010), trace norm regularized least squares minimization (Yang et al., 2013), and constrained sparse regression (Bioucas-Dias and Figueiredo, 2010). Except for convex problems, ADMM is also a widely used solution to some nonconvex problems, whose objective function could be nonconvex, nonsmooth, or both. (Wang et al., 2019) shows that ADMM is able to converge as long as the objective has a smooth part, while the remaining part can be coupled or nonconvex, or include separable nonsmooth functions. Applications of ADMM to nonconvex problems include network reference (Miksik et al., 2014), global conformal mapping (Lai and Osher, 2014), noisy color image restoration (Lai and Osher, 2014).

2.5. Adversarial Defenses

Both learning and security communities have developed many defensive methods against adversarial examples. Among them, adversarial training and several certified defenses attract the most attention due to their outstanding/guaranteed performance against strong attacks (He et al., 2017; Uesato et al., 2018; Athalye et al., 2018). In the following, we briefly introduce adversarial training and several certified defenses, including the randomized smoothing method adopted in this paper.

Adversarial Training

Adversarial training is one of the most successful empirical defenses in the past few years (Goodfellow et al., 2014; Madry et al., 2017; Zhang et al., 2019). The intuition of adversarial training is to improve model robustness by training the model with adversarial examples. Although adversarial training achieves tremendous success against many strong attacks (Zheng et al., 2019; Andriushchenko et al., 2019; Tashiro et al., 2020), its performance is not theoretically guaranteed and thus might be compromised in the future. Besides, adversarial training always requires much more computational resource than standard training, making it not scalable to complicated models.

Certified Defenses

A defense with a theoretical guarantee on its defensive performance is considered as a certified defense. In general, there are three main approaches to design certified defenses. The first approach is to formulate the certification problem as an optimization problem and bound it by dual approach and convex relaxations (Dvijotham et al., 2018; Raghunathan et al., 2018; Wong and Kolter, 2018). The second approach approximates a convex set that contains all the possible outputs of each layer to certify an upper bound on the range of the final output (Mirman et al., 2018; Gowal et al., 2018; Wang et al., 2018). The third is the randomized smoothing method used in this paper. The only essential operation for this method is to add Gaussian/Laplace noise to the inputs, which is simple and applicable to any deep learning models. (Lecuyer et al., 2018) first proves that randomized smoothing is a certified defense by theories on differential privacy. (Li et al., 2018a) improves the certified bound using a lemma on Renyi divergence. Cohen et al. (Cohen et al., 2019) proves a tight bound on the robust radius certified by adding Gaussian noise using the Neyman-Pearson lemma. (Jia et al., 2019) further extends the approach of (Cohen et al., 2019) to the classification setting. Since the bound proved by (Cohen et al., 2019) is the tightest, the method in (Cohen et al., 2019) is used for certification. In this paper we adopt the approach in (Lecuyer et al., 2018) due to its ability for efficient inference in practice.

3. Threat Model

3.1. Adversary Knowledge: White-box Setting

In this paper, we follow the white-box setting, where the adversary has full access to the model architecture and parameters. We make this assumption because (i) it is always a safe, conservative, and realistic assumption since we might never know the knowledge of potential adversaries about the model (Carlini and Wagner, 2017), which varies among different adversaries and also changes over time. (ii) For systems/devices equipped with an action recognition model, recognition is more likely to be done locally, or on a local cloud, making the adversary easily acquire the model parameters with his own system/device. Note that although most of the experiments on the proposed attack and defense are done under the white-box setting, we also have several experiments on evaluating the transferability of our attack.

3.2. Adversary Goal: Targeted & Untargeted label Setting

Under the targeted setting, the goal of an adversary is to mislead the recognition model to predict the adversarial skeleton action as a targeted label pre-defined by the adversary. For instance, suppose the adversary is “kicking” someone under a surveillance camera equipped with an action recognition model. It may launch a targeted attack to mislead the model to recognize this violent action as a normal one such as ”drinking water”. Under the untargeted label settings, an adversary only aims to disable the recognition and thus is considered successful as long as the model makes wrong predictions instead of a specific targeted prediction. In this paper, we propose two objectives suitable for the above two settings respectively, which will be detailed in section 4.4.

3.3. Imperceptibility & Reproducibility

Except for the aforementioned adversary goals, the adversary also requires the adversarial perturbation to be both imperceptible and reproducible. Here “imperceptibility” means it should be difficult for human vision to figure out the adversarial perturbation, i.e., the difference between the original and adversarial skeleton actions. This is not only a common requirement in the previous attacks, but also a useful one in our scenario. Note that it is natural to schedule a periodical examination for an autonomous surveillance system by human labor to check if the system works well. If the system has been fooled by a seemingly “normal” adversarial skeleton action, the mistake might be due to the system itself rather than the adversary who performs the adversarial skeleton action in the examination process. Here “reproducibility” is an additional requirement specific to our scenario. As mentioned in the introduction, the adversarial skeleton action could be a real threat when it can be reproduced under a real-world system. Thus, to make our attack a real-world threat, the generated adversarial skeleton actions should satisfy three concrete constraints to be reproducible, which will be detailed in section 4.

4. Adversarial Skeleton Action

In this section, we present our proposed attack, i.e., ADMM attack. We first introduce how to formulate the three constraints into mathematical equations. Then we formulate the constrained optimization problem to generate adversarial skeleton actions under both targeted and untargeted settings. Finally, we elaborate on how to solve the optimization problem by ADMM.

4.1. Bone Constraints

We again take the skeletons in the NTU RGB+D dataset as an example. As shown in Fig. 2, in a skeleton, there are totally 25 joints, forming a total of 24 bones. While the bones are not explicitly considered in modeling, they are strictly connecting to the 25 joints, thus imposing 24 bone-length constraints, i.e., the distance between the joints at the two ends of a bone should remain the same in adversarial skeleton actions. To mathematically represent the 24 bones, we associate each joint with its preceding joint, forming the two ends of a bone. As a result, the 24 preceding-joints for joint-2joint-25 are denoted as . The corresponding joint indices of the elements in are {1, 21, 3, 21, 5, 6, 7, 21, 9, 10, 11, 1, 13, 14, 15, 1, 17, 18, 19, 2, 8, 8, 12, 12}. We define the -th bone’s length as . In this regard, the bone constraints can be represented as . Due to the measurement errors in the NTU dataset itself, here we also tolerate very small difference between and . Therefore, we can finally formulate the bone constraints as


where is usually set as . Note that inequality constraints in the primal problem will impose inequality constraints on the corresponding Lagrangian variables in the dual problem. In order to avoid this in the dual problem, we reformulate the above inequality constraints as mathematical equations, i.e., (4) is equivalent to


4.2. Joint Angle Constraints

Except for the bone-length constraints, we also need to impose constraints on the rotations of the joint angles according to the physiological structures of human beings. Let us also use the NTU dataset as an example. Each joint angle corresponds to the angle between two bones, and thus can be represented by the three joint locations of those two corresponding bones as illustrated in the right of Fig. 2. Note that a natural way to compute the joint angle as shown in Fig. 2 is to first compute the cosine value and then input the value into the arccos function. However, the gradient of arccos function is likely exploded, causing large numerical errors when the value of the joint angle is close to (). To deal with this issue, we derive an approximate upper bound for the changes of joint angle value to avoid computing the arccos function and its gradient. Again, take the right of Fig. 2 as an example, the angle change caused by the displacement of joint-9 (i.e., , , ) can be approximated by . In particular, when the angle change is smaller than (i.e., ), we can consider almost same as . The total angle change is upper bounded by the sum of the changes caused by the displacements caused by joint-9, joint-10, and joint-11. Therefore the upper bound can be represented by Although this representation looks more complicated than the arccos function, its gradient can be computed efficiently and accurately. Given such an approximation, the joint angle constraints can be similarly represented as


where is set as ().

4.3. Speed Constraints

According to the physical conditions of human beings, we should consider one more type of constraints, i.e., temporal smoothness constraints. By those constraints, we attempt to restrict the speeds of the motions in the generated adversarial skeleton actions. Specifically, the speeds of the motions can be approximated by the displacements between two consecutive temporal frames, i.e., . Then, similar to Eq. 5, we bound the change of speeds by


where is usually set as (smaller than) .

4.4. Constrained Primal Problem Formulation

In this subsection, we introduce the main objectives used under the untargeted setting and targeted setting.

Untargeted Setting

Under the untargeted setting, the adversary achieves its goal as long as the DNN makes a prediction other than the ground-truth label, i.e., . This will hold iff . Therefore, we define the objective as minimizing , where is the desired confidence value of the DNN on the wrong prediction. Note that if the objective is equal to , we have .

Targeted Setting

The goal of the adversary is to render the prediction result to be the attack target , i.e., . Therefore, the primal objective is defined as minimizing the cross entropy between and , or following the logic of the untargeted setting.

We can also adopt other objectives for our purpose. However, it turns out the above two main objectives are the most commonly-used ones in previous work (Kurakin et al., 2016; Madry et al., 2017; Carlini and Wagner, 2017). For simplicity, we denote the main loss by . The constrained primal problem can then be formulated as

subject to
Figure 3. Evolution of the averaged loss items and the constraints ()

4.5. Dual Optimization by ADMM

Note that our constrained primal problems are in general intractable. Instead of searching for a solution to the constrained primal problem, we propose to formulate and optimize its unconstrained dual problem via ADMM. The algorithm is illustrated in Alg. 1. Specifically, we first define the augmented Lagrangian of the constrained primal as shown in Alg. 1. The additional term , which is commonly used in ADMM (for nonconvex problems), aims to further penalize any violation of the equality constraints. We note that larger usually leads to smaller violation but larger final main objective (decreases the attack success rate).

Specifically, given the Lagrangian (defined in Alg. 1), the dual problem is . Note that since the internal function is an affine function w.r.t. the variables , we can simply use single-step gradient ascent with a large step size (usually set as in ADMM) to update those dual variables. However, is an extremely complicated nonconvex function w.r.t. the adversarial sample

. Therefore, in most cases, we could only guarantee local optima for the internal minimization problem. Fortunately, it turns out that even the local optima can always fool the DNNs. To find a local optimum efficiently, we adopt the Adam optimizer instead of the vanilla stochastic gradient descent (SGD) since Adam optimizer always converges faster than vanilla SGD. Theoretically, a local minimum is guaranteed because the Adam optimizer stops updating the variables when the gradients are (close to)

. Next, we further look into the evolution of the loss during the optimization process. As shown in Fig. 3, at the very beginning (i.e., the first stage), the internal minimization problem finds adversarial samples with large violation of the constraints. The large violation will cause the Lagrangian multipliers to increase rapidly, and thus significantly increase the loss terms (bone loss), (joint loss), and (speed loss). As a result, the algorithm proceeds into the second stage, where the Adam optimizer focuses more on diminishing the constraint violation , , and when optimizing . Finally, the algorithm proceeds into a relatively stable stage where we can stop the algorithm. According to Fig. 2, our algorithm is very efficient in the sense that it only needs 200 (internal) iterations to enter the final stable stage.


  Loss function

, hyper-parameter , adam optimizer for the adversarial skeleton action , maximum number of iterations .
  Define Constraints: , , and . (Vector Representations: , , and )
  Define Lagrangian Variables: , , and (Corresponding to , , and )
  Define Augmented Lagrangian: .
  for  = to  do
     Update : fix the multipliers
      updated by the adam optimizer
     Update Multipliers: compute , , and based on
     ; ;
  end for
Algorithm 1 Generating Adversarial Skeleton Actions
Figure 4. The top six frames represent a “kicking (another person)” skeleton action, and the bottom six frames are the corresponding frames from the adversarial skeleton action generated by our attack under the targeted setting (optimizing the first person). The generated adversarial skeleton action is recognized as “drinking water” by the 2s-AGCN.

5. Defense against Adversarial Skeleton Actions

Note that although the method proposed in (Li et al., 2018a; Cohen et al., 2019) can certify larger robust radii than (Lecuyer et al., 2018)

. However, the sample complexity to compute the confidence intervals in

(Li et al., 2018a; Cohen et al., 2019) will lead to computational overhead in the inference stage. Therefore, we only use the method in (Cohen et al., 2019) in the certification process. In the inference stage, we modify the method in (Lecuyer et al., 2018) to build a relatively efficient defense against adversarial skeleton actions, as shown in Alg. 2. In general, our proposed defense consists of two steps: adding Gaussian noise and temporal filtering by Gaussian kernel. In the following, we will detail these two steps and explain why we include them in the defense.

5.1. Additive Gaussian Noise

Our first step is adding Gaussian noise to the skeleton actions. In the inference stage, we follow (Lecuyer et al., 2018) to make the prediction as given input , where is randomized mechanism with Gaussian noise and post-processing function

. In order to estimate

, we sample N noisy samples from and feed them into the post-processing function and the neural network . is estimated by , and according to the Chernoff bound (Boucheron et al., 2013), the error of this estimation is bounded by


In the certification stage, we rely on the main theorem from (Cohen et al., 2019), which gives the currently tightest bound:

Lemma 5.1 ().

Denote an mechanism randomized by Gaussian noise by , and the ground-truth label by . Define . Suppose & satisfy


the robust radius is .

Lemma 5.1 indicates that as long as , , i.e., the prediction is correct. The algorithm using the above lemma for certification is detailed in Algorithm 3. In the next subsection, we will detail the post-processing function mentioned before.

0:  Neural Network

, standard deviation of the additive Gaussian noise

, skeleton action (probably adversarial), number of noisy samples for inference of .
  Sample samples from
  Smooth by a or Gaussian filter
  Feed into the network
Algorithm 2 Defense (Inference)
0:  Neural Network , standard deviation of the additive Gaussian noise , original and adversarial skeleton action & , number of noisy samples for inference of , a predefined confidence value p for hypothesis test (usually ).
  Recognition: Sample samples from
  Smooth by a or Gaussian filter
  Feed into the (normally trained) network
  Confidence Interval: Compute the number (counts) top two indices in
  Compute the lower bound for and the upper bound for by the method in (Goodman, 1965) with confidence .
  Certification: Compute the certified radius by .
  Output if corresponds to the ground-truth label else
  Compare R with , and if is larger, then output the index corresponding to .
Algorithm 3 Defense (Certification)

5.2. Temporal Filtering by Gaussian Kernel

After adding Gaussian noise to the skeleton actions, we propose to further smooths the action along the temporal axis by a or Gaussian filter. The intuition is that the adjacent frames in a skeleton action sequence are very similar to each other, and thus can be used as references to rectify the adversarial perturbations. Although this additional operation does not improve the certification results, we observe that it can help our defense become more compatible with a normally trained model than the original randomized smoothing method in (Lecuyer et al., 2018; Cohen et al., 2019). Also, we argue that this simple operation is not usually used in previous work because it is not very suitable in the image recognition domain, where no adjacency information (along the temporal axis) is available.

White-box NTU CV NTU CS
Untargeted Success Rate Success Rate
HCN 100% 2.64% 0.132 4.52% 0.396 100% 2.17% 0.111 3.17% 0.347
100% 1.92% 0.099 1.65% 0.330 100% 1.62% 0.086 1.30% 0.290
92.8% 1.50% 0.085 1.25% 0.270 92.4% 1.25% 0.073 0.98% 0.241
2s-AGCN 100% 2.17% 0.112 1.62% 0.653 100% 1.97% 0.107 2.20% 0.614
100% 1.70% 0.094 0.59% 0.528 100% 1.46% 0.086 0.57% 0.496
99.0% 1.37% 0.083 0.39% 0.428 98.8% 1.19% 0.078 0.34% 0.413
White-box NTU CV NTU CS
targeted Success Rate Success Rate
HCN 100% 3.60% 0.165 7.75% 0.673 100% 3.55% 0.165 6.68% 0.723
99.7% 3.24% 0.156 4.69% 0.630 100% 3.16% 0.155 4.24% 0.674
22.3% 2.27% 0.115 2.83% 0.444 26.9% 2.14% 0.112 2.50% 0.462
2s-AGCN 100% 1.66% 0.090 0.55% 0.569 100% 1.67% 0.091 0.71% 0.649
100% 1.61% 0.091 0.42% 0.556 100% 1.56% 0.090 0.49% 0.615
97.2% 1.54% 0.089 0.38% 0.512 97.9% 1.47% 0.087 0.40% 0.552
Table 1. The empirical performance of our proposed method: averaged bone-length difference between original and adversarial skeletons (), averaged joint angle difference (upper bound) (), kinetic energy difference (), distance ().

6. Experiments

6.1. Attack Performance

Main Results

The main results of our attack are shown in Table 5. As we can see, our proposed attack can achieve 100% success rates with very small violation of the constraints. The averaged normalized bone-length difference is approximately , and the violation of the joint angles is smaller than . Considering the skeleton data is usually noisy, this subtle violation is considered “very common” in real world. We also provide more experimental results in the supplementary material (e.g., on Kinetics).

We also note that adversarial-sample generation under the untargeted setting is usually easier than that under the targeted setting since a targeted adversarial sample is guaranteed to be an untargeted adversarial sample, but not vice versa. This fact is also reflected by the results in Table 5. Furthermore, in Figure 4, we show the visualization result of an adversarial skeleton action (recognized as a normal action “drinking water”) generated by our attack, which is almost visually indistinguishable from its original skeleton action (“kicking”). Figures of more adversarial actions are attached in the appendix.

Source (Model) Target Dataset
HCN(1) HCN(2) NTU CV 24.7% 26.0%
NTU CS 28.5% 32.6%
HCN(1) 2s-AGCN NTU CV 17.6% 20.4%
NTU CS 17.3% 19.6%
Table 2. Attack success rates of adversarial examples transferred between models.


In order to shed light on the transferability of our attack, we feed the adversarial skeleton actions generated on a HCN model to another HCN model and 2s-AGCN, respectively. In order to boost the transferability performance, we set as or to generate adversarial skeleton actions with larger perturbation. The attack success rates are given in Table 2. Similar to 3D adversarial point clouds (Xiang et al., 2018), the transferability of the adversarial skeleton actions is also a little limited compared with adversarial images.

Comparison with C&W Attack

We use C&W attack as an example to shed light on the difference between our attack and the existing attacks. C&W attack has been demonstrated as a successful optimization-based adversarial attack in many application domains. However, since C&W attack mainly considers minimizing the distance between original and adversarial skeletons, it might easily violate the constraints, as shown in our simple case study (Table 3).

Untargeted Success Rate
NTU CV 100% 4.67% 0.241 13.0% 0.278
NTU CS 100% 4.09% 0.211 10.2% 0.244
Targeted Success Rate
NTU CV 100% 8.82% 0.468 38.1% 0.510
NTU CS 100% 9.45% 0.507 36.8% 0.520
Table 3. Adversarial skeleton actions generated by C&W attack on HCN ( distance is smaller).

6.2. Defense Performance

Empirical Results

We demonstrate the performance of the defense for inference in Table 4. We set to generate adversarial examples, and set (Alg. 2), which is more smaller than the number of samples required for certification but can achieve good empirical performance, as shown in Table 4. it is much easier to defend adversarial skeleton actions under the targeted setting than the untargeted setting. Note that the accuracy of HCN on NTU-CV and NTU-CS is respectively and (Li et al., 2018b), and the accuracy of 2s-AGCN is respectively and (Shi et al., 2019).

Model Setting NTU CV NTU CS
HCN Untargeted 62.0% 62.3% 50.6% 51.4%
Targeted 79.4% 70.8% 67.1% 58.3%
2s-AGCN Untargeted 51.0% 42.2% 42.1% 40.2%
Targeted 60.8% 50.5% 42.2% 44.1%
Table 4. Empirical performance (model accuracy) of our proposed defense on normally trained model.

Certified Results

Due to the high computational cost of the certification method (N=1000), we mainly evaluate the certification algorithm on HCN. The certified accuracy achieved by different levels of noise is shown in Fig. 5. Note that we use the same level of noise to train the model as the noise for certification. As we can see, with sacrificing accuracy on the clean samples, the method is able to achieve about certified accuracy ().

Figure 5. Certification accuracy on HCN

7. Conclusion

We study the problem of adversarial vulnerability of skeleton-based action recognition. We first identify and formulate three main constraints that should be considered in adversarial skeleton actions. Since the corresponding constrained optimization problem is intractable, we propose to optimize its dual problem by ADMM, which is a generic method first proposed in this paper to generate constrained adversarial examples. To defend against adversarial skeleton actions, we further specify an efficient defensive inference algorithm and a certification algorithm. The effectiveness of the attack and defense is demonstrated on two opensource models, and the results induce several interesting observations that can help us understand more about adversarial skeleton actions.


  • M. Andriushchenko, F. Croce, N. Flammarion, and M. Hein (2019) Square attack: a query-efficient black-box adversarial attack via random search. arXiv preprint arXiv:1912.00049. Cited by: §2.5.
  • A. Athalye, N. Carlini, and D. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420. Cited by: §2.5.
  • J. M. Bioucas-Dias and M. A. Figueiredo (2010) Alternating direction algorithms for constrained sparse regression: application to hyperspectral unmixing. In 2010 2nd Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing, pp. 1–4. Cited by: §2.4.
  • S. Boucheron, G. Lugosi, and P. Massart (2013) Concentration inequalities: a nonasymptotic theory of independence. Oxford university press. Cited by: §5.1.
  • N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pp. 39–57. Cited by: §2.3, §3.1, §4.4.
  • G. Cheng, Y. Wan, A. N. Saudagar, K. Namuduri, and B. P. Buckles (2015) Advances in human action recognition: a survey. arXiv preprint arXiv:1501.05964. Cited by: §1.
  • J. M. Cohen, E. Rosenfeld, and J. Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. arXiv preprint arXiv:1902.02918. Cited by: §1, §2.5, §5.1, §5.2, §5.
  • K. Dvijotham, R. Stanforth, S. Gowal, T. A. Mann, and P. Kohli (2018) A dual approach to scalable verification of deep networks.. In UAI, pp. 550–559. Cited by: §1, §2.5.
  • P. A. Forero, A. Cano, and G. B. Giannakis (2010) Consensus-based distributed support vector machines. Journal of Machine Learning Research 11 (May), pp. 1663–1707. Cited by: §2.4.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §2.3, §2.5.
  • L. A. Goodman (1965) On simultaneous confidence intervals for multinomial proportions. Technometrics 7 (2), pp. 247–254. Cited by: 5.
  • S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, T. Mann, and P. Kohli (2018) On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715. Cited by: §1, §2.5.
  • W. He, J. Wei, X. Chen, N. Carlini, and D. Song (2017) Adversarial example defense: ensembles of weak defenses are not strong. In 11th Workshop on Offensive Technologies ( 17), Cited by: §2.5.
  • J. Jia, X. Cao, B. Wang, and N. Z. Gong (2019) Certified robustness for top-k predictions against adversarial perturbations via randomized smoothing. arXiv preprint arXiv:1912.09899. Cited by: §2.5.
  • Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid (2017) A new representation of skeleton sequences for 3d action recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 3288–3297. Cited by: §1.
  • A. Kurakin, I. Goodfellow, and S. Bengio (2016) Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236. Cited by: §2.3, §4.4.
  • R. Lai and S. Osher (2014) A splitting method for orthogonality constrained problems. Journal of Scientific Computing 58 (2), pp. 431–449. Cited by: §2.4.
  • M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana (2018) Certified robustness to adversarial examples with differential privacy. arXiv preprint arXiv:1802.03471. Cited by: §1, §2.5, §5.1, §5.2, §5.
  • B. Li, C. Chen, W. Wang, and L. Carin (2018a) Second-order adversarial attack and certifiable robustness. arXiv preprint arXiv:1809.03113. Cited by: §2.5, §5.
  • C. Li, Q. Zhong, D. Xie, and S. Pu (2018b) Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint arXiv:1804.06055. Cited by: §1, §1, §2.2, §6.2.
  • S. Li, W. Li, C. Cook, C. Zhu, and Y. Gao (2018) Independently recurrent neural network (indrnn): building a longer and deeper rnn. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • J. Liu, N. Akhtar, and A. Mian (2019) Adversarial attack on skeleton-based human action recognition. arXiv preprint arXiv:1909.06500. Cited by: §2.3.
  • Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang (2020) Disentangling and unifying graph convolutions for skeleton-based action recognition. arXiv preprint arXiv:2003.14111. Cited by: §1.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §2.3, §2.5, §4.4.
  • O. Miksik, V. Vineet, P. Pérez, P. H. Torr, and F. C. Sévigné (2014) Distributed non-convex admm-inference in large-scale random fields. In British Machine Vision Conference (BMVC), Vol. 2. Cited by: §2.4.
  • M. Mirman, T. Gehr, and M. Vechev (2018) Differentiable abstract interpretation for provably robust neural networks. In International Conference on Machine Learning, pp. 3575–3583. Cited by: §1, §2.5.
  • A. Raghunathan, J. Steinhardt, and P. Liang (2018) Certified defenses against adversarial examples. arXiv preprint arXiv:1801.09344. Cited by: §2.5.
  • A. Shahroudy, J. Liu, T. Ng, and G. Wang (2016) NTU rgb+d: a large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1010–1019. Cited by: Figure 1, §1.
  • L. Shi, Y. Zhang, J. Cheng, and H. Lu (2019) Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12026–12035. Cited by: §1, §6.2.
  • C. Si, W. Chen, W. Wang, L. Wang, and T. Tan (2019) An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1227–1236. Cited by: §1.
  • Y. Tashiro, Y. Song, and S. Ermon (2020) Output diversified initialization for adversarial attacks. arXiv preprint arXiv:2003.06878. Cited by: §2.5.
  • J. Uesato, B. O’Donoghue, P. Kohli, and A. Oord (2018) Adversarial risk and the dangers of evaluating against weak attacks. In International Conference on Machine Learning, pp. 5032–5041. Cited by: §2.5.
  • S. Wang, K. Pei, J. Whitehouse, J. Yang, and S. Jana (2018) Efficient formal safety analysis of neural networks. In Advances in Neural Information Processing Systems, pp. 6367–6377. Cited by: §1, §2.5.
  • Y. Wang, W. Yin, and J. Zeng (2019) Global convergence of admm in nonconvex nonsmooth optimization. Journal of Scientific Computing 78 (1), pp. 29–63. Cited by: §2.4.
  • E. Wong and Z. Kolter (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pp. 5283–5292. Cited by: §1, §2.5.
  • C. Xiang, C. R. Qi, and B. Li (2018) Generating 3d adversarial point clouds. arXiv preprint arXiv:1809.07016. Cited by: §6.1.
  • S. Yan, Y. Xiong, and D. Lin (2018) Spatial temporal graph convolutional networks for skeleton-based action recognition. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §1.
  • A. Y. Yang, Z. Zhou, A. G. Balasubramanian, S. S. Sastry, and Y. Ma (2013) Fast

    -minimization algorithms for robust face recognition

    IEEE Transactions on Image Processing 22 (8), pp. 3234–3246. Cited by: §2.4.
  • H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan (2019) Theoretically principled trade-off between robustness and accuracy. arXiv preprint arXiv:1901.08573. Cited by: §2.5.
  • P. Zhao, S. Liu, Y. Wang, and X. Lin (2018) An admm-based universal framework for adversarial attacks on deep neural networks. In Proceedings of the 26th ACM international conference on Multimedia, pp. 1065–1073. Cited by: §2.3.
  • T. Zheng, C. Chen, and K. Ren (2019) Distributionally adversarial attack. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 2253–2260. Cited by: §2.5.

Appendix A Appendix

Additional visualization results

Here we provide more visualization results. We use “drinking water” as the attack target because “drinking water” is a normal action, which looks completely different from the some violent/abnormal actions like throwing, kicking, pushing, and punching. Despite the obvious visual difference between “drinking water” and those abnormal actions, our attack can still fool the state-of-the-art models to recognize those abnormal actions as “drinking water” by imperceptible and reproducible perturbation, which indicates that our attack is very powerful. In Fig. 6, we show that our attack can fool the HCN model to recognize the “throwing”, “pushing”, and “kicking” actions as a normal action “drinking water” by imperceptible adversarial perturbation. Similarly, in Fig. 7, we show that our attack can fool the 2s-AGCN model to recognize the “throwing” and “pushing” actions as a normal action “drinking water”. These visualization results along with the quantitative results in Table 1 (in the paper) demonstrate that the perturbations are indeed imperceptible and reproducible.

Kinetics Dataset

Except for the NTU dataset, we also evaluate our attack on another popular dataset, i.e., Kinetics-400 dataset under both the untargeted and targeted settings. As shown in Table 5, under the untargeted setting, our attack can achieve 100% attack success rates with very small violation of the constraints, similar to its performance on the NTU dataset. However, under the targeted setting, it is much more difficult for our attack to find targeted adversarial skeleton actions with very small violation of the constraints. This is because Kinetics-400 has 400 classes of actions, and the original NTU dataset only has 60 classes of actions. Also, we argue that the results on Kinetics under the targeted setting do not devalue our attack since even for most of the clean testing samples from Kinetics, it is difficult for the state-of-the-models to predict their ground-truth labels (targets).

Figure 6. The adversarial skeleton actions generated by our attack under the targeted setting. The generated adversarial skeleton actions are recognized as “drinking water” by the HCN.
Figure 7. The adversarial skeleton actions generated by our attack under the targeted setting. The generated adversarial skeleton action is recognized as “drinking water” by the 2s-AGCN.
Untargeted Kinetics-400
Success Rate
HCN 100% 2.60% 0.082 1.66% 0.150
100% 2.58% 0.080 1.52% 0.162
98.8% 2.49% 0.078 1.21% 0.145
2s-AGCN 100% 0.91% 0.053 0.58% 0.331
100% 0.77% 0.047 0.53% 0.298
100% 0.75% 0.046 0.52% 0.287
Targeted Kinetics-400
Success Rate
HCN 90.2% 5.22% 0.220 11.2% 1.864
67.2% 2.79% 0.124 4.86% 1.350
17.2% 1.44% 0.073 2.36% 0.763
2s-AGCN 99.2% 5.25% 0.167 1.20% 0.725
98.8% 5.04% 0.159 1.21% 0.722
98.4% 4.89% 0.153 1.03% 0.677
Table 5. The performance of our proposed attack on Kinetics: success rate, averaged bone-length difference between original and adversarial skeletons (), averaged joint angle difference (upper bound) (), kinetic energy difference (), distance ().