1. Introduction
Action recognition is an important task in computer vision, motivated by many downstream applications such as video surveillance and indexing, and humanmachine interaction
(Cheng et al., 2015). It is also a very challenging task since it requires to capture longterm spatialtemporal context and understand the semantics of actions. One method proposed by the community is to learn action recognition on the human skeleton information collected by cameras or sensors, where an action is represented by a time series of human joint locations. Compared with video streams, skeleton representation is more robust to the variance of background conditions, and also easiertohandle for machine learning models due to its compact representation. Recent advances in deep learning techniques boost the performance of this method. Currently, a variety of deep learning model structures have been applied to skeletonbased action recognition, including convolutional neural networks
(Ke et al., 2017; Li et al., 2018b)(Li et al., 2018; Si et al., 2019), and graph neural networks (Yan et al., 2018; Shi et al., 2019; Liu et al., 2020).On the other hand, existing work has demonstrated the vulnerability of deep learning techniques to adversarial examples in many application domains. This phenomenon gives us a good reason to suspect that the DNNs for skeletonbased action recognition might also be vulnerable to adversarial skeleton examples despite achieving high accuracy in a benign environment. Note that a thorough study on the adversarial vulnerability of actionrecognition models is indispensable before deploying them to realworld applications such as surveillance systems because otherwise, the potential adversaries might easily deceive those systems by generating and imitating specific adversarial actions. However, the study on the adversarial skeleton examples is scant and nontrivial^{*}^{*}*The only parallel work is detailed in section 2.3., due to the fundamental differences between the properties of adversarial skeleton actions and other adversarial examples. The differences are mainly caused by the bones between joints and the joint angles, which impose unique spatial constraints on skeleton data (Shahroudy et al., 2016). Specifically, in the generated adversarial skeleton actions, lengths of the bones must be maintained the same, and joint angles can not violate certain physiological structures. Otherwise, the adversarial actions are not reproducible by the individuals who perform the original actions. Also, considering the physical conditions of human beings, the speeds of motions in the adversarial actions should also be constrained.
To address the above issues, in this paper, we propose an optimization based method for generating adversarial skeleton actions. Specifically, we formulate the generation of adversarial skeleton actions as a constrained optimization problem by representing those constraints with mathematical equations. Since the primal constrained problem is intractable, we turn to solve its dual problem. Moreover, since all the constraints are represented by mathematical equations, both primal and dual variables are nonrestrictive in the dual problem. We further specify an efficient algorithm based on ADMM to solve the unconstrained dual problem, in which the internal minimization objective is optimized by an Adam optimizer, and the external maximization objective is optimized by onestep gradient ascent. We show that this algorithm can find an adversarial skeleton action within 200 internal steps.
Other than the attack, we further propose an efficient defense against adversarial skeleton actions based on previous theories and empirical observations. Our defense consists of two core steps, i.e.,
adding Gaussian noise and Gaussian filtering to action data. The first step, adding Gaussian noise, is inspired by the recent advance in certified defenses. Specifically, adding Gaussian noise to the input is proved to be a certified defense, which means additive Gaussian noise on the adversarial examples can guarantee the model to output a correct prediction (with high probability), as long as the adversarial perturbation is restricted within a certain radius in the neighbor of the original data sample. Note that there are also several other methods to certify model robustness, such as dual approach, interval analysis and abstract interpretations
(Dvijotham et al., 2018; Wong and Kolter, 2018; Mirman et al., 2018; Gowal et al., 2018; Wang et al., 2018). We adopt the Gaussian noise method because it is simple, effective, and more importantly, scalable to complicated models. Note that skeletonbased action recognition models are always more complicated than the common ConvNets certified by (Dvijotham et al., 2018; Wong and Kolter, 2018; Mirman et al., 2018; Gowal et al., 2018; Wang et al., 2018). The second step is to smooth the skeleton frames along the temporal axis using a Gaussian filter. This step will not affect the robustness certified by the first step according to the postprocessing property (Lecuyer et al., 2018; Li et al., 2018b; Cohen et al., 2019), but can always filter out a certain amount of adversarial perturbation and random noise in practice, thus making our defense applicable to normally trained models.Our proposed attack and defense are evaluated on two opensource models, i.e., 2sAGCN and HCN ^{†}^{†}†
We select these two models because the authors have released the code and hyperparameters on Github so that we can correctly reproduce the results. Also, these two models achieve fairly good performance.
. Extensive evaluations show that our attack can achieve attack success rate with almost no violation of the constraints. Moreover, the visualization results, including images and videos, demonstrate that the difference between the original and adversarial skeleton actions is imperceptible. Extensive evaluations also show that our defense is effective and efficient. Specifically, our defense can improve the empirical accuracy of normally trained models to over against adversarial skeleton actions under different settings.To summarize, our main contribution is fourfold:

We identify the constraints needed to be considered in adversarial skeleton actions, and formulate the problem of generating adversarial skeleton actions as a constrained optimization problem by representing those constraints as mathematical equations.

We propose to solve the primal constrained problem by optimizing the dual problem using ADMM, which is the first trial on generating adversarial actions with ADMM and yields outstanding performance.

We propose an efficient twostep plugin defense against adversarial skeleton actions, and specify the defense in both inference and certification stages.

We conduct extensive evaluations, and provide several interesting observations regarding adversarial skeleton actions based on the experimental results.
2. Preliminaries
2.1. Definitions and Notations
Let and respectively denote a data sample and the label, where is the number of all possible classes. For an image, is a 2D matrix. For a skeleton action studied in this paper, , where denotes the position (coordinates) of the th joint of the th skeleton frame in an action sequence, with and denoting the number of joints in a skeleton and the number of skeleton frames in an action sequence, respectively. The corresponding adversarial skeleton action is denoted by . We take the skeletons in the largest dataset, i.e., NTU RGB+D dataset, as an example. As shown in figure 2, in a skeleton, there are totally 25 joints in a skeleton frame, and thus . The number of frames
differs for each skeleton action, and usually, we subsample a constant number of frames from each sequence or pad zeros after each sequence to endow all the skeleton actions with the same
. Let denote a classification network, whererepresents the network weights. The logit output on
is denoted by with elements ().can correctly classify
iff . The goal of adversarial attacks is to find an adversarial sample , which satisfies several predefined constraints, such that or ( is the target label). A commonlyused constraint is that should be close to the original sample according to some distance metric.2.2. DNNs for Skeletonbased Action Recognition
In the following, we briefly introduce the two DNNs used for evaluation of our proposed attack method in this project. HCN is a CNNbased endtoend hierarchical network for learning global cooccurrence features from skeleton data (Li et al., 2018b). HCN is designed to learn different levels of features from both raw skeleton and skeleton motion. The jointlevel features are learned by a multilayer CNN, and the global cooccurrence features are learned from the fused jointlevel features. At the end, the cooccurrence features are also fed to a fullyconnected network for action classification. 2sAGCN is one of the stateoftheart GCNbased models for skeletonbased action recognition. In contrast to the earliest GCNbased model, (i.e., STGCN), 2sAGCN learns the appropriate graph topology of every skeleton action rather than prefine the graph topology. This enables 2sAGCN to capture the implicit connections between joints in certain actions, such as the connection between hand and face in the “wiping face” action. Besides, 2sAGCN also adopts the twostream framework to learn from both static and motion information. Overall, 2sAGCN significantly improves the accuracy of STGCN by nearly 7%.
2.3. Adversarial Attacks
After the discovery of adversarial examples, the community has developed hundreds of attack methods to generate adversarial samples. In the following, we mainly introduce four attack methods plus a parallel work, with a discussion on the difference between our proposed method and these attacks.
Fast Gradient Sign Method (FGSM)
FGSM is a typical onestep adversarial attack algorithm proposed by (Goodfellow et al., 2014). The algorithm updates a benign sample along the direction of the gradient of the loss w.r.t. the sample. Formally, FGSM follows the update rule as
(1) 
where controls the maximum perturbation of the adversarial samples; is the valid elementwise value range and function clips its input into the range of .
Projected Gradient Descent (PGD)
PGD (Kurakin et al., 2016; Madry et al., 2017) is a strong iterative version of FGSM, which executes Eq. 2.3 for multiple steps with a smaller step size and then projects the updated adversarial examples into the predefined norm ball. Specifically, in each step, PGD updates the sample by
(2) 
The function is a clip function for norm balls, and an normalizer for norm balls.
Carlini and Wagner Attack
(Carlini and Wagner, 2017) proposes an attack called C&W attack, which generates norm adversarial samples by optimization over the C&W loss:
(3) 
In the C&W loss, represents some distance metric between the benign sample and the adversarial sample , and the metrics used in (Carlini and Wagner, 2017) include , , and distances. is a customized loss. It is worth noting that our proposed attack is completely different from PGD or C&W attack. For PGD, C&W, or many other attacks, the simple constraints on the pixel value can be resolved by projection functions or naturally incorporated into the objective by / function. However, in our scenario, the constrained optimization problem is much more complicated, and thus has to be solved by more advanced methods.
ADMMbased Adversarial Attack
(Zhao et al., 2018) also proposes a framework based on ADMM to generate adversarial examples. However, we note that our proposed attack is also completely different from (Zhao et al., 2018) in two aspects: First, the constraints we consider are more complicated than the norm constraints, which makes ADMM more appropriate than the other attack algorithms here. Second, we formulate the problem in a completely different manner from (Zhao et al., 2018). (Zhao et al., 2018) follows the ADMM framework to break the problem defined like Eq. 3 into two subproblems, while our attack formulates a completely different problem with indispensable equality constraints, and ADMM is naturally an appropriate solution to this problem.
Adversarial Attack on Skeleton Action
Note that (Liu et al., 2019) is a parallel work that proposes an attack based on FGSM and BIM (PGD) to generate adversarial skeleton actions. Specifically, (Liu et al., 2019) adapts the FGSM and BIM to skeletonbased action recognition by using a clipping function and an alignment operation to impose the bone and joint constraints on the updated adversarial skeleton actions in each iteration. However, the method is very different from our work. First, the joint constraint considered in (Liu et al., 2019) is not the constraint for joint angles mentioned before. Second, the alignment operation might corrupt the perturbation learned in each iteration. In contrast to (Liu et al., 2019), we attempt to formulate adversarial skeleton action generation as a constrained optimization problem with equality constraints. Reformulating the equality constraints by Lagrangian multipliers yields an unconstrained dual optimization problem, which does not need any complicated additional operation in the optimization process. Third, we propose to solve the the dual optimization problem by ADMM, which is a more appropriate algorithm to optimize complicated constrained problems. Therefore, the attack achieves better performance than (Liu et al., 2019), which will be detailed in section 6.1. Finally, we specify a defense method against adversarial skeleton actions based on the stateoftheart theories and our observations.
2.4. Alternating Direction Method of Multipliers (ADMM)
Alternating Direction Method of Multipliers (ADMM) is a powerful optimization algorithm to handle largescale statistical tasks in diverse application domains. It blends the decomposability of dual ascent with the great convergence property of the method of multipliers. Currently, ADMM plays a significant role in solving statistical problems, such as support vector machines
(Forero et al., 2010), trace norm regularized least squares minimization (Yang et al., 2013), and constrained sparse regression (BioucasDias and Figueiredo, 2010). Except for convex problems, ADMM is also a widely used solution to some nonconvex problems, whose objective function could be nonconvex, nonsmooth, or both. (Wang et al., 2019) shows that ADMM is able to converge as long as the objective has a smooth part, while the remaining part can be coupled or nonconvex, or include separable nonsmooth functions. Applications of ADMM to nonconvex problems include network reference (Miksik et al., 2014), global conformal mapping (Lai and Osher, 2014), noisy color image restoration (Lai and Osher, 2014).2.5. Adversarial Defenses
Both learning and security communities have developed many defensive methods against adversarial examples. Among them, adversarial training and several certified defenses attract the most attention due to their outstanding/guaranteed performance against strong attacks (He et al., 2017; Uesato et al., 2018; Athalye et al., 2018). In the following, we briefly introduce adversarial training and several certified defenses, including the randomized smoothing method adopted in this paper.
Adversarial Training
Adversarial training is one of the most successful empirical defenses in the past few years (Goodfellow et al., 2014; Madry et al., 2017; Zhang et al., 2019). The intuition of adversarial training is to improve model robustness by training the model with adversarial examples. Although adversarial training achieves tremendous success against many strong attacks (Zheng et al., 2019; Andriushchenko et al., 2019; Tashiro et al., 2020), its performance is not theoretically guaranteed and thus might be compromised in the future. Besides, adversarial training always requires much more computational resource than standard training, making it not scalable to complicated models.
Certified Defenses
A defense with a theoretical guarantee on its defensive performance is considered as a certified defense. In general, there are three main approaches to design certified defenses. The first approach is to formulate the certification problem as an optimization problem and bound it by dual approach and convex relaxations (Dvijotham et al., 2018; Raghunathan et al., 2018; Wong and Kolter, 2018). The second approach approximates a convex set that contains all the possible outputs of each layer to certify an upper bound on the range of the final output (Mirman et al., 2018; Gowal et al., 2018; Wang et al., 2018). The third is the randomized smoothing method used in this paper. The only essential operation for this method is to add Gaussian/Laplace noise to the inputs, which is simple and applicable to any deep learning models. (Lecuyer et al., 2018) first proves that randomized smoothing is a certified defense by theories on differential privacy. (Li et al., 2018a) improves the certified bound using a lemma on Renyi divergence. Cohen et al. (Cohen et al., 2019) proves a tight bound on the robust radius certified by adding Gaussian noise using the NeymanPearson lemma. (Jia et al., 2019) further extends the approach of (Cohen et al., 2019) to the classification setting. Since the bound proved by (Cohen et al., 2019) is the tightest, the method in (Cohen et al., 2019) is used for certification. In this paper we adopt the approach in (Lecuyer et al., 2018) due to its ability for efficient inference in practice.
3. Threat Model
3.1. Adversary Knowledge: Whitebox Setting
In this paper, we follow the whitebox setting, where the adversary has full access to the model architecture and parameters. We make this assumption because (i) it is always a safe, conservative, and realistic assumption since we might never know the knowledge of potential adversaries about the model (Carlini and Wagner, 2017), which varies among different adversaries and also changes over time. (ii) For systems/devices equipped with an action recognition model, recognition is more likely to be done locally, or on a local cloud, making the adversary easily acquire the model parameters with his own system/device. Note that although most of the experiments on the proposed attack and defense are done under the whitebox setting, we also have several experiments on evaluating the transferability of our attack.
3.2. Adversary Goal: Targeted & Untargeted label Setting
Under the targeted setting, the goal of an adversary is to mislead the recognition model to predict the adversarial skeleton action as a targeted label predefined by the adversary. For instance, suppose the adversary is “kicking” someone under a surveillance camera equipped with an action recognition model. It may launch a targeted attack to mislead the model to recognize this violent action as a normal one such as ”drinking water”. Under the untargeted label settings, an adversary only aims to disable the recognition and thus is considered successful as long as the model makes wrong predictions instead of a specific targeted prediction. In this paper, we propose two objectives suitable for the above two settings respectively, which will be detailed in section 4.4.
3.3. Imperceptibility & Reproducibility
Except for the aforementioned adversary goals, the adversary also requires the adversarial perturbation to be both imperceptible and reproducible. Here “imperceptibility” means it should be difficult for human vision to figure out the adversarial perturbation, i.e., the difference between the original and adversarial skeleton actions. This is not only a common requirement in the previous attacks, but also a useful one in our scenario. Note that it is natural to schedule a periodical examination for an autonomous surveillance system by human labor to check if the system works well. If the system has been fooled by a seemingly “normal” adversarial skeleton action, the mistake might be due to the system itself rather than the adversary who performs the adversarial skeleton action in the examination process. Here “reproducibility” is an additional requirement specific to our scenario. As mentioned in the introduction, the adversarial skeleton action could be a real threat when it can be reproduced under a realworld system. Thus, to make our attack a realworld threat, the generated adversarial skeleton actions should satisfy three concrete constraints to be reproducible, which will be detailed in section 4.
4. Adversarial Skeleton Action
In this section, we present our proposed attack, i.e., ADMM attack. We first introduce how to formulate the three constraints into mathematical equations. Then we formulate the constrained optimization problem to generate adversarial skeleton actions under both targeted and untargeted settings. Finally, we elaborate on how to solve the optimization problem by ADMM.
4.1. Bone Constraints
We again take the skeletons in the NTU RGB+D dataset as an example. As shown in Fig. 2, in a skeleton, there are totally 25 joints, forming a total of 24 bones. While the bones are not explicitly considered in modeling, they are strictly connecting to the 25 joints, thus imposing 24 bonelength constraints, i.e., the distance between the joints at the two ends of a bone should remain the same in adversarial skeleton actions. To mathematically represent the 24 bones, we associate each joint with its preceding joint, forming the two ends of a bone. As a result, the 24 precedingjoints for joint2joint25 are denoted as . The corresponding joint indices of the elements in are {1, 21, 3, 21, 5, 6, 7, 21, 9, 10, 11, 1, 13, 14, 15, 1, 17, 18, 19, 2, 8, 8, 12, 12}. We define the th bone’s length as . In this regard, the bone constraints can be represented as . Due to the measurement errors in the NTU dataset itself, here we also tolerate very small difference between and . Therefore, we can finally formulate the bone constraints as
(4) 
where is usually set as . Note that inequality constraints in the primal problem will impose inequality constraints on the corresponding Lagrangian variables in the dual problem. In order to avoid this in the dual problem, we reformulate the above inequality constraints as mathematical equations, i.e., (4) is equivalent to
(5) 
4.2. Joint Angle Constraints
Except for the bonelength constraints, we also need to impose constraints on the rotations of the joint angles according to the physiological structures of human beings. Let us also use the NTU dataset as an example. Each joint angle corresponds to the angle between two bones, and thus can be represented by the three joint locations of those two corresponding bones as illustrated in the right of Fig. 2. Note that a natural way to compute the joint angle as shown in Fig. 2 is to first compute the cosine value and then input the value into the arccos function. However, the gradient of arccos function is likely exploded, causing large numerical errors when the value of the joint angle is close to (). To deal with this issue, we derive an approximate upper bound for the changes of joint angle value to avoid computing the arccos function and its gradient. Again, take the right of Fig. 2 as an example, the angle change caused by the displacement of joint9 (i.e., , , ) can be approximated by . In particular, when the angle change is smaller than (i.e., ), we can consider almost same as . The total angle change is upper bounded by the sum of the changes caused by the displacements caused by joint9, joint10, and joint11. Therefore the upper bound can be represented by Although this representation looks more complicated than the arccos function, its gradient can be computed efficiently and accurately. Given such an approximation, the joint angle constraints can be similarly represented as
(6) 
where is set as ().
4.3. Speed Constraints
According to the physical conditions of human beings, we should consider one more type of constraints, i.e., temporal smoothness constraints. By those constraints, we attempt to restrict the speeds of the motions in the generated adversarial skeleton actions. Specifically, the speeds of the motions can be approximated by the displacements between two consecutive temporal frames, i.e., . Then, similar to Eq. 5, we bound the change of speeds by
(7) 
where is usually set as (smaller than) .
4.4. Constrained Primal Problem Formulation
In this subsection, we introduce the main objectives used under the untargeted setting and targeted setting.
Untargeted Setting
Under the untargeted setting, the adversary achieves its goal as long as the DNN makes a prediction other than the groundtruth label, i.e., . This will hold iff . Therefore, we define the objective as minimizing , where is the desired confidence value of the DNN on the wrong prediction. Note that if the objective is equal to , we have .
Targeted Setting
The goal of the adversary is to render the prediction result to be the attack target , i.e., . Therefore, the primal objective is defined as minimizing the cross entropy between and , or following the logic of the untargeted setting.
We can also adopt other objectives for our purpose. However, it turns out the above two main objectives are the most commonlyused ones in previous work (Kurakin et al., 2016; Madry et al., 2017; Carlini and Wagner, 2017). For simplicity, we denote the main loss by . The constrained primal problem can then be formulated as
(8)  
subject to 
4.5. Dual Optimization by ADMM
Note that our constrained primal problems are in general intractable. Instead of searching for a solution to the constrained primal problem, we propose to formulate and optimize its unconstrained dual problem via ADMM. The algorithm is illustrated in Alg. 1. Specifically, we first define the augmented Lagrangian of the constrained primal as shown in Alg. 1. The additional term , which is commonly used in ADMM (for nonconvex problems), aims to further penalize any violation of the equality constraints. We note that larger usually leads to smaller violation but larger final main objective (decreases the attack success rate).
Specifically, given the Lagrangian (defined in Alg. 1), the dual problem is . Note that since the internal function is an affine function w.r.t. the variables , we can simply use singlestep gradient ascent with a large step size (usually set as in ADMM) to update those dual variables. However, is an extremely complicated nonconvex function w.r.t. the adversarial sample
. Therefore, in most cases, we could only guarantee local optima for the internal minimization problem. Fortunately, it turns out that even the local optima can always fool the DNNs. To find a local optimum efficiently, we adopt the Adam optimizer instead of the vanilla stochastic gradient descent (SGD) since Adam optimizer always converges faster than vanilla SGD. Theoretically, a local minimum is guaranteed because the Adam optimizer stops updating the variables when the gradients are (close to)
. Next, we further look into the evolution of the loss during the optimization process. As shown in Fig. 3, at the very beginning (i.e., the first stage), the internal minimization problem finds adversarial samples with large violation of the constraints. The large violation will cause the Lagrangian multipliers to increase rapidly, and thus significantly increase the loss terms (bone loss), (joint loss), and (speed loss). As a result, the algorithm proceeds into the second stage, where the Adam optimizer focuses more on diminishing the constraint violation , , and when optimizing . Finally, the algorithm proceeds into a relatively stable stage where we can stop the algorithm. According to Fig. 2, our algorithm is very efficient in the sense that it only needs 200 (internal) iterations to enter the final stable stage.5. Defense against Adversarial Skeleton Actions
Note that although the method proposed in (Li et al., 2018a; Cohen et al., 2019) can certify larger robust radii than (Lecuyer et al., 2018)
. However, the sample complexity to compute the confidence intervals in
(Li et al., 2018a; Cohen et al., 2019) will lead to computational overhead in the inference stage. Therefore, we only use the method in (Cohen et al., 2019) in the certification process. In the inference stage, we modify the method in (Lecuyer et al., 2018) to build a relatively efficient defense against adversarial skeleton actions, as shown in Alg. 2. In general, our proposed defense consists of two steps: adding Gaussian noise and temporal filtering by Gaussian kernel. In the following, we will detail these two steps and explain why we include them in the defense.5.1. Additive Gaussian Noise
Our first step is adding Gaussian noise to the skeleton actions. In the inference stage, we follow (Lecuyer et al., 2018) to make the prediction as given input , where is randomized mechanism with Gaussian noise and postprocessing function
. In order to estimate
, we sample N noisy samples from and feed them into the postprocessing function and the neural network . is estimated by , and according to the Chernoff bound (Boucheron et al., 2013), the error of this estimation is bounded by.
In the certification stage, we rely on the main theorem from (Cohen et al., 2019), which gives the currently tightest bound:
Lemma 5.1 ().
Denote an mechanism randomized by Gaussian noise by , and the groundtruth label by . Define . Suppose & satisfy
(9) 
the robust radius is .
5.2. Temporal Filtering by Gaussian Kernel
After adding Gaussian noise to the skeleton actions, we propose to further smooths the action along the temporal axis by a or Gaussian filter. The intuition is that the adjacent frames in a skeleton action sequence are very similar to each other, and thus can be used as references to rectify the adversarial perturbations. Although this additional operation does not improve the certification results, we observe that it can help our defense become more compatible with a normally trained model than the original randomized smoothing method in (Lecuyer et al., 2018; Cohen et al., 2019). Also, we argue that this simple operation is not usually used in previous work because it is not very suitable in the image recognition domain, where no adjacency information (along the temporal axis) is available.
Whitebox  NTU CV  NTU CS  

Untargeted  Success Rate  Success Rate  
HCN  100%  2.64%  0.132  4.52%  0.396  100%  2.17%  0.111  3.17%  0.347  
100%  1.92%  0.099  1.65%  0.330  100%  1.62%  0.086  1.30%  0.290  
92.8%  1.50%  0.085  1.25%  0.270  92.4%  1.25%  0.073  0.98%  0.241  
2sAGCN  100%  2.17%  0.112  1.62%  0.653  100%  1.97%  0.107  2.20%  0.614  
100%  1.70%  0.094  0.59%  0.528  100%  1.46%  0.086  0.57%  0.496  
99.0%  1.37%  0.083  0.39%  0.428  98.8%  1.19%  0.078  0.34%  0.413  
Whitebox  NTU CV  NTU CS  
targeted  Success Rate  Success Rate  
HCN  100%  3.60%  0.165  7.75%  0.673  100%  3.55%  0.165  6.68%  0.723  
99.7%  3.24%  0.156  4.69%  0.630  100%  3.16%  0.155  4.24%  0.674  
22.3%  2.27%  0.115  2.83%  0.444  26.9%  2.14%  0.112  2.50%  0.462  
2sAGCN  100%  1.66%  0.090  0.55%  0.569  100%  1.67%  0.091  0.71%  0.649  
100%  1.61%  0.091  0.42%  0.556  100%  1.56%  0.090  0.49%  0.615  
97.2%  1.54%  0.089  0.38%  0.512  97.9%  1.47%  0.087  0.40%  0.552 
6. Experiments
6.1. Attack Performance
Main Results
The main results of our attack are shown in Table 5. As we can see, our proposed attack can achieve 100% success rates with very small violation of the constraints. The averaged normalized bonelength difference is approximately , and the violation of the joint angles is smaller than . Considering the skeleton data is usually noisy, this subtle violation is considered “very common” in real world. We also provide more experimental results in the supplementary material (e.g., on Kinetics).
We also note that adversarialsample generation under the untargeted setting is usually easier than that under the targeted setting since a targeted adversarial sample is guaranteed to be an untargeted adversarial sample, but not vice versa. This fact is also reflected by the results in Table 5. Furthermore, in Figure 4, we show the visualization result of an adversarial skeleton action (recognized as a normal action “drinking water”) generated by our attack, which is almost visually indistinguishable from its original skeleton action (“kicking”). Figures of more adversarial actions are attached in the appendix.
Source (Model) Target  Dataset  

HCN(1) HCN(2)  NTU CV  24.7%  26.0% 
NTU CS  28.5%  32.6%  
HCN(1) 2sAGCN  NTU CV  17.6%  20.4% 
NTU CS  17.3%  19.6% 
Transferability
In order to shed light on the transferability of our attack, we feed the adversarial skeleton actions generated on a HCN model to another HCN model and 2sAGCN, respectively. In order to boost the transferability performance, we set as or to generate adversarial skeleton actions with larger perturbation. The attack success rates are given in Table 2. Similar to 3D adversarial point clouds (Xiang et al., 2018), the transferability of the adversarial skeleton actions is also a little limited compared with adversarial images.
Comparison with C&W Attack
We use C&W attack as an example to shed light on the difference between our attack and the existing attacks. C&W attack has been demonstrated as a successful optimizationbased adversarial attack in many application domains. However, since C&W attack mainly considers minimizing the distance between original and adversarial skeletons, it might easily violate the constraints, as shown in our simple case study (Table 3).
Untargeted  Success Rate  

NTU CV  100%  4.67%  0.241  13.0%  0.278 
NTU CS  100%  4.09%  0.211  10.2%  0.244 
Targeted  Success Rate  
NTU CV  100%  8.82%  0.468  38.1%  0.510 
NTU CS  100%  9.45%  0.507  36.8%  0.520 
6.2. Defense Performance
Empirical Results
We demonstrate the performance of the defense for inference in Table 4. We set to generate adversarial examples, and set (Alg. 2), which is more smaller than the number of samples required for certification but can achieve good empirical performance, as shown in Table 4. it is much easier to defend adversarial skeleton actions under the targeted setting than the untargeted setting. Note that the accuracy of HCN on NTUCV and NTUCS is respectively and (Li et al., 2018b), and the accuracy of 2sAGCN is respectively and (Shi et al., 2019).
Model  Setting  NTU CV  NTU CS  

HCN  Untargeted  62.0%  62.3%  50.6%  51.4% 
Targeted  79.4%  70.8%  67.1%  58.3%  
2sAGCN  Untargeted  51.0%  42.2%  42.1%  40.2% 
Targeted  60.8%  50.5%  42.2%  44.1% 
Certified Results
Due to the high computational cost of the certification method (N=1000), we mainly evaluate the certification algorithm on HCN. The certified accuracy achieved by different levels of noise is shown in Fig. 5. Note that we use the same level of noise to train the model as the noise for certification. As we can see, with sacrificing accuracy on the clean samples, the method is able to achieve about certified accuracy ().
7. Conclusion
We study the problem of adversarial vulnerability of skeletonbased action recognition. We first identify and formulate three main constraints that should be considered in adversarial skeleton actions. Since the corresponding constrained optimization problem is intractable, we propose to optimize its dual problem by ADMM, which is a generic method first proposed in this paper to generate constrained adversarial examples. To defend against adversarial skeleton actions, we further specify an efficient defensive inference algorithm and a certification algorithm. The effectiveness of the attack and defense is demonstrated on two opensource models, and the results induce several interesting observations that can help us understand more about adversarial skeleton actions.
References
 Square attack: a queryefficient blackbox adversarial attack via random search. arXiv preprint arXiv:1912.00049. Cited by: §2.5.
 Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420. Cited by: §2.5.
 Alternating direction algorithms for constrained sparse regression: application to hyperspectral unmixing. In 2010 2nd Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing, pp. 1–4. Cited by: §2.4.
 Concentration inequalities: a nonasymptotic theory of independence. Oxford university press. Cited by: §5.1.
 Towards evaluating the robustness of neural networks. In Security and Privacy (SP), 2017 IEEE Symposium on, pp. 39–57. Cited by: §2.3, §3.1, §4.4.
 Advances in human action recognition: a survey. arXiv preprint arXiv:1501.05964. Cited by: §1.
 Certified adversarial robustness via randomized smoothing. arXiv preprint arXiv:1902.02918. Cited by: §1, §2.5, §5.1, §5.2, §5.
 A dual approach to scalable verification of deep networks.. In UAI, pp. 550–559. Cited by: §1, §2.5.
 Consensusbased distributed support vector machines. Journal of Machine Learning Research 11 (May), pp. 1663–1707. Cited by: §2.4.
 Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §2.3, §2.5.
 On simultaneous confidence intervals for multinomial proportions. Technometrics 7 (2), pp. 247–254. Cited by: 5.
 On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715. Cited by: §1, §2.5.
 Adversarial example defense: ensembles of weak defenses are not strong. In 11th Workshop on Offensive Technologies ( 17), Cited by: §2.5.
 Certified robustness for topk predictions against adversarial perturbations via randomized smoothing. arXiv preprint arXiv:1912.09899. Cited by: §2.5.

A new representation of skeleton sequences for 3d action recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 3288–3297. Cited by: §1.  Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236. Cited by: §2.3, §4.4.
 A splitting method for orthogonality constrained problems. Journal of Scientific Computing 58 (2), pp. 431–449. Cited by: §2.4.
 Certified robustness to adversarial examples with differential privacy. arXiv preprint arXiv:1802.03471. Cited by: §1, §2.5, §5.1, §5.2, §5.
 Secondorder adversarial attack and certifiable robustness. arXiv preprint arXiv:1809.03113. Cited by: §2.5, §5.
 Cooccurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. arXiv preprint arXiv:1804.06055. Cited by: §1, §1, §2.2, §6.2.
 Independently recurrent neural network (indrnn): building a longer and deeper rnn. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
 Adversarial attack on skeletonbased human action recognition. arXiv preprint arXiv:1909.06500. Cited by: §2.3.
 Disentangling and unifying graph convolutions for skeletonbased action recognition. arXiv preprint arXiv:2003.14111. Cited by: §1.
 Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §2.3, §2.5, §4.4.
 Distributed nonconvex admminference in largescale random fields. In British Machine Vision Conference (BMVC), Vol. 2. Cited by: §2.4.
 Differentiable abstract interpretation for provably robust neural networks. In International Conference on Machine Learning, pp. 3575–3583. Cited by: §1, §2.5.
 Certified defenses against adversarial examples. arXiv preprint arXiv:1801.09344. Cited by: §2.5.
 NTU rgb+d: a large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1010–1019. Cited by: Figure 1, §1.
 Twostream adaptive graph convolutional networks for skeletonbased action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12026–12035. Cited by: §1, §6.2.
 An attention enhanced graph convolutional lstm network for skeletonbased action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1227–1236. Cited by: §1.
 Output diversified initialization for adversarial attacks. arXiv preprint arXiv:2003.06878. Cited by: §2.5.
 Adversarial risk and the dangers of evaluating against weak attacks. In International Conference on Machine Learning, pp. 5032–5041. Cited by: §2.5.
 Efficient formal safety analysis of neural networks. In Advances in Neural Information Processing Systems, pp. 6367–6377. Cited by: §1, §2.5.
 Global convergence of admm in nonconvex nonsmooth optimization. Journal of Scientific Computing 78 (1), pp. 29–63. Cited by: §2.4.
 Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pp. 5283–5292. Cited by: §1, §2.5.
 Generating 3d adversarial point clouds. arXiv preprint arXiv:1809.07016. Cited by: §6.1.

Spatial temporal graph convolutional networks for skeletonbased action recognition.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §1. 
Fast
minimization algorithms for robust face recognition
. IEEE Transactions on Image Processing 22 (8), pp. 3234–3246. Cited by: §2.4.  Theoretically principled tradeoff between robustness and accuracy. arXiv preprint arXiv:1901.08573. Cited by: §2.5.
 An admmbased universal framework for adversarial attacks on deep neural networks. In Proceedings of the 26th ACM international conference on Multimedia, pp. 1065–1073. Cited by: §2.3.
 Distributionally adversarial attack. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 2253–2260. Cited by: §2.5.
Appendix A Appendix
Additional visualization results
Here we provide more visualization results. We use “drinking water” as the attack target because “drinking water” is a normal action, which looks completely different from the some violent/abnormal actions like throwing, kicking, pushing, and punching. Despite the obvious visual difference between “drinking water” and those abnormal actions, our attack can still fool the stateoftheart models to recognize those abnormal actions as “drinking water” by imperceptible and reproducible perturbation, which indicates that our attack is very powerful. In Fig. 6, we show that our attack can fool the HCN model to recognize the “throwing”, “pushing”, and “kicking” actions as a normal action “drinking water” by imperceptible adversarial perturbation. Similarly, in Fig. 7, we show that our attack can fool the 2sAGCN model to recognize the “throwing” and “pushing” actions as a normal action “drinking water”. These visualization results along with the quantitative results in Table 1 (in the paper) demonstrate that the perturbations are indeed imperceptible and reproducible.
Kinetics Dataset
Except for the NTU dataset, we also evaluate our attack on another popular dataset, i.e., Kinetics400 dataset under both the untargeted and targeted settings. As shown in Table 5, under the untargeted setting, our attack can achieve 100% attack success rates with very small violation of the constraints, similar to its performance on the NTU dataset. However, under the targeted setting, it is much more difficult for our attack to find targeted adversarial skeleton actions with very small violation of the constraints. This is because Kinetics400 has 400 classes of actions, and the original NTU dataset only has 60 classes of actions. Also, we argue that the results on Kinetics under the targeted setting do not devalue our attack since even for most of the clean testing samples from Kinetics, it is difficult for the stateofthemodels to predict their groundtruth labels (targets).
Untargeted  Kinetics400  

Success Rate  
HCN  100%  2.60%  0.082  1.66%  0.150  
100%  2.58%  0.080  1.52%  0.162  
98.8%  2.49%  0.078  1.21%  0.145  
2sAGCN  100%  0.91%  0.053  0.58%  0.331  
100%  0.77%  0.047  0.53%  0.298  
100%  0.75%  0.046  0.52%  0.287  
Targeted  Kinetics400  
Success Rate  
HCN  90.2%  5.22%  0.220  11.2%  1.864  
67.2%  2.79%  0.124  4.86%  1.350  
17.2%  1.44%  0.073  2.36%  0.763  
2sAGCN  99.2%  5.25%  0.167  1.20%  0.725  
98.8%  5.04%  0.159  1.21%  0.722  
98.4%  4.89%  0.153  1.03%  0.677 