# Fast and Scalable Adversarial Training of Kernel SVM via Doubly Stochastic Gradients

## Authors

• 3 publications
• 2 publications
• 25 publications
05/15/2020

### Initializing Perturbations in Multiple Directions for Fast Adversarial Training

Recent developments in the filed of Deep Learning have demonstrated that...
04/20/2021

### Adversarial Training for Deep Learning-based Intrusion Detection Systems

Nowadays, Deep Neural Networks (DNNs) report state-of-the-art results in...
08/24/2021

### Adversarial Robustness of Deep Learning: Theory, Algorithms, and Applications

This tutorial aims to introduce the fundamentals of adversarial robustne...
07/10/2020

### Improving Adversarial Robustness by Enforcing Local and Global Compactness

The fact that deep neural networks are susceptible to crafted perturbati...
05/13/2021

### Stochastic-Shield: A Probabilistic Approach Towards Training-Free Adversarial Defense in Quantized CNNs

Quantized neural networks (NN) are the common standard to efficiently de...
04/24/2019

### A Robust Approach for Securing Audio Classification Against Adversarial Attacks

Adversarial audio attacks can be considered as a small perturbation unpe...
03/19/2021

### Noise Modulation: Let Your Model Interpret Itself

Given the great success of Deep Neural Networks(DNNs) and the black-box ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Machine learning models have long been proved to be vulnerable to adversarial attacks which generate subtle perturbations to the inputs that lead to incorrect outputs. The perturbed inputs are defined as adversarial examples where the perturbations that lead to misclassification are often imperceptible. This serious threat has recently led to a large influx of contributions in adversarial attacks especially for deep neural networks (DNNs). These methods of adversarial attacks include FGSM Goodfellow et al. (2014), PGD Madry et al. (2017), CW Carlini and Wagner (2017), ZOO Chen et al. (2017) and so on. We give a brief review of them in the section of related work.

The topic of adversarial attacks has also attracted much attention in the field of SVM. In 2004, Dalvi et al. Dalvi et al. (2004) and later Lowd and Meek Lowd and Meek (2005a, b) studied the task of spam filtering, showing that linear SVM could be easily tricked by few carefully crafted changes in the content of spam emails, without affecting their readability. Some other attacks such as label flipping attack Biggio et al. (2011); Xiao et al. (2012), poison attack Biggio et al. (2012); Xiao et al. (2015) and evasion attack Biggio et al. (2013) have also proved the vulnerability of SVM to adversarial examples.

Due to the serious threat of these attacks, there is no doubt that defense techniques that can improve adversarial robustness of learning models are crucial for secure machine learning. Most defensive strategies nowadays focus on DNNs, such as defensive distillation

Papernot et al. (2016), gradient regularization Ross and Doshivelez (2018) and adversarial training Madry et al. (2017), among which adversarial training has been demonstrated to be the most effective Athalye et al. (2018). This method focuses on a min-max problem, where the inner maximization is to find the most aggressive adversarial examples and the outer minimization is to find model parameters that minimize the loss on the adversarial examples. Up till now, there have been many forms of adversarial training on DNNs which further improve their robustness and training efficiency compared with standard adversarial training Shafahi et al. (2019); Carmon et al. (2019); Miyato et al. (2019); Wang et al. (2020).

Since SVM is a classical and important learning model in machine learning, the improvement of its security and robustness is also critical. However, to the best of our knowledge, the only work of adversarial training for SVM is limited to linear SVMs. Specifically, Zhou et al. Zhou et al. (2012)

formulated a convex adversarial training formula for linear SVMs, in which the constraint is defined over the sample space based on two kinds of attack models. As we know, datasets with complex structures can be hardly classified by linear SVMs, but can be easily handled by kernel SVMs. We give a brief review of adversarial training strategies of DNNs and SVMs in Table

1. From this table, it is easy to find that how to improve the robustness of kernel SVMs against adversarial examples is still an unstudied problem.

To fill the vacancy, in this paper, we focus on kernel SVMs and propose adv-SVM to improve their adversarial robustness via adversarial training. To the best of our knowledge, this is the first work that devotes to fast and scalable adversarial training of kernel SVMs. Specifically, we first build connections of perturbations between the original and kernel spaces, i.e., and , where is the perturbation added to the normal example in the original space, is the perturbation in the kernel space and is the corresponding feature mapping. Then we construct the simplified and equivalent form of the inner maximization and transform the min-max objective function into a convex minimization problem based on the connection of the perturbations. However, directly optimizing this minimization problem is still difficult since the kernel function, which is necessary for the optimization, needs operations to be computed, where is the number of training examples and is the dimension. Huge requirement of computation complexity hinders its application on large scale datasets.

To further improve its scalability, we apply the doubly stochastic gradient (DSG) algorithm Dai et al. (2014) to solve our objective. Specifically, in each iteration, we randomly select one training example and one random feature parameter for approximating the value of a kernel function instead of computing it directly. The DSG algorithm effectively reduces the computation complexity of solving kernel methods from to , where is the number of iterations Gu and Huo (2018).

The main contributions of this paper are summarized as follows:

• We develop an adversarial training strategy, adv-SVM, for kernel SVM based on the relationship of perturbations between the original and kernel spaces and transform it into an equivalent convex optimization problem, then apply the DSG algorithm to solve it in an efficient way.

• We provide comprehensive theoretical analysis to prove that our proposed adv-SVM can converge to the optimal solution at the rate of with either a constant stepsize or a diminishing stepsize even though it is based on the approximation principle.

• We investigate the performance of adv-SVM under typical white-box and black-box attacks. The empirical results suggest our proposed algorithm can complete the training process in a relatively short time and stay robust in face of various attack strategies at the same time.

## 2 Related Work

During the development of adversarial machine learning, a wide range of attacking methods have been proposed for the crafting of adversarial examples. Here we mention some frequently used ones which are useful for generating adversarial examples in our experiments.

• Fast Gradient Sign Method (FGSM). FGSM, which belongs to white-box attacks, perturbs normal examples for one step by the amount along the gradient Goodfellow et al. (2014).

• Projected Gradient Descent (PGD). PGD, which is also a white-box attack method, perturbs normal examples for a number of steps with a smaller step size and keeps the adversarial examples in the -ball where is the maximum allowable perturbation Madry et al. (2017).

• CW. CW is also a white-box attack method, which aims to find the minimally-distorted adversarial examples. This method is acknowledged to be one of the strongest attacks up to date Carlini and Wagner (2017).

• Zeroth Order Optimization (ZOO). ZOO is a black-box attack based on coordinate descent. It assumes that the attackers only have access to the prediction confidence from the victim classifier’s outputs. This method is proved to be as effective as CW attack Chen et al. (2017).

It should be noted that although these methods are proposed for DNN models, they are also applicable to other learning models. We can apply them to generate adversarial examples for SVM models.

## 3 Background

In this section, we give a brief review of adversarial training on linear SVM and the random feature approximation algorithm.

### 3.1 Adversarial Training on Linear SVM

We assume that SVM has been trained on a 2-class dataset with as a normal example in the -dimensional input space and as its label.

The adversarial training process aims to train a robust learning model using adversarial examples. As illustrated in Zhou et al. (2012), it is formulated as solving a min-max problem. The inner maximization problem simulates the behavior of an attacker which constructs adversarial examples leading to the maximum output distortion:

 maxx′i [1−yi(wTx′i+b)]+ (1) s.t. ∥∥x′i−xi∥∥2≤ϵ

where is the adversarial example of , denotes the limited range of perturbations, and

are the parameters of SVM. The loss function here is the commonly used hinge loss and we express it as

.

The outer minimization problem targets to find parameters that minimize the loss caused by inner maximization.

 minw,b 12∥w∥22+Cnn∑i=1maxx′ [1−yi(wTx′i+b)]+ s.t. ∥∥x′i−xi∥∥2≤ϵ (2)

Due to the limited application of linear SVMs, we aim to extend the adversarial training strategy to kernel SVMs.

### 3.2 Random Feature Approximation

Random feature approximation is a powerful technique used in DSG to make kernel methods scalable. This method approximates the kernel function by mapping the decision function to a lower dimensional random feature space. Its theoretic foundation relies on the intriguing duality between kernels and random processes Geng et al. (2019).

Specifically, the Bochner theorem Rudin (1962) provides a relationship between the kernel function and a random process with measure : for any stationary and continuous kernel

, there exits a probability measure

, such that . In this way, the value of the kernel function can be approximated by explicitly computing random features , i.e.,

 k(x,x′)≈1mm∑i=1ϕωi(x)ϕTωi(x′)=ϕω(x)ϕTω(x′) (3)

where is the number of random features, denotes and denotes . For the detailed derivation process, please refer to Ton et al. (2018).

It is clearly that feature mappings are , where . To further alleviate computational costs, can be expressed as , where is drawn from and is drawn uniformly from .

It is known that most kernel methods can be expressed as convex optimization problems in reproducing kernel Hilbert space (RKHS) Dai et al. (2014). A RKHS has the reproducing property, i.e., and , . Thus we have and Dang et al. (2020).

## 4 Adversarial Training on Kernel SVM

In this section, we extend the objective function (3.1) of linear SVM to kernel SVM, where the difficulty lies in the uncontrollable of the perturbations mapped in the kernel space.

### 4.1 Kernelization

Firstly, we discuss the kernelization of the perturbations. When constructing an adversarial example, we first add perturbations to the normal example in the original space constrained by as shown in Figure 1(a) and 1(b). But once the adversarial example is mapped into the kernel space, it will become unpredictable like Figure 1(c). Then the irregular perturbations greatly increase the difficulty of computation and the obtainment of the closed-form solution.

Fortunately, the following theorem provides a tight connection between perturbations in the original space and the kernel space.

###### Theorem 1.

Xu et al. (2009) Supposing that the kernel function has the form , with , a decreasing function, which is denoted by the RKHS space of and the corresponding feature mapping, then we have for any and ,

 sup∥δ∥≤c⟨w,ϕ(x+δ)⟩≤sup∥∥δϕ∥∥H≤√2f(0)−2f(c)⟨w,ϕ(x)+δϕ⟩.

Since the perturbation range of tightly covers that of , which is also intuitively illustrated in Figure 1(d), then we apply to deal with the following computation, making the problem a linear problem in the kernel space. Thus, the inner maximization problem (1) in the kernel space can be expressed as

 maxx′ [1−yi(wTΦ(x′i)+b)]+ (4) s.t. ∥∥Φ(x′i)−ϕ(xi)∥∥2≤ϵ′

where denotes and is .

### 4.2 Construction of the Equivalent Form111There is an error here in the version of AAAI 2021, please refer to this version as the correct one.

In this part, we aim to get the simplified and equivalent form of Eq. (4) via the following theorem, then the min-max optimization problem in the kernel space can be transformed into a minimization problem.

###### Theorem 2.

With the constraint , the maximization problem is equivalent to the regularized loss function .

The detailed proof of Theorem 5 is provided in our appendix. Then, the original min-max objective function can be rewritten as the following minimization problem:

 minw∈H,b12∥w∥22+Cnn∑i=1[1−yiwTϕ(xi)+ϵ′∥w∥2−yib]+ (5)

## 5 Learning Strategy of adv-SVM

In this section, we extend the DSG algorithm to solve the objective minimization problem (5), since DSG has been proved to be a powerful technique for scalable kernel learning Dai et al. (2014).

For easy expression, we substitute for in Eq. (5), as , which is accessible to kernels which satisfy , such as RBF and Laplacian kernels Hajiaboli et al. (2011), then the objective function can be expressed as follows:

 minf∈HR(f) (6) = minf∈H12∥f∥22+Cnn∑i=1[1−yif(xi)+ϵ′∥f∥2−yib]+

In this part, we use the DSG algorithm to update the solution of Eq. (6). For convenience, here we only discuss the case when the hinge loss is greater than 0.

To iteratively update in a stochastic manner, we need to sample a data point each iteration from the data distribution. The stochastic functional gradient for is

 ∇R(f)=f(⋅)+C[−yk(x,⋅)+ϵ′f(⋅)∥f∥2] (7)

It is noted that is the derivative wrt. . Since it still costs too much if we compute the kernel functions directly, next, we apply the random feature approximation algorithm introduced earlier to approximate the value of the kernels.

#### Random Feature Approximation.

According to Eq. (3), when sampling random process

from its probability distribution

, we can further approximate Eq. (7) as

 ∇^R(f)=f(⋅)+C[−yϕω(x)ϕω(⋅)+ϵ′f(⋅)∥f∥2] (8)

#### Update Rules.

According to the principle of SGD method, the update rule for in the -th iteration is

 ft+1(⋅) =ft(⋅)−γt∇^R(f)=t∑i=1aitζi(⋅) (9)

where is the stepsize of the -th iteration, denotes and the initial value . The value of can be easily inferred as 333The value of is gotten by expanding the middle term of Eq. (9) iteratively with the definition of ..

Note that if we compute the value of kernels explicitly instead of using random features, the update rule for is

 ht+1(⋅)=ht(⋅)−γt∇R(f)=t∑i=1aitξi(⋅). (10)

where . Our algorithm apply the update rule (9) instead, which can reduce the cost of kernel computation.

#### Detailed Algorithm.

Based on Eq. (9) above, we propose the training and prediction algorithms for the adversarial training of kernel SVM in Algorithm 1 and 2 respectively.

A crucial step of DSG in Algorithm 1 and 2 is sampling with seed . As the seeds are aligned for the training and prediction processes in the same iteration Shi et al. (2019), we only need to save the seeds instead of the whole random features, which is memory friendly.

Different to the diminishing stepsize used in the original version of DSG Dai et al. (2014), our algorithm here supports both diminishing and constant stepsize strategies (line 5 of Algorithm 1). The process of gradient descent is composed of a transient phase followed by a stationary phase. In the diminishing stepsize case, the transient phase is relatively long and can be impractical if the stepsize is misspecified Toulis et al. (2017), but once entering the stationary phase, it will converge to the optimal solution gently. While in the constant stepsize case, the transient phase is much shorter and less sensitive to the stepsize Chee and Toulis (2018), but it may oscillate in the region of during the stationary phase.

## 6 Convergence Analysis

In this section, we aim to prove that adv-SVM can converge to the optimal solution at the rate of based on the framework of Dai et al. (2014), where is the number of iterations. We first provide some assumptions.

###### Assumption 1.

(Bound of kernel function) There exists , such that .

###### Assumption 2.

(Bound of random feature norm) There exits , such that .

###### Assumption 3.

The spectral radius of a function has a lower bound that

, where a spectral radius is the maximum modulus of eigenvalues

Gurvits et al. (2007), i.e., .

For Assumption 6, it is known that we can find eigenvalues for a matrix in space and the spectral radius of matrix is defined as the maximum modulus of the eigenvalues of Gurvits et al. (2007), i.e., . Similar to matrix case, in RKHS space, a function can be viewed as an infinite matrix, then infinite eigenvalues

and infinite eigenfunctions

can be found Iii (2004). Treat as a set of orthogonal basis, then can be represented as the linear combination of the basis, i.e., . Similar to the definition of spectral radius in matrix, for function , .

We update the solution through random features and random data points according to (9). As a result, may be outside of RKHS , making it hard to directly evaluate the error between and the optimal solution . In this case, we utilize as an intermediate value to decompose the difference between and Shi et al. (2020):

 |ft+1(x)−f∗|2 (11) ≤ 2|ft+1(x)−ht+1(x)|2errorduetorandomfeatures+2κ∥ht+1−f∗∥22.errorduetorandomdata

We introduce our main lemmas and theorems as below. All the detailed proofs are provided in our appendix.

### 6.1 Convergence Analysis on Diminishing Stepsize

We first prove that the convergence rate of our algorithm with diminishing stepsize is .

###### Lemma 1.

(Error due to random features) For any ,

 EDt,ωt[|ft+1(x)−ht+1(x)|2] ≤1t2C2θ2(k+ϕ)2
###### Lemma 2.

(Error due to random data) Let be the optimal solution to our target problem, we set with such that , then we have

 EDt,ωt[∥ht+1−f∗∥22]≤Q21t (12)

where , , is a constant value and .

###### Theorem 3.

(Convergence in expectation) When with , ,

 EDt,ωt[|ft+1(x)−f∗|2]≤2Q20t+2κQ21t (13)

where .

###### Remark 1.

According to Eq. (11), the error caused by doubly stochastic approximation can be computed via the combination of Lemma 1 and 2 and we prove in Theorem 3 that it converges at the rate of .

### 6.2 Convergence Analysis on Constant Stepsize

In this part, we provide a novel theoretical analysis to prove that adv-SVM with constant stepsize converges to the optimal solution at a rate near .

Notice that the diminishing stepsize provides to the convergence rate, while in the case of constant stepsize, the stepsize makes the analysis more challenging.

###### Lemma 3.

(Error due to random features) For any ,

 EDt,ωt[|ft+1(x)−ht+1(x)|2]≤C2ηc(κ+ϕ)2
###### Lemma 4.

(Error due to random data) Let be the optimal solution to our target problem, set and , with for , we will reach after

 T≥Blog(2e1/ϵ)ϑϵ (14)

iterations, where and .

###### Theorem 4.

(Convergence in expectation) Set , and , , with where , we will reach after

 T≥4κBlog(8κe1/ϵ)ϑϵ (15)

iterations, where and are defined in Lemma 11.

###### Remark 2.

Based on Theorem 4, will converge to the optimal solution at a rate near if eliminating the factor. This rate is nearly the same as the one of diminishing stepsize, even though the stepsize of our algorithm keeps constant.

## 7 Experiments

In this section, we will accomplish comprehensive experiments to show the effectiveness and efficiency of adv-SVM.

### 7.1 Experimental Setup

Models compared in experiments include Natural: normal DSG algorithm Dai et al. (2014); adv-linear-SVM: adversarial training of linear SVM proposed by Zhou et al. Zhou et al. (2012); adv-SVM(C): our proposed adversarial training algorithm with constant stepsize; adv-SVM(D): our proposed adversarial training algorithm with diminishing stepsize.

The four attack methods of constructing adversarial samples we applied cover both white-box and black-box attacks and are already introduced in the section of related work. For FGSM and PGD, the maximum perturbation is set as and the step size for PGD is . We use the version of CW to generate adversarial examples. For ZOO, we use the ZOO-ADAM algorithm and set the step size , ADAM parameters , .

Implementation. We perform experiments on Intel Xeon E5-2696 machine with 48GB RAM. It has been mentioned that our model is implemented444The DSG code is available at https://github.com/zixu1986/Doubly˙Stochastic˙Gradients. based on the DSG framework Dai et al. (2014). For the sake of efficiency, in the experiment, we use a mini-batch setting. The random features used in DSG are sampled according to pseudo-random number generators. RBF kernel is used for natural DSG and adv-SVM algorithms, the number of random features is set as and the batch size is 500. 5-fold cross validation is used to choose the optimal hyper-parameters (the regularization parameter and the step size ). The parameters and are searched in the region . For algorithm adv-linear-SVM, we use its free-range training model and set the hyper-parameter as 0.1 according to their analysis. This algorithm is implemented in CVXa package for specifying and solving convex programs Grant and Boyd (2014). The stop criterion for all experiments is one pass over each entire dataset. All results are the average of 10 trials.

Datasets.

We evaluate the robustness of adv-SVM on two well-known datasets, MNIST

Lecun and Bottou (1998) and CIFAR10 Krizhevsky and Hinton (2009). Since we focus on binary classification of kernel SVM, here we select two similar classes from the datasets respectively. Each pixel value of the data is normalized into via dividing its value by 255. Table 4 summarizes the 6 datasets used in our experiments. Due to the page limit, we only show the results of CIFAR10 automobile vs. truck and MNIST8m 0 vs. 4 here. The results of other datasets are provided in the appendix .

### 7.2 Experimental Results

We explore the defensive capability of our model against PGD attack in terms of the attack steps (Fig. 1(a), 1(b)) and the maximum allowable perturbation (Fig. 1(c), 1(d)). For Fig. 1(a) and 1(b), the maximum allowable perturbation is fixed as , for Fig. 1(c) and 1(d), the attack step is fixed as 10.

It can be seen clearly that PGD attack strengthens with the increase of either or . Meanwhile, increasing has greater impact on test accuracy than increasing . However, due to the large allowable disturbance range, it increases the risks of the detection of adversarial examples at the same time since these perturbed examples are not so much similar as original examples, which explains the reason why our algorithm has a better defensive capability for large .

We evaluate robustness of the 4 competing methods against 4 types of attacks introduced earlier plus the clean datasets (Normal). Here the attack strategy for PGD is 10 steps with max perturbation . From Table 2 and 3, we can see that on both datasets, the natural model achieves the best accuracy on normal test images, but it’s not robust to adversarial examples. Among four attacks, CW and ZOO have the strongest ability to trick models. Although PGD and FGSM belong to the same type attack method, PGD has stronger attack ability and is more difficult to defend since it’s a multi-step iterative attack method rather than a single-step one. According to the results of adv-linear-SVM, we can see that this algorithm is not only time-consuming in training examples, but also vulnerable to strong attacks like CW and ZOO, which even gets higher test error than unsecured algorithm (natural DSG). In comparison, our proposed adv-SVM can finish tasks in just a few minutes and can defend both white-box and black-box attacks.

Fig. 3 shows test error vs. iterations on three models against four attacks. The results indicate that adv-SVM can converge in a fast speed. Moreover, compared with adv-SVM(D), adv-SVM(C) enjoys a faster convergence rate and lower test error although it may oscillate slightly in the stationary phase, which is consistent with our analysis.

## 8 Conclusion

To alleviate SVMs’ fragility to adversarial examples, we propose an adversarial training strategy named as adv-SVM which is applicable to kernel SVM. DSG algorithm is also applied to improve its scalability. Although we use the principle of approximation, the theoretical analysis shows that our algorithm can converge to the optimal solution. Moreover, comprehensive experimental results also reveal its efficiency in adversarial training models and robustness against various attacks.

## Acknowledgments

B. Gu was partially supported by National Natural Science Foundation of China (No: 62076138), the Qing Lan Project (No.R2020Q04), the National Natural Science Foundation of China (No.62076138), the Six talent peaks project (No.XYDXX-042) and the 333 Project (No. BRA2017455) in Jiangsu Province.

## Appendix A Proof of Theorem 1

###### Theorem 5.

With the constraint , the maximization problem is equivalent to the regularized loss function .

###### Proof.

Since , the constraint can also be write as , let . We define . To prove the theorem, we first prove , and then prove . In the following, we give the details to prove these two sub-conclusions.

Step 1: We first prove .

Since, , we define a subset of as .

Hence,

 maxδiϕ∈T′[1−yi(wT(ϕ(xi)+δiϕ)+b)]+ = maxδiϕ∈T′[1−yiwTϕ(xi)−yiwTδiϕ−yib]+ = [1−yiwTϕ(xi)+ϵ′∥w∥2−yib]+

Since , the first sub-conclusion can be proved.

Step 2: Next we prove .

 maxδiϕ∈T[1−yi(wT(ϕ(xi)+δiϕ)+b)]+ = maxδiϕ∈T[1−yiwTϕ(xi)−yiwTδiϕ−yib]+ ≤ maxδiϕ∈T[1−yiwTϕ(xi)+yi∥w∥2⋅∥δiϕ∥2−yib]+ ≤ [1−yiwTϕ(xi)+ϵ′∥w∥2−yib]+

The first inequality is due to the Cauchy-Schwarz ineuqality. The second inequality holds since . Hence the second sub-conclusion holds.

Step 3: Combining these two steps, we have (A):

 max∥δiϕ∥2≤ϵ[1−yi(wT(ϕ(xi)+δiϕ)+b)]+ = [1−yiwTϕ(xi)+ϵ′∥w∥2−yib]+ (16)

## Appendix B Detailed Proof of Convergence Rate

###### Assumption 4.

(Bound of kernel function) The kernel function is bounded, i.e., there exists , such that .

###### Assumption 5.

(Bound of random feature norm) There exits , such that .

###### Assumption 6.

The spectral radius of a function has a lower bound that , where a spectral radius is the maximum modulus of eigenvaluesGurvits et al. (2007), i.e., .

### b.1 Convergence Analysis on Diminishing Stepsize

In this section, we aim to prove that our algorithm with diminishing stepsize converges to the optimal solution at a rate of .

###### Lemma 5.

(Error due to random features) For any ,

 EDt,ωt[|ft+1(x)−ht+1(x)|2]≤ C2t∑i=1∣∣ait∣∣2(κ+ϕ)2
###### Proof.
 ft+1(⋅)−ht+1(⋅)= t∑i=1aitζi(⋅)−t∑i=1aitξi(⋅) = t∑i=1ait[−Cyiϕω(xi)ϕω(⋅)+Cyik(xi,⋅)] = Cyit∑i=1ait[k(xi,⋅)−ϕω(xi)ϕω(⋅)] ≤ Cyit∑i=1ait(k(xi,⋅)+ϕω(xi)ϕω(⋅)) ≤ Cyit∑i=1ait(κ+ϕ) (17)

Thus we can get

 |ft+1−ht+1|2≤ C2y2it∑i=1∣∣ait∣∣2(κ+ϕ)2 = C2t∑i=1∣∣ait∣∣2(κ+ϕ)2 (18)

###### Lemma 6.

Suppose , then we have and .

###### Proof.

The proof of DSG first gives the upper bound of and proves that monotonically increasing. According to the definition of , we have

 ∣∣ai+1t∣∣−∣∣ait∣∣ = = (|γi+1|−∣∣∣γi(1−γi+1(1+ϵ′C∥fi+1∥))∣∣∣) ⋅∣∣∣(1−γi+2(1+ϵ′C∥fi+2∥))⋯(1−γt(1+ϵ′C∥ft∥))∣∣∣ ≥ θi+1(1−∣∣∣1−θi+1(1+ϵ′C∥fi+1∥)∣∣∣) ⋅∣∣∣(1−γi+2(1+ϵ′C∥fi+2∥))⋯(1−γt(1+ϵ′C∥ft∥))∣∣∣

Then according to Assumption 6 and the theorem that is the lower bound of any matrix norm of that , we can get that , thus we have

 ∣∣ai+1t∣∣−∣∣ait∣∣≥0

In this way, we come to a conclusion that the value of is monotonically increasing, since , we can get that . ∎

###### Lemma 7.

(Error due to random data) Let be the optimal solution to our target problem, we set with such that , then we have

 EDt,ωt[∥ht+1−f∗∥22]≤Q21t (19)

where , , is a constant value and ,

###### Proof.

For the sake of simple notations, we first denote the following two different gradient terms, which are

 gt =pt+ht=−Cytk(xt,⋅)+Cϵ′f(⋅)∥f∥+ht ¯¯¯gt =EDt[gt]=EDt[−Cytk(xt,⋅)+Cϵ′f(⋅)∥f∥]+ht

Note that by our previous definition, we have , .

Denote . Then we have

 At+1= ∥ht−f∗−γtgt∥2 = At+γ2t∥gt∥2−2γt⟨ht−f∗,gt⟩