1 Introduction
Driven by the hardness of labeling work, graphbased semisupervised learning (GSSL) Zhu and Ghahramani (2002); Zhu et al. (2003); Chapelle et al. (2009) has been widely used to boost the quality of models using easily accessible unlabeled data. The core idea behind it is that both labeled and unlabeled data coexist in the same manifold. For instance, in the transductive setting, we have label propagation Zhu and Ghahramani (2002) that transfers the label information from labeled nodes to neighboring nodes according to their proximity. While in the inductive case, a graphbased manifold regularizer can be added to many existing supervised learning models to enforce the smoothness of predictions on the data manifold Belkin et al. (2006); Sindhwani et al. (2005). GSSL has received a lot of attention; many of the applications are safetycritical such as drug discovery Zhang et al. (2015) and social media mining Speriosu et al. (2011).
We aim to develop systematic and efficient data poisoning methods for poisoning GSSL models. Our idea is partially motivated by the recent researches on the robustness of machine learning models to adversarial examples
Goodfellow et al. (2015); Szegedy et al. (2013). These works mostly show that carefully designed, slightly perturbed inputs – also known as adversarial examples – can substantially degrade the performance of many machine learning models. We would like to tell apart this problem from our setting: adversarial attacks are performed during the testing phase and applied to test data Carlini and Wagner (2017); Chen et al. (2018); Athalye et al. (2018); Cheng et al. (2018); Papernot et al. (2016); Cheng et al. (2019), whereas data poisoning attack is conducted during training phase Mei and Zhu (2015); Koh and Liang (2017); Xiao et al. (2015); Li et al. (2016); Zhao et al. (2018), and perturbations are added to training data only. In other words, data poisoning attack concerns about how to imperceptibly change the training data to affect testing performance. As we can imagine, this setting is more challenging than testing time adversarial attacks due to the hardness of propagating information through a sophisticated training algorithm.Despite the efforts made on studying poisoning attack to supervised models Mei and Zhu (2015); Koh and Liang (2017); Xiao et al. (2015); Li et al. (2016); Zhao et al. (2018), the robustness of semisupervised algorithms has seldom been studied and many related questions remain unsolved. For instance, are semisupervised learning algorithms sensitive to small perturbation of labels? And how do we formally measure the robustness of these algorithms?
In this paper, we initiate the first systematic study of data poisoning attacks against GSSL. We mainly cover the widely used label propagation algorithm, but similar ideas can be applied to poisoning manifold regularization based SSL as well (see Appendix 4.2). To poison semisupervised learning algorithms, we can either change the training labels or features. For label poisoning, we show it is a constrained quadratic minimization problem, and depending on whether it is a regression or classification task, we can take a continuous or discrete optimization method. For feature poisoning, we conduct gradientbased optimization with group Lasso regularization to enforce group sparsity (shown in Appendix 4.2). Using the proposed algorithms, we answer the questions mentioned above with several experiments. Our contributions can be summarized as follows:

[noitemsep,leftmargin=*]

We propose a framework for data poisoning attack to GSSL that 1) includes both classification and regression cases, 2) works under various kinds of constraints, and 3) assumes both complete and incomplete knowledge of algorithm user (also called “victim”).

For label poisoning to regression task, which is a nonconvex trust region problem, we design a specialized solver that can find a global minimum in asymptotically linear time.

For label poisoning attack to classification task, which is an NPhard integer programming problem, we propose a novel probabilistic solver that works in combination with gradient descent optimizer. Empirical results show that our method works much better than classical greedy methods.

We design comprehensive experiments using the proposed poisoning algorithms on a variety of problems and datasets.
In what follows, we refer to the party running poisoning algorithm as the attacker, and the party doing the learning and inference work as the victim.
2 Related Work
Adversarial attacks have been extensively studied recently. Many recent works consider the test time attack, where the model is fixed, and the attacker slightly perturbs a testing example to change the model output completely Szegedy et al. (2013). We often formulate the attacking process as an optimization problem Carlini and Wagner (2017), which can be solved in the whitebox setting. In this paper, we consider a different area called data poisoning attack, where we run the attack during training time — an attacker can carefully modify (or add/remove) data in the training set so that the model trained on the poisoned data either has significantly degraded performance Xiao et al. (2015); Mei and Zhu (2015) or has some desired properties Chen et al. (2017); Li et al. (2016). As we mentioned, this is usually harder than test time attacks since the model is not predetermined. Poisoning attacks have been studied in several applications, including multitask learning Zhao et al. (2018), image classification Chen et al. (2017), matrix factorization for recommendation systems Li et al. (2016) and online learning Wang and Chaudhuri (2018). However, they did not include semisupervised learning, and the resulting algorithms are quite different from us.
To the best of our knowledge, Dai et al. (2018); Zügner et al. (2018); Wang et al. (2018) are the only related works on attacking semisupervised learning models. They conduct test time attacks
to Graph Convolutional Network (GCN). In summary, their contributions are different from us in several aspects: 1) the GCN algorithm is quite different from the classical SSL algorithms considered in this paper (e.g. label propagation and manifold regularization). Notably, we only use feature vectors and the graph will be constructed manually with kernel function. 2) Their works are restricted to testing time attacks by assuming the model is
learned and fixed, and the goal of attacker is to find a perturbation to fool the established model. Although there are some experiments in Zügner et al. (2018) on poisoning attacks, the perturbation is still generated from test time attack and they did not design taskspecific algorithms for the poisoning in the training time. In contrast, we consider the data poisoning problem, which happens before the victim trains a model.3 Data Poisoning Attack to GSSL
3.1 Problem setting
We consider the graphbased semisupervised learning (GSSL) problem. The input include labeled data and unlabeled data , we define the whole features . Denoting the labels of as , our goal is to predict the labels of test data . The learner applies algorithm to predict from available data . Here we restrict to label propagation method, where we first generate a graph with adjacency matrix from Gaussian kernel: , where the subscripts represents the th row of . Then the graph Laplacian is calculated by , where is the degree matrix. The unlabeled data is then predicted through energy minimization principle Zhu et al. (2003)
(1) 
The problem has a simple closed form solution , where we define , and . Now we consider the attacker who wants to greatly change the prediction result by perturbing the training data by small amounts respectively, where is the perturbation matrix , and
is a vector. This seems to be a simple problem at the first glance, however, we will show that the problem of finding optimal perturbation is often intractable, and therefore provable and effective algorithms are needed. To sum up, the problem have several degrees of freedom:

[noitemsep,leftmargin=*]

Learning algorithm: Among all graphbased semisupervised learning algorithms, we primarily focus on the label propagation method; however, we also discuss manifold regularization method in Appendix 4.2.

Task: We should treat the regression task and classification task differently because the former is inherently a continuous optimization problem while the latter can be transformed into integer programming.

Knowledge of attacker: Ideally, the attacker knows every aspect of the victim, including training data, testing data, and training algorithms. However, we will also discuss incomplete knowledge scenario; for example, the attacker may not know the exact value of hyperparameters.

What to perturb: We assume the attacker can perturb the label or the feature, but not both. We made this assumption to simplify our discussion and should not affect our findings.

Constraints: We also assume the attacker has limited capability, so that (s)he can only make small perturbations. It could be measured norm or sparsity.
3.2 Toy Example
We show a toy example in Figure 1 to motivate the data poisoning attack to graph semisupervised learning (let us focus on label propagation in this toy example). In this example, the shaded region is very close to node and yet quite far from other labeled nodes. After running label propagation, all nodes inside the shaded area will be predicted to be the same label as node. That gives the attacker a chance to manipulate the decision of all unlabeled nodes in the shaded area at the cost of flipping just one node. For example, in Figure 1, if we change node’s label from positive to negative, the predictions in the shaded area containing three nodes will also change from positive to negative.
Besides changing the labels, another way to attack is to perturb the features so that the graph structure changes subtly (recall the graph structure is constructed based on pairwise distances). For instance, we can change the features so that node is moved away from the shaded region, while more negative label points are moved towards the shaded area. Then with label propagation, the labels of the shaded region will be changed from positive to negative as well. We will examine both cases in the following sections.
3.3 A unified framework
The goal of poisoning attack is to modify the data points to maximize the error rate (for classification) or RMSE score (for regression); thus we write the objective as
(2) 
To see the flexibility of Eq. (2) in modeling different tasks, different knowledge levels of attackers or different budgets, we decompose it into following parts that are changeable in real applications:

[noitemsep,leftmargin=*]

/ are the constraints on and . For example, restricts the perturbation to be no larger than ; while makes the solution to have at most nonzeros. As to the choices of , besides regularization, we can also enforce group sparsity structure, where each row of could be all zeros.

is the task dependent squeeze function, for classification task we set since the labels are discrete and we evaluate the accuracy; for regression task it is identity function , and loss is used.

is the kernel function parameterized by , we choose Gaussian kernel throughout.

Similar to , the new similarity matrix is generated by Gaussian kernel with parameter , except that it is now calculated upon poisoned data .

Although not included in this paper, we can also formulate targeted poisoning attack problem by changing min to max and let be the target.
There are two obstacles to solving Eq. 2, that make our algorithms nontrivial. First, the problem is naturally nonconvex, making it hard to determine whether a specific solution is globally optimal; secondly, in classification tasks where our goal is to maximize the testing time error rate, the objective is nondifferentiable under discrete domain. Besides, even with hundreds of labeled data, the domain space can be unbearably big for brute force search and yet the greedy search is too myopic to find a good solution (as we will see in experiments).
In the next parts, we show how to tackle these two problems separately. Specifically, in the first part, we propose an efficient solver designed for data poisoning attack to the regression problem under various constraints. Then we proceed to solve the discrete, nondifferentiable poisoning attack to the classification problem.
3.4 Regression task, (un)known label
We first consider the regression task where only label poisoning is allowed. This simplifies Eq. (2) as
(estimated label)  (3a)  
(true label)  (3b) 
Here we used the fact that , where we define . We can solve Eq. (3a) by SVD; it’s easy to see that the optimal solution should be and is the top right sigular vector if we decompose . However, (3b) is less straightforward, in fact it is a nonconvex trust region problem, which can be generally formulated as
(4) 
Our case (3b) can thus be described as and . Recently Hazan and Koren (2016) proposed a sublinear time solver that is able to find a global minimum in time. Here we propose an asymptotic linear algorithm based purely on gradient information, which is stated in Algorithm 1 and Theorem 6. In Algorithm 1 there are two phases, in the following theorems, we show that the phase I ends within finite iterations, and phase II converges with an asymptotic linear rate. We postpone the proof to Appendix 1.
Theorem 1 (Convergent).
Suppose the operator norm , by choosing a step size with initialization , . Then iterates generated from Algorithm 1 converge to the global minimum.
Lemma 1 (Finite phase I).
Since is indefinite, , and
is the corresponding eigenvector. Denote
is the projection of any onto , let be number of iterations in phase I of Algorithm 1, then:(5) 
Theorem 2 (Asymptotic linear rate).
Let be an infinite sequence of iterates generated by Algorithm 1, suppose it converges to (guaranteed by Theorem 3), let and
be the smallest and largest eigenvalues of
. Assume that is a local minimizer then and given in the interval with , , are line search parameters. There exists an integer such that:for all .
3.5 Classification task
As we have mentioned, data poisoning attack to classification problem is more challenging, as we can only flip
an unnoticeable fraction of training labels. This is inherently a combinatorial optimization problem. For simplicity, we restrict the scope to binary classification so that
, and the labels are perturbed as , where denotes Hadamard product and . For restricting the amount of perturbation, we replace the norm constraint in Eq. (3a) with integer constraint , where is a user predefined constant. In summary, the final objective function has the following form(6) 
where we define and , so the objective function directly relates to error rate. Notice that the feasible set contains around solutions, making it almost impossible to do an exhaustive search. A simple alternative is greedy search: first initialize , then at each time we select index and try flip , such that the objective function (6) decreases the most. Next, we set . We repeat this process multiple times until the constraint in (6) is met.
Doubtlessly, the greedy solver is myopic. The main reason is that the greedy method cannot explore other flipping actions that appear to be suboptimal within the current context, despite that some suboptimal actions might be better in the long run. Inspired by the bandit model, we can imagine this problem as a multiarm bandit, with arms in total. And we apply a strategy similar to
greedy: each time we assign a high probability to the best action but still leave nonzero probabilities to other “actions”. The new strategy can be called
probabilistic method, specifically, we model each actionas a Bernoulli distribution, the probability of “flipping” is
. The new loss function is just an expectation over Bernoulli variables
(7) 
Here we replace the integer constraint in Eq. 6 with a regularizer , the original constraint is reached by selecting a proper . Once problem (7) is solved, we craft the actual perturbation by setting if is among the top largest elements.
To solve Eq. (7), we need to find a good gradient estimator. Before that, we replace with to get a continuously differentiable objective. We borrow the idea of “reparameterization trick” Figurnov et al. (2018); Tucker et al. (2017) to approximate by a continuous random vector
(8) 
where and are two Gumbel distributions.
is the temperature controlling the steepness of sigmoid function: as
, the sigmoid function pointwise converges to a stair function. Plugging (8) into (7), the new loss function becomes(9) 
Therefore, we can easily obtain an unbiased, low variance gradient estimator via Monte Carlo sampling from
, specifically(10) 
Based on that, we can apply many stochastic optimization methods, including SGD and Adam Kingma and Ba (2014), to finalize the process. In the experimental section, we will compare the greedy search with our probabilistic approach on real data.
4 Experiments
In this section, we will show the effectiveness of our proposed data poisoning attack algorithms for regression and classification tasks on graphbased SSL.
Name  Task  

cadata  Regression  8,000  8  1.0 
E2006  Regression  19,227  150,360  1.0 
mnist17  Classification  26,014  780  0.6 
rcv1  Classification  20,242  47,236  0.1 
4.1 Experimental settings and baselines
We conduct experiments on two regression and two binary classification datasets^{1}^{1}1Publicly available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. The metainformation can be found in Table 1. We use a Gaussian kernel with width to construct the graph. For each data, we randomly choose samples as the labeled set, and the rest are unlabeled. We normalize the feature vectors by , where is the sample mean, and is the sample variance. For regression data, we also scale the output by so that . To evaluate the performance of label propagation models, for regression task we use RMSE metric defined as , while for classification tasks we use error rate metric. For comparison with other methods, since this is the first work on data poisoning attack to GSSL, we proposed several baselines according to graph centrality measures. The first baseline is random perturbation, where we randomly add Gaussian noise (for regression) or Bernoulli noise (for regression) to labels. The other two baselines based on graph centrality scores are more challenging, they are widely used to find the “important” nodes in the graph. Intuitively, we need to perturb “important” nodes to attack the model, and we decide the importance by node degree or PageRank. We explain the baselines with more details in the appendix.
4.2 Effectiveness of data poisoning to GSSL
In this experiment, we consider the whitebox setting where the attacker knows not only the ground truth labels but also the correct hyperparameter . We thus apply our proposed label poisoning algorithms in Section 3.4 and 3.5 to attack regression and classification tasks, respectively. In particular, we apply constraint for perturbation in the regression task and use the greedy method in the classification task. The results are shown in Figure 2,
as we can see in this figure, for both regression and classification problems, small perturbations can lead to vast differences: for instance, on cadata, the RMSE increases from to when applied a carefully designed perturbation (this is very small compared with the norm of label ); More surprisingly, on mnist17, the accuracy can drop from to by flipping just nodes. This phenomenon indicates that current graphbased SSL, especially the label propagation method, can be very fragile to data poisoning attacks. On the other hand, using different baselines (shown in Figure 2, bottom row), the accuracy does not decline much, this indicates that our proposed attack algorithms are more effective than centrality based algorithms.
Moreover, the robustness of label propagation is strongly related to the number of labeled data : for all datasets shown in Figure 2, we notice that the models with larger tend to be more resistant to poisoning attacks. This phenomenon arises because, during the learning process, the label information propagates from labeled nodes to unlabeled ones. Therefore even if a few nodes are “contaminated” during poisoning attacks, it is still possible to recover the label information from other labeled nodes. Hence this experiment can be regarded as another instance of “no free lunch” theory in adversarial learning Tsipras et al. (2018).
4.3 Comparing poisoning with and without truth labels
We compare the effectiveness of poisoning attacks with and without ground truth labels . Recall that if an attacker does not hold , (s)he will need to replace it with the estimated values . Thus we expect a degradation of effectiveness due to the replacement of , especially when is not a good estimation of . The result is shown in Figure 3. Surprisingly, we did not observe such phenomenon: for regression tasks on cadata and E2006, two curves are closely aligned despite that attacks without ground truth labels are only slightly worse. For classification tasks on mnist17 and rcv1, we cannot observe any difference, the choices of which nodes to flip are exactly the same (except the case in rcv1). This experiment provides a valuable implication that hiding the ground truth labels cannot protect the SSL models, because the attackers can alternatively use the estimated ground truth .
4.4 Comparing greedy and probabilistic method
In this experiment, we compare the performance of three approximate solvers for problem (6) in Section 3.5, namely greedy and probabilistic methods. We choose rcv1 data as oppose to mnist17 data, because rcv1 is much harder for poisoning algorithm: when , we need to make error rate , whilst mnist17 only takes
. For hyperparameters, we set
, , . The results are shown in Figure 4, we can see that for larger , greedy method can easily stuck into local optima and inferior than our probabilistic based algorithms.4.5 Sensitivity analysis of hyperparameter
Since we use the Gaussian kernel to construct the graph, there is an important hyperparameter (kernel width) that controls the structure of the graph defined in (1), which is often chosen empirically by the victim through validation. Given the flexibility of , it is thus interesting to see how the effectiveness of the poisoning attack degrades with the attacker’s imperfect estimation of . To this end, we suppose the victim runs the model at the optimal hyperparameter , determined by validation, while the attacker has a very rough estimation . We conduct this experiment on cadata when the attacker knows or does not know the ground truth labels , the result is exhibited in Figure 5. It shows that when the adversary does not have exact information of , it will receive some penalties on the performance (in RMSE or error rate). However, it is entirely safe to choose a smaller because the performance decaying rate is pretty low. Take Figure 5 for example, even though , the RMSE only drops from to . On the other hand, if is over large, the nodes become more isolated, and thus the perturbations are harder to propagate to neighbors.
5 Conclusion
We conduct the first comprehensive study of data poisoning to GSSL algorithms, including label propagation and manifold regularization (in the appendix). The experimental results for regression and classification tasks exhibit the effectiveness of our proposed attack algorithms. In the future, it will be interesting to study poisoning attacks for deep semisupervised learning models.
Acknowledgement
Xuanqing Liu and ChoJui Hsieh acknowledge the support of NSF IIS1719097, Intel faculty award, Google Cloud and Nvidia. Zhu acknowledges NSF 1545481, 1561512, 1623605, 1704117, 1836978 and the MADLab AF COE FA95501810166.
References
 [1] (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420. Cited by: §1.
 [2] (2006) Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. Journal of machine learning research 7 (Nov), pp. 2399–2434. Cited by: §D.2, §1.
 [3] (1995) Loading and correlations in the interpretation of principle compenents. Journal of Applied Statistics 22 (2), pp. 203–214. Cited by: §D.1.

[4]
(2017)
Towards evaluating the robustness of neural networks
. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §1, §2.  [5] (2009) Semisupervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks 20 (3), pp. 542–542. Cited by: §1.

[6]
(2018)
EAD: elasticnet attacks to deep neural networks via adversarial examples.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, Cited by: §1. 
[7]
(2017)
Targeted backdoor attacks on deep learning systems using data poisoning
. arXiv preprint arXiv:1712.05526. Cited by: §2.  [8] (2019) Queryefficient hardlabel blackbox attack: an optimizationbased approach. In ICLR, Cited by: §1.
 [9] (2018) Seq2Sick: evaluating the robustness of sequencetosequence models with adversarial examples. arXiv preprint arXiv:1803.01128. Cited by: §1.

[10]
(2008)
Optimal solutions for sparse principal component analysis
. Journal of Machine Learning Research 9 (Jul), pp. 1269–1294. Cited by: §D.1.  [11] (2018) Adversarial attack on graph structured data. In ICML, Cited by: §2.
 [12] (2018) Implicit reparameterization gradients. arXiv preprint arXiv:1805.08498. Cited by: §3.5.
 [13] (2015) Explaining and harnessing adversarial examples. In ICLR, Cited by: §1.
 [14] (2016) A lineartime algorithm for trust region problems. Mathematical Programming 158 (12), pp. 363–381. Cited by: §3.4.
 [15] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.5.
 [16] (2017) Understanding blackbox predictions via influence functions. arXiv preprint arXiv:1703.04730. Cited by: §1, §1.
 [17] (2016) Data poisoning attacks on factorizationbased collaborative filtering. In Advances in neural information processing systems, pp. 1885–1893. Cited by: §1, §1, §2.
 [18] (2015) Using machine teaching to identify optimal trainingset attacks on machine learners.. In AAAI, Cited by: §1, §1, §2.

[19]
(2016)
Crafting adversarial input sequences for recurrent neural networks
. In Military Communications Conference, MILCOM 20162016 IEEE, pp. 49–54. Cited by: §1.  [20] (2005) Linear manifold regularization for large scale semisupervised learning. Cited by: §1.

[21]
(2011)
Twitter polarity classification with label propagation over lexical links and the follower graph.
In
Proceedings of the First workshop on Unsupervised Learning in NLP
, pp. 53–63. Cited by: §1.  [22] (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1, §2.

[23]
(2018)
Robustness may be at odds with accuracy
. Cited by: §4.2.  [24] (2017) Rebar: lowvariance, unbiased gradient estimates for discrete latent variable models. In NIPS, Cited by: §3.5.
 [25] (2018) Attack graph convolutional networks by adding fake nodes. arXiv preprint arXiv:1810.10751. Cited by: §2.
 [26] (2018) Data poisoning attacks against online learning. arXiv preprint arXiv:1808.08994. Cited by: §2.
 [27] (2015) Support vector machines under adversarial label contamination. Neurocomputing 160, pp. 53–62. Cited by: §1, §1, §2.
 [28] (2015) Label propagation prediction of drugdrug interactions based on clinical side effects. Scientific reports 5, pp. 12339. Cited by: §1.
 [29] (2018) Data poisoning attacks on multitask relationship learning. Cited by: §1, §1, §2.
 [30] (2003) Semisupervised learning using gaussian fields and harmonic functions. In Proceedings of the 20th International conference on Machine learning (ICML03), pp. 912–919. Cited by: §1, §3.1.
 [31] (2002) Learning from labeled and unlabeled data with label propagation. Cited by: §1.
 [32] (2006) Sparse principal component analysis. Journal of computational and graphical statistics 15 (2), pp. 265–286. Cited by: §D.1, §D.1.
 [33] (2018) Adversarial attacks on classification models for graphs. In KDD, Cited by: §2.
Appendix A Proof of Convergence
We show that our gradient based nonconvex trust region solver is able to find a global minimum efficiently. First recall the objective function
(11) 
Suppose has decomposition and rank . We only focus on “easy case”: in which case and is the corresponding eigenvector of . In opposite, the hard case is hardly seen in practice due to rounding error, and to be safe we can also add a small Gaussian noise to . To see the structure of solution, suppose the solution is , then by KKT condition, we can get the condition of global optima
(12)  
By condition , further if , then and . Because , which implies . Immediately we know and is the solution of .
As a immediate application of (12), we can conclude the following lemma:
Lemma 2.
When and , among all stationary points if then is the global minimum.
Proof.
We proof by contradiction. Suppose is a stationary point and , according to (12) if is not a global minimum then the third condition in (12) should be violated, implying that . Furthermore, for stationary point , we know the gradient of Lagrangian is zero: . Projecting this equation onto we get
(13) 
By condition we know ; multiply both sides of Eq. 13 by we get , which is in contradiction to . ∎
We now consider the projected gradient descent update rule , with following assumptions:
Assumption 1.
(Bounded step size) Step size , where .
Assumption 2.
(Initialize) , .
Under these assumptions, we next show proximal gradient descent converges to global minimum
Theorem 3.
Proof.
By careful analysis, we can actually divide the convergence process into two stages. In the first stage, the iterates stay inside the sphere ; in the second stage the iterates stay on the unit ball . Furthermore, we can show that the first stage ends with finite number of iterations. Before that, we introduce the following lemma:
Lemma 3.
Considering the first stage, when iterates are inside unit sphere , i.e. under the update rule , and under assumption that , we will have (recall we define as the operator norm of ).
Proof.
We first define , then by iteration rule , projecting both sides on ,
dividing both sides by , we get
(14) 
solving this geometric series, we get:
(15) 
suppose at th iteration we have , after plugging in Eq. (15) and noticing
(16) 
Furthermore, from Assumption 2 we know that , if we have , equivalently then by Eq. (16) we know leading to a contradiction, so it must hold that
At the same time, the eigenvalues are nondecreasing, for , which means
(17) 
Also recalling the initialization condition implies , subtracting both sides by and noticing
(18) 
Combining Eq. (17) with Eq. (18), we can conclude that if holds, then such relation also holds for index
(19) 
Consider at any iteration time , suppose is the smallest coordinate index such that , and hence holds for all . By Eq. (19) we know and for any (such a may not exist, but it doesn’t matter). By analyzing the sign of we know:
the second equality is true due to Eq. (14), we know for all and .
We complete the proof by following inequalities:
(20)  
Where the last inequality follows from assumption in this lemma. ∎
By applying this lemma on the iterates that are still inside the sphere, we will eventually conclude that monotone increases. In fact, we have the following theorem:
Theorem 4.
Suppose is in the region , such that proximal gradient update equals to plain GD: , then under this update rule, is monotone increasing.
Proof.
We prove by induction. First of all, notice , to prove , it remains to show . For we note that
(21) 
where the last inequality follows from Assumption 2. Now suppose and by update rule we know:
From Lemma 4 we know and recall is the operator norm of , we have , combining them together:
(22) 
by choosing we proved .
Due to induction rule we know that holds for all and moreover, is monotone increasing. ∎
We can easily improve the results above, to show that phase I (where