1 Introduction
We consider stochastic composite optimization problems of the form
(1) 
where is a smooth and possibly nonconvex function, is a random variable, each is a smooth vector mapping, and is convex and lowersemicontinuous. A special case we will consider separately is when
is a discrete random variable with uniform distribution over
. In this case the problem is equivalent to a deterministic optimization problem(2) 
The formulations in (1) and (2) cover a broader range of applications than classical stochastic optimization and empirical risk minimization (ERM) problems where each is a scalar function () and
is the scalar identity map. A wellknown example is policy evaluation in reinforcement learning (RL)
(e.g., Sutton and Barto, 1998). With linear value function approximation, it can be formulated aswhere and
are random matrix and vector generated by a Markov decision process (MDP)
(e.g., Dann et al., 2014). Here we have , and .Another interesting application is riskaverse optimization (e.g., Rockafellar, 2007; Ruszczyński, 2013), which has many applications in RL and financial mathematics. We consider a general formulation of meanvariance tradeoff:
(3) 
where each is a reward function (such as total portfolio return). The goal of problem (3) is to maximize the average reward with a penalty on the variance which captures the potential risk. It can be cast in the form of (1) by using the mappings
(4) 
Here, the intermediate dimension is very low, i.e., . This leads to very little overhead in computation compared with stochastic optimization without composition.
Besides these applications, the composition structure in (1) and (2) are of independent interest for research on stochastic and randomized algorithms. For the ease of notation, we define
(5) 
In addition, let and denote the gradients of and respectively, and denote the Jacobian matrix of at . Then we have
In practice, computing exactly can be very costly if not impossible. A common strategy is to use stochastic approximation: we randomly sample a subset of from its distribution and let
(6) 
However, is always a biased estimate of unless one can replace with the full expectation . This is in great contrast to the classical stochastic optimization problem
(7) 
where in (6) is always an unbiased gradient estimator for the smooth part . Using biased gradient estimators can cause various difficulties for constructing and analyzing randomized algorithms, but is often inevitable in dealing with more complex objective functions other than the empirical risk (see, e.g., Chaudhari et al., 2016; Hazan et al., 2016; Gulcehre et al., 2016; Mobahi and Fisher, 2015). As a simplest model, the analysis of randomized algorithms for (1) may provide insights for solving more challenging problems.
In this paper, we develop an efficient stochastic composite gradient method called CIVR (Composite Incremental Variance Reduction), for solving problems of the forms (1) and (2). We measure efficiency by the sample complexity of the individual functions and their Jacobian , i.e., the total number of times they need to be evaluated at some point, in order to find an approximate solution. For nonconvex functions, an approximate solution is some random output of the algorithm that satisfies , where is the proximal gradient mapping of the objective function at (see details in Section 2). If , then and the criteria for approximation becomes . If the objective is convex, we require where . For smooth and convex functions, these two notions are compatible, meaning that the dependence of the sample complexity on in terms of both notions are of the same order.
Assumptions (common: and Lipschitz and smooth, thus smooth)  
Problem  nonconvex  gradient dominant  convex, convex 
convex  optimally strongly convex  
(1)  
(2) 
Table 1 summarizes the sample complexities of the CIVR method under different assumptions obtained in this paper. We can define a condition number for gradient dominant functions and for optimally strongly convex functions, then the complexities become and for (1) and (2) respectively. In order to better position our contributions, we next discuss related work and then putting these results into context.
1.1 Related Work
We first discuss the nonconvex stochastic optimization problem (7), which is a special cases of (1). When and is smooth, Ghadimi and Lan (2013) developed a randomized stochastic gradient method with iteration complexity . AllenZhu (2018) obtained with additional secondorder guarantee. There are also many recent works on solving its finitesum version
(8) 
which is a special case of (2). By extending the variance reduction techniques SVRG (Johnson and Zhang, 2013; Xiao and Zhang, 2014) and SAGA (Defazio et al., 2014) to nonconvex optimization, AllenZhu and Hazan (2016) and Reddi et al. (2016a, b, c) developed randomized algorithms with sample complexity . Under additional assumptions of gradient dominance or strong convexity, they obtained sample complexity , where is a suitable condition number. AllenZhu (2017) and Lei et al. (2017) obtained .
Based on a new variance reduction technique called SARAH (Nguyen et al., 2017), Nguyen et al. (2019) and Pham et al. (2019) developed nonconvex extensions to obtain sample complexities and for solving the expectation and finitesum cases respectively. Fang et al. (2018) introduced another variance reduction technique called Spider, which can be viewed as a more general variant of SARAH. They obtained sample complexities and for the two cases respectively, but require small step sizes that are proportional to . Wang et al. (2018) extended Spider to obtain the same complexities with constant step sizes and under the gradientdominant condition. In addition, Zhou et al. (2018) obtained similar results using a nested SVRG approach.
In addition to the above works on solving special cases of (1) and (2), there are also considerable recent works on a more general, twolayer stochastic composite optimization problem
(9) 
where is parametrized by another random variables , which is independent of . When , Wang et al. (2017a) derived algorithms to find an approximate solution with sample complexities , and for the smooth nonconvex case, smooth convex case and smooth strongly convex case respectively. For nontrivial convex , Wang et al. (2017b) obtained improved sample complexity of , and for the three cases mentioned above respectively.
As a special case of (9), the following finitesum problem also received significant attention:
(10) 
When and the overall objective function is strongly convex, Lian et al. (2017) derived two algorithms based on the SVRG scheme to attain sample complexities and respectively, where is some suitably defined condition number. Huo et al. (2018) also used the SVRG scheme to obtain an complexity for the smooth nonconvex case and for strongly convex problems with nonsmooth . More recently, Zhang and Xiao (2019) proposed a composite randomized incremental gradient method based on the SAGA estimator Defazio et al. (2014), which matches the best known complexity when is smooth and nonconvex, and obtained an improved complexity under either gradient dominant or strongly convex assumptions. When applied to the special cases (1) and (2) we focus on in this paper (), these results are strictly worse than ours in Table 1.
1.2 Contributions and Outline
We develop the CIVR method by extending the variance reduction technique of SARAH (Nguyen et al., 2017, 2019; Pham et al., 2019) and Spider (Fang et al., 2018; Wang et al., 2018) to solve the composite optimization problems (1) and (2). The complexities of CIVR in Table 1 match the best results for solving the noncomposite problems (7) and (8), despite the additional outer composition and the compositegradient estimator always being biased. In addition:
Our results indicate that the additional smooth composition in (1) and (2) does not incur higher complexity compared with (7) and (8), despite the difficulty of dealing with biased estimators. We believe these results can also be extended to the twolayer problems (9) and (10), by replacing with in Table 1. But the extensions require quite different techniques and we will address them in a separate paper.
The rest of this paper is organized as follows. In Section 2, we introduce the CIVR method. In Section 3, we present convergence results of CIVR for solving the composite optimization problems (1) and (2) and the required parameter settings. Better complexities of CIVR under the gradientdominant and optimally strongly convex conditions are given in Section 4. In Section 5, we present numerical experiments for solving a riskaverse portfolio optimization problem (3) on realworld datasets.
2 The composite incremental variance reduction (CIVR) method
algocf[t]
With the notations in (5), we can write the composite stochastic optimization problem (1) as
(14) 
where is smooth and is convex. The proximal operator of with parameter is defined as
(15) 
We assume that is relatively simple, meaning that its proximal operator has a closedform solution or can be computed efficiently. The proximal gradient method (e.g., Nesterov, 2013; Beck, 2017) for solving problem (14) is
(16) 
where is the step size. The proximal gradient mapping of is defined as
(17) 
As a result, the proximal gradient method (16) can be written as . Notice that when , becomes the identity mapping and we have for any .
Suppose is generated by a randomized algorithm. We call an stationary point in expectation if
(18) 
(We assume that is a constant that does not depend on .) As we mentioned in the introduction, we measure the efficiency of an algorithm by its sample complexity of and their Jacobian , i.e., the total number of times they need to be evaluated, in order to find a point that satisfies (18). Our goal is to develop a randomized algorithm that has low sample complexity.
We present in Algorithm LABEL:alg:CIVR the Composite Incremental Variance Reduction (CIVR) method. This methods employs a two timescale variancereduced estimator for both the inner function value of and its Jacobian . At the beginning of each outer iteration
(each called an epoch), we construct a relatively accurate estimate
for and for respectively, using a relatively large sample size . During each inner iteration of the th epoch, we construct an estimate for and for respectively, using a smaller sample size and incremental corrections from the previous iterations. Note that the epoch length and the sample sizes and are all adjustable for each epoch . Therefore, besides setting a constant set of parameters, we can also adjust them gradually in order to obtain better theoretical properties and practical performance.This variancereduction technique was first proposed as part of SARAH Nguyen et al. (2017) where it is called recursive variance reduction. It was also proposed in (Fang et al., 2018) in the form of a Stochastic PathIntegrated Differential EstimatoR (Spider). Here we simply call it incremental variance reduction. A distinct feature of this incremental estimator is that the innerloop estimates and are biased, i.e.,
(19) 
This is in contrast to two other popular variancereduction techniques, SVRG (Johnson and Zhang, 2013) and SAGA (Defazio et al., 2014)
, whose gradient estimators are always unbiased. Note that unbiased estimators for
and are not essential here, because the composite estimator is always biased.3 Convergence Analysis
In this section, we present theoretical results on the convergence properties of CIVR (Algorithm LABEL:alg:CIVR) when the composite function is smooth. More specifically, we make the following assumptions.
Assumption 1.
The following conditions hold concerning problems (1) and (2):

is a smooth and Lipschitz function and its gradient is Lipschitz.

Each is a smooth and Lipschitz vector mapping and its Jacobian is Lipschtiz. Consequently, in (5) is Lipschitz and its Jacobian is Lipschitz.

is a convex and lowersemicontinuous function.

The overall objective function is bounded below, i.e., .
Assumption 2.
For problem (1), we further assume that there exist constants and such that
(20) 
As a result of Assumption 1, is smooth and is Lipschitz continuous with
(see proof in the supplementary materials). For convenience, we also define two constants
(21) 
It is important to notice that , since we will use step size .
In the next two subsections, we present complexity analysis of CIVR for solving problem (1) and (2) respectively. Due to the space limitation, all proofs are provided in the supplementary materials.
3.1 The composite expectation case
The following results for solving problem (1) are presented with notations defined in (5), (17) and (21).
Theorem 1.
Note that in the above scheme, the epoch lengths and all the batch sizes and are set to be constant (depending on a prefixed ) without regard of . Intuitively, we do not need as many samples in the early stage of the algorithm as in the later stage. In addition, it will be useful in practice to have a variant of the algorithm that can adaptively choose , and throughout the epochs without dependence on a prefixed precision. This is done in the following theorem.
3.2 The composite finitesum case
In this section, we consider the composite finitesum optimization problem (2). In this case, the random variable has a uniform distribution over the finite index set . At the beginning of each epoch in Algorithm LABEL:alg:CIVR, we use the full sample size to compute and . Therefore for all and Equation (LABEL:defn:sarah1) in Algorithm LABEL:alg:CIVR becomes
(24) 
Also in this case, we no longer need Assumption 2.
Theorem 3.
Suppose Assumptions 1 holds. Let the parameters in Algorithm LABEL:alg:CIVR be set as and for all . Then as long as , we have for any ,
(25) 
As a result, obtaining an approximate solution requires epochs and a total sample complexity of .
Similar to the previous section, we can also choose the epoch lengths and sample sizes adaptively to save the sampling cost in the early stage of the algorithm. However, due to the finitesum structure of the problem, when the batch size reaches , we will start to take the full batch at the beginning of each epoch to get the exact and . This leads to the following theorem.
Theorem 4.
Suppose Assumptions 1 holds. For some positive constants and , denote . When we set the parameters to be ; when , we set and . Then as long as ,
(26) 
As a result, the total sample complexity of Algorithm LABEL:alg:CIVR for obtaining an approximate solution is , where hides logarithmic factors.
4 Fast convergence rates under stronger conditions
In this section we consider two cases where fast linear convergence can be guaranteed for CIVR.
4.1 Gradientdominant function
The first case is when and is gradient dominant, i.e., there is some such that
(27) 
Note that a strongly convex function is gradient dominant by this definition. Hence strong convexity is a special case of the gradient dominant condition, which in turn is a special case of the PolyakŁojasiewicz condition with the Łojasiewicz exponent equal to 2 (see, e.g., Karimi et al., 2016).
In order to solve (1) with a prefixed precision , we use a periodic restart strategy depicted below.
Theorem 5.
Consider (1) with . Suppose Assumptions 1 and 2 hold and is gradient dominant. Given any , let , and . Then as long as ,
(28) 
Therefore, we can periodically restart Algorithm LABEL:alg:CIVR after every epochs (using the output of previous period as input to the new period), then converges linearly to with a factor of per period. As a result, the sample complexity for finding an solution is .
The restart strategy also applies to the finitesum case.
4.2 Optimally strongly convex function
In this part, we assume a optimally strongly convex condition on the function , i.e., there exists a such that
(30) 
We have the following two results for solving problems (1) and (2) respectively.
Theorem 7.
Theorem 8.
If we define a condition number , then since , we have and the above complexities become and .
5 Numerical Experiments
In this section, we present numerical experiments for a riskaverse portfolio optimization problem. Suppose there are assets that one can invest during time periods labeled as . Let be the return or payoff per unit of asset at time , and be the vector consists of . Let be the decision variable, where each component represent the amount of investment or percentage of the total investment allocated to asset , for . The same allocations or percentages of allocations are repeated over the time periods. We would like to maximize the average return over the periods, but with a penalty on the variance of the returns across the periods (in other words, we would like different periods to have similar returns).
This problem can be formulated as a finitesum version of problem (3), with a discrete random variable and for . The function can be chosen as the indicator function of an ball, or a soft regularization term. We choose the later one in our experiments to obtain a sparse asset allocation. Using the mappings defined in (4), it can be further transformed into the composite finitesum problem (2), hence readily solved by the CIVR method. For comparison, we implement the CSAGA algorithm Zhang and Xiao (2019) as a benchmark. As another benchmark, this problem can also be formulated as a twolayer composite finitesum problem (10), which was done in (Huo et al., 2018) and (Lian et al., 2017). We solve the twolayer formulation by ASCPG (Wang et al., 2017b) and VRSCPG (Huo et al., 2018). Finally, we also implemented CIVRadp, which is the adaptive sampling variant described in Theorem 4.
We test these algorithms on three real world portfolio datasets, which contain 30, 38 and 49 industrial portfolios respectively, from the Keneth R. French Data Library^{1}^{1}1http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html. For the three datasets, the daily data of the most recent 24452, 10000 and 24400 days are extracted respectively to conduct the experiments. We set the parameter in (3) and use an regularization . The experiment results are shown in Figure 1. The curves are averaged over 20 runs and are plotted against the number of samples of the component functions (the horizontal axis).
Throughout the experiments, VRSCPG and CSAGA algorithms use the batch size while CIVR uses the batch size , all dictated by their complexity theory. CIVRadp employs the adaptive batch size for . For Industrial30 dataset, all of VRSCPG, CSAGA, CIVR and CIVRadp use the same step size . They are chosen from the set by experiments. And works best for all four tested methods simultaneously. Similarly, is chosen for the Industrial38 dataset and is chosen for the Industrial49 dataset. For ASCPG, we set its step size parameters and (see details in Wang et al., 2017b). They are handtuned to ensure ASCPG converges fast among a range of tested parameters. Overall, CIVR and CIVRadp outperform other methods.
References
 AllenZhu [2017] Zeyuan AllenZhu. Natasha: Faster nonconvex stochastic optimization via strongly nonconvex parameter. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70 of Proceedings of Machine Learning Research, pages 89–97, Sydney, Australia, 2017.
 AllenZhu [2018] Zeyuan AllenZhu. Natasha 2: Faster nonconvex optimization than SGD. In Advances in Neural Information Processing Systems 31, pages 2675–2686. Curran Associates, Inc., 2018.
 AllenZhu and Hazan [2016] Zeyuan AllenZhu and Elad Hazan. Variance reduction for faster nonconvex optimization. In Proceedings of the 33rd International Conference on Machine Learning (ICML), pages 699–707, 2016.
 Beck [2017] Amir Beck. FirstOrder Methods in Optimization. MOSSIAM Series on Optimization. SIAM, 2017.
 Chaudhari et al. [2016] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropysgd: Biasing gradient descent into wide valleys. arXiv preprint, arXiv:1611.01838, 2016.
 Dann et al. [2014] Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal differences: a survey and comparison. Journal of Machine Learning Research, 15(1):809–883, 2014.
 Defazio et al. [2014] Aaron Defazio, Francis Bach, and Simon LacosteJulien. SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems 27, pages 1646–1654, 2014.
 Fang et al. [2018] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Nearoptimal nonconvex optimization via stochastic pathintegrated differential estimator. In Advances in Neural Information Processing Systems 31, pages 689–699. Curran Associates, Inc., 2018.
 Ghadimi and Lan [2013] Saeed Ghadimi and Guanghui Lan. Stochastic first and zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
 Gulcehre et al. [2016] Caglar Gulcehre, Marcin Moczulski, Francesco Visin, and Yoshua Bengio. Mollifying networks. arXiv preprint, arXiv:1608.04980, 2016.
 Hazan et al. [2016] Elad Hazan, Kfir Yehuda Levy, and Shai ShalevShwartz. On graduated optimization for stochastic nonconvex problems. In International conference on machine learning, pages 1833–1841, 2016.

Huo et al. [2018]
Zhouyuan Huo, Bin Gu, Ji Jiu, and Heng Huang.
Accelerated method for stochastic composition optimization with
nonsmooth regularization.
In
Proceedings of the 32nd AAAI Conference on Artificial Intelligence
, pages 3287–3294, 2018. 
Johnson and Zhang [2013]
Rie Johnson and Tong Zhang.
Accelerating stochastic gradient descent using predictive variance reduction.
In Advances in Neural Information Processing Systems 26, pages 315–323, 2013.  Karimi et al. [2016] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient method and proximalgradient methods under the PolyakŁojasiewicz condition. In Machine Learning and Knowledge Discovery in Database  European Conference, Proceedings, pages 795–811, 2016.
 Lei et al. [2017] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Nonconvex finitesum optimization via SCSG methods. In Advances in Neural Information Processing Systems 30, pages 2348–2358. Curran Associates, Inc., 2017.
 Li and Li [2018] Zhize Li and Jian Li. A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In Advances in Neural Information Processing Systems 31, pages 5564–5574. Curran Associates, Inc., 2018.
 Lian et al. [2017] Xiangru Lian, Mengdi Wang, and Ji Liu. Finitesum composition optimization via variance reduced gradient descent. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1159–1167, 2017.

Mobahi and Fisher [2015]
Hossein Mobahi and John W Fisher.
On the link between gaussian homotopy continuation and convex
envelopes.
In
International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition
, pages 43–56. Springer, 2015.  Nesterov [2013] Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125–161, 2013.
 Nguyen et al. [2017] Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70 of Proceedings of Machine Learning Research (PMLR), pages 2613–2621, Sydney, Australia, 2017.
 Nguyen et al. [2019] Lam M. Nguyen, Marten van Dijk, Dzung T. Phan, Phuong Ha Nguyen, TsuiWei Weng, and Jayant R. Kalagnanam. Finitesum smooth optimization with sarah. arXiv preprint, arXiv:1901.07648, 2019.
 Pham et al. [2019] Nhan H. Pham, Lam M. Nguyen, Dzung T. Phan, and Quoc TranDinh. ProxSARAH: An efficient algorithmic framework for stochastic composite nonconvex optimization. arXiv preprint, arXiv:1902.05679, 2019.
 Reddi et al. [2016a] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 314–323, New York, New York, USA, 2016a.
 Reddi et al. [2016b] Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola. Fast incremental method for smooth nonconvex optimization. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 1971–1977. IEEE, 2016b.
 Reddi et al. [2016c] Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alexander J Smola. Proximal stochastic methods for nonsmooth nonconvex finitesum optimization. In Advances in Neural Information Processing Systems 29, pages 1145–1153, 2016c.
 Rockafellar [1970] R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.
 Rockafellar [2007] R. Tyrrell Rockafellar. Coherent approaches to risk in optimization under uncertainty. INFORMS TutORials in Operations Research, 2007.
 Ruszczyński [2013] Andrzej Ruszczyński. Advances in riskaverse optimization. INFORMS TutORials in Operation Research, 2013.
 Sutton and Barto [1998] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
 Wang et al. [2017a] Mengdi Wang, Ethan X Fang, and Han Liu. Stochastic compositional gradient descent: algorithms for minimizing compositions of expectedvalue functions. Mathematical Programming, 161(12):419–449, 2017a.
 Wang et al. [2017b] Mengdi Wang, Ji Liu, and Ethan Fang. Accelerating stochastic composition optimization. Journal of Machine Learning Research, 18(105):1–23, 2017b.
 Wang et al. [2018] Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. SpiderBoost: A class of faster variancereduced algorithms for nonconvex optimization. arXiv preprint, arXiv:1810.10690, 2018.
 Xiao and Zhang [2014] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
 Zhang and Xiao [2019] Junyu Zhang and Lin Xiao. A composite randomized incremental gradient method. In Proceedings of the 36th International Conference on Machine Learning (ICML), number 97 in Proceedings of Machine Learning Research (PMLR), Long Beach, California, 2019.
 Zhou et al. [2018] Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduced gradient descent for nonconvex optimization. In Advances in Neural Information Processing Systems 31, pages 3921–3932. Curran Associates, Inc., 2018.
Appendix A Convergence analysis for composite expectation case
In this section, we focus on convergence analysis of CIVR for solving the stochastic composite optimization problem (1), and prove Theorems 1 and 2.
First, we show that under Assumption 1, the composite function is smooth and has Lipschitz constant .
where we used and , which are implied by the Lipschitz conditions on and respectively.
Although the incremental estimators used in CIVR are biased, as shown in (19), we can still bound their squared distances from the targets. This is given in the following lemma.
Lemma 1.
Suppose Assumption 1 holds. Let and be constructed according to (LABEL:defn:sarah1) and (LABEL:defn:sarah2) in Algorithm LABEL:alg:CIVR. For any and , we have the following mean squared error (MSE) bounds
(33) 
Proof.
We first state a fact that allows us to decompose the MSE into a squared bias term and a variance term, that is, for an arbitrary random vector and a constant vector , we have
(34) 
where . As a result,
For the bias term, we have . For the variance term, we have
where the second equality is due to the fact that is a constant conditioning on and in the last inequality we used the Lipschitz continuity of . Consequently,
Recursively applying the above procedure yields
(35) 
Similarly, the bound on can be shown by using the Lipschitz continuity of
Comments
There are no comments yet.