1 Introduction
Largescale machine learning problems have large number of data points or large number of features in each data point, or both are large. This leads to high periteration complexity of the iterative learning algorithms, which results in slow training of models. Thus, largescale learning or learning on the big data is one of the major challenge today in machine learning Chauhan et al. (2017); Zhou et al. (2017). To tackle this largescale learning challenge, recently research has focused on stochastic optimization approach Chauhan et al. (2018c), coordinate descent approach Wright (2015), proximal algorithms Parikh and Boyd (2014), parallel and distributed algorithms Yang et al. (2016) (as discussed in Chauhan et al. (2018a)), and momentum acceleration algorithms AllenZhu (2017). Stochastic approximation leads to variance in the values of deterministic gradient and noisy gradients which are calculated using stochastic approximation, and affects the convergence of learning algorithm. There are several approaches to deal with stochastic noise but most important of these (as discussed in Csiba and Richt (2016)) are: (a) using minibatching Chauhan et al. (2018b), (b) decreasing learning rates ShalevShwartz et al. (2007), (c) variance reduction Le Roux et al. (2012), and (d) importance sampling Csiba and Richt (2016). To deal with the largescale learning problems, we use minibatching and variance reduction in this paper.
1.1 Optimization Problem
In this paper, we consider composite convex optimization problem, as given below:
(1) 
where is a finite average of component functions , are convex and smooth,
is a relatively simple convex but possibly nondifferentiable function (also referred to as regularizer and sometimes as proximal function). This kind of optimization problems can be found in operation research, data science, signal processing, statistics and machine learning etc. For example, regularized Empirical Risk Minimization (ERM) problem is a common problem in machine learning, which is average of losses on the training dataset. In ERM, component function
denotes value of loss function at one data point, e.g., in binary classification, it can be logistic loss, i.e.,
, where is collection of training data points, and hingeloss, i.e., ; for regression problem, it can be least squares, i.e., . The regularizer can be (regularizer), (regularizer) and (elastic net regularizer), where andare regularization coefficients. Thus, problems like logistic regression, SVM, ridge regression and lasso etc. fall under ERM.
1.2 Solution Techniques for Optimization Problem
The simple first order method to solve problem (1) is given by Cauchy in his seminal work in 1847, known as Gradient Descent (GD) method Cauchy (1847), is given below for iteration as:
(2) 
where is the learning rate (also known as step size in optimization). For nonsmooth regularizer, i.e., when is nonsmooth then typically, proximal step is calculated after the gradient step, and method is called Proximal Gradient Descent (PGD), as given below:
(3) 
where . GD converges linearly for stronglyconvex and smooth problem, and (3) converges at a rate of for nonstrongly convex differentiable problems, where is the number of iterations. The periteration complexity of GD and PGD methods is . Since for largescale learning problems, the values of (number of data points) and/or (number of features in each data point) is very large, so the periteration complexity of these methods is very large. Each iteration becomes computationally expensive and might even be infeasible to process for a limited capacity machine, which leads to slow training of models in machine learning. Thus, the challenge is to develop efficient algorithms to deal with the largescale learning problems Chauhan et al. (2017); Zhou et al. (2017).
To tackle this challenge, stochastic approximation is one of the popular approach, first introduced by Robbins and Monro, in their seminal work back in 1951, which makes each iteration independent of number of data points Kiefer and Wolfowitz (1952); Robbins and Monro (1951). Based on this approach, we have Stochastic Gradient Descent (SGD) method Bottou (2010), as given below, to solve problem (1) for the smooth case:
(4) 
where is selected uniformly randomly from {1,2,…,n} and . The periteration complexity of SGD is and it is very effective to deal with problems with large number of data points. Since ,
is an unbiased estimator of
, but the variance in these two values need decreasing learning rates. This leads to slow convergence of learning algorithms. Recently, a lot of research is going on to reduce the variance between and . First variance reduction method introduced by Le Roux et al. (2012), called SAG (Stochastic Average Gradient), some other common and latest methods are SVRG (Stochastic Variance Reduced Gradient) Johnson and Zhang (2013), ProxSVRG Xiao and Zhang (2014), S2GD (SemiStochastic Gradient Descent) Konečný and Richtárik (2013); Yang et al. (2018), SAGA Defazio et al. (2014), Katyusha AllenZhu (2017), VRSGD (Variance Reduced Gradient Descent) Fanhua et al. (2018), SAAGI and II Chauhan et al. (2017) etc. Like, GD, these methods utilize full gradient and like, SGD, these methods calculate gradient for one or few data points during each iteration. Thus, just like GD, these methods converge linearly for strongly convex and smooth problems and like, SGD, they calculate gradient using one or few data points and have low periteration complexity. Thus, the variance reduction methods enjoy best of GD and SGD. Please refer to, Bottou et al. (2016) for a review on optimization methods for solving largescale machine learning problems. In this paper, we have proposed new variants of SAAGI and II, called SAAGIII and IV, as variance reduction techniques.1.3 Research Contributions
The research contributions of this paper are summarized below:

Novel variants of SAAGI and II are proposed, called SAAGIII and SAAGIV, respectively. Unlike SAAGI, for SAAGIII the starting point is set to the average of iterates of previous epoch except for the first one, , where is number of inner iterations. Unlike SAAGII, for SAAGIV, the starting point and snap point are set to the last iterate and average of previous epoch except for the first one, and .

Theoretical results prove linear convergence of SAAGIV for all the four combinations of smoothness and strongconvexity in expectation.

Finally, empirical results prove the efficacy of proposed methods against stateofart methods in terms of convergence and accuracy against training time, number of epochs and number of gradient evaluations.
2 Notations and Related Work
This section discusses notations used in the paper and related work.
2.1 Notations
The training dataset is represented as , where is the number of data points and is the number of features.
denotes the parameter vector and
denotes the regularization parameter. denotes Euclidean norm, also called norm, and denotes norm. and are used to denote smoothness and strong convexity of problem, respectively. denotes the learning rate, denotes epoch number and is the total number of epochs. denotes the minibatch size, denotes the inner iterations, s.t., . The value of loss function at is denoted by component function . and is the optimal objective function value, sometimes denoted as .2.2 Related Work
The emerging technologies and the availability of different data sources have lead to rapid expansion of data in all science and engineering domains. On one side, this massive data has potential to uncover more finegrained patterns, take timely and accurate decisions, and on other side it creates a lot of challenges to make sense of it, like, slow training and scalability of models, because of inability of traditional technologies to process this huge data. The term “Big Data” was coined to highlight the data explosion and need of new technologies to process this massive data. Big data is a vast subject in itself. Big data can be characterized, mainly using three Vs: Volume, Velocity and Variety but recently a lot of other Vs have been used. When one deals with ‘volume’ aspect of big data in machine learning, it is called the largescale machine learning problem or big data problem. The largescale learning problems have large number of data points or large number of features in each data point, or both, which lead to large periteration complexity of learning algorithms, and ultimately lead to slow training of machine learning models. Thus, one of the major challenge before machine learning is to develop efficient and scalable learning algorithms Chauhan et al. (2017, 2018a); Zhou et al. (2017).
To solve problem (1) for smooth regularizer, a simple method is GD, given by Cauchy (1847), and it converges linearly for stronglyconvex and smooth problems. For nonsmooth regularizer, typically, proximal step is applied to GD step, called PGD method which converges at a rate of for nonstrongly convex problems. The periteration complexity of GD and PGD is which is very large for largescale learning problems and results in slow training of models. Stochastic approximation is one of the approach to tackle this challenge. It was first introduced by Robbins and Monro Robbins and Monro (1951) and is very effective to deal with problems with large number of data points because each iteration uses one (or few) data points, like in SGD Bottou (2010); Zhang (2004). In SGD, each iteration is times faster than GD, as their periteration complexities are and , respectively. SGD need decreasing learning rates, i.e., for iteration, because of variance in gradients, so it converges slower than GD, with sublinear convergence rate even for strongly convex problem Rakhlin et al. (2012). There are several approaches to deal with stochastic noise but most important of these (as discussed in Csiba and Richt (2016)) are: (a) using minibatching Yang et al. (2018), (b) decreasing learning ratesShalevShwartz et al. (2007), (c) variance reduction Le Roux et al. (2012), and (d) importance sampling Csiba and Richt (2016).
Variance reduction techniques, first introduced by Le Roux et al. (2012), called SAG, converges linearly, like, GD for strongly convex and smooth problems, and uses one randomly selected data point, like SGD during each iteration. SAG enjoys benefits of both GD and SGD, as it converges linearly for strongly convex and smooth case like, GD but it has periteration complexity of SGD. Later, a lot of variance reduction methods were proposed, like, SVRG Johnson and Zhang (2013), SAGA Defazio et al. (2014), S2GD Konečný and Richtárik (2013), SDCA ShalevShwartz and Zhang (2013), SPDC Zhang and Xiao (2015), Katyusha AllenZhu (2017), Catalyst Lin et al. (2015), SAAGI, II Chauhan et al. (2017) and VRSGD Fanhua et al. (2018) etc. These variance reduction methods can use constant learning rate and can be divided into three categories (as discussed in Fanhua et al. (2018)): (a) primal methods which can be applied to primal optimization problem, like, SAG, SAGA, SVRG etc., (b) dual methods which can be applied to dual problems, like, SDCA, and (c) primaldual methods which involve primal and dual variable both, like, SPDC etc.
In this paper, we have proposed novel variants of SAAGI and II, named as SAAGIII and SAAGIV, respectively. Unlike SAAGI, for SAAGIII the starting point is set to the average of iterates of previous epoch except for the first one, , where is number of inner iterations. Unlike SAAGII, for SAAGIV, the starting point and snap point are set to the last iterate and average of previous epoch except for the first one, and . Chauhan et al. (2017) proposed Batch Block Optimization Framework (BBOF) to tackle the big data (largescale learning) challenge in machine learning, along with two variance reduction methods SAAGI and II. BBOF is based on best of stochastic approximation (SA) and best of coordinate descent (CD) (another approach which is very effective to deal with largescale learning problems especially problems with high dimensions). Techniques based on best features of SA and CD approaches are also used in Wang and Banerjee (2014); Xu and Yin (2015); Zebang Shen (2017); Zhao et al. (2014), and Zebang Shen (2017) calls it doubly stochastic since both data points and coordinate are sampled during each iteration. It is observed that for ERM, it is difficult to get the advantage of BBOF in practice because results with CD or SGD are faster than BBOF setting as BBOF needs extra computations while sampling and updating the block of coordinates. When one block of coordinates is updated and as we move to another block, the partial gradients need dot product of parameter vector () and data points ( like, in logistic regression). Since each update changes so for each block update needs to calculate the dot product. On other hand, if all coordinates are updated at a time, like in SGD, that would need to calculate dot product only once. Although, GaussSeidel update of parameters helps in faster convergence but the overall gain is less because of extra computational load. Moreover, SAAGI and II have been proposed to work in BBOF (minibatch and blockcoordinate) setting as well as minibatch (and considering all coordinates). Since BBOF is not very helpful for ERM so the SAAGIII and IV are proposed for minibatch setting only. SAAGs (I, II, III and IV) can be extended to stochastic setting (consider one data point during each iteration) but SAAGI and II are unstable for stochastic setting, and SAAGIII and IV, could not beat the existing methods in stochastic setting. SAAGs has been extended to deal with smooth and nonsmooth regularizers, as we have used two different update rules, like Fanhua et al. (2018) (see Section 3 for details).
3 SAAGI, II and Proximal Extensions
Originally, Chauhan et al. (2017) proposed SAAGI and II for smooth problems, we have extended SAAGI and II to nonsmooth problems. Unlike proximal methods which use single update rule for both smooth and nonsmooth problems, we have used two different update rules and introduced proximal step for nonsmooth problem. For minibatch of size , epoch and inner iteration , SAAGI and II are given below:
SAAGI:
(5) 
where and
SAAGII:
(6) 
where and . Unlike SVRG and VRSGD, and like SAG, SAAGs are biased gradient estimators because the expectation of gradient estimator is not equal to full gradient, i.e., , as detailed in Lemma 6.
SAAGI algorithm, represented by Algorithm 1, divides the dataset into minibatches of equal size (say) and takes input , number of epochs. During each inner iteration, it randomly selects one minibatch of data points from , calculates gradient over minibatch, updates the total gradient value and performs stochastic backtrackingArmijo line search (SBAS) over . Then parameters are updated using Option I for smooth regularizer and using Option II for nonsmooth regularizer. Inner iterations are run times where and then last iterate is used as the starting point for next epoch.
SAAGII algorithm, represented by Algorithm 2, takes input as number of epochs () and number of minibatches () of equal size (say) . It initializes . During each inner iteration, it randomly selects one minibatch , calculates two gradients over at last iterate and snappoint, updates and performs stochastic backtrackingArmijo line search (SBAS) over . Then parameters are updated using Option I for smooth regularizer and using Option II for nonsmooth regularizer. After inner iterations, it uses the last iterate to set the snap point and the starting point for the next epoch.
4 SAAGIII and IV Algorithms
SAAGIII algorithm, represented by Algorithm 3, divides the dataset into minibatches of equal size (say) and takes input , number of epochs. During each inner iteration, it randomly selects one minibatch of data points from , calculates gradient over minibatch, updates the total gradient value and performs stochastic backtrackingArmijo line search (SBAS) over . Then parameters are updated using Option I for smooth regularizer and using Option II for nonsmooth regularizer. Inner iterations are run times where and then iterate average is calculated and used as the starting point for next epoch, .
SAAGIV algorithm, represented by Algorithm 4, takes input as number of epochs () and number of minibatches () of equal size (say) . It initializes . During each inner iteration, it randomly selects one minibatch , calculates two gradients over at last iterate and snappoint, updates and performs stochastic backtrackingArmijo line search (SBAS) over . Then parameters are updated using Option I for smooth regularizer and using Option II for nonsmooth regularizer. After inner iterations, it calculates average to set the snap point, and uses last iterate as the starting point for the new epoch, as and , respectively.
The comparative study of SAAGs is represented by Figure 1 for smooth problem (regularized logistic regression), which compares accuracy and suboptimality against training time (in seconds), gradients and epochs. The results are reported on Adult dataset with minibatch of 32 data points. It is clear from all the six criteria plots that results for SAAGIII and IV are very stable than SAAGI and II, respectively, because of averaging of iterates. SAAGIV performs better than SAAGII and SAAGIII performs closely but stably than SAAGI. Moreover, SAAGI and SAAGII stabilize with increase in minibatch size but the performance of methods decreases with minibatch size (see, Appendix for effect of minibatch sizes on SAAGs).
5 Analysis
In general, SAAGIV gives better results for largescale learning problems as compared to SAAGIII as shown by empirical results presented in Fig. 3, 4 with news20, rcv1 datasets and results in Appendix A.3 ‘Effect of minibatch size’. So, in this section, we have provided convergence rates of SAAGIV considering all cases of smoothness with strong convexity. Moreover, analysis of SAAGIII represents a typical case due to the biased nature of gradient estimator and the fact that the full gradient is incrementally maintained rather than being calculated at a fix point, like in SAAGIV. So, analysis of SAAGIII is left open. The convergence rates for all the different combinations of smoothness and strong convexity are given below:
Theorem 1.
Under the assumptions of Lipschitz continuity with smooth regularizer, the convergence of SAAGIV is given below:
(7) 
where, and is constant.
Theorem 2.
Under the assumptions of Lipschitz continuity and strong convexity with smooth regularizer, the convergence of SAAGIV method is given below:
(8) 
where,
and is constant.
Theorem 3.
Under the assumptions of Lipschitz continuity with nonsmooth regularizer, the convergence of SAAGIV is given below:
(9) 
where,
and is constant.
Theorem 4.
Under the assumptions of Lipschitz continuity and strong convexity with nonsmooth regularizer, the convergence of SAAGIV is given below:
(10) 
where,
and is constant.
All the proofs are given in the Appendix B and all these results prove linear convergence (as per definition of convergence) of SAAGIV for all the four combinations of smoothness and strongconvexity with some initial errors due to the constant terms in the results. SAAGs are based on intuitions from practice Chauhan et al. (2017) and they try to give more importance to the latest gradient values than the older gradient values, which make them biased techniques and results into this extra constant term. This constant term signifies that SAAGs converge to a region close to the solution, which is very practical because all the machine learning algorithms are used to solve the problems approximately and we never find an exact solution for the problem Bottou and Bousquet (2007), because of computational difficulty. Moreover, the constant term pops up due to the minibatched gradient value at optimal point, i.e., . If the size of the minibatch increases and eventually becomes equal to the dataset then this constant becomes equal to full gradient and vanishes, i.e.,
The linear convergence for all combinations of strong convexity and smoothness of the regularizer, is the maximum rate exhibited by the first order methods without curvature information. SAG, SVRG, SAGA and VRSGD also exhibit linear convergence for the strong convexity and smooth problem but except VRSGD, they don’t cover all the cases, e.g., SVRG does not cover the nonstrongly convex cases. However, the theoretical results provided by VRSGD, prove linear convergence for strongly convex cases, like our results, but VRSGD provides only convergence for nonstrongly convex cases, unlike our linear convergence results.
6 Experimental Results
In this section, we have presented the experimental results^{1}^{1}1experimental results can be reproduced using the code available at link: https://drive.google.com/open?id=1Rdp_pmHLQAA9OBxBtHzz6FCduCypAzhd. SAAGIII and IV are compared against the most widely used variance reduction method, SVRG and one of the latest method VRSGD which has been proved to outperform existing techniques. The results have been reported in terms of suboptimality and accuracy against time, epochs and gradients. The SAAGs can be applied to strongly and nonstrongly convex problems with smooth or nonsmooth regularizers. But the results have been reported with strongly convex problems with and without smoothness because problems can be easily converted to strongly convex problems by adding regularization.
6.1 Experimental Setup
The experiments are reported using six different criteria which plot suboptimality (objective minus best value) versus epochs (where one epoch refers to one pass through the dataset), suboptimality versus gradients, suboptimality versus time, accuracy versus time, accuracy versus epochs and accuracy versus gradients. The xaxis and yaxis data are represented in linear and log scale, respectively. The experiments use the following binary datasets: rcv1 (data  20, 242, features  47, 236), news20 (data  19, 996, features  1, 355, 191), realsim (data  72,309, features  20, 958) and Adult (also called as a9a, data  32,561 and features  123), which are available from the LibSVM website^{2}^{2}2https://www.csie.ntu.edu.tw/c̃jlin/libsvmtools/datasets/. All the datasets are divided into 80% and 20% as training and test dataset, respectively. The value of regularization parameter is set as for all the algorithms. The parameters for, Stochastic BacktrackingArmijo line Search (SBAS), are set as: and learning rate is initialized as, . The inner iterations are set as, (as used in Fanhua et al. (2018)). Moreover, in SBAS, algorithms looks for maximum 10 iterations and after that it returns the current value of learning rate if it reduces the objective value otherwise it returns 0.0. This is done to avoid sticking in the algorithm because of stochastic line search. All the experiments have been conducted on MacBook Air (8 GB 1600 MHz DDR3, 1.6 GHz Intel Core i5 and 256GB SSD) using MEX files.
6.2 Results with Smooth Problem
The results are reported with regularized logistic regression problem as given below:
(11) 
Figure 2 represents the comparative study of SAAGIII, IV, SVRG and VRSGD on realsim dataset. As it is clear from the first row of the figure, SAAGIII and IV give better accuracy and attain the results faster than other the other methods. From the second row of the figure, it is clear that SAAGs converges faster than SVRG and VRSGD. Moreover, SAAGIII performs better than SAAGIV and VRSGD performs slightly better than SVRG as established in Fanhua et al. (2018). Figure 3 reports results with news20 dataset and as depicted in the figure the results are similar to realsim dataset Fig. (2). SAAGs give better accuracy and converge faster than SVRG and VRSGD methods, but SAAGIV gives best results. This is because as the minibatch size or the dataset size increases, SAAGII and SAAGIV perform better (as reported in Chauhan et al. (2017)).
6.3 Results with nonsmooth Problem
The results are reported with elasticnetregularized logistic regression problem (nonsmooth regularizer) as given below:
(12) 
where and .
Figure 4 represents the comparative study of SAAGIII, IV, SVRG and VRSGD on rcv1 dataset. As it is clear from the figure, for all the six criteria plots, SAAGIII and IV outperform SVRG and VRSGD, and provide better accuracy and faster convergence. SAAGIV gives best results in terms of suboptimality but in terms of accuracy, SAAGIII and IV have close performance except for accuracy versus gradients, where SAAGIII gives better results because SAAGIII calculates gradients at last iterate only unlike SAAGIV which calculates gradients at snap point and last iterate. Figure 5 reports results with Adult dataset, and as it is clear from plots, SAAGIII outperforms all the methods in all the six criteria plots. Moreover, SAAGIV lags behind because the dataset and/or minibatch size is small. Some results are also given with SVM in Appendix A.1
6.4 Effect of Regularization Constant
Figure 6, studies the effect of regularization coefficient on SAAGIII, IV, SVRG and VRSGD for smooth problem (regularized logistic regression) using rcv1 dataset and considering regularization coefficient values in {, , }. As it is clear from the plots, all the methods are affected by the large () regularization coefficient value and have low accuracy but for sufficiently small values methods don’t have much effect. For suboptimality plots, all have slower convergence for large regularization coefficient but then convergence improves with the decrease in regularization, because decreasing the regularization, increases the overfitting. The results for nonsmooth problem are similar, so they are given in the Appendix A.5.
6.5 Effect of minibatch size
Figure 7, studies the effect of minibatch size on SAAGIII, IV, SVRG and VRSGD for smooth problem (regularized logistic regression) using rcv1 dataset and considers minibatch values in {} data points. As it is clear from the plots, except for the results with training time, the performance of SAAGIII, IV and SVRG fall with increase in minibatch size but for VRSGD performance first improves slightly and then falls slightly. For results with training time the performance of SAAGIII and IV falls with minibatch size for suboptimality and remains almost same for accuracy. But the performance of VRSGD and SVRG improves because VRSGD and SVRG train quickly for large minibatches. Similar results are obtained for study of effect of minibatch size on nonsmooth problem so the results are given in the Appendix A.3.
7 Conclusion
We have proposed novel variants of SAAGI and II, called SAAGIII and IV, respectively, by using average of iterates for SAAGIII as a starting point, and average of iterates and last iterate for SAAGIV as the snap point and starting point, respectively, for new epoch, except the first one. SAAGs (I, II, III and IV), are also extended to solve nonsmooth problems by using two different update rules and introducing proximal step for nonsmooth problem. Theoretical results proved linear convergence of SAAGIV for all the four combinations of smoothness and strongconvexity with some initial errors, in expectation. The empirical results proved the efficacy of proposed methods against existing variance reduction methods in terms of, accuracy and suboptimality, against training time, epochs and gradients.
Acknowledgements.
First author is thankful to Ministry of Human Resource Development, Government of INDIA, to provide fellowship (University Grants Commission  Senior Research Fellowship) to pursue his PhD.Appendix A More Experiments
a.1 Results with Support Vector Machine (SVM)
This subsection compares SAAGs against SVRG and VRSGD on SVM problem with mushroom and gisette datasets. Methods use stochastic backtracking line search method to find the step size. Fig. 8 presents the results and compares the suboptimality against the training time (in seconds). Results are similar to experiments with logistic regression but are not that smooth. SAAGs outperform other methods on mushroom dataset (first row) and gisette dataset (second row) for suboptimality against training time and accuracy against time but all methods give almost similar results on accuracy versus training time for mushroom dataset. SAAGIV outperforms other method and SAAGIII sometimes lags behind VRSGD method. It is also observed that results with logistic regression are better than the results with the SVM problem. The optimization problem for SVM is given below:
(13) 
where is the regularization coefficient (also penalty parameter) which balances the trade off between margin size and error Chauhan et al. (2018a).
a.2 Comparison of SAAGs (I, II, III and IV) for nonsmooth problem
Comparison of SAAGs for nonsmooth problem is depicted in Figure 9 using Adult dataset with minibatch of 32 data points. As it is clear from the figure, just like the smooth problem, results with SAAGIII and IV are stable and better or equal to SAAGI and II.
a.3 Effect of minibatch size on SAAGIII, IV, SVRG and VRSGD for nonsmooth problem
Effect of minibatch size on SAAGIII, IV, SVRG and VRSGD for nonsmooth problem is depicted in Figure 10 using rcv1 binary dataset with minibatch of 32, 64 and 128 data points. Similar to smooth problem, proposed methods outperform SVRG and VRSGD methods. SAAGIV gives the best result in terms of time and epochs but in terms of gradients/n, SAAGIII gives best results.
a.4 Effect of minibatch size on SAAGs (I, II, III, IV) for smooth problem
Effect of minibatch size on SAAGs (I, II, III, IV) for smooth problem is depicted in Figure 11 using Adult dataset with minibatch sizes of 32, 64 and 128 data points. The results are similar to nonsmooth problem.
a.5 Effect of regularization coefficient for nonsmooth problem
Figure 12 depicts effect of regularization coefficient on SAAGIII, IV, SVRG and VRSGD for nonsmooth problem using rcv1 dataset. It considers regularization coefficient values as , and . The results are similar to smooth problem. As it is clear from the figure, for larger values, , all the methods do not perform well but once the coefficient is sufficiently small, it does not make much difference, and in all the cases our proposed methods outperform SVRG and VRSGD.
Appendix B Proofs
Following assumptions are considered in the paper:
Assumption 1 (Smoothness).
Suppose function is convex and differentiable, and that gradient is Lipschitzcontinuous, where is Lipschitz constant, then, we have,
(14) 
(15) 
Assumption 2 (Strong Convexity).
Suppose function is strongly convex function for and is the optimal value of , then, we have,
(16) 
(17) 
Assumption 3 (Assumption 3 in Fanhua et al. (2018)).
For all , the following inequality holds
(18) 
where is a constant.
We derive our proofs by taking motivation from Fanhua et al. (2018) and Xiao and Zhang (2014). Before providing the proofs, we provide certain lemmas, as given below:
Lemma 1 (3Point Property Lan (2012)).
Let be the optimal solution of the following problem: where and is a convex function (but possibly nondifferentiable). Then for any , then the following inequality holds,
(19) 
Lemma 2 (Theorem 4 in Konečný et al. (2016)).
For nonsmooth problems, taking , we have and the variance satisfies following inequality,
(20) 
where .
Following the Lemma 2 for nonsmooth problems, one can easily prove the following results for the smooth problems,
Lemma 3.
For smooth problems, taking , we have and the variance satisfies following inequality,
(21) 
where .
Lemma 4 (Extension of Lemma 3.4 in Xiao and Zhang (2014) to minibatches).
Under Assumption 1 for smooth regularizer, we have
(22) 
Proof.
Given any , consider the function,
It is straightforward to check that , hence Since is Lipschitz continuous so we have,
Taking expectation, we have
(23) 
By optimality, , we have
This proves the required lemma. ∎
Lemma 5 (Extension of Lemma 3.4 in Xiao and Zhang (2014) to minibatches).
Under Assumption 1 for nonsmooth regularizer, we have
(24) 