The problem of regularized loss minimization (e.g. Hastie et al. (2009)) is often described as
aiming to obtain a linear predictor for an unknown input . Therein,
is a loss function which is the sum of convex losses forexamples: for
In this study, we pose sign constraints (Lawson and Hanson, 1995) to the entries in the model parameter in the unconstrained minimization problem (1). We divide the index set of entries into three exclusive subsets, , , and , as and impose on the entries in and ,
Sign constraints can introduce prior knowledge directly to learning machines. For example, let us consider a binary classification task. In case that -th explanatory variable is positively correlated to a binary class label , then a positive weight coefficient is expected to achieve a better generalization performance than a negative coefficient, because without sign constraints, the entry in the optimal solution might be negative due to small sample problem. On the other hand, in case that is negatively correlated to the class label, a negative weight coefficient would yield better prediction. If sign constraints were explicitly imposed, then inadequate signs of coefficients could be avoided.
The strategy of sign constraints for generic learning problems has rarely been discussed so far, although there are extensive reports for non-negative least square regression supported by many successful applications including sound source localization: (Lin et al., 2004), tomographic imaging (Ma, 2013), spectral analysis (Zhang et al., 2007)
, hyperspectral image super-resolution(Dong et al., 2016), microbial community pattern detection (Cai et al., 2017)et al., 2009; He et al., 2013), and non-negative image restoration (Henrot et al., 2013; Landi and Piccolomini, 2012; Wang and Ma, 2007; Shashua and Hazan, 2005). In most of them, non-negative least square regression is used as an important ingredient of bigger methods such as non-negative matrix factorization (Lee and Seung, 2001; Wang et al., 2017; Kimura et al., 2016; Févotte and Idier, 2011; Ding et al., 2006).
Several efficient algorithms for the non-negative least square regression have been developed. The active set method by Lawson and Hanson (1995) has been widely used in many years, and several work (Kim et al., 2010, 2007; Bierlaire et al., 1991; Portugal et al., 1994; Moré and Toraldo, 1991; Lin and Moré, 1999; Morigi et al., 2007) have accelerated optimization by combining the active set method with the projected gradient approach. Interior point methods (Bellavia et al., 2006; Heinkenschloss et al., 1999; Kanzow and Klug, 2006) have been proposed as an alternative algorithm for non-negative least square regression. However, all of them cannot be applied to generic regularized loss minimization problems.
In this paper, we present two algorithms for the sign-constrained regularized loss minimization problem with generic loss functions. A surge of algorithms for unconstrained regularized empirical loss minimization have been developed such as SAG (Roux et al., 2012; Schmidt et al., 2016), SVRG (Johnson and Zhang, 2013), Prox-SVRG (Xiao and Zhang, 2014), SAGA (Defazio et al., 2014a), Kaczmarz (Needell et al., 2015), EMGD (Zhang et al., 2013), and Finito (Defazio et al., 2014b). This study focuses on two popular algorithms, Pegasos (Shalev-Shwartz et al., 2011) and SDCA (Shalev-Shwartz and Zhang, 2013). A prominent characteristic of the two algorithms is unnecessity to choose a step size. Some of the other optimization algorithms guarantee convergence to the optimum under the assumption of a small step size, although the step size is often too small to be used. Meanwhile, the theorem of Pegasos has been developed with a step size which is large enough to be adopted actually. SDCA needs no step size. Two new algorithms developed in this study for the sign-constrained problems are simple modifications of Pegasos and SDCA.
The contributions of this study are summarized as follows.
Sign constraints are introduced to generic regularized loss minimization problems.
Two optimization algorithms for the sign-constrained regularized loss minimization, called SC-Pega and SC-SDCA, were developed by simply inserting the sign correction step, introduced in Section 3, to the original Pegasos and SDCA.
Our theoretical analysis ensures that both SC-Pega and SC-SDCA do not degrade the convergence rates of the original algorithms.
Two attractive applications, where the sign-constrained learning is effective, are presented. The one is exploitation of prior information about correlation between explanatory variables and a target variable. The other is introduction of the sign-constrained to SVM-Pairwise method (Liao and Noble, 2003).
Experimental results demonstrate significant improvement of generalization performance by introducing sign constraints in both two applications.
2 Problem Setting
The feasible region can be expressed simply as
where , each entry is given by
Using , the optimization problem discussed in this paper can be expressed as
Throughout this paper, the following assumptions are used:
|(a)||is a convex function.||(b)||.|
|(c)||, .||(d)||, .|
|Classical hinge loss||-Lipschitz|
|Smoothed hinge loss||-smooth|
|Square error loss||-smooth|
|Absolute error loss||-Lipschitz|
Most of widely used loss functions satisfy the above assumptions. Several examples of such loss functions are described in Table 1. If the hinge loss is chosen, the learning machine is a well-known instance called the support vector machine. If the square error loss is chosen, the learning machine is called the ridge regression. We denote the optimal solution to the constraint problem by We assume two types of loss functions: -Lipschitz continuous function and -smooth function. Function is said to be an -Lipschitz continuous funciton if
Such functions are often said shortly to be -Lipschitz in this paper. Function is a -smooth function if its derivative function is -Lipschitz. For an index subset and a vector , let be the subvector of containing entries corresponding to . Let be a sub-matrix in containing columns corresponding to . Let be defined as
3 Sign-Constrained Pegasos
In the original Pegasos algorithm (Shalev-Shwartz et al., 2011), is assumed to be the classical hinge loss function (See Table 1 for the definition). Each iterate consists of three steps: the mini-batch selection step, the gradient step, and the projection-onto-ball step. Mini-batch selection step chooses a subset from examples at random. The cardinality of the subset is predefined as . Gradient step computes the gradient of
which approximates the objective function . The current solution is moved toward the opposite gradient direction as
At the projection-onto-ball step, the norm of the solution is shortened to be if the norm is over :
The projection-onto-ball step plays an important role in getting a smaller upper-bound of the norm of the gradient of the regularization term in the objective, which eventually reduces the number of iterates to attain an -approximate solution (i.e. ).
In the algorithm developed in this study, we simply inserts between those two steps, a new step that corrects the sign of each entry in the current solution as
which can be rewritten equivalently as where the operator is defined as , .
The algorithm can be summarized as Algorithm 1. Here, the loss function is not limited to the classical hinge loss. In the projection-onto-ball step, the solution is projected onto -ball instead of -ball to handle more general settings. Recall that if is the hinge loss employed in the original Pegasos. It can be shown that the objective gap is bounded as follows.
Consider Algorithm 1. If are -Lipschitz continuous, it holds that
4 Sign-Constrained SDCA
The original SDCA is a framework for the unconstrained problems (1). In SDCA, a dual problem is solved instead of the primal problem. Namely, the dual objective is maximized in a iterative fashion with respect to the dual variables . The problem dual to the unconstrained problem (1) is given by
To find the maximizer of , a single example is chosen randomly at each iterate , and a single dual variable is optimized with the other variables , frozen. If we denote by the value of the dual vector at the previous iterate , the dual vector is updated as where is determined so that where and
In case of the hinge loss, the maximizer of can be found within computation. The primal variable can also be maintained within computation by .
Now let us move on the sign-constrained problem. In addition to Algorithm 1 that is derived from Pegasos, we present another algorithm based on SDCA for solving the minimizer of subject to the sign constraint . Like Algorithm 1 that has been designed by inserting the sign correction step into the original Pegasos iterate, the new algorithm has been developed by simply adding the sign correction step in each SDCA iterate. The resultant algorithm is described in Algorithm 2.
For some loss functions, maximization at step 5 in Algorithm 2 cannot be given in a closed form. Alternatively, step 4 can be replaced to
We have found the following theorem that states the required number of iterates guaranteeing the expected primal objective gap below a threshold under the sign constraints.
5 Multiclass Classification
In this section, we extend our algorithms to the multi-class classification setting of classes. Here, the model parameter is a instead of a vector . The loss function for each example is of an -dimensional vector. Here, the prediction is supposed to be done by taking the class with the maximal score among , and . Here, without loss of generality, the set of the class labels are given by . Several loss functions are used for multiclass classification as follows.
Therein, is the true class label of -th example.
Top- hinge loss (Lapin et al., 2015):
Therein, denotes the -th largest value in a vector .
The objective function for learning is defined as
The learning problem discussed is minimization of with respect to subject to sign constraints
with two exclusive set and such that
the feasible region can be expressed as
The goal is here to develop algorithms that find
where is the horizontal concatenation of columns in selected by a minibatch . We here use the following assumptions: is a convex function; ; , ; , .
The SDCA-based learning algorithm can also be developed for the multiclass classification task. In the algorithm, the dual variables are represented as a matrix . At each iterate , one of columns, , is chosen at random instead of choosing one of a dual variable to update the matrix as where we have used the iterate number as the superscript of . To determine the value of , the following auxiliary funcition is introduced:
In this section, experimental results are reported in order to illustrate the effects of the sign constraints on classification and to demonstrate the convergence behavior.
6.1 Prediction Performance
The pattern recognition performance of the sign-constrained learning was examined on two tasks:Escherichia coli (E. coli) prediction and protein function prediction.
E. coli Prediction
The first task is to predict E. coli counts in river water. The E. coli count has been used as an indicator for fecal contamination in water environment in many parts of the world (Scott et al., 2002). In this experiment, the data points with E. coli
counts over 500 most probable number (MPN)/100 mL are assigned to positive class, and the others are negative. The hydrological and water quality monitoring data are used for predictingE. coli counts to be positive or negative.
For ensuring the microbial safety in water usage, it is meaningful to predict E. coli counts on a real-time basis. The concentration of E. coli in water, which is measured by culture-dependent methods (Kobayashi et al., 2013), has been used to monitor the fecal contamination in water environment, and has been proved to be effective to prevent waterborne infectious diseases in varied water usage styles. On the other hand, the real-time monitoring of E. coli counts has not yet been achieved. It take at least ten hours to obtain E. coli counts by culture-dependent methods, and also at least several hours are needed to measure the concentration of E. coli by culture-independent methods (Ishii et al., 2014b, a), such as polymerase chain reaction. Since it is possible to measure the some of the hydrological and water quality data with real-time sensors, the real-time prediction of E. coli counts will be realized if the hydrological and water quality data are available for the E. coli count prediction.
Many training examples are required to obtain a better generalization performance. A serious issue, however, is that measuring the concentration of E. Coli is time-consuming and the cost of reagents is expensive. We here demonstrate that this issue can be relaxed by exploiting the domain knowledge hoarded in the field of water engineering.
The hydrological and water quality data contain nine explanatory variables, WT, pH, EC, SS, DO, BOD, TN, TP, and flow rate. The explanatory variable is divided into two variables, and . It is well-known, in the field of water engineering, that E. coli is increased, as WT, EC, SS, BOD, TN, and TP are larger, and as , ，DO, and the flow rate are smaller. From this fact, we restrict the sign of entries in the predictor parameter as follows.
Coefficients of six explanatory variables, WT, EC, SS, BOD, TN, and TP must be non-negative.
Coefficients of four explanatory variables, , , DO, flow rate must be non-positive.
We actually measured the concentrations of E. coli 177 times from December 5th, 2011 to April 17th, 2013. We obtained 177 data points including 88 positives and 89 negatives. We chose ten examples out of 177 data points at random to use them for training, and the other 167 examples were used for testing. The prediction performance is evaluated by the precision recall break-even point (PRBEP) (Joachims, 2005) and the ROC score. We compared the classical SVM with the sign-constrained SVM (SC-SVM) to examine the effects of sign constraints. We repeated this procedure 10,000 times and obtained 10,000 PRBEP and 10,000 ROC scores.
|(a) PRBEP||(b) ROC score|
SC-SVM achieved significant improvement compared to the classical SVM. SC-SVM achieved PRBEP and ROC score of 0.808 and 0.863 on average over 10,000 trials, whereas those of the classical SVM were 0.757 and 0.810, respectively. The difference from the classical SVM on each trial is plotted in the histograms of Figure 1. Positive improvements of ROC scores were obtained in 8,932 trials out of 10,000 trials, whereas ROC scores were decreased only for 796 trials. For PRBEP, positive improvements were obtained on 7,349 trials, whereas deteriorations were observed only on 1,069 trials.
|(a) Covtype||(b) W8a||(c) Phishing|
Protein Function Prediction
|1||0.751 (0.011)||0.730 (0.010)|
|2||0.740 (0.016)||0.680 (0.015)|
|3||0.753 (0.011)||0.721 (0.011)|
|4||0.762 (0.010)||0.734 (0.010)|
|5||0.769 (0.012)||0.691 (0.013)|
|6||0.690 (0.014)||0.614 (0.014)|
|7||0.713 (0.024)||0.618 (0.022)|
|8||0.725 (0.019)||0.667 (0.019)|
|9||0.655 (0.024)||0.578 (0.023)|
|10||0.743 (0.016)||0.710 (0.014)|
|11||0.535 (0.019)||0.492 (0.018)|
|12||0.912 (0.011)||0.901 (0.011)|
In the field of molecular biology, understanding the functions of proteins is positioned as a key step for elucidation of cellular mechanisms. Sequence similarities have been a major mean to predict the function of an unannotated protein. At the beginning of this century, the prediction accuracy has been improved by combining sequence similarities with discriminative learning. The method, named SVM-Pairwise (Liao and Noble, 2003), uses a feature vector that contains pairwise similarities to annotated protein sequences. Several other literature (Liu et al., 2014; Ogul and Mumcuoglu, 2006; Lanckriet et al., 2004b, a) have also provided empirical evidences for the fact that the SVM-Pairwise approach is a powerful framework. Basically, if proteins are in a training dataset, the feature vector has entries, . If we suppose that the first proteins in the training set are in positive class and the rest are negative, then the first similarities are sequence similarities to positive examples, and are similarities to negative examples. The -dimensional vectors are fed to SVM and get the weight coefficients . Then, the prediction score of the target protein is expressed as
The input protein sequence is predicted to have some particular cellular function if the score is over a threshold. It should be preferable that the first weight coefficients are non-negative and that the rest of weight coefficients are non-positive. The SVM-Pairwise approach does not ensure those requirements. Meanwhile, our approach is capable to explicitly impose the constraints of
This approach was applied to predict protein functions in Saccharomyces cerevisiae (S. cerevisiae). The annotations of the protein functions are provided in MIPS Comprehensive Yeast Genome Database (CYGD). The dataset contains 3,583 proteins. The Smith-Waterman similarities available from https://noble.gs.washington.edu/proj/sdp-svm/ were used as sequence similarities among the proteins. The number of categories was 12. Some proteins have multiple cellular functions. Indeed, 1,343 proteins in the dataset have more than one function. From this reason, we pose 12 independent binary classification tasks instead of a single multi-class classification task. 3,583 proteins were randomly splited in half to get two datasets. The one was used for training, and the other was for testing. For 12 classification tasks, we repeated this procedure 100 times, and we obtained 100 ROC scores.
reports the ROC scores averaged over 100 trials and the standard deviations for 12 binary classification tasks. The sign constraints significantly surpassed the classical training for all 12 tasks. Surprisingly, we observed that the ROC score of SC-SVM is larger than that of the classical SVM in every trial.
We carried out empirical evaluation of the proposed optimization methods, the sign-constrained Pegasos (SC-Pega) and the sign-constrained SDCA (SC-SDCA), in order to illustrate the convergence of our algorithms to the optimum. For SC-Pega, we set the mini-batch size to and . In this experiments, we used the smoothed hinge loss with and . We used three datasets, Covtype ( and ), W8a ( and ), and Phishing ( and ). The three datasets are for binary classification and available from LIBSVM web site (https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/).
In this paper, we presented two new algorithms for minimizing regularized empirical loss subject to sign constraints. The two algorithms are based on Pegasos and SDCA, both of which have a solid theoretical support for convergence. The sign-constrained versions, named SC-Pega and SC-SDCA, respectively, enjoy the same convergence rate as the corresponding original algorithms. The algorithms were demonstrated in two applications. The one is posing sign constraints according to domain knowledge, and the other is improving the SVM-Pairwise method by sign constraints.
TK was supported by JPSP KAKENHI Grant Number 26249075 and 40401236.
- Bellavia et al. (2006) Bellavia, S., Macconi, M., and Morini, B. (2006). An interior point newton-like method for non-negative least-squares problems with degenerate solution. Numerical Linear Algebra with Applications, 13(10), 825–846. doi: 10.1002/nla.502.
- Bierlaire et al. (1991) Bierlaire, M., Toint, P., and Tuyttens, D. (1991). On iterative algorithms for linear least squares problems with bound constraints. Linear Algebra and its Applications, 143, 111–143. doi: 10.1016/0024-3795(91)90009-l.
- Cai et al. (2017) Cai, Y., Gu, H., and Kenney, T. (2017). Learning microbial community structures with supervised and unsupervised non-negative matrix factorization. Microbiome, 5(1), 110.
- Defazio et al. (2014a) Defazio, A., Bach, F., and Lacoste-julien, S. (2014a). Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1646–1654. Curran Associates, Inc.
- Defazio et al. (2014b) Defazio, A. J., Caetano, T. S., and Domke, J. (2014b). Finito: A faster, permutable incremental gradient method for big data problems. arXiv:1407.2710.
- Ding et al. (2006) Ding, C., Li, T., Peng, W., and Park, H. (2006). Orthogonal nonnegative matrix tri-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD06. ACM Press. doi:10.1145/1150402.1150420.
- Dong et al. (2016) Dong, W., Fu, F., Shi, G., Cao, X., Wu, J., Li, G., and Li, G. (2016). Hyperspectral image super-resolution via non-negative structured sparse representation. IEEE Trans Image Process, 25(5), 2337–52.
- Févotte and Idier (2011) Févotte, C. and Idier, J. (2011). Algorithms for nonnegative matrix factorization with the -divergence. Neural Computation, 23(9), 2421–2456. doi:10.1162/neco_a_00168, [pdf].
- Hastie et al. (2009) Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning – Data Mining, Inference, and Prediction. Springer, 2nd edition.
- Hazan et al. (2007) Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3), 169–192. doi:10.1007/s10994-007-5016-8.
- He et al. (2013) He, R., Zheng, W. S., Hu, B. G., and Kong, X. W. (2013). Two-stage nonnegative sparse representation for large-scale face recognition. IEEE Trans Neural Netw Learn Syst, 24(1), 35–46.
- Heinkenschloss et al. (1999) Heinkenschloss, M., Ulbrich, M., and Ulbrich, S. (1999). Superlinear and quadratic convergence of affine-scaling interior-point newton methods for problems with simple bounds without strict complementarity assumption. Mathematical Programming, 86(3), 615–635. doi: 10.1007/s101070050107.
- Henrot et al. (2013) Henrot, S., Moussaoui, S., Soussen, C., and Brie, D. (2013). Edge-preserving nonnegative hyperspectral image restoration. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE. doi: 10.1109/icassp.2013.6637926.
- Ishii et al. (2014a) Ishii, S., Kitamura, G., Segawa, T., Kobayashi, A., Miura, T., Sano, D., and Okabe, S. (2014a). Microfluidic quantitative pcr for simultaneous quantification of multiple viruses in environmental water samples. Appl Environ Microbiol, 80(24), 7505–11.
- Ishii et al. (2014b) Ishii, S., Nakamura, T., Ozawa, S., Kobayashi, A., Sano, D., and Okabe, S. (2014b). Water quality monitoring and risk assessment by simultaneous multipathogen quantification. Environ Sci Technol, 48(9), 4744–9.
- Ji et al. (2009) Ji, Y., Lin, T., and Zha, H. (2009). Mahalanobis distance based non-negative sparse representation for face recognition. In 2009 International Conference on Machine Learning and Applications. IEEE. doi:10.1109/icmla.2009.50.
- Joachims (2005) Joachims, T. (2005). A support vector method for multivariate performance measures. In L. D. Raedt and S. Wrobel, editors, Proceedings of the 22nd International Conference on Machine Learning (ICML-05), pages 377–384.
- Johnson and Zhang (2013) Johnson, R. and Zhang, T. (2013). In Advances in Neural Information Processing Systems 26: Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States., pages 315–323.
- Kanzow and Klug (2006) Kanzow, C. and Klug, A. (2006). On affine-scaling interior-point newton methods for nonlinear minimization with bound constraints. Computational Optimization and Applications, 35(2), 177–197. doi:10.1007/s10589-006-6514-5.
- Kim et al. (2007) Kim, D., Sra, S., and Dhillon, I. S. (2007). Fast newton-type methods for the least squares nonnegative matrix approximation problem. In Proceedings of the 2007 SIAM International Conference on Data Mining, pages 343–354. Society for Industrial and Applied Mathematics. doi:10.1137/1.9781611972771.31.
- Kim et al. (2010) Kim, D., Sra, S., and Dhillon, I. S. (2010). Tackling box-constrained optimization via a new projected quasi-newton approach. SIAM Journal on Scientific Computing, 32(6), 3548–3563. doi:10.1137/08073812x.
- Kimura et al. (2016) Kimura, K., Kudo, M., and Tanaka, Y. (2016). A column-wise update algorithm for nonnegative matrix factorization in bregman divergence with an orthogonal constraint. Machine Learning, 103(2), 285–306. doi:10.1007/s10994-016-5553-0, [pdf].
- Kobayashi et al. (2013) Kobayashi, A., Sano, D., Hatori, J., Ishii, S., and Okabe, S. (2013). Chicken- and duck-associated bacteroides-prevotella genetic markers for detecting fecal contamination in environmental water. Appl Microbiol Biotechnol, 97(16), 7427–37.
- Lanckriet et al. (2004a) Lanckriet, G. R., Deng, M., Cristianini, N., Jordan, M. I., and Noble, W. S. (2004a). Kernel-based data fusion and its application to protein function prediction in yeast. Pac Symp Biocomput, -(-), 300–11.
- Lanckriet et al. (2004b) Lanckriet, G. R., Bie, T. D., Cristianini, N., Jordan, M. I., and Noble, W. S. (2004b). A statistical framework for genomic data fusion. Bioinformatics, 20(16), 2626–35.
- Landi and Piccolomini (2012) Landi, G. and Piccolomini, E. L. (2012). NPTool: a Matlab software for nonnegative image restoration with Newton projection methods. Numerical Algorithms, 62(3), 487–504. doi: 10.1007/s11075-012-9602-x.
- Lapin et al. (2015) Lapin, M., Hein, M., and Schiele, B. (2015). Top-k multiclass svm. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 325–333. Curran Associates, Inc.
- Lawson and Hanson (1995) Lawson, C. L. and Hanson, R. J. (1995). Solving Least Squares Problems. Society for Industrial and Applied Mathematics. doi:10.1137/1.9781611971217.
- Lee and Seung (2001) Lee, D. D. and Seung, H. S. (2001). Algorithms for non-negative matrix factorization. In Advances in neural information processing systems, pages 556–562. [pdf].
- Liao and Noble (2003) Liao, L. and Noble, W. S. (2003). Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J Comput Biol, 10(6), 857–68.
- Lin and Moré (1999) Lin, C.-J. and Moré, J. J. (1999). Newton’s method for large bound-constrained optimization problems. SIAM Journal on Optimization, 9(4), 1100–1127. doi: 10.1137/s1052623498345075.
- Lin et al. (2004) Lin, Y., Lee, D., and Saul, L. (2004). Nonnegative deconvolution for time of arrival estimation. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE. doi:10.1109/icassp.2004.1326273.
- Liu et al. (2014) Liu, B., Zhang, D., Xu, R., Xu, J., Wang, X., Chen, Q., Dong, Q., and Chou, K. C. (2014). Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics, 30(4), 472–479.
- Ma (2013) Ma, J. (2013). Algorithms for non-negatively constrained maximum penalized likelihood reconstruction in tomographic imaging. Algorithms, 6(1), 136–160. doi: 10.3390/a6010136.
- Moré and Toraldo (1991) Moré, J. J. and Toraldo, G. (1991). On the solution of large quadratic programming problems with bound constraints. SIAM Journal on Optimization, 1(1), 93–113. doi: 10.1137/0801008.
- Morigi et al. (2007) Morigi, S., Reichel, L., Sgallari, F., and Zama, F. (2007). An iterative method for linear discrete ill-posed problems with box constraints. Journal of Computational and Applied Mathematics, 198(2), 505–520. doi: 10.1016/j.cam.2005.06.053.
- Needell et al. (2015) Needell, D., Srebro, N., and Ward, R. (2015). Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Mathematical Programming, 155(1-2), 549–573. doi:10.1007/s10107-015-0864-7.
- Ogul and Mumcuoglu (2006) Ogul, H. and Mumcuoglu, E. U. (2006). Svm-based detection of distant protein structural relationships using pairwise probabilistic suffix trees. Comput Biol Chem, 30(4), 292–299.
- Portugal et al. (1994) Portugal, L. F., Júdice, J. J., and Vicente, L. N. (1994). A comparison of block pivoting and interior-point algorithms for linear least squares problems with nonnegative variables. Mathematics of Computation, 63(208), 625–625. doi:10.1090/s0025-5718-1994-1250776-4.
- Rockafellar (1970) Rockafellar, R. T. (1970). Convex Analysis. Princeton University Press, Princeton, NJ.
- Roux et al. (2012) Roux, N. L., Schmidt, M., and Bach, F. R. (2012). A stochastic gradient method with an exponential convergence rate for finite training sets. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2663–2671. Curran Associates, Inc.
- Schmidt et al. (2016) Schmidt, M., Roux, N. L., and Bach, F. (2016). Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2), 83–112. doi:10.1007/s10107-016-1030-6.
- Scott et al. (2002) Scott, T. M., Rose, J. B., Jenkins, T. M., Farrah, S. R., and Lukasik, J. (2002). Microbial source tracking: current methodology and future directions. Appl Environ Microbiol, 68(12), 5796–803.
- Shalev-Shwartz and Zhang (2013) Shalev-Shwartz, S. and Zhang, T. (2013). Stochastic dual coordinate ascent methods for regularized loss. J. Mach. Learn. Res., 14(1), 567–599.
- Shalev-Shwartz and Zhang (2016) Shalev-Shwartz, S. and Zhang, T. (2016). Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1), 105–145.
Shalev-Shwartz et al. (2011)
Shalev-Shwartz, S., Singer, Y., Srebro, N., and Cotter, A. (2011).
Pegasos: primal estimated sub-gradient solver for SVM.Math. Program., 127(1), 3–30.
- Shashua and Hazan (2005) Shashua, A. and Hazan, T. (2005). Non-negative tensor factorization with applications to statistics and computer vision. In Proceedings of the 22nd international conference on Machine learning - ICML’05. ACM Press. doi:10.1145/1102351.1102451.
- Wang et al. (2017) Wang, J., Tian, F., Yu, H., Liu, C. H., Zhan, K., and Wang, X. (2017). Diverse non-negative matrix factorization for multiview data representation. IEEE Trans Cybern, -(-), –.
- Wang and Ma (2007) Wang, Y. and Ma, S. (2007). Projected barzilai–borwein method for large-scale nonnegative image restoration. Inverse Problems in Science and Engineering, 15(6), 559–583. doi: 10.1080/17415970600881897.
- Xiao and Zhang (2014) Xiao, L. and Zhang, T. (2014). A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4), 2057–2075. doi:10.1137/140961791.
- Zhang et al. (2013) Zhang, L., Mahdavi, M., and Jin, R. (2013). Linear convergence with condition number independent access of full gradients. In Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, pages 980–988, USA. Curran Associates Inc.
- Zhang et al. (2007) Zhang, Q., Wang, H., Plemmons, R., and Pauca, V. P. (2007). Spectral unmixing using nonnegative tensor factorization. In Proceedings of the 45th annual southeast regional conference on ACM-SE 45. ACM Press. doi: 10.1145/1233341.1233449.
A Proofs and Derivations
a.1 Proof of Theorem 1
Shalev-Shwartz et al. (2011) have used the following lemma, given below, to obtain the bound.
Lemma A.1 (Hazan et al. (2007)).
Let be a sequence of -strongly convex functions. Let be a closed convex set and define . Let be a sequence of vectors such that and for , , where . Assume that , . Then, for , it holds that
Let be -ball defined as
and be the set defined in (3). Then, the intersection of the two sets are closed and convex. It holds that
for . Furthermore, the optimal solution is in the intersection of the two sets. Namely,
a.2 Proof of Lemma a.2
Lemma A.2 states the following three claims.
Proof of (35)
To prove (35), it suffices to show the projection from a point onto the set is given by
The projection problem can be expressed as
With non-negative dual variables and , the Lagrangian function is given by
Let be the saddle point of . Then, . At the saddle point, it holds that , yielding
The dual objective is written as
This implies that is maximized when . Note that this does not depend on the value of . Substituting this into (41), we have
where the last equality follows from the fast that which can be shown as follows. The Lagrangian function for the problem of projection of onto is given by , and, with a similar derivation, the dual objective is which is maximized at yielding .
Next, we find the optimal . The dual objective is
with the derivative
Setting the derivative to zero and noting that is a non-negative variable, we get