1 Introduction
The problem of regularized loss minimization (e.g. Hastie et al. (2009)) is often described as
min  (1)  
where  
aiming to obtain a linear predictor for an unknown input . Therein,
is a loss function which is the sum of convex losses for
examples: for. This problem covers a large class of machine learning algorithms including support vector machine, logistic regression, support vector regression, and ridge regression.
In this study, we pose sign constraints (Lawson and Hanson, 1995) to the entries in the model parameter in the unconstrained minimization problem (1). We divide the index set of entries into three exclusive subsets, , , and , as and impose on the entries in and ,
(2) 
Sign constraints can introduce prior knowledge directly to learning machines. For example, let us consider a binary classification task. In case that th explanatory variable is positively correlated to a binary class label , then a positive weight coefficient is expected to achieve a better generalization performance than a negative coefficient, because without sign constraints, the entry in the optimal solution might be negative due to small sample problem. On the other hand, in case that is negatively correlated to the class label, a negative weight coefficient would yield better prediction. If sign constraints were explicitly imposed, then inadequate signs of coefficients could be avoided.
The strategy of sign constraints for generic learning problems has rarely been discussed so far, although there are extensive reports for nonnegative least square regression supported by many successful applications including sound source localization: (Lin et al., 2004), tomographic imaging (Ma, 2013), spectral analysis (Zhang et al., 2007)
, hyperspectral image superresolution
(Dong et al., 2016), microbial community pattern detection (Cai et al., 2017)(Ji et al., 2009; He et al., 2013), and nonnegative image restoration (Henrot et al., 2013; Landi and Piccolomini, 2012; Wang and Ma, 2007; Shashua and Hazan, 2005). In most of them, nonnegative least square regression is used as an important ingredient of bigger methods such as nonnegative matrix factorization (Lee and Seung, 2001; Wang et al., 2017; Kimura et al., 2016; Févotte and Idier, 2011; Ding et al., 2006).Several efficient algorithms for the nonnegative least square regression have been developed. The active set method by Lawson and Hanson (1995) has been widely used in many years, and several work (Kim et al., 2010, 2007; Bierlaire et al., 1991; Portugal et al., 1994; Moré and Toraldo, 1991; Lin and Moré, 1999; Morigi et al., 2007) have accelerated optimization by combining the active set method with the projected gradient approach. Interior point methods (Bellavia et al., 2006; Heinkenschloss et al., 1999; Kanzow and Klug, 2006) have been proposed as an alternative algorithm for nonnegative least square regression. However, all of them cannot be applied to generic regularized loss minimization problems.
In this paper, we present two algorithms for the signconstrained regularized loss minimization problem with generic loss functions. A surge of algorithms for unconstrained regularized empirical loss minimization have been developed such as SAG (Roux et al., 2012; Schmidt et al., 2016), SVRG (Johnson and Zhang, 2013), ProxSVRG (Xiao and Zhang, 2014), SAGA (Defazio et al., 2014a), Kaczmarz (Needell et al., 2015), EMGD (Zhang et al., 2013), and Finito (Defazio et al., 2014b). This study focuses on two popular algorithms, Pegasos (ShalevShwartz et al., 2011) and SDCA (ShalevShwartz and Zhang, 2013). A prominent characteristic of the two algorithms is unnecessity to choose a step size. Some of the other optimization algorithms guarantee convergence to the optimum under the assumption of a small step size, although the step size is often too small to be used. Meanwhile, the theorem of Pegasos has been developed with a step size which is large enough to be adopted actually. SDCA needs no step size. Two new algorithms developed in this study for the signconstrained problems are simple modifications of Pegasos and SDCA.
The contributions of this study are summarized as follows.

Sign constraints are introduced to generic regularized loss minimization problems.

Two optimization algorithms for the signconstrained regularized loss minimization, called SCPega and SCSDCA, were developed by simply inserting the sign correction step, introduced in Section 3, to the original Pegasos and SDCA.

Our theoretical analysis ensures that both SCPega and SCSDCA do not degrade the convergence rates of the original algorithms.

Two attractive applications, where the signconstrained learning is effective, are presented. The one is exploitation of prior information about correlation between explanatory variables and a target variable. The other is introduction of the signconstrained to SVMPairwise method (Liao and Noble, 2003).

Experimental results demonstrate significant improvement of generalization performance by introducing sign constraints in both two applications.
2 Problem Setting
The feasible region can be expressed simply as
(3) 
where , each entry is given by
(4) 
Using , the optimization problem discussed in this paper can be expressed as
min  (5) 
Assumption 2.1.
Throughout this paper, the following assumptions are used:
(a)  is a convex function.  (b)  .  
(c)  , .  (d)  , . 
Name  Definition  Label  Type  

Classical hinge loss  Lipschitz  
Smoothed hinge loss  smooth  
Logistic loss  smooth  
Square error loss  smooth  
Absolute error loss  Lipschitz 
Most of widely used loss functions satisfy the above assumptions. Several examples of such loss functions are described in Table 1. If the hinge loss is chosen, the learning machine is a wellknown instance called the support vector machine. If the square error loss is chosen, the learning machine is called the ridge regression. We denote the optimal solution to the constraint problem by We assume two types of loss functions: Lipschitz continuous function and smooth function. Function is said to be an Lipschitz continuous funciton if
(6) 
Such functions are often said shortly to be Lipschitz in this paper. Function is a smooth function if its derivative function is Lipschitz. For an index subset and a vector , let be the subvector of containing entries corresponding to . Let be a submatrix in containing columns corresponding to . Let be defined as
(7) 
3 SignConstrained Pegasos
In the original Pegasos algorithm (ShalevShwartz et al., 2011), is assumed to be the classical hinge loss function (See Table 1 for the definition). Each iterate consists of three steps: the minibatch selection step, the gradient step, and the projectionontoball step. Minibatch selection step chooses a subset from examples at random. The cardinality of the subset is predefined as . Gradient step computes the gradient of
(8) 
which approximates the objective function . The current solution is moved toward the opposite gradient direction as
(9)  
At the projectionontoball step, the norm of the solution is shortened to be if the norm is over :
(10) 
The projectionontoball step plays an important role in getting a smaller upperbound of the norm of the gradient of the regularization term in the objective, which eventually reduces the number of iterates to attain an approximate solution (i.e. ).
In the algorithm developed in this study, we simply inserts between those two steps, a new step that corrects the sign of each entry in the current solution as
(11) 
which can be rewritten equivalently as where the operator is defined as , .
The algorithm can be summarized as Algorithm 1. Here, the loss function is not limited to the classical hinge loss. In the projectionontoball step, the solution is projected onto ball instead of ball to handle more general settings. Recall that if is the hinge loss employed in the original Pegasos. It can be shown that the objective gap is bounded as follows.
Theorem 1.
Consider Algorithm 1. If are Lipschitz continuous, it holds that
(12) 
4 SignConstrained SDCA
The original SDCA is a framework for the unconstrained problems (1). In SDCA, a dual problem is solved instead of the primal problem. Namely, the dual objective is maximized in a iterative fashion with respect to the dual variables . The problem dual to the unconstrained problem (1) is given by
min  (13) 
where
(14) 
To find the maximizer of , a single example is chosen randomly at each iterate , and a single dual variable is optimized with the other variables , frozen. If we denote by the value of the dual vector at the previous iterate , the dual vector is updated as where is determined so that where and
(15) 
In case of the hinge loss, the maximizer of can be found within computation. The primal variable can also be maintained within computation by .
Now let us move on the signconstrained problem. In addition to Algorithm 1 that is derived from Pegasos, we present another algorithm based on SDCA for solving the minimizer of subject to the sign constraint . Like Algorithm 1 that has been designed by inserting the sign correction step into the original Pegasos iterate, the new algorithm has been developed by simply adding the sign correction step in each SDCA iterate. The resultant algorithm is described in Algorithm 2.
For some loss functions, maximization at step 5 in Algorithm 2 cannot be given in a closed form. Alternatively, step 4 can be replaced to
(16)  
Therein, we have defined , , , and . See Subsection A.4 for derivation of (16).
We have found the following theorem that states the required number of iterates guaranteeing the expected primal objective gap below a threshold under the sign constraints.
Theorem 2.
5 Multiclass Classification
In this section, we extend our algorithms to the multiclass classification setting of classes. Here, the model parameter is a instead of a vector . The loss function for each example is of an dimensional vector. Here, the prediction is supposed to be done by taking the class with the maximal score among , and . Here, without loss of generality, the set of the class labels are given by . Several loss functions are used for multiclass classification as follows.

Softmax loss:
(20) Therein, is the true class label of th example.

Maxhinge loss;
(21)
The objective function for learning is defined as
(23) 
The learning problem discussed is minimization of with respect to subject to sign constraints
(24)  
with two exclusive set and such that
(25) 
Introducing as
(26) 
the feasible region can be expressed as
(27) 
The goal is here to develop algorithms that find
(28) 
Define as
(29) 
where is the horizontal concatenation of columns in selected by a minibatch . We here use the following assumptions: is a convex function; ; , ; , .
By extending Algorithm 1, an algorithm for minimization of subject to the sign constraints can be developed as described in Algorithm 3.
The SDCAbased learning algorithm can also be developed for the multiclass classification task. In the algorithm, the dual variables are represented as a matrix . At each iterate , one of columns, , is chosen at random instead of choosing one of a dual variable to update the matrix as where we have used the iterate number as the superscript of . To determine the value of , the following auxiliary funcition is introduced:
(30) 
6 Experiments
In this section, experimental results are reported in order to illustrate the effects of the sign constraints on classification and to demonstrate the convergence behavior.
6.1 Prediction Performance
The pattern recognition performance of the signconstrained learning was examined on two tasks:
Escherichia coli (E. coli) prediction and protein function prediction.E. coli Prediction
The first task is to predict E. coli counts in river water. The E. coli count has been used as an indicator for fecal contamination in water environment in many parts of the world (Scott et al., 2002). In this experiment, the data points with E. coli
counts over 500 most probable number (MPN)/100 mL are assigned to positive class, and the others are negative. The hydrological and water quality monitoring data are used for predicting
E. coli counts to be positive or negative.For ensuring the microbial safety in water usage, it is meaningful to predict E. coli counts on a realtime basis. The concentration of E. coli in water, which is measured by culturedependent methods (Kobayashi et al., 2013), has been used to monitor the fecal contamination in water environment, and has been proved to be effective to prevent waterborne infectious diseases in varied water usage styles. On the other hand, the realtime monitoring of E. coli counts has not yet been achieved. It take at least ten hours to obtain E. coli counts by culturedependent methods, and also at least several hours are needed to measure the concentration of E. coli by cultureindependent methods (Ishii et al., 2014b, a), such as polymerase chain reaction. Since it is possible to measure the some of the hydrological and water quality data with realtime sensors, the realtime prediction of E. coli counts will be realized if the hydrological and water quality data are available for the E. coli count prediction.
Many training examples are required to obtain a better generalization performance. A serious issue, however, is that measuring the concentration of E. Coli is timeconsuming and the cost of reagents is expensive. We here demonstrate that this issue can be relaxed by exploiting the domain knowledge hoarded in the field of water engineering.
The hydrological and water quality data contain nine explanatory variables, WT, pH, EC, SS, DO, BOD, TN, TP, and flow rate. The explanatory variable is divided into two variables, and . It is wellknown, in the field of water engineering, that E. coli is increased, as WT, EC, SS, BOD, TN, and TP are larger, and as , ，DO, and the flow rate are smaller. From this fact, we restrict the sign of entries in the predictor parameter as follows.

Coefficients of six explanatory variables, WT, EC, SS, BOD, TN, and TP must be nonnegative.

Coefficients of four explanatory variables, , , DO, flow rate must be nonpositive.
We actually measured the concentrations of E. coli 177 times from December 5th, 2011 to April 17th, 2013. We obtained 177 data points including 88 positives and 89 negatives. We chose ten examples out of 177 data points at random to use them for training, and the other 167 examples were used for testing. The prediction performance is evaluated by the precision recall breakeven point (PRBEP) (Joachims, 2005) and the ROC score. We compared the classical SVM with the signconstrained SVM (SCSVM) to examine the effects of sign constraints. We repeated this procedure 10,000 times and obtained 10,000 PRBEP and 10,000 ROC scores.
(a) PRBEP  (b) ROC score 

SCSVM achieved significant improvement compared to the classical SVM. SCSVM achieved PRBEP and ROC score of 0.808 and 0.863 on average over 10,000 trials, whereas those of the classical SVM were 0.757 and 0.810, respectively. The difference from the classical SVM on each trial is plotted in the histograms of Figure 1. Positive improvements of ROC scores were obtained in 8,932 trials out of 10,000 trials, whereas ROC scores were decreased only for 796 trials. For PRBEP, positive improvements were obtained on 7,349 trials, whereas deteriorations were observed only on 1,069 trials.
(a) Covtype  (b) W8a  (c) Phishing 

Protein Function Prediction
Category  SCSVM  SVM 

1  0.751 (0.011)  0.730 (0.010) 
2  0.740 (0.016)  0.680 (0.015) 
3  0.753 (0.011)  0.721 (0.011) 
4  0.762 (0.010)  0.734 (0.010) 
5  0.769 (0.012)  0.691 (0.013) 
6  0.690 (0.014)  0.614 (0.014) 
7  0.713 (0.024)  0.618 (0.022) 
8  0.725 (0.019)  0.667 (0.019) 
9  0.655 (0.024)  0.578 (0.023) 
10  0.743 (0.016)  0.710 (0.014) 
11  0.535 (0.019)  0.492 (0.018) 
12  0.912 (0.011)  0.901 (0.011) 
In the field of molecular biology, understanding the functions of proteins is positioned as a key step for elucidation of cellular mechanisms. Sequence similarities have been a major mean to predict the function of an unannotated protein. At the beginning of this century, the prediction accuracy has been improved by combining sequence similarities with discriminative learning. The method, named SVMPairwise (Liao and Noble, 2003), uses a feature vector that contains pairwise similarities to annotated protein sequences. Several other literature (Liu et al., 2014; Ogul and Mumcuoglu, 2006; Lanckriet et al., 2004b, a) have also provided empirical evidences for the fact that the SVMPairwise approach is a powerful framework. Basically, if proteins are in a training dataset, the feature vector has entries, . If we suppose that the first proteins in the training set are in positive class and the rest are negative, then the first similarities are sequence similarities to positive examples, and are similarities to negative examples. The dimensional vectors are fed to SVM and get the weight coefficients . Then, the prediction score of the target protein is expressed as
(31) 
The input protein sequence is predicted to have some particular cellular function if the score is over a threshold. It should be preferable that the first weight coefficients are nonnegative and that the rest of weight coefficients are nonpositive. The SVMPairwise approach does not ensure those requirements. Meanwhile, our approach is capable to explicitly impose the constraints of
(32) 
This approach was applied to predict protein functions in Saccharomyces cerevisiae (S. cerevisiae). The annotations of the protein functions are provided in MIPS Comprehensive Yeast Genome Database (CYGD). The dataset contains 3,583 proteins. The SmithWaterman similarities available from https://noble.gs.washington.edu/proj/sdpsvm/ were used as sequence similarities among the proteins. The number of categories was 12. Some proteins have multiple cellular functions. Indeed, 1,343 proteins in the dataset have more than one function. From this reason, we pose 12 independent binary classification tasks instead of a single multiclass classification task. 3,583 proteins were randomly splited in half to get two datasets. The one was used for training, and the other was for testing. For 12 classification tasks, we repeated this procedure 100 times, and we obtained 100 ROC scores.
Table 2
reports the ROC scores averaged over 100 trials and the standard deviations for 12 binary classification tasks. The sign constraints significantly surpassed the classical training for all 12 tasks. Surprisingly, we observed that the ROC score of SCSVM is larger than that of the classical SVM in every trial.
6.2 Convergence
We carried out empirical evaluation of the proposed optimization methods, the signconstrained Pegasos (SCPega) and the signconstrained SDCA (SCSDCA), in order to illustrate the convergence of our algorithms to the optimum. For SCPega, we set the minibatch size to and . In this experiments, we used the smoothed hinge loss with and . We used three datasets, Covtype ( and ), W8a ( and ), and Phishing ( and ). The three datasets are for binary classification and available from LIBSVM web site (https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/).
7 Conclusions
In this paper, we presented two new algorithms for minimizing regularized empirical loss subject to sign constraints. The two algorithms are based on Pegasos and SDCA, both of which have a solid theoretical support for convergence. The signconstrained versions, named SCPega and SCSDCA, respectively, enjoy the same convergence rate as the corresponding original algorithms. The algorithms were demonstrated in two applications. The one is posing sign constraints according to domain knowledge, and the other is improving the SVMPairwise method by sign constraints.
Acknowledgements
TK was supported by JPSP KAKENHI Grant Number 26249075 and 40401236.
References
 Bellavia et al. (2006) Bellavia, S., Macconi, M., and Morini, B. (2006). An interior point newtonlike method for nonnegative leastsquares problems with degenerate solution. Numerical Linear Algebra with Applications, 13(10), 825–846. doi: 10.1002/nla.502.
 Bierlaire et al. (1991) Bierlaire, M., Toint, P., and Tuyttens, D. (1991). On iterative algorithms for linear least squares problems with bound constraints. Linear Algebra and its Applications, 143, 111–143. doi: 10.1016/00243795(91)90009l.
 Cai et al. (2017) Cai, Y., Gu, H., and Kenney, T. (2017). Learning microbial community structures with supervised and unsupervised nonnegative matrix factorization. Microbiome, 5(1), 110.
 Defazio et al. (2014a) Defazio, A., Bach, F., and Lacostejulien, S. (2014a). Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1646–1654. Curran Associates, Inc.
 Defazio et al. (2014b) Defazio, A. J., Caetano, T. S., and Domke, J. (2014b). Finito: A faster, permutable incremental gradient method for big data problems. arXiv:1407.2710.
 Ding et al. (2006) Ding, C., Li, T., Peng, W., and Park, H. (2006). Orthogonal nonnegative matrix trifactorizations for clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining  KDD06. ACM Press. doi:10.1145/1150402.1150420.
 Dong et al. (2016) Dong, W., Fu, F., Shi, G., Cao, X., Wu, J., Li, G., and Li, G. (2016). Hyperspectral image superresolution via nonnegative structured sparse representation. IEEE Trans Image Process, 25(5), 2337–52.
 Févotte and Idier (2011) Févotte, C. and Idier, J. (2011). Algorithms for nonnegative matrix factorization with the divergence. Neural Computation, 23(9), 2421–2456. doi:10.1162/neco_a_00168, [pdf].
 Hastie et al. (2009) Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning – Data Mining, Inference, and Prediction. Springer, 2nd edition.
 Hazan et al. (2007) Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(23), 169–192. doi:10.1007/s1099400750168.
 He et al. (2013) He, R., Zheng, W. S., Hu, B. G., and Kong, X. W. (2013). Twostage nonnegative sparse representation for largescale face recognition. IEEE Trans Neural Netw Learn Syst, 24(1), 35–46.
 Heinkenschloss et al. (1999) Heinkenschloss, M., Ulbrich, M., and Ulbrich, S. (1999). Superlinear and quadratic convergence of affinescaling interiorpoint newton methods for problems with simple bounds without strict complementarity assumption. Mathematical Programming, 86(3), 615–635. doi: 10.1007/s101070050107.
 Henrot et al. (2013) Henrot, S., Moussaoui, S., Soussen, C., and Brie, D. (2013). Edgepreserving nonnegative hyperspectral image restoration. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE. doi: 10.1109/icassp.2013.6637926.
 Ishii et al. (2014a) Ishii, S., Kitamura, G., Segawa, T., Kobayashi, A., Miura, T., Sano, D., and Okabe, S. (2014a). Microfluidic quantitative pcr for simultaneous quantification of multiple viruses in environmental water samples. Appl Environ Microbiol, 80(24), 7505–11.
 Ishii et al. (2014b) Ishii, S., Nakamura, T., Ozawa, S., Kobayashi, A., Sano, D., and Okabe, S. (2014b). Water quality monitoring and risk assessment by simultaneous multipathogen quantification. Environ Sci Technol, 48(9), 4744–9.
 Ji et al. (2009) Ji, Y., Lin, T., and Zha, H. (2009). Mahalanobis distance based nonnegative sparse representation for face recognition. In 2009 International Conference on Machine Learning and Applications. IEEE. doi:10.1109/icmla.2009.50.
 Joachims (2005) Joachims, T. (2005). A support vector method for multivariate performance measures. In L. D. Raedt and S. Wrobel, editors, Proceedings of the 22nd International Conference on Machine Learning (ICML05), pages 377–384.

Johnson and Zhang (2013)
Johnson, R. and Zhang, T. (2013).
Accelerating stochastic gradient descent using predictive variance reduction.
In Advances in Neural Information Processing Systems 26: Proceedings of a meeting held December 58, 2013, Lake Tahoe, Nevada, United States., pages 315–323.  Kanzow and Klug (2006) Kanzow, C. and Klug, A. (2006). On affinescaling interiorpoint newton methods for nonlinear minimization with bound constraints. Computational Optimization and Applications, 35(2), 177–197. doi:10.1007/s1058900665145.
 Kim et al. (2007) Kim, D., Sra, S., and Dhillon, I. S. (2007). Fast newtontype methods for the least squares nonnegative matrix approximation problem. In Proceedings of the 2007 SIAM International Conference on Data Mining, pages 343–354. Society for Industrial and Applied Mathematics. doi:10.1137/1.9781611972771.31.
 Kim et al. (2010) Kim, D., Sra, S., and Dhillon, I. S. (2010). Tackling boxconstrained optimization via a new projected quasinewton approach. SIAM Journal on Scientific Computing, 32(6), 3548–3563. doi:10.1137/08073812x.
 Kimura et al. (2016) Kimura, K., Kudo, M., and Tanaka, Y. (2016). A columnwise update algorithm for nonnegative matrix factorization in bregman divergence with an orthogonal constraint. Machine Learning, 103(2), 285–306. doi:10.1007/s1099401655530, [pdf].
 Kobayashi et al. (2013) Kobayashi, A., Sano, D., Hatori, J., Ishii, S., and Okabe, S. (2013). Chicken and duckassociated bacteroidesprevotella genetic markers for detecting fecal contamination in environmental water. Appl Microbiol Biotechnol, 97(16), 7427–37.
 Lanckriet et al. (2004a) Lanckriet, G. R., Deng, M., Cristianini, N., Jordan, M. I., and Noble, W. S. (2004a). Kernelbased data fusion and its application to protein function prediction in yeast. Pac Symp Biocomput, (), 300–11.
 Lanckriet et al. (2004b) Lanckriet, G. R., Bie, T. D., Cristianini, N., Jordan, M. I., and Noble, W. S. (2004b). A statistical framework for genomic data fusion. Bioinformatics, 20(16), 2626–35.
 Landi and Piccolomini (2012) Landi, G. and Piccolomini, E. L. (2012). NPTool: a Matlab software for nonnegative image restoration with Newton projection methods. Numerical Algorithms, 62(3), 487–504. doi: 10.1007/s110750129602x.
 Lapin et al. (2015) Lapin, M., Hein, M., and Schiele, B. (2015). Topk multiclass svm. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 325–333. Curran Associates, Inc.
 Lawson and Hanson (1995) Lawson, C. L. and Hanson, R. J. (1995). Solving Least Squares Problems. Society for Industrial and Applied Mathematics. doi:10.1137/1.9781611971217.
 Lee and Seung (2001) Lee, D. D. and Seung, H. S. (2001). Algorithms for nonnegative matrix factorization. In Advances in neural information processing systems, pages 556–562. [pdf].
 Liao and Noble (2003) Liao, L. and Noble, W. S. (2003). Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. J Comput Biol, 10(6), 857–68.
 Lin and Moré (1999) Lin, C.J. and Moré, J. J. (1999). Newton’s method for large boundconstrained optimization problems. SIAM Journal on Optimization, 9(4), 1100–1127. doi: 10.1137/s1052623498345075.
 Lin et al. (2004) Lin, Y., Lee, D., and Saul, L. (2004). Nonnegative deconvolution for time of arrival estimation. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE. doi:10.1109/icassp.2004.1326273.
 Liu et al. (2014) Liu, B., Zhang, D., Xu, R., Xu, J., Wang, X., Chen, Q., Dong, Q., and Chou, K. C. (2014). Combining evolutionary information extracted from frequency profiles with sequencebased kernels for protein remote homology detection. Bioinformatics, 30(4), 472–479.
 Ma (2013) Ma, J. (2013). Algorithms for nonnegatively constrained maximum penalized likelihood reconstruction in tomographic imaging. Algorithms, 6(1), 136–160. doi: 10.3390/a6010136.
 Moré and Toraldo (1991) Moré, J. J. and Toraldo, G. (1991). On the solution of large quadratic programming problems with bound constraints. SIAM Journal on Optimization, 1(1), 93–113. doi: 10.1137/0801008.
 Morigi et al. (2007) Morigi, S., Reichel, L., Sgallari, F., and Zama, F. (2007). An iterative method for linear discrete illposed problems with box constraints. Journal of Computational and Applied Mathematics, 198(2), 505–520. doi: 10.1016/j.cam.2005.06.053.
 Needell et al. (2015) Needell, D., Srebro, N., and Ward, R. (2015). Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. Mathematical Programming, 155(12), 549–573. doi:10.1007/s1010701508647.
 Ogul and Mumcuoglu (2006) Ogul, H. and Mumcuoglu, E. U. (2006). Svmbased detection of distant protein structural relationships using pairwise probabilistic suffix trees. Comput Biol Chem, 30(4), 292–299.
 Portugal et al. (1994) Portugal, L. F., Júdice, J. J., and Vicente, L. N. (1994). A comparison of block pivoting and interiorpoint algorithms for linear least squares problems with nonnegative variables. Mathematics of Computation, 63(208), 625–625. doi:10.1090/s00255718199412507764.
 Rockafellar (1970) Rockafellar, R. T. (1970). Convex Analysis. Princeton University Press, Princeton, NJ.
 Roux et al. (2012) Roux, N. L., Schmidt, M., and Bach, F. R. (2012). A stochastic gradient method with an exponential convergence rate for finite training sets. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2663–2671. Curran Associates, Inc.
 Schmidt et al. (2016) Schmidt, M., Roux, N. L., and Bach, F. (2016). Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(12), 83–112. doi:10.1007/s1010701610306.
 Scott et al. (2002) Scott, T. M., Rose, J. B., Jenkins, T. M., Farrah, S. R., and Lukasik, J. (2002). Microbial source tracking: current methodology and future directions. Appl Environ Microbiol, 68(12), 5796–803.
 ShalevShwartz and Zhang (2013) ShalevShwartz, S. and Zhang, T. (2013). Stochastic dual coordinate ascent methods for regularized loss. J. Mach. Learn. Res., 14(1), 567–599.
 ShalevShwartz and Zhang (2016) ShalevShwartz, S. and Zhang, T. (2016). Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, 155(1), 105–145.

ShalevShwartz et al. (2011)
ShalevShwartz, S., Singer, Y., Srebro, N., and Cotter, A. (2011).
Pegasos: primal estimated subgradient solver for SVM.
Math. Program., 127(1), 3–30.  Shashua and Hazan (2005) Shashua, A. and Hazan, T. (2005). Nonnegative tensor factorization with applications to statistics and computer vision. In Proceedings of the 22nd international conference on Machine learning  ICML’05. ACM Press. doi:10.1145/1102351.1102451.
 Wang et al. (2017) Wang, J., Tian, F., Yu, H., Liu, C. H., Zhan, K., and Wang, X. (2017). Diverse nonnegative matrix factorization for multiview data representation. IEEE Trans Cybern, (), –.
 Wang and Ma (2007) Wang, Y. and Ma, S. (2007). Projected barzilai–borwein method for largescale nonnegative image restoration. Inverse Problems in Science and Engineering, 15(6), 559–583. doi: 10.1080/17415970600881897.
 Xiao and Zhang (2014) Xiao, L. and Zhang, T. (2014). A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4), 2057–2075. doi:10.1137/140961791.
 Zhang et al. (2013) Zhang, L., Mahdavi, M., and Jin, R. (2013). Linear convergence with condition number independent access of full gradients. In Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, pages 980–988, USA. Curran Associates Inc.
 Zhang et al. (2007) Zhang, Q., Wang, H., Plemmons, R., and Pauca, V. P. (2007). Spectral unmixing using nonnegative tensor factorization. In Proceedings of the 45th annual southeast regional conference on ACMSE 45. ACM Press. doi: 10.1145/1233341.1233449.
A Proofs and Derivations
a.1 Proof of Theorem 1
ShalevShwartz et al. (2011) have used the following lemma, given below, to obtain the bound.
Lemma A.1 (Hazan et al. (2007)).
Let be a sequence of strongly convex functions. Let be a closed convex set and define . Let be a sequence of vectors such that and for , , where . Assume that , . Then, for , it holds that
(33) 
We, too, have used Lemma A.1 to obtain Theorem 1 for our signconstrained learning problem (5). To this end, we find the following lemma.
Lemma A.2.
Let be ball defined as
(34) 
and be the set defined in (3). Then, the intersection of the two sets are closed and convex. It holds that
(35) 
for . Furthermore, the optimal solution is in the intersection of the two sets. Namely,
(36) 
a.2 Proof of Lemma a.2
Lemma A.2 states the following three claims.
Apparently, is a closed and convex set because the both sets are closed and convex. We shall show (35) and then (36).
Proof of (35)
To prove (35), it suffices to show the projection from a point onto the set is given by
(38) 
The projection problem can be expressed as
min  (39)  
subject to 
With nonnegative dual variables and , the Lagrangian function is given by
(40) 
Let be the saddle point of . Then, . At the saddle point, it holds that , yielding
(41) 
The dual objective is written as
(42)  
This implies that is maximized when . Note that this does not depend on the value of . Substituting this into (41), we have
(43) 
where the last equality follows from the fast that which can be shown as follows. The Lagrangian function for the problem of projection of onto is given by , and, with a similar derivation, the dual objective is which is maximized at yielding .
Next, we find the optimal . The dual objective is
(44) 
with the derivative
(45) 
Setting the derivative to zero and noting that is a nonnegative variable, we get
Comments
There are no comments yet.