A class of algorithms originated for minimizing or maximizing a function , while satisfying some constraints . In the history of optimization, the function and the constraints
were linear and the problem was known as linear programming (LP). One of the early published algorithms for solving the linear programming was given by Dantzig, popularly known as Simplex method. As the number of dimensions and constraints increased, solving the LP using simplex method became hard. The inability of simplex method was that it could not solve LP in polynomial time. Khachiyan proposed ellipsoid algorithm as an alternative to simplex method and proved that it could reach the solution iteratively in polynomial time . The practical infeasible condition of ellipsoid algorithm led to the evolution of several interior or barrier point methods. One of the well known interior point methods is Karmarkar’s method proposed by Narendra Karmarkar .
Binary classification is one of the active research areas in machine learning[1, 11]. There are several ways to train a binary classifier. The class labels of a data set can be stored and retrieved during classification using the approach of nearest neighbor . A hyperplane is learnt for classification by training a neural network, which may not always be optimal 
. Vapnik and others formulated the problem of classification as optimization. This method is known as support vector machines (SVMs). Sequential minimal optimization (SMO) is a technique which solves the optimization problem in SVMs 
. Decision trees, Bagging, and Boosting techniques are also used in binary classification[11, 12, 14, 22].
Nearest neighbor method does not involve any modeling to reduce the storage of the training data set . On the other hand, neural network and SVM model the data with an objective function to estimate a hyperplane, which is used in classification. In neural network approach, the objective function is a least squares, which is quadratic in nature and is minimized for the given data set. The hyperplane obtained from a neural network may or may not be optimal, since it depends on the number of layers and weights used to train the network. SVM uses quadratic programming formulation with linear constraints for minimizing the objective function . Several variants of Even though the objective function used in neural network or SVM is a quadratic programming problem, the constraints are linear.
If there is a way to model linear constraints as quadratic constraints, then the objective function becomes quadratically constrained quadratic programming (QCQP). In this paper, binary classification is posed as a QCQP problem and a novel solution is proposed using particle swarm optimization (PSO). One of the advantages of this approach is that it solves the QCQP problem without the need for gradient estimation.
The paper is organized as follows: QCQP and PSO are described in the background section and the solution for quadratically constrained quadratic programming using particle swarms is described in Sec. 3. The proposed method is compared with Khachiyan’s and Karmarkar’s algorithms for linear programming and with neural networks and SVM for quadratic programming in Sec. 4 on experiments and results. Section 5 concludes the paper with suggestions for future work.
The formulation of QCQP and PSO are described in the subsections below.
2.1 Quadratically Constrained Quadratic Programming
2.2 Particle Swarm Optimization
. PSO evolved by simulating bird flocking and fish schooling. The advantages of PSO are that it is simple in conception and easy to implement. Particles are deployed in search space and each particle is evaluated against an optimization function. The best particle is chosen as a directing agent for the rest. The velocity of each particle is controlled by both the particle’s personal best and the global best. During the movement of the particles, a few of them may reach the global best. The movement of particles is adapted from the genetic algorithms or evolutionary programming.
Let X = be the particles deployed in the search space of the optimization function, where is the number of particles and V = are the velocities of the respective particles. for all the particles. A simple PSO update is as follows,
Velocity update equation
where is the weight for the previous velocity; , are constants and , are random values varied in each iteration. is the personal best value for particle and is the global best value among all the particles. is the updated velocity of the particle in the iteration and is the velocity value in the iteration. is the position of the particle after the iteration.
For position update, the updated velocity is added to the existing position of the particle. The position update equation is
3 Proposed method
Our interest lies in determining the shortest path between the two non-intersecting ellipsoids. The shortest path between the two ellipsoids in can be formulated as a convex QCQP problem.
Arriving at the two end points of the shortest path, one on each ellipsoid, can be posed as minimization of in a quadratic form and formulated as follows.
where and , are non-intersecting regions in , . and are the matrices depicting the ellipsoids used in the optimization.
In case the Euclidean distance metric is used for minimization of the path length, then
is the identity matrix. The modified equation is,
Suppose one end point in the shortest path is known and fixed as . Then, Eq (5) can be reformulated as,
Figure 1 shows the ellipsoid with the covariance matrix with points (inside), (on the boundary) and (outside). The boundary point is nearest to the point outside the ellipsoid. We need to determine the unknown by the minimization of .
3.2 Solution using PSO
The novelty of this paper is the application of particle swarms for solving the QCQP problem. Particle swarms are deployed within the ellipsoid to determine . The function is evaluated for each particle in the search space. The particle with the minimum is considered as the closest to the point . Only one particle of the swarm is shown in Figure 1 for ease of representation.
PSO is a stochastic evolutionary algorithm, which takes several generations to reach the optimal value and its performance depends on the initialization. The velocity update equation of the PSO algorithm is modified by including the function. The addition of evaluation function restricts the particle from moving away from the actual course towards the global best position. This addition provides an advantage in computation. However, it also constrains the particle to move in a particular direction. To counter this effect, we add a craziness term in the velocity update equation. The modified velocity update equation is:
where and are constants, and is a random value that is varied during each iteration. The value of is chosen such that the term does not take the particle outside the ellipsoid. is a random point on the surface of a sphere of dimension with randomly varying radius. forms the craziness term.
The velocity vector of particle should be directed towards the minimization of .
The function needs to be evaluated. We use instead of using as the objective function. The value of is unknown and needs to be arrived at by minimizing the function .
The value of the function for the particle at the iteration is evaluated as
Let the present position of the particle be split into the previous position and the velocity vector of the particle.
Here, is fixed and the particle position in the objective function is dependent on the velocity vector . So, one possible option for minimizing the function is by changing the direction of the velocity vector of the particle as follows:
Thus, the function is minimized if the direction of the velocity vector is as given by Eq (10). ∎
The velocity vector update equation (2) does not have any term relating to minimization, but the modified Eq (7) includes the direction for minimization. It improves the convergence rate and thus reduces the computation time. The arrow (for representation purpose) shown in Figure 1 is in the direction of minimizing the function given by Eq (6).
On the assumption that one end point is known in the shortest path, particle swarms are placed in the search space of the other region and the other end point of the shortest path is determined. In order to evaluate the objective function in Eq (5), we need to determine one end point from region and the other from region . Two sets of particle swarms, one for each region, are placed within the search spaces of the respective regions. The objective function is evaluated based on the particles present in both the regions. In every iteration, the best position of a particle in one region is used as the known end point in the shortest path in the velocity update equation of the particles of the other region. The objective function reaches its optimal value after several generations.
In the process of optimization, some particles may often go out of the search space. To limit the particles within the search space, we inspect after every tentative position update as to whether the particle is lying within the search space by carrying out a check on the QCQP constraints. If the intended new position of a particle is going to violate the constraint, then its position is not updated. In other words, the particles likely to go out of the search space are redeployed back to their previous positions.
The proposed solution may be used in control system problems such as optimization of sensor networks or collaborative data mining, which are based on multiple agents or gradient projection . General consensus problem may be solved using the proposed method, where multiple agents need to reach a common goal [23, 24, 25]. Consensus or distributed optimization is discussed in  as ‘consensus and sharing’ using alternating direction method of multipliers (ADMoM). ADMoM uses one agent for each constraint. Such solvers are used for solving SVM in distributed format .
Table 1 presents a pseudo code for the algorithm. The parameters ,,,, and are set to fixed values and the randomly varying parameters ,,, and are updated in each iteration. The position, velocity, personal best, and global best of each particle are stored. The maximum number of iterations for the algorithm is specified as in the experiments and the size of swarm used in the algorithm is ‘10’. The proposed algorithm is implemented in MATLAB.
|Inputs: , and ; set|
|and initialize parameters|
|Outputs: Global best value|
|Function evaluation step:|
|Calculate the function for|
|Velocity update step:|
|Randomly choose values for in the range ‘0’ and ‘1’.|
|Then update the velocity of each particle as in Eq (7).|
|Position update step:|
|Add updated velocity to existing position.|
|Check constraint on the particles .|
|for = 1 to|
4 Experiments and Results
In this section, different optimization problems with quadratic constraints are solved using the proposed algorithm. The reliability of our algorithm is tested on linear and quadratic programming problems.
4.1 Linear programming with quadratic constraints
A LP problem with quadratic constraints is chosen in order to compare the convergence performance of our method with Khachiyan’s and Karmarkar’s methods. The LP problem is given below:
where X is the region, satisfying the constraint.
This LP problem is reformulated as a QCQP problem:
where vector = , = and A is the identity matrix of size 2 (positive definite).
The solution for this LP problem is with . This LP problem is solved using Khachiyan’s ellipsoid algorithm, Karmarkar’s algorithm and our method. Figure 2 shows the value of the function for each iteration for a typical independent run of the experiment with 50 iterations. The length vector in Karmarkar’s method is scaled by a variable . Two values are used for namely 0.05 and 0.50, for the evaluation of the function. As the value of increases, the algorithm reaches the optimal value of in less number of iterations. Table 2 shows the error value of function for all the algorithms after the iteration as shown in Figure 2. The error values for the ellipsoid and our methods are less than x.
|Algorithm||Ellipsoid||Karmarkar||Karmarkar||Our method||Our method|
In our algorithm, the values of the variables , and are set to . These values are small so that the vector addition to position keeps the particles within the region X. Only the value of is varied since it scales the neighboring search space of a particle. The results for two different values of are shown in Figure 2. We carried out 50 independent runs for this LP problem with two different values of
. The average, minimum, maximum and standard deviation of error value are tabulated in Table3. The error values for are less, compared to , indicating that as the neighboring search space is scaled to a higher value, the error value of the fitness function becomes better in fewer number of iterations.
4.2 Binary classification of simulated data
Several variants of PSO have been applied for classification problems . Generally PSO is deployed in rule-based classification. A suitable classifier is chosen for classification; for example, a neural network . The parameters like network weights are tuned using particle swarms. There are discrete versions of swarms, which can take a finite set of values . Here, swarms learn the rule for classifying the test samples. In our method, binary classification is posed as QCQP and swarms are used to solve this QCQP problem. The input feature space is considered as the particle swarm space.
shows a simulated data set for two classes synthesized using two Gaussian distributions with meansand the same covariance matrix . A generic decision boundary for a binary classification problem is a hypersurface. The classes in the data are linearly separable thus reducing a hypersurface to a hyperplane. The equation of a hyperplane for this kind of data is given by,
where , are weights for the individual features and is the bias of the hyperplane. This hyperplane equation is used for classification:
where and are the regions for the classes 1 and 2, respectively.
The MATLAB programming platform is used to implement and test our method and also a SVM and neural network for comparison. We trained the SVM with a linear kernel and a neural network on this synthetic data. The hyperplanes learnt by the SVM and the neural network are shown in Figure 3
. A single layer perceptron algorithm with 100 epochs is used for training the neural network and SMO algorithm is used for training the SVM with linear kernel.
We now explain the determination of the optimal hyperplane for binary classification by our method. The values of sample mean and covariance for the two classes are calculated from the simulated samples. Mahalanobis distance  is determined from the mean value of a class to the data points of the other class and the closest data point of the other class label is found. This closest point is used to fix the reference boundary of the search space (ellipsoid) of the other class. This process is repeated from the other class and the boundary of the first class is also determined. With the boundaries, ellipsoidal regions are formed with estimated means of classes as the centers. This is posed as a QCQP problem formulated in Eq (5) with the estimated covariance matrices normalized to boundary points as and . Our algorithm is implemented by placing particle swarms near the mean value of the classes and evaluating the optimization function. The shortest path and the closest point on each boundary are estimated.
Eq (15) is optimized using the proposed algorithm. The intersection between the two ellipsoidal regions is assumed to be zero to eliminate a situation where particles reach to different solutions. Once the closest points on the boundaries are determined, the perpendicular bisector of the line joining the closest points is the hyperplane. The calculated hyperplane for binary classification is shown in Figure 3. We can observe that the optimal hyperplane calculated by SMO algorithm and our method are closely placed, while other hyperplanes are not optimal. The estimated weight and bias values for each method are tabulated in Table 4. The estimated weight values can be normalized as
The normalized weights of SVM and our method are equal and they are and .
4.3 Performance on real datasets
Our algorithm is also tested on some of the datasets available from UCI ML repository . The datasets used in our experiments are Iris, Wine, Pima and Thyroid. These four datasets have been chosen based on the consideration of minimal number of datasets with the maximum coverage of the different types of attributes (namely, binary, categorical value as integer, and real values). Further, the number of classes in each case is 2 or 3. So, the maximum number of hyperplanes to be obtained for any of these datasets is limited to three. The main characteristics of these datasets are tabulated in Table 5. The cross-validation performance on these datasets is compared with those of a SVM with linear and RBF kernel, a neural network, and GSVM. GSVM  estimates a classification hyperplane similar to the proposed approach and is developed on the fundamentals of optimizing the SVM problem. GSVM modifies the bias (
) obtained from SVM by moving the hyperplane such that the geometric mean of recall and precision is improved.
|Dataset||No. of||No. of||No. of||Nature of|
|Wine||178||13||3||real & integer|
|Pima||768||8||2||real & integer|
|Thyroid||215||5||3||real & binary|
4.3.1 Iris Dataset
The Iris dataset consists of four different measurements on 150 samples of iris flower. There are 50 samples of three different species of iris forming the dataset. The features, whose values are available from the dataset, are length and width of leaves and petals of different iris plants. Out of the three species, two are not linearly separable from each other, whereas the third is linearly separable from the rest of the species. The classification task is to determine the species of the iris plant, given the 4-dimensional feature vector.
4.3.2 Wine Dataset
The Wine dataset contains the different physical and chemical properties of three different types of wines derived from three different strains of plants. Some of the physical properties such as hue and colour intensity have integer values, whereas chemical properties such as ash or phenol content have real values. The feature vector has 13 dimensions and there are a total of 178 samples. The classification task is to determine the type of wine, given the values of the content of the thirteen physical and chemical components.
4.3.3 Pima Indians Diabetes Dataset
The Pima dataset contains eight different parameters measured from 768 adult females of Pima Indian heritage. Once again, some of them are integer valued, such as age and number of pregnancies. Certain other parameters, such as serum insulin, are real valued. It is a two-class classification problem of identifying normal and diabetic subjects, given the 8-dimensional feature vector as input.
4.3.4 Thyroid Disease Dataset
This dataset contains ten distinct databases of different dimensions. The particular database chosen for our study contains 5 different parameters measured from 215 individuals. Some of the variables have binary values, while others have real values. The classification task is to assign an individual to one of 3 classes, given the 5-dimensional feature vector as input.
4.3.5 Data projection
We perform a preprocessing step on these real datasets. This step is necessary since,
The number of samples available in the datasets is also less.
To overcome these two problems, we perform the two steps given below.
Eigen value decomposition is performed on dataset covariance matrix. The dataset is projected on to these eigen vectors. Scaling is performed in such a way that each component in the new projected dataset has unit variance.
This step is performed independently on each of the subsets used in the cross-validation stage. We project the subset of samples into two dimensions. First is the direction of the vector joining the sample means of the classes. The equation for this vector is,
where is the projection vector, and are the sample means of classes-1 and 2, respectively. The second direction is that of the eigen vector corresponding to the largest eigen value of the covariance matrix of the subset.
The projected samples are used in the estimation of new sample means and covariance matrices. The estimated hyperplane is used to classify the test samples.
These datasets do not have separate training and test samples; hence we perform cross-validation. In cross-validation, a small subset is used for testing while the remaining are used as training samples. We use ten fold cross-validation: split the datasets into ten subsets and use one of them for testing and the others for training at a time and then rotate the subsets. Equation (15) is used in the estimation of the hyperplane, which in turn is used to classify the test samples. Ten trials are performed and the average cross-validation errors are reported in Table 6.
|Dataset||CV error by||CV error||CV error in||CV error in||CV error|
|Our method||in SVM with||Neural||SVM with RBF||in GSVM|
|linear kernel||Network||kernel |||
The performance of our method is close to that of the variants of SVM and is superior to that of the neural network. We notice that in Iris and Wine datasets, the cross-validation error obtained by our technique is better than those achieved by SVM with linear and RBF kernels and also the neural network. In Pima and Thyroid datasets, the cross-validation error is high, due to the high degree of correlation.
5 Conclusion and Future work
We have developed a classification method and optimization algorithm for solving QCQP problems. The novelty in this method is the application of particle swarms, an evolutionary technique, for optimization. The results indicate that our approach is a possible method in solving general QCQP problems without gradient estimation. We have shown the results of our algorithm under quadratic constraints by evaluating different optimization functions. The issue with PSO based methods is their computational complexity and the need for parameter tuning. The number of function evaluations linearly increases with the number of particles employed and the number of iterations carried out.
In future, we intend to learn multiple hyperplanes by placing multiple kernels in each class and evaluating the performance against multiple-kernel learning algorithms. The hyperplanes estimated for different kernels may reduce the cross-validation error for the Pima and Thyroid datasets.
Bishop, C.M: Pattern recognition and Machine Learning, Springer, Berlin (2006)
-  Bishop, C.M: Neural Networks for Pattern Recognition, Oxford university press (1995)
-  Bomze, I.M.: On Standard Quadratic Optimization Problems, Journal of Global Optimization 13: p.369-387 (1998)
-  Bomze, I.M., Schachinger, W.: Multi-Standard Quadratic Optimization: interior point methods and cone programming reformulation, Computation Optimization Applications, 45: p.237-256 (2010)
-  Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press. http://www.stanford.edu/ boyd/cvxbook/ (2004)
-  Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends in Machine Learning, volume 3, issue 1, p.1-122 (2010)
Cervantes, A., Galvan, I.M., Isasi, P.: A comparison between the Pittsburgh and Michigan approaches for the binary PSO algorithm, Congress on Evolutionary Computation, p.290-297 (2005)
-  Cortes, C., Vapnik, V.: Support-Vector Networks, Machine Learning, volume 20, number 3, p.273-297 (1995)
-  Dantzig, G.B.: Linear Programming, University Press, Princeton, NJ (1963)
-  Derrac, J., García, S., Herrera, F.: Fuzzy nearest neighbor algorithms: Taxonomy, experimental analysis and prospects. Information Sciences, volume 260, p.98-119 (2014) doi: http://dx.doi.org/10.1016/j.ins.2013.10.038
-  Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification, Wiley (2001)
Fernández, A., López, V., Galar, M., Jesus, M.J., Herrera, F.: Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches. Knowledge-Based Systems, p.97-110 (2013)
-  Forero, P.A., Cano, A., Giannakis, G.B.: Consensus-based distributed support vector machines. Journal of Machine Learning Research, volume 11, p.1663-1707 (2010)
-  Galar, M., Fernández, A., Barrenechea, E., Herrera, F.: EUSBoost: Enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recognition (2013) doi: http://dx.doi.org/10.1016/j.patcog.2013.05.006
-  Gonzalez-Abril, L., Velasco, F., Angulo, C., Ortega, J.A.: A study on output normalization in multiclass SVMs. Pattern Recognition Letters, p.344-348 (2013)
-  Gonzalez-Abril, L., Nuñez, H., Angulo, C., Velasco, F.: GSVM : An SVM for handling imbalanced accuracy between classes in bi-classification problems. Applied Soft Computing, volume 17, p.23-31 (2014) doi: http://dx.doi.org/10.1016/j.asoc.2013.12.013
-  Karmarkar, N.: A new polynomial-time algorithm for linear programming. Combinatorica 4, p.373-395 (1984)
-  Kennedy, J., Eberhart, R.C.: Particle swarm optimization. Proceedings of IEEE International Conference on Neural Networks. IV. p.1942-1948 (1995)
-  Kennedy, J., Eberhart, R.C., Shi, Y.: Swarm intelligence. Morgan Kaufmann Publishers, San Francisco (2001)
-  Kennedy, J., Eberhart, R.C.: A discrete binary version of the particle swarm algorithm. Proceedings of the World Multiconference on Systems, Cybernetics and Informatics, p.4104-4109 (1997)
-  Khachiyan, L.G.: A polynomial algorithm in linear programming, Doklady Akademia Nauk SSSR 244:S p.1093-1096, translated in Soviet Mathematics Doklady 20:1, p.191-194 (1979)
-  López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An Insight into Classification with Imbalanced Data: Empirical Results and Current Trends on Using Data Intrinsic Characteristics, Information Sciences (2013) doi: http://dx.doi.org/10.1016/j.ins.2013.07.007
-  Matei, I., Baras, J.S.: A performance comparison between two consensus-based distributed optimization algorithms. 3rd IFAC Workshop on Distributed Estimation and Control in Networked Systems, p.168-173, Santa Barbara, CA, USA (2012)
-  Nedi, A., Ozdaglar, A.: Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, volume 54, issue 1, p.48-61 (2009)
-  Nedi, A., Ozdaglar, A., Parrilo, P.A.: Constrained consensus and optimization in multi-agent networks. IEEE Transactions on Automatic Control, volume 55, issue 4, p.922-938 (2010)
-  Platt, J.: Fast training of support vector machines using sequential minimal optimization. In B. Schoelkopf and C. Burges and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning (1998)
-  Spadoni, M., Stefanini, L.: A differential evolution algorithm to deal with box, linear and quadratic-convex constraints for boundary optimization Journal of Global Optimization 52(1), p.171-192 (2012)
-  UC Irvine Machine Learning Repository. http://archive.ics.uci.edu/ml/
-  Zanella, F., Varagnolo, D., Cenedese, A., Pillonetto, G., Schenato, L.: Multidimensional Newton-Raphson consensus for distributed convex optimization. The 2012 American Control Conference (ACC), p.1079-1084, Montreal, QC (2012)
-  Zhan, Z-H., Zhang, J., Li, Y., Chung, H.S-H.: Adaptive particle swarm optimization, IEEE Transactions on Systems, Man, and cybernetics - Part B: Cybernetics, 39(6), p.1362-1381 (2009)