Exploring the Entire Regularization Path for the Asymmetric Cost Linear Support Vector Machine

10/12/2016 ∙ by Daniel Wesierski, et al. ∙ 0

We propose an algorithm for exploring the entire regularization path of asymmetric-cost linear support vector machines. Empirical evidence suggests the predictive power of support vector machines depends on the regularization parameters of the training algorithms. The algorithms exploring the entire regularization paths have been proposed for single-cost support vector machines thereby providing the complete knowledge on the behavior of the trained model over the hyperparameter space. Considering the problem in two-dimensional hyperparameter space though enables our algorithm to maintain greater flexibility in dealing with special cases and sheds light on problems encountered by algorithms building the paths in one-dimensional spaces. We demonstrate two-dimensional regularization paths for linear support vector machines that we train on synthetic and real data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Support Vector Machines (Boser et al., 1992) belong to core machine learning techniques for binary classification. Given a large number of training samples characterized by a large number of features, a linear SVM is often the

go-to approach in many applications. A handy collection of software packages, e.g., LIBLINEAR (Fan et al., 2008), Pegasos (Shalev-Shwartz et al., 2011), SVM (Joachims, 2006), Scikit-learn (Pedregosa et al., 2011) provide practitioners with efficient algorithms for fitting linear models to datasets. Finding optimal hyperparameters of the algorithms for model selection is crucial though for good performance at test-time.

A vanilla cross-validated grid-search is the most common approach to choosing satisfactory hyperparameters. However, grid search scales exponentially with the number of hyperparameters while choosing the right sampling scheme over the hyperparameter space impacts model performance (Bergstra & Bengio, 2012). Linear SVMs typically require setting a single hyperparameter that equally regularizes the training loss of misclassified data. (Klatzer & Pock, 2015) propose bi-level optimization for searching several hyperparameters of linear and kernel SVMs and (Chu et al., 2015) use warm-start techniques to efficiently fit an SVM to large datasets but both approaches explore the hyperparameter regularization space partially.

The algorithm proposed in (Hastie et al., 2004) builds the entire regularization path for linear and kernel SVMs that use single, symmetric cost for misclassifying negative and positive data111Solutions path algorithms relate to parametric programming techniques. With independent revival in machine learning, these techniques have traditionally been applied in optimization and control theory (Gartner et al., 2012). The stability of the algorithm was improved in (Ong et al., 2010) by augmenting the search space of feasible event updates from one- to multi-dimensional hyperparameter space. In this paper, we also show that a one-dimensional path following method can diverge to unoptimal solution wrt KKT conditions.

Many problems often require setting multiple hyperparameters (Karasuyama et al., 2012). They arise especially when dealing with imbalanced datasets (Japkowicz & Stephen, 2002) and require training an SVM with two cost hyperparameters assymetrically attributed to positive and negative examples. (Bach et al., 2006) builds a pencil of one-dimensional regularization paths for the assymetric-cost SVMs. On the other hand, (Karasuyama et al., 2012) build a one-dimensional regularization path but in a multidimensional hyperspace.

In contrast to algorithms building one-dimensional paths in higher-dimensional hyperparameter spaces, we describe a solution path algorithm that explores the entire regularization path for an assymetric-cost linear SVMs. Hence, our path is a two-dimensional path in the two-dimensional hyperparameter space.

Our main contributions include:

  • development of the entire regularization path for assymetric-cost linear support vector machine (AC-LSVM)

  • algorithm initialization at arbitrary location in the hyperparameter space

  • computationally and memory efficient algorithm amenable to local parallelization.

2 Problem formulation

Our binary classification task requires a fixed input set of training examples , where , , , to be annotated with corresponding binary labels denoting either class. Then, the objective is to learn a decision function

that will allow its associated classifier

to predict the label for new sample at test-time.

AC-LSVM learns the array of parameters of the decision function by solving the following, primal quadratic program (QP):

(1)
(2)

where we include the scalar valued bias term in and augment data points by some constant :

(3)

where is defined by a user (Hsieh, 2008). The above formulation should learn to assign scores higher than margin to positive examples and lower than margin to negative examples . As data may be inseparable in , the objective function (1) penalizes violations of these constraints (2) with slack variables , asymmetrically weighted by constants and .

Active sets

Solving the primal QP (1) is often approached with the help of Lagrange multipliers , where , which are associated with constraints in (1). Let and . Then, the dual problem takes the familiar form:

(4)
(5)
(6)

The immediate consequence of applying the Lagrange multipliers is the expression for the LSVM parameters yielding the decision function .

The optimal solution of the dual problem is dictated by satisfying the usual Karush-Kuhn-Tucker (KKT) conditions. Notably, the KKT conditions can be algebraically rearranged giving rise to the following active sets:

(7)
(8)
(9)

Firstly, the sets (7)(9) cluster data points to the margin , to the left , and to the right of the margin along with their associated scores . Secondly, the sets indicate the range within the space for Lagrange multipliers over which is allowed to vary thereby giving rise to a convex polytope in that space.

Convex polytope

A unique region in satisfying a particular configuration of the set is bounded by a convex polytope. The first task in path exploration is thus to obtain the boundaries of the convex polytope. Following (Hastie, 2004), we obtain222A similar derivation appeared in (Bach et al., 2006). linear inequality constraints from (7)(9):

(10)
(11)
(12)
(13)
(14)

where, is the orthogonal projector onto the orthogonal complement of the subspace spanned by and is the Moore-Penrose pseudoinverse if has full column rank.

Specifically, let be a matrix composed of constraints (10)(14).

(15)

Then, the boundaries of the convex polytope in the space are indicated by a subset of active constraints in , which evaluate to for some . The boundaries can be determined in linear time with efficient convex hull (CH) routines (Avis et al., 1997).

Now, in order grow the entire regularization path in , the sets have to updated at -th step such that the KKT conditions will hold, thereby determining an -th convex polytope. The polytope constraints for which indicate that a point has to go from to in order to satisfy the KKT conditions. Likewise, for which indicate updating the point from to , indicate point transition from to , and indicate point transition from to . These set transitions are usually called events, while the activated constraints are called breakpoints.

Therefore, at breakpoint, we determine the event for -th point by a function that updates the set from to as:

(16)

where the direction of the transition depends on the current set configuration.

Following (Hastie et al., 2004), our algorithm requires set to proceed to the next set by (16). However, unlike (Hastie et al., 2004), the constraints (10)(14) are independent of the previous computations of and . This has several implications. Firstly, our algorithm does not accumulate potential numerical errors in these parameters. Secondly, the algorithm can be initialized from an arbitrary location in the space.

3 Proposed method

The evolution of is continuous and piecewise linear in the space (Bach et al., 2006). An immediate consequence is that the active constraints have to flip during the set update (16).

Flipping constraints

Suppose we have a single event that goes to thereby forcing the set update rule . Then, the -th constraint in (10) can be rearranged using the matrix inversion lemma wrt as:

(17)

where . This constraint is equal to its corresponding, sign-flipped counterpart in (13) at as:

(18)

with . The same argument holds for update type .

Furthermore, (Hastie et al., 2004) express the evolution of with a single cost parameter . This case is equivalent to that yields the identity line in the space. (Ong et al., 2010) observe that one-dimensional path exploration over a line can lead to incorrect results and resort to searching for alternative set updates over a higher-dimensional hyperparameter space. Notably, when two points hit the margin at the same time at , the matrix updated by both points not necessarily needs to become singular. However, the sets can be incorrectly updated. We formalize this by introducing the notion of joint events that may co-occur at some point on the line. In our setting of the D path exploration, this is always the case when a vertex of a polytope coincides with a line in the space.

Joint events

At the vertices of the convex polytope at least two events occur concurrently. In this case, the set can be updated twice from to . Hence, this vertex calls different updates of the set, i.e. two single updates for both edges and a joint update.

Note that the piecewise continuous D path of also implies piecewise continuous D path of the events. Moreover, as each vertex is surrounded by different sets, two events at the vertex have to satisfy the following vertex loop property:

(19)

stating that the events have to flip at the vertex such that a sequence of up to single updates reaches each set from any other set associated with that vertex.

3.1 AC-LSVMPath algorithm

We now describe our algorithm. We represent the entire regularization path for the AC-LSVM by the set of vertices , edges , and facets . Let . Then:

(20)
(21)
(22)

where and denote attribute and connectivity, respectively, of each element in the sets.

Ordering

The sets admit the following connectivity structure. Let , , and be partitioned into subsets where , , and . The subsets admit a sequential ordering, where , , and , such that edges determine the adjacency of facet pairs333This is also known as facet-to-facet property in parametric programming literature (Spjotvold, 2008) or while vertices determine the intersection of edges or . In effect, our algorithm orders facets into a layer-like structure.

We define a vertex as an open vertex when or , where is set cardinality, if the vertex does not lie on neither axis. We define a closed vertex when . When , the vertex is also closed if it lies on either axis. Similarly, an edge is called an open edge when and a closed edge when . Then, a facet is called an open facet when first and last edge in are unequal; otherwise it is a closed facet. Finally, are called either single when they are unique or replicated , otherwise.

We propose to explore the AC-LSVM regularization path in a sequence of layers . We require that facet attributes are given at the beginning, where , , and . An -th layer is then composed of four successive steps.

1. Closing open edges and facets - CEF

For each set, which is attributed to , the algorithm separately calls a convex hull routine CH. The routine uses (15) to compute linear inequality constraints creating a convex polytope at . The ordered set of edges , where the first and last edge are open, serve as initial, mandatory constraints in the CH routine. After completion, the routine augments the set by closed edges and the set by open vertices .

2. Merging closed edges and open vertices - MEV

As the CH routine runs for each facet separately, some edges and/or vertices may be replicated, thereby yielding and .

Notably, a vertex is replicated when another vertex (or other vertices) has the same attribute, i.e. . However, we argue that merging vertices into a single vertex based on the distance between them in some metric space may affect the numerical stability of the algorithm.

On the other hand, a closed edge is replicated by another closed edge , when both edges connect a pair of vertices that are both replicated. Replicated edges cannot merge solely by comparing their event attributes . As they are piecewise continuous in the space, they are not unique. Similarly to vertices though, the edges might be merged by numerically comparing their associated linear constraints, which are only sign-flipped versions of each other, as shown in (17)(18). However, this again raises concerns about the potential numeric instability of such a merging procedure.

In view of this, we propose a sequential merging procedure that leverages sets, which are both unique in and discrete. To this end, we first introduce two functions that act on attributes and connectivity of objects .

Let be an indexing function that groups by assigning labels from set to based on indexed over :

(23)

Let then be a relabeling function that assigns labels from to indexed over :

(24)

The algorithm commences the merging procedure by populating initially empty set with facets that are obtained by separately updating (16) the facets through the events attributed to each edge . Note, however, that replicated edges will produce facet attributes in that replicate facet attributes from the preceding layer. Moreover, single edges may as well produce replicated facet attributes in the current layer. Hence, we have that .

In order to group facets into single and replicated sets, the algorithm indexes facet attributes with based on their equality. Then, relabeling facet-edge connectivities of edges with allows for indexing the connectivities with also based on their equality. Having indicated single and replicated edges, the algorithm relabels edge-vertex connectivities of vertices with .

Note that there are two general vertex replication schemes. Vertices, which indicate the occurrence of joint events, can be replicated when two facets connect (i) through an edge (i.e., vertices share replicated edge) or (ii) through a vertex (i.e., vertices connect only to their respective, single edges). At this point of the merging procedure, a vertex is associated indirectly through edges with two facets , when it lies on either axis, or three facets , otherwise.

Two vertices lying on, say, axis are replicated when their respective edges share a facet and are attributed the events that refer to the same, negative point , but yet that have opposite event types, i.e. . This condition follows directly from (10)(13), as at . Conversely, when vertices lie on axis, they are replicated when their edges have events referring to the positive point, . Then, two vertices lying on neither axis are replicated when their respective edges are associated with two common facets and equal joint events. Hence, vertices are indexed with based on the equality of edge-facet connectivities along with edge attributes. Alternatively, should the joint events be unique, the vertices could then be merged solely by comparing these events. Showing that joint events being unique is true or false, i.e. two D event paths can intersect only at a single point in the entire space, is an interesting future work.

Having grouped facets , edges , and vertices , now the algorithm can merge facet-edge connectivities of the replicated facets, prune replicated edges , and merge vertex-edge connectivites of the replicated vertices.

Being left with only single facets , edges , and vertices , the algorithm relabels with the edge-vertex connectivities of single, open edges and single, closed edges intersecting with . Finally, the algorithm relabels with the facet-edge connectivites of facets from the preceding and current layer.

3. Closing open vertices - CV

In this step, the algorithm closes the vertices by attaching open edges. Specifically, by exploiting the piecewise continuity of events at vertices, the algorithm populates the set with open edges , such that a vertex now connects either to (i) edges, when it lies on, say, axis and connects to event edge with associated positive point, or to (ii) edges when it lies on neither axis.

Using the vertex loop property (19), the algorithm then augments the set with additional facets such that now the closed vertex connects indirectly through its edges to facets and additionally up to new facet .

There are several advantages for generating open edges. Firstly, augmenting the initialization list of edges during the call to the CH routine reduces the number of points for processing, with computational load . Secondly, each vertex generates up to two single open edges. However, there can be two single vertices that generate the same open edge thereby merging the D path of an event. In this case, both open edges are merged into a single closed edge and the facet is closed without processing it with CH routine. This merging step is described next.

4. Merging open edges and facets - MEF

As open edges and their facets, which are generated in step , can also be single or replicated, step proceeds similarly to step .

The algorithm indexes additional facets with and relabels the open edge connectivities with . Then, the algorithm indexes these connectivites with and merges edge-vertex and facet-edge connectivities. Finally, the algorithm relabels with the facet-edge connectivity of all facets in and returns to step .

Termination

The algorithm terminates at -th layer, in which the CH routine for sets of all facets in , where , produces open polytopes in the space.

Special cases

As mentioned in (Hastie et al., 2004), two special case events may occur after closing facet and set updating (16). When (i) replicated data points exist in the dataset and enter the margin, or (ii) single points simultaneously project onto the margin such that , then the matrix becomes singular and thus not invertible, yielding non-unique paths for some . In contrast to (Hastie et al., 2004), note that the case (ii) is likely to occur in the considered LSVM formulation (1)(3) as the positive and negative data points span up to subspace after being affine transformed, yielding e.g. parameters .

In the context of our algorithm, both cases (i)(ii) are detected at when the matrix formed of constraints (10)(13) associated with these points either has , producing multiple events at an edge denoted by constraints that are identical up to positive scale factor, or has , producing multiple joint events at a vertex denoted by constraints that intersect at the same point.

We propose the following procedure for handling both special cases. Namely, when some facets close with edges having multiple events or with vertices having multiple joint events that would lead to cases (i)(ii), the algorithm moves to step , as it can obtain facet updates in these special cases. However, it skips step for these particular facets. While we empirically observed that such vertices close with edges having multiple joint events, it is an open issue how to generate open edges in this case. Instead, during successive layers, step augments the list of facets, edges, and vertices by the ones associated to (i)(ii) for indexing and relabeling them with respect to successive ones that will become replicated in further layers. In effect, our algorithm ’goes around’ these special case facets and attempts to close them by computing adjacent facets. However, the path for in these cases is not unique and remains unexplored. Nevertheless, our experiments suggest that unexplored regions occupy relatively negligibly small area in the hyperparameter space.

When the algorithm starts with all points in and either case (i)(ii) occurs at the initial layers, the exploration of the path may halt444The path exploration will halt when these cases occur as the data points that are the first to enter the margin; and more generally, when D multiple event paths referring to these cases will go to both axis, instead of to one axis and to infinity. due to the piecewise continuity of the (multiple) events. A workaround can then be to run a regular LSVM solver at yet unexplored point , obtain sets, and extract convex polytope to restart the algorithm.

Our future work will focus on improving our tactics for special cases. We posit that one worthy challenge in this regard is to efficiently build the entire regularization path in -dimensional hyperparameter space.

Computational complexity

Let be the average size of a margin set for all , let be the average size of . Then, the complexity of our algorithm is , where is the number of computations for solving (10) (without inverse updating/downdating (Hastie et al., 2004)) and we hid constant factor related to convex hull computation. However, note that typically we have . In addition, we empirically observed that (but cf. (Gartner et al., 2012)), so that the number of layers approximates dataset size. Our algorithm is sequential in but parallel in . Therefore, the complexity of a parallel implementation of the algorithm can drop to . Finally, at each facet, it is necessary to evaluate (10). But then the evaluation of constraints (11)(14) can be computed in parallel, as well. While this would lead to further reduce the computational burden, memory transfer remains the main bottleneck on modern computer architectures.

Our algorithm partitions the sets , , into a layer-like structure such that our two-step merging procedure requires access to objects only from layer pairs and and not to preceding layers555When the algorithm encounters special cases at , it requires access to , , objects related to these cases even after layers, but the number of these objects is typically small.. In effect, the algorithm only requires memory to cache the sets at , where and are average edge and vertex subset sizes of and , respectively.

Figure 1: Visualization of the entire regularization path for the AC-LSVM. Experiments (i)-(iv) are shown in counterclockwise order. In (i) we show a portion of the entire regularization path, where red dots indicate facet means. In (ii) we show intertwined layers of facets up to some layer (blue and green) and 1D event paths of several points (cyan - event and red - event ). In (iii) and (iv) we show the entire regularization path.

Figure 2: Visualization of the entire regularization paths for several Langrange multipliers for experiment (iv).

4 Numerical experiments

In this section, we evaluate our AC-LSVMPath algorithm described in section 3

. We conduct three numerical experiments for exploring the two-dimensional path of assymetric-cost LSVMs on synthetic data. We generate samples from a gaussian distribution

for (i) a small dataset with large number of features , (ii) a large dataset with small number of features , and (iii) a moderate size dataset with moderate number of features .

We also build two-dimensional regularization path when input features are sparse (iv). We use off-the-shelf algorithm for training flexible part mixtures model (Yang & Ramanan, 2013), that uses positive examples from Parse dataset and negative examples from INRIA’s Person dataset (Dalal & Triggs, 2006). The model is iteratively trained with hundreds of positive examples and millions of hard-mined negative examples. We keep original settings. The hyperparameters are set to and to compensate for imbalanced training (Akbani et al., 2004).

For experiments (i)(iv), we have the following settings: (i) , , , (ii) , , , (iii) , , , (iv) , , . We set in all experiments, as in (Yang & Ramanan, 2013). The results are shown in Fig. 1 and Fig. 2.

5 Conclusions

This work proposed an algorithm that explores the entire regularization path of asymmetric-cost linear support vector machines. The events of data concurrently projecting onto the margin are usually considered as special cases when building one-dimensional regularization paths while they happen repeatedly in the two-dimensional setting. To this end, we introduced the notion of joint events and illustrated the set update scheme with vertex loop property to efficiently exploit their occurrence during our iterative path exploration. Finally, as we structure the path into successive layers of sets, our algorithm has modest memory requirements and can be locally parallelized at each layer of the regularization path. Finally, we posit that extending our algorithm to the entire -dimensional regularization path would facilitate processing of further special cases.

References

Japkowicz, N., Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 429-449

Bergstra, J., Bengio, Y. (2012). Random search for hyper-parameter optimization. The Journal of Machine Learning Research, 13(1), 281-305

Bach, F. R., Heckerman, D., Horvitz, E. (2006). Considering cost asymmetry in learning classifiers. The Journal of Machine Learning Research, 1713-1741

Hastie, T., Rosset, S., Tibshirani, R., Zhu, J. (2004). The entire regularization path for the support vector machine. The Journal of Machine Learning Research, 1391-1415

Hsieh, C. J., Chang, K. W., Lin, C. J., Keerthi, S. S., Sundararajan, S. (2008). A dual coordinate descent method for large-scale linear SVM. In Proceedings of the 25th International Conference on Machine learning, 408-415

Spjotvold, J. (2008). Parametric programming in control theory. PhD Thesis

Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., Lin, C. J. (2008). LIBLINEAR: A library for large linear classification. The Journal of Machine Learning Research, 9, 1871-1874

Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A. (2011). Pegasos: Primal estimated sub-gradient solver for SVM.

Mathematical programming, 127(1), 3-30

Joachims, T. (2006). Training linear SVMs in linear time. In ACM SIGKDD International Conference on Knowledge Discovery and Data mining, 217-226

Gartner, B., Jaggi, M., Maria, C. (2012). An exponential lower bound on the complexity of regularization paths. Journal of Computational Geometry, 3(1), 168-195

Karasuyama, M., Harada, N., Sugiyama, M., Takeuchi, I. (2012). Multi-parametric solution-path algorithm for instance-weighted support vector machines. Machine learning, 88(3), 297-330

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al., Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825-2830

Klatzer, T., Pock, T. (2015). Continuous Hyper-parameter Learning for Support Vector Machines. In Computer Vision Winter Workshop

Chu, B. Y., Ho, C. H., Tsai, C. H., Lin, C. Y., Lin, C. J. (2015). Warm Start for Parameter Selection of Linear Classifiers. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 149-158

Snoek, J., Larochelle, H., Adams, R. P. (2012). Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, 2951-2959

Ong, C. J., Shao, S., Yang, J. (2010). An improved algorithm for the solution of the regularization path of support vector machine.

IEEE Transactions on Neural Networks

, 21(3), 451-462

Akbani, R., Kwek, S., Japkowicz, N. (2004). Applying support vector machines to imbalanced datasets. In Machine Learning: ECML, 39-50

Dalal, N., Triggs, B. (2005). Histograms of oriented gradients for human detection. In

Computer Vision and Pattern Recognition

,886-893

Boser, B. E., Guyon, I. M., Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In

Proceedings of the fifth annual workshop on Computational learning theory

, 144-152

Avis, D., Bremmer, D., Seidel, R. (1997). How good are convex hull algorithms?. Computational Geometry, 7(5), 265-301