under the banner of relevance reasoning. However, the better part of recent works have been exclusively towards feature selection[3, 4]. With increased processing power, run time of training is feasible even for datasets erstwhile considered large. Additionally, dimensionality (dimensions) dominates dataset size (numOfPoints) in the algorithmic complexities of learning algorithms. In the training phase, less data points mean fewer generalization guarantees, however, as we are moving in the era of big data, even the fastest classification algorithms are taking un-feasible time to train models. When data sources are abundant, it is befitting to separate data based on relevance to the learning task. This has led to a renewed interest in the once famous problem statement of relevance reasoning [5, 6]. Reasoning on relevance to get improved scalability of classification algorithms is currently explored on graphical/network data , and learned models .
One research area where training set selection has been given attention to is support vector machines (SVM). Generally, these selection methods can be divided into two types. The first type of methods aims to modify the SVM formulation so that it can be applied to large datasets. Many approaches have worked successfully in the past, including sequential minimal optimization (smo)[9, 10]
, and genetic programming. The first type of methods, however, do not benefit from reducing the size of training data because they only deal with data handling. Reduction of data size is the quintessential advantage of the second kind of methods [12, 13, 14, 15, 16, 17, 18, 19, 20]. These methods focus on segregating the data points on relevance to the classification task. Computation time is reduced by actively reducing numOfPoints. Handling data was the central theme of research during the mid-2000s instead of reducing it. Despite the apparent advantage of reduction of data, researchers made sincere efforts for methods of the first type [9, 10, 11, 21] in comparison to the latter type. For instance, formulations implemented in LIBSVM , which is a widely used benchmark library for SVM methods, are of the first type of methods.
Few existing works of the second type are limited in one respect or other. For instance, clustering based SVM (CB-SVM) work by Yang et al. , and others [13, 12] which resulted in huge speed-ups are limited to linear kernels only. A geometric approach of minimum ball enclosing by Cervantes et al.  requires two stages of SVM training. A similar method by Li et al.  suffers from a random selection of data. They have reported 92.0% prediction accuracy for a separable dataset for which LIBSVM reproduces 99.9%.
The presented approach also falls under the second type. As will be shown, the presented selection scheme is very deterministic in prediction accuracy. A model trained with the presented heuristic results in close or better prediction accuracy than that of full data training. One recurring problem with the second type of methods is inefficient space searching scalability [15, 16, 17, 18, 19]
. It becomes worse for high dimensional data, where the heuristic takes more time than training itself. This issue is addressed by the use of state-of-the-art approximate nearest neighbor (ann) methods  which are highly scalable.
In LIBSVM, SMO decomposition method of Fan et al.  is available for the classification task. Training set selection compares in principle to working set selection (wss) in the context of SMO or similar decomposition methods. Furthermore, a shrinking technique is available for these formulations to remove bounded components during iterations, effectively reducing the optimization problem . As will be shown, data selection in the presented heuristic only depends on the underlying classification patterns, giving it an essential advantage of generic applicability to the majority of classification algorithms, including the SMO formulations of LIBSVM. For these reasons, the state-of-the-art shrinking heuristic of LIBSVM is compared to the presented heuristic.
The presented heuristic augments clustering based approaches [20, 13, 12] by constructing an (approximate) information graph out of the clustered data. This graph acts as a proxy for reducing the training set. A novel edge weight scheme captures the underlying classification patterns in the graph. The graph is then pruned via filtering on the edge weights to select a relevant dataset that can be used for the training task. Furthermore, a graph coarsening approach is presented to break the selected/reduced set into further partitions that are independently available for training, leading to an approximate learning scheme. Both the methods lead to reduction in the number of training data points, which reduces complexity of the training algorithm, giving performance advantages.
Most of the existing methods of both types are limited to the SVM class of algorithms [12, 13, 14, 15, 16, 17, 18, 19, 20, 9, 10, 11, 21]. Generic applicability on a majority of classification algorithms is another advantage of the presented heuristic. It gives an opportunity to use the heuristic as a pre-processing tool, separate from the classification algorithm. Since the data points are selected based on their relevance to the classification task, the resultant reduced training set is much more balanced in size across the target classes. In other words, the formulation addresses the problem statement of class imbalance, which is a topic of current research in big data .
The remainder of this paper is organized as follows, the ‘Methodology’ section describes formulations of the proposed heuristic in detail. The heuristic is evaluated with a number of tests, and datasets in the ‘Results’ section. Finally, the ‘Conclusions’ section summarizes concluding remarks, and ideas for future work in the heuristic formulation.
The heuristic procedure organically divides into the following steps,
Graph knitting scheme
Graph shedding scheme
Graph clubbing scheme
Pre-processor for the testing phase
The training phase proceeds after the first four steps, whereas the testing phase follows the last step of the heuristic formulation. The clustering step is used to get a computationally feasible resolution of the underlying data. A weighted graph is constructed next in the graph knitting scheme, using a three parts algorithm, and an edge weight scheme, which captures the classification patterns completely. Significant nodes of this graph with respect to the classification task are determined in the graph shedding step. Finally, the graph clubbing step divides these nodes into partitions that can be trained independently using a directional aspect of the graph coarsening objective achieved via another three parts algorithm. Because of the multiple data partitions which translate into as many classifiers, there is a need to determine which classifier to choose for testing a data point. This is achieved by the pre-processor for the testing phase. Lastly, a network application is designed which distributes the obtained partitions in a load-balancing, and a communication-free manner.
From a computational point of view, run-time profile of the first four steps for a typical run case ( 100k, = 2, 350 clusters) is,
Clustering step - 750 ms or 98%
Graph knitting scheme - 10 ms or 1%
Graph shedding, and graph clubbing schemes - 2 ms ()
On the other hand, the pre-processor for the testing phase takes 5% of the testing phase time. The clustering step is predominant over subsequent steps with a run-time complexity in , whereas that of all the other steps is in the order of number of clusters (numOfClusters). The input parameter to the clustering step, , controls the granularity of data representation. Typically, the ratio of to that of , which is also known as nominal vc_dim, is 10 to 300, explaining the run-time dominance of the clustering step. The graph knitting scheme becomes the most computation-intensive upon considering the subsequent steps, because it involves heavy space searching. It is to be noted that the run-time percentage profile can vary a lot depending on the nominal VC dimension or the input parameter .
The clustering step is used to lower the resolution of the underlying data. In principle, the presented heuristic does not require this step. However, it is not computationally feasible to execute the subsequent steps with a run-time complexity in instead of . Additionally, it will also affect the generic applicability of the heuristic, which is discussed later in the ‘Graph shedding scheme’ step.
The step consists of a standard K-means++ clustering algorithm, and a metric to store classification patterns of the original data. K-means++ provides improved initial seeding of clusters over the traditional K-means method. This choice leads to running the clustering algorithm for a nominal number of iterations, typically 5. In the current implementation, every cluster center maintaines the target class through the weighted average calculation over all its data points. However, advanced metrics can be constructed to unearth more characteristics of patterns from the clustered data.
Although this step serves only for coarsening the data representation, it dominates the computation cost of the first four steps. Two state-of-the-art K-means++ implementations were tested, K-MeansRex , and scalable mlpack package . For a test run with the data points in the range of 1 to 100K ( = 2, = 100), K-means++ from mlpack was times faster on average in execution than K-MeansRex’s implementation. Therefore, the K-means++ from mlpack is chosen as the standard clustering algorithm in this work.
Given the vast research literature available for clustering methods, there are other implementations available. One of such improvements would be the scalable K-means++ by Bahmani et al. , which is shown to be considerably faster than native K-means++. Another practical option is to exploit K-means implementations from a proven distributed computing platform .
Graph knitting scheme
From this step onwards, the presented heuristic digresses from most of the existing geometric approaches of the second type, primarily because of the choice of graph to represent the classification dataset, and the use of seminal works in neighbor searching methods. First, the choice of a weighted graph opened the possibility of using well researched work on graph coarsening, which is the foundation of the graph clubbing step. Second, most of the existing approaches could not benefit from seminal works in the neighbor searching methods, which have contributed profoundly to the success of computer vision. The fast library for approximate nearest neighbors (FLANN) search engine is used in this work.
An information graph is constructed once a reasonable representation of the underlying data is obtained. Two major challenges include determination of neighbors, and capturing the classification patterns in edge weights. First, neighbors are determined such that the whole hypothesis is covered while passively enforcing regularity and planarity in the graph. The neighbors are determined in two steps, superficial search, and exclusive search, presented in Part I, and II of Algorithm I, respectively. Part III
of the algorithm controls skewness of the graph. Second, a two-fold pattern capturing edge weight scheme is presented in Eq.2.
The first challenge of neighbor determination is addressed in two stages, superficial neighbor search (sns), and exclusive neighbor search (ens), presented in Part I, and II, respectively. In the algorithm, the number of neighbors is nominally controlled by an input parameter , the number of desired neighbors. For every node, variable in Line 1 of Part I is used to track the size of the neighbor list.
However, neighbor list for the node can be terminated before adding neighbors by updating the variable to TRUE in Line 4 of Part I. By limiting the number of neighbors in this way, construction of the graph can be controlled.
Part I is used to look for nearest neighbors, regardless of the target class. The algorithm takes input an empty graph, , which is formed with the set of the cluster centers, , after the clustering step. This set is the search space passed to FLANN space indexing utility in Line 8, and 9. The input constant is explained later in Eq. 1. Objective of the part is to fill the graph, , with set of edges. This is similar to the construction of a K-nearest neighbor graph (knng). However, an input parameter, MAX_SAME_CLASS_NEIGH, is used to limit the number of same class neighbors in Line 13, so that the remainder of edges for the node, , are constructed for nodes with opposite target class. Nodes for which all neighbors are found will vary depending on characteristics of the data. They will be excluded from the next parts. An additional computation of reach, in Line 15, is maintained for every node. The metric presented in Eq. 1 is similar to the Hausdorff distance . The distance utility of FLANN, in Line 10, is used to compute the summation in Eq. 1, whereas a scaling constant () controls the reach according to
where r is the scaling constant, ri is reach of the node, xi is position of the node, xj is position of the node, and ni denotes set of same class neighbors for the node.
In order to capture the classification patterns completely, it is necessary to make edges along the hypothesis of the classification data. So, Part II extends neighbor searching exclusively for nodes of the opposite target class. The node is considered only if there is a remainder requirement of neighbors, , where . In the part, nodes of class 2 forms the search space in Line 3, in which neighbors are searched for class 1 nodes. This step along with the vice-versa case forms one iteration of the ENS procedure. For each iteration, a search space of the opposite target class is constructed in Line 3.
Inclusion of a neighbor after space searching in Line 5 is stringent compared to SNS. An input parameter, NEIGH_LIMIT, and the computed reach are used to limit the availability of node as prospect neighbor in Line 9. For every node added as a neighbor, the variable , initiated in Line 6, is incremented. Reach is used to further update the boolean for node , even if in Line 13. Such an node tends to be an internal node of a target class. In other words, reach only encourages the convex hull nodes of one class to choose neighbors with nodes of the opposite class, aiding in planar construction of the graph.
counter is required in Part II to avoid a node that might habitually come up as a prospect neighbor despite not being very representative of a target class. It otherwise leads to a skewed graph concerning node degree. in Line 3 of Part III is used to reduce the search space of every target class, controlling skewness of the constructed graph.
The second challenge is to capture the classification patterns in edge weights, for which a two-fold edge weight scheme is designed. First, each node measures its internal pattern as the absolute difference from one of the two target classes. Second, every pair of nodes in an edge measures external pattern by the relative difference of their target class. Individual contributions are added via the power scheme in Eq. 2 to weigh the edge according to
where wij is weight of the edge, tci denotes target class of the node, ci, and ce are constants for internal, and external classification patterns, respectively, and quantities in are absolute values.
The use of state-of-the-art implementation like FLANN, for neighbor searching cannot be overemphasized. For instance, in a typical run case ( and ), approximate neighbor searching method of FLANN was 1000 times faster when compared to the exact algorithm for nearest neighbor searching, which involves for every node, computing distances with all the remaining nodes, and then sorting them to determine the nearest neighbors. The run-time advantage is clear when comparing the complexity of exact graph construction, , to that of approximate methods offered by FLANN .
Graph shedding scheme
Once a weighted graph is obtained, an edge cut based filtering presented in Algorithm 2 separates the training dataset into relevant, and non-relevant. For every node, the neighbor list is iterated to check for a significant edge in Line 5, and when found, that node is added to in Line 11. This leads to a training set selection that the second type of approach aims to achieve. Note that the result of this step depends on the characteristics of the graph, such as how well connected the graph is, and how well the underlying classification patterns have been captured. Algorithm 1 with the edge weight scheme address these issues.
The role of nominal VC dimension drips down in the pruned graph as well. Since it controls the granularity of the underlying data, it also controls the granularity of data selection. That gives the heuristic an essential advantage in terms of limiting data shrinkage while selecting the relevant data points. Because the selection is done via clusters, and the ratio of to is typically , meaning that many data points that are not very close to the hypothesis boundary of the classification patterns are also selected. That gives an extra buffer of data points upon which another selection method of both types applies. The majority of classification algorithms, for example neural methods, gaussian processes et cetera, can use the heuristic. Until this step, the presented heuristic’s aim matches with the existing approaches of the first, and second type. For comparison purposes, the heuristic until this step (including) is referred to as gsh for the remainder of this work. Edge cut for GSH is referred to as GS edge cut.
Graph clubbing scheme
This step extends the problem statement of training set selection to further breaking the reduced training set into few partitions or critical chunks, each of which can be trained independently by virtue of Part I to III of Algorithm 3. The main aim is to design an approximate formulation that is theoretically faster, even for serial execution. The algorithm divides the training set into few partitions such that the number of computations are reduced significantly in the training phase. Consider that the order of complexity of most of the classification algorithms is higher than linear, that is if is the complexity. Now, the graph clubbing scheme doesn’t change the order, however it results in significant reduction in total computations. For example, for a classification algorithm with complexity, if , where is a constant, is the original number of computations; then after data reduction by the graph clubbing scheme, there are four equal sized partitions/critical chunks. Now the number of computations is or a quarter of the original number.
Independence during the training phase of each obtained partition is mainly because of the directional aspect of the algorithm, which is achieved via the edge weight scheme. The directional aspect is responsible for two objectives, namely obtaining equally-sized partitions, and ensuring orthogonality of the hypothesis boundary with neighboring partitions’ boundary. Two ways in which the edge weight scheme is leveraged for the directional aspect is in the priority aspect of the partial weighted matching (PWM) algorithm, Part I, and the re-assessment aspect of the coarsening formulation, Part II. The graph clubbing algorithm, Part III ties Part I, and II in an iterative scheme. Each of such obtained partitions can now be trained independently, giving further leverage for a nominal number of worker processes.
The partial weighted matching (PWM), Part I, is designed for the weighted graph obtained from the graph knitting step. The obtained matching is partial because edges weighing less than EDGE_CUT (input parameter) are filtered out in Line 4. So only cluster points closer to the hypothesis boundary are chosen for training. Sorting in Line 6, before ordered matching in Line 9, adds the weighted aspect to the matching. It enables the heaviest edges to be picked earliest for contraction, subtly addressing both the main objectives of the directional aspect. Since the heaviest edges cover the classification patterns, prioritized selection of them results in uniform size of partitions. Prioritized selection also means that the most significant patterns are given preference, which conversely means the least significant patterns are avoided. So the hypothesis boundary, along which the least significant patterns reside, is orthogonal to the contracted edges, where the most significant patterns reside. Higher prediction accuracies are obtained because of the preference of contraction of the heaviest edges.
The coarsening formulation of Part II applies the directional aspect, as edge contraction occurs in this part. It is to be noted that the coarsening formulation is different in aim to otherwise researched formulations. Most of the popular formulations are intended for reducing communication cost or preserving the global structure while getting a low-cost representation of data . Furthermore, unlike Kernighan-Lin, and other matching based coarsening objectives, the presented optimization objective is deterministic in execution.
In the part, re-assessment of target class, and edge list in Lines 6, and 10, respectively, for newly contracted nodes augments the standard coarsening step in Line 3. By using different values of , and for initial versus re-assessment edge weights, the precedence of the kind of edges is established in the matching scheme. Original heavy edges are proiritized over re-assessed edges of newly contracted edges, achieving the orthogonality property. Transition of coarsening from original edges to newly contracted nodes can be captured by the virtue of drastic decrease in graph cost metric, presented in Eq. 3,
where cg is cost of the graph , e is an edge, Ep denotes the subset of (set of all edges) such that , and we is the weight of edge .
One technical choice is to use MIN in Line 4, for identifying the new node that is a result of the contraction of edge between nodes , and .
Part I, and II, are executed in the iterations of Part III in Lines 5, and 6, respectively. In Line 7, kink detection in graph cost is used as a termination condition of the iterative algorithm. However, a maximum number of coarsening iterations, i.e. MAX_NUM_OF_COARSENING_ITER in Line 4, is used in the majority of tests. This step concludes the formulation of the heuristic. It is referred to as gch for the remainder of this work. Similarly, edge cut is referred as GC edge cut.
The implementation of few optimizations improved the run-time. First, the starting nodes in Line 2, Part I, are the ones that are identified relevant in GSH. After each iteration, half of the nodes that belong to contracted edges are reduced in the update of relevant nodes list ( in Algorithm 2). As a result, the complexity of the matching algorithm reduces with the coarsening iterations. Second, the neighbor list of a node is sorted. As a result, edge contraction computation is linear (in complexity) to . It is computationally canonical to the sorted union of two lists. Lastly, usual numerical optimizations, such as masking to avoid dynamic memory allocations, and indexing (at the expense of memory) for searching are used.
An event-driven, multi-process algorithm is designed to distribute the partitions obtained after GCH in a communication-free manner. In the network application, processes assume a (single) master or (multiple) worker role. Part I, and II of Algorithm 4, respectively describe master’s, and worker’s side of event-handling design.
For the master process which executes Part I, one partition from the list of partitions, , is communicated to the requesting worker process (the one which issues DATA_REQUEST event in Line 3 of Part II) by trigger of DATA event in Line 6. The worker process, upon receiving the event DATA, proceeds to training with the recieved data partition in Line 5 of Part II. After collecting acknowledgements of the completion of training of all data partitions in Line 9, the master process terminates every worker by issuing TERM_TRAIN event. The design implements a round-robin scheme, which balances load of the network queries. That is accompolished by having a queue data structure for recording DATA_REQUEST events in Line 3. It is to be noted that conflict of two simultaneous entries is resolved by time stamps, making the queue fair with respect to a worker’s request.
Instead of directly using TCP channels for message communication, the distributed messaging API ZeroMQ  is used. It provides essential safety, and liveness properties on network channels. However, apart from message guarantees such as the liveness, and safety property of ‘once only message delivery’, ZeroMQ is rudimentary compared to higher level message passing libraries. This gives an opportunity to design, and optimize various aspects of the architecture. One such aspect is the messaging protocol. A couple of messaging protocols are designed as shown in Figure 1. Protocol 2 implementes a single float/double entry (in character array or CA) messaging scheme, whereas protocol 1 first requires marshaling all entries of a data point before messaging it.
Another aspect that was tested is connection time of the network. Connection time is measured on the master process, and included the following steps:
Start of TCP channels (wrapped in the API)
Initialize a hash table
Recieve connection request from all worker processes
Send connection confirmation to all worker processes
The connection prodecure requires step 2 for maintaining worker processes’ information, giving an opportunity to optimize the step as per the need. A light-weight hash table (), and hash key is designed which generates unique keys for worker processes. The design helps to reduce the overhead of starting, and running the multi-process application.
Pre-processor for the testing phase
Unlike the application of GSH, which results in a single training set, few data partitions are obtained after GCH, and the training set is the union of these partitions. It means that there would be as many classifiers as the number of partitions. Hence, it is needed to determine which classifier to choose for predicting a point from the testing dataset, . Algorithm 5, nearest hypothesis search (nhs), is used for this task. A search space is formed consisting of nodes of the coarsened graph in Line 1. ANN search for the nearest hypothesis follows in Line 5.
Once the nearest hypothesis is determined, prediction of the target class for testing data points follows. This added step before the testing phase only takes about 5 - 7% of the run-time of the testing phase for SVM class of algorithms, as will be shown in the ‘Results’ section.
Results are presented in two major divisions, first with tests on parameter space of the heuristic, and second for gauging performance of the heuristic. All the tests were conducted on a variety of datasets.
Parameter space of heuristic
|GS edge cut|
|GC edge cut|
Node reach, and ENS
Tests in this section present heuristic tools that capture original classification data into the weighted graph. These tools are designed to handle real datasets, which vary diversely in characteristics. A mix of real, and synthetic datasets that mimic varying characteristics are considered.
A timeline of the ENS procedure with is presented in Figure 3. It is based on the dataset of Table 1. However, class 2 data points were intentionally translated to create separation, which is very typical for real data. It is evident that the connectivity of the graph increases with more iterations. Skewness control of the constructed graph, explained in Part III of Algorithm 1, was carried out at end of the iterations of the ENS procedure, resulting in reduction of available nodes for search as shown in Table 2.
|# iterations||# class 1 nodes||# class 2 nodes|
A second way to control connectivity is fine tuning of the reach equation, presented in Eq. 1. In the next test, scaling constant is varied, and results are presented in Figure 4. Three cases are shown in sucession, under-reach (a), ideal-reach (b), and over-reach (c). Even in the over-reach case, inner nodes are not able to make opposite class neighbors, enforcing planar construction of the graph.
The ENS procedure was next run on a real dataset, called cuff-less blood pressure estimation dataset from UCI machine learning repository[32, 33]. It is a three attribute, 12000 instances, real, multivariate dataset. Results are presented in Figure 5, with Subfigure (a) showing the dataset, (b) showing cluster centers, whereas (c), and (d) showing the constructed graph without, and with ENS procedure, respectively. The latter graph is connected because necessary edges are present between opposite class nodes, covering the entire hypothesis. Class imbalance is analysed next, and results are presented in Table 3. The column 2 shows the number of data points for each class originally, and column 3 shows the number of data points after ENS, as presented in the graph of Figure 5(d). Imbalance of target classes, quantified by standard deviation (sd) is significantly reduced after ENS.
|Procedure||# class 1 points||# class 2 points||SD|
The aim of this test is to exemplify the two-fold edge weight scheme, shown in Eq. 2. It shows the role of edge weights in the outcome of GSH, summarized in Figure 6. As classification patterns in the underlying data gets more confused in succession in Subfigures (a), and (c), more edges are weighted significantly, resulting in the selection of a bigger training set by GSH.
This test exemplifies the formulation of the graph clubbing scheme. Although the edge weight scheme is important in the application of GSH, it is primarily designed to play a crucial role in the priority/directional aspect of the coarsening objective. Constants for initial edge weights were such that was kept significantly higher than , as shown in Table 4. Two cases of re-assessments were considered, namely case I, and case II.
|Case I||Case II|
Since initial constants were kept significantly higher than their re-assessment counterparts, original nodes were contracted first in both cases, as can be seen in Figure 7(a), and (b). Transition to contraction of re-assessed type nodes is reflected in the graph cost characteristic in Figure 8 for case I. After the iteration, the slope magnitude decreases significantly because of contraction of all original nodes, that have significantly higher edge weights dictated by heavier initial constants. Edge contraction continues to aggressively club the graph for case I, including the re-assessed nodes. This results in fewer partitions compared to case II in Figure 7(b). Clubbing is almost shut off in case II because of trivial re-assessment constants. Note that in general, the more the coarsening iterations, the fewer the partitions.
Several tests were conducted to evaluate performance of the presented methods, including GSH, serial GCH, and distributed GCH. In addition, the network architecture for distributed GCH is evaluated.
Dataset I was a synthetically constructed two-dimensional, near-linearly separable dataset with parameters shown in Table 5
. Dataset II was a similar dataset, but in place of a (nearly) straight-separating hyperplane, a spherical separating hyperplane of radius 0.2 was employed. Otherwise, it uses the same parameters as Dataset I. A dataset from the UCI machine learning repository, called skin segmentation dataset[33, 34] is the next dataset, called Dataset III for the remainder. This classification dataset has four numeric features, and 245k observation instances. The nominal VC dimension is . Table 6 summarizes other relevant parameters.
|Range of # data points||testing||3k - 300k|
|training||1k - 100k|
|clustering||3 - 5|
|Range of||10 - 1000|
|1.0 - 2.0|
|# data points||test||122529|
|1.0 - 2.0|
Note that the computation setup was kept consistent. Programming language of all codes is C/C++ for which computation time was measured. That includes every step of the presented heuristic, and the classification algorithms. Lastly, time stamping was carried on an otherwise idle system.
The main aim of the next set of tests is to compare the training phase using GSH against state-of-the-art shrinking heuristic of LIBSVM. Classification algorithms used in these tests were all variants of LIBSVM’s SMO implementation , which is a method of the second type.
Two testing variables are evaluated. First, computation run-time of GSH along with the classification algorithm on reduced training data is compared against that of the same classification algorithm but augmented with the shrinking heuristic. Second, the prediction accuracy of the learned models obtained from both the above setups are compared.
Results are presented in the following way. Each of Figures 9 - 15 is for a SVM classifier, that includes C-SVM, and nu-SVM with linear, polynomial, and rbf kernels. Subfigure (a) is used to present computation run-time results, and (b) is used for prediction accuracy comparison. Training dataset size was geometrically varied from 1k to 200k, and testing dataset size was kept three times that of the training. Accordingly, the parameter was varied geometrically in the range given in Table 5. Note that training time is included in ‘LIBSVM shrinking heuristic’, and ‘GSH’ plots in Subfigure (a), but is not explicitly written for convenience. A third line-plot denoted ‘SVM training (post GSH)’ is presented in Subfigure (a) to separate computation run-time of GSH from subsequent training.
C-SVM with linear kernel was the first classification algorithm tested, as presented in Figure 9. The GSH scales even for a low number of points (10k), as shown in Subfigure (a), however, scaling becomes more evident with more training data points. Furthermore, it is visible that GSH took a small fraction of total time, as indicated by the separation between the second, and third line plots. It highlights the scalability issue of the SVM formulation. In Subfigure (b), the prediction accuracy of reduced training closely follows that of full training, clearly visible after 20k testing data points.
C-SVM with polynomial kernel was the second classifier tested, as presented in Figure 10. Observations here follow that of the earlier case. Although, one noticeable difference is a better run-time profile to the previous case. This is because of the higher run-time complexity of the polynomial kernel classifier than that of the linear kernel presented in Figure 9.
For the second SMO implementation, nu-SVM, Figures 11, and 12 summarize linear, and polynomial kernel cases, respectively. Note that the training phase with nu-SVM in Figures 11(a), and 12(a) is significantly slower compared to C-SVM in Figures 9(a), and 10(a). Furthermore, it is observed that reduced training can result in improved prediction accuracy, as can be seen in Figure 12(b). For the polynomial kernel, GSH performed better than native training by about 6%.
Again, C-SVM with linear kernel was the first classification algorithm tested, as presented in Figure 13. However, prediction accuracies are in a lower range than before, as shown in Subfigure (b). Nevertheless, both setups are practically identical in prediction accuracy. Similar run-time improvements are observed in Subfigure (a). For nu-SVM with linear kernel, presented in Figure 14, scaling observations are similar to the corresponding Figure 11
, which used Dataset I. Note that the prediction accuracies are only about 50%, which is the expected value for random selection between two target classes. However, note that the main argument here is not the absolute prediction accuracy, rather closeness of it for both setups. To improve the prediction accuracy, results were obtained for a radial basis function (rbf) kernel, which is widely considered a robust kernel type in SVM classifiers, and are presented in Figure15. This resulted in slower learning, as observed in Subfigure (a), but near perfect prediction accuracy, as shown in Subfigure (b).
The next set of tests was conducted on Dataset III, for evaluating the performance of the heuristic on real data, and barplots in Figure 16 are used to present the findings. Computation run-time was scaled to accommodate different classification algorithms, namely C-SVM, and nu-SVM with linear, polynomial, and rbf kernels. For all six cases, reported prediction accuracy is low as shown in Table 7, but close for both setups. Scaling improvements are consistent to the previous reportings for Dataset I, and II.
Overall, scalability improvements are observed consistently over a number of classifiers for Dataset I, II, and III. Furthermore, the prediction accuracy either closely follows original training or even outperforms it in a few instances.
The next set of tests are aimed to compare run-time of serial GCH to that of GSH. More precisely, serial execution of GCH, which represents approximate learning, is fared against (reduced training data) GSH learning. In GCH, the graph clubbing step was an added cost over GSH. Results are presented for Dataset I.
The number of iterations was set 10, and used as the termination condition for the graph clubbing algorithm. Pre-processor was used, and the collected prediction accuracies were weight-averaged over all reported partitions. Edge weight constants given in Table 8 were used throughout.
Only GSH, and GCH are compared along with respective line-plots of subsequent training for all the classifiers, presented in Figures 19, and 20. However, for the first classifier, C-SVM with linear kernel, shown in Figure 17, computation run-time of LIBSVM with shrinking heuristic is also plotted. An accuracy plot, Subfigure (b), and a plot for the number of partitions in Figure 18 are also presented.
In Figure 17(a), a clear scalability hierarchy is observed, with serial GCH GSH LIBSVM shrinking heuristic. In the accuracy plot, shown in Subfigure (b), approximate learning closely follows native learning for 8k training data points. However, for fewer points, the accuracy of the model trained after GCH, which is 95% is not as good as native training at . 2 to 5 numbers of partitions were obtained as shown in Figure 18. For polynomial kernel classifiers, presented in Figure 19 - 20, scaling advantages with serial GCH compared to GSH are observed even for lower data points (5k).
The main aim here was to demonstrate run-time advantage with worker processes for the communication-free training scheme discussed earlier. The scheme only aims for coarse grain parallelization, such that individual data chunks/partitions can be distributed to worker processes.
Results are presented in Figure 21 for C-SVM with polynomial kernel classifier on Dataset I. The distribution invariantly depends on the number of obtained partitions. Therefore, scaling results in Subfigure (a) cannot be interpreted without the plot for the number of partitions, which is shown in Subfigure (b).
Significant improvements are observed with more workers. However, sub-optimal distribution with the number of workers 2 can be seen in Figure 22. Mean number of data points per chunk/partition is plotted as a function of the number of total data points. Using the number of partitions/chunks from Figure 21(b), it is straightforward to interpret the monotonic increase of the plot. In a hypothetical case where the partitions are all equally-sized, SD would be zero. However, as the number of total data points increases, error bars for SD are observed to be commensurate to the mean number of data points. Un-equal sizes of partitions is the apparent reason for the sub-optimal distribution.
The main motivation for this set of tests was to evaluate various implementation aspects of the network that implemented distributed GCH. The first aspect is the connection time of the distributed application. Figure 23 shows connection time measurements with the number of workers.
Apparently, the connection time is very small (120 ms) when compared to the run-time of the training phase. The connection procedure was replicated on another distributed messaging API, MPICH3.3 , and the connection time with ZeroMQ was observed to be 1.53 times faster than with MPICH3.3.
The next aspect of the network is the messaging protocols. The primary motivation behind the protocol design is to utilize the shape of the data point, that has n-dimensional feature vector (), and 1-dimensional target class value (tc. Two of such protocols as previously discussed were fared in run-time. A couple of opposing factors are at play between these protocols. Data marshaling is an extra step in protocol 1 (before actual messaging) when compared to protocol 2. On the other hand, message count/traffic increases (multiple times governed by the dimensionality of data) in protocol 2 compared to protocol 1. Overall, protocol 1 performed better as can be been in Figure 24
. Although increased marshaling may probably result in even better scaling, a larger memory footprint on the channel is un-advisable as it can result in a overflow scenario.
Messaging efficiency was tested in the context of prediction/testing phase of the distributed implementation. Reportedly in Figure 25, it is very efficient at 1% of the prediction time. It is found to be even more efficient at 0.5% for the training phase. Pre-processing run-time is at 7%, as can be seen in Figure 25.
Note that external libraries are used in the formulution, including a clustering method, an approximate nearest neighbor implementation, two messaging APIs, and many classification algorithms. The heuristic execution is not very sensitive to the parameters of these external implementations. For instance, only 3 - 5 iterations of clustering were needed, input parameters for approximate nearest neighbor, and classification algorithms were kept at default. However, the heuristic is sensitive to the parameters which are part of the heuristic steps.
In this work, a three parts algorithm, and an edge weight scheme are designed to construct a weighted graph that effectively captures the classification patterns of dataset. Another three part graph coarsening algorithm with the directional aspect is designed that divides the reduced dataset into partitions that can be trained independently using the designed communication-free network.
GSH provides an evident serial scalability advantage, and generic applicability holds for every classifier. It encourages to extrapolate the argument, and claim that GSH will hold for a multitude of classification algorithms. Approximate learning is the reason behind serial GCH’s added advantage over GSH. A better model training throughput can be achieved via the use of both of these approaches.
Distribution of chunks/partitions of the training set to worker processes results in further scaling improvements. Even for a nominal number of worker processes (say four), available in almost all current computing platforms, parallel performance benefits can be achieved. However, un-equal coarsening, and control over the number of partitions still needs to be further investigated in the current implementation of GCH.
First, a possible direction of investigation is the directional aspect of the objective. That is, if the partitioning was better directed, better orthogonality between the contour of the underlying classification pattern, and of partition boundaries shall be obtained. This condition is necessary for correctness (in prediction accuracy) of this approximate procedure. The two-fold pattern measure scheme should be designed to get the desired control over the direction of coarsening.
A second possible investigation is modifying the objective function of the current graph coarsening scheme. Part of traditional coarsening objectives can be added to the current objective. Since popular objectives predominantly optimized size division of partitions, it will address the un-equal coarsening problem in the presented objective. Furthermore, the number of partitions is directly controlled in these objectives.
All approaches of the presented heuristic, namely GSH, serial, and distributed GCH, are effective to reduce training run-time of a classification algorithm. Furthermore, scaling benefits are accompanied with no compromise in prediction accuracy.
[name=List of Abbreviations, include-classes=abbrev, nomencl, sort = false]
Availability of data and materials
Dataset I, and II used during the current study are available from the corresponding author on a reasonable request. Furthermore, two real datasets used are available in the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/skin+segmentation; https://archive.ics.uci.edu/ml/datasets/Cuff-Less+Blood+Pressure+Estimation. Other resources with respect to the manuscript are shared at https://sumedhyadav.github.io/projects/graph_heuristic.
Tests were performed with following computing specfications:
Operating system(s): Ubuntu 16.04 LTS
Programming language: C/C++, python (for visualization)
Visualization tools: matplotlib, and networkx modules in python
External Libraries: FLANN, mlpack, LIBSVM
Version requirements: g++ 5.4.0, Python 2.7.12, matplotlib 1.5.1, networkx 2.2
The authors declare that they have no competing interests.
This work is joint research of SY and MB. SY did all implementation, and run all test cases. All authors read, and approved the final manuscript.
SY acknowledges Gautam Kumar’s help with respect to preparing the manuscript.
-  Levy AY, Fikes RE, Sagiv Y (1997) Speeding up inferences Using relevance reasoning: A formalism and algorithms. Artif. Intell. 97:83-136. doi:10.1016/S0004-3702(97)00049-0.
-  Blum AL, Langley P (1997) Selection of relevant features and examples in machine learning. Artif. Intell. 97:245-271. doi:0.1016/S0004-3702(97)00063-5.
-  Weng J, Young DS (2017) Some dimension reduction strategies for the analysis of survey data. Journal of Big Data 4:1:43. doi:10.1186/s40537-017-0103-6.
Guyon I, Gunn S, Nikravesh M, Zadeh LA (2008) Feature extraction: Foundations and Applications. In:Studies in Fuzziness and Soft Computing, vol 207. Springer, Springer-Verlag Berlin Heidelberg. doi:10.1007/978-3-540-35488-8.
-  Fayed H (2018) A data reduction approach using hyperspherical sectors for support vector machine. In:DSIT ’18:Proceedings of the 2018 International Conference on Data Science and Information Technology, Singapore, Singapore. ACM, New York, NY, USA. doi:10.1145/3239283.3239317.
-  Coleman C, Mussmann S, Mirzasoleiman B, Bailis P, Liang P, Leskovec J, Zadaria M (2019) Select via vroxy: Efficient data selection for training deep networks. https://openreview.net/forum?id=ryzHXnR5Y7. Accessed on 1 Feb 2019.
-  Loukas A, Vandergheynst P (2018) Spectrally approximating large graphs with smaller graphs. CoRR abs/1802.07510.
Weinberg AI, Last M (2019) Selecting a representative decision tree from an ensemble of decision-tree models for fast big data classification. Journal of Big Data 6:1:23. doi:10.1186/s40537-019-0186-3.
-  Chen PH, Fan RE, Lin CJ (2006) A study on smo-type decomposition methods for support vector machines. Trans. Neur. Network 17:893-908. doi:0.1109/TNN.2006.875973.
-  Fan RE, Chen PH, Lin CJ (2005) Working set selection using second order information for training support vector machines. J. Mach. Learn. Res. 6:1889-1918.
-  Nalepa J, Kawulok M (2014) A memetic algorithm to select training data for support vector machines. In:GECCO ’14:Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, Vancouver, BC, Canada. ACM, New York, NY, USA. doi:10.1145/2576768.2598370.
Salvador S, Chan P (2004) Determining the number of clusters/segments in hierarchical clustering/segmentation algorithms. In:ICTAI ’04:Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence. IEEE Computer Society, Washington, DC, USA. doi:10.1109/ICTAI.2004.50.
-  Awad M, Khan L, Bastani F, Yen IL, (2004) An effective support vector machines (SVMs) performance using hierarchical clustering. In:16th IEEE International Conference on Tools with Artificial Intelligence, pp 663-667. doi:10.1109/ICTAI.2004.26.
-  Cervantes J, Li X, Yu W, Li K (2008) Support vector machine classification for large data sets via minimum enclosing ball clustering. Neurocomputing 71:611-619. doi:10.1016/j.neucom.2007.07.028.
-  Li X, Cervantes J, Yu W (2007) Two-stage svm classification for large data sets via randomly reducing and recovering training data. In:IEEE International Conference on Systems, Man and Cybernetics, October 2007. doi:10.1109/ICSMC.2007.4413814.
Wang J, Neskovic P, Cooper LN (2006) A minimum sphere covering approach to pattern classification. In:ICPR’06:18th International Conference on Pattern Recognition, August 2006, 3:433-436. doi:10.1109/ICPR.2006.102.
-  Mavroforakis ME, Theodoridis S (2006) A geometric approach to support vector machine (SVM) classification. Trans. Neur. Network 17:671-682. doi:10.1109/TNN.2006.873281.
-  Fung G, Mangasarian OL (2000) Data selection for support vector machine classifiers. In:KDD ’00:Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, Massachusetts, USA. ACM, New York, NY, USA. doi:10.1145/347090.347105.
-  Wang J, Neskovic P, Cooper LN (2005) Training data selection for support vector machines. In:ICNC’05:Proceedings of the First International Conference on Advances in Natural Computation - Volume Part I, Changsha, China. Springer-Verlag, Berlin, Heidelberg, Germany. doi:10.1007/11539087_71.
-  Yu H, Yang J, Han J (2003) Classifying large data sets using SVMs with hierarchical clusters. In:KDD ’03:Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA. ACM, New York, NY, USA. doi:10.1145/956750.956786.
-  Chang CC, Lin CJ (2011) LIBSVM: A Library for support vector machines. ACM Trans. Intell. Syst. Technol. 2:27:1-27:27. doi:10.1145/1961189.1961199.
-  Chau LA, Li X, Yu W (2013) Convex and concave hulls for classification with support vector machine. Neurocomputing 122:198-209. doi:10.1016/j.neucom.2013.05.040.
-  Muja M, Lowe DG (2014) Scalable nearest neighbor algorithms for high dimensional data. IEEE Transactions on Pattern Analysis and Machine Intelligence 36:2227-2240. doi:10.1109/TPAMI.2014.2321376.
-  Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N (2018) A survey on addressing high-class imbalance in big data. Journal of Big Data 5:1:42. doi:10.1186/s40537-018-0151-6.
-  Arthur D, Vassilvitskii S (2007) K-means++: The advantages of careful seeding. In:SODA ’07:Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, Louisiana. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA.
-  Hughes M (2018) KmeansRex : Fast, vectorized C++ implementation of K-Means using the Eigen matrix template library. https://github.com/michaelchughes/KMeansRex. Accessed on 1 Dec 2018.
-  Curtin RR, Edel M, Lozhnikov M, Mentekidis Y, Ghaisas S, Zhang S (2018) mlpack 3: a fast, flexible machine learning library. Journal of Open Source Software 3:726. doi:10.21105/joss.00726.
-  Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable K-means++. Proc. VLDB Endow. 5:622-633. doi:10.14778/2180912.2180915.
-  Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Apache Spark: A unified engine for big data processing. Commun. ACM. 59:56-65. doi:10.1145/2934664.
-  Dhillon IS, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors a multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 29:1944-1957. doi:10.1109/TPAMI.2007.1115.
-  Akgul F (2013) ZeroMQ. Packt Publishing, Birmingham, United Kingdom.
-  Kachuee M, Kiani M, Mohammadzade H, Shabany M (2015) Cuff-less high-accuracy calibration-free blood pressure estimation using pulse transit time dataset. https://archive.ics.uci.edu/ml/datasets/Cuff-Less+Blood+Pressure+Estimation. Accessed on 1 Dec 2018.
-  Dua D, Graff C (2019) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed on 1 Dec 2018.
-  Bhatt R, Dhall A (2012) Skin segmentation dataset. http://archive.ics.uci.edu/ml/datasets/skin+segmentation. Accessed on 1 Dec 2018.
-  MPICH3.3 (2018) A high performance and widely portable implementation of the message passing interface (MPI) standard. https://www.mpich.org/. Accessed 1 February 2019.