A graphical heuristic for reduction and partitioning of large datasets for scalable supervised training

07/24/2019 ∙ by Sumedh Yadav, et al. ∙ RWTH Aachen University 0

A scalable graphical method is presented for selecting, and partitioning datasets for the training phase of a classification task. For the heuristic, a clustering algorithm is required to get its computation cost in a reasonable proportion to the task itself. This step is proceeded by construction of an information graph of the underlying classification patterns using approximate nearest neighbor methods. The presented method constitutes of two approaches, one for reducing a given training set, and another for partitioning the selected/reduced set. The heuristic targets large datasets, since the primary goal is significant reduction in training computation run-time without compromising prediction accuracy. Test results show that both approaches significantly speed-up the training task when compared against that of state-of-the-art shrinking heuristic available in LIBSVM. Furthermore, the approaches closely follow or even outperform in prediction accuracy. A network design is also presented for the partitioning based distributed training formulation. Added speed-up in training run-time is observed when compared to that of serial implementation of the approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Two decades earlier, some of the most seminal works in machine learning were done on training set selection [1, 2]

under the banner of relevance reasoning. However, the better part of recent works have been exclusively towards feature selection

[3, 4]. With increased processing power, run time of training is feasible even for datasets erstwhile considered large. Additionally, dimensionality (dimensions) dominates dataset size (numOfPoints) in the algorithmic complexities of learning algorithms. In the training phase, less data points mean fewer generalization guarantees, however, as we are moving in the era of big data, even the fastest classification algorithms are taking un-feasible time to train models. When data sources are abundant, it is befitting to separate data based on relevance to the learning task. This has led to a renewed interest in the once famous problem statement of relevance reasoning [5, 6]. Reasoning on relevance to get improved scalability of classification algorithms is currently explored on graphical/network data [7], and learned models [8].

One research area where training set selection has been given attention to is support vector machines (SVM). Generally, these selection methods can be divided into two types. The first type of methods aims to modify the SVM formulation so that it can be applied to large datasets. Many approaches have worked successfully in the past, including sequential minimal optimization (smo)

[9, 10]

, and genetic programming

[11]. The first type of methods, however, do not benefit from reducing the size of training data because they only deal with data handling. Reduction of data size is the quintessential advantage of the second kind of methods [12, 13, 14, 15, 16, 17, 18, 19, 20]. These methods focus on segregating the data points on relevance to the classification task. Computation time is reduced by actively reducing numOfPoints. Handling data was the central theme of research during the mid-2000s instead of reducing it. Despite the apparent advantage of reduction of data, researchers made sincere efforts for methods of the first type [9, 10, 11, 21] in comparison to the latter type. For instance, formulations implemented in LIBSVM [21], which is a widely used benchmark library for SVM methods, are of the first type of methods.

Few existing works of the second type are limited in one respect or other. For instance, clustering based SVM (CB-SVM) work by Yang et al. [20], and others [13, 12] which resulted in huge speed-ups are limited to linear kernels only. A geometric approach of minimum ball enclosing by Cervantes et al. [14] requires two stages of SVM training. A similar method by Li et al. [15] suffers from a random selection of data. They have reported 92.0% prediction accuracy for a separable dataset for which LIBSVM reproduces 99.9%.

The presented approach also falls under the second type. As will be shown, the presented selection scheme is very deterministic in prediction accuracy. A model trained with the presented heuristic results in close or better prediction accuracy than that of full data training. One recurring problem with the second type of methods is inefficient space searching scalability [15, 16, 17, 18, 19]

. It becomes worse for high dimensional data, where the heuristic takes more time than training itself

[22]. This issue is addressed by the use of state-of-the-art approximate nearest neighbor (ann) methods [23] which are highly scalable.

In LIBSVM, SMO decomposition method of Fan et al. [10] is available for the classification task. Training set selection compares in principle to working set selection (wss) in the context of SMO or similar decomposition methods. Furthermore, a shrinking technique is available for these formulations to remove bounded components during iterations, effectively reducing the optimization problem [10]. As will be shown, data selection in the presented heuristic only depends on the underlying classification patterns, giving it an essential advantage of generic applicability to the majority of classification algorithms, including the SMO formulations of LIBSVM. For these reasons, the state-of-the-art shrinking heuristic of LIBSVM is compared to the presented heuristic.

The presented heuristic augments clustering based approaches [20, 13, 12] by constructing an (approximate) information graph out of the clustered data. This graph acts as a proxy for reducing the training set. A novel edge weight scheme captures the underlying classification patterns in the graph. The graph is then pruned via filtering on the edge weights to select a relevant dataset that can be used for the training task. Furthermore, a graph coarsening approach is presented to break the selected/reduced set into further partitions that are independently available for training, leading to an approximate learning scheme. Both the methods lead to reduction in the number of training data points, which reduces complexity of the training algorithm, giving performance advantages.

Most of the existing methods of both types are limited to the SVM class of algorithms [12, 13, 14, 15, 16, 17, 18, 19, 20, 9, 10, 11, 21]. Generic applicability on a majority of classification algorithms is another advantage of the presented heuristic. It gives an opportunity to use the heuristic as a pre-processing tool, separate from the classification algorithm. Since the data points are selected based on their relevance to the classification task, the resultant reduced training set is much more balanced in size across the target classes. In other words, the formulation addresses the problem statement of class imbalance, which is a topic of current research in big data [24].

The remainder of this paper is organized as follows, the ‘Methodology’ section describes formulations of the proposed heuristic in detail. The heuristic is evaluated with a number of tests, and datasets in the ‘Results’ section. Finally, the ‘Conclusions’ section summarizes concluding remarks, and ideas for future work in the heuristic formulation.

Methodology

The heuristic procedure organically divides into the following steps,

  • Clustering step

  • Graph knitting scheme

  • Graph shedding scheme

  • Graph clubbing scheme

  • Pre-processor for the testing phase

The training phase proceeds after the first four steps, whereas the testing phase follows the last step of the heuristic formulation. The clustering step is used to get a computationally feasible resolution of the underlying data. A weighted graph is constructed next in the graph knitting scheme, using a three parts algorithm, and an edge weight scheme, which captures the classification patterns completely. Significant nodes of this graph with respect to the classification task are determined in the graph shedding step. Finally, the graph clubbing step divides these nodes into partitions that can be trained independently using a directional aspect of the graph coarsening objective achieved via another three parts algorithm. Because of the multiple data partitions which translate into as many classifiers, there is a need to determine which classifier to choose for testing a data point. This is achieved by the pre-processor for the testing phase. Lastly, a network application is designed which distributes the obtained partitions in a load-balancing, and a communication-free manner.

From a computational point of view, run-time profile of the first four steps for a typical run case ( 100k, = 2, 350 clusters) is,

  • Clustering step - 750 ms or 98%

  • Graph knitting scheme - 10 ms or 1%

  • Graph shedding, and graph clubbing schemes - 2 ms ()

On the other hand, the pre-processor for the testing phase takes 5% of the testing phase time. The clustering step is predominant over subsequent steps with a run-time complexity in , whereas that of all the other steps is in the order of number of clusters (numOfClusters). The input parameter to the clustering step, , controls the granularity of data representation. Typically, the ratio of to that of , which is also known as nominal vc_dim, is 10 to 300, explaining the run-time dominance of the clustering step. The graph knitting scheme becomes the most computation-intensive upon considering the subsequent steps, because it involves heavy space searching. It is to be noted that the run-time percentage profile can vary a lot depending on the nominal VC dimension or the input parameter .

Clustering step

The clustering step is used to lower the resolution of the underlying data. In principle, the presented heuristic does not require this step. However, it is not computationally feasible to execute the subsequent steps with a run-time complexity in instead of . Additionally, it will also affect the generic applicability of the heuristic, which is discussed later in the ‘Graph shedding scheme’ step.

The step consists of a standard K-means++

[25] clustering algorithm, and a metric to store classification patterns of the original data. K-means++ provides improved initial seeding of clusters over the traditional K-means method. This choice leads to running the clustering algorithm for a nominal number of iterations, typically 5. In the current implementation, every cluster center maintaines the target class through the weighted average calculation over all its data points. However, advanced metrics can be constructed to unearth more characteristics of patterns from the clustered data.

Although this step serves only for coarsening the data representation, it dominates the computation cost of the first four steps. Two state-of-the-art K-means++ implementations were tested, K-MeansRex [26], and scalable mlpack package [27]. For a test run with the data points in the range of 1 to 100K ( = 2, = 100), K-means++ from mlpack was times faster on average in execution than K-MeansRex’s implementation. Therefore, the K-means++ from mlpack is chosen as the standard clustering algorithm in this work.

Given the vast research literature available for clustering methods, there are other implementations available. One of such improvements would be the scalable K-means++ by Bahmani et al. [28], which is shown to be considerably faster than native K-means++. Another practical option is to exploit K-means implementations from a proven distributed computing platform [29].

Graph knitting scheme

From this step onwards, the presented heuristic digresses from most of the existing geometric approaches of the second type, primarily because of the choice of graph to represent the classification dataset, and the use of seminal works in neighbor searching methods. First, the choice of a weighted graph opened the possibility of using well researched work on graph coarsening, which is the foundation of the graph clubbing step. Second, most of the existing approaches could not benefit from seminal works in the neighbor searching methods, which have contributed profoundly to the success of computer vision. The fast library for approximate nearest neighbors (FLANN) search engine

[23] is used in this work.

An information graph is constructed once a reasonable representation of the underlying data is obtained. Two major challenges include determination of neighbors, and capturing the classification patterns in edge weights. First, neighbors are determined such that the whole hypothesis is covered while passively enforcing regularity and planarity in the graph. The neighbors are determined in two steps, superficial search, and exclusive search, presented in Part I, and II of Algorithm I, respectively. Part III

of the algorithm controls skewness of the graph. Second, a two-fold pattern capturing edge weight scheme is presented in Eq.

2.

The first challenge of neighbor determination is addressed in two stages, superficial neighbor search (sns), and exclusive neighbor search (ens), presented in Part I, and II, respectively. In the algorithm, the number of neighbors is nominally controlled by an input parameter , the number of desired neighbors. For every node, variable in Line 1 of Part I is used to track the size of the neighbor list.

However, neighbor list for the node can be terminated before adding neighbors by updating the variable to TRUE in Line 4 of Part I. By limiting the number of neighbors in this way, construction of the graph can be controlled.

Part I is used to look for nearest neighbors, regardless of the target class. The algorithm takes input an empty graph, , which is formed with the set of the cluster centers, , after the clustering step. This set is the search space passed to FLANN space indexing utility in Line 8, and 9. The input constant is explained later in Eq. 1. Objective of the part is to fill the graph, , with set of edges. This is similar to the construction of a K-nearest neighbor graph (knng). However, an input parameter, MAX_SAME_CLASS_NEIGH, is used to limit the number of same class neighbors in Line 13, so that the remainder of edges for the node, , are constructed for nodes with opposite target class. Nodes for which all neighbors are found will vary depending on characteristics of the data. They will be excluded from the next parts. An additional computation of reach, in Line 15, is maintained for every node. The metric presented in Eq. 1 is similar to the Hausdorff distance [19]. The distance utility of FLANN, in Line 10, is used to compute the summation in Eq. 1, whereas a scaling constant () controls the reach according to

(1)

where r is the scaling constant, ri is reach of the node, xi is position of the node, xj is position of the node, and ni denotes set of same class neighbors for the node.

Input : , , , MAX_SAME_CLASS_NEIGH

Output : - is the set of edges, stored in

1: 
2:
3:
4: neighbor list is NULL for each
5:
6:
7: 
8:
9:
10: FLANN instance with all nodes
11:
12: ANN search
13: 
14:for all  do
15:     for all  do
16:         if  and  then
17:               add to neighbor list of
18:               update reach of
19:              
20:                        
21:         if not  then
22:              
23:                             
24:     
25:     if  then
26:          neighbor searching finished for      
Algorithm 1, Part I : Superficial Neighbor Search

In order to capture the classification patterns completely, it is necessary to make edges along the hypothesis of the classification data. So, Part II extends neighbor searching exclusively for nodes of the opposite target class. The node is considered only if there is a remainder requirement of neighbors, , where . In the part, nodes of class 2 forms the search space in Line 3, in which neighbors are searched for class 1 nodes. This step along with the vice-versa case forms one iteration of the ENS procedure. For each iteration, a search space of the opposite target class is constructed in Line 3.

Input : ), class_1_nodes, class_2_nodes, [], reach[], NEIGH_LIMIT

Output : - updated graph with set of new edges, updated in

1: 
2:
3:
4: FLANN instance with only class 2 nodes
5:
6: ANN search
7: 
8:
9:for all  not  do
10:     for all  do
11:         if  and  then
12:               add to neighbor list of
13:              
14:                        
15:         if  then
16:               doesn’t belong to convex hull               
Algorithm 1, Part II : Exclusive (other class) Neighbor Search

Inclusion of a neighbor after space searching in Line 5 is stringent compared to SNS. An input parameter, NEIGH_LIMIT, and the computed reach are used to limit the availability of node as prospect neighbor in Line 9. For every node added as a neighbor, the variable , initiated in Line 6, is incremented. Reach is used to further update the boolean for node , even if in Line 13. Such an node tends to be an internal node of a target class. In other words, reach only encourages the convex hull nodes of one class to choose neighbors with nodes of the opposite class, aiding in planar construction of the graph.

counter is required in Part II to avoid a node that might habitually come up as a prospect neighbor despite not being very representative of a target class. It otherwise leads to a skewed graph concerning node degree. in Line 3 of Part III is used to reduce the search space of every target class, controlling skewness of the constructed graph.

Input : , , NEIGH_LIMIT

Output :

1: 
2:
3:for all  do
4:     if  then
5:          added in new nodes list      
6:
Algorithm 1, Part III : Search/Indexing Space Reduction

The second challenge is to capture the classification patterns in edge weights, for which a two-fold edge weight scheme is designed. First, each node measures its internal pattern as the absolute difference from one of the two target classes. Second, every pair of nodes in an edge measures external pattern by the relative difference of their target class. Individual contributions are added via the power scheme in Eq. 2 to weigh the edge according to

(2)

where wij is weight of the edge, tci denotes target class of the node, ci, and ce are constants for internal, and external classification patterns, respectively, and quantities in are absolute values.

The use of state-of-the-art implementation like FLANN, for neighbor searching cannot be overemphasized. For instance, in a typical run case ( and ), approximate neighbor searching method of FLANN was 1000 times faster when compared to the exact algorithm for nearest neighbor searching, which involves for every node, computing distances with all the remaining nodes, and then sorting them to determine the nearest neighbors. The run-time advantage is clear when comparing the complexity of exact graph construction, , to that of approximate methods offered by FLANN [23].

Graph shedding scheme

Once a weighted graph is obtained, an edge cut based filtering presented in Algorithm 2 separates the training dataset into relevant, and non-relevant. For every node, the neighbor list is iterated to check for a significant edge in Line 5, and when found, that node is added to in Line 11. This leads to a training set selection that the second type of approach aims to achieve. Note that the result of this step depends on the characteristics of the graph, such as how well connected the graph is, and how well the underlying classification patterns have been captured. Algorithm 1 with the edge weight scheme address these issues.

Input : , EDGE_CUT

Output :

1: 
2:
3:for all  do
4:     
5:     for all  do
6:         if  then
7:              
8:                             
9:     if not  then
10:          remove the node from
11:     else
12:          add to potential critical list      
Algorithm 2 : Graph Shedding

The role of nominal VC dimension drips down in the pruned graph as well. Since it controls the granularity of the underlying data, it also controls the granularity of data selection. That gives the heuristic an essential advantage in terms of limiting data shrinkage while selecting the relevant data points. Because the selection is done via clusters, and the ratio of to is typically , meaning that many data points that are not very close to the hypothesis boundary of the classification patterns are also selected. That gives an extra buffer of data points upon which another selection method of both types applies. The majority of classification algorithms, for example neural methods, gaussian processes et cetera, can use the heuristic. Until this step, the presented heuristic’s aim matches with the existing approaches of the first, and second type. For comparison purposes, the heuristic until this step (including) is referred to as gsh for the remainder of this work. Edge cut for GSH is referred to as GS edge cut.

Graph clubbing scheme

Formulation

This step extends the problem statement of training set selection to further breaking the reduced training set into few partitions or critical chunks, each of which can be trained independently by virtue of Part I to III of Algorithm 3. The main aim is to design an approximate formulation that is theoretically faster, even for serial execution. The algorithm divides the training set into few partitions such that the number of computations are reduced significantly in the training phase. Consider that the order of complexity of most of the classification algorithms is higher than linear, that is if is the complexity. Now, the graph clubbing scheme doesn’t change the order, however it results in significant reduction in total computations. For example, for a classification algorithm with complexity, if , where is a constant, is the original number of computations; then after data reduction by the graph clubbing scheme, there are four equal sized partitions/critical chunks. Now the number of computations is or a quarter of the original number.

Independence during the training phase of each obtained partition is mainly because of the directional aspect of the algorithm, which is achieved via the edge weight scheme. The directional aspect is responsible for two objectives, namely obtaining equally-sized partitions, and ensuring orthogonality of the hypothesis boundary with neighboring partitions’ boundary. Two ways in which the edge weight scheme is leveraged for the directional aspect is in the priority aspect of the partial weighted matching (PWM) algorithm, Part I, and the re-assessment aspect of the coarsening formulation, Part II. The graph clubbing algorithm, Part III ties Part I, and II in an iterative scheme. Each of such obtained partitions can now be trained independently, giving further leverage for a nominal number of worker processes.

The partial weighted matching (PWM), Part I, is designed for the weighted graph obtained from the graph knitting step. The obtained matching is partial because edges weighing less than EDGE_CUT (input parameter) are filtered out in Line 4. So only cluster points closer to the hypothesis boundary are chosen for training. Sorting in Line 6, before ordered matching in Line 9, adds the weighted aspect to the matching. It enables the heaviest edges to be picked earliest for contraction, subtly addressing both the main objectives of the directional aspect. Since the heaviest edges cover the classification patterns, prioritized selection of them results in uniform size of partitions. Prioritized selection also means that the most significant patterns are given preference, which conversely means the least significant patterns are avoided. So the hypothesis boundary, along which the least significant patterns reside, is orthogonal to the contracted edges, where the most significant patterns reside. Higher prediction accuracies are obtained because of the preference of contraction of the heaviest edges.

Input : , EDGE_CUT

Output : - matching of graph

1: 
2:
3:for all  do
4:     for all  do
5:         if  then
6:                             
7: sort all collected edges
8:
9:
10:for all  do
11:     if not and not  then
12:          add edge to matching
13:         
14:               
Algorithm 3, Part I : Partial Weighted Matching

The coarsening formulation of Part II applies the directional aspect, as edge contraction occurs in this part. It is to be noted that the coarsening formulation is different in aim to otherwise researched formulations. Most of the popular formulations are intended for reducing communication cost or preserving the global structure while getting a low-cost representation of data [30]. Furthermore, unlike Kernighan-Lin, and other matching based coarsening objectives, the presented optimization objective is deterministic in execution.

In the part, re-assessment of target class, and edge list in Lines 6, and 10, respectively, for newly contracted nodes augments the standard coarsening step in Line 3. By using different values of , and for initial versus re-assessment edge weights, the precedence of the kind of edges is established in the matching scheme. Original heavy edges are proiritized over re-assessed edges of newly contracted edges, achieving the orthogonality property. Transition of coarsening from original edges to newly contracted nodes can be captured by the virtue of drastic decrease in graph cost metric, presented in Eq. 3,

(3)

where cg is cost of the graph , e is an edge, Ep denotes the subset of (set of all edges) such that , and we is the weight of edge .

One technical choice is to use MIN in Line 4, for identifying the new node that is a result of the contraction of edge between nodes , and .

Input :

Output : - coarsened graph with

1: 
2:
3:for all  do
4:      standard graph coarsening step
5:      add new critical node
6:     
7:      re-evaluating target class of new critical node
8:for all  do
9:     for all  do
10:         if  then
11:               re-assess edge weights               
Algorithm 3, Part II : Graph Coarsening (with edge weight reassessment)

Part I, and II, are executed in the iterations of Part III in Lines 5, and 6, respectively. In Line 7, kink detection in graph cost is used as a termination condition of the iterative algorithm. However, a maximum number of coarsening iterations, i.e. MAX_NUM_OF_COARSENING_ITER in Line 4, is used in the majority of tests. This step concludes the formulation of the heuristic. It is referred to as gch for the remainder of this work. Similarly, edge cut is referred as GC edge cut.

Input : , EDGE_CUT

Output : - iteratively coarsened graph

1: 
2: graph cost equation, Eq. 3
3:
4:
5:while  do
6:      Part I
7:      Part II
8:      current graph cost
9:     if  then backward difference gradient
10:         
11:     else
12:          march forward
13:               
Algorithm 3, Part III : Graph Clubbing

The implementation of few optimizations improved the run-time. First, the starting nodes in Line 2, Part I, are the ones that are identified relevant in GSH. After each iteration, half of the nodes that belong to contracted edges are reduced in the update of relevant nodes list ( in Algorithm 2). As a result, the complexity of the matching algorithm reduces with the coarsening iterations. Second, the neighbor list of a node is sorted. As a result, edge contraction computation is linear (in complexity) to . It is computationally canonical to the sorted union of two lists. Lastly, usual numerical optimizations, such as masking to avoid dynamic memory allocations, and indexing (at the expense of memory) for searching are used.

Network design

An event-driven, multi-process algorithm is designed to distribute the partitions obtained after GCH in a communication-free manner. In the network application, processes assume a (single) master or (multiple) worker role. Part I, and II of Algorithm 4, respectively describe master’s, and worker’s side of event-handling design.

For the master process which executes Part I, one partition from the list of partitions, , is communicated to the requesting worker process (the one which issues DATA_REQUEST event in Line 3 of Part II) by trigger of DATA event in Line 6. The worker process, upon receiving the event DATA, proceeds to training with the recieved data partition in Line 5 of Part II. After collecting acknowledgements of the completion of training of all data partitions in Line 9, the master process terminates every worker by issuing TERM_TRAIN event. The design implements a round-robin scheme, which balances load of the network queries. That is accompolished by having a queue data structure for recording DATA_REQUEST events in Line 3. It is to be noted that conflict of two simultaneous entries is resolved by time stamps, making the queue fair with respect to a worker’s request.

Input :

Output : number of trained hypotheses

1: 
2:
3:
4:upon event DATA_REQUEST do
5:     if  then
6:         
7:         trigger event DATA  sending a part to the requesting worker               
8:upon event DONE_TRAINING do
9:     
10:     if  then
11:         for all  do
12:              trigger event TERM_TRAIN                              
Algorithm 4, Part I : Master Process
1:
2:while not  do
3:     trigger event DATA_REQUEST       
4:     upon event DATA do
5:          training of received part      
6:     trigger event DONE_TRAINING       
7:     upon event TERM_TRAIN do
8:               
Algorithm 4, Part II : Worker Process

Instead of directly using TCP channels for message communication, the distributed messaging API ZeroMQ [31] is used. It provides essential safety, and liveness properties on network channels. However, apart from message guarantees such as the liveness, and safety property of ‘once only message delivery’, ZeroMQ is rudimentary compared to higher level message passing libraries. This gives an opportunity to design, and optimize various aspects of the architecture. One such aspect is the messaging protocol. A couple of messaging protocols are designed as shown in Figure 1. Protocol 2 implementes a single float/double entry (in character array or CA) messaging scheme, whereas protocol 1 first requires marshaling all entries of a data point before messaging it.

Figure 1: Messaging protocols. Protocol 2, i.e. single entry versus Protocol 1, i.e. marshaling protocol is presented for data of the point.

Another aspect that was tested is connection time of the network. Connection time is measured on the master process, and included the following steps:

  • Start of TCP channels (wrapped in the API)

  • Initialize a hash table

  • Recieve connection request from all worker processes

  • Send connection confirmation to all worker processes

The connection prodecure requires step 2 for maintaining worker processes’ information, giving an opportunity to optimize the step as per the need. A light-weight hash table (), and hash key is designed which generates unique keys for worker processes. The design helps to reduce the overhead of starting, and running the multi-process application.

Pre-processor for the testing phase

Unlike the application of GSH, which results in a single training set, few data partitions are obtained after GCH, and the training set is the union of these partitions. It means that there would be as many classifiers as the number of partitions. Hence, it is needed to determine which classifier to choose for predicting a point from the testing dataset, . Algorithm 5, nearest hypothesis search (nhs), is used for this task. A search space is formed consisting of nodes of the coarsened graph in Line 1. ANN search for the nearest hypothesis follows in Line 5.

Input : - testing dataset, - graph after GCH

Output : - nearest hypothesis for every point in

1: 
2:
3: 
4:
5: FLANN instance
6:
7: only nearest hypothesis
Algorithm 5 : Nearest Hypothesis Search

Once the nearest hypothesis is determined, prediction of the target class for testing data points follows. This added step before the testing phase only takes about 5 - 7% of the run-time of the testing phase for SVM class of algorithms, as will be shown in the ‘Results’ section.

Results

Results are presented in two major divisions, first with tests on parameter space of the heuristic, and second for gauging performance of the heuristic. All the tests were conducted on a variety of datasets.

Parameter space of heuristic

In this set of tests, the focus is on working details, and exemplifying steps of the heuristic. Datasets similar to that shown in Figure 2, with parameters summarized in Table 1 are extensively used.

Figure 2: Near linearly-separable dataset.
Parameter Value
30000
2
300
GS edge cut
GC edge cut
Table 1: Dataset parameters.

Node reach, and ENS

Tests in this section present heuristic tools that capture original classification data into the weighted graph. These tools are designed to handle real datasets, which vary diversely in characteristics. A mix of real, and synthetic datasets that mimic varying characteristics are considered.

A timeline of the ENS procedure with is presented in Figure 3. It is based on the dataset of Table 1. However, class 2 data points were intentionally translated to create separation, which is very typical for real data. It is evident that the connectivity of the graph increases with more iterations. Skewness control of the constructed graph, explained in Part III of Algorithm 1, was carried out at end of the iterations of the ENS procedure, resulting in reduction of available nodes for search as shown in Table 2.

Figure 3: Exclusive neighbor search. Number of ENS iterations are 0, 2, and 4 in (a), (b), and (c), respectively. Zoomed view of dataset in Figure 2.
# iterations # class 1 nodes # class 2 nodes
0 143 157
2 143 157
4 139 153
Table 2: Search space reduction.

A second way to control connectivity is fine tuning of the reach equation, presented in Eq. 1. In the next test, scaling constant is varied, and results are presented in Figure 4. Three cases are shown in sucession, under-reach (a), ideal-reach (b), and over-reach (c). Even in the over-reach case, inner nodes are not able to make opposite class neighbors, enforcing planar construction of the graph.

Figure 4: Reach controlled graph knitting. Scaling constant is , , and in (a), (b), and (c), respectively. Zoomed view of dataset in Figure 2.

The ENS procedure was next run on a real dataset, called cuff-less blood pressure estimation dataset from UCI machine learning repository

[32, 33]. It is a three attribute, 12000 instances, real, multivariate dataset. Results are presented in Figure 5, with Subfigure (a) showing the dataset, (b) showing cluster centers, whereas (c), and (d) showing the constructed graph without, and with ENS procedure, respectively. The latter graph is connected because necessary edges are present between opposite class nodes, covering the entire hypothesis. Class imbalance is analysed next, and results are presented in Table 3. The column 2 shows the number of data points for each class originally, and column 3 shows the number of data points after ENS, as presented in the graph of Figure 5(d). Imbalance of target classes, quantified by standard deviation (sd) is significantly reduced after ENS.

Procedure # class 1 points # class 2 points SD
None 1743 10256 4256.5
GSH 1661 2001 170.0
Table 3: Class imbalance.
Figure 5: Graph knitting procedure. Scaled UCI dataset, clustered data, constructed graph with no, and 1 ENS iteration in (a), (b), (c), and (d), respectively. Dashed edges in (d) are formed after ENS procedure.

Gsh

The aim of this test is to exemplify the two-fold edge weight scheme, shown in Eq. 2. It shows the role of edge weights in the outcome of GSH, summarized in Figure 6. As classification patterns in the underlying data gets more confused in succession in Subfigures (a), and (c), more edges are weighted significantly, resulting in the selection of a bigger training set by GSH.

Figure 6: Reduction of training set. Classification data is more confused in (c) compared to (a). (b), and (d) show cluster nodes that were selected for (a), and (c) cases, respectively.

Gch

This test exemplifies the formulation of the graph clubbing scheme. Although the edge weight scheme is important in the application of GSH, it is primarily designed to play a crucial role in the priority/directional aspect of the coarsening objective. Constants for initial edge weights were such that was kept significantly higher than , as shown in Table 4. Two cases of re-assessments were considered, namely case I, and case II.

Parameter Initial Re-assessment
Case I Case II
Table 4: Pattern measure constants.

Since initial constants were kept significantly higher than their re-assessment counterparts, original nodes were contracted first in both cases, as can be seen in Figure 7(a), and (b). Transition to contraction of re-assessed type nodes is reflected in the graph cost characteristic in Figure 8 for case I. After the iteration, the slope magnitude decreases significantly because of contraction of all original nodes, that have significantly higher edge weights dictated by heavier initial constants. Edge contraction continues to aggressively club the graph for case I, including the re-assessed nodes. This results in fewer partitions compared to case II in Figure 7(b). Clubbing is almost shut off in case II because of trivial re-assessment constants. Note that in general, the more the coarsening iterations, the fewer the partitions.

Figure 7: Partitioning of training set. Number of coarsening iterations is 2, and 10 for (a), and (b), respectively. In each Subfigure, the first plot is for the re-assessment case I, whereas the overlaying plot with shifted origin is for the case II.
Figure 8: Graph cost function with coarsening iterations.

Performance evaluations

Several tests were conducted to evaluate performance of the presented methods, including GSH, serial GCH, and distributed GCH. In addition, the network architecture for distributed GCH is evaluated.

Dataset I was a synthetically constructed two-dimensional, near-linearly separable dataset with parameters shown in Table 5

. Dataset II was a similar dataset, but in place of a (nearly) straight-separating hyperplane, a spherical separating hyperplane of radius 0.2 was employed. Otherwise, it uses the same parameters as Dataset I. A dataset from the UCI machine learning repository, called skin segmentation dataset

[33, 34] is the next dataset, called Dataset III for the remainder. This classification dataset has four numeric features, and 245k observation instances. The nominal VC dimension is . Table 6 summarizes other relevant parameters.

Parameter Values
Range of # data points testing 3k - 300k
training 1k - 100k
# iterations coarsening 10
clustering 3 - 5
ENS 0
4
3
Range of 10 - 1000
1.0 - 2.0
Table 5: Dataset I & II parameters.
Parameter Values
# data points test 122529
training 122528
# iterations coarsening 0
clustering 5
ENS 0
5
4
500
1.0 - 2.0
Table 6: UCI dataset, called Dataset III parameters.

Note that the computation setup was kept consistent. Programming language of all codes is C/C++ for which computation time was measured. That includes every step of the presented heuristic, and the classification algorithms. Lastly, time stamping was carried on an otherwise idle system.

Gsh

The main aim of the next set of tests is to compare the training phase using GSH against state-of-the-art shrinking heuristic of LIBSVM. Classification algorithms used in these tests were all variants of LIBSVM’s SMO implementation [10], which is a method of the second type.

Two testing variables are evaluated. First, computation run-time of GSH along with the classification algorithm on reduced training data is compared against that of the same classification algorithm but augmented with the shrinking heuristic. Second, the prediction accuracy of the learned models obtained from both the above setups are compared.

Results are presented in the following way. Each of Figures 9 - 15 is for a SVM classifier, that includes C-SVM, and nu-SVM with linear, polynomial, and rbf kernels. Subfigure (a) is used to present computation run-time results, and (b) is used for prediction accuracy comparison. Training dataset size was geometrically varied from 1k to 200k, and testing dataset size was kept three times that of the training. Accordingly, the parameter was varied geometrically in the range given in Table 5. Note that training time is included in ‘LIBSVM shrinking heuristic’, and ‘GSH’ plots in Subfigure (a), but is not explicitly written for convenience. A third line-plot denoted ‘SVM training (post GSH)’ is presented in Subfigure (a) to separate computation run-time of GSH from subsequent training.

Figure 9: C-SVM with linear kernel.

C-SVM with linear kernel was the first classification algorithm tested, as presented in Figure 9. The GSH scales even for a low number of points (10k), as shown in Subfigure (a), however, scaling becomes more evident with more training data points. Furthermore, it is visible that GSH took a small fraction of total time, as indicated by the separation between the second, and third line plots. It highlights the scalability issue of the SVM formulation. In Subfigure (b), the prediction accuracy of reduced training closely follows that of full training, clearly visible after 20k testing data points.

Figure 10: C-SVM with polynomial kernel.

C-SVM with polynomial kernel was the second classifier tested, as presented in Figure 10. Observations here follow that of the earlier case. Although, one noticeable difference is a better run-time profile to the previous case. This is because of the higher run-time complexity of the polynomial kernel classifier than that of the linear kernel presented in Figure 9.

Figure 11: nu-SVM with linear kernel.

For the second SMO implementation, nu-SVM, Figures 11, and 12 summarize linear, and polynomial kernel cases, respectively. Note that the training phase with nu-SVM in Figures 11(a), and 12(a) is significantly slower compared to C-SVM in Figures 9(a), and 10(a). Furthermore, it is observed that reduced training can result in improved prediction accuracy, as can be seen in Figure 12(b). For the polynomial kernel, GSH performed better than native training by about 6%.

Figure 12: nu-SVM with polynomial kernel.

Dataset I was used in all the tests until now, and near-perfect accuracy plots support that it is an ideal dataset. A more realistic case, Dataset II, was considered for Figures 13 - 15.

Figure 13: C-SVM with linear kernel.

Again, C-SVM with linear kernel was the first classification algorithm tested, as presented in Figure 13. However, prediction accuracies are in a lower range than before, as shown in Subfigure (b). Nevertheless, both setups are practically identical in prediction accuracy. Similar run-time improvements are observed in Subfigure (a). For nu-SVM with linear kernel, presented in Figure 14, scaling observations are similar to the corresponding Figure 11

, which used Dataset I. Note that the prediction accuracies are only about 50%, which is the expected value for random selection between two target classes. However, note that the main argument here is not the absolute prediction accuracy, rather closeness of it for both setups. To improve the prediction accuracy, results were obtained for a radial basis function (rbf) kernel, which is widely considered a robust kernel type in SVM classifiers, and are presented in Figure 

15. This resulted in slower learning, as observed in Subfigure (a), but near perfect prediction accuracy, as shown in Subfigure (b).

Figure 14: nu-SVM with linear kernel.
Figure 15: nu-SVM with rbf kernel.

The next set of tests was conducted on Dataset III, for evaluating the performance of the heuristic on real data, and barplots in Figure 16 are used to present the findings. Computation run-time was scaled to accommodate different classification algorithms, namely C-SVM, and nu-SVM with linear, polynomial, and rbf kernels. For all six cases, reported prediction accuracy is low as shown in Table 7, but close for both setups. Scaling improvements are consistent to the previous reportings for Dataset I, and II.

Figure 16: Fraction of LIBSVM training time. a, b, and c is for C-SVM, whereas d, e, and f is for nu-SVM with linear, polynomial, and rbf kernels, respectively
Clasification algorithm LIBSVM GSH
C-SVM linear 58 58
polynomial 58 58
radial 59 59
nu-SVM linear 52 58
polynomial 58 57
radial 55 48
Table 7: Prediction accuracy (%).

Overall, scalability improvements are observed consistently over a number of classifiers for Dataset I, II, and III. Furthermore, the prediction accuracy either closely follows original training or even outperforms it in a few instances.

Serial GCH

The next set of tests are aimed to compare run-time of serial GCH to that of GSH. More precisely, serial execution of GCH, which represents approximate learning, is fared against (reduced training data) GSH learning. In GCH, the graph clubbing step was an added cost over GSH. Results are presented for Dataset I.

The number of iterations was set 10, and used as the termination condition for the graph clubbing algorithm. Pre-processor was used, and the collected prediction accuracies were weight-averaged over all reported partitions. Edge weight constants given in Table 8 were used throughout.

Parameter Initial Re-assessment
Table 8: Pattern measure constants for GCH.

Only GSH, and GCH are compared along with respective line-plots of subsequent training for all the classifiers, presented in Figures 19, and 20. However, for the first classifier, C-SVM with linear kernel, shown in Figure 17, computation run-time of LIBSVM with shrinking heuristic is also plotted. An accuracy plot, Subfigure (b), and a plot for the number of partitions in Figure 18 are also presented.

Figure 17: C-SVM with linear kernel.
Figure 18: Partitions after GCH.

In Figure 17(a), a clear scalability hierarchy is observed, with serial GCH GSH LIBSVM shrinking heuristic. In the accuracy plot, shown in Subfigure (b), approximate learning closely follows native learning for 8k training data points. However, for fewer points, the accuracy of the model trained after GCH, which is 95% is not as good as native training at . 2 to 5 numbers of partitions were obtained as shown in Figure 18. For polynomial kernel classifiers, presented in Figure 1920, scaling advantages with serial GCH compared to GSH are observed even for lower data points (5k).

Figure 19: C-SVM with polynomial kernel.
Figure 20: nu-SVM with polynomial kernel.

Distributed GCH

The main aim here was to demonstrate run-time advantage with worker processes for the communication-free training scheme discussed earlier. The scheme only aims for coarse grain parallelization, such that individual data chunks/partitions can be distributed to worker processes.

Results are presented in Figure 21 for C-SVM with polynomial kernel classifier on Dataset I. The distribution invariantly depends on the number of obtained partitions. Therefore, scaling results in Subfigure (a) cannot be interpreted without the plot for the number of partitions, which is shown in Subfigure (b).

Figure 21: Scalability with workers (a), and obtained partitions (b).

Significant improvements are observed with more workers. However, sub-optimal distribution with the number of workers 2 can be seen in Figure 22. Mean number of data points per chunk/partition is plotted as a function of the number of total data points. Using the number of partitions/chunks from Figure 21(b), it is straightforward to interpret the monotonic increase of the plot. In a hypothetical case where the partitions are all equally-sized, SD would be zero. However, as the number of total data points increases, error bars for SD are observed to be commensurate to the mean number of data points. Un-equal sizes of partitions is the apparent reason for the sub-optimal distribution.

Figure 22: Mean size of partitions with SD errorbars.

Network design

The main motivation for this set of tests was to evaluate various implementation aspects of the network that implemented distributed GCH. The first aspect is the connection time of the distributed application. Figure 23 shows connection time measurements with the number of workers.

Apparently, the connection time is very small (120 ms) when compared to the run-time of the training phase. The connection procedure was replicated on another distributed messaging API, MPICH3.3 [35], and the connection time with ZeroMQ was observed to be 1.53 times faster than with MPICH3.3.

Figure 23: Performance of the designed connection procedure

The next aspect of the network is the messaging protocols. The primary motivation behind the protocol design is to utilize the shape of the data point, that has n-dimensional feature vector (), and 1-dimensional target class value (tc. Two of such protocols as previously discussed were fared in run-time. A couple of opposing factors are at play between these protocols. Data marshaling is an extra step in protocol 1 (before actual messaging) when compared to protocol 2. On the other hand, message count/traffic increases (multiple times governed by the dimensionality of data) in protocol 2 compared to protocol 1. Overall, protocol 1 performed better as can be been in Figure 24

. Although increased marshaling may probably result in even better scaling, a larger memory footprint on the channel is un-advisable as it can result in a overflow scenario.

Figure 24: Evaluation of messaging protocols. Protocol 1 is used in this work.

Messaging efficiency was tested in the context of prediction/testing phase of the distributed implementation. Reportedly in Figure 25, it is very efficient at 1% of the prediction time. It is found to be even more efficient at 0.5% for the training phase. Pre-processing run-time is at 7%, as can be seen in Figure 25.

Figure 25: Messaging, and pre-processing. Run-time of message communications, and pre-processing measured in percentage of the testing phase.

Note that external libraries are used in the formulution, including a clustering method, an approximate nearest neighbor implementation, two messaging APIs, and many classification algorithms. The heuristic execution is not very sensitive to the parameters of these external implementations. For instance, only 3 - 5 iterations of clustering were needed, input parameters for approximate nearest neighbor, and classification algorithms were kept at default. However, the heuristic is sensitive to the parameters which are part of the heuristic steps.

Conclusions

In this work, a three parts algorithm, and an edge weight scheme are designed to construct a weighted graph that effectively captures the classification patterns of dataset. Another three part graph coarsening algorithm with the directional aspect is designed that divides the reduced dataset into partitions that can be trained independently using the designed communication-free network.

GSH provides an evident serial scalability advantage, and generic applicability holds for every classifier. It encourages to extrapolate the argument, and claim that GSH will hold for a multitude of classification algorithms. Approximate learning is the reason behind serial GCH’s added advantage over GSH. A better model training throughput can be achieved via the use of both of these approaches.

Distribution of chunks/partitions of the training set to worker processes results in further scaling improvements. Even for a nominal number of worker processes (say four), available in almost all current computing platforms, parallel performance benefits can be achieved. However, un-equal coarsening, and control over the number of partitions still needs to be further investigated in the current implementation of GCH.

First, a possible direction of investigation is the directional aspect of the objective. That is, if the partitioning was better directed, better orthogonality between the contour of the underlying classification pattern, and of partition boundaries shall be obtained. This condition is necessary for correctness (in prediction accuracy) of this approximate procedure. The two-fold pattern measure scheme should be designed to get the desired control over the direction of coarsening.

A second possible investigation is modifying the objective function of the current graph coarsening scheme. Part of traditional coarsening objectives can be added to the current objective. Since popular objectives predominantly optimized size division of partitions, it will address the un-equal coarsening problem in the presented objective. Furthermore, the number of partitions is directly controlled in these objectives.

All approaches of the presented heuristic, namely GSH, serial, and distributed GCH, are effective to reduce training run-time of a classification algorithm. Furthermore, scaling benefits are accompanied with no compromise in prediction accuracy.

[name=List of Abbreviations, include-classes=abbrev, nomencl, sort = false]

Declarations

Availability of data and materials

Dataset I, and II used during the current study are available from the corresponding author on a reasonable request. Furthermore, two real datasets used are available in the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/skin+segmentation; https://archive.ics.uci.edu/ml/datasets/Cuff-Less+Blood+Pressure+Estimation. Other resources with respect to the manuscript are shared at https://sumedhyadav.github.io/projects/graph_heuristic.

Tests were performed with following computing specfications:

  • Operating system(s): Ubuntu 16.04 LTS

  • Programming language: C/C++, python (for visualization)

  • Visualization tools: matplotlib, and networkx modules in python

  • External Libraries: FLANN, mlpack, LIBSVM

  • Version requirements: g++ 5.4.0, Python 2.7.12, matplotlib 1.5.1, networkx 2.2

Competing interests

The authors declare that they have no competing interests.

Funding

Not applicable

Author’s contributions

This work is joint research of SY and MB. SY did all implementation, and run all test cases. All authors read, and approved the final manuscript.

Acknowledgements

SY acknowledges Gautam Kumar’s help with respect to preparing the manuscript.

Authors’ information

Not applicable

References