A Network-Based High-Level Data Classification Algorithm Using Betweenness Centrality

09/16/2020 ∙ by Esteban Vilca, et al. ∙ Universidade de São Paulo 21

Data classification is a major machine learning paradigm, which has been widely applied to solve a large number of real-world problems. Traditional data classification techniques consider only physical features (e.g., distance, similarity, or distribution) of the input data. For this reason, those are called low-level classification. On the other hand, the human (animal) brain performs both low and high orders of learning and it has a facility in identifying patterns according to the semantic meaning of the input data. Data classification that considers not only physical attributes but also the pattern formation is referred to as high-level classification. Several high-level classification techniques have been developed, which make use of complex networks to characterize data patterns and have obtained promising results. In this paper, we propose a pure network-based high-level classification technique that uses the betweenness centrality measure. We test this model in nine different real datasets and compare it with other nine traditional and well-known classification models. The results show us a competent classification performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine learning can be defined as a set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict the future data, or perform other kinds of decision making under uncertainty [10].

Usually, machine learning is divided into three main paradigms: supervised learning, unsupervised learning, and semi-supervised learning [9]

. Supervised learning uses tagged data to detect patterns and predict future cases. According to the kind of labels, the prediction is called

classification for categorical labels and regression for numerical labels. The unsupervised machine learning algorithms explore the data to search for possible structures that can tag the data. For example, on social media, the users don’t provide necessarily some special information like political preferences. However, using the collected data an unsupervised algorithm could detect this [9]

. Semi-supervised learning is a combination of supervised and unsupervised learning. Usually, the quantity of labeled data is low because tagging is expensive. So, semi-supervised algorithms use some labeled data to predict the untagged data

[13].

Data classification is one of the most important topics in supervised learning. It aims at generating a map from the input data to the corresponding desired output, for a given training set. The constructed map, called a classifier, is used to predict new input instances.

Many algorithms use only physical features (e.g., distance, similarity, or distribution) for classification. These are called low-level

classification algorithms. Such algorithms can get good classification results if the training and testing data are well-behaved, for example, the data satisfies a normal distribution. But, these techniques have a problem to classify data with complex structures. On the other hand, a high-level algorithm uses data interaction as a system for classification, exploiting the structure of the data to capture patterns. In this way, it can perform classification according to pattern formation of the data like cycles, high links density, assortativity, network communication (

betweenness), and so on instead of just measuring physical features like euclidean distance.

In order to capture the structure and properties of the data, we propose to work with complex networks, which are defined as large scale graphs with nontrivial connection patterns [1]. Many network measures have been developed and each one of them characterizes network structure from a particular viewpoint. In the category of degree-related measures, we have the density that represents how strong the nodes connections are [4] and assortativity degree that represents the attraction of nodes with a similar degree (degree correlation) [12]. In the category of centrality measures, we have betweenness centrality that measures the node importance for communication on the network [8], closeness vitality that measures the impact of a network communication if a node is removed [4], and so on.

Several high-level algorithms have been proposed to use network measures to make a classification. Impact measure approach tries to reduce the variation of a measure once a new node is inserted into a network [5], link prediction approach uses a meta class node to represent each label and the classification is performed using link predictions techniques [7], and importance measure exploits the page-rank algorithm for classification [3].

The technique proposed in this work, captures the structure of data using just one metric . This measure captures the node importance for the graph communication. Nodes that have low tend to be on the periphery on the contrary the nodes tend to be focal points [4]

. Instead of focusing on the node insertion impact or preservation of the structure measure using many network measures. We focus on the structure generated once a new node is inserted and identify if this inserted node presents similar features to the others in the new network. Also, unlike other methodologies that require a classical algorithm like SVM (Support Vector Machine) to complete the high-level classification

[17], our methodology uses pure network measures to classify. This approach shows good performance, avoids the double calculations of impact measure method, reduces the number of properties to be used, and do not require to be combined with other classical techniques.

2 Model Description

In this section, we describe the working mechanism of our model. Firstly, we give an overview of the training and classification model phase. Then we provide details about each step of the algorithm. Finally, we describe how we use the betweenness measure on the model.

2.1 Overview of the Model

Each complex network consists of a set of nodes or vertices and a set of links or edges between each pair of nodes. The input data of elements for contains two parts: the attributes and the labels .

In , the dataset where represents the attributes, and represents the label of the instance . The values of where is the possible labels of the instance. The goal of is to predict the values using the instances . This could be considered as function approximation where the function is our algorithm. To evaluate the model, it is required to split the data in training and testing datasets. The dataset will be used to build our model and the dataset will be used for evaluation.

In the training phase, we will build complex networks using the training dataset. The instances in the dataset will be the nodes and the links will represent the similarity between these nodes. Therefore, we will have , where is the set of nodes and is the set of links in the complex network . The links could be created using and or personalized relation metrics like friendship on social data, flight routes, or city connections.

The network will be built using to produce the nodes and and as relation metric for links . Then, we remove the links between nodes with different labels . Following this strategy, we will have one network component for each label in .

In the testing phase, we insert a node from into each component following the same and rules of training phase. Then, we calculate the of this node in each . This measure is compared to the others from each network component . So, the differences are saved in a new list for each .

Finally, we get the average of the lowest values for each list and we classify the new node to the with the lowest average. Then, we remove this node from the other components. In the case that the average differences of two or more lists are equal, we use the number of links connected to this new node in each component as a second difference measure.

2.2 Network-Based High Level Classification Algorithm Using Betweenness Centrality (NBHL-BC)

The proposed high-level classification algorithm, which will be referred as NBHL-BC, has four parameters , , , and . Where is the number of neighbors used in the , is the percentile into used to calculate , is the number of nodes with similar used for classification,and is the weight to balance between links and .

During the training phase, we need to build the network using the and where . Each node in is related with one instance in and each link in is defined following these two techniques:

(1)

Where represents a pair of data instance and its corresponding label . For each instance , is the set of nodes to be connected to it, its neighborhood. returns the set of nodes i.e. the set of nodes whose similarity with is beyond a predefined value and have the same class label. Here, is a similarity function like euclidean distance. returns the set containing the nearest neighbors of . The value is the percentile of the in the sub graph of . Note that the -radius criteria is used for dense regions (), while the is employed for sparse regions. With this mechanism, it is expected that each label will have an independent sub graph [17] [18] [5].

(a) Inserted node in
(b) Inserted node in
(c) Inserted node in
(d) Final Classification
Figure 1: Classification of a new instance (dark node) into the iris dataset graph using with

In Figure 0(d), we can see the graph where represent the sub graphs with nodes red, green and blue. On the testing phase, we insert each instance (dark node) to each component following the same rule on equation (1), and assuming that the node will be inserted in each sub graph.

For example, in Figure 1, there are three network components and the node to be tested is inserted to each one. In Figures 0(a) 0(b) 0(c), the node uses its nearest neighbors with the same label. In this case with , there are 4 nodes in , 1 in , and 0 in . The with the (median of ) is less than 5, because the current inserted node presents a sparse behavior; for this reason, we will use just .Moreover, due to the condition of the same label, the algorithm will produce one sub graphs for each possible label.

Now we calculate the for the node in each component when the new node is inserted. Following this rule, the inserted node will have different values for each component.

The is a mixed measure (global and local) that captures how much a given node is in the shortest paths of others nodes [4]. This measure captures the influence of a node in the communication of the network [11]. We capture not only the characteristics of a node but also the behavior of their neighborhood. So, we have a metric that provides local and global network characteristics. This metrics is defined in the equation 2.

(2)

where is 1 when the node is part of the geodesic path from to and 0 otherwise. is the total number of shortest paths between and .

Then, we calculate the difference of this measure between the inserted node and the other nodes in each component . In the algorithm 2 on line 14, we can show this step. In the algorithm 1, we can appreciate how an inserted node will present a different for each sub graph.

1:function NodeInsertion()
2:      index is the number of nodes in the graph +1
3:     
4:     
5:     for  do
6:         if  then
7:              
8:         end if
9:     end for
10:     
11:     
12:     return
13:end function
Algorithm 1 Node Insertion

These values will be inserted into an independent list for each component . We will calculate the average of the lower values on each list. In the 2 on line 19, we can appreciate how we get just the b lower elements on previously sorted on line 16. The results are stored on the array where each represents the average difference of the nearest betweenness node values on the sub graph . This process is represented in the algorithm 2 on line 27 and 28.

(3)

Where is the normalized version of

. In order to avoid conflicts of probabilities with the same value

, we calculate the number of links of the inserted node with respect to each sub graph on the array . Then, we follow a similar process of equation 3 for normalization. This process is represented in the algorithm 2 on line 29.

(4)

Finally, once we normalize these values, we calculate the sum of and made a final normalization.

(5)

where represents the probability of a node to be inserted in the sub graph , and controls the weights between structural information and number of links. If , we just capture information using , and if , we just capture information about number of links. The fully algorithm is described in algorithm 2.

1:function Classification(,)
2:      is the number of nodes in
3:     NodeInsertion()
4:     
5:     
6:     for   do Where each is a subgraph
7:          NB is a list of node betweenness differences
8:         
9:         
10:         for   do
11:              if   then
12:                  
13:              end if
14:               B is betweenness centrality
15:         end for
16:         Sort(NB) NB has the differences between the nodes in and the new node
17:         
18:         
19:         while   do
20:              
21:              
22:         end while
23:         
24:         
25:         
26:     end for
27:     
28:     
29:     
30:     
31:     
32:     return MaxIndexValue() has each class probability
33:end function
Algorithm 2 Classification Algorithm

3 Performance Tests on Toy Datasets

In this section, we present the classification performance of our algorithm in toy datasets and compare the results with other algorithms using python as programming language and Scikit-learn library for algorithms [14]

. Specifically, we test our algorithm against Multi Layer Perceptron (MLP)

[15]

, Decision Tree C4.5 (DT)

[16]

, and Random Forest (RF)

[2]. The algorithms are tested using cross validation 10-folds, executed 10 times, and we use a grid search to select the hyper parameters that give the best accuracy for all the algorithms.

The toy datasets are Moons and Circle with 0.0 and 0.25 of Gaussian standard deviation noise added to the data

2. The NBHL-BC parameter values are shown in table 1, and the classification accuracy results are shown in table 2. These datasets were used because present clear data patterns where traditional algorithms reduce their effectiveness. In the case of Decision tree, we use gini index as quality measure without pruning method. In the case of Random Forest, we use gini index as split criterion and 100 trees. In the case of MLP, we use 2 hidden layer with 10 nodes and 100 interactions for dataset without noise and 500 interactions with noise.

(a) Moons Noise 0.0
(b) Moons Noise 0.25
(c) Circle Noise 0.0
(d) Circle Noise 0.25
Figure 2: Synthetic Datasets for Testing
Dataset accuracy
Moons 0.00 5 0.5 5 1.0 100.0
Moons 0.25 8 0.0 10 1.0 97.0
Circle 0.00 1 0.5 1 1.0 100.0
Circle 0.25 5 0.5 1 1.0 64.0
Moons 0.00 5 0.5 5 0.5 100.0
Moons 0.25 9 0.0 10 0.5 96.0
Circle 0.00 1 0.0 1 0.5 100.0
Circle 0.25 5 0.5 1 0.5 65.0
Moons 0.00 5 0.5 5 0.0 100.0
Moons 0.25 9 0.0 10 0.0 96.0
Circle 0.00 1 0.0 1 0.0 100.0
Circle 0.25 5 0.5 1 0.0 64.0
Table 1: Parameter values used by our algorithm (NBHL-BC) in Toy Datasets

We use in the first group because we want to evaluate just the structural methodology using . In the second group, we combine both strategies with and we got a small improvement on the dataset Circle 0.25. In the last group, we use just the number of links and we got similar results but there is a reduction of the accuracy in Moons 0.25. In some cases, we need to remove the property of using and increase the quantity of neighbors in Moons with 0.25 noise. The similar nodes were kept in all the tests because other values reduce accuracy.

MLP DT RF NBHL-BC
Moons 0.00 94.0 95.0 98.0 100.0
Moons 0.25 84.0 85.0 91.0 97.0
Circle 0.00 90.0 92.0 91.0 100.0
Circle 0.25 62.0 56.0 56.0 65.0
Table 2: Classification accuracy of the NBHL-BC compared to Multi Layer Perceptron (MLP), Decision Tree (DT), and Random Forest (RF) in Toy Datasets

In this simulations, our algorithm presents the best results in all the datasets. Specially in the the circle dataset with noise 0.25, which is the most difficult case, our algorithm presents better classification accuracy than other techniques under comparison.

4 Experimental Results on Real Datasets

In this section, we are going to present the results of the NBHL-BC technique on UCI classification datasets [6] . Also, we will compare our results with other algorithms . We tested our algorithm against Multi Layer Perceptron (MLP) [15], Decision Tree C4.5 (DT) [16], Random Forest (RF) [2], and the Network Base High Level Technique (NBHL) [5].

The algorithms are tested splitting each dataset in two sub data sets, for training and testing with a proportion of 75% and 25% respectively following an stratified sampling using python as programming language and Scikit-learn library for algorithms .

The datasets used are shown in table 3 with the number of instances, attributes and classes. These datasets were selected because the previous high-level algorithm used them. The NBHL-BC parameter values are given in table 4, and classification accuracy results are presented in table 5.

Dataset Instances Attributes Classes
Glass 214 9 6
Iris 150 4 3
Pima 768 8 2
 Teaching 151 5 3
Wine 178 13 3
Yeast 1484 8 10
Zoo 101 16 7
Table 3: Information about the UCI classification dataset used on these project
Dataset
Glass 1 0.0 1 1.0
Iris 7 0.0 3 1.0
Pima 8 0.0 4 1.0
Teaching 5 0.0 5 1.0
Wine 12 0.0 5 1.0
Yeast 14 0.0 3 0.5
Zoo 1 0.0 1 1.0
Table 4: Parameter values used by our algorithm (NBHL-BC) in UCI datasets
MLP DT RF NBHL NBHL-BC
Glass 69.231 63.077 75.385 66.700 69.231
Iris 93.333 93.333 93.333 97.400 95.556
Pima 74.892 69.264 77.056 73.400 77.056
Teaching 52.174 52.174 60.870 55.300 65.217
Wine 96.296 92.593 98.148 80.000 98.148
Yeast 59.641 48.430 61.883 36.700 54.036
Zoo 96.774 96.774 96.774 100.00 100.00
Table 5: Classification accuracy results of the NBHL-BC compared to Multi Layer Perceptron (MLP), Decision Tree C4.5 (DT), Random Forest (RF), and Network Base High Level Classification (NBHL) using the testing dataset.

Our algorithm presents a good performance in all the datasets compared to other algorithms. In four cases, our algorithm presents the best results. Just in case of Iris dataset, another high level classification algorithm NBHL is better than the proposed one.

Moreover, the parameter that regulates the weight between the measure and number of links in 6 of the 7 datasets is 1.0 that means that the algorithm just use the . In the dataset Yeast, it was required an that means that give same importance between and number of links. In table 6, we tested UCI Wine dataset [6] using 10-fold cross validation with different values for . The accuracy with only links number is quite lower than , and the best result is mixing both techniques with . The parameter that evaluates the number of nodes with the lower difference with respect to the inserted node were kept constant.

Dataset accuracy
Wine 8 0.5 5 0.0 95.492
Wine 8 0.5 5 0.1 96.619
Wine 8 0.5 5 0.2 96.619
Wine 8 0.5 5 0.3 96.619
Wine 8 0.5 5 0.4 97.175
Wine 8 0.5 5 0.5 96.619
Wine 8 0.5 5 0.6 96.063
Wine 8 0.5 5 0.7 96.048
Wine 8 0.5 5 0.8 95.508
Wine 8 0.5 5 0.9 96.619
Wine 8 0.5 5 1.0 96.048
Table 6: Results of 10-folds cross validation in UCI Wine dataset with the training dataset.

5 Conclusions

In this paper, we describe a new technique for high-level classification using property. We propose that nodes with similar could determinate the new untagged instance belongs to. This measure provides the importance of each node in the sub-graph communication. We exploit this property to classify a new node into a sub-graph that presents a similar communication structure. We test this algorithm in 4 toy datasets and 7 real datasets and the results are promising.

As further works, we think that it is needed some procedures to reduce the noisy instances, and attributes that could produce disconnected sub graphs. Also, it is needed a way to detect the best parameters for and perhaps following an optimization approach like particle swarm.

References

  • [1] R. Albert and A. Barabási (2002-01) Statistical mechanics of complex networks. Rev. Mod. Phys. 74, pp. 47–97. Cited by: §1.
  • [2] L. Breiman (2001) Random forests. Machine Learning 45 (1), pp. 5–32. Cited by: §3, §4.
  • [3] M. Carneiro and L. Zhao (2018) Organizational data classification based on the importance concept of complex networks.

    IEEE Transactions on Neural Networks and Learning Systems

    29, pp. 3361–3373.
    Cited by: §1.
  • [4] T. Christiano Silva and L. Zhao (2016) Machine learning in complex networks. Springer International Publishing. Cited by: §1, §1, §2.2.
  • [5] T. Colliri, D. Ji, H. Pan, and L. Zhao (2018-07) A network-based high level data classification technique. In 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §1, §2.2, §4.
  • [6] D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §4, §4.
  • [7] S. A. Fadaee and M. A. Haeri (2019) Classification using link prediction. Neurocomputing 359, pp. 395 – 407. Cited by: §1.
  • [8] L. C. Freeman (1977) A set of measures of centrality based upon betweenness. Sociometry 40, pp. 35–41. Cited by: §1.
  • [9] A. Géron (2017)

    Hands-on machine learning with scikit-learn and tensorflow: concepts, tools, and techniques to build intelligent systems

    .
    1st edition, O’Reilly Media, Inc.. Cited by: §1.
  • [10] K. P. Murphy (2013) Machine learning : a probabilistic perspective. MIT Press, Cambridge, Mass. [u.a.]. Cited by: §1.
  • [11] M. Needham and A.E. Hodler (2019) Graph algorithms: practical examples in apache spark and neo4j. O’Reilly Media, Incorporated. External Links: ISBN 9781492047681, LCCN 2019275313, Link Cited by: §2.2.
  • [12] M. E. J. Newman (2003-02) Mixing patterns in networks. Phys. Rev. E 67 (2), pp. 026126. Cited by: §1.
  • [13] A. A. Patel (2019) Hands-on unsupervised learning using python. Cited by: §1.
  • [14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §3.
  • [15] M. Riedmiller and H. Braun (1993)

    A direct adaptive method for faster backpropagation learning: the rprop algorithm

    .
    In IEEE International Conference on Neural Networks, pp. 586–591 vol.1. Cited by: §3, §4.
  • [16] J. Shafer, R. Agrawal, and M. Mehta (2000-08) SPRINT: a scalable parallel classifier for data mining. VLDB, pp. . Cited by: §3, §4.
  • [17] T. C. Silva and L. Zhao (2012-06) Network-based high level data classification. IEEE Transactions on Neural Networks and Learning Systems 23 (6), pp. 954–970. Cited by: §1, §2.2.
  • [18] T. C. Silva and L. Zhao (2015) High-level pattern-based classification via tourist walks in networks. Information Sciences 294, pp. 109 – 126. Note: Innovative Applications of Artificial Neural Networks in Engineering Cited by: §2.2.