Improve High Level Classification with a More Sensitive metric and Optimization approach for Complex Network Building

10/23/2021
by   Josimar Chire, et al.
Universidade de São Paulo
0

Complex Networks are a good approach to find internal relationships and represent the structure of classes in a dataset then they are used for High Level Classification. Previous works use K-Nearest Neighbors to build each Complex Network considering all the available samples. This paper introduces a different creation of Complex Networks, considering only sample which belongs to each class. And metric is used to analyze the structure of Complex Networks, besides an optimization approach to improve the performance is presented. Experiments are executed considering a cross validation process, the optimization approach is performed using grid search and Genetic Algorithm, this process can improve the results up to 10

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

08/29/2020

New feature for Complex Network based on Ant Colony Optimization for High Level Classification

Low level classification extracts features from the elements, i.e. physi...
09/14/2020

New complex network building methodology for High Level Classification based on attribute-attribute interaction

High-level classification algorithms focus on the interactions between i...
09/28/2020

A new network-base high-level data classification methodology (Quipus) by modeling attribute-attribute interactions

High-level classification algorithms focus on the interactions between i...
07/13/2013

Learning an Integrated Distance Metric for Comparing Structure of Complex Networks

Graph comparison plays a major role in many network applications. We oft...
05/15/2022

Variable Functioning and Its Application to Large Scale Steel Frame Design Optimization

To solve complex real-world problems, heuristics and concept-based appro...
05/07/2013

High Level Pattern Classification via Tourist Walks in Networks

Complex networks refer to large-scale graphs with nontrivial connection ...
09/24/2020

Characterization of Covid-19 Dataset using Complex Networks and Image Processing

This paper aims to explore the structure of pattern behind covid-19 data...

I Introduction

Complex Networks (CN) can represent internal relationships [Barabasi99, barabasi2016network, newman10, chiresaire2020new]

of complex systems. During the building of CN, the process can capture existent connections between entities/objects of one class. Then, after the creation is possible to check if one element belongs or not to one class, considering a closeness metric. Therefore, it is an advantage considering traditional algorithms of Machine Learning for classification tasks, i.e. Multilayer Perceptron, Support Vector Machine, some algorithm which needs an optimization process to adjust an error function.

It is frequent in classification tasks consider that each feature, or variable can contribute to the final classification label in different proportion. Then, it is possible to assign or add a weight related to each column or variable, and this can be rewrite as an optimization problem. The objective is to get higher or lower metric through searching of these weights. This process can be solved using Evolutionary Algorithms, i.e. Genetic Algorithm (GA), because they are well-know of being useful for this kind of problems.

The process of building a Complex Network can be performed by other methods, in the state of art are: b-matching [Jebara2009], linear neighborhood [Wang2008] and methods based on single linkage [Cupertino2013].

The proposal of this paper is to present a new way of building a Complex Network adding a criterion of optimization considering weight contribution of each feature. The results of this work are:

  • New approach to build a Complex Network considering class independence

  • New metric considering shortest path in a graph

  • Improving Complex Network Building using an Optimization scheme

Ii Methods

The actual paper presents the building process of Complex Networks and Optimization approach to tune the structure up to improve the performance. The figure 1 summarizes these steps.

Fig. 1: Proposal methodology

Ii-a Preparing dataset

The dataset can have different scale for the variables or columns involved, then a scaling step is performed to have similar distributions. And oversampling step is used to balance dataset and have similar number of samples per class.

Ii-B Building Complex Networks

The dataset is splitted in two parts following classic train-test process, but this proposal does not have a training process, only one building process is done. Let separate samples per class, and create a Complex Network considering each sample as node and using a Euclidean distance to join all the nodes and assign the corespondent weight, a full connected graph.

After the creation of each Complex Network (CN), using the adjacency matrix of each CN can be executed a pruning process. This step excludes edges/connections with values higher then *median of the weights, to tune the structure up. And after of several experiments, the proper values is .

Ii-C Evaluation of Structure

Previous proposal used traditional CNs metrics, i.e. neighborhood degree, asortativity and others. Following their results and discussion, is open the proposal of a More Sensitive Metric. Therfore, after the building of CNs are calculated the Minimum Spanning Trees.

An undirected graph , where are the vertexes and are the edges which connect vertexes, i.e. and there are a associated cost for each edge w(u,v). Then, a path which connects all the vertexes and the total weight is minimum, with no presence of cycles, this is name minimum spanning tree (MST) [cormenbook].

(1)

Then, MST represents the shortest path to join all the nodes, and the addition of the distances which compound MST is calculated as metric to represent the structure the of CN. It is important to highlight, this step is necessary to measure the actual structure.

Ii-D Insertion and Classification

Let use the test set of the original dataset to know if one sample/element belongs to one class, following the next reasoning: ”one element which belongs to one class, after insertion will produce less impact than one element of other class”

Then, each element is inserted to each CN and calculated the distances between all the elements of this CN and this new element. After of the insertion, a new MST’ is calculated for each CN. Then, a difference is calculated between MST and MST’ for each CN, and the minimum value determines to which class the element belongs.

And additional step to improve the performance is the optimization of the structure.

Ii-E Optimization of the Building

Considering that features contribute in different levels or have different importance, the proposal includes to consider a weight for each feature, to get the level of importance or contribution following an optimization approach. The objective is to maximize the precision of the proposal.

(2)

In this proposal, Genetic Algorithm was used to find these weights.

Iii Results

Iii-a Artificial Dataset

An artificial dataset is creating considering some figures, i.e. spirals, stars. Besides, an overlapping is present to present challenge to the algorithms for classification tasks, see Fig. 2

Fig. 2: Visualization of Dataset (7 classes) and results

A comparison is performed with classical Machine Learning algorithm, i.e. Multilayer Perceptron, Decision Tree, Logistic Regression, Naive Bayes, Gradient and bagging, boosting.

Fig. 3: Experiment Results

Following the approach to optimize the building of CN, a grid search is performed. It is important to highlight the time than this process could take.

Fig. 4: Optimization results vs previous results

Iii-B Datasets

The chosen dataset are available in [romano2021pmlb], after a search were selected dataset with at least three hundreds samples per class, numerical data and at least two classes. Therefore, were selected: magic, satimage, sleep and phoneme. Then, a brief description of the datasets is present in Tab. I. The website presents a filter bars to select the datasets, but it was necessary to go through the sites and description of datasets to find the proper for the experiments.

Dataset PMLB
Name Samples Class Variables Classes
magic 0: 12332, 1: 6688 10 2
satimage
1: 1533, 2: 703, 3: 1358,
4: 626, 5 707, 7: 1508
36 7
sleep
0: 21359,1: 9052,2: 52698,
3: 10832,5: 11967
13 5
phoneme 0: 3818, 1: 586 5 2
TABLE I: Dataset Description

After the experiments using a cross validation of k=10, the results are presented in Fig 5. The proposal without optimization is named and with optimization is .

Fig. 5:

Iii-C Analysis of the results

Considering the results of section III-A, the proposal outperform the other algorithms is only check the minimum values. Besides, the median is higher and the limits of the boxplot are better. These results show the proposal can have good performance. After adding optimization approach, the proposal reduces the limits of the boxplot, and better performance than all the algorithms. This is related to the contribution of each variable to the classification task.

By the other hand, results of the subsection III-B show the proposal got the best values with satimage, phoneme datasets. Besides, the optimization approach can improve the results up to 10%. But, with datasets: magic and sleep is in the last three last positions. In spite of these results, the optimization approach can improve them up to 10%.

Iv Conclusion

In conclusion, the proposal to separate and create independent CN for each class can present good results, the experiments can support this affirmation. The metric introduced to MST to calculate the shortest path in the CN helps the global performance. Finally, the optimization approach can optimize the results up to 10% to the original proposal.

References