Complex Networks (CN) can represent internal relationships [Barabasi99, barabasi2016network, newman10, chiresaire2020new]
of complex systems. During the building of CN, the process can capture existent connections between entities/objects of one class. Then, after the creation is possible to check if one element belongs or not to one class, considering a closeness metric. Therefore, it is an advantage considering traditional algorithms of Machine Learning for classification tasks, i.e. Multilayer Perceptron, Support Vector Machine, some algorithm which needs an optimization process to adjust an error function.
It is frequent in classification tasks consider that each feature, or variable can contribute to the final classification label in different proportion. Then, it is possible to assign or add a weight related to each column or variable, and this can be rewrite as an optimization problem. The objective is to get higher or lower metric through searching of these weights. This process can be solved using Evolutionary Algorithms, i.e. Genetic Algorithm (GA), because they are well-know of being useful for this kind of problems.
The process of building a Complex Network can be performed by other methods, in the state of art are: b-matching [Jebara2009], linear neighborhood [Wang2008] and methods based on single linkage [Cupertino2013].
The proposal of this paper is to present a new way of building a Complex Network adding a criterion of optimization considering weight contribution of each feature. The results of this work are:
New approach to build a Complex Network considering class independence
New metric considering shortest path in a graph
Improving Complex Network Building using an Optimization scheme
The actual paper presents the building process of Complex Networks and Optimization approach to tune the structure up to improve the performance. The figure 1 summarizes these steps.
Ii-a Preparing dataset
The dataset can have different scale for the variables or columns involved, then a scaling step is performed to have similar distributions. And oversampling step is used to balance dataset and have similar number of samples per class.
Ii-B Building Complex Networks
The dataset is splitted in two parts following classic train-test process, but this proposal does not have a training process, only one building process is done. Let separate samples per class, and create a Complex Network considering each sample as node and using a Euclidean distance to join all the nodes and assign the corespondent weight, a full connected graph.
After the creation of each Complex Network (CN), using the adjacency matrix of each CN can be executed a pruning process. This step excludes edges/connections with values higher then *median of the weights, to tune the structure up. And after of several experiments, the proper values is .
Ii-C Evaluation of Structure
Previous proposal used traditional CNs metrics, i.e. neighborhood degree, asortativity and others. Following their results and discussion, is open the proposal of a More Sensitive Metric. Therfore, after the building of CNs are calculated the Minimum Spanning Trees.
An undirected graph , where are the vertexes and are the edges which connect vertexes, i.e. and there are a associated cost for each edge w(u,v). Then, a path which connects all the vertexes and the total weight is minimum, with no presence of cycles, this is name minimum spanning tree (MST) [cormenbook].
Then, MST represents the shortest path to join all the nodes, and the addition of the distances which compound MST is calculated as metric to represent the structure the of CN. It is important to highlight, this step is necessary to measure the actual structure.
Ii-D Insertion and Classification
Let use the test set of the original dataset to know if one sample/element belongs to one class, following the next reasoning: ”one element which belongs to one class, after insertion will produce less impact than one element of other class”
Then, each element is inserted to each CN and calculated the distances between all the elements of this CN and this new element. After of the insertion, a new MST’ is calculated for each CN. Then, a difference is calculated between MST and MST’ for each CN, and the minimum value determines to which class the element belongs.
And additional step to improve the performance is the optimization of the structure.
Ii-E Optimization of the Building
Considering that features contribute in different levels or have different importance, the proposal includes to consider a weight for each feature, to get the level of importance or contribution following an optimization approach. The objective is to maximize the precision of the proposal.
In this proposal, Genetic Algorithm was used to find these weights.
Iii-a Artificial Dataset
An artificial dataset is creating considering some figures, i.e. spirals, stars. Besides, an overlapping is present to present challenge to the algorithms for classification tasks, see Fig. 2
Following the approach to optimize the building of CN, a grid search is performed. It is important to highlight the time than this process could take.
The chosen dataset are available in [romano2021pmlb], after a search were selected dataset with at least three hundreds samples per class, numerical data and at least two classes. Therefore, were selected: magic, satimage, sleep and phoneme. Then, a brief description of the datasets is present in Tab. I. The website presents a filter bars to select the datasets, but it was necessary to go through the sites and description of datasets to find the proper for the experiments.
|magic||0: 12332, 1: 6688||10||2|
|phoneme||0: 3818, 1: 586||5||2|
After the experiments using a cross validation of k=10, the results are presented in Fig 5. The proposal without optimization is named and with optimization is .
Iii-C Analysis of the results
Considering the results of section III-A, the proposal outperform the other algorithms is only check the minimum values. Besides, the median is higher and the limits of the boxplot are better. These results show the proposal can have good performance. After adding optimization approach, the proposal reduces the limits of the boxplot, and better performance than all the algorithms. This is related to the contribution of each variable to the classification task.
By the other hand, results of the subsection III-B show the proposal got the best values with satimage, phoneme datasets. Besides, the optimization approach can improve the results up to 10%. But, with datasets: magic and sleep is in the last three last positions. In spite of these results, the optimization approach can improve them up to 10%.
In conclusion, the proposal to separate and create independent CN for each class can present good results, the experiments can support this affirmation. The metric introduced to MST to calculate the shortest path in the CN helps the global performance. Finally, the optimization approach can optimize the results up to 10% to the original proposal.