Artificial Intelligence has become one of the most fostered research areas throughout the last years [Russell:16]. It is common to observe an increasing trend of automating tasks [Acemoglu:18], which also fosters the development of minimal human interaction algorithms, denoted as Machine Learning.
Machine learning research consists of developing new types of algorithms that do not need explicit instructions, relying on patterns and inferences [Bishop:06]. Also, they are designed in a way that humans can be assisted in decision-making tasks or even in daily activities automation, such as data retrieval [Manning:08], intelligent gadgets [Li:15], self-driving cars [Shalev:17], among others. In the past decades, most of machine learning-based algorithms were developed as symbolic- and knowledge-based models due to the difficulty in dealing with probabilistic models at that time [Shavlik:91]. Nevertheless, with the advent of computational power, probabilistic models were put in the spotlight as the availability of digitized information was no longer a problem [Langley:11]
. Hence, most of today’s algorithms rely on mathematical models and data sampling, i.e., models that are capable of learning occult patterns in training data and predicting unseen data, and are to a wide variety of tasks, such as computer vision[Sebe:05]Indurkhya:10].
One can observe that it is possible to divide machine learning algorithms into two types of learning: supervised learning[Kotsiantis:07]Hastie:09], as depicted in Figure 1. Concerning the supervised learning ones, such as classification and regression tasks, the algorithms aim at building mathematical models from labeled data, i.e., data that contains the input features and the possible outputs (classes), and perform predictions on unseen data. Regarding the unsupervised learning, the algorithms aim at building mathematical models that are capable of aggregating sets of data with common characteristics, known as clusters. In other words, unsupervised learning is capable of discovering patterns in data and grouping them into categories without knowing their actual labels.
Furthermore, it is crucial to observe that the machine learning area is closely related to other fields, such as data mining, optimization, and statistics. Regarding data mining, while machine learning focus on predicting information based on already-known properties from the data, the data mining area focuses on finding unknown properties in the data and transforming unknown knowledge into real-knowledge, for further application in machine learning algorithms [Friedman:98]
. Concerning the optimization area, it is common to observe that most machine learning algorithms models are formulated as optimization problems, where some loss function is minimized over a set of training data. Essentially, the loss function is capable of expressing the discrepancy between the model’s predictions and the actual samples, assisting the algorithm in learning the data’s patterns and being capable of predicting unseen information[Sra:12]. Finally, regarding the statistics field, it is possible to perceive that statistics focus on drawing inferences from samples, while machine learning focuses on finding generalizable prediction patterns [Bzdok:18]. Additionally, due to their intimate relationship, some researchers combined machine learning and statistical methods in a new field of study, known as statistical learning [James:13].
Recently, a new graph-based classifier proposed by Papa et al. [Papa:09], known as Optimum-Path Forest (OPF), attempts to fulfill the literature with a parameterless classifier, which is effective during the learning step and efficiently when performing new predictions. Several works introduced the capacity of OPF and its state-of-the-art performance, being comparable to the well-known Support Vector Machines (SVM) [Chang:11] in supervised [Papa:12] and unsupervised learning [Rosa:14]
tasks. Additionally, it provides tools, such as graph-cutting and K-Nearest Neighbors (KNN) graphs[PapaKNN:09], to reduce the training set size with negligible effects on the accuracy of the classification. Nevertheless, the problem arises with the fact that there is only one official implementation based on the C language, making it difficult to be integrated with other well-known machine learning frameworks. Furthermore, there is a Python-based trend in the machine learning community.
In this paper, we propose an open-source Python Optimum-Path Forest classification library, called OPFython111https://github.com/gugarosa/opfython. Mainly, the idea is to provide a user-friendly environment to work with Optimum-Path Forest classifiers by creating high-level methods and classes, removing from the user the burden of programming at a mathematical level. The main contributions of this paper are threefold: (i) to introduce an Optimum-Path Forest classification library in the Python language, (ii) to provide an easy-to-go implementation and user-friendly framework, and (iii) to fill the lack of research regarding Optimum-Path Forest classifiers.
The remainder of this paper is organized as follows. Section 2 presents a literature review and related works concerning Optimum-Path Forest classifiers frameworks. Section 3 introduces a theoretical background concerning the supervised and unsupervised Optimum-Path Forest classifiers. Section 4 introduces thoughts of the OPFython library, such as its architecture, and an overview of the included packages. Section 5 provides more profound notions about the library, such as how to install, how to understand its documentation, some pre-included examples, and how to perform unitary tests. Furthermore, Section 6 presents vital knowledge about the usage of the library, i.e., how to run pre-defined examples, and how to model a new experiment. Finally, Section 7 states conclusions and future works.
2 Literature Review and Related Works
Optimum-Path Forest classifiers have arisen as a new approach to tackle supervised and unsupervised problems. They offer a parameterless graph-based implementation that is capable of executing an effective learning procedure while being extremely efficient when performing new predictions. It is possible to find its usage in a wide range of applications, such as feature selection[Rodrigues:14], image segmentation [Miranda:09, Cappabianco:12], signals classification [Nunes:14, Luz:13]. For instance, Iliev et al. applied an Optimum-Path Forest classification using glottal features for spoken emotion recognition, achieving state-of-the-art results comparable to the SVM classifier. Moreover, Ramos et al. [Ramos:11] applied an OPF-based classification for detecting non-technical energy losses, achieving outstanding results comparable to state-of-the-art artificial intelligence techniques. Furthermore, Fernandes et al. [Fernandes:19] proposed a probabilistic-driven OPF classifier for detecting non-technical energy losses, improving the baselines obtained by the standard OPF.
Even though numerous works in the literature fosters the Optimum-Path Forests, there are some gaps in works regarding frameworks or open-sourced libraries. There is only an official implementation provided by Papa et al. [LibOPF:15], denoted as LibOPF, which does not provide straightforward tools for users to design new experiments or integrate with other frameworks. Additionally, it lacks documentation and test suites, which assist users in understanding the code and implementing new methods and classes. Moreover, the library is implemented in C language, making it extremely difficult to integrate with other frameworks or packages, primarily because the machine learning community is turning their attention to the Python language.
Therefore, OPFython attempts to fill the gaps concerning Optimum-Path Forest frameworks. It is purely implemented in Python and provides comprehensive documentation, test suites, and several pre-loaded examples. Furthermore, every line of code is commented, there are continuous integration tests for every new push to its repository, an astounding readme which teaches on how to get started with the library and full-time maintenance and support.
3 Theoretical Foundation
Before diving into OPFython’s library, we present a theoretical foundation about the Optimum-Path Forest. In the next subsections, we mathematically explain how the supervised and unsupervised classifiers work.
3.1 Supervised Optimum-Path Forest
The Optimum-Path Forest is a multi-class classifier developed by Papa et al. [Papa:09], being efficient in the training step and effective in the testing stage. Its foremost ability is to segment the feature space without requiring massive volumes of data. Essentially, the OPF classifier is a graph, having two possible adjacent relations: a complete graph or a
graph. The difference between both methods is the adjacency relation, the methodology to estimate the prototypes222Prototypes are master nodes that represent a specific class and conquer other nodes., and the path cost function.
The principal idea behind the supervised OPF is to construct a complete graph, where any two samples are connected. In this case, the nodes represent the samples’ features vector and the edges connect all nodes. Regarding the prototypes, the same are chosen throughout Minimum Spanning Trees (MST)333MSTs are subgraphs that connect all nodes within the same set using the minimum possible cost. in order to find the nearest samples from different classes, namely, by selecting samples located in the classes frontiers444Regions more likely to classification mistakes.. After the prototypes definition, they compete with each other to conquer adjacent nodes while trying to find the best path (lowest cost) defined by the path cost function and create Optimum-Path Trees (OPT). Finally, during the testing phase, OPF inserts each new sample into the graph and finds the prototype, which offers the minimum cost path (class labeling).
Let be a dataset, where , and and represents the training and testing sets, respectively. Each sample can be represented by its feature vector . OPF graph is represented by , where refers to the set of edges that connects all nodes pairs and is the features vectors set , . In addition, let be a function that assigns a real label for each sample in .
3.1.1 Training Step
Let the graph be inducted from the training set, where holds all feature vectors from samples belonging to the training set. The first objective of the training phase is to obtain a set of prototypes , onde .
Let a path , with ending in s, be in and a function that associates a value to this path. In order to a prototype conquer adjacent samples, the purpose is to minimize through a path cost function given by the following equation:
where computes the maximum distance between adjacent samples s and t along the path . A path is referred as optimum if for any other path .
The minimization of assigns to each sample an optimum path , whose minimum cost is given by the following equation:
3.1.2 Testing Step
The testing set graph is composed by samples . Each sample t is connected to a sample , making t as part of the original graph. The objective is to find an optimum path from to t with class of its prototype . Consequently, the sample t is removed from the graph. This path can be identified by evaluating the optimum cost value :
3.2 Unsupervised Optimum-Path Forest
Let be a dataset such that for every sample there is a feature vector . Additionally, let be the distance between samples and in the feature space, which is described by .
A graph is defined by arcs that connect -nearest neighbors in the feature space. The arcs are weighted by and the nodes are weighted by a density value , given by Equation 4:
where , , and is the maximum arc weight in . Using these parameters, all nodes are considered for density computation, since a Gaussian function covers most samples within .
A standard method to compute a probability density function (p.d.f.) is the Parzen-window. Equation (4) provides a Parzen-window estimation based on an isotropic Gaussian kernel when the arcs if are defined. Nevertheless, this approximation causes differences in scale and sample concentration.
An interesting way to solve this problem is to choose based on a particular region of the feature space [Comaniciu:03]. By considering the -nearest neighbors, it is possible to handle different concentrations and transform the scale problem into finding the best value within , for . An approach proposed by Rocha et al. [RochaIJIST:09] considers the minimum graph cut (best ) according to a measure suggested by Shi et al. [Shi:00].
Moreover, let a path be the sequence of adjacent samples starting from a root and ending at a sample , being a trivial path and the concatenation of and arc . Among all possible paths with roots on the maxima of the p.d.f., the problem lies in finding a path with the lowest density value. Each path defines an influence zone (cluster) by selecting strongly connected samples. Mathematically speaking, Equation 5 maximizes for all where:
where and is a root with one element set for each maximum of the p.d.f.. One can see that higher values of reduces the number of maxima. Additionally, in this library, we are using and .
Finally, the OPF algorithm maximizes such that the optimum paths compose an Optimum-Path Forest, i.e., a predecessor no-cycling map which assigns to each sample its predecessor from the optimum path or a marker when . Each p.d.f. maximum (prototype) is the root of an OPT, commonly known as a cluster. Furthermore, the collection of all OPTs is the so-called Optimum-Path Forest.
OPFython is distributed among several packages, each one being accountable for particular classes and methods. Figure 2 represents a summary of OPFython’s architecture, while the next sections present each one of its packages within more details.
The core package serves as the origin of all OPFython’s sub-classes. It assists as a building base for implementing more appropriate structures that one may require when creating an Optimum-Path Forest-based classifier. As portrayed in Figure 3, four modules compose the core package, as follows:
Heap: The heap assists OPF in stacking nodes’ according to their costs and further unstacking them to build the subgraph;
Node: When working with graph-based structures, each of their pieces is represented by a node. In OPFython, we use the node structure to store valuable information of a sample, such as its features, label and other information that OPF might need;
OPF: The OPF class serves as the classifier itself. It implements some basic methods that are common to its children, as well as some methods that assist users in saving and loading pre-trained models;
Subgraph: The subgraph is one of the most fundamental structures of the OPF classifier. As it is a graph-based classifier, it uses nodes and arcs to build the optimum-path costs and find the prototype nodes, which conquer the remaining samples and propagate their labels.
In order to ease the user’s life, OPFython offers a mathematical package, containing low-level math implementations, illustrated by Figure 4. Naturally, some repeated functions that are used throughout the library are represented in this package, as follows:
Distance: A distance metric is used to calculate the cost between nodes. Hence, we offer a variety of distance metrics that fulfills every task needs;
General: Common-use functions that do not have a special division are defined in this module;
Lastly, some methods might use random numbers for sampling or setting a heuristic. This module can generate uniform and Gaussian random numbers.
There are several approaches to be conducted when designing an optimum-path forest classifier, such as supervised, unsupervised, semi-supervised, among others. Therefore, the models’ package provides classes and methods that compose these high-level abstractions and implement the classifying strategies. Currently, OPFython offers four types of classifiers, which are illustrated by Figure 5 and described as follows:
KNNSupervisedOPF [PapaKNN:09]: A supervised Optimum-Path Forest classifier that uses a KNN-based subgraph, providing a more effective way to build up the connectivity subgraph.
SemiSupervisedOPF [Amorim:14]: A semi-supervised Optimum-Path Forest classifier, which is extremely useful in labeling unknown samples.
SupervisedOPF [Papa:09]: The classical supervised Optimum-Path Forest classifier, which is suitable for training on labeled datasets and performing new predictions;
UnsupervisedOPF [Rocha:09]: The standard unsupervised Optimum-Path Forest classifier, which is suitable for clustering unlabeled datasets.
The stream package deals with every pre-processing step of the input data. Essentially, it is responsible for loading the data, parsing it into samples and labels, and splitting it into new sets, such as training, validation, and testing. Figure 6 depicts its modules, as well as we provide a brief description of them as follows:
Loader: A loading module that assists users in pre-loading datasets. Currently, it is possible to load files in .txt, .csv and .json formats;
Parser: After loading the files, it is necessary to parse the pre-loaded arrays into samples and labels;
Splitter: Finally, if necessary, one can split the loaded and parsed dataset into new sets, such as training, validation, and testing.
As mentioned before, the subgraph is one of the essential structures of the classification process. Nevertheless, one can observe that distinct classifiers might need distinct subgraphs. Therefore, we are glad to offer additional subgraphs implementations as portrayed by Figure 7 and described as follows:
KNNSubgraph: When dealing with KNN-based classifiers, it is crucial to use a KNN-based subgraph, as it implements some additional functions that the classifier might need.
A utility package implements standard tools that are shared over the library, as it is a better approach to implement once and re-use them across other modules, as shown in Figure 8. This package implements the subsequent modules:
Constants: Constants are fixed numbers that do not alter throughout the code. For the sake of easiness, they are implemented in the same module;
Converter: Most of OPF users are familiarized to the specific file format it uses. Hence, we implement an own module that is capable of converting .opf files into .txt, .csv, and .json.
Exception: In order to assist users, the exception module implements common errors and exceptions that might happen when invalid arguments are used in OPFython classes and methods;
Logging: Every method that is invoked in the library is logged onto a log file. One can watch the log in order to detect potential errors, essential warnings, or even success messages throughout the classification procedure.
5 Library Usage
In this section, we describe how to install the OPFython library, as well as the first steps to start playing with it. Essentially, one can study its documentation or make use of the already-included examples. Besides, there are implemented methods that conduct unitary tests and verify if everything is operating as presumed.
First of all, we understand that everything has to be smooth without being tricky or daunting. Therefore, OPFython will always be the one-to-go package, from the very first installation to its further usage. Just execute the following command under the most preferred Python environment (standard, conda, virtualenv):
pip install opfython
Alternatively, it is possible to use the bleeding-edge version by cloning its repository and installing it:
git clone https://github.com/gugarosa/opfython.git
pip install .
Note that there is no other requirement to use OPFython. As its single dependency is the Numpy package, it can be installed everywhere, despite the machine’s operational system.
One might have an enthusiasm for mastering the concepts and strategies behind OPFython. Hence, we provide a fully documented reference555https://opfython.readthedocs.io containing everything that the library offers. From elementary classes to more complex methods, OPFython’s documentation is the perfect reference to learn how the library was developed or even improve it with contributions.
5.3 Classes and Methods Examples
Additionally, in the
examples/ folder, we provide example scripts for all packages that the library implements, such as:
Each example is constituted of high-level explanations of how to use predefined classes and methods. One can observe that it provides a standard description of how to instantiate each class and decide which arguments should be employed.
5.4 Test Suites
OPFython is prepared with tests to give a more in-depth analysis of the code. Also, the intention behind any test is to check whether everything is running as demanded or not. Thus, there are two main methods in order to execute the tests:
PyTest: The first method is running the solo command
pytest tests/, as depicted by Figure 9. It will fulfill all the implemented tests and return an output indicating whether they succeeded or failed;
Coverage:, An interesting extension to PyTest is the coverage module. Despite granting the same outputs from PyTest, it will also present a report that states how much the tests cover the code, as illustrated by Figure 10. Its usage is also straightforward:
coverage run -m pytest tests/and
coverage report -m.
In this section, we explain how to perform a classification task with OPFython, as well as briefly describe the seven pre-loaded applications that are included with the library.
6.1 Getting Started
After installing the OPFython library, it is straightforward to use its packages, where seven elementary examples show key features that the library implements. One can refer to the
examples/applications folder and examine the following files:
KNN-based supervised OPF training:
Supervised OPF agglomerative learning:
Supervised OPF learning:
Supervised OPF pruning:
Supervised OPF training:
Unsupervised OPF clustering:
Each example is comprised of the following pipeline: loading the dataset, parsing the dataset, splitting the dataset, instantiating a classifier, fitting the training data, predicting the validation/test data, and calculating the classifier’s performance. Finally, after performing the classification process, it is possible to save the model in a disk-file for further inspection. Figure 11 illustrates the output logs generated by an OPFython classification.
The difference between the provided scripts consists of the type of classifier. While supervised classifications attempt to learn a set of classes from particular samples that represent them, the unsupervised classification tries to aggregate samples into clusters, i.e., dense regions where samples share some similar traits. As for now, we offer three supervised classifications, e.g., KNN-based supervised OPF, semi-supervised OPF and supervised OPF, as well as, an unsupervised classification, the unsupervised OPF. Additionally, we offer three extensions of the supervised OPF, e.g., supervised OPF agglomerative learning (learns from mistakes over the validation set), supervised OPF learning (learns the best classifier over a validation set), and supervised OPF pruning (prunes nodes while maintaining the accuracy).
6.2 Modeling a New Classification
In order to model a new classification, some conventional rules need to be comprehended. First of all, the data should be loaded and parsed, which in this case, we will be loading a common dataset known as Boat:
Furthermore, if necessary, we can split the data into new sets, such as training and testing, as follows:
Afterward, we can instantiate an OPF classifier:
Finally, we can fit the classifier and perform new predictions:
After predicting new samples, it is possible to evaluate the classifier’s performance:
In this article, we introduce an open-source Python-inspired library for handling Optimum-Path Forest classifiers, known as OPFython. Based upon an object-oriented paradigm, OPFython provides a modern yet straightforward implementation, allowing users to prototype new OPF-based classifiers swiftly.
The library implements a wide variety of Optimum-Path Forest classifiers, such as supervised, semi-supervised, and unsupervised ones, as well as auxiliary functions that assist the classifiers’ workflow, i.e., distance functions, classification metrics, data processing, errors logging. Additionally, as the original LibOPF thoroughly inspires OPFython’s library, it is possible to use the same loading format (OPF file format) and methods that are available in the original package. Furthermore, OPFython provides a model-saving method, which can be used to pre-train classifiers and retrieve insightful information about the classification procedure.
Regarding future works, we intend to make available more OPF-based classifiers, as well as a visualization package, which will allow users to feed their saved models and furnish charts. Furthermore, we aim at improving our implementations by distributing the calculations, i.e., employ a parallel computing concept, which will hopefully reduce our computational burden.
The authors are grateful to São Paulo Research Foundation (FAPESP) grants #2013/07375-0, #2014/12236-1, #2017/25908-6, #2018/15597-6, and #2019/02205-5, as well as CNPq grants #307066/2017-7 and #427968/2018-6.