1 Introduction
Artificial Intelligence has become one of the most fostered research areas throughout the last years [Russell:16]. It is common to observe an increasing trend of automating tasks [Acemoglu:18], which also fosters the development of minimal human interaction algorithms, denoted as Machine Learning.
Machine learning research consists of developing new types of algorithms that do not need explicit instructions, relying on patterns and inferences [Bishop:06]. Also, they are designed in a way that humans can be assisted in decisionmaking tasks or even in daily activities automation, such as data retrieval [Manning:08], intelligent gadgets [Li:15], selfdriving cars [Shalev:17], among others. In the past decades, most of machine learningbased algorithms were developed as symbolic and knowledgebased models due to the difficulty in dealing with probabilistic models at that time [Shavlik:91]. Nevertheless, with the advent of computational power, probabilistic models were put in the spotlight as the availability of digitized information was no longer a problem [Langley:11]
. Hence, most of today’s algorithms rely on mathematical models and data sampling, i.e., models that are capable of learning occult patterns in training data and predicting unseen data, and are to a wide variety of tasks, such as computer vision
[Sebe:05]and natural language processing
[Indurkhya:10].One can observe that it is possible to divide machine learning algorithms into two types of learning: supervised learning
[Kotsiantis:07][Hastie:09], as depicted in Figure 1. Concerning the supervised learning ones, such as classification and regression tasks, the algorithms aim at building mathematical models from labeled data, i.e., data that contains the input features and the possible outputs (classes), and perform predictions on unseen data. Regarding the unsupervised learning, the algorithms aim at building mathematical models that are capable of aggregating sets of data with common characteristics, known as clusters. In other words, unsupervised learning is capable of discovering patterns in data and grouping them into categories without knowing their actual labels.(a)  (b) 
Furthermore, it is crucial to observe that the machine learning area is closely related to other fields, such as data mining, optimization, and statistics. Regarding data mining, while machine learning focus on predicting information based on alreadyknown properties from the data, the data mining area focuses on finding unknown properties in the data and transforming unknown knowledge into realknowledge, for further application in machine learning algorithms [Friedman:98]
. Concerning the optimization area, it is common to observe that most machine learning algorithms models are formulated as optimization problems, where some loss function is minimized over a set of training data. Essentially, the loss function is capable of expressing the discrepancy between the model’s predictions and the actual samples, assisting the algorithm in learning the data’s patterns and being capable of predicting unseen information
[Sra:12]. Finally, regarding the statistics field, it is possible to perceive that statistics focus on drawing inferences from samples, while machine learning focuses on finding generalizable prediction patterns [Bzdok:18]. Additionally, due to their intimate relationship, some researchers combined machine learning and statistical methods in a new field of study, known as statistical learning [James:13].Recently, a new graphbased classifier proposed by Papa et al. [Papa:09], known as OptimumPath Forest (OPF), attempts to fulfill the literature with a parameterless classifier, which is effective during the learning step and efficiently when performing new predictions. Several works introduced the capacity of OPF and its stateoftheart performance, being comparable to the wellknown Support Vector Machines (SVM) [Chang:11] in supervised [Papa:12] and unsupervised learning [Rosa:14]
tasks. Additionally, it provides tools, such as graphcutting and KNearest Neighbors (KNN) graphs
[PapaKNN:09], to reduce the training set size with negligible effects on the accuracy of the classification. Nevertheless, the problem arises with the fact that there is only one official implementation based on the C language, making it difficult to be integrated with other wellknown machine learning frameworks. Furthermore, there is a Pythonbased trend in the machine learning community.In this paper, we propose an opensource Python OptimumPath Forest classification library, called OPFython
^{1}^{1}1https://github.com/gugarosa/opfython. Mainly, the idea is to provide a userfriendly environment to work with OptimumPath Forest classifiers by creating highlevel methods and classes, removing from the user the burden of programming at a mathematical level. The main contributions of this paper are threefold: (i) to introduce an OptimumPath Forest classification library in the Python language, (ii) to provide an easytogo implementation and userfriendly framework, and (iii) to fill the lack of research regarding OptimumPath Forest classifiers.The remainder of this paper is organized as follows. Section 2 presents a literature review and related works concerning OptimumPath Forest classifiers frameworks. Section 3 introduces a theoretical background concerning the supervised and unsupervised OptimumPath Forest classifiers. Section 4 introduces thoughts of the OPFython library, such as its architecture, and an overview of the included packages. Section 5 provides more profound notions about the library, such as how to install, how to understand its documentation, some preincluded examples, and how to perform unitary tests. Furthermore, Section 6 presents vital knowledge about the usage of the library, i.e., how to run predefined examples, and how to model a new experiment. Finally, Section 7 states conclusions and future works.
2 Literature Review and Related Works
OptimumPath Forest classifiers have arisen as a new approach to tackle supervised and unsupervised problems. They offer a parameterless graphbased implementation that is capable of executing an effective learning procedure while being extremely efficient when performing new predictions. It is possible to find its usage in a wide range of applications, such as feature selection
[Rodrigues:14], image segmentation [Miranda:09, Cappabianco:12], signals classification [Nunes:14, Luz:13]. For instance, Iliev et al. applied an OptimumPath Forest classification using glottal features for spoken emotion recognition, achieving stateoftheart results comparable to the SVM classifier. Moreover, Ramos et al. [Ramos:11] applied an OPFbased classification for detecting nontechnical energy losses, achieving outstanding results comparable to stateoftheart artificial intelligence techniques. Furthermore, Fernandes et al. [Fernandes:19] proposed a probabilisticdriven OPF classifier for detecting nontechnical energy losses, improving the baselines obtained by the standard OPF.Even though numerous works in the literature fosters the OptimumPath Forests, there are some gaps in works regarding frameworks or opensourced libraries. There is only an official implementation provided by Papa et al. [LibOPF:15], denoted as LibOPF, which does not provide straightforward tools for users to design new experiments or integrate with other frameworks. Additionally, it lacks documentation and test suites, which assist users in understanding the code and implementing new methods and classes. Moreover, the library is implemented in C language, making it extremely difficult to integrate with other frameworks or packages, primarily because the machine learning community is turning their attention to the Python language.
Therefore, OPFython attempts to fill the gaps concerning OptimumPath Forest frameworks. It is purely implemented in Python and provides comprehensive documentation, test suites, and several preloaded examples. Furthermore, every line of code is commented, there are continuous integration tests for every new push to its repository, an astounding readme which teaches on how to get started with the library and fulltime maintenance and support.
3 Theoretical Foundation
Before diving into OPFython’s library, we present a theoretical foundation about the OptimumPath Forest. In the next subsections, we mathematically explain how the supervised and unsupervised classifiers work.
3.1 Supervised OptimumPath Forest
The OptimumPath Forest is a multiclass classifier developed by Papa et al. [Papa:09], being efficient in the training step and effective in the testing stage. Its foremost ability is to segment the feature space without requiring massive volumes of data. Essentially, the OPF classifier is a graph, having two possible adjacent relations: a complete graph or a
graph. The difference between both methods is the adjacency relation, the methodology to estimate the prototypes
^{2}^{2}2Prototypes are master nodes that represent a specific class and conquer other nodes., and the path cost function.The principal idea behind the supervised OPF is to construct a complete graph, where any two samples are connected. In this case, the nodes represent the samples’ features vector and the edges connect all nodes. Regarding the prototypes, the same are chosen throughout Minimum Spanning Trees (MST)^{3}^{3}3MSTs are subgraphs that connect all nodes within the same set using the minimum possible cost. in order to find the nearest samples from different classes, namely, by selecting samples located in the classes frontiers^{4}^{4}4Regions more likely to classification mistakes.. After the prototypes definition, they compete with each other to conquer adjacent nodes while trying to find the best path (lowest cost) defined by the path cost function and create OptimumPath Trees (OPT). Finally, during the testing phase, OPF inserts each new sample into the graph and finds the prototype, which offers the minimum cost path (class labeling).
Let be a dataset, where , and and represents the training and testing sets, respectively. Each sample can be represented by its feature vector . OPF graph is represented by , where refers to the set of edges that connects all nodes pairs and is the features vectors set , . In addition, let be a function that assigns a real label for each sample in .
3.1.1 Training Step
Let the graph be inducted from the training set, where holds all feature vectors from samples belonging to the training set. The first objective of the training phase is to obtain a set of prototypes , onde .
Let a path , with ending in s, be in and a function that associates a value to this path. In order to a prototype conquer adjacent samples, the purpose is to minimize through a path cost function given by the following equation:
(1) 
where computes the maximum distance between adjacent samples s and t along the path . A path is referred as optimum if for any other path .
The minimization of assigns to each sample an optimum path , whose minimum cost is given by the following equation:
(2) 
3.1.2 Testing Step
The testing set graph is composed by samples . Each sample t is connected to a sample , making t as part of the original graph. The objective is to find an optimum path from to t with class of its prototype . Consequently, the sample t is removed from the graph. This path can be identified by evaluating the optimum cost value :
(3) 
3.2 Unsupervised OptimumPath Forest
Let be a dataset such that for every sample there is a feature vector . Additionally, let be the distance between samples and in the feature space, which is described by .
A graph is defined by arcs that connect nearest neighbors in the feature space. The arcs are weighted by and the nodes are weighted by a density value , given by Equation 4:
(4) 
where , , and is the maximum arc weight in . Using these parameters, all nodes are considered for density computation, since a Gaussian function covers most samples within .
A standard method to compute a probability density function (p.d.f.) is the Parzenwindow. Equation (
4) provides a Parzenwindow estimation based on an isotropic Gaussian kernel when the arcs if are defined. Nevertheless, this approximation causes differences in scale and sample concentration.An interesting way to solve this problem is to choose based on a particular region of the feature space [Comaniciu:03]. By considering the nearest neighbors, it is possible to handle different concentrations and transform the scale problem into finding the best value within , for . An approach proposed by Rocha et al. [RochaIJIST:09] considers the minimum graph cut (best ) according to a measure suggested by Shi et al. [Shi:00].
Moreover, let a path be the sequence of adjacent samples starting from a root and ending at a sample , being a trivial path and the concatenation of and arc . Among all possible paths with roots on the maxima of the p.d.f., the problem lies in finding a path with the lowest density value. Each path defines an influence zone (cluster) by selecting strongly connected samples. Mathematically speaking, Equation 5 maximizes for all where:
and
(5) 
where and is a root with one element set for each maximum of the p.d.f.. One can see that higher values of reduces the number of maxima. Additionally, in this library, we are using and .
Finally, the OPF algorithm maximizes such that the optimum paths compose an OptimumPath Forest, i.e., a predecessor nocycling map which assigns to each sample its predecessor from the optimum path or a marker when . Each p.d.f. maximum (prototype) is the root of an OPT, commonly known as a cluster. Furthermore, the collection of all OPTs is the socalled OptimumPath Forest.
4 OPFython
OPFython is distributed among several packages, each one being accountable for particular classes and methods. Figure 2 represents a summary of OPFython’s architecture, while the next sections present each one of its packages within more details.
4.1 Core
The core package serves as the origin of all OPFython’s subclasses. It assists as a building base for implementing more appropriate structures that one may require when creating an OptimumPath Forestbased classifier. As portrayed in Figure 3, four modules compose the core package, as follows:

Heap: The heap assists OPF in stacking nodes’ according to their costs and further unstacking them to build the subgraph;

Node: When working with graphbased structures, each of their pieces is represented by a node. In OPFython, we use the node structure to store valuable information of a sample, such as its features, label and other information that OPF might need;

OPF: The OPF class serves as the classifier itself. It implements some basic methods that are common to its children, as well as some methods that assist users in saving and loading pretrained models;

Subgraph: The subgraph is one of the most fundamental structures of the OPF classifier. As it is a graphbased classifier, it uses nodes and arcs to build the optimumpath costs and find the prototype nodes, which conquer the remaining samples and propagate their labels.
4.2 Math
In order to ease the user’s life, OPFython offers a mathematical package, containing lowlevel math implementations, illustrated by Figure 4. Naturally, some repeated functions that are used throughout the library are represented in this package, as follows:

Distance: A distance metric is used to calculate the cost between nodes. Hence, we offer a variety of distance metrics that fulfills every task needs;

General: Commonuse functions that do not have a special division are defined in this module;

Random:
Lastly, some methods might use random numbers for sampling or setting a heuristic. This module can generate uniform and Gaussian random numbers.
4.3 Models
There are several approaches to be conducted when designing an optimumpath forest classifier, such as supervised, unsupervised, semisupervised, among others. Therefore, the models’ package provides classes and methods that compose these highlevel abstractions and implement the classifying strategies. Currently, OPFython offers four types of classifiers, which are illustrated by Figure 5 and described as follows:

KNNSupervisedOPF [PapaKNN:09]: A supervised OptimumPath Forest classifier that uses a KNNbased subgraph, providing a more effective way to build up the connectivity subgraph.

SemiSupervisedOPF [Amorim:14]: A semisupervised OptimumPath Forest classifier, which is extremely useful in labeling unknown samples.

SupervisedOPF [Papa:09]: The classical supervised OptimumPath Forest classifier, which is suitable for training on labeled datasets and performing new predictions;

UnsupervisedOPF [Rocha:09]: The standard unsupervised OptimumPath Forest classifier, which is suitable for clustering unlabeled datasets.
4.4 Stream
The stream package deals with every preprocessing step of the input data. Essentially, it is responsible for loading the data, parsing it into samples and labels, and splitting it into new sets, such as training, validation, and testing. Figure 6 depicts its modules, as well as we provide a brief description of them as follows:

Loader: A loading module that assists users in preloading datasets. Currently, it is possible to load files in .txt, .csv and .json formats;

Parser: After loading the files, it is necessary to parse the preloaded arrays into samples and labels;

Splitter: Finally, if necessary, one can split the loaded and parsed dataset into new sets, such as training, validation, and testing.
4.5 Subgraphs
As mentioned before, the subgraph is one of the essential structures of the classification process. Nevertheless, one can observe that distinct classifiers might need distinct subgraphs. Therefore, we are glad to offer additional subgraphs implementations as portrayed by Figure 7 and described as follows:

KNNSubgraph: When dealing with KNNbased classifiers, it is crucial to use a KNNbased subgraph, as it implements some additional functions that the classifier might need.
4.6 Utils
A utility package implements standard tools that are shared over the library, as it is a better approach to implement once and reuse them across other modules, as shown in Figure 8. This package implements the subsequent modules:

Constants: Constants are fixed numbers that do not alter throughout the code. For the sake of easiness, they are implemented in the same module;

Converter: Most of OPF users are familiarized to the specific file format it uses. Hence, we implement an own module that is capable of converting .opf files into .txt, .csv, and .json.

Exception: In order to assist users, the exception module implements common errors and exceptions that might happen when invalid arguments are used in OPFython classes and methods;

Logging: Every method that is invoked in the library is logged onto a log file. One can watch the log in order to detect potential errors, essential warnings, or even success messages throughout the classification procedure.
5 Library Usage
In this section, we describe how to install the OPFython library, as well as the first steps to start playing with it. Essentially, one can study its documentation or make use of the alreadyincluded examples. Besides, there are implemented methods that conduct unitary tests and verify if everything is operating as presumed.
5.1 Installation
First of all, we understand that everything has to be smooth without being tricky or daunting. Therefore, OPFython will always be the onetogo package, from the very first installation to its further usage. Just execute the following command under the most preferred Python environment (standard, conda, virtualenv):
pip install opfython
Alternatively, it is possible to use the bleedingedge version by cloning its repository and installing it:
git clone https://github.com/gugarosa/opfython.git
pip install .
Note that there is no other requirement to use OPFython. As its single dependency is the Numpy package, it can be installed everywhere, despite the machine’s operational system.
5.2 Documentation
One might have an enthusiasm for mastering the concepts and strategies behind OPFython. Hence, we provide a fully documented reference^{5}^{5}5https://opfython.readthedocs.io containing everything that the library offers. From elementary classes to more complex methods, OPFython’s documentation is the perfect reference to learn how the library was developed or even improve it with contributions.
5.3 Classes and Methods Examples
Additionally, in the examples/
folder, we provide example scripts for all packages that the library implements, such as:

Core:
create_heap.py
,create_node.py
,create_opf.py
,create_subgraph.py
; 
Math:
calculate_distances.py
,general_purpose.py
,generate_random_numbers.py
; 
Models:
create_knn_supervised_opf.py
,create_semi_supervised_opf.py
,create_supervised_opf.py
,create_unsupervised_opf.py
; 
Stream:
load_file.py
,parse_loaded_file.py
,split_data.py
; 
Subgraphs:
create_knn_subgraph.py
. 
Utils:
convert_from_opf.py
;
Each example is constituted of highlevel explanations of how to use predefined classes and methods. One can observe that it provides a standard description of how to instantiate each class and decide which arguments should be employed.
5.4 Test Suites
OPFython is prepared with tests to give a more indepth analysis of the code. Also, the intention behind any test is to check whether everything is running as demanded or not. Thus, there are two main methods in order to execute the tests:

PyTest: The first method is running the solo command
pytest tests/
, as depicted by Figure 9. It will fulfill all the implemented tests and return an output indicating whether they succeeded or failed; 
Coverage:, An interesting extension to PyTest is the coverage module. Despite granting the same outputs from PyTest, it will also present a report that states how much the tests cover the code, as illustrated by Figure 10. Its usage is also straightforward:
coverage run m pytest tests/
andcoverage report m
.
6 Applications
In this section, we explain how to perform a classification task with OPFython, as well as briefly describe the seven preloaded applications that are included with the library.
6.1 Getting Started
After installing the OPFython library, it is straightforward to use its packages, where seven elementary examples show key features that the library implements. One can refer to the examples/applications
folder and examine the following files:

KNNbased supervised OPF training:
knn_supervised_opf_training.py
; 
Semisupervised OPF:
semi_supervised_opf_training.py
; 
Supervised OPF agglomerative learning:
supervised_opf_agglomerative.py
; 
Supervised OPF learning:
supervised_opf_learning.py
; 
Supervised OPF pruning:
supervised_opf_pruning.py
; 
Supervised OPF training:
supervised_opf_training.py
; 
Unsupervised OPF clustering:
unsupervised_opf_clustering.py
.
Each example is comprised of the following pipeline: loading the dataset, parsing the dataset, splitting the dataset, instantiating a classifier, fitting the training data, predicting the validation/test data, and calculating the classifier’s performance. Finally, after performing the classification process, it is possible to save the model in a diskfile for further inspection. Figure 11 illustrates the output logs generated by an OPFython classification.
The difference between the provided scripts consists of the type of classifier. While supervised classifications attempt to learn a set of classes from particular samples that represent them, the unsupervised classification tries to aggregate samples into clusters, i.e., dense regions where samples share some similar traits. As for now, we offer three supervised classifications, e.g., KNNbased supervised OPF, semisupervised OPF and supervised OPF, as well as, an unsupervised classification, the unsupervised OPF. Additionally, we offer three extensions of the supervised OPF, e.g., supervised OPF agglomerative learning (learns from mistakes over the validation set), supervised OPF learning (learns the best classifier over a validation set), and supervised OPF pruning (prunes nodes while maintaining the accuracy).
6.2 Modeling a New Classification
In order to model a new classification, some conventional rules need to be comprehended. First of all, the data should be loaded and parsed, which in this case, we will be loading a common dataset known as Boat:
Furthermore, if necessary, we can split the data into new sets, such as training and testing, as follows:
Afterward, we can instantiate an OPF classifier:
Finally, we can fit the classifier and perform new predictions:
After predicting new samples, it is possible to evaluate the classifier’s performance:
7 Conclusions
In this article, we introduce an opensource Pythoninspired library for handling OptimumPath Forest classifiers, known as OPFython. Based upon an objectoriented paradigm, OPFython provides a modern yet straightforward implementation, allowing users to prototype new OPFbased classifiers swiftly.
The library implements a wide variety of OptimumPath Forest classifiers, such as supervised, semisupervised, and unsupervised ones, as well as auxiliary functions that assist the classifiers’ workflow, i.e., distance functions, classification metrics, data processing, errors logging. Additionally, as the original LibOPF thoroughly inspires OPFython’s library, it is possible to use the same loading format (OPF file format) and methods that are available in the original package. Furthermore, OPFython provides a modelsaving method, which can be used to pretrain classifiers and retrieve insightful information about the classification procedure.
Regarding future works, we intend to make available more OPFbased classifiers, as well as a visualization package, which will allow users to feed their saved models and furnish charts. Furthermore, we aim at improving our implementations by distributing the calculations, i.e., employ a parallel computing concept, which will hopefully reduce our computational burden.
Acknowledgments
The authors are grateful to São Paulo Research Foundation (FAPESP) grants #2013/073750, #2014/122361, #2017/259086, #2018/155976, and #2019/022055, as well as CNPq grants #307066/20177 and #427968/20186.
Comments
There are no comments yet.