A good representation of the encoded knowledge in a machine learning model is fundamental to its success. Several data structures have been used for this purpose, for instance, matrices of weights, trees, and graphs [1, 2]. Sometimes, learning models lack a data structure to store knowledge, storing the input as is .
In recent years, several works have been using complex networks for data representation and analysis [4, 5, 6]. Complex networks are graphs with a nontrivial topology that represent the interactions of a dynamical system . Advances in the science of complex systems bring several tools to understand such systems.
In , we describe how to map a dataset with binary features into a bipartite complex-network. Such network is called feature–sample network
. Using that representation, we solve a semi-supervised learning task called positive-unlabeled (PU) learning.
When dealing with machine learning problems, we often need to pre-process the input data. Feature learning
transforms or enhances raw data to structures that are effectively exploited by learning models. Autoencoders and manifold learning are examples of feature learning methods.
In this paper, we propose a feature learning process to add information in feature–sample networks. In summary, we include to the network a limited number of new vertices based on a non-linear function of the preexisting ones. The set of new vertices is determined by a multi-objective objective problem, in which the goal is to maximize the number of features while maintaining some properties of the original data. Two multi-objective approaches are designed: a lexicographic genetic algorithm (LGA) and an implementation of the improved strength Pareto evolutionary algorithm (SPEA2).
We show that enhanced feature–sample networks improve the performance of learning methods in the major machine learning paradigms: supervised and unsupervised learning. We also expose the pros and cons of each optimization approach.
The rest of this paper is organized as follows. Sections III and II describe how to enhance feature–sample networks as an optimization problem. In Section IV, computer simulations illustrate the optimization process and assess the performance improvements in machine learning tasks. Finally, we conclude this paper in Section V.
Ii Enhanced feature–sample networks
In this section, we describe how we enhance a feature–sample network by adding to the network a constrained number of new features. Connections between the samples and each new feature depend on a nonlinear function of a combination of preexisting features. The chosen set of new features is the result of an optimization process that not only maximizes the number of new features but also distributes them along the samples. This process is similar to the projection of the associated data into a higher-dimensional space preserving some important characteristics of the data.
In the following subsections, we first review the feature–sample networks, then describe the creation of new features. Moreover, we elaborate an optimization problem to enhance feature–sample networks.
Ii-a Feature–sample network review
Assume as input the dataset
whose members are binary feature vectors. The feature vectors are sparse, that is, the number of elements with value 1 is much lower than the dimension .
The feature–sample network is the bipartite complex-network whose edges connect samples and features of the dataset . A simple, unweighted, undirected graph represents such network. The vertex set is and an edge exists between sample and feature if .
Ii-B And-features definition
According to the Cover’s Theorem , given a not-densely populated space of a classification problem, the chance of it being linearly separable is increased as one cast it in a high-dimensional space nonlinearly.
Since we assume the input feature–sample network is sparse, we can synthesize features nonlinearly to exploit the properties of this theorem. One way to produce new features is using the and operator, which is a nonlinear Boolean function.
We call and-feature the feature that links to all samples connected to a given set of two or more preexisting features.
Given a feature–sample network with samples and features, we can produce an and-feature for each combination of features such that and . We call the order of the and-feature . Thus, the number of possible and-features is
In the rest of this paper, we index every possible and-feature using the parameter , …, such that . Each element indicates a feature in the network. The feature is part of the combination if, and only if, . Thus, the set is .
Using this notation, we say that the and-feature connects to each sample if, and only if, holds.
From this discussion, we realize that enumerating every combination has exponential cost. Moreover, once the network is sparse, we expect that many of the and-features have no connections.
Ii-C Optimization problem definition
The problem of enhancing the network can be viewed as a optimization problem. Given an input feature–sample network with samples and features, we denote the enhanced network from the original by adding every and-feature . The number of features of the enhanced network, excluding the and-features that do not connect, is given by . Thus, and .
Let be the maximum number allowed of generated features, we want to
The disadvantage of this approach is that the and-features might not be well distributed. Thus, while some samples may have few new feature, others may have many. To overcome this limitation, we introduce the disproportion between the network and its enhanced version .
The disproportion is zero if the number of new connections in each sample is proportional to its initial sparsity. In this way, while the sparsity of each sample might change, we keep the same shape of the degree distribution of the samples.
Let be the degree of the sample , the disproportion between two networks is
is the standard deviation of the arguments.
Using the disproportion in our optimization problem, the goal is to
In this section, we study the multi-objective problem stated in the previous section and describe how we solve it.
Iii-a Problem study
In Section II-B, we see that the number of possible and-features scales exponentially in the number of features . As a consequence, storing every possible and-feature is not feasible.
Furthermore, the number of candidate solutions is also exponential in the number of possible new features. Precisely, there are at the most
solutions to explore. We can not limit the size of the set by since many and-features may have no connection.
The three common approaches for solving multi-objective optimization problems are weighted-formula, lexicographic, and Pareto . The first strategy transforms the problem into a single-objective one, usually by weighting each objective and adding them up. The lexicographic approach assigns a priority to each objective and then optimizes the objectives in that order. Hence, when comparing two candidate solutions, the highest-priority objective is compared and, if equivalent, the second objective is compared. If the second objective is also equivalent between the solutions, the third one is used, and so on. Both the weighted-formula and the lexicographic strategies return only one solution for the problem. The Pareto methods use different mathematical tools to evaluate the candidate solution, finding a set of non-dominated solutions. A solution is said to be non-dominated if it is not worse than any other solution concerning all the criteria .
Since it is not trivial the difference of scale between our objective functions, we opted to use only a lexicographic and a Pareto method.
We design two population-based optimization algorithms. Specifically, we consider the use of two metaheuristics:
a lexicographic genetic algorithm (GA); and
the improved strength Pareto evolutionary algorithm (SPEA2) .
Although the methods are different, both approaches share many properties – individual representation, population initialization, operators of mutation, recombination and selection – which are explained in Section III-D. The main difference between them is the fitness evaluation.
In the GA, the individuals are ordered lexicographically, that is, ordered primarily by the first objective function and, in case of tie, the second objective. SPEA2, however, consider not only the Pareto front but also the density of the solutions.
Iii-B Lexicographic genetic algorithm review
A GA has the following steps :
In words, a random population of candidate solutions is generated as the first step. Then, while the stop condition is not met, the next generation of individuals comprises the best individuals of the previous generation and the individuals originated by recombining and mutating parents selected from the previous generation.
In the lexicographic approach, the only difference is during the evaluation of the candidate solution , where the best individuals are decided lexicographically.
In our specific problem, a solution is better than if
Iii-C Improved Strength Pareto Evolutionary Algorithm review
SPEA2 works similarly to a GA. The major difference is that it keeps an archive with the candidate solutions for the Pareto set. If the number of non-dominated solutions is greater than the limit of the archive, some solutions are discarded. Such operation is called truncation. The truncation operator tries to maintain the candidate solutions uniformly distributed along the Pareto front.
As indicated by the authors, we select individuals by employing binary tournament. Also, let be the archive size, we fix the parameter in the truncation operator .
Iii-D Metaheuristic design
The common implementation characteristics of our metaheuristic is exposed in the next items.
Iii-D1 Individual representation
In our problem, each solution is a set of zero or more and-features. If we enumerate every possible and-feature, the solution can also be viewed as a binary vector with entries 1 for the present and-features.
Iii-D2 Population initialization
Given , we sample, without replacement, random and-features to compose each candidate solution such that
. And-features are sampled so that the probability of having orderis
Iii-D3 Recombination operator
We use the uniform crossover operator with two parents generating two children. In Section III-D6, we show how to implement it efficiently using our set representation.
Iii-D4 Selection operator
The binary tournament method is chosen to select the parent that go to the recombination step.
Iii-D5 Mutation operator
We formulated the following specific mutation operator to exploit the characteristics of our problem.
Given a candidate solution , we apply random changes in the individual. For each change, there is a equal probability of either
trying to add a new and-feature; or
trying to remove an and-feature ; or
trying to modify an and-feature .
When trying to add a new feature, an and-feature is sampled and the solution is updated to . Note that the candidate solution may not change if the and-feature was previously present.
If trying to remove a feature, each and-feature has probability of being selected. In this case, the candidate solution is updated to . Note that, with probability , the individual do not change.
Finally, in the last case, an and-feature with order
is selected uniformly to be modified. Once the and-feature is
selected, a modified and-feature will be produced. Two cases may
a) with chance , an index is selected uniformly,
The first case will include one more term into the and-feature if . The second one swaps two elements of and, effectively, takes effect when . The candidate solutions is then updated to . The size is never increased, and the candidate solution will be preserved if .
Iii-D6 Performance considerations
Although, we can view both solutions and and-features as binary vectors, the set representation is more practical because of the high space-complexity of the problem. Moreover, there is no need to store entries for and-features that lack connections. Instead of just ignoring them, we exploit the evaluation step to determine which and-features are useless and remove them from the set.
Also, using the set representation, the crossover of the candidate solutions and can be implemented efficiently with steps
where stands for the symmetric difference operator.
Finally, to conform to the requirements in Section III-D2, one can sample each candidate solution efficiently with steps
where and-features are sampled without replacement.
Iv Experimental results
In this section, we present applications of our feature learning technique in the two major categories in machine learning: supervised and unsupervised learning.
First, we illustrate the optimization process in the famous Iris dataset. In this example, we also conduct community detection in the original feature–sample network obtained from the dataset and in the enhanced one.
Iv-a Enhanced community detection and clustering
The UCI Iris dataset  contains 150 samples that describe the sepals and petals of individual Iris flowers. The flowers are either Iris setosa, Iris virginica, or Iris versicolor. In , we construct a feature–sample network from this dataset by discretizing the features.
Figure 1 shows the generated network. Each color represents a different class. Circles represent samples and squares, features. We use that same network as an input to both algorithms – using SPEA2 and the lexicographic GA.
Iv-A1 Evolution of the candidate solutions
We execute the optimization process once for each strategy. We fix the population in individuals. For SPEA2, the archive has size and, for the lexicographic GA, we keep the best solutions over the generations. In the initial population, we use and . The recombination rate is and the apply random changes in each generated individual.
Figures 3 and 2 describe the obtained result. Both disproportion and number of discovered and-feature are depicted along the generations. Solid lines are the average result in the population and shadows cover the range – from minimum to maximum – of each measurement. The results include only the non-dominated solutions in SPEA2 and the best solutions in the lexicographic GA.
Using both strategies, we could reach the optimal solution: and-features with at least one connection and disproportion. However, the optimization strategies differ as to how to achieve this.
SPEA2 tries to find as many new and-feature while keeping the ones with lowest values of disproportion. When a larger set of and-features with disproportion is discovered, such solution dominates every other solution found so far. Thus, we can observe “steps” in the evolution of the number of and-features.
In the lexicographic GA, the disproportion is only considered when the number of discovered and-features is the same. As a result, the algorithm greedily produce and-features disregarding the disproportion until it cannot find more new features to add. It enables a faster convergence in this case, but it may find only solutions with high disproportion when it is unfeasible to reach the maximum number of and-feature – which is very common in practice. To solve this issue in larger problems, one can set the limit in the number of discovered features, .
The optimal enhanced feature–sample network for this dataset is in Figure 4.
Iv-A2 Community detection
Applying a greedy community detection method  in both networks, we observe that the enhanced network has modularity () higher than the input network (.)
The enhanced network can also improve clustering tasks. If comparing the expected class and the obtained communities, the enhanced version achieves higher Jaccard index,against .
Iv-B Performance enhancement in supervised learning
We also apply our proposal in 4 classification tasks from UCI . Table I presents the datasets along with the number of samples , features , and possible and-features . We highlight that it is unfeasible to list every possible combination among the features even for small datasets. The networks are generated as shown in  with bins.
The optimization process is executed times for each strategy – SPEA2 and GA. We fix the population size in , the archive size in , and the elitism solutions. For the initial population, we use and . The recombination rate is and mutations are performed for each candidate solution. We limit the number of and-feature by . The execution is stopped after the 1000th generation.
Table II summarizes the number of and-features and disproportion obtained by the optimization process. For the lexicographic GA, we show the average and the standard deviation of the measurements among the best individuals. For SPEA2, only the non-dominated solutions are considered.
As expected by considering the previous study (Section IV-A,) the lexicographic GA achieved better count of and-features – the maximum allowed –, but worse values of disproportion. The candidate solutions of SPEA2 present wide variation, but consistent lower disproportion.
|Dataset||Lowest disproportion||Highest number of and-features|
|Breast 2010||513, 0.000 (SPEA2)||2700, 0.129 (LGA)|
|Ecoli||166, 0.011 (SPEA2)||1900, 0.114 (LGA)|
|Glass||416, 0.019 (SPEA2)||2500, 0.130 (LGA)|
|Wine||413, 0.015 (SPEA2)||3900, 0.226 (LGA)|
|Dataset||Original||()||Lowest disproportion||()||Highest number of and-features||()|
For each one of the datasets, we take the candidate solution with highest count of and-features and with the lowest disproportion among every solution produced. (We ignored a few solutions with less than 100 and-features.) Such candidate solutions, and the strategy that has found them, are indicated in Table III.
We solve the classification problems for each one of the datasets using the -nearest neighbors method. As inputs we use the interaction matrices of the original network and the enhance ones from the selected candidate solutions. We performed split validations with and of labeled samples for each case. We also varied .
The best results for each configuration are shown in Table IV. Improvements are highlighted in bold. Using the solution with the highest number of and-features, we improved the classification results in all cases. Using less and-features but with lower disproportion, we see improvements almost as good as using higher and-feature count.
In this paper, we presented an unsupervised feature learning mechanism that works on datasets with binary features. First, the dataset is mapped into a feature–sample network. Then, a multi-objective optimization process selects a set of new vertices that correspond to new features to produce an enhanced version of the network.
We show that the enhanced network contains more information and can be exploited to improve the performance of machine learning methods.
To solve the optimization problem, we opted to design population-based metaheuristics. We used both a lexicographic GA and the SPEA2 algorithm to find the candidate solutions.
From the experiments, we conclude that the GA produces more new features in fewer generations. However, candidate solutions in SPEA2, besides having less new features, also improved the performance of machine learning methods.
This fact suggests that the disproportion is a good measurement of the quality of the selected set of and-features. In future works, we will correlate improvement and disproportion of the solutions with the same number of features.
Furthermore, the learning techniques used – fast-greedy community detection and -nearest neighbors – do not take full advantage of the new features. In subsequent studies, we will elaborate learning models to exploit the enhanced feature–sample network explicitly.
This research was supported by the São Paulo State Research Foundation (FAPESP) and the Brazilian National Research Council (CNPq).
-  C. M. Bishop, Pattern Recognition and Machine Learning. Springer-Verlag New York, 2006.
X. Zhu and A. B. Goldberg, “Introduction to Semi-Supervised Learning,”
Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 3, no. 1, pp. 1–130, 2009.
-  T. M. Cover, “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition,” IEEE Transactions on Electronic Computers, vol. EC-14, no. 3, pp. 326–334, 1965.
-  T. C. Silva, L. Zhao, and T. H. Cupertino, “Handwritten data clustering using agents competition in networks,” Journal of Mathematical Imaging and Vision, vol. 45, no. 3, pp. 264–276, 2013.
-  T. C. Silva and L. Zhao, Machine Learning in Complex Networks, 1st ed. Springer, 2016.
F. A. N. Verri, P. R. Urio, and L. Zhao, “Network unfolding map by vertex-edge
IEEE Transactions on Neural Networks and Learning Systems, vol. PP, no. 99, pp. 1–14, 2016.
-  M. E. J. Newman, Networks: An Introduction, 1st ed. New York, NY: Oxford University Press, 2010.
-  F. A. N. Verri and L. Zhao, “Random walk in feature-sample networks for semi-supervised classification,” in 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), 2016, pp. 235–240.
J. Mũnoz Marí, F. Bovolo, L. Gómez-Chova, L. Bruzzone, and G. Camp-Valls, “Semisupervised One-Class Support Vector Machines for Classification of Remote Sensing Data,”IEEE Trans. Geosci. Remote Sens., vol. 48, no. 8, pp. 3188–3197, 2010.
-  Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
-  A. A. Freitas, “A critical review of multi-objective optimization in data mining: A position paper,” SIGKDD Explor. Newsl., vol. 6, no. 2, pp. 77–86, 2004.
-  E. Zitzler, M. Laumanns, and L. Thiele, “SPEA2: Improving the strength pareto evolutionary algorithm for multiobjective optimization,” in Evolutionary Methods for Design Optimization and Control with Applications to Industrial Problems. Athens, Greece: International Center for Numerical Methods in Engineering, 2001, pp. 95–100.
-  M. Srinivas and L. M. Patnaik, “Genetic algorithms: a survey,” Computer, vol. 27, no. 6, pp. 17–26, 1994.
-  C. A. Coello Coello, D. A. Van Veldhuizen, and G. B. Lamont, Evolutionary Algorithms for Solving Multi-Objective Problems. Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2002.
-  M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml
-  A. Clauset, M. E. J. Newman, and C. Moore, “Finding community structure in very large networks,” Physical Review E, vol. 70, no. 6, pp. 1–6, 2004.