I Introduction
A good representation of the encoded knowledge in a machine learning model is fundamental to its success. Several data structures have been used for this purpose, for instance, matrices of weights, trees, and graphs [1, 2]. Sometimes, learning models lack a data structure to store knowledge, storing the input as is [3].
In recent years, several works have been using complex networks for data representation and analysis [4, 5, 6]. Complex networks are graphs with a nontrivial topology that represent the interactions of a dynamical system [7]. Advances in the science of complex systems bring several tools to understand such systems.
In [8], we describe how to map a dataset with binary features into a bipartite complexnetwork. Such network is called feature–sample network
. Using that representation, we solve a semisupervised learning task called positiveunlabeled (PU) learning
[9].When dealing with machine learning problems, we often need to preprocess the input data. Feature learning
transforms or enhances raw data to structures that are effectively exploited by learning models. Autoencoders and manifold learning are examples of feature learning methods
[10].In this paper, we propose a feature learning process to add information in feature–sample networks. In summary, we include to the network a limited number of new vertices based on a nonlinear function of the preexisting ones. The set of new vertices is determined by a multiobjective objective problem, in which the goal is to maximize the number of features while maintaining some properties of the original data. Two multiobjective approaches are designed: a lexicographic genetic algorithm (LGA) and an implementation of the improved strength Pareto evolutionary algorithm (SPEA2).
We show that enhanced feature–sample networks improve the performance of learning methods in the major machine learning paradigms: supervised and unsupervised learning. We also expose the pros and cons of each optimization approach.
The rest of this paper is organized as follows. Sections III and II describe how to enhance feature–sample networks as an optimization problem. In Section IV, computer simulations illustrate the optimization process and assess the performance improvements in machine learning tasks. Finally, we conclude this paper in Section V.
Ii Enhanced feature–sample networks
In this section, we describe how we enhance a feature–sample network by adding to the network a constrained number of new features. Connections between the samples and each new feature depend on a nonlinear function of a combination of preexisting features. The chosen set of new features is the result of an optimization process that not only maximizes the number of new features but also distributes them along the samples. This process is similar to the projection of the associated data into a higherdimensional space preserving some important characteristics of the data.
In the following subsections, we first review the feature–sample networks, then describe the creation of new features. Moreover, we elaborate an optimization problem to enhance feature–sample networks.
Iia Feature–sample network review
Assume as input the dataset
whose members are binary feature vectors
. The feature vectors are sparse, that is, the number of elements with value 1 is much lower than the dimension .The feature–sample network is the bipartite complexnetwork whose edges connect samples and features of the dataset . A simple, unweighted, undirected graph represents such network. The vertex set is and an edge exists between sample and feature if .
IiB Andfeatures definition
According to the Cover’s Theorem [3], given a notdensely populated space of a classification problem, the chance of it being linearly separable is increased as one cast it in a highdimensional space nonlinearly.
Since we assume the input feature–sample network is sparse, we can synthesize features nonlinearly to exploit the properties of this theorem. One way to produce new features is using the and operator, which is a nonlinear Boolean function.
We call andfeature the feature that links to all samples connected to a given set of two or more preexisting features.
Given a feature–sample network with samples and features, we can produce an andfeature for each combination of features such that and . We call the order of the andfeature . Thus, the number of possible andfeatures is
In the rest of this paper, we index every possible andfeature using the parameter , …, such that . Each element indicates a feature in the network. The feature is part of the combination if, and only if, . Thus, the set is .
Using this notation, we say that the andfeature connects to each sample if, and only if, holds.
From this discussion, we realize that enumerating every combination has exponential cost. Moreover, once the network is sparse, we expect that many of the andfeatures have no connections.
IiC Optimization problem definition
The problem of enhancing the network can be viewed as a optimization problem. Given an input feature–sample network with samples and features, we denote the enhanced network from the original by adding every andfeature . The number of features of the enhanced network, excluding the andfeatures that do not connect, is given by . Thus, and .
Let be the maximum number allowed of generated features, we want to
subject to 
The disadvantage of this approach is that the andfeatures might not be well distributed. Thus, while some samples may have few new feature, others may have many. To overcome this limitation, we introduce the disproportion between the network and its enhanced version .
The disproportion is zero if the number of new connections in each sample is proportional to its initial sparsity. In this way, while the sparsity of each sample might change, we keep the same shape of the degree distribution of the samples.
Let be the degree of the sample , the disproportion between two networks is
where
is the standard deviation of the arguments.
Using the disproportion in our optimization problem, the goal is to
subject to 
Iii Methods
In this section, we study the multiobjective problem stated in the previous section and describe how we solve it.
Iiia Problem study
In Section IIB, we see that the number of possible andfeatures scales exponentially in the number of features . As a consequence, storing every possible andfeature is not feasible.
Furthermore, the number of candidate solutions is also exponential in the number of possible new features. Precisely, there are at the most
solutions to explore. We can not limit the size of the set by since many andfeatures may have no connection.
The three common approaches for solving multiobjective optimization problems are weightedformula, lexicographic, and Pareto [11]. The first strategy transforms the problem into a singleobjective one, usually by weighting each objective and adding them up. The lexicographic approach assigns a priority to each objective and then optimizes the objectives in that order. Hence, when comparing two candidate solutions, the highestpriority objective is compared and, if equivalent, the second objective is compared. If the second objective is also equivalent between the solutions, the third one is used, and so on. Both the weightedformula and the lexicographic strategies return only one solution for the problem. The Pareto methods use different mathematical tools to evaluate the candidate solution, finding a set of nondominated solutions. A solution is said to be nondominated if it is not worse than any other solution concerning all the criteria [11].
Since it is not trivial the difference of scale between our objective functions, we opted to use only a lexicographic and a Pareto method.
We design two populationbased optimization algorithms. Specifically, we consider the use of two metaheuristics:

a lexicographic genetic algorithm (GA); and

the improved strength Pareto evolutionary algorithm (SPEA2) [12].
Although the methods are different, both approaches share many properties – individual representation, population initialization, operators of mutation, recombination and selection – which are explained in Section IIID. The main difference between them is the fitness evaluation.
In the GA, the individuals are ordered lexicographically, that is, ordered primarily by the first objective function and, in case of tie, the second objective. SPEA2, however, consider not only the Pareto front but also the density of the solutions.
IiiB Lexicographic genetic algorithm review
A GA has the following steps [13]:
In words, a random population of candidate solutions is generated as the first step. Then, while the stop condition is not met, the next generation of individuals comprises the best individuals of the previous generation and the individuals originated by recombining and mutating parents selected from the previous generation.
In the lexicographic approach, the only difference is during the evaluation of the candidate solution [14], where the best individuals are decided lexicographically.
In our specific problem, a solution is better than if

; or

and .
IiiC Improved Strength Pareto Evolutionary Algorithm review
SPEA2 works similarly to a GA. The major difference is that it keeps an archive with the candidate solutions for the Pareto set. If the number of nondominated solutions is greater than the limit of the archive, some solutions are discarded. Such operation is called truncation. The truncation operator tries to maintain the candidate solutions uniformly distributed along the Pareto front
[12].As indicated by the authors, we select individuals by employing binary tournament. Also, let be the archive size, we fix the parameter in the truncation operator [12].
IiiD Metaheuristic design
The common implementation characteristics of our metaheuristic is exposed in the next items.
IiiD1 Individual representation
In our problem, each solution is a set of zero or more andfeatures. If we enumerate every possible andfeature, the solution can also be viewed as a binary vector with entries 1 for the present andfeatures.
IiiD2 Population initialization
Given , we sample, without replacement, random andfeatures to compose each candidate solution such that
. Andfeatures are sampled so that the probability of having order
isIiiD3 Recombination operator
We use the uniform crossover operator with two parents generating two children. In Section IIID6, we show how to implement it efficiently using our set representation.
IiiD4 Selection operator
The binary tournament method is chosen to select the parent that go to the recombination step.
IiiD5 Mutation operator
We formulated the following specific mutation operator to exploit the characteristics of our problem.
Given a candidate solution , we apply random changes in the individual. For each change, there is a equal probability of either

trying to add a new andfeature; or

trying to remove an andfeature ; or

trying to modify an andfeature .
When trying to add a new feature, an andfeature is sampled and the solution is updated to . Note that the candidate solution may not change if the andfeature was previously present.
If trying to remove a feature, each andfeature has probability of being selected. In this case, the candidate solution is updated to . Note that, with probability , the individual do not change.
Finally, in the last case, an andfeature with order is selected uniformly to be modified. Once the andfeature is selected, a modified andfeature will be produced. Two cases may happen: a) with chance , an index is selected uniformly, and is
The first case will include one more term into the andfeature if . The second one swaps two elements of and, effectively, takes effect when . The candidate solutions is then updated to . The size is never increased, and the candidate solution will be preserved if .
IiiD6 Performance considerations
Although, we can view both solutions and andfeatures as binary vectors, the set representation is more practical because of the high spacecomplexity of the problem. Moreover, there is no need to store entries for andfeatures that lack connections. Instead of just ignoring them, we exploit the evaluation step to determine which andfeatures are useless and remove them from the set.
Also, using the set representation, the crossover of the candidate solutions and can be implemented efficiently with steps
where stands for the symmetric difference operator.
Finally, to conform to the requirements in Section IIID2, one can sample each candidate solution efficiently with steps
where andfeatures are sampled without replacement.
Iv Experimental results
In this section, we present applications of our feature learning technique in the two major categories in machine learning: supervised and unsupervised learning.
First, we illustrate the optimization process in the famous Iris dataset. In this example, we also conduct community detection in the original feature–sample network obtained from the dataset and in the enhanced one.
Then, we observe the increase of the accuracy obtained by the nearest neighbors [1]classifier in other four UCI datasets.
Iva Enhanced community detection and clustering
The UCI Iris dataset [15] contains 150 samples that describe the sepals and petals of individual Iris flowers. The flowers are either Iris setosa, Iris virginica, or Iris versicolor. In [8], we construct a feature–sample network from this dataset by discretizing the features.
Figure 1 shows the generated network. Each color represents a different class. Circles represent samples and squares, features. We use that same network as an input to both algorithms – using SPEA2 and the lexicographic GA.
IvA1 Evolution of the candidate solutions
We execute the optimization process once for each strategy. We fix the population in individuals. For SPEA2, the archive has size and, for the lexicographic GA, we keep the best solutions over the generations. In the initial population, we use and . The recombination rate is and the apply random changes in each generated individual.
Figures 3 and 2 describe the obtained result. Both disproportion and number of discovered andfeature are depicted along the generations. Solid lines are the average result in the population and shadows cover the range – from minimum to maximum – of each measurement. The results include only the nondominated solutions in SPEA2 and the best solutions in the lexicographic GA.
Using both strategies, we could reach the optimal solution: andfeatures with at least one connection and disproportion. However, the optimization strategies differ as to how to achieve this.
SPEA2 tries to find as many new andfeature while keeping the ones with lowest values of disproportion. When a larger set of andfeatures with disproportion is discovered, such solution dominates every other solution found so far. Thus, we can observe “steps” in the evolution of the number of andfeatures.
In the lexicographic GA, the disproportion is only considered when the number of discovered andfeatures is the same. As a result, the algorithm greedily produce andfeatures disregarding the disproportion until it cannot find more new features to add. It enables a faster convergence in this case, but it may find only solutions with high disproportion when it is unfeasible to reach the maximum number of andfeature – which is very common in practice. To solve this issue in larger problems, one can set the limit in the number of discovered features, .
The optimal enhanced feature–sample network for this dataset is in Figure 4.
IvA2 Community detection
Applying a greedy community detection method [16] in both networks, we observe that the enhanced network has modularity () higher than the input network (.)
The enhanced network can also improve clustering tasks. If comparing the expected class and the obtained communities, the enhanced version achieves higher Jaccard index,
against .IvB Performance enhancement in supervised learning
Dataset  N  D  M 
Breast 2010  106  27  134,217,700 
Ecoli  336  19  524,268 
Glass  214  25  33,554,406 
Wine  178  39  549,755,813,848 
We also apply our proposal in 4 classification tasks from UCI [15]. Table I presents the datasets along with the number of samples , features , and possible andfeatures . We highlight that it is unfeasible to list every possible combination among the features even for small datasets. The networks are generated as shown in [8] with bins.
The optimization process is executed times for each strategy – SPEA2 and GA. We fix the population size in , the archive size in , and the elitism solutions. For the initial population, we use and . The recombination rate is and mutations are performed for each candidate solution. We limit the number of andfeature by . The execution is stopped after the 1000th generation.
Dataset  LGA  SPEA2 

Reuters  ,  
Table II summarizes the number of andfeatures and disproportion obtained by the optimization process. For the lexicographic GA, we show the average and the standard deviation of the measurements among the best individuals. For SPEA2, only the nondominated solutions are considered.
As expected by considering the previous study (Section IVA,) the lexicographic GA achieved better count of andfeatures – the maximum allowed –, but worse values of disproportion. The candidate solutions of SPEA2 present wide variation, but consistent lower disproportion.
Dataset  Lowest disproportion  Highest number of andfeatures 

Breast 2010  513, 0.000 (SPEA2)  2700, 0.129 (LGA) 
Ecoli  166, 0.011 (SPEA2)  1900, 0.114 (LGA) 
Glass  416, 0.019 (SPEA2)  2500, 0.130 (LGA) 
Wine  413, 0.015 (SPEA2)  3900, 0.226 (LGA) 
Dataset  Original  ()  Lowest disproportion  ()  Highest number of andfeatures  () 
70% labeled  
80% labeled  
For each one of the datasets, we take the candidate solution with highest count of andfeatures and with the lowest disproportion among every solution produced. (We ignored a few solutions with less than 100 andfeatures.) Such candidate solutions, and the strategy that has found them, are indicated in Table III.
We solve the classification problems for each one of the datasets using the nearest neighbors method. As inputs we use the interaction matrices of the original network and the enhance ones from the selected candidate solutions. We performed split validations with and of labeled samples for each case. We also varied .
The best results for each configuration are shown in Table IV. Improvements are highlighted in bold. Using the solution with the highest number of andfeatures, we improved the classification results in all cases. Using less andfeatures but with lower disproportion, we see improvements almost as good as using higher andfeature count.
V Conclusion
In this paper, we presented an unsupervised feature learning mechanism that works on datasets with binary features. First, the dataset is mapped into a feature–sample network. Then, a multiobjective optimization process selects a set of new vertices that correspond to new features to produce an enhanced version of the network.
We show that the enhanced network contains more information and can be exploited to improve the performance of machine learning methods.
To solve the optimization problem, we opted to design populationbased metaheuristics. We used both a lexicographic GA and the SPEA2 algorithm to find the candidate solutions.
From the experiments, we conclude that the GA produces more new features in fewer generations. However, candidate solutions in SPEA2, besides having less new features, also improved the performance of machine learning methods.
This fact suggests that the disproportion is a good measurement of the quality of the selected set of andfeatures. In future works, we will correlate improvement and disproportion of the solutions with the same number of features.
Furthermore, the learning techniques used – fastgreedy community detection and nearest neighbors – do not take full advantage of the new features. In subsequent studies, we will elaborate learning models to exploit the enhanced feature–sample network explicitly.
Acknowledgments
This research was supported by the São Paulo State Research Foundation (FAPESP) and the Brazilian National Research Council (CNPq).
References
 [1] C. M. Bishop, Pattern Recognition and Machine Learning. SpringerVerlag New York, 2006.

[2]
X. Zhu and A. B. Goldberg, “Introduction to SemiSupervised Learning,”
Synthesis Lectures on Artificial Intelligence and Machine Learning
, vol. 3, no. 1, pp. 1–130, 2009.  [3] T. M. Cover, “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition,” IEEE Transactions on Electronic Computers, vol. EC14, no. 3, pp. 326–334, 1965.
 [4] T. C. Silva, L. Zhao, and T. H. Cupertino, “Handwritten data clustering using agents competition in networks,” Journal of Mathematical Imaging and Vision, vol. 45, no. 3, pp. 264–276, 2013.
 [5] T. C. Silva and L. Zhao, Machine Learning in Complex Networks, 1st ed. Springer, 2016.

[6]
F. A. N. Verri, P. R. Urio, and L. Zhao, “Network unfolding map by vertexedge
dynamics modeling,”
IEEE Transactions on Neural Networks and Learning Systems
, vol. PP, no. 99, pp. 1–14, 2016.  [7] M. E. J. Newman, Networks: An Introduction, 1st ed. New York, NY: Oxford University Press, 2010.
 [8] F. A. N. Verri and L. Zhao, “Random walk in featuresample networks for semisupervised classification,” in 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), 2016, pp. 235–240.

[9]
J. Mũnoz Marí, F. Bovolo, L. GómezChova, L. Bruzzone, and G. CampValls, “Semisupervised OneClass Support Vector Machines for Classification of Remote Sensing Data,”
IEEE Trans. Geosci. Remote Sens., vol. 48, no. 8, pp. 3188–3197, 2010.  [10] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
 [11] A. A. Freitas, “A critical review of multiobjective optimization in data mining: A position paper,” SIGKDD Explor. Newsl., vol. 6, no. 2, pp. 77–86, 2004.
 [12] E. Zitzler, M. Laumanns, and L. Thiele, “SPEA2: Improving the strength pareto evolutionary algorithm for multiobjective optimization,” in Evolutionary Methods for Design Optimization and Control with Applications to Industrial Problems. Athens, Greece: International Center for Numerical Methods in Engineering, 2001, pp. 95–100.
 [13] M. Srinivas and L. M. Patnaik, “Genetic algorithms: a survey,” Computer, vol. 27, no. 6, pp. 17–26, 1994.
 [14] C. A. Coello Coello, D. A. Van Veldhuizen, and G. B. Lamont, Evolutionary Algorithms for Solving MultiObjective Problems. Secaucus, NJ, USA: SpringerVerlag New York, Inc., 2002.
 [15] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml
 [16] A. Clauset, M. E. J. Newman, and C. Moore, “Finding community structure in very large networks,” Physical Review E, vol. 70, no. 6, pp. 1–6, 2004.