Feature learning in feature-sample networks using multi-objective optimization

10/25/2017 ∙ by Filipe Alves Neto Verri, et al. ∙ Universidade de São Paulo 0

Data and knowledge representation are fundamental concepts in machine learning. The quality of the representation impacts the performance of the learning model directly. Feature learning transforms or enhances raw data to structures that are effectively exploited by those models. In recent years, several works have been using complex networks for data representation and analysis. However, no feature learning method has been proposed for such category of techniques. Here, we present an unsupervised feature learning mechanism that works on datasets with binary features. First, the dataset is mapped into a feature--sample network. Then, a multi-objective optimization process selects a set of new vertices to produce an enhanced version of the network. The new features depend on a nonlinear function of a combination of preexisting features. Effectively, the process projects the input data into a higher-dimensional space. To solve the optimization problem, we design two metaheuristics based on the lexicographic genetic algorithm and the improved strength Pareto evolutionary algorithm (SPEA2). We show that the enhanced network contains more information and can be exploited to improve the performance of machine learning methods. The advantages and disadvantages of each optimization strategy are discussed.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A good representation of the encoded knowledge in a machine learning model is fundamental to its success. Several data structures have been used for this purpose, for instance, matrices of weights, trees, and graphs [1, 2]. Sometimes, learning models lack a data structure to store knowledge, storing the input as is [3].

In recent years, several works have been using complex networks for data representation and analysis [4, 5, 6]. Complex networks are graphs with a nontrivial topology that represent the interactions of a dynamical system [7]. Advances in the science of complex systems bring several tools to understand such systems.

In [8], we describe how to map a dataset with binary features into a bipartite complex-network. Such network is called feature–sample network

. Using that representation, we solve a semi-supervised learning task called positive-unlabeled (PU) learning 

[9].

When dealing with machine learning problems, we often need to pre-process the input data. Feature learning

transforms or enhances raw data to structures that are effectively exploited by learning models. Autoencoders and manifold learning are examples of feature learning methods 

[10].

In this paper, we propose a feature learning process to add information in feature–sample networks. In summary, we include to the network a limited number of new vertices based on a non-linear function of the preexisting ones. The set of new vertices is determined by a multi-objective objective problem, in which the goal is to maximize the number of features while maintaining some properties of the original data. Two multi-objective approaches are designed: a lexicographic genetic algorithm (LGA) and an implementation of the improved strength Pareto evolutionary algorithm (SPEA2).

We show that enhanced feature–sample networks improve the performance of learning methods in the major machine learning paradigms: supervised and unsupervised learning. We also expose the pros and cons of each optimization approach.

The rest of this paper is organized as follows. Sections III and II describe how to enhance feature–sample networks as an optimization problem. In Section IV, computer simulations illustrate the optimization process and assess the performance improvements in machine learning tasks. Finally, we conclude this paper in Section V.

Ii Enhanced feature–sample networks

In this section, we describe how we enhance a feature–sample network by adding to the network a constrained number of new features. Connections between the samples and each new feature depend on a nonlinear function of a combination of preexisting features. The chosen set of new features is the result of an optimization process that not only maximizes the number of new features but also distributes them along the samples. This process is similar to the projection of the associated data into a higher-dimensional space preserving some important characteristics of the data.

In the following subsections, we first review the feature–sample networks, then describe the creation of new features. Moreover, we elaborate an optimization problem to enhance feature–sample networks.

Ii-a Feature–sample network review

Assume as input the dataset

whose members are binary feature vectors

. The feature vectors are sparse, that is, the number of elements with value 1 is much lower than the dimension .

The feature–sample network is the bipartite complex-network whose edges connect samples and features of the dataset . A simple, unweighted, undirected graph represents such network. The vertex set is and an edge exists between sample and feature if .

Ii-B And-features definition

According to the Cover’s Theorem [3], given a not-densely populated space of a classification problem, the chance of it being linearly separable is increased as one cast it in a high-dimensional space nonlinearly.

Since we assume the input feature–sample network is sparse, we can synthesize features nonlinearly to exploit the properties of this theorem. One way to produce new features is using the and operator, which is a nonlinear Boolean function.

We call and-feature the feature that links to all samples connected to a given set of two or more preexisting features.

Given a feature–sample network with samples and features, we can produce an and-feature for each combination of features such that and . We call the order of the and-feature . Thus, the number of possible and-features is

In the rest of this paper, we index every possible and-feature using the parameter , …, such that . Each element indicates a feature in the network. The feature is part of the combination if, and only if, . Thus, the set is .

Using this notation, we say that the and-feature connects to each sample if, and only if, holds.

From this discussion, we realize that enumerating every combination has exponential cost. Moreover, once the network is sparse, we expect that many of the and-features have no connections.

Ii-C Optimization problem definition

The problem of enhancing the network can be viewed as a optimization problem. Given an input feature–sample network with samples and features, we denote the enhanced network from the original by adding every and-feature . The number of features of the enhanced network, excluding the and-features that do not connect, is given by . Thus, and .

Let be the maximum number allowed of generated features, we want to

subject to

The disadvantage of this approach is that the and-features might not be well distributed. Thus, while some samples may have few new feature, others may have many. To overcome this limitation, we introduce the disproportion between the network and its enhanced version .

The disproportion is zero if the number of new connections in each sample is proportional to its initial sparsity. In this way, while the sparsity of each sample might change, we keep the same shape of the degree distribution of the samples.

Let be the degree of the sample , the disproportion between two networks is

where

is the standard deviation of the arguments.

Using the disproportion in our optimization problem, the goal is to

subject to

Iii Methods

In this section, we study the multi-objective problem stated in the previous section and describe how we solve it.

Iii-a Problem study

In Section II-B, we see that the number of possible and-features scales exponentially in the number of features . As a consequence, storing every possible and-feature is not feasible.

Furthermore, the number of candidate solutions is also exponential in the number of possible new features. Precisely, there are at the most

solutions to explore. We can not limit the size of the set by since many and-features may have no connection.

The three common approaches for solving multi-objective optimization problems are weighted-formula, lexicographic, and Pareto [11]. The first strategy transforms the problem into a single-objective one, usually by weighting each objective and adding them up. The lexicographic approach assigns a priority to each objective and then optimizes the objectives in that order. Hence, when comparing two candidate solutions, the highest-priority objective is compared and, if equivalent, the second objective is compared. If the second objective is also equivalent between the solutions, the third one is used, and so on. Both the weighted-formula and the lexicographic strategies return only one solution for the problem. The Pareto methods use different mathematical tools to evaluate the candidate solution, finding a set of non-dominated solutions. A solution is said to be non-dominated if it is not worse than any other solution concerning all the criteria [11].

Since it is not trivial the difference of scale between our objective functions, we opted to use only a lexicographic and a Pareto method.

We design two population-based optimization algorithms. Specifically, we consider the use of two metaheuristics:

  • a lexicographic genetic algorithm (GA); and

  • the improved strength Pareto evolutionary algorithm (SPEA2) [12].

Although the methods are different, both approaches share many properties – individual representation, population initialization, operators of mutation, recombination and selection – which are explained in Section III-D. The main difference between them is the fitness evaluation.

In the GA, the individuals are ordered lexicographically, that is, ordered primarily by the first objective function and, in case of tie, the second objective. SPEA2, however, consider not only the Pareto front but also the density of the solutions.

Iii-B Lexicographic genetic algorithm review

A GA has the following steps [13]:

1: InitialPopulation()
2:while StopCondition() = false do
3:     Evaluate()
4:      Elitism()
5:     while  do
6:         parents Select()
7:         children Recombine(parents)
8:         Mutate(children)
9:          children
10:     end while
11:     
12:end while.

In words, a random population of candidate solutions is generated as the first step. Then, while the stop condition is not met, the next generation of individuals comprises the best individuals of the previous generation and the individuals originated by recombining and mutating parents selected from the previous generation.

In the lexicographic approach, the only difference is during the evaluation of the candidate solution [14], where the best individuals are decided lexicographically.

In our specific problem, a solution is better than if

  • ; or

  • and .

Iii-C Improved Strength Pareto Evolutionary Algorithm review

SPEA2 works similarly to a GA. The major difference is that it keeps an archive with the candidate solutions for the Pareto set. If the number of non-dominated solutions is greater than the limit of the archive, some solutions are discarded. Such operation is called truncation. The truncation operator tries to maintain the candidate solutions uniformly distributed along the Pareto front 

[12].

As indicated by the authors, we select individuals by employing binary tournament. Also, let be the archive size, we fix the parameter in the truncation operator [12].

Iii-D Metaheuristic design

The common implementation characteristics of our metaheuristic is exposed in the next items.

Iii-D1 Individual representation

In our problem, each solution is a set of zero or more and-features. If we enumerate every possible and-feature, the solution can also be viewed as a binary vector with entries 1 for the present and-features.

Iii-D2 Population initialization

Given , we sample, without replacement, random and-features to compose each candidate solution such that

. And-features are sampled so that the probability of having order

is

Iii-D3 Recombination operator

We use the uniform crossover operator with two parents generating two children. In Section III-D6, we show how to implement it efficiently using our set representation.

Iii-D4 Selection operator

The binary tournament method is chosen to select the parent that go to the recombination step.

Iii-D5 Mutation operator

We formulated the following specific mutation operator to exploit the characteristics of our problem.

Given a candidate solution , we apply random changes in the individual. For each change, there is a equal probability of either

  • trying to add a new and-feature; or

  • trying to remove an and-feature ; or

  • trying to modify an and-feature .

When trying to add a new feature, an and-feature is sampled and the solution is updated to . Note that the candidate solution may not change if the and-feature was previously present.

If trying to remove a feature, each and-feature has probability of being selected. In this case, the candidate solution is updated to . Note that, with probability , the individual do not change.

Finally, in the last case, an and-feature with order is selected uniformly to be modified. Once the and-feature is selected, a modified and-feature will be produced. Two cases may happen: a) with chance , an index is selected uniformly, and is

b) with chance , two indexes and are chosen uniformly. The modified and-feature is

The first case will include one more term into the and-feature if . The second one swaps two elements of and, effectively, takes effect when . The candidate solutions is then updated to . The size is never increased, and the candidate solution will be preserved if .

Iii-D6 Performance considerations

Although, we can view both solutions and and-features as binary vectors, the set representation is more practical because of the high space-complexity of the problem. Moreover, there is no need to store entries for and-features that lack connections. Instead of just ignoring them, we exploit the evaluation step to determine which and-features are useless and remove them from the set.

Also, using the set representation, the crossover of the candidate solutions and can be implemented efficiently with steps

1:
2:for  do
3:     if SampleUniform(, )  then
4:         
5:     else
6:         
7:     end if
8:end for

where stands for the symmetric difference operator.

Finally, to conform to the requirements in Section III-D2, one can sample each candidate solution efficiently with steps

1: SampleAndFeature()
2:
3:loop
4:      SampleAndFeature()
5:     
6:     if SampleUniform(, )  then
7:         break
8:     end if
9:end loop

where and-features are sampled without replacement.

Iv Experimental results

In this section, we present applications of our feature learning technique in the two major categories in machine learning: supervised and unsupervised learning.

First, we illustrate the optimization process in the famous Iris dataset. In this example, we also conduct community detection in the original feature–sample network obtained from the dataset and in the enhanced one.

Then, we observe the increase of the accuracy obtained by the -nearest neighbors [1]classifier in other four UCI datasets.

Iv-a Enhanced community detection and clustering

The UCI Iris dataset [15] contains 150 samples that describe the sepals and petals of individual Iris flowers. The flowers are either Iris setosa, Iris virginica, or Iris versicolor. In [8], we construct a feature–sample network from this dataset by discretizing the features.

Figure 1: Feature–sample network for Iris dataset. Circles are vertices associated with samples and squares with features. Colors represent the classes.

Figure 1 shows the generated network. Each color represents a different class. Circles represent samples and squares, features. We use that same network as an input to both algorithms – using SPEA2 and the lexicographic GA.

Figure 2: Evolution of the number of and-features and disproportion along the generations of the SPEA2 algorithm with Iris dataset as the input network. Solid lines are the average disproportion and number of and-features of the non-dominated solution at a given generation. Shadows cover the range of the measurements.
Figure 3: Evolution of the number of and-features and disproportion along the generations of the lexicographic GA with Iris dataset as the input network. Solid lines are the average disproportion and number of and-features of the best solutions (elitism) at a given generation. Shadows cover the range of the measurements.

Iv-A1 Evolution of the candidate solutions

We execute the optimization process once for each strategy. We fix the population in individuals. For SPEA2, the archive has size and, for the lexicographic GA, we keep the best solutions over the generations. In the initial population, we use and . The recombination rate is and the apply random changes in each generated individual.

Figures 3 and 2 describe the obtained result. Both disproportion and number of discovered and-feature are depicted along the generations. Solid lines are the average result in the population and shadows cover the range – from minimum to maximum – of each measurement. The results include only the non-dominated solutions in SPEA2 and the best solutions in the lexicographic GA.

Using both strategies, we could reach the optimal solution: and-features with at least one connection and disproportion. However, the optimization strategies differ as to how to achieve this.

SPEA2 tries to find as many new and-feature while keeping the ones with lowest values of disproportion. When a larger set of and-features with disproportion is discovered, such solution dominates every other solution found so far. Thus, we can observe “steps” in the evolution of the number of and-features.

In the lexicographic GA, the disproportion is only considered when the number of discovered and-features is the same. As a result, the algorithm greedily produce and-features disregarding the disproportion until it cannot find more new features to add. It enables a faster convergence in this case, but it may find only solutions with high disproportion when it is unfeasible to reach the maximum number of and-feature – which is very common in practice. To solve this issue in larger problems, one can set the limit in the number of discovered features, .

Figure 4: Feature–sample network for Iris dataset with all possible 128 and-features (excluding those with no connections). Circles are vertices associated with samples and squares with features. Colors represent the classes.

The optimal enhanced feature–sample network for this dataset is in Figure 4.

Iv-A2 Community detection

Applying a greedy community detection method [16] in both networks, we observe that the enhanced network has modularity () higher than the input network (.)

The enhanced network can also improve clustering tasks. If comparing the expected class and the obtained communities, the enhanced version achieves higher Jaccard index,

against .

Iv-B Performance enhancement in supervised learning

Dataset N D M
Breast 2010 106 27 134,217,700
Ecoli 336 19 524,268
Glass 214 25 33,554,406
Wine 178 39 549,755,813,848
Table I: UCI datasets along with the number of samples , features , and possible and-feature .

We also apply our proposal in 4 classification tasks from UCI [15]. Table I presents the datasets along with the number of samples , features , and possible and-features . We highlight that it is unfeasible to list every possible combination among the features even for small datasets. The networks are generated as shown in [8] with bins.

The optimization process is executed times for each strategy – SPEA2 and GA. We fix the population size in , the archive size in , and the elitism solutions. For the initial population, we use and . The recombination rate is and mutations are performed for each candidate solution. We limit the number of and-feature by . The execution is stopped after the 1000th generation.

Dataset LGA SPEA2
Reuters ,
Table II: Number of and-features and disproportion obtained by the optimization process for both strategies. Average and standard deviation are shown for each measurement.

Table II summarizes the number of and-features and disproportion obtained by the optimization process. For the lexicographic GA, we show the average and the standard deviation of the measurements among the best individuals. For SPEA2, only the non-dominated solutions are considered.

As expected by considering the previous study (Section IV-A,) the lexicographic GA achieved better count of and-features – the maximum allowed –, but worse values of disproportion. The candidate solutions of SPEA2 present wide variation, but consistent lower disproportion.

Dataset Lowest disproportion Highest number of and-features
Breast 2010 513, 0.000 (SPEA2) 2700, 0.129 (LGA)
Ecoli 166, 0.011 (SPEA2) 1900, 0.114 (LGA)
Glass 416, 0.019 (SPEA2) 2500, 0.130 (LGA)
Wine 413, 0.015 (SPEA2) 3900, 0.226 (LGA)
Table III: Number of and-features, disproportion, and strategy of the results with lowest disproportion and highest number of and-features.
Dataset Original () Lowest disproportion () Highest number of and-features ()
70% labeled
80% labeled
Table IV: Accuracy of the -NN method on the UCI datasets using three different input networks – original, lowest disproportion, and highest number of and-features. For each setting, the best value of is shown.

For each one of the datasets, we take the candidate solution with highest count of and-features and with the lowest disproportion among every solution produced. (We ignored a few solutions with less than 100 and-features.) Such candidate solutions, and the strategy that has found them, are indicated in Table III.

We solve the classification problems for each one of the datasets using the -nearest neighbors method. As inputs we use the interaction matrices of the original network and the enhance ones from the selected candidate solutions. We performed split validations with and of labeled samples for each case. We also varied .

The best results for each configuration are shown in Table IV. Improvements are highlighted in bold. Using the solution with the highest number of and-features, we improved the classification results in all cases. Using less and-features but with lower disproportion, we see improvements almost as good as using higher and-feature count.

V Conclusion

In this paper, we presented an unsupervised feature learning mechanism that works on datasets with binary features. First, the dataset is mapped into a feature–sample network. Then, a multi-objective optimization process selects a set of new vertices that correspond to new features to produce an enhanced version of the network.

We show that the enhanced network contains more information and can be exploited to improve the performance of machine learning methods.

To solve the optimization problem, we opted to design population-based metaheuristics. We used both a lexicographic GA and the SPEA2 algorithm to find the candidate solutions.

From the experiments, we conclude that the GA produces more new features in fewer generations. However, candidate solutions in SPEA2, besides having less new features, also improved the performance of machine learning methods.

This fact suggests that the disproportion is a good measurement of the quality of the selected set of and-features. In future works, we will correlate improvement and disproportion of the solutions with the same number of features.

Furthermore, the learning techniques used – fast-greedy community detection and -nearest neighbors – do not take full advantage of the new features. In subsequent studies, we will elaborate learning models to exploit the enhanced feature–sample network explicitly.

Acknowledgments

This research was supported by the São Paulo State Research Foundation (FAPESP) and the Brazilian National Research Council (CNPq).

References

  • [1] C. M. Bishop, Pattern Recognition and Machine Learning.   Springer-Verlag New York, 2006.
  • [2] X. Zhu and A. B. Goldberg, “Introduction to Semi-Supervised Learning,”

    Synthesis Lectures on Artificial Intelligence and Machine Learning

    , vol. 3, no. 1, pp. 1–130, 2009.
  • [3] T. M. Cover, “Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition,” IEEE Transactions on Electronic Computers, vol. EC-14, no. 3, pp. 326–334, 1965.
  • [4] T. C. Silva, L. Zhao, and T. H. Cupertino, “Handwritten data clustering using agents competition in networks,” Journal of Mathematical Imaging and Vision, vol. 45, no. 3, pp. 264–276, 2013.
  • [5] T. C. Silva and L. Zhao, Machine Learning in Complex Networks, 1st ed.   Springer, 2016.
  • [6] F. A. N. Verri, P. R. Urio, and L. Zhao, “Network unfolding map by vertex-edge dynamics modeling,”

    IEEE Transactions on Neural Networks and Learning Systems

    , vol. PP, no. 99, pp. 1–14, 2016.
  • [7] M. E. J. Newman, Networks: An Introduction, 1st ed.   New York, NY: Oxford University Press, 2010.
  • [8] F. A. N. Verri and L. Zhao, “Random walk in feature-sample networks for semi-supervised classification,” in 2016 5th Brazilian Conference on Intelligent Systems (BRACIS), 2016, pp. 235–240.
  • [9]

    J. Mũnoz Marí, F. Bovolo, L. Gómez-Chova, L. Bruzzone, and G. Camp-Valls, “Semisupervised One-Class Support Vector Machines for Classification of Remote Sensing Data,”

    IEEE Trans. Geosci. Remote Sens., vol. 48, no. 8, pp. 3188–3197, 2010.
  • [10] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 8, pp. 1798–1828, 2013.
  • [11] A. A. Freitas, “A critical review of multi-objective optimization in data mining: A position paper,” SIGKDD Explor. Newsl., vol. 6, no. 2, pp. 77–86, 2004.
  • [12] E. Zitzler, M. Laumanns, and L. Thiele, “SPEA2: Improving the strength pareto evolutionary algorithm for multiobjective optimization,” in Evolutionary Methods for Design Optimization and Control with Applications to Industrial Problems.   Athens, Greece: International Center for Numerical Methods in Engineering, 2001, pp. 95–100.
  • [13] M. Srinivas and L. M. Patnaik, “Genetic algorithms: a survey,” Computer, vol. 27, no. 6, pp. 17–26, 1994.
  • [14] C. A. Coello Coello, D. A. Van Veldhuizen, and G. B. Lamont, Evolutionary Algorithms for Solving Multi-Objective Problems.   Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2002.
  • [15] M. Lichman, “UCI machine learning repository,” 2013. [Online]. Available: http://archive.ics.uci.edu/ml
  • [16] A. Clauset, M. E. J. Newman, and C. Moore, “Finding community structure in very large networks,” Physical Review E, vol. 70, no. 6, pp. 1–6, 2004.