Particle Competition and Cooperation for Semi-Supervised Learning with Label Noise

02/12/2020 ∙ by Fabricio Aparecido Breve, et al. ∙ Universidade de São Paulo unesp Unife 14

Semi-supervised learning methods are usually employed in the classification of data sets where only a small subset of the data items is labeled. In these scenarios, label noise is a crucial issue, since the noise may easily spread to a large portion or even the entire data set, leading to major degradation in classification accuracy. Therefore, the development of new techniques to reduce the nasty effects of label noise in semi-supervised learning is a vital issue. Recently, a graph-based semi-supervised learning approach based on Particle competition and cooperation was developed. In this model, particles walk in the graphs constructed from the data sets. Competition takes place among particles representing different class labels, while the cooperation occurs among particles with the same label. This paper presents a new particle competition and cooperation algorithm, specifically designed to increase the robustness to the presence of label noise, improving its label noise tolerance. Different from other methods, the proposed one does not require a separate technique to deal with label noise. It performs classification of unlabeled nodes and reclassification of the nodes affected by label noise in a unique process. Computer simulations show the classification accuracy of the proposed method when applied to some artificial and real-world data sets, in which we introduce increasing amounts of label noise. The classification accuracy is compared to those achieved by previous particle competition and cooperation algorithms and other representative graph-based semi-supervised learning methods using the same scenarios. Results show the effectiveness of the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Label noise is an important issue in machine learning and, more specifically, in data classification. A classifier usually learns from a set of labeled samples to predict the classes of new samples. However, many real-world data sets contain noise, and learning from those may lead to many potential negative consequences

Frenay and Verleysen (2013). Label noise may be of two different types: feature noise and class noise Frenay and Verleysen (2013); Zhu and Wu (2004). Feature noise affects observed values of the data features. For example, sensors may introduce some Gaussian noise during data feature measurement. On the other hand, class noise alters the labels assigned to data instances. For instance, a specialist may mistakenly assign the wrong class to some samples Hickey (1996), specially when the labeling task is subjective, like in medical applications Malossini et al. (2006). In this paper, we focus on class noise, which is potentially the more harmful type of label noise Frenay and Verleysen (2013); Zhu and Wu (2004); Sáez et al. (2014).

The reliability of class labels is important in supervised learning algorithms Slonim (1996); Krishnan (1988), but in semi-supervised learning this is a crucial issue. Semi-supervised learning is usually applied to problems where only a small subset of labeled samples is available, together with a large amount of unlabeled samples Zhu (2005); Chapelle et al. (2006); Abney (2008). This is a common situation nowadays, as the size of the data sets being treated is constantly increasing, making prohibitive the task of labeling samples to supervised approaches. This task is time consuming and usually requires the work of human experts. Therefore, class noise is a major problem in semi-supervised learning, due to the smaller proportion of labeled data in the whole data set. In these scenarios, errors may easily affect the classification of a large portion or even the entire data set Breve and Zhao (2012), leading to major degradation in classification accuracy, which is the more frequently reported consequence of label noise Frenay and Verleysen (2013). Therefore, it is vital to develop techniques to reduce the nasty effects of label noise in semi-supervised learning process.

There are three broader approaches of handling label noise in classification Teng (2001); Frenay and Verleysen (2013); Yin and Dong (2011): robust algorithms, filtering, and correction. Robust algorithms are designed to naturally tolerate a certain amount of label noise, so they do not need any special treatment. Filtering the noise means that some label noise cleaning strategy is used to identify and discard noisy labels before the training process. Finally, correction means that the noisy labels are identified, but instead of eliminating them, they are repaired or handled properly. Albeit it is not always clear whether an approach belongs to one category or the other. Frenay and Verleysen (2013). Usually, a mixed strategy of the above mentioned categories are used to deal with label noise problem.

Recently, a particle competition and cooperation approach was used to realize graph-based semi-supervised learning Breve et al. (2012). The data set is converted into a graph, where samples are nodes with edges between the similar samples. Each labeled node is associated with a labeled particle. Particles walk through the graph and cooperate with identically labeled particles to classify unlabeled samples, while competing against particles with different labels. The main advantage of particle competition and cooperation method over most other semi-supervised learning algorithms can be summarized as follows: we have proved that it has lower computational complexity Breve et al. (2012)

due to its local propagation nature; at the same time, extensive numerical studies show the method can achieve high precision of classification; it is similar to many natural or biological processes, such as resource competition by animals, territory exploration by humans (animal), election campaigns, etc. In this way, we believe that the particle competition and cooperation method can be also used back to model those natural or biological systems. The original competition and cooperation process generates much useful information and saved in the dominance level vector of each node. Such information can be used to solve other relevant problems beyond the standard machine learning tasks. For example, it can help to determine data class overlapping, fuzzy classification, and outlier detection by analyzing the distribution of the dominance vectors

Breve and Zhao (2013). In this paper, we modify and further improve the original method to treat an important issue in semi-supervised learning: learning with label noise or wrong labels.

Taking the interesting features of the particle competition and cooperation approach into account, further improvements to increase the robustness of the method have been pursued. Some preliminary results were presented in Breve and Zhao (2012). The improved algorithm raised classification accuracy in the presence of label noise. However, some drawbacks have been identified, like high differences in node degree among labeled and unlabeled nodes and lack of connection between labeled particles and their corresponding labeled nodes. As a consequence, the particles spend quite more time on labeled nodes than unlabeled ones, which demands a higher number of iterations to converge. Moreover, on conditions where the amount of label noise is critical, a team of particles may switch territory with another team. This happens because particles are not strongly attracted to their corresponding nodes and they may be attracted to nodes with label noise which are on another class territory. This territory switching phenomenon always involves all particles from two or more classes, therefore it leads to major classification accuracy lost.

In this paper, we further improved the robustness of the particle competition and cooperation method to label noise. We addressed the problems of the preliminary version by enhancing graph generation, leveling nodes degrees, and thus lowering execution times. The territory switching phenomenon was also addressed by the changes in the graph generation, changes in the particles distance tables calculation, and periodic resets in particles and nodes. These improvements allow the new model to keep the particles closer to their neighborhood, increase the attraction between particles and their corresponding labeled nodes, and bring particles back after a while if they still fail to avoid territory switching eventually.

The proposed algorithm falls somewhere near the boundary between the robust algorithm approach and the correction approach aforementioned. It may be seen as a robust algorithm approach since the original algorithm has some natural tolerance to label noise, although it was not designed to handle this specific problem. In addition, it was also improved to dynamically discover and re-label label noise, thus stopping the noise propagation and allowing the algorithm to achieve higher classification accuracy. In this sense, this approach may be seem as belonging to the correction approach type. It is important to notice that this correction is a built-in feature. Both labeling unlabeled nodes and fixing label noise tasks run together in a single process.

Computer simulations presented in this paper show the effectiveness and robustness of the improved algorithm in the presence of high amounts of label noise. The classification accuracy achieved by the proposed method is compared with those achieved by all three previous versions and also with those achieved by some other representative graph-based semi-supervised learning methods Zhou et al. (2004); Zhu and Ghahramani (2002); Wang and Zhang (2008). Both artificially generated and real-world data sets were used. Label noise was introduced in these data sets with increasing levels to discover how much label noise each algorithm can handle until the classification accuracy seriously drops.

This paper is organized as follows. An overview of the particle competition and cooperation approach is shown in Section 2. The proposed model is described in Section 3. In Section 4, we present computer simulations. Finally, in Section 5 we draw some conclusions.

2 Particle Competition and Cooperation Overview

In this section, we present an overview of the previous particle competition and cooperation models Breve et al. (2012, 2010); Breve and Zhao (2012). First, the vector-based data set is converted to a non-weighted and undirected graph. Each data instance becomes a graph node. Edges connecting the nodes are created according to the distance between the nodes in the data feature space. This graph generating process is described in Subsection 2.1. Then, a particle is created for each labeled node. Particles with the same label belong to the same team and cooperate among themselves. On the other hand, particles with different labels compete against each other. When the system runs, the particles walk in the graph, selecting the next node to visit according to the rules described in Subsection 2.3. Each node has a set of domination levels, one level for each class of the problem. When a particle visits a node, it will increase its class domination level on that node, at the same time that it will decrease the domination level of the other classes. Each particle possesses a strength level, which lowers or raises according to the domination level of its class in the node it is being visited. Particles also have a distance table which they update dynamically as they walk on the graph. Nodes and particles dynamics are describe in Subsection 2.4. The stop criterion is described in Subsection 2.5. At the end of the iterative process, each data item is labeled after the class with the highest domination level on it.

2.1 Graph Construction

Consider a vector-based data set with numerical attributes, and the corresponding label set . The first points are labeled as and the remaining points are unlabeled, i.e, . We define the graph . is the set of nodes, where each one corresponds to a sample , and is the set of edges .

In Breve et al. (2012) and Breve et al. (2010), two nodes and are connected if the distance (usually the Euclidean distance) between and is below a given threshold . Since the threshold may be hard to define, another option is to connect and if is among the -nearest neighbors of or vice-versa. Otherwise, and are disconnected. In Breve and Zhao (2012), and are connected if is among the -nearest neighbors of or vice-versa; or if and are both labeled instances with the same label. Otherwise, they are disconnected. This last rule was introduced to provide an easy and fast escape path to particles starting in nodes representing label noise samples. However, there is a side effect in this strategy, which will be discussed in Section 3.1.

2.2 Particles and Nodes Initialization

For each labeled node in the graph, which corresponds to a labeled data point , there is a particle which has as its initial position. is called the home node of .

Each particle holds two variables. The first one is corresponding to the particle strength level, which indicates how much the particle is able to change the visited node levels at time . The second variable is a distance table, i.e., a vector , where each element corresponds to the distance dynamically measured between the particle’s home node and the node . Each particle is created with initial strength level set to maximum, . Particles begin their journey knowing only the distance to their corresponding labeled nodes, which is set to zero (). Other distances are set to the largest possible value () if the graph is a single component.

Each node has a vector variable , where each element corresponds to the domination level of team (class) over node . For each node, the sum of the domination levels is always constant, . The initial domination levels are set differently for labeled nodes and unlabeled nodes. Labeled nodes begin fully dominated by the corresponding team (class). On the other hand, unlabeled nodes have all teams (classes) domination levels set equally. Therefore, for each node , the initial levels of the domination vector are set as follows:

(1)

2.3 Random-Greedy Walk

Particles walk in the graph trying to dominate as many nodes as possible, while preventing enemy particles from invading their territory. But how do they do that? This is the job of the two rules: random walk and greedy walk. They are used to determine which is the next node a particle will visit. In the random walk, particles randomly chooses any neighbor to visit without concerning domination levels or distance from its home node. This rule is useful for exploration and acquisition of new nodes. Meanwhile, in the greedy walk, particles prefer visiting nodes that have been already dominated by its own team and that are closer to their home nodes. This rule is useful for defense of its team’s territory. Particles must exhibit both movements in order to achieve an equilibrium between exploratory and defensive behavior.

Therefore, in random walk the particle moves to any node

with the probabilities defined as:

(2)

where is the index of the current node of particle , so if there is an edge between the current node and any node , and otherwise. In greedy movement the particle moves to a neighbor with probabilities defined according to its team domination level on that neighbor and inverse of the distance () from that neighbor to its home node by the following expression,

(3)

where is the index of the current node of particle and , where is the class label of particle .

At each iteration, each particle has probability to take greedy movement and probability to take random movement, with . Once the random rule or greedy rule is determined, the neighbor node to be visited is chosen with probabilities defined by Eq. (2) or Eq. (3), respectively.

When a particle visits a node, it updates the domination level on that node, its own strength and its distance table, as we will see later. But, after that, the particle only stays in the visited node until the next iteration if its team (class) domination level on that node is higher than those from all other teams (classes); otherwise, a shock happens and the particle is pushed back to its previous node until the next iteration.

In Breve and Zhao (2012), the equations (2) and (3) were replaced by a single random-greedy equation:

(4)

where is the index of the node currently being visited by particle . This new random-greedy equation balances exploratory and defensive behavior.

2.4 Nodes and Particles Dynamics

As mentioned before, at each iteration , each particle chooses a neighbor node to visit. During this visit, particle updates the domination level of the neighbor node as follows:

(5)

where is a parameter to control changing rate of the domination levels and represents the class label of particle . The update consists of particle changing the visited node by increasing the domination level of its team (, ) while decreasing the domination levels of other teams (, )). In Breve et al. (2012), there is an exception: the domination levels of labeled nodes are always fixed, assuming that their respective labels are always reliable.

When visiting a neighbor node, a particle will get weaker or stronger according to the domination level of its team in that node, after (5) is applied. Therefore, at each iteration, a particle strength is updated:

(6)

where is the node being visited. In Breve et al. (2010) there is also a parameter to control the amplitude of the particle strength change, so this Eq. (6) becomes

(7)

In Breve et al. (2010), there are also accumulated domination levels, which are defined as , which is a vector of the same size as , and holds accumulated domination level by team over node . At each iteration, for each selected node (in random movement), the accumulated domination level is updated as follows::

(8)

where is the class label of particle .

During the visit the particle also updates its distance table as follows:

(9)

where and are the distances to home node from the previous node and from the visited node, respectively.

Distance calculation is a dynamical process: particles have limited knowledge of the network, i.e., they do not know the connection pattern of nodes. Therefore, they assume all the nodes can be reached only with a number of steps as high as the total amount of nodes minus one () starting from its home node. Every time a particle chooses a neighbor node to visit, it will check the distance to that node in its distance table. If the distance on the table is higher than the distance it has from the previous node plus 1, it will update the table. In other words, unknown distances are calculated on the fly and updated as particles naturally find shorter paths while they walk. In Breve and Zhao (2012), all particles from the same team share the same distance table.

2.5 Stop Criterion

In most scenarios, after a sufficient amount of iterations, most nodes will be locally dominated by a single team. Lets call this the equilibrium state. At this point, most nodes are unlikely to have major changes in their domination levels. However, some special nodes like the nodes on frontier regions, and nodes with label noise together with their closer neighbors are less stable and more susceptible to changes in the domination levels even after the equilibrium is reached. Therefore, we cannot expect full convergence of all node labels every time. Instead, we monitor the average maximum domination levels of the nodes (, ) and we keep track of the highest level they have achieved. This measure usually increases quickly at the beginning, then it slows down and oscillates around the maximum point. When there is no increase in this highest level achieved for a given amount of iterations () the algorithm is stopped. These iterations are needed because those special nodes usually require more iterations than others to provide more reliable labels. In other words, the particle competition and cooperation algorithms spend only a small portion of all the iterations to classify most nodes, and then they spend most of the remaining iterations () to classify the few remaining nodes. Using the algorithms proposed in Breve et al. (2012), Breve et al. (2010), and Breve and Zhao (2012), we usually set , where is the network size, is the amount of labeled nodes (particles), and is a constant. In this paper, we use . In our experiments values lower than that usually leads to lower classification accuracy, while values higher than that usually leads to no improvement in classification, but higher execution time.

When the algorithm stops completely, in Breve et al. (2012) and Breve and Zhao (2012), each node is labeled or relabeled (only in Breve and Zhao (2012)) by the class which has the higher level of domination in it:

(10)

On the other hand, Breve et al. (2010) uses the accumulated domination levels to label (or relabel) each node:

(11)

3 The Proposed Model

In this section, we present the features introduced in the particle competition and cooperation approach to minimize the effects of label noise. The new graph construction steps are described in Subsection 3.1. The changes in particles and node initialization are describe in Subsection 3.2. The new particles and nodes dynamics are described in Subsection 3.3. A rule to reset particles and nodes is discussed on Subsection 3.4. Finally, an overview of the proposed algorithm is presented in Subsection 3.5.

3.1 Graph Construction

In Section 2.1 we described how the vector-based data set is converted to a non-weighted undirect graph in Breve et al. (2012), Breve et al. (2010), and Breve and Zhao (2012). Remember that Ref. Breve and Zhao (2012) introduced a rule to provide an easy and fast path to particles corresponding to nodes with class noise escape to their own neighborhood. However, there is a side effect in this strategy. The degrees of labeled nodes will increase according to the amount of labeled nodes of the same class, while the degrees of unlabeled nodes will depend only on the value of (from the -nearest neighbors connections) and reciprocal connections. This fact may lead to graphs that labeled nodes have much higher degree than unlabeled ones. In this scenario, particles will spend too much time walking only on labeled nodes, which delays the algorithm stop and may also affect its classification accuracy. Therefore, in the proposed method we fix this problem by using a different strategy to connected label and unlabeled nodes. Here, each unlabeled node is connected to its -nearest neighbors, no matter whether these neighbors are labeled or unlabeled (as in Breve et al. (2012)). On the other hand, labeled nodes are designed to prefer to connect to the -nearest other labeled nodes from the same class. Only if is larger than the amount of nodes of the same class (lets call this amount ), the remaining connections will be made to the -nearest neighbors no matter whether they are unlabeled nodes or labeled nodes but from other classes. Of course the connections are still reciprocal, as the network is undirected. Therefore labeled nodes will still be connected to their closest neighbors by reciprocity, but their degree will not be much larger than the degree of the unlabeled nodes.

3.2 Particles and Nodes Initialization

Particles and nodes initialization is similar to the algorithm in Breve et al. (2012), as described in Section 2.2. However, in the proposed method we introduce the overall domination levels , which have the same structure as , but they are intended to keep overall information about nodes domination. These levels are all initially set to minimum:

(12)

and they are updated when the reset rule is triggered, as it will be explained in Subsection 3.4.

3.3 Nodes and Particles Dynamics

In Section 2.4, we mentioned that in Breve and Zhao (2012) all particles from the same team share the same distance table. This feature makes the particle’s distance to any labeled node of the same class to be zero, which minimizes the importance of the home node and the particle’s “desire” of going back to it, i.e., this rule allows particles to completely abandon their home nodes when they suffer from label noise, but their home nodes may be legitimate labeled nodes, which may lead to the territory switching phenomenon, where teams of particles may switch territory with another team, leading to major classification accuracy lost. To avoid this problem, we assume individual distance tables, i.e., each particle has its own table. Particles may still leave their home nodes faster than in Breve et al. (2012) and Breve et al. (2010), due to the changes proposed in the graph construction step, connecting labeled nodes even when they are not so close. But the new distance for each particle creates a stronger tie to the home node, minimizing territory switching phenomenon occurrences.

3.4 Reset Rule

One of the problems observed in Breve and Zhao (2012) is the territory switching phenomenon, where a team of particles may switch territory with another team. This phenomenon leads to major classification accuracy lost, as two or more classes are usually almost entirely misclassified in these scenarios. In this paper, we made some enhancements to minimize this problem, including changes in graph construction and particles distance tables, which were described in Subsections 3.1 and 3.3, respectively. Another novelty here is the introduction of a reset rule, which is used to reset all nodes domination levels and all particles position, strength, and distance tables from time to time. The reset is triggered when the highest level achieved by average maximum domination levels of the nodes (, ) has no increase in the last iterations. Here we change the definition of by introducing the new term :

(13)

where is the network size, is the amount of labeled nodes (particles), is a constant and is the amount of resets that will be performed. Notice that as we increase , decreases. Remember that, in Section 2.5, we explained that the amount of iterations to reach the equilibrium state (before the iterations take place) is usually much less than . Therefore, the impact of the amount of resets () in execution time is negligible, and the execution time of the proposed method is nearly the same of the previous versions of the algorithm, given the same data set and parameters. In this paper, we set (the same value used for the previous versions Breve et al. (2012, 2010); Breve and Zhao (2012)). Increasing the value of minimizes the territory switching effect, increasing classification accuracy, but it also decreases the value of , which may lead to lower classification accuracy in individual runs (between each reset). Therefore, must be carefully chosen. In our experiments, provided good shield against the territory switching effect without affecting classification accuracy of individual runs, so this value was used in all the experiments in this paper.

Each time the reset rule is triggered, all nodes’ current domination levels are added to the nodes overall domination levels:

(14)

The overall domination levels are increased just before each reset. When the reset rule is triggered for the time, the algorithm stops completely.

Thus, each node is labeled (or relabeled) by the class which has the higher overall level of domination in it:

(15)

3.5 The Algorithm

Overall, the proposed algorithm can be outlined as described in Algorithm 1.

1 Build the graph using the rules describe in Subsection 3.1;
2 Set nodes’ accumulated domination levels by using Eq. (12);
3 for 1 to  do
4        Set nodes’ domination levels by using Eq. (1);
5        Set particles initial position, strength and distance tables by the rules described in Subsection 2.2;
6        repeat
7               for each particle do
8                      Select a neighbor node to visit by using Eq. (4);
9                      Update the visited node domination levels by using Eq. (5);
10                      Update particle strength by using Eq. (6);
11                      Update particle distance tables by using Eq. (9);
12                     
13              
14       until the reset rule is triggered;
15       Update accumulated domination levels by using Eq. (14);
16       
Label each data item by using Eq. (15)
Algorithm 1 The Particle Competition and Cooperation Algorithm Enhanced to Minimize Effects of Label Noise

4 Computer Simulations

In this section, we present computer simulation results to show the effectiveness and robustness of the proposed method in the presence of label noise. We measure the classification accuracy of the proposed method when applied to artificial and real-world data sets, in which we introduce increasing amounts of label noise. The results of the proposed method, which we called the Label Noise Robust Particle Competition and Cooperation method (LNR-PCC), are compared to those achieved by three other representative graph-based semi-supervised learning methods: Local and Global Consistency (LGC) Zhou et al. (2004), Label Propagation (LP) Zhu and Ghahramani (2002), and Linear Neighborhood Propagation (LNP) Wang and Zhang (2008). We also include the results achieved by three previous versions of the Particle Competition and Cooperation method: PCC-1 Breve et al. (2012), PCC-2 Breve et al. (2010), and PCC-3 Breve and Zhao (2012).

Regarding the parameters used in the algorithms in this experimental study, the following configuration is set. For the LGC and LNP methods, we have fixed , as done in Zhou et al. (2004) and Wang and Zhang (2008), respectively. For PCC-1 and PCC-2 methods, we have fixed , as done in Breve et al. (2010). Finally, is kept fixed in PCC-1, PCC-2, PCC-3, and LNR-PCC, as done in Breve et al. (2012, 2010); Breve and Zhao (2012). The most sensitive parameters, which are of the LGC and the LP methods, and

of LNP, PCC-1, PCC-2, PCC-3 and LNR-PCC methods, are all optimized using the genetic algorithm available in the Global Optimization Toolbox of MATLAB, aiming to minimize the classification error.

In all the simulations presented in this paper, we randomly select a subset of elements () to be presented to the algorithm with their labels (considered as labeled data instances), while the other elements in the data set are presented to the algorithm without labels (considered as unlabeled data instances). The only exception is the g241c data set, in which we use the labeled subsets shown in Chapelle et al. (2006), instead of random selecting. In order to test robustness to label noise, we randomly choose elements from the labeled subset () to have their labels changed to any of the other classes chosen randomly for each sample, thus producing label noise. These label noise subsets are generated with increasing sizes, , until the labeled subset is no better than a random labeled subset. For instance, in a four classes problem with equiprobable classes, one can expect classification accuracy if the samples are labeled randomly. Therefore, there is no point in using a labeled subset in this scenario if the label noise amount is higher than .

Figure 1 and Table 1 show the classification accuracy comparison when the semi-supervised learning graph-based methods are applied to the Iris Data Set Bache and Lichman (2013), which has elements distributed in classes. data items are randomly chosen to compose the labeled subset. When the label noise subset size is small (), the proposed method is outperformed only by PCC-3. When label noise reaches critical levels (), the proposed algorithm performs better than all the others.

Figure 1: Classification error rate in the Iris data set Bache and Lichman (2013) with different label noise subset () sizes. Each data point is the average of executions with different and subsets.
Q size LGC LP LNP PCC-1 PCC-2 PCC-3 LNR-PCC
0.00 0.0353 0.0340 0.0498 0.0290 0.0286 0.0279 0.0345
0.50 0.0360 0.0364 0.0522 0.0346 0.0314 0.0273 0.0355
0.10 0.0495 0.0562 0.0555 0.0488 0.0431 0.0357 0.0427
0.15 0.0627 0.0667 0.0738 0.0629 0.0551 0.0386 0.0481
0.20 0.0655 0.0762 0.0862 0.0634 0.0543 0.0420 0.0499
0.25 0.0769 0.0807 0.1156 0.0770 0.0641 0.0427 0.0523
0.30 0.0922 0.0920 0.0887 0.0926 0.0784 0.0520 0.0603
0.35 0.1202 0.1264 0.1080 0.0954 0.0845 0.0618 0.0705
0.40 0.1565 0.1675 0.1769 0.1184 0.1114 0.0775 0.0902
0.45 0.2056 0.2078 0.1880 0.1415 0.1333 0.1004 0.0989
0.50 0.2629 0.2749 0.2211 0.2042 0.1982 0.1575 0.1512
0.55 0.3393 0.3515 0.2824 0.2851 0.2862 0.2593 0.2549
0.60 0.3953 0.4213 0.3400 0.4180 0.4198 0.4169 0.4008
0.65 0.4955 0.5171 0.4133 0.5110 0.5162 0.5419 0.5271
Table 1: Classification error rate in the Iris data set Bache and Lichman (2013) with different label noise subset () sizes. Each value is the average of executions with different and subsets.

Figure 2 and Table 2 show the classification accuracy comparison when the methods are applied to the Wine Data Set Bache and Lichman (2013), which has samples distributed in classes. samples are randomly chosen to compose the labeled subset. When all the samples in the labeled subset are correctly labeled, the proposed algorithm already performs better than all the others. As the mislabeled subset increases, this difference becomes higher because the classification error rates of the other algorithms increase more quickly than the the proposed method. Interestingly, the proposed algorithm seems not to be affected by up to of label noise. The classification accuracy begins to drop from label noise and beyond. However, this is the region where the proposed method increases its advantage over the others. When of the labeled subset is affected by label noise, the proposed method impressively made only around one third of the amount of classification mistakes made by LGC, LP, and LNP methods.

Figure 2: Classification error rate in the Wine data set Bache and Lichman (2013) with different label noise subset () sizes. Each data point is the average of executions with different and subsets.
Q size LGC LP LNP PCC-1 PCC-2 PCC-3 LNR-PCC
0.00 0.0562 0.0459 0.0699 0.0288 0.0289 0.0317 0.0274
0.05 0.0659 0.0490 0.0919 0.0346 0.0331 0.0343 0.0296
0.10 0.0630 0.0555 0.0968 0.0346 0.0326 0.0308 0.0284
0.15 0.0739 0.0697 0.1067 0.0444 0.0426 0.0372 0.0329
0.20 0.0838 0.0717 0.1168 0.0502 0.0464 0.0385 0.0351
0.25 0.1114 0.0975 0.1738 0.0525 0.0490 0.0383 0.0345
0.30 0.1084 0.1001 0.1754 0.0545 0.0506 0.0404 0.0360
0.35 0.1672 0.1522 0.2091 0.0665 0.0629 0.0518 0.0448
0.40 0.2046 0.2054 0.2257 0.0789 0.0763 0.0546 0.0495
0.45 0.2167 0.2171 0.2251 0.0876 0.0845 0.0641 0.0567
0.50 0.3178 0.3152 0.2983 0.1300 0.1301 0.1156 0.1017
0.55 0.3883 0.3942 0.3362 0.2682 0.2735 0.2568 0.2415
0.60 0.4690 0.4945 0.3745 0.4091 0.4215 0.4512 0.3843
0.65 0.4984 0.5445 0.4458 0.5294 0.5275 0.5628 0.5117
Table 2: Classification error rate in the Wine data set Bache and Lichman (2013) with different label noise subset () sizes. Each data point is the average of executions with different and subsets.

Figure 3 and Table 3 show the classification accuracy comparison when the semi-supervised learning graph-based methods are applied to an artificial data set with elements equally divided into normally distributed classes (Gaussian distribution). This data set was generated with function gauss from PRTools Duin et al. (2007). samples are randomly chosen to build the labeled subset. When there is no label noise, all the methods have similar classification accuracy, except LNP. As the label noise subset increases, the classification error rates begin to slowly raise. The PCC methods show their advantage by keeping nearly the same classification accuracy from to label noise, while the other methods drop their classification accuracy earlier (). In the range from to label noise, the proposed method outperformed all the others.

Figure 3: Classification error rate in the data set with 4 normally distributed classes with different label noise subset sizes. Each data point is the average of executions with different and subsets.
Q size LGC LP LNP PCC-1 PCC-2 PCC-3 LNR-PCC
0.00 0.0457 0.0419 0.0503 0.0447 0.0442 0.0460 0.0453
0.05 0.0465 0.0435 0.0539 0.0471 0.0462 0.0457 0.0460
0.10 0.0444 0.0427 0.0511 0.0466 0.0454 0.0438 0.0439
0.15 0.0492 0.0484 0.0629 0.0490 0.0467 0.0462 0.0445
0.20 0.0463 0.0444 0.0559 0.0475 0.0448 0.0449 0.0433
0.25 0.0557 0.0544 0.0799 0.0505 0.0473 0.0450 0.0479
0.30 0.0597 0.0683 0.0768 0.0541 0.0494 0.0452 0.0462
0.35 0.0751 0.0751 0.1109 0.0542 0.0495 0.0483 0.0464
0.40 0.1346 0.1387 0.1631 0.0692 0.0627 0.0544 0.0499
0.45 0.1446 0.1622 0.1875 0.0575 0.0538 0.0591 0.0480
0.50 0.1942 0.2061 0.2311 0.0844 0.0813 0.0982 0.0745
0.55 0.2936 0.2978 0.3133 0.1840 0.1833 0.1781 0.1853
0.60 0.4193 0.4603 0.4263 0.2224 0.2215 0.2958 0.2296
0.65 0.5586 0.5551 0.5365 0.4406 0.4377 0.5051 0.4548
0.70 0.5973 0.6229 0.6126 0.5912 0.5905 0.6236 0.6240
0.75 0.7112 0.6999 0.6981 0.7241 0.7233 0.7330 0.7147
Table 3: Classification error rate in the data set with 4 normally distributed classes with different label noise subset sizes. Each data point is the average of executions with different and subsets.

Figure 4 and Table 4 show the classification accuracy comparison when the learning methods are applied to the g241c data set Chapelle et al. (2006). The g241c data set is composed by samples divided into classes. There are different labeled subsets, each of them containing samples, as provided in Chapelle et al. (2006). From Figure 4 analysis, we see that the proposed method is better than all the others in the presence of to label noise.

Figure 4: Classification error rate in the g241c data set Chapelle et al. (2006) with different label noise subset () sizes. Each data point is the average of executions with different and subsets.
Q size LGC LP LNP PCC-1 PCC-2 PCC-3 LNR-PCC
0.00 0.4364 0.3362 0.4451 0.2372 0.2478 0.3362 0.2378
0.05 0.4455 0.3468 0.4615 0.2230 0.2187 0.3208 0.2028
0.10 0.4510 0.3810 0.4636 0.2482 0.2637 0.3560 0.2465
0.15 0.4470 0.3870 0.4476 0.2357 0.2471 0.3413 0.2213
0.20 0.4714 0.3913 0.4782 0.2299 0.2257 0.3540 0.1968
0.25 0.4612 0.4075 0.4621 0.2320 0.2350 0.3585 0.2031
0.30 0.4714 0.4276 0.4701 0.2313 0.2247 0.3586 0.2077
0.35 0.4649 0.4371 0.4716 0.2369 0.2372 0.3583 0.1913
0.40 0.4864 0.4586 0.4708 0.3196 0.3368 0.4260 0.2743
0.45 0.4800 0.4786 0.4901 0.3868 0.4122 0.4547 0.3656
0.50 0.4848 0.4734 0.4911 0.3878 0.4060 0.4603 0.4057
Table 4: Classification error rate in the g241c data set Chapelle et al. (2006) with different label noise subset () sizes. Each data point is the average of executions with different and subsets.

Figure 5 and Table 5 show the classification accuracy comparison when the learning methods are applied to the Semeion Handwritten Digit data set 1; 2, which has samples distributed in classes. samples are randomly chosen to compose the labeled subset. Only LGC is better than the proposed method when label noise affects or less of the labeled data. However, when to of the labeled subset is affected by label noise, the proposed method is better than all the others.

Figure 5: Classification error rate in the Semeion Handwritten Digit data set 1; 2 with different label noise subset () sizes. Each data point is the average of executions with different and subsets.
Q size LGC LP LNP PCC-1 PCC-2 PCC-3 LNR-PCC
0.00 0.1238 0.2050 0.3307 0.1475 0.1459 0.1492 0.1414
0.05 0.1408 0.2443 0.3449 0.1658 0.1594 0.1596 0.1511
0.10 0.1582 0.2683 0.3722 0.1836 0.1763 0.1704 0.1628
0.15 0.1788 0.2912 0.3692 0.1994 0.1883 0.1786 0.1670
0.20 0.1945 0.3176 0.3747 0.2197 0.2080 0.1951 0.1809
0.25 0.2099 0.3354 0.3897 0.2236 0.2119 0.1944 0.1900
0.30 0.2381 0.3763 0.4185 0.2365 0.2245 0.2116 0.2003
0.35 0.2712 0.3996 0.4377 0.2343 0.2243 0.2183 0.2062
0.40 0.2910 0.4189 0.4623 0.2629 0.2486 0.2359 0.2241
0.45 0.3385 0.4488 0.4605 0.2930 0.2813 0.2641 0.2509
0.50 0.3654 0.4812 0.4946 0.2992 0.2831 0.2650 0.2502
0.55 0.4424 0.5509 0.5344 0.3610 0.3460 0.3211 0.3032
0.60 0.4846 0.5808 0.5747 0.4178 0.4028 0.3803 0.3639
0.65 0.5553 0.6328 0.6275 0.4609 0.4425 0.4108 0.3982
0.70 0.5934 0.6721 0.6661 0.5375 0.5240 0.4977 0.4849
0.75 0.6888 0.7393 0.7065 0.6472 0.6356 0.6258 0.6239
0.80 0.7473 0.7749 0.7415 0.7434 0.7343 0.7282 0.7151
0.85 0.8062 0.8262 0.7892 0.8144 0.8109 0.8053 0.7931
0.90 0.8748 0.8813 0.8343 0.8836 0.8803 0.8857 0.8810
Table 5: Classification error rate in the Semeion Handwritten Digit data set 1; 2 with different label noise subset () sizes. Each data point is the average of executions with different and subsets.

Finally, Figure 6 and Table 6 show the classification accuracy comparison when the learning methods are applied to the Optical Recognition of Handwritten Digits data set Bache and Lichman (2013), which has samples distributed in classes. samples are randomly chosen to compose the labeled subset. LGC, LP, and LNP methods were not applied to this data set due to the prohibitive execution time they would take. The proposed method is better than all previous versions in the presence of to label noise.

Figure 6: Classification error rate in the Optical Recognition of Handwritten Digits data set Bache and Lichman (2013) with different label noise subset () sizes. Each data point is the average of executions with different and subsets.
Q size PCC-1 PCC-2 PCC-3 LNR-PCC
0.00 0.0292 0.0320 0.0295 0.0308
0.05 0.0445 0.0378 0.0349 0.0384
0.10 0.0591 0.0427 0.0419 0.0438
0.15 0.0725 0.0505 0.0478 0.0490
0.20 0.0838 0.0577 0.0544 0.0548
0.25 0.0931 0.0585 0.0611 0.0616
0.30 0.1038 0.0646 0.0672 0.0706
0.35 0.1105 0.0663 0.0753 0.0722
0.40 0.1211 0.0722 0.0840 0.0802
0.45 0.1347 0.0770 0.0883 0.0775
0.50 0.1379 0.0744 0.0999 0.0805
0.55 0.1573 0.0858 0.1029 0.0750
0.60 0.1716 0.0992 0.1157 0.0825
0.65 0.2158 0.1369 0.1425 0.0922
0.70 0.2823 0.1979 0.1662 0.1074
0.75 0.3598 0.2878 0.2135 0.1512
0.80 0.5136 0.4603 0.3383 0.3509
0.85 0.7434 0.7227 0.7000 0.6808
0.90 0.8810 0.8775 0.8604 0.8541
Table 6: Classification error rate in the Optical Recognition of Handwritten Digits data set Bache and Lichman (2013) with different label noise subset () sizes. Each data point is the average of executions with different and subsets.

Each data point (each size of the label noise subset ) in the Figures 1 to 6 curves is the average value from to executions (depending on the data set) with different elements in both labeled subset and label noise subset . Notice that the parameter optimization procedure is executed for each of these executions. Thus, each average value obtained by LGC, LP, and LNP is actually the average of the to best values obtained from the corresponding optimization process. On the other hand, for the PCC methods, the best value in each optimization process is discarded. Instead, the optimized parameters are used in another executions, since those are stochastic algorithms and the best classification accuracy obtained during the optimization process might be too optimistic. Therefore, for the PCC methods, each data point in the figures curves is actually the average value of to executions ( to configurations of elements in subsets and , and repetitions on each specific configuration).

The aforementioned parameters and from LNR-PCC were inherited from its previous versions Breve et al. (2012, 2010); Breve and Zhao (2012), in which they were extensively studied. LNR-PCC also introduces the new parameter . In this paper, we fixed , as explained in Section 3.4. Two different scenarios from the experiments above were selected to show how the parameter affects the classification accuracy. Figures 7 and 8

show the classification error rate and standard deviation when the LNR-PCC method is applied to the g241c and the Semeion Handwritten Digit data sets, respectively. Notice that

is equivalent to not applying the reset rule. In the first scenario (Figure 7), it is clear that the parameter is important to decrease classification error and that it has an optimal value, beyond which the classification error starts to increase again. In the second scenario (Figure 8), there is a larger range of optimal values, but it is still clear that the reset rule is important because produces the worst result.

Figure 7: Classification error rate and standard deviation with LNR-PCC using different values for parameter, applied to the g241c data set Chapelle et al. (2006) with labeled nodes, from which are incorrectly labeled. Each data point is the average of executions, on each of the labeled subsets defined by Chapelle et al. (2006).
Figure 8: Classification error rate and standard deviation with LNR-PCC using different values for parameter, applied to the Semeion Handwritten Digit data set 1; 2 with labeled nodes, from which half of them are incorrectly labeled. Each data point is the average of executions, on each of randomly selected labeled subsets.

5 Conclusions

In this paper we have proposed a new particle competition and cooperation method for semi-supervised classification in the presence of label noise. Particles walk through the graph generated from the data set. Each particle cooperates to other particles of the same label and competes against particles of dierent labels to classify unlabeled samples. The new algorithm is specifically designed to address the problem of noise label by discovering and re-labeling them, employing novel graph construction rules and new particle dynamics. These built-in features lead to an increased robustness to noise label, preventing noise propagation at a large extent and, therefore, achieving better classification accuracy. Unlike other methods, which requires separate steps for filtering label noise and classifying unlabeled nodes, the proposed method performs the classification of unlabeled data and reclassification of labeled data together in a unique process.

The improvements over the original particle competition and cooperation approach were developed mostly to address the phenomenon that we call territory switching, where two or more teams of particles almost completely move to territories that belongs to another class, leaving their own territory to be taken by enemies as well. The new graph construction steps reduces the connectivity of the network, thus preventing particles from taking long trips and keeping them around their home nodes. The individual distance tables also keep particles closer to their home node, as they slightly increases the nominal distance of teammates’ home nodes, effect that is also enhanced by the new graph construction steps. Finally, the reset rule brings particles back to home periodically, so that even if the territory switch occasionally occurs, it will not ruin the classification.

Computer simulations were performed using some artificial and real-world data sets with increasing amount of label noise. The experimental results indicate that the proposed model is robust to the presence of label noise. In the comparison to other representative graph-based semi-supervised methods, including previous particle competition and cooperation models, the proposed method presents better classification accuracy in most of the analyzed scenarios.

Acknowledgment

The authors would like to thank the São Paulo State Research Foundation (FAPESP) and the National Counsel of Technological and Scientific Development (CNPq) for the financial support.

References

References

  • [1] Note: Semeion Research Center of Sciences of Communication, via Sersale 117, 00128 Rome, Italy Cited by: Figure 5, Figure 8, Table 5, §4.
  • [2] Note: Tattile Via Gaetano Donizetti, 1-3-5,25030 Mairano (Brescia), Italy Cited by: Figure 5, Figure 8, Table 5, §4.
  • S. Abney (2008) Semisupervised learning for computational linguistics. CRC Press. External Links: ISBN 1584885599, 9781584885597 Cited by: §1.
  • K. Bache and M. Lichman (2013) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: Figure 1, Figure 2, Figure 6, Table 1, Table 2, Table 6, §4, §4, §4.
  • F. A. Breve, L. Zhao, and M. G. Quiles (2010) Semi-supervised learning from imperfect data through particle cooperation and competition. In Neural Networks (IJCNN), The 2010 International Joint Conference on, pp. 1–8. External Links: Document, ISSN 1098-7576 Cited by: §2.1, §2.4, §2.4, §2.5, §2.5, §2, §3.1, §3.3, §3.4, §4, §4, §4.
  • F. Breve, L. Zhao, M. Quiles, W. Pedrycz, and J. Liu (2012) Particle competition and cooperation in networks for semi-supervised learning. Knowledge and Data Engineering, IEEE Transactions on 24 (9), pp. 1686–1698. External Links: Document, ISSN 1041-4347 Cited by: §1, §2.1, §2.4, §2.5, §2.5, §2, §3.1, §3.2, §3.3, §3.4, §4, §4, §4.
  • F. Breve and L. Zhao (2012) Particle competition and cooperation to prevent error propagation from mislabeled data in semi-supervised learning. In Neural Networks (SBRN), 2012 Brazilian Symposium on, pp. 79–84. External Links: Document, ISSN 1522-4899 Cited by: §1, §1, §2.1, §2.3, §2.4, §2.5, §2.5, §2, §3.1, §3.3, §3.4, §4, §4, §4.
  • F. Breve and L. Zhao (2013) Fuzzy community structure detection by particle competition and cooperation. Soft Computing 17 (4), pp. 659–673 (English). External Links: ISSN 1432-7643, Document, Link Cited by: §1.
  • O. Chapelle, B. Schölkopf, and A. Zien (Eds.) (2006) Semi-Supervised Learning. Adaptive Computation and Machine Learning, The MIT Press, Cambridge, MA. Cited by: §1, Figure 4, Figure 7, Table 4, §4, §4.
  • R.P.W. Duin, P. Juszczak, P. Paclik, E. Pekalska, D. de Ridder, D.M.J. Tax, and S. Verzakov (2007)

    PRTools4.1, a matlab toolbox for pattern recognition

    .
    Delft University of Technology. External Links: Link Cited by: §4.
  • B. Frenay and M. Verleysen (2013) Cited by: §1, §1, §1.
  • R. J. Hickey (1996) Noise modelling and evaluating learning from examples. Artificial Intelligence 82 (1–2), pp. 157 – 179. Note: External Links: ISSN 0004-3702, Document, Link Cited by: §1.
  • T. Krishnan (1988) Efficiency of learning with imperfect supervision. Pattern Recogn. 21 (2), pp. 183–188. External Links: ISSN 0031-3203, Document Cited by: §1.
  • A. Malossini, E. Blanzieri, and R. T. Ng (2006) Detecting potential labeling errors in microarrays by data perturbation. Bioinformatics 22 (17), pp. 2114–2121. External Links: Document, Link, http://bioinformatics.oxfordjournals.org/content/22/17/2114.full.pdf+html Cited by: §1.
  • JoséA. Sáez, M. Galar, J. Luengo, and F. Herrera (2014) Analyzing the presence of noise in multi-class problems: alleviating its influence with the one-vs-one decomposition. Knowledge and Information Systems 38 (1), pp. 179–206 (English). External Links: ISSN 0219-1377, Document, Link Cited by: §1.
  • D. K. Slonim (1996) Learning from imperfect data in theory and practice. Technical report Massachusetts Institute of Technology, Cambridge, MA, USA. Cited by: §1.
  • C. Teng (2001) A comparison of noise handling techniques.. In Proceedings of the 14th International Florida Artificial Intelligence Res. Soc. Conf., pp. 269–273. Cited by: §1.
  • F. Wang and C. Zhang (2008) Label propagation through linear neighborhoods. IEEE Transactions on Knowledge and Data Engineering 20 (1), pp. 55–67. External Links: Document, ISSN 1041-4347 Cited by: §1, §4, §4.
  • H. Yin and H. Dong (2011) The problem of noise in classification: past, current and future work. In Communication Software and Networks (ICCSN), 2011 IEEE 3rd International Conference on, pp. 412–416. External Links: Document Cited by: §1.
  • D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf (2004) Learning with local and global consistency. In Advances in Neural Information Processing Systems, Vol. 16, pp. 321–328. Cited by: §1, §4, §4.
  • X. Zhu and Z. Ghahramani (2002) Learning from labeled and unlabeled data with label propagation. Technical report Technical Report CMU-CALD-02-107, Carnegie Mellon University, Pittsburgh. Cited by: §1, §4.
  • X. Zhu (2005) Semi-supervised learning literature survey. Technical report Technical Report 1530, Computer Sciences, University of Wisconsin-Madison. Cited by: §1.
  • X. Zhu and X. Wu (2004) Class noise vs. attribute noise: a quantitative study. Artificial Intelligence Review 22 (3), pp. 177–210 (English). External Links: ISSN 0269-2821, Document, Link Cited by: §1.