A Semi-Supervised Self-Organizing Map for Clustering and Classification

07/01/2019 ∙ by Pedro H. M. Braga, et al. ∙ UFPE 0

There has been an increasing interest in semi-supervised learning in the recent years because of the great number of datasets with a large number of unlabeled data but only a few labeled samples. Semi-supervised learning algorithms can work with both types of data, combining them to obtain better performance for both clustering and classification. Also, these datasets commonly have a high number of dimensions. This article presents a new semi-supervised method based on self-organizing maps (SOMs) for clustering and classification, called Semi-Supervised Self-Organizing Map (SS-SOM). The method can dynamically switch between supervised and unsupervised learning during the training according to the availability of the class labels for each pattern. Our results show that the SS-SOM outperforms other semi-supervised methods in conditions in which there is a low amount of labeled samples, also achieving good results when all samples are labeled.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In recent years, research on Artificial Neural Networks with supervised learning algorithms has made great advances, often appearing in technology news with increasingly impressive practical applications in diverse areas, such as Robotics

[1], Genomics [2]

, and Natural Language Processing

[3].

Despite these advances, the fact that these methods require a large amount of properly labeled data for training (sometimes, in the order of thousands of patterns per class) makes their use in many applications impractical. In certain areas, such as in the medical field, it is extremely difficult and expensive to obtain balanced labeled datasets. In other areas, such as robotics, the dynamic imposed makes it impossible to have real-time labels. In addition, in certain problems, new categories of elements may frequently arise, making it infeasible to create a comprehensive previously labeled training dataset.

Therefore, at the current stage of research, it is of great importance to put forward methods that can benefit both from the (frequently large amounts of) unlabeled data available as well as from the smaller amounts of labeled data, what would expand the current range of machine learning applications.

In order to achieve performance improvements, Semi-Supervised Learning (SSL) methods take advantage of both unlabeled and labeled data [4]. Hence, SSL is halfway between supervised and unsupervised learning, being applied for both classification and clustering tasks [5].

In semi-supervised classification, the training process tries to exploit additional information (often available as label classes) together with the unlabeled data to achieve a more accurate classification function. In semi-supervised clustering, this prior information is used to obtain a better clustering performance [5, 6]

. Prototype-based methods such as K-Means

[5] and Self-Organizing Maps (SOM) [7, 8] are examples that have been successfully applied in this area.

Kohonen proposed two very influential prototype-based methods. SOM [7]

is an unsupervised learning method frequently applied for clustering, and the Learning Vector Quantization (LVQ)

[9] is a supervised learning method that shares many similarities with SOM, which is frequently applied for classification. Therefore, these methods are good candidates for developing a hybrid approach for SSL.

Various modifications of LVQ and SOM were proposed to improve their performance in more challenging datasets with thousands of dimensions, commonly found in areas such as data mining [10] and bioinformatics [2]

. In this context, traditional distance metrics often applied in prototype-based methods may become meaningless due to the curse of dimensionality

[11], in which objects may appear approximately equidistant from each other, what is aggravated by the presence of irrelevant dimensions in the dataset. SOM and LVQ-based methods usually deal with such problems by applying weights to the input dimensions, what has been shown to provide significant performance improvements.

Following this path, in this paper, we proposed a new method called Semi-Supervised Self-Organizing Map (SS-SOM), which is an extension of Local Adaptive Receptive Field Dimension Selective Self-Organizing Map (LARFDSSOM) [8], created by introducing important modifications to incorporate semi-supervised learning.

In order to evaluate the SS-SOM, we compared it with other supervised and semi-supervised methods. The performance of SS-SOM was evaluated in different conditions of labels availability, ranging from 1% to 100% of labeled samples in the dataset. The proposed method presents promising results when applied to real-world datasets, even in conditions of a low percentage of labeled data, reaching a similar accuracy of traditional supervised learning methods.

The rest of this article is structured as follows: Section II defines the machine learning approaches considered in this article. Section III presents a review of important and prominent classification and clustering methods from different learning approaches. Section IV describes in details the proposed method. Section V presents the experimental setup, methodology and the obtained results and comparisons. Finally, in Section VI we discuss the obtained results and indicate future directions.

Ii Machine Learning Approaches

In a broad sense, the learning processes are traditionally categorized into two fundamentally different types of tasks: learning with and without supervisor [12, 13].

In the first, called supervised learning, involving only labeled data, the goal is to learn a mapping from X to Y, given a training set made of pairs (, ), where are the labels of the samples

. The latter, involving only unlabeled data, can be divided into two subcategories: 1) unsupervised learning, where the goal is to find interesting structure in the data X by estimating a density of which is likely to have generated X; and 2) reinforcement learning, where the learning of an input-output mapping is performed through continued interaction with the environment in order to minimize some kind of cost function

[12, 13].

In the past years, there is a growing interest in a hybrid setting, called semi-supervised learning (SSL). SSL is a central point between supervised and unsupervised learning. In many learning tasks, there is a large supply of unlabeled data, but insufficient labeled ones, since it can be expensive and hard to generate. The basic idea of SSL is to take advantage of both labeled and unlabeled data during the training, combining them to improve the performance of the models [6, 14, 13, 5].

Moreover, SSL can be further classified into semi-supervised classification and semi-supervised clustering

[6]. Firstly, in the semi-supervised classification, the training set is given in two parts: and . Where S and U are the labeled and unlabeled data, respectively. At first hand, it is possible to consider a traditional supervised scenario using only S

to build a classifier. However, the unsupervised estimation of the probability function

p(x) of the input set can take advantage of both S and U. Besides, classification tasks can reach a higher performance through the use SSL as a combination of supervised and unsupervised learning [6]. Many semi-supervised classification algorithms have been developed in the past decades, and, according to Zhu [15]

, we can structure them into the following categories: 1) Self-training; 2) SSL with generative models; 3) Semi-supervised Support Vector Machines (

VM), or transductive SVM; 4) SSL with graphs; and 5) SSL with committees.

Secondly, in the semi-supervised clustering, the aim is to group the data in an unknown number of groups relying on some kind of similarity or distance measures in combination with objective functions. Clustering is a more difficult and challenging problem than classification, and the nature of the data can make the clustering tasks even more difficult, so any kind of additional prior information in respect to the data can be useful to obtain a better performance. Therefore, the general idea behind semi-supervised clustering is to integrate some type of prior information in the process. For example, a subset of labeled data and further constraints on pairs of the patterns in form of must-link and cannot-link [6, 15]

. Prototype-based models algorithms (e.g., k-means, and SOMs), Hidden Markov Random Fields (HMRFs), Expectation Maximization (EM) and Label Propagation (LP) are examples that have been successful in this area

[6, 15, 5, 14].

Iii Related Work

Several techniques have been developed and used to deal with high-dimensional data in different learning contexts. Thus, in this section, we describe the unsupervised (Section

III-A), supervised (Section III-B), and semi-supervised (Section III-C) methods and discuss how they are connected with the motivating problem. Some of these methods will be further compared in the Sections VI and V.

Iii-a Unsupervised Methods

Unsupervised learning techniques can address the problem imposed by the high-dimensional and unlabeled data. In this context, we can cite the concept of Self-Organizing Maps (SOM), first introduced by Kohonen [9]. SOM is used in several applications including clustering data without the knowledge of the labels. SOM also provides a topology that preserves the mapping from the high-dimensional space to map units and the relation between the points.

The general task of clustering involves not only clustering the data but also identifying subsets of the input dimensions which are relevant to characterize each cluster. One way to achieve this is by applying local relevances to the input dimensions. The identification of which dimension is relevant or not is an important feature when working with high-dimensional data [2]. In this context, subspace clustering methods have been proposed aiming to determine clusters in subspaces of the input dimensions of a given dataset [10]. Moreover, in subspace clustering problems, a sample may belong to more than one cluster as a result of taking into account different subsets of the input dimensions [8]. On the other hand, it is important to mention that in projected clustering problems, each sample belongs to a single cluster.

Therefore, some variations of the original SOM were developed to improve the performance of the clustering tasks, and LARFDSSOM is an example. It uses a time-varying structure, a neighborhood defined by connecting nodes that have similar subspaces of the input dimensions, and a local receptive field that is adjusted for each node as a function of its local variance. Hence, LARFDSSOM showed good results in the motivating problem for both subspace and projected clustering

[8].

Iii-B Supervised Methods

Some supervised methods for classification were proposed to deal with high-dimensional data. According to Hammer [16], some Learning Vector Quantization (LVQ) methods are good options, since they have been shown to be a valuable alternative to Support Vector Machines (SVMs) [17]

. Even so, SVMs and the Multilayer Perceptron (MLP)

[12] are also alternatives.

As the SOM, LVQ was also proposed by Kohonen [9]. It is a family of algorithms for statistical pattern classification, which uses prototypes to represent class regions [18]

. These regions are defined by hyperplanes between prototypes, resulting in Voronoi partitions. Various modifications of LVQ exist to ensure faster convergence, a better adaptation of the receptive fields, and an adaptation for complex data structures

[19].

The Generalized Learning Vector Quantization (GRLVQ) is a member of this family. The algorithm was inspired by GLVQ and proposed to deal with high dimensional datasets by using a relevance vector able to identify the irrelevant dimensions and/or noise commonly present in real datasets. GRLVQ adapts weights for each input dimension by incorporating an update rule [19].

Iii-C Semi-supervised Methods

The K-means is one of the most popular and simple clustering algorithms. Despite the fact that K-means was proposed over 50 years ago, it is still widely used, and many variations have been proposed. Semi-supervised K-means-based methods were very successful demonstrating their advantages over standard approaches. One of them is called Seeded-KMeans [5]. It can be viewed as an instance of the EM algorithm, where labeled data provides prior information about the conditional distribution of hidden category labels working as a guide for the clustering process.

Given a dataset X, the K-means clustering of the dataset generates a number of k partitions of X. Let be the seed set, a subset of data-points on which supervision is provided as follows: for each , a group will be created according to the partition to with it belongs. By the end of the process, the partitions of the seed set S will form the seed clustering and will be used to guide the K-means algorithm [5].

In the Seeded-KMeans, the seed clustering is used only to initialize the K-means algorithm. Hence, instead of initializing from K random means, the mean of the ith cluster is initialized with the mean of the ith partition of the seed set.

Label propagation (LP) is another promising approach for SSL [20]. LP methods operate on proximity graphs or connected structures to spread and propagate information about the class to nearby nodes according to a similarity matrix. It is based on the assumption that nearby entities should belong to the same class, in contrast to far away entities [4, 20].

For LP purposes, each node is assigned to a label vector. A label vector contains the probabilistic membership degrees of input samples to the available cluster. Here, the nodes propagate their label vectors to all adjacent nodes according to a defined distance W. Nodes belonging to a pre-classified input sample have fixed label vectors [20].

A similar alternative to LP is called Label Spreading (LS) [21]

. It differs from LP in modifications to the similarity matrix. LP uses the raw similarity matrix constructed from the data with no changes, whereas LS minimizes a loss function that has regularization properties allowing it to be often better regarding robustness to noise.

Iv Proposed Method

SS-SOM111Available at: https://github.com/phbraga/SS-SOM is a semi-supervised hybrid SOM, based on LARFDSSOM [8], with a time-varying structure [22] and two different ways of learning. It is possible for SS-SOM, as in LARFDSSOM, that the nodes consider different relevances for the input dimensions and adapts its receptive field during the self-organization process.

Moreover, our model is a prototype-based method that can learn in a supervised or unsupervised way. The SS-SOM can switch between these two ways during the self-organization process according to the availability of the information about the class label for each input pattern. To achieve this, we modified the LARFDSSOM to include concepts from the standard LVQ [9] when the class label of some input pattern is given. The operations of the map consist of three phases: 1) organization (Alg. 1); 2) convergence; and 3) clustering or classification.

1 Initialize parameters , lp, , age_wins, , , , minwd, , push_rate, ; Initialize the map with one node with initialized at the first input pattern , 1, 0, 0 and noClass or class() if available; Initialize the variable nwins 1; for t 0 to  do
2        Choose a random input pattern x; Compute the activation of all nodes (Eq. 2); Find the winner with the highest activation () (Eq. 1); if x has a label then
3               Run the SupervisedMode(x, ) (Alg. 3);
4       else
5               Run the UnsupervisedMode(x, ) (Alg. 2);
6       if nwins = age_wins then
7               Remove nodes with lp age_wins; Update the connections of the remaining nodes (Eq. 7); Reset the number of wins of the remaining nodes: 0; nwins 0;
8       nwins nwins + 1;
Run the Convergence Phase;
Algorithm 1 Hybrid Mode

In the organization phase, after the network initialization, the nodes compete to form clusters of randomly chosen input patterns. There are two different ways to decide who is the winner of a competition, which nodes need to be updated and when a new node needs to be inserted. If the class label of the input pattern is provided, the supervised mode is used (Section IV-B), otherwise, the unsupervised mode is employed (Section IV-A

). The model can be trivially modified to also incorporating reinforcement learning. The neighborhood of SS-SOM is formed connecting nodes with others of the same class label, or with unlabeled nodes. In both cases, it is necessary to take into account a similar subset of the input dimensions. The competition, adaptation and cooperation steps are repeated for a limited number of epochs. Furthermore, as in LARFDSSOM, the nodes that do not win for a minimum number of patterns are removed from the map every time that a certain age number (

parameter) is reached.

The convergence phase starts after the organization phase. Here, the nodes are also updated and removed when necessary, similarly to the way conducted in the first phase. The difference is the fact that there is no insertion of new nodes. Moreover, this phase finishes the cycle left by the organization phase and runs another one to ensure convergence.

After finishing the convergence phase, the map can cluster and classify input patterns. Depending on the amount and distribution of labeled input patterns presented to the network during the training, after the convergence phase the map may have: 1) all the nodes labeled; 2) some nodes labeled; 3) no nodes labeled. For the first case, the clustering and classification are straightforward: each test pattern is associated with the label of the node with the highest activation. For the second case, if the node with the highest activation has no class, we continue looking for another node with a defined class label, and an activation above the threshold . For the third and final case, we can identify the clusters of the input patterns, but not their classes.

It is important to mention that in subspace clustering an input pattern may belong to more than one cluster. However, in this work, we considered only the task of projected clustering, when each input pattern is assigned to a single cluster.

The next sections describe the operation in the unsupervised and supervised modes.

Iv-a Unsupervised Mode

Given an unlabeled input pattern, we look for a winner node disregarding their class labels. Therefore, as in the Eq. 1

, the winner of a competition is the node that is the most activated according to a radial basis function with the receptive field adjusted as a function of its relevance vector. In other words, the winner

s(x) is the node with the highest activation value (Section IV-C2) for the input pattern:

(1)

where

is the activation function explained in Section

IV-C2 and is the relevance vector of the node j.

Similarly to LARFDSSOM, SS-SOM has an activation threshold . According to this, if the activation of the winner is lower than , a new node is inserted into the map at the input pattern position because the winner is not close enough. Otherwise, the winner and its neighbors are updated to get closer to the input pattern (Section IV-C3), for that, we consider two fixed learning rates: 1) for the winner node; and 2) for its neighbors, where ¡. Alg. 2 presents this procedure.

Input : Input pattern x and the first winner ;
1 if  and N  then
2        Create new node j and set: x, 1, 0, 0 and noClass; Connect j to the other nodes as per Eq. 7;
3else
4        Update the winner node and its neighbors: UpdateNode(, ), UpdateNode(neighbors(), ) (Alg. 4); Set + 1;
Algorithm 2 Unsupervised Mode

Iv-B Supervised Mode

In order to incorporate the supervised learning mode, each node in the map can be associated with a class label. Hence, when a labeled input pattern is given, we treat it differently. The Alg. 3 presents this procedure.

Input : Input pattern x and the first winner ;
1 if  = class(x) or = noClass then
2        if  and N  then
3               Create new node j and set: x, 1, 0, 0 and class(x); Connect j to the other nodes as per Eq. 7;
4       else if   then
5               Update the winner node and its neighbors: UpdateNode(, ), UpdateNode(neighbors(), ) (Alg. 4); Set class(x); Update connections as per Eq. 7; Set + 1;
6       
7else
8        Try to find a new winner with noClass or the same class of x with activation ; if  exists then
9               Update the new winner node, its neighbors and the previous wrong winner: UpdateNode(, ), UpdateNode(neighbors(), ) and UpdateNode(, -push_rate) (Alg. 4); Set + 1;
10       else if N  then
11               Create new node j and set: x, 1, 0, 0 and class(x); Connect to other nodes as per Eq. 7;
12       
Algorithm 3 Supervised Mode

In order to obtain performance improvements from the labeled patterns, we take the labels into account when looking for a winner. Here, unlike the unsupervised mode that only consider the activation, if the most activated node has the same class of the input pattern or a not defined class (line 1 in Alg. 3), a very similar procedure to the unsupervised mode (Section IV-A) is ran (lines 2 to 9). The difference, in this case, is the fact that is necessary to set class to the same class of the input pattern x, as well as update its connections. Otherwise, we search for another winner matching the following conditions (line 11): 1) it needs to have the same class of the input pattern or an unspecified class, and 2) the activation must be higher than .

If any node fulfills these conditions (line 12 in Alg. 3), a new winner has been found, and it and its neighbors will be updated as in the unsupervised mode (Section IV-A). However, the fact that was the wrong winner leads to the possibility to push it away from the input pattern. Therefore, similarly as in the LVQ, we push away from the input pattern with a fixed learning rate . This procedure is presented in lines 13 and 14 of Alg. 3. Otherwise, if the maximum number of nodes in the map was not achieved, a new node is inserted into the map at the same position and with the same class of the input pattern x (lines 16 and 17 of Alg. 3).

Iv-C Common Operations for Both Modes

Iv-C1 Nodes structure

In SS-SOM, each node j in the map represents a cluster and is associated with three m-dimensional vectors, where m is the number of input dimensions; is the center vector that represents the prototype of the cluster j in the input space; is the relevance vector in which each component represents the estimated relevance, a weighting factor within [0, 1], that the node j applies for the ith input dimension; and is the distance vector, that stores a moving average of the observed distance between the input patterns x and the center vector . The vector is used solely to compute the relevance vector, as in [8].

Iv-C2 Nodes activation

The activation of a node in SS-SOM is calculated as a radial basis function of the weighted distance with the receptive field adjusted as a function of its relevance vector. The activation grows as the distance decreases and as the relevances increases. The Eq. 2 shows the activation calculation as follows:

(2)

where is a small value added to avoid division by zero and is the weighted distance function used in LARFDSSOM:

(3)

Iv-C3 Node Update

In SS-SOM, in order to update the vectors associated with the nodes (the winner, the neighbors or the winner of a wrong class), a fixed learning rate is used, depending on the undergoing procedure (Alg. 3 or Alg. 2).

Input : Node s, Learning Rate lr
1 Function UpdateNode(s, lr):
2        Update the distance vectors of s according (Eq. 5); Update the relevance vectors of s (Eq. 6); Update the weight vectors of s (Eq. 4);
Algorithm 4 Node Update

Alg. 4 shows how the update occurs in SS-SOM. Given a learning rate, the node will be updated as in LARFDSSOM. We present the equations as follows:

(4)

where e is the learning rate.

To compute the relevance vectors, we estimate the average distance of each node to the input pattern that it clusters. As in LARFDSSOM, the distance vectors are updated through a moving average of the observed distance between the input pattern and the current center vector

(5)

where e is the learning rate, ]0,1[ controls the rate of change of the moving average, and the operator denotes the absolute value, not the norm [8].

After updating the distance vector, each component of the relevance vector is calculated by an inverse logistic function of the distances as follows in Eq. 6

(6)

where , , are the maximum, the minimum, and the mean of the components of the distance vector , respectively. The parameter s 0 controls the slope of the logistic function [8].

Iv-C4 Node Removal

In SS-SOM, each node j in the map stores a variable that represents the number of the node victories since the last reset. Whenever age_wins is reached, a reset occurs (lines 13-19 in Alg. 1), it means that any nodes which do not win at least the minimum percentage of the competitions lp age_wins will be removed. After the reset, the number of victories of the remaining nodes is set to zero.

Iv-C5 Neighborhood Update

When a reset occurs, and the nodes have been removed, the connections between the remaining nodes must be updated. In SS-SOM, the neighborhood is formed by nodes with the same class or unlabeled nodes that apply similar relevances for the input dimensions, so that, a connection between two nodes means that they cluster patterns with the same class or at least in similar subspaces. Eq. 7 considers these similarities between the relevances of every pair of nodes to control this behavior.

(7)

Iv-D SS-SOM Parameters Summary

SS-SOM inherits all parameters from LARFDSSOM and includes a new parameter called . This parameter provides a specific learning rate for the update of wrong winners as described in Section IV-B. It means that we have 11 parameters to set up. Despite this being a high number of parameters, a sensitivity analysis showed in [8] revealed that only and lp present a high impact on the results. SS-SOM kept this characteristic with the addition of as a new sensitive parameter. So that, we can keep the other parameters values fixed inside the ranges defined in Table V, given their marginal influences, including the number of epochs. The parameter

, however, is crucial. Since it defines the receptive field of nodes, during the training, it affects the number of nodes inserted in the map, as well as the number of patterns regarded as outliers during the clustering and classification phase. The parameter

lp defines the minimum percentage of input patterns that a node has to cluster for not being removed from the map. This parameter is dataset dependent and has a substantial impact on the results. Finally, the parameter is the learning rate of the winner node, it defines the update step, which depends on the dataset. After a well adjust of and lp, it starts to impact the results, but it is not so significant than the other two. A short description for the other parameters can be found in [8].

V Experiments

In order to evaluate the classification capabilities of SS-SOM, we compare it with some traditional supervised methods such as MLP [12], SVM [17], and GRLVQ [19]. We also compared SS-SOM with the following semi-supervised methods: Label Spreading [21] and Label Propagation [4]. Finally, we used seven real-world datasets of the OpenSubspace framework [23]. It provides real-world datasets adapted from the UCI machine learning repository [24] as well as an extensive amount of synthetic datasets. A detailed description of the datasets can be found in [23].

In Section V-A, we present the methodology and the experimental setup, next in Section V-B, we present the results and analysis necessary to clarify the final conclusions.

V-a Experimental Setup

For all the algorithms, on each dataset, we used 3-times 3-fold cross-validation. Each method was trained and tested 500 times for each fold with different parameter values sampled from the parameter ranges presented in Tables IV, according to a Latin Hypercube Sampling (LHS) [25]

, while the best accuracy achieved by each method in each fold was recorded for each dataset. This comprises a total of 752,000 experiments. After that, we calculate the mean and the standard deviation of the best results for each dataset separately. The LHS guarantees the full coverage of the range of each parameter. For our case, the range of each parameter is divided into 500 intervals of equal probability which leads to a random selection of a single value from each interval

[8].

For studying the effects of the different levels of supervision, i. e., the percentage of labeled data, the semi-supervised methods were trained with the following percentages: 1%, 5%, 10%, 25%, 50%, 75%  and 100%. The ranges of the parameter for the supervised methods are shown in Tables IIII and the parameter ranges for both semi-supervised methods can be seen in Table IV. Finally, the ranges for SS-SOM are shown in Table V. The maximum number of nodes for SS-SOM () was set to be the size of the training set. A detailed description of the parameters of the comparable methods can be found in [19], [12], [17], [4], and [21].

Parameters min max
Number of nodes 10 30
Positive learning rate 0.4 0.5
Negative learning rate 0.01 0.05
Weights learning rate 0.15 0.2
Learning Decay 0.000001 0.00002
Number of epochs 5000 10000
TABLE I: Parameter Ranges for GRLVQ
Parameters min max
C 0.1 10
Kernel Function 1 4
Degree of polynomial kernel function 3 5
Gamma of kernel functions 2, 3 and 4 0.1 1
Independent term in kernel functions 2 and 3 0.01 1
  • 1: linear, 2: poly, 3: rbf and 4: sigmoid.

TABLE II: Parameter Ranges for SVM
Parameters min max

Number of neurons in each layer

1 100
Number of hidden layers 1 3
Learning rate 0.001 0.1
Momentum 0.85 0.95
Epochs 100 200
Optimizer 1 3
Activation function 1 3
Learning Decay 1 3
  • 1: lbfgs; 2: sgd; 3: adam;

    1: logistic; 2: tanh; 3: relu;


    1: constant; 2: invscaling; 3: adaptative.

TABLE III: Parameter Ranges for MLP
Parameters min max
Kernel Function 1 2
(for RBF Kernel) 10 30

Number of Neighbors (for KNN Kernel)

1 100
0 1
Number of epochs 20 100
  • 1: RBF and 2: KNN. * is only used for label spreading.

TABLE IV: Parameter Ranges for Label Spreading and Label Propagation

We considered a projected clustering problem, where each sample should be assigned to a single cluster, and SS-SOM was set to operate in such mode. For classification purposes, if available, we use the node class as the predicted class. Otherwise, it is straightforwardly considered as an error. The next section presents the obtained results and their analysis.

Parameters min max
Activation threshold () 0.80 0.999
Lowest cluster percentage (lp) 0.001 0.01
Relevance rate () 0.001 0.5
Max competitions ()
Winner learning rate () 0.001 0.2
Wrong winner learning rate ()
Neighbors learning rate ()
Relevance smoothness () 0.01 0.1
Connection threshold () 0 0.5
Number of epochs () 1 100
  • * S is the number of input patterns in the dataset.

TABLE V: Parameter Ranges for SS-SOM

V-B Experimental Results and Analysis

Fig. 1: Best mean accuracy and standard deviation as function of the percentage of supervision for each dataset

Fig. 1 shows the results of SS-SOM in comparison with Label Propagation and Spreading for the real-world datasets as a function of the percentage of labeled data. In all datasets, the performance of the proposed method is superior to the other semi-supervised methods concerning the supervision rate between 1% to 75%, whereas with higher percentages (100%) the difference is smaller, but it continues to outperform or obtain comparable results. These results show the robustness of proposed method in situations when only a small number of labeled data is available.

Table VI shows the results of SS-SOM and other semi-supervised methods using 100% of the labeled data, allowing a fair comparison with supervised methods such as GRLVQ, MLP, and SVM. Our method shows a comparable performance with the other semi-supervised methods, where the biggest difference is for Vowel. Also, SS-SOM appears as the best overall among the semi-supervised methods (the first three in the table), as well as the MLP among the supervised (the last three in the table).

On considering all methods at 100% of supervision, the MLP outperforms all the others in four of seven datasets. Our method presented the best result for the Shape dataset, outperforming all the other methods. Whereas Label Spreading and Propagation methods are the best ones for Vowel, the SS-SOM showed better results than two of three supervised methods, SVM and GRLVQ, with the former showing a low accuracy value. Also, the SVM appears as the best for Pendigits. Besides that, in all the other datasets, SS-SOM showed results close to the best, even with it not being the primary objective of this work.

Accuracy Breast Diabetes Glass Liver Pendigits Shape Vowel
SS-SOM 0.832 (0.044) 0.776 (0.016) 0.714 (0.033) 0.748 (0.025) 0.978 (0.004) 0.935 (0.029) 0.876 (0.017)
Label Propagation 0.805 (0.063) 0.730 (0.031) 0.663 (0.044) 0.623 (0.036) 0.994 (0.003) 0.925 (0.036) 0.948 (0.012)
Label Spreading 0.805 (0.066) 0.729 (0.031) 0.663 (0.044) 0.640 (0.031) 0.994 (0.003) 0.925 (0.036) 0.948 (0.012)
MLP 0.854 (0.032) 0.791 (0.017) 0.746 (0.031) 0.766 (0.031) 0.993 (0.001) 0.923 (0.034) 0.874 (0.033)
SVM 0.850 (0.037) 0.788 (0.020) 0.718 (0.028) 0.746 (0.054) 0.997 (0.001) 0.931 (0.030) 0.909 (0.022)
GRLVQ 0.830 (0.049) 0.772 (0.020) 0.676 (0.027) 0.699 (0.022) 0.915 (0.004) 0.823 (0.061) 0.515 (0.027)
  • In bold, the best results for each dataset on each category: semi-supervised and supervised methods. The underlined results indicate the global best.

TABLE VI: Accuracy Results for Real-World Datasets with 100% of the labeled data

Vi Conclusion and Future Work

This article presented an approach for classification and clustering with semi-supervised learning. The behavior of SS-SOM was shown to have led to significant improvements in classification results for small amounts of labeled data, establishing its position as a good option when dealing with such problems, which is the central point of this article. The proposed method showed its robustness under this condition, being better than other semi-supervised models, achieving impressive results even with only 1% of labeled data. Furthermore, despite the fact that SS-OM has 11 parameters, only three of them (, and ) present important effects on the results.

Also, in all datasets, using 100% of the labels, SS-SOM showed results better than or at least close to the best found in comparison with others supervised and semi-supervised methods, even with it not being the objective of this work.

It is important to mention that in the current implementation, the self-organizing process is run for a number of epochs sampled from LHS, which is usually greater than the necessary to converge, even at the defined interval. An adequate stop criterion is an object of study for future versions in order to reduce the training time.

Notice that LARFDSSOM presented good results for subspace clustering [8], and when there is no labeled sample available, SS-SOM works exactly as LARFDSSOM, inheriting its characteristics and performance. However, when labeled samples are given, the results can be even better. Moreover, with a small change, SS-SOM could also incorporate reinforcement learning, being, thus, capable of switching between three different learning approaches, to exploits several forms of information available, what is left for future work.

Ackowledgments

The authors would like to thank the Brazilian National Council for Technological and Scientific Development (CNPq) for supporting this research study.

References

  • [1]

    S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,”

    CoRR, 2016.
  • [2] F. R. B. Araujo, H. F. Bassani, and A. F. R. Araujo, “Learning vector quantization with local adaptive weighting for relevance determination in Genome-Wide association studies,” in The 2013 International Joint Conference on Neural Networks.   IEEE, aug 2013, pp. 1–8.
  • [3]

    J. Zhou, Y. Cao, X. Wang, P. Li, and W. Xu, “Deep recurrent models with fast-forward connections for neural machine translation,”

    CoRR, 2016.
  • [4] X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeled data with label propagation,” 2002.
  • [5] S. Basu, A. Banerjee, and R. Mooney, “Semi-supervised clustering by seeding,” in In Proceedings of 19th International Conference on Machine Learning, 2002.
  • [6] F. Schwenker and E. Trentin, “Pattern classification and clustering: A review of partially supervised learning approaches,” Pattern Recognition Letters, vol. 37, pp. 4–14, 2014.
  • [7] T. Kohonen, “The self-organizing map,” Proceedings of the IEEE, vol. 78, no. 9, pp. 1464–1480, 1990.
  • [8] H. F. Bassani and A. F. Araujo, “Dimension selective self-organizing maps with time-varying structure for subspace and projected clustering,” IEEE transactions on neural networks and learning systems, vol. 26, no. 3, pp. 458–471, 2015.
  • [9] T. Kohonen, “Learning vector quantization,” in Self-Organizing Maps.   Springer, 1995, pp. 175–189.
  • [10] H.-P. Kriegel, P. Kröger, and A. Zimek, “Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering,” ACM Transactions on Knowledge Discovery from Data, vol. 3, no. 1, p. 1, 2009.
  • [11] M. Köppen, “The curse of dimensionality,” in 5th Online World Conference on Soft Computing in Industrial Applications, 2000, pp. 4–8.
  • [12] S. Haykin, Neural Networks and Learning Machines, 3rd ed.   Prentice-Hall, 2008.
  • [13] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning,” IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 542–542, 2009.
  • [14] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pattern recognition letters, vol. 31, no. 8, pp. 651–666, 2010.
  • [15] X. Zhu, “Semi-supervised learning literature survey,” Computer Science, University of Wisconsin-Madison, vol. 2, no. 3, p. 4, 2006.
  • [16] B. Hammer, M. Strickert, and T. Villmann, “Relevance lvq versus svm,” in

    International Conference on Artificial Intelligence and Soft Computing

    .   Springer, 2004, pp. 592–597.
  • [17] C. Cortes and V. Vapnik, “Support vector machine,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995.
  • [18] D. Nova and P. A. Estévez, “A review of learning vector quantization classifiers,” Neural Computing and Applications, vol. 25, no. 3-4, pp. 511–524, 2014.
  • [19] B. Hammer and T. Villmann, “Generalized relevance learning vector quantization,” Neural Networks, vol. 15, no. 8, pp. 1059–1068, 2002.
  • [20] L. Herrmann and A. Ultsch, “Label propagation for semi-supervised learning in self-organizing maps,” Proceedings of the 6th WSOM, 2007.
  • [21] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in Advances in neural information processing systems, 2004, pp. 321–328.
  • [22] A. F. Araujo and R. L. Rego, “Self-organizing maps with a time-varying structure,” ACM Computing Surveys, vol. 46, no. 1, p. 7, 2013.
  • [23] E. Müller, S. Günnemann, I. Assent, and T. Seidl, “Evaluating clustering in subspace projections of high dimensional data,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 1270–1281, 2009.
  • [24] A. Asuncion and D. Newman, “Uci machine learning repository,” 2007.
  • [25] J. C. Helton, F. Davis, and J. D. Johnson, “A comparison of uncertainty and sensitivity analysis results obtained with random and latin hypercube sampling,” Reliability Engineering & System Safety, vol. 89, no. 3, pp. 305–330, 2005.