A Semi-Supervised Self-Organizing Map with Adaptive Local Thresholds

07/01/2019 ∙ by Pedro H. M. Braga, et al. ∙ UFPE 0

In the recent years, there is a growing interest in semi-supervised learning, since, in many learning tasks, there is a plentiful supply of unlabeled data, but insufficient labeled ones. Hence, Semi-Supervised learning models can benefit from both types of data to improve the obtained performance. Also, it is important to develop methods that are easy to parameterize in a way that is robust to the different characteristics of the data at hand. This article presents a new method based on Self-Organizing Map (SOM) for clustering and classification, called Adaptive Local Thresholds Semi-Supervised Self-Organizing Map (ALTSS-SOM). It can dynamically switch between two forms of learning at training time, according to the availability of labels, as in previous models, and can automatically adjust itself to the local variance observed in each data cluster. The results show that the ALTSS-SOM surpass the performance of other semi-supervised methods in terms of classification, and other pure clustering methods when there are no labels available, being also less sensitive than previous methods to the parameters values.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Over the last few years, the use of machine-learning technology has driven many aspects of modern society. Recent research on Artificial Neural Networks with supervised learning has shown great advances. It is the most common form of machine learning

[1]. It is not unusual to see on the news several practical applications, in diverse areas [1, 2]. A key to the success of supervised learning, especially, deep supervised learning, is the availability of sufficiently large amounts of labeled training data. Unfortunately, creating such properly labeled data with enough examples for each class is not easy. As a result, the use of supervised learning methods became impractical in many applications such as in the medical field, where it is extremely difficult and expensive to obtain balanced labeled data.

On the other hand, due to the advances in technology that have produced datasets of increasing size, in terms of the number of samples and features, unlabeled data usually can be easily obtained. Therefore, it is of great importance to put forward methods that can combine both types of data in order to benefit from the information they can provide, each of them in their way [3]

. An approach typically applied in such scenario is ssl. It is a halfway between supervised and unsupervised learning and can be used to both clustering and classification tasks

[3, 4].

We point out that prototype-based methods have been successfully applied for both tasks. Methods based on Self-Organizing Maps [5, 6, 2]

and K-Means

[7]

can be highlighted as examples, as well as deep learning techniques

[8, 9, 10]. The som is an unsupervised learning method, frequently applied for clustering, while lvq [5], its supervised counterpart that shares many similarities, is normally used for classification. They both were proposed by Kohonen, and since then, various modifications have been introduced, including semi-supervised versions [2], to deal with more challenging problems.

Recent som-based methods employ a threshold defining the minimum level of activation for an input pattern to be considered associated with a cluster prototype. This threshold level is a parameter of the model which is shared by all prototypes [6, 2]

, thus, the regions that a prototype can represent are not learned at all, or they are normally estimated using supervised approaches, as in

[11].

In this context, the main idea of this paper is to introduce the concept of local adaptive thresholds through the use of the local variances observed by the prototypes for each dimension. Such variances are calculated using a bias-corrected moving average with an exponentially weighted decay rate [12]. This concept was derived from the idea of rejection options, early introduced by Chow [13]. It is related to the conditions of taking a classification or recognition decision for a particular point or a data region. In the case of som-based methods, these decisions are associated with the nodes in the map (i.e., when they must accept an input pattern to be part of its representation pool). Such rejection options define the first step towards an adaptation of the model complexity tailored to data regions with a high degree of uncertainty [14]. So far, most models that use rejection options deals with just a single threshold, as well as most of them can handle only with binary classification [11].

In this article, we propose a new model called propm, which is an extension of sssom, created by introducing important modifications to incorporate the ability to estimate local rejection options as a function of both local variance and relevance of the input dimensions for each node in the map. To evaluate propm, we compared it with other semi-supervised approaches that do not use adaptive reject options. We also compare propm with its predecessor that used a parameter to define the threshold region to make pattern rejection decisions. Also, once we introduce an entirely new learning procedure, it becomes necessary to compare propm not only regarding the classification rate but also considering the clustering rate. It is done by taking into consideration the methods that provided the ideas for the development of the proposed method as well as other conventional methods in the literature. Finally, as our parameter sensitivity analysis shows, the sensibility of the model to the parameters was significantly decreased in comparison with the previous version.

The rest of this article is organizing as follows: Section II presents a short review of the background related to the areas where this paper is inserted. Section III introduces related work in the literature. Section IV describes in detail the proposed method. Section V presents the experimental setup, methodology, the obtained results, and comparisons. Finally, Section VI discusses the obtained results and draws the conclusion of this paper, as well as indicates future directions.

Ii Background

High-dimensional data poses different challenges for clustering tasks. In particular, similarity measures used in traditional clustering techniques may become meaningless due to the curse of dimensionality [15]. In this context, subspace clustering and projected clustering appear as common choices. They aim to determine clusters in subspaces of the input dimensions. This task involves not only the clustering itself but also identifying relevant subsets in the input dimensions for each cluster [16]. One way to achieve this is by applying local relevances to the input dimensions.

Moreover, as in [2]

, this paper introduces a model that is able not only to cluster but also classify samples. In this context, we aim to introduce the concept of reject options. According to Chow

[13]

, uncertainties and noise inherent in any pattern recognition task result in a certain amount of unavoidable errors. Uncertainty normally has two reasons: points being outliers or located in ambiguous regions

[17]. The option to reject is introduced to avoid an excessive misrecognition rate by converting errors into rejection.

We derive this idea to consider local reject options based on the variances estimated for each node in the map (discussed in more details further in the Section IV). The main idea is to give nodes the ability to reject an input pattern x if it is outside a region in the space defined from the estimated variances in each dimension around the centroid of each node. This will result in an adaptive local thresholding approach, similar to the one found in [18, 19]. However, this variance based approach provides a threshold adjusted for each dimension of each node in the map during the semi-supervised learning process.

Finally, it is important mentioning that ssl perfectly fits for all of the referred problems and techniques. Because of that, considering the growing interest in semi-supervised learning in the past years, such combined approaches may come to arise more often. It is also worth pointing out that such interest for ssl is growing in the machine-learning [7, 20, 2] alongside in the deep learning context [21, 22].

Iii Related Work

This section briefly summarizes related works in diverse contexts. First, in the purely semi-supervised machine-learning context, we highlight the K-means based methods. K-means is one of the most popular and simple clustering algorithms that was proposed over 50 years ago, but it is still widely used in diverse applications [23, 24].

Continuing talking about prototype-based methods, the sssom [2] appears as the inspiration to the model presented in this paper. sssom is a semi-supervised method that essentially inherits characteristics from its predecessor, larfdssom [6], but also introduces elements of supervision (i.e. supervised learning) to create a hybrid environment where ssl could be applied. The larfdssom [6] uses a time-varying structure, a neighborhood defined by connecting nodes that have similar subspaces of the input dimensions, and a local receptive field that is adjusted for each node as a function of its local variance. Both larfdssom and sssom were developed to deal with high-dimensional data, considering different learning contexts. The latter carries the ability to perform not only clustering tasks but also classification. Note that they works exactly as the same when there is no labeled sample available.

Label Propagation methods [7] are another approach for ssl. Essentially, they operate on proximity graphs or connected structures to spread and propagate class information to nearby nodes according to a similarity matrix. It is based on the assumption that nearby entities should belong to the same class, in contrast to far away entities. Label Spreading [20]

methods are very similar. The difference consists of modifications to the similarity matrix. The first uses the raw similarity matrix constructed from the data with no changes, whereas the latter minimizes a loss function that has regularization properties allowing it to be often better regarding robustness to noise.

Furthermore, in the literature, some state of the art strategies for rejection option can be listed [14]. On considering both local and global rejection, with the latter being the most common form, they can be divided into three distinct categories [11]

: 1) probabilistic approaches; 2) turning deterministic measures into probabilities, and 3) deterministic approaches. Further, some adaptive local thresholding techniques are also found, like as in

[18].

Moreover, it is possible to see in [16] a review of methods that work well for problems of clustering in high-dimensional data. On considering this, and also the detailed comparison of subspace and projected clustering methods perform by [6], we highlight proclus [25], doc [26] and the larfdssom models due their good performances. doc [26] is a cell-based method that searches for sets of grid cells containing more than a certain number of objects by using a Monte Carlo based approach that computes, with high probability, a good approximation of an optimal projective cluster. proclus [25] is a clustering-oriented algorithm that aims to find clusters in small projected subspaces by optimizing an objective function of the entire set of clusters, such as the number of clusters, average dimensionality, or other statistical properties.

Iv Proposed Method

propm111Available at: https://github.com/phbraga/alt-sssom is a som with Adaptive Local Thresholds [18] based on sssom. Hence, being based on sssom, propm can also learn in a supervised or unsupervised way depending on the availability of labels, and maintains the general characteristics of its predecessors. However, it introduces new supervised and unsupervised behaviors to allow a better usage, and consequently a better understanding of the data statistics. By doing this, propm aims at overcoming the problems presented by sssom, such as the high sensitivity to the parameters, and the low sample efficiency. Additionally, with the proper changes, propm targets achieving better results for both classification and clustering tasks.

Therefore, the parameterized activation threshold () used in both previous methods is replaced by an adaptive thresholding technique that takes into account the local variance to provide the model the ability to learn the receptive field of each node. The objective is to estimate optimal local regions in the space with respect to the distribuition of the input patterns for each node in the map. To do so, inspired by the Adam algorithm, a method for efficient stochastic optimization that only requires first-order gradients with little memory requirement [12], propm updates exponential moving averages of the distances between each input pattern and the centroid of the nodes for each dimension (the vector in the algorithms). In sssom and larfdssom, this estimate was done by using not only but also the learning rate . However, propm modified this approach to use solely the parameter for controling the exponential decay rate of the moving averages.

The moving averages themselves are estimates of the first moment (the mean) of the distances between the input patterns and the centroids of the nodes. Because of that, such means can be used as estimates of the uncentered variance of the nodes in each dimension. However, these moving averages are initialized as vectors of zeros, leading to moment estimates that are biased towards zero, especially during the initial steps, and when the decay rate is small (close to 1)

[12]. Still, according to [12], this initialization bias can be counteracted, resulting in the bias-corrected estimate .

The moving averages themselves are estimates of the first moment (the mean) of the distances between the input patterns and the centroids of the nodes. Because of that, such means can be used as estimates of the uncentered variance of the nodes in each dimension. However, these moving averages are initialized as vectors of zeros, leading to moment estimates that are biased towards zero, especially during the initial steps, and when the decay rate is small (close to 1) [12]. Still, according to [12], this initialization bias can be counteracted, resulting in a bias-corrected estimate . During the learning process, this bias-corrected estimate , together with the relevance vector can be used as reject options [13], determining whether or not an input pattern is in the receptive field of a winner node.

The overall operation of the map comprises three phases: 1) organization (Algorithm 1); 2) convergence; and 3) clustering or classification.

1 Initialize parameters lp, , age_wins, , , , minwd, , ; Initialize the map with one node with initialized at the first input pattern , 1, 0, 0, 0, 0 and noClass or class() if available; Initialize the variable nwins 1; for epoch 0 to  do
2        Choose a random input pattern ; Compute the activation of all nodes (Eq. 2); Find the winner with the highest activation (Eq. 1); if  has a label then
3               Run the SupervisedMode(, ) (Algorithm 5);
4       else
5               Run the UnsupervisedMode(, ) (Algorithm 4);
6       if nwins = age_wins then
7               Remove nodes with lp age_wins; Update the connections of the remaining nodes; Reset the number of wins of the remaining nodes: 0; nwins 0;
8       nwins nwins + 1;
Run the Convergence Phase;
Algorithm 1 propm

In the organization phase, the network is initialized, and the nodes start to compete to form clusters of randomly chosen input patterns. The first node of the map is created at the same position of the first input pattern. As in sssom, there are two distinct ways to define the winner of a competition, to decide when a new node must be inserted and when the nodes need to be updated. However, in propm, before a node is updated, it is necessary to decide if it will affect the whole node structure or just the weighted averages and the relevance vectors. If the input pattern class label is provided, it will be done in the supervised mode (Section IV-G), otherwise, in the unsupervised mode (Section IV-F).

The neighborhood of propm is defined as the same as in sssom. Nonetheless, it defines the nodes that will be adjusted together with the winner, thus outlining the cooperation step. The competition and cooperation steps are repeated for a limited number of epochs, and during this process, some nodes are removed periodically, conforming to a defined removal rule that a node must win at least for a minimum number of patterns to continue in the map, as in sssom.

The convergence phase starts right after the organization process. In this phase, the nodes are also updated and removed when required, like in the way conducted in the first phase but with a slight difference: there is no insertion of new nodes. Finally, when the convergence phase finishes, the map clusters and classifies input patterns. At this stage, as in the sssom, there are three possible scenarios: 1) all of the nodes have a defined representing class; 2) a mixed scenario, with some nodes labeled and other not; and 3) none of the nodes labeled. The first scenario will allow both classification and clustering tasks to be executed straightforwardly. The second will add one more step to the process because if the most activated node does not have a defined class, the algorithm continues trying to find a next highly activated node with a defined class. The last scenario only provides the ability to cluster.

Iv-a Structure of the Nodes

In propm, each node j in the map represents a cluster and is associated with four m-dimensional vectors, where m is the number of input dimensions: The first three vectors, , , and , are the same as defined in [2], , where is the center vector; is the relevance vector; is the distance vector that stores moving averages of the observed distance between the input patterns x and the center vector for each dimension. Note, however, that in sssom and propm can be seen as the biased first moment estimate. Because of that, propm introduces a fourth vector, , which is the bias-corrected first moment estimate that the algorithm computes to counteract the bias towards zero of , specifically at the initial steps. The vector is used to compute the relevance vector , and both of them are used to approximate the variance of each node, taking into account how significant each dimension is. Such variance is used to define local reject options during the learning process every time that a new input pattern is presented to the map.

Iv-B Competition

propm tries to choose the winner of a competition as the most activated node given an input pattern , except in certain cases that will be discussed in Section IV-G, when the label is available. In propm, likewise in sssom, the most activated node s() is defined as per Eq. 1:

(1)

where is the relevance vector of the node j and

is the activation function.

As in sssom, the activation function in propm is calculated according to a radial basis function with the receptive field adjusted as a function of its relevance vector

, as shown in Eq. 2:

(2)

where is a small float number added to avoid division by zero and is the weighted distance function:

(3)

Iv-C Estimating bias-corrected moving averages

In propm, the procedure that updates the distance vectors, as well as the relevance vectors, is shown in the Algorithm 2.

Input : Input pattern , Node s
1 Function UpdateRelevances(, s):
2        Set + 1; Update the distance vector (Eq. 4); Update the corrected distance vector (Eq. 5); Update the relevance vector (Eq. 6);
Algorithm 2 Update Relevances

The distance vectors are initialized as a vector of zeros and updated through a moving average of the observed distance between the input pattern and the current center vector , as per Eq. 4:

(4)

where is the parameter that controls the rate of change of the moving average (i.e., the exponential decay rate), and denotes the absolute value applied to the elements of the vectors.

In order to correct the bias towards zero of at the initial timesteps, caused by initializing the moving averages with zeros, as in the Adam algorithm [12], propm divides it by the term , where indicates the current timestep of each node j. In sum, the bias-corrected moving averages vectors are updated at every node timestep according to the Eq. 5:

(5)

To obtain accurate information about the relevance of each dimension for a given node, an update of the relevance vectors must follow every moving averages update. It is calculated by an inverse logistic function of the bias-corrected estimated distances , as follows in Eq. 6.

(6)

where and are respectively the maximum, the minimum and, the mean of the components of the bias-corrected moving average vector , and the the parameter controls the slope of the logistic function [6]. This function is pretty similar to the one used in sssom, however, instead of using , the propm replaces it by in order to get a more accurate and unbiased moving average value.

Iv-D Local Thresholds

The distance vectors represent the corrected moving average of the observed distances between the input patterns and center vectors for each node j in the map. As a result, they can be considered as the variances of the nodes, as stated before.

In addition, the vectors express how each of the dimensions is important for each node, which indicates the subspaces of the input dimensions of a given dataset. This information corroborates with the definition of a local threshold, together with the estimated variance in the form of the vectors.

Combining them, it is possible to define a local region around each node center to act like a reject option for some input patterns. If only the variances were used, some unimportant dimensions with a low variance could misguide the process when a similar input pattern is outside the acceptance region of a node j, but only in dimensions that are not relevant to it. Therefore, a flexible variance is defined to act as a local threshold and rejection option to mitigate such problems:

(7)

When a dimension has a high relevance to the node, it will not impact its variance value. However, when a dimension has a small relevance, propm will relax the constraints to allow a better definition of subspaces. Therefore, the general acceptance rule is defined by Eq. 8, where the idea is to approximate to an optimal rule.

(8)

where , are respectively the input pattern, and the center vector, and = Var is the relaxed variance vector as per Eq. 7.

Iv-E Node Update

As in larfdssom, propm updates the winner node and its neighbors using two distinct learning rates, , and , respectively. The Algorithm 3 shows how the whole update occurs.

Input : Input pattern , Node s, Learning Rate lr
1 Function UpdateNode(s, lr):
2        UpdateRelevances(, s) (Algorithm 2); Update the weight vector (Eq. 9);
Algorithm 3 Update Node

The relevances and weighted moving averages are updated as shown in Section IV-C, and the centroid vector , given a learning rate lr, is updated as follows:

(9)

Iv-F Unsupervised Mode

Given an unlabeled input pattern, the most activated node is considered as the winner, disregarding its class labels. In this sense, propm verifies if the condition expressed by Eq. 8 is satisfied. If so, the winner and its neighbors are updated towards the input pattern. Otherwise, a new node is inserted into the map at the input pattern position. However, since is the original winner, it will improve its knowledge about the region where it is located by updating its moving averages and relevances, but not its center. This mechanism provides the nodes the ability to learn about the region they are inserted in. An additional case is handled when the map has reached the maximum number of nodes. In this case, aiming at not losing the information that the input pattern can provide, as in previous models, and improving sample efficiency, propm updates the moving average and the relevance vectors of the winning node. Algorithm 4 illustrates this procedure.

Input : Input pattern and the first winner ;
1 if A(, , Var(, )) is True and N
2 then See Eq. 8
3        Update the winner node and its neighbors: UpdateNode(, ), UpdateNode(neighbors(), ) (Algorithm 3); Set + 1;
4else if A(, , Var(, )) is False then See Eq. 8
5        Create a new node j and set: , 1, 0, 0, 0, noClass and ; Connect j to its neighbors; Update the relevances vector of : UpdateRelevances(, ) (Algorithm 2);
6else
7        Update the relevances vector of : UpdateRelevances(, ) (Algorithm 2);
Algorithm 4 Unsupervised Mode

Iv-G Supervised Mode

Algorithm 5 shows how this supervised procedure is conducted. In this procedure, unlike the unsupervised mode, the labels are taken into account when looking for a winner. If the most activated node has the same class of the input pattern or a not defined class, a very similar approach to the unsupervised mode is applied. The difference is directly related to the fact that is necessary to set class as the same class of the given input pattern , as well as to update its connections. Otherwise, the propm tries to find a new winner with the same class of the input pattern or a not yet defined class.

If some new node takes the place of as a new winner , the acceptance criteria expressed by the Eq. 8 is verified. If so, and the map is not full, the new winner and its neighbors are updated. Otherwise, only the moving averages and relevance vector of are updated in order to give the same chance received by to improve its knowledge about the surrounding area.

If there are no new nodes to replace as a new winner, and the map is not full, the node is duplicated, preserving the moving averages vectors, the centroid vector as well as the relevance vector. However, the class of this new duplicated node is set to the same as the input pattern. The other parameters are set as usual. If none of the above conditions are fulfilled, the propm solely updates the moving averages and relevance vector of the first defined winner .

Input : Input pattern and the first winner ;
1 if  = class() or = noClass then
2        if A(, , Var(, )) is False and N
3        then See Eq. 8
4               Create new node j and set: , 1, 0, 0, 0, class() and ; Connect j to its neighbors; Update the relevances vector of : UpdateRelevances(, ) (Algorithm 2);
5       else if A(, , Var(, )) is True then See Eq. 8
6               Update the winner node and its neighbors: UpdateNode(, ), UpdateNode(neighbors(), ) (Algorithm 3); Set class(); Update connections; Set + 1;
7       else
8               Update the relevances vector of : UpdateRelevances(, )
9       
10else
11        Try to find a new winner as the next node with highest activation and noClass or the same class of ; if  exists then
12               if A(, , Var(, )) is True and N
13               then See Eq. 8
14                      Update the new winner node and its neighbors: UpdateNode(, ) and UpdateNode(neighbors(), ) (Algorithm 3);
15              else
16                      Update the relevances vector of : UpdateRelevances(, )
17              Set + 1;
18       else if N  then
19               Create new node j by duplicating and set: , , , , 0, class() and ; Connect to its neighbors;
20       else
21               Update the relevances vector of : UpdateRelevances(, )
22       
Algorithm 5 Supervised Mode

Iv-H Node Removal

In propm, as in sssom and larfdssom, each node j stores a variable that accounts for the number of nodes victories since the last reset. Whenever nwins reaches the age_wins value, a reset occurs. It implies to the moment when nodes that did not win at least the minimum percentage of the competition are removed from the map. After a reset, the number of victories of the remaining nodes is reset to zero. Finally, to avoid the removal of recently created nodes, when a new node is inserted, its number of wins is set to , where nwins indicates the number of competitions that have occurred since the last reset.

Iv-I Parameters Summary

propm removes two parameters from its predecessor, sssom. First, the parameter , that has a great impact on the results as shown by [6]. It was replaced by the adaptive local threshold technique introduced by propm (Section IV-D) that can define and learn the space region that a node can represent during the training. Second, the parameter was also removed due to its irrelevance in the learning process after removing , i.e., propm has nine parameters to be set up. More precisely, a sensitivity analysis revealed that there is no parameter with a high impact on the results anymore. This method seeks to establish a good level of self-adjustment, in a way that we can keep the parameters values fixed inside predefined ranges.

V Experiments

The experiments are divided into three parts. First, in order to evaluate the classification rate of propm, we replicated the experiments conducted in [2], adding the proposed method to the comparison. Second, we compare the performance of doc, proclus, larfdssom/sssom and propm. Remark that the first two methods are originally from the data mining area. This choice of comparison methods was defined taking into consideration the analysis provided by [6], where larfdssom presented the best results overall, and doc and proclus appeared as the two best options on average concerning subspace approaches in a distinction of data mining applications. Also, we refer the larfdssom and sssom together due to their equivalence for clustering tasks solely. Third, we performed a sensitivity analysis to show that with propm the same range of parameters work well for a variety of datasets and that range does not exist for the previous methods (larfdssom and sssom).

For all experiments, the seven real-world datasets provided by the OpenSubspace framework [27] were used, rescaling them to the [0, 1] interval. Section V-A presents the experimental setup that was used. Next in the Section V-B, we present the results and analysis that are necessary to clarify the conclusions taken.

V-a Experimental Setup

For the first set of experiments, on every dataset, we used 3-times 3-fold cross-validation to measure the classification rate, as in [2]. Still as in [2], for studying the effects of the different the percentage of labeled data, the semi-supervised methods were trained with 1%, 5%, 10%, 25%, 50%, 75% and 100% of labels. In the second group of experiments, we have chosen the ce metric, as in [6] to evaluate the clustering assignments. For that, we considered all dimensions as relevant in the target clusterings used to calculate the ce. Also as in [6], we considered a problem of projected clustering, where each sample is assigned to a single cluster. Finally, we perform a sensitivity analysis with larfdssom and propm to elucidate the gains obtained regarding the importance that a parameter has in the outcome result. It will establish propm as a more robust and self-controlled model.

For all the experiments, we sampled the parameter ranges according to a lhs [28], that guarantees the full coverage of the range of each parameter. In this sense, we gathered 500 different parameter settings, i.e., the range of each parameter was divided into 500 intervals of equal probability, resulting in a random selection of a single value from each interval [28]. The ranges used for propm are defined in Table I, whereas the ranges of the other methods were the same as those used in [6, 2]. We set the maximum number of nodes for propm to be 200.

Parameters min max
Lowest cluster percentage (lp) 0.001 0.002
Relevance rate () 0.90 0.95
Max competitions ()
Winner learning rate () 0.001 0.2
Neighbors learning rate ()
Relevance smoothness () 0.01 0.1
Connection threshold () 0 0.5
Number of epochs () 1 100
  • * S is the number of input patterns in the dataset.

TABLE I: Parameter Ranges for propm

V-B Experimental Results and Analysis

Fig. 1 shows the results of propm in comparison with sssom, Label Propagation and Label Spreading in the real-world datasets. The results are shown as a function of the percentage of labels that were used. Overall, the propm improved the performance of sssom, except for the Diabetes dataset (Fig. 1(b)

) where the results obtained were slightly worse, but yet comparable. The flexibility provided by the estimation of the representing area allowed such results. Still, the standard deviation for all datasets in all supervision levels was also minimized, which indicates another positive aspect of the proposed method: it is more robust to variations on both datasets and parameters. With sssom, the other semi-supervised methods surpassed its performance in the Pendigits and Vowel datasets, however, propm achieved a consistent improvement by outperforming the results of both Label Propagation and Spreading in such cases. On all the other situations of this experiment, the propm outperformed the comparing models.

(a) Breast
(b) Diabetes
(c) Glass
(d) Shape
(e) Pendigits
(f) Vowel
Fig. 1: Best mean accuracy and standard deviation as function of the percentage of supervision for each dataset

Second, Table II shows the results of the ce obtained with the methods. It shows that no method achieved the best result for all real-world datasets. Even though the propm presented an overall result that is better than all the others, it achieved the same results of doc and larfdssom for the Breast dataset. For the Glass dataset, again, it showed the same result of larfdssom. Also, propm was not the best for Diabetes, which shows a similar behavior when compared with the results obtained in the first set of experiments that took into account the classification rate. On considering a general comparison, propm present the best results on average. It is worth mentioning that the similarity between the results of propm and larfdssom can be attributed to the fact that there were no labeled noise samples in the datasets, to the unknown information about the irrelevant dimensions, and to the intrinsic characteristics inherited by propm from the larfdssom. Also, once doc does not have a direct way to control the number of clusters, it displays some difficulty to find out the correct value. Moreover, proclus presented good results when the parameter controlling the number of clusters is defined close to the optimum. The good results obtained by larfdssom is directly related to an excellent choice of the parameters and lp, which significantly impact the results. propm achieves a good result without the needing of a highly accurate definition of parameters, as well as it is not necessary to define an a priori number of clusters due to its time-varying feature.

CE Breast Diabetes Glass Liver Pendigits Shape Vowel Average STD
DOC 0.763 0.654 0.439 0.580 0.566 0.419 0.142 0.509 0.201
PROCLUS 0.702 0.647 0.528 0.565 0.615 0.706 0.253 0.574 0.156
LARFDSSOM 0.763 0.727 0.575 0.580 0.737 0.719 0.317 0.631 0.158
ALT-SSSOM 0.763 0.697 0.575 0.603 0.741 0.738 0.319 0.633 0.156
TABLE II: CE Results for Real-World Datasets. Best results for each dataset are shown in bold

Third, Fig. 2 shows the scatter plot of the accuracy as a function of the parameter lp for the datasets trained with 50% of labels to illustrate a scenario where both forms of learning impact the outcome. Note that for all datasets, lp did not show a significant impact on the results. The linear fit to the data, represented by the red line, highlight this by not exhibiting any trend. It is also worth mentioning that the plots in Fig. 2 are the combination of each parameter value for each cross-validation set. In previous versions, the most two critical parameters were and lp. The parameter played a role of great importance due to its high impact on the results with just a small change on its values, i.e., impacted the results exponentially. Here in propm, lp is the most important parameter because it defines more clearly the behavior of the algorithm. Despite it, it does not impact the result.

(a) - Breast
(b) - Diabetes
(c) - Pendigits
(d) - Vowel
Fig. 2: Scatter plots of the Accuracy obtained with propm as a function of its parameter lp for the datasets trained with 50% of labels

Fig. 3 also shows a scatter plot of parameter. However, we pick it up the datasets Breast and Pendigits to illustrate the choice of parameters. We first started with a wide range from 0.001 to 0.4 (Fig. 3(a) and Fig. 3(b)). However, the linear fit was mostly horizontal, not indicating any trend, again. We then shrank the range to 0.001 to 0.2 and reran the experiments. The results were as expected, keeping as stable as shown by Fig. 3(c) and Fig. 3(d). These experiments clarify how robust the model is to parameter changes, that is why we kept the same scale in the graphics. The two parameters taken for study in these figures were chosen due to its semantical importance, since other parameters presented similar behavior, with none of them acting a role as and lp in the before-mentioned versions.

(a) - Breast
(b) - Pendigits
(c) - Breast
(d) - Pendigits
Fig. 3: Scatter plots of the Accuracy obtained with propm as a function of its parameter for the datasets trained with 50% of labels to illustrate how the parameter ranges were defined

Vi Conclusion and Future Work

This paper presented our second approach for semi-supervised self-organizing maps applied to cluster and classification tasks. The behavior of propm shown to have led to improvements on its previous version, the sssom not only in terms of classification rate but also in clustering aspects. It consolidates the position of a good choice in situations where only a small portion of labels are available. The clustering task also achieved a significant improvement with the proposed changes. The propm was able to reduce in two the number of parameters, whereby improving the performance in both contexts of classification and clustering metrics.

Also, probably one of the most important contributions of this current paper is related to the parametric robustness showed by the third and final experiment. It is of great importance to reduce the dependency and the variability of models to some parameters, and propm achieves this.

The use of a relaxed estimated variance and region allows the method to better explore the information available on the data, improving sample efficiency. Also, the modifications proposed in propm provided the ability to not merely discard data in certain cases but keep digging into its characteristics in order to establish a better estimate of their statistics.

Finally, we have left for future work some promising approaches that are related to defining a better stop criterion, to the use of the unsupervised error to build a model with more than one layer, as well as a hierarchical approach to provide better exploitation of the data statistics.

Acknowledgments

The authors would like to thank CNPq (Conselho Nacional de Desenvolvimento Científico e Tecnológico), Brazil, for supporting this research study, and FACEPE (Fundação de Amparo à Ciência e Tecnologia do Estado de Pernambuco), Brazil, for financial support on the project #APQ-0880-1.03/14.

References

  • [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, p. 436, 2015.
  • [2] P. H. Braga and H. F. Bassani, “A semi-supervised self-organizing map for clustering and classification,” in 2018 International Joint Conference on Neural Networks (IJCNN).   IEEE, 2018, pp. 1–8.
  • [3] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning,” IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 542–542, 2009.
  • [4] F. Schwenker and E. Trentin, “Pattern classification and clustering: A review of partially supervised learning approaches,” Pattern Recognition Letters, vol. 37, pp. 4–14, 2014.
  • [5] T. Kohonen, “The self-organizing map,” Proceedings of the IEEE, vol. 78, no. 9, pp. 1464–1480, 1990.
  • [6] H. F. Bassani and A. F. Araujo, “Dimension selective self-organizing maps with time-varying structure for subspace and projected clustering,” IEEE transactions on neural networks and learning systems, vol. 26, no. 3, pp. 458–471, 2015.
  • [7] Z. Xiaojin and G. Zoubin, “Learning from labeled and unlabeled data with label propagation,” Tech. Rep., Technical Report CMU-CALD-02–107, Carnegie Mellon University, 2002.
  • [8] H. Dozono, G. Niina, and S. Araki, “Convolutional self organizing map,” in CSCI.   IEEE, 2016, pp. 767–771.
  • [9]

    L. Chen, S. Yu, and M. Yang, “Semi-supervised convolutional neural networks with label propagation for image classification,” in

    2018 24th International Conference on Pattern Recognition (ICPR).   IEEE, 2018, pp. 1319–1324.
  • [10] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, “Semi-supervised learning with ladder networks,” in Advances in Neural Information Processing Systems, 2015, pp. 3546–3554.
  • [11] L. Fischer, B. Hammer, and H. Wersing, “Optimal local rejection for classifiers,” Neurocomputing, vol. 214, pp. 445–457, 2016.
  • [12] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [13] C. Chow, “On optimum recognition error and reject tradeoff,” IEEE Transactions on information theory, vol. 16, no. 1, pp. 41–46, 1970.
  • [14] L. Fischer, B. Hammer, and H. Wersing, “Rejection strategies for learning vector quantization.” in ESANN, 2014.
  • [15] M. Köppen, “The curse of dimensionality,” in 5th Online World Conference on Soft Computing in Industrial Applications, 2000, pp. 4–8.
  • [16] H.-P. Kriegel, P. Kröger, and A. Zimek, “Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering,” ACM Transactions on Knowledge Discovery from Data, vol. 3, no. 1, p. 1, 2009.
  • [17] A. Vailaya and A. Jain, “Reject option for vq-based bayesian classification,” in ICPR.   IEEE, 2000, p. 2048.
  • [18] X. Jiang and D. Mojon, “Adaptive local thresholding by verification-based multithreshold probing with application to vessel detection in retinal images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 1, pp. 131–137, 2003.
  • [19] T. R. Singh, S. Roy, O. I. Singh, T. Sinam, K. Singh et al., “A new local adaptive thresholding technique in binarization,” arXiv preprint arXiv:1201.5227, 2012.
  • [20] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in Advances in neural information processing systems, 2004, pp. 321–328.
  • [21] Z. Hailat, A. Komarichev, and X. Chen, “Deep semi-supervised learning,” in 2018 24th International Conference on Pattern Recognition (ICPR), Aug 2018, pp. 2154–2159.
  • [22] N. Liu, J. Wang, and Y. Gong, “Deep self-organizing map for visual classification,” in Neural Networks (IJCNN), 2015 International Joint Conference on.   IEEE, 2015, pp. 1–6.
  • [23] S. Basu, A. Banerjee, and R. Mooney, “Semi-supervised clustering by seeding,” in In Proceedings of 19th International Conference on Machine Learning, 2002.
  • [24] A. K. Jain, “Data clustering: 50 years beyond k-means,” Pattern recognition letters, vol. 31, no. 8, pp. 651–666, 2010.
  • [25] C. C. Aggarwal, J. L. Wolf, P. S. Yu, C. Procopiuc, and J. S. Park, “Fast algorithms for projected clustering,” in ACM SIGMoD Record, vol. 28, no. 2.   ACM, 1999, pp. 61–72.
  • [26] C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. Murali, “A monte carlo algorithm for fast projective clustering,” in Proceedings of the 2002 ACM SIGMOD international conference on Management of data.   ACM, 2002, pp. 418–427.
  • [27] E. Müller, S. Günnemann, I. Assent, and T. Seidl, “Evaluating clustering in subspace projections of high dimensional data,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 1270–1281, 2009.
  • [28] J. C. Helton, F. Davis, and J. D. Johnson, “A comparison of uncertainty and sensitivity analysis results obtained with random and latin hypercube sampling,” Reliability Engineering & System Safety, vol. 89, no. 3, pp. 305–330, 2005.