An Efficient and Adaptive Granular-ball Generation Method in Classification Problem

Granular-ball computing is an efficient, robust, and scalable learning method for granular computing. The basis of granular-ball computing is the granular-ball generation method. This paper proposes a method for accelerating the granular-ball generation using the division to replace k-means. It can greatly improve the efficiency of granular-ball generation while ensuring the accuracy similar to the existing method. Besides, a new adaptive method for the granular-ball generation is proposed by considering granular-ball's overlap eliminating and some other factors. This makes the granular-ball generation process of parameter-free and completely adaptive in the true sense. In addition, this paper first provides the mathematical models for the granular-ball covering. The experimental results on some real data sets demonstrate that the proposed two granular-ball generation methods have similar accuracies with the existing method while adaptiveness or acceleration is realized.

READ FULL TEXT VIEW PDF
01/10/2022

GBRS: An Unified Model of Pawlak Rough Set and Neighborhood Rough Set

Pawlak rough set and neighborhood rough set are the two most common roug...
05/29/2022

An adaptive granularity clustering method based on hyper-ball

The purpose of cluster analysis is to classify elements according to the...
06/01/2020

Negative Instance for the Edge Patrolling Beacon Problem

Can an infinite-strength magnetic beacon always “catch” an iron ball, wh...
08/03/2022

FauxThrow: Exploring the Effects of Incorrect Point of Release in Throwing Motions

Our aim is to develop a better understanding of how the Point of Release...
09/13/2018

Learning Hybrid Models to Control a Ball in a Circular Maze

This paper presents a problem of model learning to navigate a ball to a ...
01/08/2020

Bio-inspired Adaptive Latching System for Towing and Guiding Power-less Floating Platforms with Autonomous Robotic Boats

Autonomous robotic boats are expected to perform several tasks: 1) navig...

I Introduction

COGNITIVE computing combined with human cognitive mechanism makes the decision-making process more reliable, efficient and understandable. It is an important means to achieve reliable governance of information space and an important direction for the development of artificial intelligence. Academician Chen’s research results published in sciences in 1982 pointed out that human cognition is characterized by large-scale priority 

[chen1982topological]. As shown in Fig. 1, the large outline letters are seen first, followed by the specific small letters within the outline letters. Based on this cognitive characteristic, granular computing can achieve efficient, scalable and robust learning processes. Zadeh, a famous American cybernetics expert, put forward the problem of granular information granulation and the concept of granular computing [zadeh1979fuzzy, zadeh1997toward]. After decades of continuous research by scholars at home and abroad, fuzzy sets [4backer1981clustering, 5kosko1986counting, 6zhang2020multi, 7liu2019combinatorial] and rough sets [8xia2020gbnrs, 9liang2012group, 10qian2010positive, 11hu2017large] have been developed. Academician Bo Zhang proposed the quotient space theory [12zhang2004quotient, 13ling2003theory], academician Deyi Li proposed the cloud model theory [14li1998uncertainty, 15li2009new], and Guoyin Wang and Shuyin Xia developed the granular-ball computing [16xia2020fast, 17xia2019granular] and other model methods.

Fig. 1: Human cognition - coarse grained large range is preferred.

In fuzzy sets, data elements are described by membership degree of different granularity of fuzzy information. Rough set theory and quotient space theory use equivalence relations and equivalence classes to construct granules of different sizes. The universe of rough set theory is the point set of objects, and the topological relations between elements are not considered. The quotient space theory is studied on the condition that there are topological relations between the elements in the universe. The quotient space and cloud model are also two important granular computing methods. Among them, rough sets have been the most widely studied, and many scholars have made a lot of outstanding work in the field of rough sets. Guoyin Wang found the difference between algebraic form and information entropy form of rough set [18wang2003rough]. Duoqian Miao and others found the equivalence between fuzzy soft sets and fuzzy information systems [19pei2005soft]. Yuhua Qian, Jiye Liang and others reduced the amount of calculations in the positive field in the attribute combination explosion through positive field reduction and nuclear attributes [10qian2010positive]. Zeshui Xu and others established the topological structure of covering rough set model [20xu2005properties]. Yiyu Yao pointed out that label noise had an obvious interference effect on upper and lower approximate calculation [21yao2007decision]. Qinghua Hu and others designed a robust classification algorithm based on fuzzy lower approximation [22hu2011robust], and consider the data distribution information and be included in the calculation and fuzzy approximate the data distribution of rough set model perception [23an2015data]. Weizhi Wu and others proposed a multi-scale decision table [24wu2011theory]. Tianrui Li put forward the dynamic incremental approximation method based on rough set maintenance environment [25chen2011rough]. Shuyin Xia and Guoyin Wang proposed a parameter-less rough set method that could process continuous data without relying on membership function [8xia2020gbnrs].

In granular computing, the larger the granularity, the higher the efficiency, and the better the robustness to noise; but it is also more likely to cause neglect of details and loss of accuracy. The smaller the granularity is, the more attention is paid to the details, but it may reduce efficiency and deteriorate robustness to noise. Selecting different granularity according to different scenes can better play the performance of multi-granularity learning method. Although multi-granular computing has a long research history, as a cognitive computing science, it also faces some new challenges and needs new development. For example, in terms of the “classifier”, one of the most widely used methods of artificial intelligence, as shown in Fig.

2(a), the input of most existing classifiers is the finest-grained sample points or pixels [26salehi2015synergistic, 27cover1967nearest, 28loh2011classification], , so there will be a lack of coarse-grained characterization. Some researchers have proposed classification algorithms based on multi-granularity ideas. For example, Dick S and others assumed that connection weights are all linguistic variables, and they have different granulation of connection weights [29dick2001granular]

. Weight updating is realized by adding “linguistic hedges”, but this method will sacrifice certain accuracy. Leite D, MM. M. Gupta and FY Wang and others used fuzzy neurons to build interpretable multi-size local models 

[30leite2013evolving, 31gupta1990fuzzy, 32wang1995implementing]

, which can learn fuzzy rules, and the output space can be processed as membership information, which can be used to process fuzzy data. In a few cases for machine learning and data mining, such as 

[33syeda2002parallel]

, fuzzy rules are extracted for credit card fraud detection. The purpose of these works is to use neural networks to process fuzzy data and apply it to fuzzy control. It is not based on multi-granularity ideas to improve the scalability, efficiency, or robustness of the classifier, and its essence is still a point input method. Park HS and others constructed the information granule by feature selection in the input space, which is essentially a feature preprocessing method without changing the learning mode of the neural networks 

[34park2009granular]

. Tang Y and others introduced a method of sampling and mapping information granules in a support vector machine 

[35tang2012granular]. The research work of  [34park2009granular, 35tang2012granular] focuses more on understanding some existing research work by using the concept of multi-granularity. Pedrycz W and others systematically proposed a neural network granulation framework from the input layer and output layers [36pedrycz2001granular]

. Both rough set and fuzzy methods can granulate the input space. But this work does not realize a specific multi-granularity neural network and examine its performance advantages. Therefore, how to implement the multi-granularity classifier shown in Fig.

2(b) is an important challenge. In the multi-granularity classifier in Fig. 2(b), the input is no longer the finest-grained point, but a universal feature with adjustable granularity. The design of this universal feature should meet high-dimensional scalability, that is, no complicated calculations are required in the high-dimensional space, otherwise the high-dimensional problem cannot be dealt with. For this reason, Guoyin Wang and Shuyin Xia proposed the use of ball as “granule” to represent this universal feature and proposed a granular-ball computing method [17xia2019granular]

. The reason is that the geometry of a ball is completely symmetrical, and only two data are needed to characterize it in any dimension: center and radius, so it is convenient to be applied to high-dimensional data. At the same time, they also proposed an efficient and adaptive method to generate granular-balls.

Fig. 2: Comparison between multi granularity classifier and traditional classifier. (a) Traditional classifier method; (b) Classifier method of coarse-grained input.

Further, the granular-ball computing is introduced into the classifier, the framework of the granular-ball computing classifier is proposed, and the original model of the granular-ball supports vector machine (GBSVM) is derived and the

-nearest neighbor algorithm of the granular-ball (GBNN) is proposed [17xia2019granular]. The efficiency of GBNN is hundreds of times higher than that of the existing NN algorithm, especially in large-scale data. In addition, GBNN does not need to select parameter , and helps to alleviate the performance in unbalanced data, which is not available in existing -nearest neighbor algorithms. Due to the robustness of granular-ball computing, GBNN has higher accuracy than accurate NN in many data. In addition, granular-ball computing was introduced into the neighborhood rough set, and a new rough set method “granular-balls neighborhood rough set (GBNRS)” was developed [8xia2020gbnrs]. GBNRS is the first parameter-free rough set algorithm to process continuous data without prior knowledge (i.e. setting membership function), which is more efficient than NRS. Since GBNRS can adaptively select the neighborhood radius, it can also obtain higher classification accuracy than NRS in many cases. In addition, the granular-ball computing was introduced into the -means algorithm and a simple and fast -means clustering method “ball -means” is developed [16xia2020fast]. Ball -means is dozens of times more efficient than similar algorithms, especially in the challenging large- clustering problem. Granular-ball computing is efficient, robust and scalable [17xia2019granular]. However, there are still many challenges in the granular-ball generation, such as the optimization of the purity threshold and its efficiency improvement. The main contributions of this paper are as follows:

  1. []

  2. The acceleration granular-ball generation method is proposed using the division to replace -means. It can accelerate the granular-ball generation several times to dozens of times while a similar accuracy is achieved.

  3. A new adaptive method for the granular-ball generation is proposed by considering granular-ball’s overlap eliminating and some other factors. This makes the granular-ball generation process of parameter-free and completely adaptive in the true sense.

  4. This paper first provides the mathematical models for the granular-ball covering.

Ii Related Work

Ii-a Granular-ball Computing

Combining the theoretical basis of traditional granular computing, and based on the research results published by Chen in Science in 1982, he pointed out that “human cognition has the characteristics of large-scale priority”  [chen1982topological], Wang put forward a lot of granular cognitive computing [56]. Based on granular cognitive computing, granular-ball computing is a new, efficient and robust granular computing method proposed by Xia and Wang [17xia2019granular], the core idea of which is to use “granular-balls” to cover or partially cover the sample space. A granular-ball , where represents the objects in , and is the number of objects in . ’s center and radius are respectively represented as follows

(1)
(2)

This means that the radius is equal to the average distance from all objects in to its center. The radius can also be set to the maximum distance. The “granular-ball” with a center and radius are used as the input of the learning method or as accurate measurements to represent the sample space, achieving multi-granularity learning characteristics (that is, scalability, multiple scales, etc.) and the accurate characterization of the sample space. The basic process of granular-ball generation for classification problems in granular-ball computing is shown in Fig. 3.

Fig. 3: Process of the existing granular-ball generation in granular-ball computing.

As shown in Fig. 3, to simulate the “the characteristics of large-scale priority of human cognition” at the beginning of the algorithm, the whole data set can be regarded as a granular-ball. At this time, the purity of the granular-ball is the worst and cannot describe any distribution characteristics of the data. The “purity” is used to measure the quality of a granular-ball  [17xia2019granular] in the step 3 in Fig. 3. It is equal to the proportion of the most labels in the granular-ball. Then, the number of different classes in the granular-ball is counted and denoted as ; the granular-ball is split into child granular-balls in the step 2. In the step 3, the purity of each granular-ball is calculated; if a granular-ball does not reaches the purity threshold, it needs to be split. As the splitting process continues to advance, the purity of the granular-balls increases, and the decision boundary becomes increasingly clearer; the boundary is clearest, and the algorithm converges when the purity of all granular-balls meets the requirements. It can be concluded from  [17xia2019granular] that for a data set, no matter what distribution its data has, we can describe its decision boundary by enough granular-balls.

Fig. 4: The granular-ball splitting generation process of the existing method on the data set fourclass. The colors of the two granular-balls in the figure (corresponding to the two sample point colors) respectively represent the two types of category labels. (a) The initial granular-ball, the whole data set can be seen as a granular-ball to participate in subsequent iterations;(b) Granular-balls generated in the first iteration; (c) Granular-balls generated in the second iteration; (d) Stop splitting results; (e) Results after stopping splitting; (f) Granular-balls extracted.

An example of granular-ball generation on the data set fourclass is shown in Fig. 4. At the beginning of the algorithm, as shown in Fig. 4(a), the whole data set can be seen as a granular-ball. As the fourclass contains two classes of points, the in the step 2 in Fig. 3 is equal to 2, and two heterogeneous points are randomly selected as the initial centers of two child granular-balls. The experimental results are shown in Fig. 4(b). However, the granular-balls are too coarse, and the qualities of granular-balls are not high enough, i.e., that their purities do not reach the purity threshold. So, the decision boundary of the granular-balls is inconsistent with that of the data set. As the splitting process progresses, as shown in Fig.4(c)-(d), the granular-balls change to be fine, and the purity of each granular-ball becomes to be high until it reaches to the purity threshold or other quality measurement. As shown in Fig. 4(e), each granular-ball reaches to the purity threshold, and is enough fine. At this time, the decision boundary is very consistent with that of the data set. Fig. 4(f) shows the extracted granular-balls when the points are removed.

The granular-ball computing has developed granular-ball classifiers [17xia2019granular], granular-ball clustering [16xia2020fast], granular-ball neighborhood rough set [8xia2020gbnrs] and granular-ball sampling methods [37xia2021granular].

Ii-B Granular-ball k-Nearest Neighbor

The NN algorithm has many characteristics such as simple, natural response to multiple classifications, independent of training, can be used for classification and regression at the same time, and easy to parallelize and implement. It is one of the most widely used artificial intelligence algorithms. In NN, the basic principle of finding the nearest neighbors of a query point is to compute the Euclidean distance from the query point to all data points, and use the values (or labels) of these neighbors to predict or classify the query point. This method is called Full Search Algorithm (FSA). FSA has the following common problems: it needs to optimize the value of ; and the optimization of the value of requires quadratic time complexity, so it is very time-consuming. From the perspective of multi-granularity, the source of the problem of optimization is due to excessive attention to fine-grained. For this reason, in [17xia2019granular], we introduce granular-ball into NN and propose an efficient nearest neighbor algorithm without optimizing value, granular-ball nearest neighbors method (GBNN). The basic idea is very easy to implement: base on the granular-ball computing, a single query sample point is a granular-ball with a very small radius, and its predicted label is equal to the nearest granular-ball’s label, which is determined by the majority labels in the granular-ball. Therefore, the common feature of GBNN and traditional NN is that the mark of the query point is determined by many points. But the important advantage of GBNN is that there is no need to optimize the parameter , and the label of the query point is determined by adaptively generated nearest neighbor granular-ball with different coarse grains; the second advantage of GBNN is that the number of granular-balls is much smaller than the sample points, and the calculation amount of the nearest granular-ball queried by the query point is much less than NN, resulting in higher efficiency of GBNN than traditional NN; the third advantage is that the decision of traditional NN will be affected by label noise, but GBNN can obtain higher accuracy because of its robustness especially on noisy data sets. These three points are important advantages of GBNN.

Ii-C Granular-ball Sampling

The purpose of granular-ball sampling (GBS) is to decrease the size of a dataset in classification by introducing the idea of granular-ball computing. The GBS method uses some adaptively generated balls to cover the data space, and the points near the boundary of each granular-ball constitute sampled results [37xia2021granular]. Fig. 5 shows the basic idea of GBS. Fig. 5(a) shows the original data set and its decision boundary; Fig. 5(b) shows the original data set after being covered by the granular-balls. In Fig. 5(c), we find the intersections of the coordinate axis with the granular-ball center as the origin and the granular-ball. In the granular-ball, among the points with the same label as the ball, the points closest to these intersections constitute the sampled result. These points are near the boundaries of the granular-ball. The same label can filter the affection of label noise points. For example, for granular-ball in Fig. 5(c), the intersection points of the ball and the coordinate axis with the center of the ball as the origin are and , and the points with the same label as the ball closest to these intersections are and in . Therefore, , and are the sampling results in , and they are also the best points to describe the boundary of the granular-ball . Fig. 5(d) shows the sampling results of the whole data set. Comparing Fig. 5(a) and Fig. 5(e), it can be observed that the boundary curve in Fig. 5(e) is very consistent with the boundary curve in Fig. 5(a). At the same time, the samples in Fig. 5(e) are less and sparser than those in Fig. 5(a). In contrast, random sampling is very easy to cause the loss of boundary information. Therefore, as shown in Fig. 5(e) and Fig. 5(f), the boundary generated by GBS is closer to the boundary of the original data set than the boundary generated by random sampling.

Fig. 5:  [37xia2021granular] The schematic diagram of GBS. (a) The original data set and its boundaries; (b) The granular-balls generated; (c) The intersection points and in 2 times the dimensions. And points and closest to the intersection points in the granular-ball A;(d) use granular-balls to sample the original data; (e) The final data set and its boundaries; (f) The sampling results come from random sampling.

For the intersection point on the granular-ball, its sampled point , the closest homogeneous point of , has the same label with the granular-ball. The intersection point can be expressed as a point where the center point vector moves along the specified coordinate axis by a length . The moving direction of the center point vector includes a positive direction and a negative direction, so one coordinate axis can correspond to two intersection points. Specifically, for a -dimensional data set , the center point vector of the granular-ball generated on is , the radius . The two intersection points of the positive and negative directions of the coordinate axis can be expressed as

(3)
(4)

In a data sets, the sampled result for the granular-ball can be modeled as follows:

(5)

where is sampled result for the granular-ball , and the label of is consistent with that of . For a unbalanced data set, the minority class of points are remained, and the sampling process is performed only on the majority class of points. So, in the Equ. (5), needs to be changed to , which represents the majority granular-balls.

In the classification with label noise, GBS can both reduce a data set and improve the its data quality. Besides, GBS is also effective for undersampling of unbalanced classification. In addition, the time complexity of GBS is O(), so it can speed up most classifiers [38zhou2009novel].

Iii Granular-ball Covering Model

At present, the granular-ball covering of granular-ball computing lacks of mathematical model. In this section, we will establish the basic model of granular-ball covering, and the specific description is as following: given a data set , where is the number of samples on . Granular-balls are used to cover and represent the data set . The original goal of the optimization problem of the granular-ball generation method is expressed as , and the main factors of measuring the coverage are: 1. Coverage degree, when other factors remain unchanged, the higher the coverage, the less sample information is lost and the more accurate the characterization. Suppose the number of samples in the granular-ball is expressed as , then its coverage degree can be expressed as . 2. When other factors remain unchanged, the number of the granular-balls is related to the size of granular-ball. The minimization of this factor is to make the granular-balls as coarser as possible. The fewer the number of granular-balls, the coarser the granular-balls and the more coarse-granularity characteristics: the more efficient the granular-ball calculation is, the better the robustness is. 3. In addition, under different problems, in order to correspond to the relevant optimization goal , the quality of the granular-ball needs to be higher than a given purity threshold of the given evaluation method. This factor is also related to the lower limit of the size of the granular-balls, so that the granular-balls must be “fine” enough to accurately describe the problem. The threshold can be obtained in a given way, or in a lattice search, or in an adaptive way, which is pursued by us. Taking the reciprocal of the granular-ball covering to optimize its minimum value and the optimization goal of the granular-balls can be expressed as

(6)

where and are the corresponding weight coefficients and .

The definition of granular-ball’s quality is different according to the environment, but it can basically be defined as a sample label with a certain approximate (or equivalence) relation. For example, in the classification problem, we often use the nearest neighbor to describe this equivalence relation. These factors are indispensable. It is obviously unreasonable to rely only on factor 1, coverage degree, or factor 2, the number of the granular-balls. For example, in the extreme case shown in Fig. 6(a), only one granular-ball is used to represent it. At this time, the quality of the granular-ball is poor, and one granular-ball cannot describe the distribution of a data set (i.e., data boundary). If the factor 2 is not considered, as shown in Fig. 6(b), the granular-balls can only cover a small part of the data set, and it is also impossible to describe the data set. If do not consider the factor 3 “the number of granular-balls” (that is, the size of the granular-balls), and only consider the quality and coverage degree, the granular-balls can be divided into the finest granular-ball, i. e., a granular-ball contains only one sample point. When coarse-granularity does not make any sense. Therefore, none of the above factors are indispensable. On the whole, when factor 1 guarantees a certain coverage, the factors 2 and 3 is to obtain granular-balls with appropriate granular-ball size; when the factors 1 and 2 remain unchanged, the smaller the threshold of factor 3, the easier the quality of the granular-balls will be satisfied (i.e. the coarser the granular-balls, the fewer the number of the granular-balls, and the more efficient computational performance can be obtained). The control of the threshold in factor 3 exhibits the ability of scalability of the granular-ball generation. In fact, the existing granular-ball generation method, as shown in Fig. 3

, provides a heuristic optimization strategy.

Fig. 6: Invalid coverage of sample space by granular-balls. (a) Granular-ball coverage results without considering the quality of the granular-ball; (b) Granular-ball coverage results without considering the rate of the coverage of the granular-balls.

Iv An Acceleration Granular-ball Generation Method

Iv-a Motivation

The existing granular-ball generation method uses the -means algorithm to split the granular-ball; so, the granular-ball generation is not efficient than -means, which can generate a stable splitting results in each iteration of granular-ball generation. However, the stability in the intermediate process is not needed in the process; what is required is only to generate the granular-balls fulfilling Equ. 6, such as that the lower bound should be ensured.

Iv-B The Process of the Acceleration Granular-ball Generation Method

As the stability in the intermediate process is not needed, as shown in the step 2 in Fig. 7, we use one division, i.e., one iteration process in the -means, to split a granular-ball in stead of a whole -means algorithm. Besides, different from the existing method shown in Fig. 3, a global division is added in the end to improve the whole distribution of the final granular-balls. In the global division, a division is performed based on all the division points. The specific process is as Fig. 7. In order to describe the process of the acceleration granular-ball generation more clearly, we give the definition of the father ball and the child ball at first.

Fig. 7: Process of the acceleration granular-ball generation in granular-ball computing.
Definitions 1

Given granular-balls and , suppose that and . Then and are the parent ball and the child ball, respectively.

A -means algorithm consists of iterations, where denotes the iteration times, and each iteration is a division, in which all points are divided into clusters according to their distances to the center points. In the step 2 in Fig. 7, the -means used for splitting a granular-ball is replaced with division, where denotes the number of classes in a granular-ball. So, the computation cost is decreased directly. Besides, as shown in the step 2, taking splitting granular-ball as an example, the center point of the granular-ball , denoted as , is remained as the center point of a certain child ball of , so only -1 points are selected as centers of those new child balls of . The sample points in all child balls do not need to calculate the distance from the original center point a because they have been calculated before. Consequently, the computation cost is further decreased.

Fig. 8: Granular-ball splitting using the acceleration granular-ball generation method. (a) granular-ball with center as ; (b) is split into two child balls and whose centers are and respectively.

As shown in Fig. 8, a granular-ball with center as is split into two child balls and whose centers are and respectively. The radius of a granular-ball is represented by the furthest distance from the data points in it to its center in order to cover all the data points in the ball. The center of the ball in Fig. 8(a) is remained as the center of child ball in Fig. 8(b). In this split process of the granular-ball, the distance from all data points in the ball to the center does not need to be computed again, and only the distance from all the data points to the center of child ball in Fig. 8(b) is computed. Finally, all data points are divided into the two granular-balls based on above distances to the two centers and .

It is worth noting that, in the granular-ball splitting process of the acceleration method, the center of a granular-ball is its division point instead of that computed using Equ. (1). As shown in Fig. 8(a), the center of the granular-ball is the division point instead of the center of those data points in , which is computed using Equ. (1). However, as shown in the step 7 in Fig. 7, the global division, i.e., a division on all division points, is performed so that a division point is close to the center of the corresponding granular-ball. For example, Fig. 9 shows the comparison results between the conventional granular-ball generation method and our proposed acceleration method. Fig. 9(a) shows the granular-ball generation result using the conventional method. Fig. 9(b) shows experimental results using the proposed acceleration method before the global division is performed. It can be seen from Fig. 9(c) that, after the global division is performed, in comparison with those in Fig. 9(b), the division center of a granular-ball, i.e., its division point changes to be closer to its true center computed using Equ. (1

), so the points in granular-ball becomes to be more tightly and uniformly distributed and the decision boundary is clearer. In addition, Fig.

9(d) shows the experimental results when some label noise points are added. It an be observed from Fig. 9(d) that, the granular-balls generated using the acceleration method can fit the division boundary well in the noisy data because of the robustness of granular-ball computing. The algorithm design of the granular-ball generating acceleration method is as Algorithm 1.

Fig. 9: Comparison of the distribution of granular-balls before and after global division. (a) The result of granular-ball generation using -means; (b) The result of the acceleration granular-ball generation method without using the global division; (c) The distribution of granular-balls using the global division; (d) The experimental result using the proposed acceleration method in a noisy data set where the noise points are colored with blue.
Input: Data set , the purity threshold
Output: The granular-balls
1:  Treat the whole data set as a granular-ball , where is initialized to 1, and represents the number of iterations;
2:   were initialized to ;
3:  Randomly select -1 points as initial division centers on , and compute the distance from all points to the division centers;
4:  repeat
5:     for each  do
6:        The purity is equal to the percentage of majority samples in ;
7:        if  then
8:           Randomly select -1 heterogeneous points as the new division centers, where represents the number of different labels in the balls;
9:           Compute the distances from the points in the ball to the new division centers;
10:           Based on the distances in step 9, granular-balls are generated;
11:           
12:           ;
13:        end if
14:     end for
15:  until  does not increase
16:  Perform a global division.
Algorithm 1 Granular-ball generation acceleration method

Iv-C Time Complexity

The time complexity of -means is O([38zhou2009novel], where represents the number of clusters, and represents the number of iterations. The convergence speed of -means is fast and can be considered approximately linear. The acceleration granular-ball generation method only needs to compute the distance between the data in this cluster and the new division centers each time when generating granular-balls. Assuming a data set with classes of data, in the first round of splitting, computation times are .

In the second round, new division centers are newly generated, and computation times are approximately

(7)

In the third round, division centers are newly generated, and computation times is approximately

(8)

In the fourth round, division centers are newly generated, and computation times is approximately

(9)

Assuming a total of iterations, the time complexity of the last global division is O(), where is the number of granular-balls, and the total time complexity is O(). However, it is worth noting that the granular-balls will stop splitting when the splitting conditions are not met. Most granular-balls will stop splitting halfway through. Therefore, the actual time complexity of the acceleration granular-ball generation method is much lower than O(). The time complexity of the acceleration method is still linear, which avoids reduces unnecessary calculations.

V An Adaptive Granular-ball Generation Method

V-a Motivation

Through the incoming purity threshold parameter, the existing method can generate granular-balls with meeting the purity threshold. The main problem of the existing method is that the purity threshold parameter cannot adapt to the data distribution of each data set, and it is difficult to find a splitting standard that matches the data distribution for each data sets. To solve this problem, we propose a purity adaptive granular-ball generation method based on the acceleration granular-ball generation method. The purity adaptation is important for the granular-ball generation, so that granular-ball generation process is completely parameter-free, and the completely parameter-free classifier, GBNN, has been developed.

V-B The Adaptive Conditions of Granular-ball Splitting

In this section, we proposed three adaptive conditions to realize the adaptive generation of granular-balls including: whether the weighted purity sum of child balls for each granular-ball increases or not, whether there is overlap between any pair of heterogeneous granular-balls, and whether each granular-ball reaches to the lower bounder of purity, i.e., the purity of the initial granular-ball of the whole data set. The specific design are as follows in detail.

V-B1 Weighed purity sum of child balls

The purity is designed for measuring the quality of a granular-ball. So, a direct idea to design a indicator to measure the child balls’ purity. Then, whether a granular-ball should be split or not is determined by whether its child granular-balls’ purity becomes larger than itself. Considering the fact that the more samples in the ball, the more important the ball is, so we design the weighed purity sum of child balls for measuring the child balls’ purity as shown in Definition 2

Definitions 2

Given granular-balls and its child balls , where and . denotes the number of elements in a set. denotes the number of classes in . denotes a set consisting of those samples whose label is equal to , and , i.e., , represents the set consisting of those samples in the majority class in the . The weighted purity sum of the child balls of can be defined as

(10)

Based on the Definition 2, Theorem 1 is proposed to describe the condition of that a granular-ball should be split.

Theorem 1

Given a granular-ball , whose label is denoted by and purity by , and its child granular-ball , where is the number of the child granular-balls. . represents the weighted purity sum of .

[1] , if , then ;

[2] , if , then .

Proof:

[1] When and , the majority samples in are also the majority samples in all child balls, so we have

(11)
(12)

At the same time, the majority samples in are equal to the sum of the majority samples in all the child balls. Combining with the Equ. 11 and Equ. 12, we get

(13)

From the Equ. 13 and the Definition 2, we can easily get

(14)

So,

(15)

[2] When and , the majority sample in are not necessarily the majority sample in all child balls. Assuming has a different label with , we can get

(16)

In addition, the samples labeled in the parent ball equal to the sum of those in all child balls, we have

(17)

Combining with Equ. 16 and Equ. 17, we get

(18)

From the Equ. 18 and the Definition 2, we can easily get

(19)

So,

(20)

When is greater than , it means the label of some child balls is different from the parent ball, i.e., the minority samples in the parent ball become the majority samples in some child balls. It can be concluded that the weighted purity sum of the child balls will be greater than the purity of the parent ball, and the number of correctly classified samples will increase. When equals , it represents a special case that the parent ball and all child balls have the same label.

Fig. 10: The situation before and after splitting using the adaptive granular-ball generation method. (a) The parent ball before splitting; (b) The child balls after splitting; (c) The situation after de-overlap; (d) The algorithm convergence result using the proposed adaptive method; (e) The experimental result using the proposed adaptive method in a noisy data set where the noise points are colored with blue.

As shown in Fig. 10, the ball in (a) is split into ball and ball in (b) using Theorem 1, and the labels of and are different. At this time, the minority class in becomes the majority of classes in , so the weighted purity sum of and will be greater than the purity of . From the perspective of GBNN, the number of samples with correct classification also increases. But in this case, there will still be premature convergence, because the accuracy does not increase monotonically. Fig. 10(b) shows a simple example of premature convergence only using the conditions in Section V-B1, that is, there is overlap between heterogeneous granular-balls.

To this end, we introduce the second condition that there can be no overlap between heterogeneous granular-balls.

V-B2 De-overlap between heterogeneous granular-balls

For the problem of granular-balls overlap, it is necessary to further detect whether there is heterogeneous granular-balls on the basis of Section V-B1, and further split and refine the overlapping granular-balls to make the decision boundary clearer. In order to improve efficiency, the next round of overlap detection only needs to traverse the child granular-balls of the granular-balls that have overlapped. The boundary overlap problem of heterogeneous granular-balls can be defined as the following model:

(21)

Among them, represents the center of the granular-ball, represents the radius of the granular-ball, and is the total number of granular-balls. In addition, the effect of the condition “” is to concentrate the boundary overlap problem between heterogeneous granular-balls, and reducing the cost of computing and analysis of the problem. The overlap between the same kind of granular-balls will not affect the decision boundary. As shown in Fig. 10(c), it can be seen that, compared to Fig. 10(b), the granular-balls after de-overlap are more suitable for the data distribution.

V-B3 An adaptive purity lower bound

In addition, the purity of the granular-balls should has an adaptive lower bound. The lower bound is the proportion of the initial majority of samples of the total sample, that is, the purity of the initial granular-ball. As shown in Fig. 10(c), the granular-balls will adaptively generate granular-balls with a lower purity than the initial purity. The quality of such granular-balls is too low, which reduces the classification accuracy of the final GBNN. For minority samples, the proportion of incorrectly classified samples can be considered as the noise rate. That is, the purity of all granular-balls must be greater than the purity of the initial granular-ball.

Fig. 10(d) shows the result of granular-ball generation using the adaptive method. The two colored granular-balls in the figure represent two classes of data. In addition, Fig. 10(e) shows the result of granular-ball generation using the adaptive method when some label noise points are added. The blue points in the figure is represented by noise data, and the other two colored points represent the two original data. The noise points are generated by randomly changing the labels of samples in the data set. The optimization goal of the adaptive granular-ball generation method can be expressed as

(22)

where and are the corresponding weight coefficients and , and represents the center and radius of respectively. denote the adaptive purity lower bound of granular-balls, and represents the child granular-balls of . In addition, and are mentioned above.

V-C Method Design

The basic idea of granular-ball generation of the adaptive granular-ball generation method is shown in Fig. 11.

Fig. 11: The basic idea of the adaptive granular-ball generation method.

In step 2 in Figure 11, based on the accelerated granular-ball generation method, k division is used to split the granular-ball, where denotes the number of classes in a granular-ball. So, the computation cost is decreased directly. In addition, as shown in step 3, when the weighted purity sum of the child balls is greater than the purity of its parent ball and the purity of the granular-ball reaches the lower bound, the child balls are retained and whether there is overlap between heterogeneous granular-balls is detected. As shown in Fig. 10(d), the boundary of the granular-balls when the algorithm converges is very consistent with that of the data set.

The algorithm design for adaptive granular-ball generation method is as Algorithm 2.

Input: Data set
Output: The granular-balls
1:  Treat the whole data set as a granular-ball , where is initialized to 1, and represents the number of iterations;
2:   were initialized to ;
3:  repeat
4:     for each  do
5:        Implement the acceleration granular-ball generation method on , pre-generate granular-balls ;
6:        The represents the purity of ;
7:        The represents the weighted purity sum of the child balls of ;
8:        if  or  then
9:           
10:           ;
11:        end if
12:     end for
13:     De-overlap between heterogeneous granular-balls;
14:  until  does not increase
15:  Perform a global division.
Algorithm 2 Granular-ball generation adaptive method

Vi Experiments

To demonstrate the feasibility and effectiveness of the acceleration granular-ball generation method and the adaptive granular-ball generation method, we compared them with NN and two popular or the state-of-the-art methods based on granular computing, including GBNN [17xia2019granular] and GBS [37xia2021granular]. Because of the robustness of the granular-ball, our experiments are carried out both on the raw data sets and the noise data sets. We verifies the performance on accuracy of the acceleration granular-ball generation method and the adaptive granular-ball generation method, and on efficiency of the acceleration method. We randomly selected ten real data sets from UCI benchmark data sets as shown in the following tables. Experimental hardware environment: PC with an Intel Core i7-107000 CPU @2.90 GHz with 32 G RAM. Experimental software environment: Python 3.9.

Vi-a Experiments on Raw Data Sets

In this section, we split the data set into ten parts, take one part for testing, and use the test accuracy as the evaluation index to verify the effectiveness of the acceleration granular-ball generation method and the adaptive granular-ball generation method. Since granular-ball generation still has a certain randomness, we do experiments on each method ten times, and take the average classification accuracy of the ten experiments results for comparison. The NN method use the ten-fold cross-validation result.

Table I shows the experimental accuracy of NN under noise-free conditions. “Acc”, “Adp”, “Origin”, and NN represent the acceleration granular-ball generation method, the adaptive granular-ball generation method, the existing granular-ball generation method, and NN, respectively. “mean” and “max” represent the experimental accuracy of the average distance and the maximum distance as the radius of the granular-ball. The proposed two methods, the acceleration granular-ball generation method and the adaptive granular-ball generation method, obtained a better performance on accuracy compared to the existing method and NN. The decision boundary obtained by using the existing methods is still not clear enough. Therefore, when measuring the NN accuracy, the granular-ball closest to the test point is likely to be inaccurate. The global division is performed after the splitting of the granular-ball is stopped. Therefore, the two methods proposed by us can obtain higher NN accuracy than the existing methods in the raw data set.

Data Acc Adp Origin NN
mean max mean max mean max
fourclass 0.990 0.987 0.988 0.973 0.990 0.958 0.999
svmguide1 0.959 0.965 0.930 0.970 0.960 0.796 0.960
diabetes 0.824 0.838 0.834 0.824 0.718 0.697 0.748
breastcancer 0.950 0.972 0.977 0.965 0.982 0.962 0.973
creditApproval 0.770 0.769 0.722 0.714 0.669 0.599 0.659
votes 0.906 0.917 0.868 0.922 0.871 0.722 0.875
svmguide3 0.831 0.828 0.812 0.818 0.786 0.776 0.788
sonar 0.895 0.886 0.833 0.855 0.833 0.757 0.831
splice 0.745 0.807 0.765 0.796 0.605 0.595 0.681
mushrooms 0.994 1.000 1.000 1.000 0.993 0.715 1.000
Average 0.886 0.897 0.873 0.884 0.841 0.758 0.851
TABLE I: Comparison of average test accuracy (raw data sets)
Data Origin-GBS Acc-GBS Adp-GBS NN
fourclass 0.9890 0.9902 0.9942 0.9971
svmguide1 0.9558 0.9612 0.9587 0.9596
diabetes 0.7331 0.7494 0.7448 0.7312
breastcancer 0.9644 0.9585 0.9696 0.9644
creditApproval 0.6855 0.6725 0.6609 0.6623
votes 0.8884 0.9000 0.9029 0.8870
svmguide3 0.7835 0.7803 0.7863 0.7807
sonar 0.8048 0.8476 0.8262 0.8048
splice 0.6964 0.7265 0.7061 0.6750
mushrooms 0.9994 1.0000 1.0000 1.0000
Average 0.8500 0.8586 0.8550 0.8462
TABLE II: Comparison of average test accuracy after sampling with GBS (raw data sets)

The column 1-3 in Table II is based on the GBS method. Firstly, we use the first three methods in Table II to generate the granular-balls; and then the GBS method is used to sample the generated granular-balls; finally, NN is used to classify the sampled result. The last column represents directly classifying the raw data set with NN. According to the paper[37xia2021granular], the purity is set from 0.54 to 1.0 with the step size of 0.2 GBS algorithm. It can be seen that our two methods have a higher accuracy on most data sets than the other two.

TABLE IV: Comparison of the number of granular-balls generated by the acceleration granular-ball generation method and the existing method
Data fourclass svmguide1 diabetes breastcancer creditApproval votes svmguide3 sonar splice mushrooms Average
balls 31 533 394 2 426 61 597 69 517 14 264
balls+ 31 390 348 15 364 60 524 73 506 39 235
TABLE III: Comparison of the running time of the acceleration granular-ball generation method and the existing method

In order to show the efficiency of the acceleration granular-ball generation method, we choose the existing granular-ball generation method as the comparison method. Table IV shows the running time of the two methods on raw data sets. The “time+” and “time” denote the acceleration method and the the existing granular-ball generation method respectively. Table IV shows the comparison of the number of granular-balls generated by the acceleration method and the existing method on raw data sets, where “ball+”, “ball” denote the acceleration method and the existing method respectively. Compared with existing granular-ball generation method from Table I-IV, the acceleration method has a higher accuracy and efficiency on most data sets, while generating the similar number of granular-balls.

Vi-B Experiments on Noise Data Sets

In this section, each data set has four class noise rates, namely 10%, 20%, 30% and 40%. Noise is generated by changing the labels of randomly selected samples in a data set. Tables V-VIII show the GBS highest average test accuracy obtained from the purity optimization of the existing method under different noise rates, and the GBS highest average accuracy of adaptive granular-ball generation method and acceleration granular-ball generation method. The purity is also set from 0.54 to 1.0 with the step size of 0.2 in GBS algorithm. The acceleration method and the adaptive method adopt the strategy of selecting heterogeneous sample points as the new clustering centers when splitting a granular-ball. This can make the algorithm converge faster, but it will reduce the accuracy when dealing with noisy data sets. It can also be seen from the experimental results of noise data sets that the acceleration granular-ball generation method and the adaptive granular-ball generation method can obtain a similar law to the existing granular-ball generation method on the noisy data, that is, when the noise rate in the data set is larger, the advantage to the original NN is more obvious. However, it can also be seen that the adaptive method still show slightly lower accuracy than the existing method when dealing with the noise data sets. It is possible that the adaptive purity lower bound of granular-ball in the adaptive method is too low, so that some granular-balls with poor quality are generated, which affects the overall accuracy. However, the adaptive method significantly improves the existing method to make it adaptive.

Data Origin-GBS Acc-GBS Adp-GBS NN
fourclass 0.8815 0.8792 0.8763 0.8769
svmguide1 0.8523 0.8625 0.8461 0.8428
diabetes 0.6721 0.6935 0.6701 0.6578
breastcancer 0.8711 0.8504 0.8593 0.8393
creditApproval 0.6442 0.6225 0.6283 0.6123
votes 0.8188 0.8029 0.7957 0.7957
svmguide3 0.7297 0.7285 0.7088 0.6964
sonar 0.7571 0.7357 0.7452 0.7571
splice 0.6449 0.6485 0.6388 0.6173
mushrooms 0.8926 0.8781 0.8358 0.8740
Average 0.7764 0.7702 0.7604 0.7570
TABLE V: Comparison of average test accuracy after sampling with GBS (noise rate 10%)
Data Origin-GBS Acc-GBS Adp-GBS NN
fourclass 0.7370 0.7948 0.7583 0.7046
svmguide1 0.7602 0.7677 0.7098 0.7156
diabetes 0.6214 0.6390 0.6065 0.5994
breastcancer 0.7652 0.7941 0.7815 0.6852
creditApproval 0.6210 0.5775 0.5695 0.5696
votes 0.7217 0.7072 0.7014 0.6725
svmguide3 0.6663 0.6775 0.6261 0.6108
sonar 0.6786 0.6238 0.6452 0.6786
splice 0.5929 0.5857 0.5740 0.5699
mushrooms 0.7830 0.7584 0.6985 0.7314
Average 0.6947 0.6926 0.6671 0.6537
TABLE VI: Comparison of average test accuracy after sampling with GBS (noise rate 20%)
Data Origin-GBS Acc-GBS Adp-GBS NN
fourclass 0.6711 0.6659 0.6653 0.6156
svmguide1 0.6543 0.6809 0.6025 0.6019
diabetes 0.5669 0.5877 0.5370 0.5266
breastcancer 0.7052 0.6800 0.6467 0.6178
creditApproval 0.5667 0.5688 0.5355 0.5355
votes 0.6435 0.6362 0.6232 0.5928
svmguide3 0.6116 0.6096 0.5735 0.5434
sonar 0.5690 0.5786 0.5667 0.5690
splice 0.5592 0.5500 0.5342 0.5245
mushrooms 0.6881 0.6495 0.5921 0.6151
Average 0.6236 0.6207 0.5877 0.5742
TABLE VII: Comparison of average test accuracy after sampling with GBS (noise rate 30%)
Data Origin-GBS Acc-GBS Adp-GBS NN
fourclass 0.5775 0.5468 0.5740 0.5312
svmguide1 0.5711 0.5807 0.5340 0.5316
diabetes 0.5117 0.5468 0.5026 0.4890
breastcancer 0.5904 0.5778 0.5615 0.5230
creditApproval 0.5080 0.5435 0.5145 0.4819
votes 0.5841 0.5652 0.5362 0.5319
svmguide3 0.5627 0.5498 0.5171 0.5104
sonar 0.4881 0.5405 0.5690 0.4833
splice 0.5321 0.5179 0.5184 0.5281
mushrooms 0.5860 0.5641 0.5255 0.5338
Average 0.5511 0.5533 0.5353 0.5144
TABLE VIII: Comparison of average test accuracy after sampling with GBS (noise rate 40%)

Vii Conclusions and Future Work

This paper proposes a method for accelerating the granular-ball generation, which can greatly improve the efficiency of granular-ball generation while ensuring accuracy. At the same time, a new granular-ball clustering method is proposed, that is, the adaptive granular-ball generation method. This adaptive method avoids the problem that the existing method needs to manually set the purity threshold parameter, and makes the generation process of the granular-balls completely adaptive. Experiments show that the acceleration method has better performance than the adaptive method for both noisy and non-noised data. At the same time, as shown by experiments, experimental accuracy of the adaptive method is slightly lower than the existing method. It proves that our method is effective, but whether there are other adaptive methods, such as based on the consistency of the internal distribution of granular-balls, may develop a more effective granular-ball adaptive optimization method. However, the proposed methods exhibit lower accuracy in some cases than the existing method, so we will study how to improve their accuracy in the future work.

Viii Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 62176033 and 61936001, the Natural Science Foundation of Chongqing under Grant No. cstc2019jcyj-cxttX0002 and by NICE: NRT for Integrated Computational Entomology, US NSF award 1631776.

References