I Introduction
COGNITIVE computing combined with human cognitive mechanism makes the decisionmaking process more reliable, efficient and understandable. It is an important means to achieve reliable governance of information space and an important direction for the development of artificial intelligence. Academician Chen’s research results published in sciences in 1982 pointed out that human cognition is characterized by largescale priority
[chen1982topological]. As shown in Fig. 1, the large outline letters are seen first, followed by the specific small letters within the outline letters. Based on this cognitive characteristic, granular computing can achieve efficient, scalable and robust learning processes. Zadeh, a famous American cybernetics expert, put forward the problem of granular information granulation and the concept of granular computing [zadeh1979fuzzy, zadeh1997toward]. After decades of continuous research by scholars at home and abroad, fuzzy sets [4backer1981clustering, 5kosko1986counting, 6zhang2020multi, 7liu2019combinatorial] and rough sets [8xia2020gbnrs, 9liang2012group, 10qian2010positive, 11hu2017large] have been developed. Academician Bo Zhang proposed the quotient space theory [12zhang2004quotient, 13ling2003theory], academician Deyi Li proposed the cloud model theory [14li1998uncertainty, 15li2009new], and Guoyin Wang and Shuyin Xia developed the granularball computing [16xia2020fast, 17xia2019granular] and other model methods.In fuzzy sets, data elements are described by membership degree of different granularity of fuzzy information. Rough set theory and quotient space theory use equivalence relations and equivalence classes to construct granules of different sizes. The universe of rough set theory is the point set of objects, and the topological relations between elements are not considered. The quotient space theory is studied on the condition that there are topological relations between the elements in the universe. The quotient space and cloud model are also two important granular computing methods. Among them, rough sets have been the most widely studied, and many scholars have made a lot of outstanding work in the field of rough sets. Guoyin Wang found the difference between algebraic form and information entropy form of rough set [18wang2003rough]. Duoqian Miao and others found the equivalence between fuzzy soft sets and fuzzy information systems [19pei2005soft]. Yuhua Qian, Jiye Liang and others reduced the amount of calculations in the positive field in the attribute combination explosion through positive field reduction and nuclear attributes [10qian2010positive]. Zeshui Xu and others established the topological structure of covering rough set model [20xu2005properties]. Yiyu Yao pointed out that label noise had an obvious interference effect on upper and lower approximate calculation [21yao2007decision]. Qinghua Hu and others designed a robust classification algorithm based on fuzzy lower approximation [22hu2011robust], and consider the data distribution information and be included in the calculation and fuzzy approximate the data distribution of rough set model perception [23an2015data]. Weizhi Wu and others proposed a multiscale decision table [24wu2011theory]. Tianrui Li put forward the dynamic incremental approximation method based on rough set maintenance environment [25chen2011rough]. Shuyin Xia and Guoyin Wang proposed a parameterless rough set method that could process continuous data without relying on membership function [8xia2020gbnrs].
In granular computing, the larger the granularity, the higher the efficiency, and the better the robustness to noise; but it is also more likely to cause neglect of details and loss of accuracy. The smaller the granularity is, the more attention is paid to the details, but it may reduce efficiency and deteriorate robustness to noise. Selecting different granularity according to different scenes can better play the performance of multigranularity learning method. Although multigranular computing has a long research history, as a cognitive computing science, it also faces some new challenges and needs new development. For example, in terms of the “classifier”, one of the most widely used methods of artificial intelligence, as shown in Fig.
2(a), the input of most existing classifiers is the finestgrained sample points or pixels [26salehi2015synergistic, 27cover1967nearest, 28loh2011classification], , so there will be a lack of coarsegrained characterization. Some researchers have proposed classification algorithms based on multigranularity ideas. For example, Dick S and others assumed that connection weights are all linguistic variables, and they have different granulation of connection weights [29dick2001granular]. Weight updating is realized by adding “linguistic hedges”, but this method will sacrifice certain accuracy. Leite D, MM. M. Gupta and FY Wang and others used fuzzy neurons to build interpretable multisize local models
[30leite2013evolving, 31gupta1990fuzzy, 32wang1995implementing], which can learn fuzzy rules, and the output space can be processed as membership information, which can be used to process fuzzy data. In a few cases for machine learning and data mining, such as
[33syeda2002parallel], fuzzy rules are extracted for credit card fraud detection. The purpose of these works is to use neural networks to process fuzzy data and apply it to fuzzy control. It is not based on multigranularity ideas to improve the scalability, efficiency, or robustness of the classifier, and its essence is still a point input method. Park HS and others constructed the information granule by feature selection in the input space, which is essentially a feature preprocessing method without changing the learning mode of the neural networks
[34park2009granular]. Tang Y and others introduced a method of sampling and mapping information granules in a support vector machine
[35tang2012granular]. The research work of [34park2009granular, 35tang2012granular] focuses more on understanding some existing research work by using the concept of multigranularity. Pedrycz W and others systematically proposed a neural network granulation framework from the input layer and output layers [36pedrycz2001granular]. Both rough set and fuzzy methods can granulate the input space. But this work does not realize a specific multigranularity neural network and examine its performance advantages. Therefore, how to implement the multigranularity classifier shown in Fig.
2(b) is an important challenge. In the multigranularity classifier in Fig. 2(b), the input is no longer the finestgrained point, but a universal feature with adjustable granularity. The design of this universal feature should meet highdimensional scalability, that is, no complicated calculations are required in the highdimensional space, otherwise the highdimensional problem cannot be dealt with. For this reason, Guoyin Wang and Shuyin Xia proposed the use of ball as “granule” to represent this universal feature and proposed a granularball computing method [17xia2019granular]. The reason is that the geometry of a ball is completely symmetrical, and only two data are needed to characterize it in any dimension: center and radius, so it is convenient to be applied to highdimensional data. At the same time, they also proposed an efficient and adaptive method to generate granularballs.
Further, the granularball computing is introduced into the classifier, the framework of the granularball computing classifier is proposed, and the original model of the granularball supports vector machine (GBSVM) is derived and the
nearest neighbor algorithm of the granularball (GBNN) is proposed [17xia2019granular]. The efficiency of GBNN is hundreds of times higher than that of the existing NN algorithm, especially in largescale data. In addition, GBNN does not need to select parameter , and helps to alleviate the performance in unbalanced data, which is not available in existing nearest neighbor algorithms. Due to the robustness of granularball computing, GBNN has higher accuracy than accurate NN in many data. In addition, granularball computing was introduced into the neighborhood rough set, and a new rough set method “granularballs neighborhood rough set (GBNRS)” was developed [8xia2020gbnrs]. GBNRS is the first parameterfree rough set algorithm to process continuous data without prior knowledge (i.e. setting membership function), which is more efficient than NRS. Since GBNRS can adaptively select the neighborhood radius, it can also obtain higher classification accuracy than NRS in many cases. In addition, the granularball computing was introduced into the means algorithm and a simple and fast means clustering method “ball means” is developed [16xia2020fast]. Ball means is dozens of times more efficient than similar algorithms, especially in the challenging large clustering problem. Granularball computing is efficient, robust and scalable [17xia2019granular]. However, there are still many challenges in the granularball generation, such as the optimization of the purity threshold and its efficiency improvement. The main contributions of this paper are as follows:
[]

The acceleration granularball generation method is proposed using the division to replace means. It can accelerate the granularball generation several times to dozens of times while a similar accuracy is achieved.

A new adaptive method for the granularball generation is proposed by considering granularball’s overlap eliminating and some other factors. This makes the granularball generation process of parameterfree and completely adaptive in the true sense.

This paper first provides the mathematical models for the granularball covering.
Ii Related Work
Iia Granularball Computing
Combining the theoretical basis of traditional granular computing, and based on the research results published by Chen in Science in 1982, he pointed out that “human cognition has the characteristics of largescale priority” [chen1982topological], Wang put forward a lot of granular cognitive computing [56]. Based on granular cognitive computing, granularball computing is a new, efficient and robust granular computing method proposed by Xia and Wang [17xia2019granular], the core idea of which is to use “granularballs” to cover or partially cover the sample space. A granularball , where represents the objects in , and is the number of objects in . ’s center and radius are respectively represented as follows
(1) 
(2) 
This means that the radius is equal to the average distance from all objects in to its center. The radius can also be set to the maximum distance. The “granularball” with a center and radius are used as the input of the learning method or as accurate measurements to represent the sample space, achieving multigranularity learning characteristics (that is, scalability, multiple scales, etc.) and the accurate characterization of the sample space. The basic process of granularball generation for classification problems in granularball computing is shown in Fig. 3.
As shown in Fig. 3, to simulate the “the characteristics of largescale priority of human cognition” at the beginning of the algorithm, the whole data set can be regarded as a granularball. At this time, the purity of the granularball is the worst and cannot describe any distribution characteristics of the data. The “purity” is used to measure the quality of a granularball [17xia2019granular] in the step 3 in Fig. 3. It is equal to the proportion of the most labels in the granularball. Then, the number of different classes in the granularball is counted and denoted as ; the granularball is split into child granularballs in the step 2. In the step 3, the purity of each granularball is calculated; if a granularball does not reaches the purity threshold, it needs to be split. As the splitting process continues to advance, the purity of the granularballs increases, and the decision boundary becomes increasingly clearer; the boundary is clearest, and the algorithm converges when the purity of all granularballs meets the requirements. It can be concluded from [17xia2019granular] that for a data set, no matter what distribution its data has, we can describe its decision boundary by enough granularballs.
An example of granularball generation on the data set fourclass is shown in Fig. 4. At the beginning of the algorithm, as shown in Fig. 4(a), the whole data set can be seen as a granularball. As the fourclass contains two classes of points, the in the step 2 in Fig. 3 is equal to 2, and two heterogeneous points are randomly selected as the initial centers of two child granularballs. The experimental results are shown in Fig. 4(b). However, the granularballs are too coarse, and the qualities of granularballs are not high enough, i.e., that their purities do not reach the purity threshold. So, the decision boundary of the granularballs is inconsistent with that of the data set. As the splitting process progresses, as shown in Fig.4(c)(d), the granularballs change to be fine, and the purity of each granularball becomes to be high until it reaches to the purity threshold or other quality measurement. As shown in Fig. 4(e), each granularball reaches to the purity threshold, and is enough fine. At this time, the decision boundary is very consistent with that of the data set. Fig. 4(f) shows the extracted granularballs when the points are removed.
The granularball computing has developed granularball classifiers [17xia2019granular], granularball clustering [16xia2020fast], granularball neighborhood rough set [8xia2020gbnrs] and granularball sampling methods [37xia2021granular].
IiB Granularball kNearest Neighbor
The NN algorithm has many characteristics such as simple, natural response to multiple classifications, independent of training, can be used for classification and regression at the same time, and easy to parallelize and implement. It is one of the most widely used artificial intelligence algorithms. In NN, the basic principle of finding the nearest neighbors of a query point is to compute the Euclidean distance from the query point to all data points, and use the values (or labels) of these neighbors to predict or classify the query point. This method is called Full Search Algorithm (FSA). FSA has the following common problems: it needs to optimize the value of ; and the optimization of the value of requires quadratic time complexity, so it is very timeconsuming. From the perspective of multigranularity, the source of the problem of optimization is due to excessive attention to finegrained. For this reason, in [17xia2019granular], we introduce granularball into NN and propose an efficient nearest neighbor algorithm without optimizing value, granularball nearest neighbors method (GBNN). The basic idea is very easy to implement: base on the granularball computing, a single query sample point is a granularball with a very small radius, and its predicted label is equal to the nearest granularball’s label, which is determined by the majority labels in the granularball. Therefore, the common feature of GBNN and traditional NN is that the mark of the query point is determined by many points. But the important advantage of GBNN is that there is no need to optimize the parameter , and the label of the query point is determined by adaptively generated nearest neighbor granularball with different coarse grains; the second advantage of GBNN is that the number of granularballs is much smaller than the sample points, and the calculation amount of the nearest granularball queried by the query point is much less than NN, resulting in higher efficiency of GBNN than traditional NN; the third advantage is that the decision of traditional NN will be affected by label noise, but GBNN can obtain higher accuracy because of its robustness especially on noisy data sets. These three points are important advantages of GBNN.
IiC Granularball Sampling
The purpose of granularball sampling (GBS) is to decrease the size of a dataset in classification by introducing the idea of granularball computing. The GBS method uses some adaptively generated balls to cover the data space, and the points near the boundary of each granularball constitute sampled results [37xia2021granular]. Fig. 5 shows the basic idea of GBS. Fig. 5(a) shows the original data set and its decision boundary; Fig. 5(b) shows the original data set after being covered by the granularballs. In Fig. 5(c), we find the intersections of the coordinate axis with the granularball center as the origin and the granularball. In the granularball, among the points with the same label as the ball, the points closest to these intersections constitute the sampled result. These points are near the boundaries of the granularball. The same label can filter the affection of label noise points. For example, for granularball in Fig. 5(c), the intersection points of the ball and the coordinate axis with the center of the ball as the origin are and , and the points with the same label as the ball closest to these intersections are and in . Therefore, , and are the sampling results in , and they are also the best points to describe the boundary of the granularball . Fig. 5(d) shows the sampling results of the whole data set. Comparing Fig. 5(a) and Fig. 5(e), it can be observed that the boundary curve in Fig. 5(e) is very consistent with the boundary curve in Fig. 5(a). At the same time, the samples in Fig. 5(e) are less and sparser than those in Fig. 5(a). In contrast, random sampling is very easy to cause the loss of boundary information. Therefore, as shown in Fig. 5(e) and Fig. 5(f), the boundary generated by GBS is closer to the boundary of the original data set than the boundary generated by random sampling.
For the intersection point on the granularball, its sampled point , the closest homogeneous point of , has the same label with the granularball. The intersection point can be expressed as a point where the center point vector moves along the specified coordinate axis by a length . The moving direction of the center point vector includes a positive direction and a negative direction, so one coordinate axis can correspond to two intersection points. Specifically, for a dimensional data set , the center point vector of the granularball generated on is , the radius . The two intersection points of the positive and negative directions of the coordinate axis can be expressed as
(3) 
(4) 
In a data sets, the sampled result for the granularball can be modeled as follows:
(5)  
where is sampled result for the granularball , and the label of is consistent with that of . For a unbalanced data set, the minority class of points are remained, and the sampling process is performed only on the majority class of points. So, in the Equ. (5), needs to be changed to , which represents the majority granularballs.
In the classification with label noise, GBS can both reduce a data set and improve the its data quality. Besides, GBS is also effective for undersampling of unbalanced classification. In addition, the time complexity of GBS is O(), so it can speed up most classifiers [38zhou2009novel].
Iii Granularball Covering Model
At present, the granularball covering of granularball computing lacks of mathematical model. In this section, we will establish the basic model of granularball covering, and the specific description is as following: given a data set , where is the number of samples on . Granularballs are used to cover and represent the data set . The original goal of the optimization problem of the granularball generation method is expressed as , and the main factors of measuring the coverage are: 1. Coverage degree, when other factors remain unchanged, the higher the coverage, the less sample information is lost and the more accurate the characterization. Suppose the number of samples in the granularball is expressed as , then its coverage degree can be expressed as . 2. When other factors remain unchanged, the number of the granularballs is related to the size of granularball. The minimization of this factor is to make the granularballs as coarser as possible. The fewer the number of granularballs, the coarser the granularballs and the more coarsegranularity characteristics: the more efficient the granularball calculation is, the better the robustness is. 3. In addition, under different problems, in order to correspond to the relevant optimization goal , the quality of the granularball needs to be higher than a given purity threshold of the given evaluation method. This factor is also related to the lower limit of the size of the granularballs, so that the granularballs must be “fine” enough to accurately describe the problem. The threshold can be obtained in a given way, or in a lattice search, or in an adaptive way, which is pursued by us. Taking the reciprocal of the granularball covering to optimize its minimum value and the optimization goal of the granularballs can be expressed as
(6)  
where and are the corresponding weight coefficients and .
The definition of granularball’s quality is different according to the environment, but it can basically be defined as a sample label with a certain approximate (or equivalence) relation. For example, in the classification problem, we often use the nearest neighbor to describe this equivalence relation. These factors are indispensable. It is obviously unreasonable to rely only on factor 1, coverage degree, or factor 2, the number of the granularballs. For example, in the extreme case shown in Fig. 6(a), only one granularball is used to represent it. At this time, the quality of the granularball is poor, and one granularball cannot describe the distribution of a data set (i.e., data boundary). If the factor 2 is not considered, as shown in Fig. 6(b), the granularballs can only cover a small part of the data set, and it is also impossible to describe the data set. If do not consider the factor 3 “the number of granularballs” (that is, the size of the granularballs), and only consider the quality and coverage degree, the granularballs can be divided into the finest granularball, i. e., a granularball contains only one sample point. When coarsegranularity does not make any sense. Therefore, none of the above factors are indispensable. On the whole, when factor 1 guarantees a certain coverage, the factors 2 and 3 is to obtain granularballs with appropriate granularball size; when the factors 1 and 2 remain unchanged, the smaller the threshold of factor 3, the easier the quality of the granularballs will be satisfied (i.e. the coarser the granularballs, the fewer the number of the granularballs, and the more efficient computational performance can be obtained). The control of the threshold in factor 3 exhibits the ability of scalability of the granularball generation. In fact, the existing granularball generation method, as shown in Fig. 3
, provides a heuristic optimization strategy.
Iv An Acceleration Granularball Generation Method
Iva Motivation
The existing granularball generation method uses the means algorithm to split the granularball; so, the granularball generation is not efficient than means, which can generate a stable splitting results in each iteration of granularball generation. However, the stability in the intermediate process is not needed in the process; what is required is only to generate the granularballs fulfilling Equ. 6, such as that the lower bound should be ensured.
IvB The Process of the Acceleration Granularball Generation Method
As the stability in the intermediate process is not needed, as shown in the step 2 in Fig. 7, we use one division, i.e., one iteration process in the means, to split a granularball in stead of a whole means algorithm. Besides, different from the existing method shown in Fig. 3, a global division is added in the end to improve the whole distribution of the final granularballs. In the global division, a division is performed based on all the division points. The specific process is as Fig. 7. In order to describe the process of the acceleration granularball generation more clearly, we give the definition of the father ball and the child ball at first.
Definitions 1
Given granularballs and , suppose that and . Then and are the parent ball and the child ball, respectively.
A means algorithm consists of iterations, where denotes the iteration times, and each iteration is a division, in which all points are divided into clusters according to their distances to the center points. In the step 2 in Fig. 7, the means used for splitting a granularball is replaced with division, where denotes the number of classes in a granularball. So, the computation cost is decreased directly. Besides, as shown in the step 2, taking splitting granularball as an example, the center point of the granularball , denoted as , is remained as the center point of a certain child ball of , so only 1 points are selected as centers of those new child balls of . The sample points in all child balls do not need to calculate the distance from the original center point a because they have been calculated before. Consequently, the computation cost is further decreased.
As shown in Fig. 8, a granularball with center as is split into two child balls and whose centers are and respectively. The radius of a granularball is represented by the furthest distance from the data points in it to its center in order to cover all the data points in the ball. The center of the ball in Fig. 8(a) is remained as the center of child ball in Fig. 8(b). In this split process of the granularball, the distance from all data points in the ball to the center does not need to be computed again, and only the distance from all the data points to the center of child ball in Fig. 8(b) is computed. Finally, all data points are divided into the two granularballs based on above distances to the two centers and .
It is worth noting that, in the granularball splitting process of the acceleration method, the center of a granularball is its division point instead of that computed using Equ. (1). As shown in Fig. 8(a), the center of the granularball is the division point instead of the center of those data points in , which is computed using Equ. (1). However, as shown in the step 7 in Fig. 7, the global division, i.e., a division on all division points, is performed so that a division point is close to the center of the corresponding granularball. For example, Fig. 9 shows the comparison results between the conventional granularball generation method and our proposed acceleration method. Fig. 9(a) shows the granularball generation result using the conventional method. Fig. 9(b) shows experimental results using the proposed acceleration method before the global division is performed. It can be seen from Fig. 9(c) that, after the global division is performed, in comparison with those in Fig. 9(b), the division center of a granularball, i.e., its division point changes to be closer to its true center computed using Equ. (1
), so the points in granularball becomes to be more tightly and uniformly distributed and the decision boundary is clearer. In addition, Fig.
9(d) shows the experimental results when some label noise points are added. It an be observed from Fig. 9(d) that, the granularballs generated using the acceleration method can fit the division boundary well in the noisy data because of the robustness of granularball computing. The algorithm design of the granularball generating acceleration method is as Algorithm 1.IvC Time Complexity
The time complexity of means is O() [38zhou2009novel], where represents the number of clusters, and represents the number of iterations. The convergence speed of means is fast and can be considered approximately linear. The acceleration granularball generation method only needs to compute the distance between the data in this cluster and the new division centers each time when generating granularballs. Assuming a data set with classes of data, in the first round of splitting, computation times are .
In the second round, new division centers are newly generated, and computation times are approximately
(7) 
In the third round, division centers are newly generated, and computation times is approximately
(8) 
In the fourth round, division centers are newly generated, and computation times is approximately
(9) 
…
Assuming a total of iterations, the time complexity of the last global division is O(), where is the number of granularballs, and the total time complexity is O(). However, it is worth noting that the granularballs will stop splitting when the splitting conditions are not met. Most granularballs will stop splitting halfway through. Therefore, the actual time complexity of the acceleration granularball generation method is much lower than O(). The time complexity of the acceleration method is still linear, which avoids reduces unnecessary calculations.
V An Adaptive Granularball Generation Method
Va Motivation
Through the incoming purity threshold parameter, the existing method can generate granularballs with meeting the purity threshold. The main problem of the existing method is that the purity threshold parameter cannot adapt to the data distribution of each data set, and it is difficult to find a splitting standard that matches the data distribution for each data sets. To solve this problem, we propose a purity adaptive granularball generation method based on the acceleration granularball generation method. The purity adaptation is important for the granularball generation, so that granularball generation process is completely parameterfree, and the completely parameterfree classifier, GBNN, has been developed.
VB The Adaptive Conditions of Granularball Splitting
In this section, we proposed three adaptive conditions to realize the adaptive generation of granularballs including: whether the weighted purity sum of child balls for each granularball increases or not, whether there is overlap between any pair of heterogeneous granularballs, and whether each granularball reaches to the lower bounder of purity, i.e., the purity of the initial granularball of the whole data set. The specific design are as follows in detail.
VB1 Weighed purity sum of child balls
The purity is designed for measuring the quality of a granularball. So, a direct idea to design a indicator to measure the child balls’ purity. Then, whether a granularball should be split or not is determined by whether its child granularballs’ purity becomes larger than itself. Considering the fact that the more samples in the ball, the more important the ball is, so we design the weighed purity sum of child balls for measuring the child balls’ purity as shown in Definition 2
Definitions 2
Given granularballs and its child balls , where and . denotes the number of elements in a set. denotes the number of classes in . denotes a set consisting of those samples whose label is equal to , and , i.e., , represents the set consisting of those samples in the majority class in the . The weighted purity sum of the child balls of can be defined as
(10)  
Based on the Definition 2, Theorem 1 is proposed to describe the condition of that a granularball should be split.
Theorem 1
Given a granularball , whose label is denoted by and purity by , and its child granularball , where is the number of the child granularballs. . represents the weighted purity sum of .
[1] , if , then ;
[2] , if , then .
Proof:
[1] When and , the majority samples in are also the majority samples in all child balls, so we have
(11) 
(12) 
At the same time, the majority samples in are equal to the sum of the majority samples in all the child balls. Combining with the Equ. 11 and Equ. 12, we get
(13) 
From the Equ. 13 and the Definition 2, we can easily get
(14) 
So,
(15) 
[2] When and , the majority sample in are not necessarily the majority sample in all child balls. Assuming has a different label with , we can get
(16) 
In addition, the samples labeled in the parent ball equal to the sum of those in all child balls, we have
(17) 
Combining with Equ. 16 and Equ. 17, we get
(18)  
From the Equ. 18 and the Definition 2, we can easily get
(19) 
So,
(20) 
When is greater than , it means the label of some child balls is different from the parent ball, i.e., the minority samples in the parent ball become the majority samples in some child balls. It can be concluded that the weighted purity sum of the child balls will be greater than the purity of the parent ball, and the number of correctly classified samples will increase. When equals , it represents a special case that the parent ball and all child balls have the same label.
As shown in Fig. 10, the ball in (a) is split into ball and ball in (b) using Theorem 1, and the labels of and are different. At this time, the minority class in becomes the majority of classes in , so the weighted purity sum of and will be greater than the purity of . From the perspective of GBNN, the number of samples with correct classification also increases. But in this case, there will still be premature convergence, because the accuracy does not increase monotonically. Fig. 10(b) shows a simple example of premature convergence only using the conditions in Section VB1, that is, there is overlap between heterogeneous granularballs.
To this end, we introduce the second condition that there can be no overlap between heterogeneous granularballs.
VB2 Deoverlap between heterogeneous granularballs
For the problem of granularballs overlap, it is necessary to further detect whether there is heterogeneous granularballs on the basis of Section VB1, and further split and refine the overlapping granularballs to make the decision boundary clearer. In order to improve efficiency, the next round of overlap detection only needs to traverse the child granularballs of the granularballs that have overlapped. The boundary overlap problem of heterogeneous granularballs can be defined as the following model:
(21)  
Among them, represents the center of the granularball, represents the radius of the granularball, and is the total number of granularballs. In addition, the effect of the condition “” is to concentrate the boundary overlap problem between heterogeneous granularballs, and reducing the cost of computing and analysis of the problem. The overlap between the same kind of granularballs will not affect the decision boundary. As shown in Fig. 10(c), it can be seen that, compared to Fig. 10(b), the granularballs after deoverlap are more suitable for the data distribution.
VB3 An adaptive purity lower bound
In addition, the purity of the granularballs should has an adaptive lower bound. The lower bound is the proportion of the initial majority of samples of the total sample, that is, the purity of the initial granularball. As shown in Fig. 10(c), the granularballs will adaptively generate granularballs with a lower purity than the initial purity. The quality of such granularballs is too low, which reduces the classification accuracy of the final GBNN. For minority samples, the proportion of incorrectly classified samples can be considered as the noise rate. That is, the purity of all granularballs must be greater than the purity of the initial granularball.
Fig. 10(d) shows the result of granularball generation using the adaptive method. The two colored granularballs in the figure represent two classes of data. In addition, Fig. 10(e) shows the result of granularball generation using the adaptive method when some label noise points are added. The blue points in the figure is represented by noise data, and the other two colored points represent the two original data. The noise points are generated by randomly changing the labels of samples in the data set. The optimization goal of the adaptive granularball generation method can be expressed as
(22)  
where and are the corresponding weight coefficients and , and represents the center and radius of respectively. denote the adaptive purity lower bound of granularballs, and represents the child granularballs of . In addition, and are mentioned above.
VC Method Design
The basic idea of granularball generation of the adaptive granularball generation method is shown in Fig. 11.
In step 2 in Figure 11, based on the accelerated granularball generation method, k division is used to split the granularball, where denotes the number of classes in a granularball. So, the computation cost is decreased directly. In addition, as shown in step 3, when the weighted purity sum of the child balls is greater than the purity of its parent ball and the purity of the granularball reaches the lower bound, the child balls are retained and whether there is overlap between heterogeneous granularballs is detected. As shown in Fig. 10(d), the boundary of the granularballs when the algorithm converges is very consistent with that of the data set.
The algorithm design for adaptive granularball generation method is as Algorithm 2.
Vi Experiments
To demonstrate the feasibility and effectiveness of the acceleration granularball generation method and the adaptive granularball generation method, we compared them with NN and two popular or the stateoftheart methods based on granular computing, including GBNN [17xia2019granular] and GBS [37xia2021granular]. Because of the robustness of the granularball, our experiments are carried out both on the raw data sets and the noise data sets. We verifies the performance on accuracy of the acceleration granularball generation method and the adaptive granularball generation method, and on efficiency of the acceleration method. We randomly selected ten real data sets from UCI benchmark data sets as shown in the following tables. Experimental hardware environment: PC with an Intel Core i7107000 CPU @2.90 GHz with 32 G RAM. Experimental software environment: Python 3.9.
Via Experiments on Raw Data Sets
In this section, we split the data set into ten parts, take one part for testing, and use the test accuracy as the evaluation index to verify the effectiveness of the acceleration granularball generation method and the adaptive granularball generation method. Since granularball generation still has a certain randomness, we do experiments on each method ten times, and take the average classification accuracy of the ten experiments results for comparison. The NN method use the tenfold crossvalidation result.
Table I shows the experimental accuracy of NN under noisefree conditions. “Acc”, “Adp”, “Origin”, and NN represent the acceleration granularball generation method, the adaptive granularball generation method, the existing granularball generation method, and NN, respectively. “mean” and “max” represent the experimental accuracy of the average distance and the maximum distance as the radius of the granularball. The proposed two methods, the acceleration granularball generation method and the adaptive granularball generation method, obtained a better performance on accuracy compared to the existing method and NN. The decision boundary obtained by using the existing methods is still not clear enough. Therefore, when measuring the NN accuracy, the granularball closest to the test point is likely to be inaccurate. The global division is performed after the splitting of the granularball is stopped. Therefore, the two methods proposed by us can obtain higher NN accuracy than the existing methods in the raw data set.
Data  Acc  Adp  Origin  NN  
mean  max  mean  max  mean  max  
fourclass  0.990  0.987  0.988  0.973  0.990  0.958  0.999 
svmguide1  0.959  0.965  0.930  0.970  0.960  0.796  0.960 
diabetes  0.824  0.838  0.834  0.824  0.718  0.697  0.748 
breastcancer  0.950  0.972  0.977  0.965  0.982  0.962  0.973 
creditApproval  0.770  0.769  0.722  0.714  0.669  0.599  0.659 
votes  0.906  0.917  0.868  0.922  0.871  0.722  0.875 
svmguide3  0.831  0.828  0.812  0.818  0.786  0.776  0.788 
sonar  0.895  0.886  0.833  0.855  0.833  0.757  0.831 
splice  0.745  0.807  0.765  0.796  0.605  0.595  0.681 
mushrooms  0.994  1.000  1.000  1.000  0.993  0.715  1.000 
Average  0.886  0.897  0.873  0.884  0.841  0.758  0.851 
Data  OriginGBS  AccGBS  AdpGBS  NN 
fourclass  0.9890  0.9902  0.9942  0.9971 
svmguide1  0.9558  0.9612  0.9587  0.9596 
diabetes  0.7331  0.7494  0.7448  0.7312 
breastcancer  0.9644  0.9585  0.9696  0.9644 
creditApproval  0.6855  0.6725  0.6609  0.6623 
votes  0.8884  0.9000  0.9029  0.8870 
svmguide3  0.7835  0.7803  0.7863  0.7807 
sonar  0.8048  0.8476  0.8262  0.8048 
splice  0.6964  0.7265  0.7061  0.6750 
mushrooms  0.9994  1.0000  1.0000  1.0000 
Average  0.8500  0.8586  0.8550  0.8462 
The column 13 in Table II is based on the GBS method. Firstly, we use the first three methods in Table II to generate the granularballs; and then the GBS method is used to sample the generated granularballs; finally, NN is used to classify the sampled result. The last column represents directly classifying the raw data set with NN. According to the paper[37xia2021granular], the purity is set from 0.54 to 1.0 with the step size of 0.2 GBS algorithm. It can be seen that our two methods have a higher accuracy on most data sets than the other two.
Data  fourclass  svmguide1  diabetes  breastcancer  creditApproval  votes  svmguide3  sonar  splice  mushrooms  Average 
balls  31  533  394  2  426  61  597  69  517  14  264 
balls+  31  390  348  15  364  60  524  73  506  39  235 
In order to show the efficiency of the acceleration granularball generation method, we choose the existing granularball generation method as the comparison method. Table IV shows the running time of the two methods on raw data sets. The “time+” and “time” denote the acceleration method and the the existing granularball generation method respectively. Table IV shows the comparison of the number of granularballs generated by the acceleration method and the existing method on raw data sets, where “ball+”, “ball” denote the acceleration method and the existing method respectively. Compared with existing granularball generation method from Table IIV, the acceleration method has a higher accuracy and efficiency on most data sets, while generating the similar number of granularballs.
ViB Experiments on Noise Data Sets
In this section, each data set has four class noise rates, namely 10%, 20%, 30% and 40%. Noise is generated by changing the labels of randomly selected samples in a data set. Tables VVIII show the GBS highest average test accuracy obtained from the purity optimization of the existing method under different noise rates, and the GBS highest average accuracy of adaptive granularball generation method and acceleration granularball generation method. The purity is also set from 0.54 to 1.0 with the step size of 0.2 in GBS algorithm. The acceleration method and the adaptive method adopt the strategy of selecting heterogeneous sample points as the new clustering centers when splitting a granularball. This can make the algorithm converge faster, but it will reduce the accuracy when dealing with noisy data sets. It can also be seen from the experimental results of noise data sets that the acceleration granularball generation method and the adaptive granularball generation method can obtain a similar law to the existing granularball generation method on the noisy data, that is, when the noise rate in the data set is larger, the advantage to the original NN is more obvious. However, it can also be seen that the adaptive method still show slightly lower accuracy than the existing method when dealing with the noise data sets. It is possible that the adaptive purity lower bound of granularball in the adaptive method is too low, so that some granularballs with poor quality are generated, which affects the overall accuracy. However, the adaptive method significantly improves the existing method to make it adaptive.
Data  OriginGBS  AccGBS  AdpGBS  NN 
fourclass  0.8815  0.8792  0.8763  0.8769 
svmguide1  0.8523  0.8625  0.8461  0.8428 
diabetes  0.6721  0.6935  0.6701  0.6578 
breastcancer  0.8711  0.8504  0.8593  0.8393 
creditApproval  0.6442  0.6225  0.6283  0.6123 
votes  0.8188  0.8029  0.7957  0.7957 
svmguide3  0.7297  0.7285  0.7088  0.6964 
sonar  0.7571  0.7357  0.7452  0.7571 
splice  0.6449  0.6485  0.6388  0.6173 
mushrooms  0.8926  0.8781  0.8358  0.8740 
Average  0.7764  0.7702  0.7604  0.7570 
Data  OriginGBS  AccGBS  AdpGBS  NN 
fourclass  0.7370  0.7948  0.7583  0.7046 
svmguide1  0.7602  0.7677  0.7098  0.7156 
diabetes  0.6214  0.6390  0.6065  0.5994 
breastcancer  0.7652  0.7941  0.7815  0.6852 
creditApproval  0.6210  0.5775  0.5695  0.5696 
votes  0.7217  0.7072  0.7014  0.6725 
svmguide3  0.6663  0.6775  0.6261  0.6108 
sonar  0.6786  0.6238  0.6452  0.6786 
splice  0.5929  0.5857  0.5740  0.5699 
mushrooms  0.7830  0.7584  0.6985  0.7314 
Average  0.6947  0.6926  0.6671  0.6537 
Data  OriginGBS  AccGBS  AdpGBS  NN 
fourclass  0.6711  0.6659  0.6653  0.6156 
svmguide1  0.6543  0.6809  0.6025  0.6019 
diabetes  0.5669  0.5877  0.5370  0.5266 
breastcancer  0.7052  0.6800  0.6467  0.6178 
creditApproval  0.5667  0.5688  0.5355  0.5355 
votes  0.6435  0.6362  0.6232  0.5928 
svmguide3  0.6116  0.6096  0.5735  0.5434 
sonar  0.5690  0.5786  0.5667  0.5690 
splice  0.5592  0.5500  0.5342  0.5245 
mushrooms  0.6881  0.6495  0.5921  0.6151 
Average  0.6236  0.6207  0.5877  0.5742 
Data  OriginGBS  AccGBS  AdpGBS  NN 
fourclass  0.5775  0.5468  0.5740  0.5312 
svmguide1  0.5711  0.5807  0.5340  0.5316 
diabetes  0.5117  0.5468  0.5026  0.4890 
breastcancer  0.5904  0.5778  0.5615  0.5230 
creditApproval  0.5080  0.5435  0.5145  0.4819 
votes  0.5841  0.5652  0.5362  0.5319 
svmguide3  0.5627  0.5498  0.5171  0.5104 
sonar  0.4881  0.5405  0.5690  0.4833 
splice  0.5321  0.5179  0.5184  0.5281 
mushrooms  0.5860  0.5641  0.5255  0.5338 
Average  0.5511  0.5533  0.5353  0.5144 
Vii Conclusions and Future Work
This paper proposes a method for accelerating the granularball generation, which can greatly improve the efficiency of granularball generation while ensuring accuracy. At the same time, a new granularball clustering method is proposed, that is, the adaptive granularball generation method. This adaptive method avoids the problem that the existing method needs to manually set the purity threshold parameter, and makes the generation process of the granularballs completely adaptive. Experiments show that the acceleration method has better performance than the adaptive method for both noisy and nonnoised data. At the same time, as shown by experiments, experimental accuracy of the adaptive method is slightly lower than the existing method. It proves that our method is effective, but whether there are other adaptive methods, such as based on the consistency of the internal distribution of granularballs, may develop a more effective granularball adaptive optimization method. However, the proposed methods exhibit lower accuracy in some cases than the existing method, so we will study how to improve their accuracy in the future work.
Viii Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 62176033 and 61936001, the Natural Science Foundation of Chongqing under Grant No. cstc2019jcyjcxttX0002 and by NICE: NRT for Integrated Computational Entomology, US NSF award 1631776.