I Introduction
With the advent of the era of big data, both the feasible size of datasets has grown explosively and the dimensionality of the data has increased dramatically[1, 2]
. In machine learning and related fields, high dimensionality slows the training speeds of models and heightens the difficulty of the learning task. High dimensionality can lead to overfitting, which reduces the generalizability of a model
[3, 4]. At high dimensions, Euclidean distance fails as a usable metric, limiting the application range of models that rely on Euclidean distance [5, 6, 7, 8, 9]. Feature engineering was developed in response to these substantial problems [10, 11]. Feature selection is one of the most wellknown methods in feature engineering, which aims to select a subset of the feature set that can replace the original feature set. Feature selection has significant remedial effect: it can reduce feature dimension, increasing the training speed of a model; it can prevent overfitting, improving the generalizability of a model; and it can increase the correlation between features and predictions, making the model more interpretive[12, 13].Feature selection methods can be divided into three categories: filter methods, wrapper methods, and embedded methods. Filter methods score features according to evaluation criteria, then sort the features in descending order according to an assigned score. Evaluation criteria usually fall in one of four categories: distance measures, information measures, dependence measures, and consistency measures [14]
. Wrapper methods treat feature selection as a feature subset optimization process, and use a classifier to evaluate feature subsets. Since each feature subset needs to train a classifier, most wrapper methods are inefficient, and research into wrapper methods therefore usually focuses on the optimization process. Embedded methods embed feature selection into the training process of the learning algorithm to screen out which features are important for model training
[15, 16].Feature selection algorithms based on rough set theory rely on attribute reduction, which classes them as a filter method. Rough set theory was proposed by Polish scientist Zdzisław Pawlak in 1982 [17]. It is an effective mathematical tool to process uncertain, inconsistent, and incomplete data that has been widely applied in data mining, machine learning, decision support system, and other application fields [18, 19, 20, 21, 22, 23, 24]. The classical rough set is also called Pawlak rough set. Pawlak rough set theory achieves this utility by using an equivalence relation to divide samples into several equivalence classes, and then defining upper and lower approximation sets using the union of the equivalence classes[17]
. The upper and lower approximations are used to describe and approximate uncertain concepts, and samples are divided into a positive region, boundary region and negative region during this process. The number of samples in the positive region is used to measure the dependence of a label on the feature set; that is, the score of the feature set. Heuristic attribute reduction algorithms based on rough set theory can effectively reduce the time complexity of high dimensional problems, making this a rich field of research in recent decades
[25, 26, 27, 28, 29, 30, 31].Pawlak rough set (PRS) and neighborhood rough set (NRS) are two most popular rough set theories. In the feature selection process, PRS granulates a dataset based on equivalence classes. An equivalence class consists of a set of attributes and a set of objects, and can describe certain knowledge. So, it provides good interpretability. However, this also results in that it only can process discrete data. Data in the real world is mostly continuous data, the discretization of which will inevitably cause loss of information, thus presenting a serious hindrance to the development and application of rough set theory. To solve this problem, Hu, Yu, and Xie proposed the NRS model [32] based on the idea of Lin’s neighborhood model [33]. NRS uses a neighborhood relation instead of an equivalence relation to granulate datasets, thus enabling NRS to process continuous data directly. However, in the model of NRS, the unpper and lower approximations of the NRS consist of sample points instead of equivalence classes, so the NRS loses interpretability. Besides, the inconsistency between PRS and NRS makes rough set theory not concise enough.
Our main contributions are as follows:

[]

We proposed an novel rough set model called granularball rough set for unifying PRS and NRS by introducing granularball into the rough set theory.

The proposed granularball rough set is the first rough set model that can naturally process continuous data while having the interpretability of equivalence classes.

Due to the combination of the robustness and adaptability of the granularball computing, the learning accuracy of the granularball rough set has been greatly improved compared with the PRS and NRS. The granularball rough set also outperforms other seven popular or the stateoftheart feature selection methods.

As the GBRS can use equivalence class to represent upper and lower approximation while processing continuous data, we further proposed a granularball rough concept tree. This makes GBRS a strong mining tool that can process continuous data and realize feature selection, knowledge representation and classification at the same time.
The rest of this paper is organized as follows: we introduce related works in Section II. The theory basis of granularball rough set is presented in Section III. Section IV details our newlyproposed granularball rough set (GBRS) model, and experimental results and analysis are presented in Section V. We present our conclusion in Section VI.
Ii Related Work
Rough set is mainly used for feature selection. So, in this section, we present a more detailed discussion of the prior work in the three categories of feature selection methods, as well as rough set theory.
Iia Filter Methods
The essence of filter methods is to use statistical indicators to score features, such as the Pearson correlation coefficient, the Gini coefficient, the KullbackLeibler divergence, the Fisher score, similarity measures, and so forth. Since filter methods only use the dataset itself and do not rely on specific classifiers, they are very versatile and are easy to expand. When compared with wrapper and embedded methods, filter methods generally have a lower algorithm complexity. At the same time, the classification accuracy of filter methods is usually lowest among the three types of methods. Filter methods also only score a single feature, rather than an entire feature subset, and thus the feature subset generated by filter methods usually has high redundancy.
Gu, Li, and Han proposed a generalized Fisher score feature selection method, aiming to find a feature subset that maximizes the lower bound of the Fisher score [34]
. This method transforms feature selection into quadraticallyconstrained linear program, and uses a cutting plane algorithm to solve the problem. Roffo and Melzi proposed a feature selection method based on graphs, ranking the most important features based on recognition as arbitrary clue sets
[35]. This method maps feature selection to an affinity graph by assigning features as nodes, and then evaluates the importance of each node via eigenvector centrality. In a later work, Roffo proposed the InfFS feature selection method, which assigns features as nodes of a graph and views feature subsets as paths in the graph
[36]. The power series property of matrices is used to evaluate the path, and the computational complexity is reduced by adding paths until the length reaches infinity.IiB Wrapper Methods
Wrapper methods use learning algorithms to evaluate features, and the classification accuracy of wrapper methods is often higher than that of filter methods. At the same time, the classifier used for evaluation limits the method, and the feature subset obtained by wrapper methods tends to have lower versatility. For each feature subset, a wrapper method needs to train a classifier, resulting in a high computational complexity which depends on the search strategy of the feature subset. However, wrapper methods do evaluate the entire feature subset rather than a single feature and take into account the dependency between features, so the redundancy of the resulting feature subset is often lower than that of filter methods.
The support vector machine (SVM) is a commonly used learning algorithm in wrapper methods. Guyon, Weston, Barnhill, and Vapnik proposed a feature selection method using SVM in combination with recursive feature elimination
[37]. The method constructs the ranking coefficient of features according to the weight vector generated by the SVM during training. In each iteration, the feature with the smallest ranking coefficient is removed, and finally a sort of all features in descending order is obtained. Guo, Kong, and He proposed a feature selection method based on clustering, which uses a tripletbased ordinal localitypreserving loss function to capture the local structures of the original data
[38]. The method defines an alternating optimization algorithm based on halfquadratic minimization to speed up the optimization process of this algorithm. Guo and Zhu specifically developed another wrapper method, Dependence Guided Unsupervised Feature Selection (DGUFS), to overcome the problem of single feature selection in filtering methods, using a joint learning framework for feature selection and clustering [39]. DGUFS is a projectionfree feature selection model based on norm equality constraints and two defined dependenceguided terms which increase the correlation between the original data, cluster labels, and the selected features.IiC Embedded Methods
Embedded methods embed feature selection into the learning algorithm, and the feature subset can be obtained when the training process of the learning algorithm has completed. This type of method is similar to filter methods, but the score of each feature is determined through model training. The idea behind these methods is to select those features important to the training of the model during the process of determining the model. Embedded methods are a compromise between filter methods and wrapper methods. Compared to filter methods, embedded methods can achieve a higher classification accuracy; compared to wrapper methods, embedded methods have lower algorithm complexity and are not as prone to overfitting.
Bradley and Mangasarian proposed an embedded feature selection method based on concave minimization and SVM [40]. This method finds a separation plane which distinguishes two point sets in the ndimensional feature space while using as few features as possible. This method not only minimizes the weighted sum of the distance between incorrectly classified points and the boundary plane, but also maximizes the distance between the two boundary planes of the separation plane. Embedded methods are often based on regression learning algorithms. Nie, Huang, Cai, and Ding proposed an efficient and robust feature selection method using a loss function based on
norms to remove outliers
[41]. This method adopts joint norm minimization on the loss function and regularization, and proposes an effective algorithm to solve joint norm minimization problems. Yang et al. proposed a feature selection method, Unsupervised Discriminative Feature Selection (UDFS), which also uses the norm [42]. UDFS optimizes an norm regularized minimization loss function, which uses discriminative information and the local structure of the data distribution.IiD Rough Set Theory
Feature selection methods based on rough set theory belong to the category of filtering methods. These methods use the positive region from rough set theory to score features. PRS granulates a dataset based on an equivalence relation, which provides good interpretability. his also result in that it only can process discrete data. However, data in the real world is mostly continuous data. Much research has been poured into overcoming the inability of rough set theory to process continuous data. This research can be roughly divided into two categories: discretizing continuous data or proposing improved rough set models. For decades, rough set models based on data discretization have proliferated [43, 44, 45]. But the discretization of data will inevitably lead to the loss of information, and the discretization results will change with the discretization method. In light of this, some have proposed improved rough set models that can directly process continuous data. Dubois and Prade combine rough sets with another concept, fuzzy sets [46], and propose fuzzy rough sets [47], which replace the equivalence relation of classic rough sets with a fuzzy similarity relation, so that fuzzy rough sets can process continuous data. However, fuzzy rough set models need to set a membership function in advance using a priori knowledge of the dataset, which reduces the generality of fuzzy rough sets.
In contrast to fuzzy rough sets, NRS [32] use a neighborhood relation to describe the relationships between samples. This neighborhood relation is completely derived from the data distribution and does not require any priori knowledge. At the same time, NRS can also process continuous data directly. Because of these advantages to NRS models, the field of NRS has been under continuous study and development. Li and Xie propose a method to accelerate NRS which based on an incremental attribute subset [48]. Gao, Liu, and Ji use a matrix to preserve measurement calculation results, requiring only one dimension measurement calculation after a dimension increase and thereby reducing the amount of calculations require to find the positive region [49]. In NRS, the neighborhood radius is a parameter that has a large impact on the reduction results and must be artificially set; how this parameter is chosen is also a frequent study of research. Peng, Liu, and Ji designed a fitness function, which combines the properties of datasets and classifiers to select the optimal neighborhood radius from a given neighborhood radius interval [50]. Xia et al. propose an adaptive NRS model by combining granular ball computing with NRS, which can automatically optimize the neighborhood radius [51]. Above NRS methods use a neighborhood relation instead of an equivalence relation to granulate datasets, thus enabling NRS to process continuous data directly. However, in the model of NRS, the unpper and lower approximations of the NRS consist of sample points instead of equivalence classes, so the NRS loses interpretability. Besides, the inconsistency between PRS and NRS makes rough set theory not concise enough. In this paper, We propose a novel rough set model named granularball rough set (GBRS) which can unify PRS and NRS. It not only has the interpretability of equivalence classes but can process continuous data naturally.
Iii The Theory Basis of Granularball Rough Set
In this section, in order to lay a foundation for our theorem and proof, we review some of the basic concepts of PRS and NRS, which have been presented in our previous work [51]. In addition, granularball computing is the main basis of the proposed method, so we also introduce it in this section.
Iiia Pawlak Rough Set
We first introduce information system and indiscernible relation.
Definition 1. [51] Let a quaternion represent an information system where:
denotes a nonempty finite set of objects. is called the universe;
denotes a nonempty finite set of attributes;
denotes the set of all attribute values, where denotes the value range of attribute ;
denotes a mapping function: , .
This information system is called a decision system if the set of attributes in the information system above satisfies , , and , where is the condition attribute set and is the decision attribute set.
Definition 2. [51] Let be an information system. and , the indiscernible relation of the attribute subset is defined as
(1) 
In PRS algorithms, represents ’s value on the attribute . So, represents that the sample has the same value with the sample on the attribute . In fact, shows that the values of samples and are the same under the attribute subset ; that is, under the description of the attribute subset , samples and are indiscernible.
is symmetric, reflexive, and transitive; that is, , is an equivalence relation on (abbreviated as ). creates a partition of , denoted and abbreviated as . The characteristics of are as follows: Suppose , if , (), and , then is divided into parts by . An element in is called an equivalence class. This leads us to our next set of definitions, approximations based on the equivalence relation .
Definition 3. [51] Let be an information system. , there is a corresponding equivalence relation on . Then, , the upper and lower approximation of with respect to are defined as follows:
(2) 
(3) 
The lower approximation represents the set of samples in that are determined to belong to according to the equivalence relation . It essentially reflects the ability of the equivalence relation to approximately describe the knowledge contained in by a partition of the knowledge of the universe . It is also commonly called the positive region of in , which is abbreviated as .
Definition 4. [51] Let be a decision system. We notate the partition of the universe by the decision attribute set into equivalence classes by . , there is a corresponding equivalence relation on . The upper and the lower approximation of with respect to are respectively defined as
(4) 
(5) 
Definition 5. [51] Let be a decision system. , the positive region and boundary region of with respect to are respectively defined as:
(6) 
(7) 
The size of the positive region reflects the separability of the classification problem in a given attribute space. The larger the positive region, the more detailed the classification problem can be described using this attribute set. We find it useful to describe this mathematically: the dependence of on is defined as
(8) 
where is the cardinality of the set and . Obviously, the larger the positive region, the stronger the dependence of on .
The dependency function defines the contribution of conditional attributes to a classification, so it can be used as an evaluation index for the importance of the attribute set.
Definition 6. [51] Given a decision system , and , the importance of relative to is defined as
(9) 
Rough set uses the measurement in (9) to select attributes in a forward way. The selection result is initialized with , and for each attribute in the attribute , that with the largest value of which should be larger than 0 is select into . This process is repeated until all is not greater than 0.
IiiB Neighborhood Rough Set
After introducing NRS somewhat loosely, we now drill down into the details, defining the basic spaces we are operating in, the neighborhoods we are working with in NRS, and the positive region we have mentioned, which is key to the operation of these methods.
Definition 7. [51] Let be a function generated on a set . is known as a metric space if satisfies:
(1) , if =;
(2) ;
(3) .
In this case, is known as a metric.
Definition 8. [51] Let be a nonempty finite set of real space. , the neigborhood of is defined as:
(10) 
where .
Definition 9. [51] Let be a neighborhood decision system. The decision attribute set divides into equivalence classes: . , the lower approximation and the upper approximation of the decision attribute set with respect to the condition attribute set are respectively defined as:
(11) 
(12) 
where , , and its positive region and boundary region are respectively defined as .
IiiC Granularball Computing
Combining the theoretical basis of traditional granular computing, and based on the research results published by Chen in Science in 1982, he pointed out that “human cognition has the characteristics of largescale priority” [52], Wang put forward a lot of granular cognitive computing[53]. Based on granular cognitive computing, granularball computing is a new, efficient and robust granular computing method proposed by Xia and Wang [54], the core idea of which is to use “granularballs” to cover or partially cover the sample space. A granularball , where represents the objects in , and is the number of objects in . ’s center and radius are respectively represented as follows
(13) 
(14) 
This means that the radius is equal to the average distance from all objects in to its center. The radius can also be set to the maximum distance. The “granularball” with a center and radius are used as the input of the learning method or as accurate measurements to represent the sample space, achieving multigranularity learning characteristics (that is, scalability, multiple scales, etc.) and the accurate characterization of the sample space. The basic process of granularball generation for classification problems in granularball computing is shown in Figure 1.
As shown in Figure 1, to similiate the “the characteristics of largescale priority of human cognition” at the beginning of the algorithm, the whole dataset can be regarded as a granularball. At this time, the purity of the granularball is the worst and cannot describe any distribution characteristics of the data. The “purity” is used to measure the quality of a granularball [54]. It is equal to the proportion of the most labels in the granularball. Then, the number of labels of different classes in the granularball is counted, and the granularball can be split into granularballs. The next step is to calculate the purity of each granularball. This is the key step, because purity is the criterion for evaluating whether a granularball needs to continue to split. As the splitting process continues to advance, the purity of the granularballs increases, and the decision boundary becomes increasingly clearer; until the purity of all granularballs meets the requirements, the boundary is clearest, and the algorithm converges. The granularball computing has developed granularball classifiers [54], granularball clustering [55], granularball neighborhood rough set [51] and granularball sampling methods [56].
Iv Granularball Rough Sets
Iva Motivation
The main difference of upper or lower approximation between the model of PRS and NRS is that, as shown in Definition 3 and Definition 10 respectively, the former consists of equivalence classes, which can be used to represent knowledge and has the interpretability; however, the latter consists of sample points, which has no interpretability. If we want to use equivalence classes to describe the upper and lower approximation of NRS, a straightforward approach is to treat all objects in a neighborhood radius as an equivalence class. However, we find that this may make two equivalence classes with different decision labels equal. We called this phenomen as “heterogeneous transmission”. It can be described in detail in Figure 2. As shown in Figure 2, according with the Definition 10, those objects including belong to positive region, and belongs to boundary region. The heterogeneous transmission appears in the intersecting area of the neighborhood area of and that of . The intersecting area is called “transmission area”. When we define the objects belong to a given neighborhood as a equivalence class, the label of the neighborhood equivalence class of is equal to “+1”, and the label of the neighborhood equivalence class of is equal to “1”. However, a new object in transmission area is equivalent to the neighborhood equivalence class of and that of at the same time. This makes the two equivalence classes with two different label equivalent. Obviously, it is harmful for learning. To avoid the heterogeneous transmission phenomena, a method is to set the neighborhood radius small enough. However, this may make most of objects or all objects always belong to positive region, and positive region can not be effectively used for measuring feature importance or other learning tasks. Overall, the heterogeneous transmission phenomena is caused by the overlap between those positive region neighborhoods with different labels in NRS.
IvB Granularball Rough Set
As shown in Fig. 3, as the neighborhood radii are adaptively different, the overlap between those positive region neighborhoods does not exist in granularball computing. Therefore, it is possible to use equivalence classes to represent upper and lower approximation by introducing granularball computing to represent neighborhood. The granularball computing based rough set is called “granularball rough set”, and its models are defined and described as follows.
Definition 10. Let be a nonempty finite set of real space. , a granularball is defined as:
(15) 
where and denote the center and the radius of respectively. The larger the , the coarser the granularball ; otherwise, the finer the granularball .
Definition 11. Let be an information system. and , the indiscernible granularball relation of the attribute subset is defined as
(16) 
If , the relationship between and is denoted as .
In granularball rough set, . So, represents that and belong to the same granularball under the given attribute set . , is an equivalence relation on (abbreviated as ). Because the granularballs do not overlap, can also create a partition of , denoted and abbreviated as . An element in is an equivalence class generated by granularball computing.
Definition 12. Given an information system , if , .
As the overlap does not exist between those granularballs with different labels in GBRS, Definition 12 means that those granular balls with the same label belong to an equivalence class. This kind of overlap between those positive region neighborhoods, i.e., granularballs, with a same label is not considered in this method because it does not lead to heterogeneous transmission and affect decision; besides, considering this overlap in the algorithm design will increase computation cost.
Properties of GBRS. Given an information system , , , represents the indiscernible granularball relation of the attribute subset on . The indiscernible granularball relation obviously has the following properties:

Symmetry: if , then ;

Reflexivity: ;

Transitivity: if , , then .
In summary, similar with that in PRS, is symmetric, reflexive, and transitive, and complete consistent with in PRS.
Based on the equivalence class , the definitions of positive region, upper and lower approximations are the same to those in PRS. Therefore, GBRS has the consistent model with the PRS. Their specific definitions are as follows:
Definition 13. Let be an information system. , there is a corresponding equivalence relation on . Then, , the upper and lower approximation of with respect to are defined as follows:
(17) 
(18) 
Definition 14. Let be a decision system. We notate the partition of the universe by the decision attribute set into equivalence classes by . , there is a corresponding equivalence relation on . The upper and the lower approximation of with respect to are respectively defined as
(19) 
(20) 
According to Definition 14, a granularball whose purity is equal to 1, i.e., that the samples in it have a same decision label, belongs to lower approximation (i.e., the positive region described in Definition 15) in a decision system.
Definition 15. Let be a decision system. , the positive region and boundary region of with respect to are respectively defined as:
(21) 
(22) 
Completely the same with that in the Pawlak rough set, the size of the positive region reflects the separability of the classification problem in a given attribute space. The larger the positive region, the more detailed the classification problem can be described using this attribute set. The dependence of on and are also the same with that in Pawlak rough set.
In the models of GBRS, when the radius of each granularball is set to a infinitely small positive number, GBRS is transformed into PRS. When the PRS algorithm is designed from the perspective of GBRS, it is called as granularball PRS (GBPRS). GBPRS and PRS have the same experimental results; however, their algorithm designs are different, and the former generate eqivalence using granularball computing. When the radius of each granularball is not set to a zero, GBRS is transformed into granularball NRS (GBNRS). As the GBNRS not only can use equivalence classes to represent knowledge, but is much more efficient than the traditional NRS which contains many overlaps, the GBNRS can completely replace the traditional NRS. In another word, the GBNRS is the representive method of neighborhood rough set. In summary, GBRS is an unified model of GBPRS and GBNRS.
In addition, as shown in Fig. 5(f), GBNRS can flexibly fit the data distribution using those granularballs with varies radii, which is obviously better than those methods using a fixed radius, such as PRS and the traditional NRS. So, GBNRS can achieve a higher accuracy than the two algorithms. Moreover, the combination of the robustness and adaptability of the granularball computing is helpful for GBNRS to perform well in accuracy. This robustness in the GBNRS is reflected in the fact that, since the noise point will be in the small granularball, the characteristics of a large neighborhood, i.e., whether it belongs to positive region or not, will not be affected by it. This robustness will not exist in other most methods, such as the traditional NRS who has a fixed radius. These will be demonstrated in the experiments.
IvC Implement of GBNRS
As the GBNRS has the unified model with the PRS, as shown in Fig. 4, its whole algorithm process is completely the same with that of the PRS. The only difference bewteen the GBRS and PRS is the generation way of positive region, which is shown in step 2 in Fig. 4. The GBRS generates positive region using granularballs. For the GBNRS, in the granularball generation, to fulfilling definition 14, the purity threshold is set to 1. Besides, referring to that in [57], the lower bound of the size of a granular ball , i.e., the number of samples in it, is optimized from 2* to 2 with a step as 1, where the denotes the number of conditional attributes in the data set. When the size of a granularball is lower than or its purity reaches to 1, the granular ball stop to split. According to Definition 15, a granularball whose purity is equal to 1 belongs to positive region, and a granularball whose purity is lower than 1 belongs to boundary region. Besides, to decrease the randomness in the granularball generation and make the positive region of the granularballs in the attribute selection process have better comparability, for a granularball, the samples who have the smallest indexes in it are selected as the initial centroids in this granularball splitting process. The flowchart is shown as Fig. 4. The specific process is mainly composed of five steps.
Fig. 5 shows the granularball generation process of GBNRS. The red points and red granularballs are labeled ‘+1’, and the green points and green granularballs are labeled ‘1’. Firstly, the whole dataset is regarded as a granularball, it is divided into two granularballs using 2means because it contains two different classes of samples in it, and as shown in Fig. 5 (a). Fig. 5(b), (c), (d) and (e) are the intermediate iteration results. For a granularball, if its purity and its size , it continuous to be split. Fig. 5(e) contains the phenomenon of heterogeneous transmission. It is eliminated by spliting the heterogeneous overlapped granularballs to remove the overlap of heterogeneous balls, the results of which are shown in Fig. 5(f). In Fig. 5(f), any pair of two heterogeneous granularballs do not contain any common samples. Besides, as shown in Fig. 5(f), those granularballs containing both green and red sample points, i.e., those black granualrballs, belong to the boundary region.
IvD Granularball Rough Concept Tree for Knowledge Representation and Classification
As the GBRS can use equivalence class to represent upper and lower approximation while processing continuous data, it can represent knowledge well. In this section, we further proposed the granularball rough concept tree (GBRCT) by combing GBNRS with the rough concept tree [58], which is proposed based on the concept lattice. RCT can not only be used to organize and describe the knowledge rules obtained by rough set based on forward attribute reduction algorithm, but also can be used for classification decision. So, the GBRCT makes GBRS a strong mining tool that not only can realize feature selection, but knowledge representation and classification at the same time.
In the RCT, each node is also called as a “knowledge point”, or called “concept node”, consisting of two parts: a sequence consisting of both some attributes and their values called “intent”, and its corresponding equivalence classes called “extent”. The “concept” consisting with “intent” and “extent” is borrowed from philosophy for knowledge representation. GBRS strictly divides the dataset by gradularballs, so in the GBRCT of this article, the representation of the first row and first column of the “knowledge point” has two parts: “the attribute value” and “the center and radius of the corresponding gradularball”.
In the GBRCT, different from that in the RCT, the intent is described using a granularball equivalence class consisting of its center and radius instead of a sequence of attribute values. The GBRCT generated on the discrete dataset zoo is shown in Fig. 6, and that on the continuous dataset wine is shown in Fig. 7. Because we know the data set zoo is discrete in advance, that the neighberhood radius is smaller than a infinitely small positive value is used as the termination condition of granularball splitting, and GBRS is converted to be GBPRS. In Figs. 6 and 7, an orange node represents a granularball equivalence class. A blue node who contains “?” represents boundary region, and an orange node representing a granularball belongs to the positive region that can certainly describe knowledge. The dataset zoo is discrete, so that the neighborhood radius is smaller than a infinitely small positive value is set to the termination condition of granularball splitting. The result of GBRCT is very similar with that using RCT except the represetation of the intent, i.e., the first row of each node. The RCT represents the intent using a sequence of attibute values, but the GBRCT using a granularball consisting of a center and a neighborhood radius whose value is equal to zero as shown in Fig. 6; On the contrary, as shown in Fig. 7, the neighborhood radius in the continious data is larger than zero. This also indicates that the GBRS realize the unified description for PRS and NRS well. In the real scene, when the information whether the data set is discrete or not is not provided in advance, an infinitely positive value is considered as an option to be optimized as the neighborhood radius in the GBRS.
Similar with that in the RCT, the number in the right of a layer of the GBRCT shows which attributes the concept nodes in the layer are generated on; besides, those positive region concept nodes containing the largest number of extent samples have the strongest representation ability for knowledge and are the most valuable, such as the second node in the second layer and the second node in the third layer in Fig. 6. In addition, as described in [58], the GBRCT can also be directly used for classification.
IvE Algorithm Design
The only difference between the GBNRS and PRS is the generation way of positive region, and the feature selection process is the same because the GBNRS and PRS have an unified representation model. So, we only discuss the algorithm design of granularball generation for positive region in this section, which is shown in Algorithm 1. The algorithm 1 mainly consists of two parts including initial granularballs generation and overlap removing. denotes the purity of the granularball . In Step 16, there is overlap between two granularballs if their boudary distance is smaller than zero, i.e,that, the distance between their centers is smaller than sum of their radii. The process of spliting in Step 17 is the similar with that in Step 6. In the output variable , those granularballs with purity as 1 belong to positive region. The lower bound of the size of a granular ball , i.e., the number of samples in it, is optimized from 2* to 2 with a step as 1, where the denotes the number of conditional attributes in the data set.
V Experiment
To demonstrate the feasibility and effectiveness of GBRS, we selected some popular or the stateoftheart algorithms for comparison. As the experimental results using GBPRS are the same with those using PRS, so the GBNRS is selected for comparison. As the PRS can only process discrete data, so we also need to conduct experiments on some discrete data sets for comparison with PRS. As shown in Table I, we randomly selected fifteen real datasets including continuous and discrete datasets to demonstrate the performance of the GBNRS. Among them, the first six are discrete datasets; the last nine are continuous datasets. Experimental hardware environment: PC with an Intel Core i7107000 CPU @2.90 GHz with 32 G RAM. Experimental software environment: Python 3.7.







1  lymphography  148  0  18  4  
2  primarytumor  336  0  15  2  
3  mushroom  7535  0  22  2  
4  mushroom1  8124  0  22  2  
5  zoo  101  0  16  7  
6  backuplarge  307  0  36  4  
7  iono  351  34  0  2  
8  Diabetes  768  8  0  2  
9  wdbc  569  30  0  2  
10  audit_risk  772  21  0  2  
11  electrical  10000  13  0  2  
12  Parkinson_Multiple  1040  27  0  2  
_Sound_Recording  
13  wine  178  13  0  3  
14  spambase  4601  57  0  2  
15  htru2  17898  8  0  2  
The lower bound of the size of a granular ball , i.e., the number of samples in it, is optimized from 2* to 2 with a step as 1, where the denotes the number of conditional attributes in the data set. The experiments are designed along the lines of Xia et al. [51]
as the quality of the reduced attribute set is not related to the testing classifier used, only a common testing classifier the nearest neighbor algorithm—is used to verify the quality of the reduced attribute set. Therefore, we use the common classifier, kNN, in our experiment with 5fold crossvalidation.
Va In Comparison with PRS under Discrete Data
The experimental results on the first six discrete datasets in Table I are shown in Table II, where the ”original” column represents the classification accuracy obtained from the original unreduced dataset. The ”NO” column in Table II is corresponded with the ”NO” column in Table I. It can be seen from Table II that, the classification accuracy of GBNRS is much higher than that of the PNS on most cases except the case on the 4 data set, in which the accracies of the two algorithms are the same. Considering the average classification accuracy, the original accuracy and PRS’s accuray are 0.8906 and 0.8618 respectively, while that of GBNRS is 0.8958; in comparison with the previous two results, GBNRS achieved 0.52 and 3.4 percentage enhancement respectively. The reason is that, the GBNRS can flexibly fit the data distribution using those granularballs with varies radii, which is obviously better than those methods using a fixed radius, such as PRS and the traditional NRS. So, GBNRS can achieve a higher accuracy than the two algorithms. In summary, analysis shows that on discrete datasets, GBNRS can achieve higher classification accuracy than both the PRS and those on the original datasets.
NO.  Original  PRS  GBNRS 
1  0.81010.0966  0.74890.0627  0.81610.0584 
2  0.69550.0290  0.66860.0414  0.69550.0290 
3  0.91110.1034  0.92020.0753  0.93280.0746 
4  10  0.98890.0026  10 
5  0.950.0499  0.90.079  0.950.0353 
6  0.97710.0274  0.94440.0526  0.98040.0269 
Average  0.8906  0.8618  0.8958 
VB In Comparsion with Varies Feature Selection Methods
in this section, we select nine continuous datasets whose indexes are from 7 to 15 in Table I, and nine popular or thestateoftheart algorithms for comparison including NRS [59], GBNRS [51], Cfs [37], Ilfs [1], Laplacian [60], Lasso [61], Mrmr [15], WNRS [62]. The experimental results are shown in Table III. The neighborhood radius is gradually increased from 0.01 to 0.5 with a step size of 0.01, which is commonly used in the NRS and WNRS. The method, GBNRS, also introduce granularball computing to decrease the overlap in the traditional NRS, resulting in the efficiency improvement. However, it did not realize the eqivalence representation; so, we named it with ”old” as suffix, i.e., GBNRS, for distinguishing it from our method GBNRS. As there is randomness in the GBNRS, we run GBNRS ten times, and take the highest classification accuracy among the ten experiments results for comparison. The experimental results are shown in Table III. It can be seen from Table III that, in comparison with other algorithms, GBNRS can achieve the highest classification accuracy in seven datasets on most cases. The reason is that the combination of the robustness and adaptability of the granularball computing. The adaptability makes GBNRS flexibly fit different data distributions using those granularballs with varies radii, resulting in a good performance of GBNRS in accuracy.
NO.  Cfs  Ilfs  Laplacian  Lasso  Mrmr  Original  NRS  GBNRS  WNRS  GBNRS 
1  0.82340.0589  0.80290.0551  0.79630.1125  0.76870.0441  0.82390.0715  0.81010.0966  0.75660.1094  0.80350.0575  0.74890.0627  0.81610.0584 
2  0.70450.0591  0.72240.0467  0.66260.0479  0.71050.0467  0.63880.0771  0.69550.0209  0.66860.0414  0.67160.0606  0.66870.0414  0.69550.0290 
3  0.89970.1235  0.90400.1420  0.91460.1080  0.92470.1029  0.95880.0677  0.91110.1035  0.97290.0260  0.92890.0755  0.97450.0571  0.93280.0746 
4  10  10  10  10  10  10  10  10  10  10 
5  0.89000.0418  0.93000.0273  0.92000.0274  0.92000.0570  0.96000.0547  0.95000.0499  0.90000.079  0.91000.0418  0.90000.0790  0.95000.0353 
6  0.97710.0274  0.98040.0269  0.97710.0274  0.97380.034  0.97710.0274  0.97710.0274  0.97050.0408  0.9640.0549  0.97710.0249  0.98040.0269 
7  0.90280.0340  0.88000.0372  0.85140.0312  0.86570.0423  0.91420.0319  0.84860.0458  0.88860.0275  0.87710.0186  0.91460.0211  0.90000.0484 
8  0.68970.0136  0.74310.0250  0.73270.0315  0.68970.0412  0.72100.0463  0.74710.0348  0.74710.0348  0.74710.0348  0.74710.0348  0.74710.0348 
9  0.96660.0273  0.97010.0192  0.97180.0266  0.97180.0266  0.97180.0259  0.96830.0253  0.97010.0192  0.96650.0243  0.96130.022  0.97180.0218 
10  0.93770.6070  0.94680.0360  0.95590.0451  0.96240.0213  0.96880.0202  0.93770.0571  0.91950.0698  0.94160.5728  0.92350.0718  0.99220.0071 
11  0.85940.0081  0.97210.0020  0.85940.0081  0.85940.0081  0.91470.0068  0.91450.0068  0.97210.0058  0.91450.0068  0.97820.0022  0.99690.0015 
12  0.90570.0455  0.82490.0686  0.89130.0522  0.89700.0512  0.92590.5930  0.82490.6860  10  0.87970.0053  10  10 
13  0.94350.0282  0.96620.0235  0.97190.0340  0.96600.0310  0.97750.0233  0.96050.0427  0.98300.0155  0.94380.0478  0.97210.0196  0.97730.0238 
14  0.87150.0562  0.86090.0582  0.86610.0559  0.88060.2870  0.88300.0603  0.86460.0639  0.86480.0650  0.86890.0634  0.86460.0646  0.88890.0531 
15  0.97720.0021  0.97720.0021  0.97720.0021  0.97770.0015  0.97720.0021  0.97720.0021  0.97720.0021  0.97720.0021  0.97780.0021  0.97850.0021 
Vi Conclusion
This paper presents an unified model for the two most popular rough set models, Pawlak rough set and neighborhood rough set models. The unified model can not only express knowledge with equivalence classes, but also deal with both continuous data and discrete data. In comparison with nine popular or thestateoftheart feature selection methods on fifteen real datasets, the experiments show that the proposed model can achieve a better performance in accuracy. However, the optimization of the is inefficient. If some incremental strategies can be developed, it can be accelerated. In the future work, we will developed the poposed model into other rough set models and improve their performance.
Vii Acknowlegements
This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 62176033 and 61936001, the Natural Science Foundation of Chongqing under Grant No. cstc2019jcyjcxttX0002 and by NICE: NRT for Integrated Computational Entomology, US NSF award 1631776.
References

[1]
Giorgio Roffo, Simone Melzi, Umberto Castellani, and Alessandro Vinciarelli.
Infinite latent feature selection: A probabilistic latent graphbased
ranking approach.
In
Proceedings of the IEEE International Conference on Computer Vision
, pages 1398–1406, 2017.  [2] T. Y. Lin. Granular computing: Practices, theories, and future directions. Computational Complexity, 2009(1):4339–4355, 2009.
 [3] Giorgio Roffo, Simone Melzi, and Marco Cristani. Infinite feature selection. In Proceedings of the IEEE International Conference on Computer Vision, pages 4202–4210, 2015.

[4]
Zhao Zhang, Mingbo Zhao, and Tommy WS Chow.
Binaryand multiclass group sparse canonical correlation analysis for feature extraction and classification.
IEEE Transactions on Knowledge and Data Engineering, 25(10):2192–2205, 2012.  [5] Xiaodong Fan, Weida Zhao, Changzhong Wang, and Yang Huang. Attribute reduction based on maxdecision neighborhood rough set model. KnowledgeBased Systems, 151:16–23, 2018.
 [6] Diwakar Tripathi, Damodar Reddy Edla, and Ramalingaswamy Cheruku. Hybrid credit scoring model using neighborhood rough set and multilayer ensemble classification. Journal of Intelligent & Fuzzy Systems, 34(3):1543–1549, 2018.
 [7] Xiaoli Chu, Bingzhen Sun, Xue Li, Keyu Han, JiaQi Wu, Yan Zhang, and Qingchun Huang. Neighborhood rough setbased threeway clustering considering attribute correlations: An approach to classification of potential gout groups. Information Sciences, 535:28–41, 2020.
 [8] Zhanhui Li, Jiancong Fan, Yande Ren, and Leiyu Tang. A novel feature extraction approach based on neighborhood rough set and pca for migraine rsfmri. Journal of Intelligent & Fuzzy Systems, 38(5):5731–5741, 2020.
 [9] Sheng Luo, Duoqian Miao, Zhifei Zhang, Yuanjian Zhang, and Shengdan Hu. A neighborhood rough set model with nominal metric embedding. Information Sciences, 520:373–388, 2020.
 [10] Zhao Zhang, Tommy WS Chow, and Mingbo Zhao. Trace ratio optimizationbased semisupervised nonlinear dimensionality reduction for marginal manifold visualization. IEEE Transactions on Knowledge and Data Engineering, 25(5):1148–1161, 2012.
 [11] Yan Zhang, Zhao Zhang, Sheng Li, Jie Qin, Guangcan Liu, Meng Wang, and Shuicheng Yan. Unsupervised nonnegative adaptive feature extraction for data representation. IEEE Transactions on Knowledge and Data Engineering, 31(12):2423–2440, 2018.
 [12] Yanyan Yang, Degang Chen, and Hui Wang. Active sample selection based incremental algorithm for attribute reduction with rough sets. IEEE Transactions on Fuzzy Systems, 25(4):825–838, 2017.

[13]
Changzhong Wang, Qinghua Hu, Xizhao Wang, Degang Chen, Yuhua Qian, and Zhe
Dong.
Feature selection based on neighborhood discrimination index.
IEEE Transactions on Neural Networks and Learning Systems
, 29(7):2986–2999, 2018.  [14] Manoranjan Dash and Huan Liu. Feature selection for classification. Intelligent data analysis, 1(14):131–156, 1997.
 [15] Lin Sun, Tengyu Yin, Weiping Ding, Yuhua Qian, and Jiucheng Xu. Feature selection with missing labels using multilabel fuzzy neighborhood rough sets and maximum relevance minimum redundancy. IEEE Transactions on Fuzzy Systems, PP(99):1–1, 2021.
 [16] Binbin Sang, Hongmei Chen, Lei Yang, Tianrui Li, Weihua Xu, and Chuan Luo. Feature selection for dynamic intervalvalued ordered data based on fuzzy dominance neighborhood rough set. KnowledgeBased Systems, 227:107223, 2021.
 [17] Zdzisław Pawlak. Rough sets. International journal of computer & information sciences, 11(5):341–356, 1982.
 [18] JinMao Wei, ShuQin Wang, and XiaoJie Yuan. Ensemble rough hypercuboid approach for classifying cancers. IEEE transactions on knowledge and data engineering, 22(3):381–391, 2009.
 [19] WP Ma, YY Huang, H Li, et al. Image segmentation based on rough set and differential immune fuzzy clustering algorithm. Journal of Software, 25:2675–2689, 2014.

[20]
Yuhua Qian, Hang Xu, Jiye Liang, Bing Liu, and Jieting Wang.
Fusing monotonic decision trees.
IEEE Transactions on Knowledge and Data Engineering, 27(10):2717–2728, 2015.  [21] Yuefeng Li, Libiao Zhang, Yue Xu, Yiyu Yao, Raymond Yiu Keung Lau, and Yutong Wu. Enhancing binary classification by modeling uncertain boundary in threeway decisions. IEEE transactions on knowledge and data engineering, 29(7):1438–1451, 2017.
 [22] Yuhua Qian, Xinyan Liang, Qi Wang, Jiye Liang, Bing Liu, Andrzej Skowron, Yiyu Yao, Jianmin Ma, and Chuangyin Dang. Local rough set: a solution to rough data analysis in big data. International Journal of Approximate Reasoning, 97:38–63, 2018.
 [23] KA Vidhya and TV Geetha. Entity resolution framework using rough set blocking for heterogeneous web of data. Journal of Intelligent & Fuzzy Systems, 34(1):659–675, 2018.
 [24] Mengjun Hu and Yiyu Yao. Structured approximations as a basis for threeway decisions in rough set theory. KnowledgeBased Systems, 165:92–109, 2019.
 [25] Yiyu Yao. Decisiontheoretic rough set models. In International conference on rough sets and knowledge technology, pages 1–12, 2007.
 [26] Wojciech Ziarko. Variable precision rough set model. Journal of computer and system sciences, 46(1):39–59, 1993.
 [27] Neil Parthaláin, Qiang Shen, and Richard Jensen. A distance measure approach to exploring the rough set boundary region for attribute reduction. IEEE Transactions on Knowledge and Data Engineering, 22(3):305–317, 2009.
 [28] Degang Chen, Suyun Zhao, Lei Zhang, Yongping Yang, and Xiao Zhang. Sample pair selection for attribute reduction with rough set. IEEE Transactions on Knowledge and Data Engineering, 24(11):2080–2093, 2012.
 [29] Jiye Liang, Feng Wang, Chuangyin Dang, and Yuhua Qian. A group incremental approach to feature selection applying rough set technique. IEEE Transactions on Knowledge and Data Engineering, 26(2):294–308, 2012.
 [30] Hongmei Chen, Tianrui Li, Chuan Luo, ShiJinn Horng, and Guoyin Wang. A rough setbased method for updating decision rules on attribute values’ coarsening and refining. IEEE transactions on knowledge and data engineering, 26(12):2886–2899, 2014.
 [31] Pradipta Maji. A rough hypercuboid approach for feature selection in approximation spaces. IEEE Transactions on Knowledge and Data Engineering, 26(1):16–29, 2012.
 [32] QingHua Hu, DaRen Yu, ZongXia Xie, et al. Numerical attribute reduction based on neighborhood granulation and rough approximation. Journal of software, 19(3):640–649, 2008.
 [33] Tsau Young Lin et al. Granular computing on binary relations i: Data mining and neighborhood systems. Rough sets in knowledge discovery, 1(1):107–121, 1998.
 [34] Quanquan Gu, Zhenhui Li, and Jiawei Han. Generalized fisher score for feature selection. arXiv preprint arXiv:1202.3725, 2012.
 [35] G Roffo and S Melzi. Ranking to learn: Feature ranking and selection via eigenvector centrality. new frontiers in mining complex patterns. In Fifth International workshop, nfMCP2016, 2017.
 [36] Giorgio Roffo, Simone Melzi, Umberto Castellani, Alessandro Vinciarelli, and Marco Cristani. Infinite feature selection: a graphbased feature filtering approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(12):4396–4410, 2020.
 [37] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. Gene selection for cancer classification using support vector machines. Machine learning, 46(1):389–422, 2002.
 [38] Jun Guo, Yanqing Guo, Xiangwei Kong, and Ran He. Unsupervised feature selection with ordinal locality. In 2017 IEEE international conference on multimedia and expo (ICME), pages 1213–1218, 2017.

[39]
Jun Guo and Wenwu Zhu.
Dependence guided unsupervised feature selection.
In
Thirtysecond AAAI conference on artificial intelligence
, 2018.  [40] Paul S Bradley and Olvi L Mangasarian. Feature selection via concave minimization and support vector machines. In ICML, volume 98, pages 82–90, 1998.
 [41] Feiping Nie, Heng Huang, Xiao Cai, and Chris Ding. Efficient and robust feature selection via joint 2, 1norms minimization. Advances in neural information processing systems, 23, 2010.
 [42] Yi Yang, Heng Tao Shen, Zhigang Ma, Zi Huang, and Xiaofang Zhou. L2, 1norm regularized discriminative feature selection for unsupervised. In TwentySecond International Joint Conference on Artificial Intelligence, 2011.
 [43] G Wang, J Liu, and F Hu. Cut’s discriminability based continuous attributes discretization in rough set [j]. Journal of Chongqing University of Posts and Telecommunications (Natural Science Edition), (4):257–261, 2010.
 [44] Rahman Ali, Muhammad Hameed Siddiqi, and Sungyoung Lee. Rough setbased approaches for discretization: a compact review. Artificial Intelligence Review, 44(2):235–263, 2015.
 [45] Feng Jiang and Yuefei Sui. A novel approach for discretization of continuous attributes in rough set theory. KnowledgeBased Systems, 73:324–334, 2015.
 [46] Lotfi A Zadeh. Fuzzy sets. In Fuzzy sets, fuzzy logic, and fuzzy systems: selected papers by Lotfi A Zadeh, pages 394–432. 1996.
 [47] Didier Dubois and Henri Prade. Rough fuzzy sets and fuzzy rough sets. International Journal of General System, 17(23):191–209, 1990.
 [48] Nan Li and JY Xie. A feature subset selection algorithm based on neighborhood rough set for incremental updating datasets. Computer Technology and Development, 21(11):149–155, 2011.
 [49] Yang Gao, Zunyi Liu, and Jun Ji. Neighborhood rough set attribute reduc tion algorithm based on matrix reservation strategy. Application Research of Computers, 12, 2019.
 [50] Xiaoran Peng, Zunyi Liu, and Jun Ji. Adaptable method for determining neighborhood size of neighborhood rough set. Application Research of Computers, (1):4, 2019.
 [51] Shuyin Xia, Zhao Zhang, Wenhua Li, Guoyin Wang, Elisabeth Giem, and Zizhong Chen. Gbnrs: A novel rough set algorithm for fast adaptive attribute reduction in classification. IEEE Transactions on Knowledge and Data Engineering, 2020.
 [52] Lin Chen. Topological structure in visual perception. Science, 218(4573):699–700, 1982.
 [53] Guoyin Wang. Dgcc: datadriven granular cognitive computing. Granular Computing, 2(4):343–355, 2017.
 [54] Shuyin Xia, Yunsheng Liu, Xin Ding, Guoyin Wang, Hong Yu, and Yuoguo Luo. Granular ball computing classifiers for efficient, scalable and robust learning. Information Sciences, 483:136–152, 2019.

[55]
Shuyin Xia, Daowan Peng, Deyu Meng, Changqing Zhang, Guoyin Wang, Elisabeth
Giem, Wei Wei, and Zizhong Chen.
A fast adaptive kmeans with no bounds.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.  [56] Shuyin Xia, Shaoyuan Zheng, Guoyin Wang, Xinbo Gao, and Binggui Wang. Granular ball sampling for noisy label classification or imbalanced classification. IEEE Transactions on Neural Networks and Learning Systems, 2021.
 [57] Shuyin Xia, Shaoyuan Zheng, Guoyin Wang, Xinbo Gao, and Binggui Wang. Granular ball sampling for noisy label classification or imbalanced classification. IEEE Transactions on Neural Networks and Learning Systems, 2021.
 [58] Shuyin Xia, Xinyu Bai, and Wang Guoyin. An efficient and accurate rough set for feature selection, classification and knowledge representation. arxiv preprint axiv: 2112.96551, 2021.
 [59] Qinghua Hu, Daren Yu, Jinfu Liu, and Congxin Wu. Neighborhood rough set based heterogeneous feature subset selection. Information sciences, 178(18):3577–3594, 2008.
 [60] Xiaofei He, Deng Cai, and Partha Niyogi. Laplacian score for feature selection. Advances in neural information processing systems, 18, 2005.
 [61] Sara A Van de Geer. Highdimensional generalized linear models and the lasso. The Annals of Statistics, 36(2):614–645, 2008.
 [62] Meng Hu, Eric CC Tsang, Yanting Guo, Degang Chen, and Weihua Xu. A novel approach to attribute reduction based on weighted neighborhood rough sets. KnowledgeBased Systems, 220:106908, 2021.