I Introduction
LABELS of data are laborious or expensive to obtain, while unlabeled data are generated or sampled in tremendous size in big data era. This is the reason why semisupervised learning (SSL) is increasingly drawing the interests and attention from the machine learning society. Among the variety of many SSL model streams, Graphbased SSL (GSSL) has the reputation of being easily understood through visual representation and is convenient to improve the learning performance by exploiting the corresponding matrix calculation. Therefore, there has been a lot of research works in this regard, e.g.,
[1], [2], [3].However, the existing GSSL models have two apparent limitations. One is the models usually need to solve an optimization problem in an iterative fashion, hence the low efficiency. The other is that these models have difficulty in delivering label for a new bunch of data, because the solution for the unlabeled data is derived specially for the given graph. With newly included data, the graph has changed and the whole iterative optimization process is required to run once again.
We ponder the possible reasons of these limitations and argue that the crux is these models treat the relationship among the neighboring data points as “peertopeer”. Because the data points are considered equal significant to represent their class, most GSSL objective functions try optimizing on each data point with equal priority. However, this “peertopeer” relationship is questionable in many situations. For example, if a data point lies at the centering location of the space of its class, then it will has more representative power than the other one that diverges more from the central location, even if and are in the same KNN or (NN) neighborhood.
This paper is grounded on the partialorderrelation assumption: the neighboring data points are not in equal status, and the label of the leader (or parent) is the contribution of its followers (or children). The assumption is intuitively reasonable since there is an old saying: “a man is known by the company he keeps”. The labels of the peripheral data may change because of the model or parameter selection, but the labels of the core data are much more stable. Fig. 1 illustrates this idea.
This paper proposes a noniterative label propagation algorithm taking our previous research work, namely local density based optimal granulation (LoDOG), as starting point. In LoDOG, the input data was organized as an optimal number of subtrees. Every noncenter node in the subtrees is led by its parent to join the microcluster the parent belongs to. In [4], these subtrees are called Leading Tree. The proposed method, Label Propagation on Optimal Leading Forest (LaPOLeaF), performs label propagation on the structure of the relatively independent subtrees in the forest, rather than on the traditional nearest neighbor graph.
Therefore, LaPOLeaF exhibits several advantages when compared with other GSSL methods:
(a) the propagation is performed on the subtrees, so the edges under consideration are much more sparse than that of nearest neighbor graph;
(b) the subtrees are relatively independent to each other, so the massive label propagation computation is easier to be parallelized when the size of samples is huge;
(c) LaPOLeaF performs label propagation in a noniterative fashion, so it is of high efficiency.
Overall, LaPOLeaF algorithm is formulated in a simple way and the empirical evaluations show the promising accuracy and very high efficiency.
The rest of the paper is organized as follows. Section II briefly reviews the related work. The model of LaPOLeaF is presented in details in Section III. Section IV describes the method to scale LaPOLeaF for big data. Section V analyzes the computation complexity and discusses the relationship to other researches, and Section VI describes the experimental study. We reach a conclusion in Section VII.
Ii Related studies
Iia Graphbased semisupervised learning (GSSL)
Suppose an undirected graph is denoted as , where is the set of the vertices, the set of edges, and is the mapping from an edge to a real number (usually defined as the similarity between the two ending points). GSSL takes the input data as the vertices of the graph, and places an edge between two vertices if are similar or correlated. The basic idea of GSSL is propagating the labels of the labeled samples to the unlabeled with the constructed graph. The propagation strength between and on each edge is in proportion to the weight .
Almost all the existing GSSL works on two fundamental assumptions. One is called “clustering assumption”, meaning that the samples in the same cluster should have the same labels. Clustering assumption usually is used for the labeled sample set. The other is called “manifold assumption”, which means that similar (or neighboring) samples should have similar labels. Manifold assumption is used for both the labeled and unlabeled data.
Starting from the two assumptions, GSSL usually aims at optimizing an objective function with two terms. However, the concrete components in different GSSL models vary. For example, in [5] the objective function is
(1) 
where is the label indication matrix; is the sum of the row of ; is the label of the labeled data.
Liu proposed an Anchor Graph Regulation (AGR) approach to predict the label for each data point as a locally weighted average of the labels of anchor points[1]. In AGR, the function is
(2) 
where is the regression matrix that describes the relationship between raw samples and anchors; . is the Laplacian matrix.
Wang proposed a hierarchical AGR method to address the granularity dilemma in AGR , by adding a series of intermediate granular anchor layer between the finest original data and the coarsest anchor layer [3].
One can see that the underlying philosophy is still the two assumptions. Slightly different from the two assumptions, Ni proposed a novel concept graph harmoniousness, which integrates the feature learning and label learning into one framework (framework of Learning by Propagability, FLP) [2]. The objective function has only one term in FLP, yet also needs to obtain the local optimal solution by alternately running iterative optimizing procedure on two variables.
IiB optimal leading forest
denotes the dataset . is the index set of . is the distance (under any metric) between and .
Definition 1.
Local Density.[6] The local density of is computed as , where is the cutoff distance or bandwidth parameter.
Definition 2.
leading node and distance. If is the nearest neighbor with higher local density to , then is called the leading node of . Formally, , denoted as for short. is called the distance of , or simply .
We store all the in an array named as LN.
Definition 3.
leading tree (LT)[4]. If , . Let an arrow start from , , and end at . Thus, and the arrows form a tree . Each node in (except ) tends to be led by to join the same cluster belongs to, unless itself makes a center. Such a tree is called a leading tree.
Definition 4.
operator[7]. For any nonroot node in an LT, there is a leading node for . This mapping is denoted as .
we denote for short.
Definition 5.
partial order in LT[7]. Suppose , we say such that .
Definition 6.
center potential. Let denote the potential of to be selected as a center, is computed as .
Intuitively, if an object has a large (means it has many near neighbors) and a large (means relatively far from another object of larger ), then would have great chance to be the center of a collection of data.
Pedrycz proposed the principle of justifiable granularity indicating that a good information granule (IG) should has sufficient experiment evidence and specific semantic [8], [9], [10]. That is, there should be as many as possible data points included in an IG, and the closure of the IG should be compact and tight in geometric perspective.
Following this principle, we have proposed a local density based optimal granulation (LoDOG) method to build justifiable granules accurately and efficiently [11]. In LoDOG, we construct the optimal IGs of by disconnecting the corresponding Leading tree into an optimal number of subtrees. The optimal number is derived via minimizing the objective function:
(3) 
where .
Here, is the number of IGs; is the parameter striking a balance between the experimental evidence and semantic; is the set of points included in granule; returns the cardinality of a set; is a strictly monotonically increasing function used to adjust the magnitude of to well match that of . This function can be automatically selected from a group of common functions such as logarithm functions, linear functions, power functions, and exponential functions; is the root of the granule as a leading tree.
We used LoDOG to construct the Optimal Leading Forest (OLF) from the dataset. The readers are referred to [11] for more details of LoDOG.
Definition 7.
optimal leading forest (OLF). leading trees can be constructed from the dataset by using LoDOG method. All the leading trees are collectively called optimal leading forest.
The concept of OLF is used to determine the localized ranges of label propagation on the whole leading tree of . That is, OLF indicates where to stop propagating the label of a labeled datum to its neighbors.
Iii Label Propagation on Optimal Leading Forest (LaPOLeaF)
LaPOLeaF first makes a global optimization to construct the OLF, then performs label propagation on each of the subtrees. Following the aforementioned partial relation assumption, the relationship between the children and their parent is formulated as 4, and each stage of the label propagation of LaPOLeaF will be guided by this formula.
(4) 
where
is the label vector of the parent for a
classification problem. is the label vector of the th child w.r.t. the current parent. That the th element equals one and all others equal to zero represents a class label of the th class, . For regression problems, and are simply scalar value. is the population of the raw data points merged in the fat node in the subtree, if the node is derived as an information granule, after some granulation methods such as local sensitive hashing (LSH) (e.g., [12][13]) or others. If there is no granulation performed before the LT construction, all are assigned with constant 1.LaPOLeaF is designed to consist of three stages after the OLF has been constructed, namely, from children to parent (C2P), from root to root (R2R), and from parent to children (P2C). The idea of these stages is illustrated in Figure 2.
To decide the layer number for each node, one can easily design a hierarchical traverse algorithm (see Appendix) for the subleadingtree.
Iiia Three key stages of label propagation in LaPOLeaF
IiiA1 From children to parent
Definition 8.
unlabeled (labeled) node. A node in the subtree of the OLF is an unlabeled node (or the node is unlabeled), if its label vector is a zero vector. Otherwise, i.e., if its label vector has at least one element greater than zero, the node is called a labeled node (or the node is labeled).
Definition 9.
unlabeled (labeled) subtree. A subtree in OLF is called an unlabeled subtree (or the subtree is unlabeled), if every node in this tree is not labeled. Otherwise, i.e., if the leading tree contains at least one labeled node, this tree is called a labeled subtree (or the subtree is labeled).
Since the label of a parent is regarded as the contribution of its children, the propagation process is required to start from the bottom of each subtree. The label vector of an unlabeled children is initialized as vector , therefore it does not contribute to the label of its parent. Once the layer index of each node is ready, the bottomup propagation can start to execute in a parallel fashion for the labeled subtrees.
Proposition 1.
After C2P propagation, the root of a labeled subtree must be labeled.
Proof.
According to the definitions of labeled node and labeled subtree, and the procedure of C2P propagation, a parent is labeled if it has at least one child labeled after the corresponding round of the propagation. The propagation is progressing sequentially along the bottomup direction, and the root is the parent at the top layer. Therefore, this proposition obviously holds. ∎
IiiA2 From root to root
If the labeled data are rare or unevenly distributed, there would be some unlabeled subtrees. In such a case, we must borrow some label information from other labeled subtrees. Because the label of the root is more stable than other nodes, the root of an unlabeled subtree should borrow label information from a root of a labeled subtree . However, there must be some requirements for . To keep consistence with our partial order assumption, is required to be superior to and is the nearest root to . Formally,
(5) 
where is the set of labeled roots.
If there exists no such for a particular , we can conclude that the root of the whole leading tree constructed from (before splitting into a forest) is not labeled. So, to guarantee every unlabeled root can successfully borrow a label, we only need to guarantee is labeled.
If is unlabeled after C2P propagation, we consider the labelborrowing trick for as well. However, there is no other root satisfying , so we modify (5) a little to borrow label from for :
(6) 
The R2R propagation is executed for the unlabeled roots in the representationpowerascending order.
IiiA3 From parent to children
After the previous two stages, all root nodes of the subtrees are labeled. In this P2C propagation, the labels are propagated in a topdown fashion, i.e., the labels are sequentially propagated from top layer to bottom layer and this process can be parallelized on the independent subtrees.
We need consider two situations: a) for a parent , all children , , are unlabeled. Here, We simply assign =, because this assignment directly satisfies (4) no matter what value each takes. b) for a parent , without loss of generality, assume the first children are labeled, the other children are unlabeled. In this situation, we generate a virtual parent to replace the original and the labeled children. Using (4), we have
(7) 
Then, the unlabeled children can be assigned with the label like in the first situation. The concept of virtual parent is illustrated in Fig. 3.
IiiB LaPOLeaF algorithm
We present the overall algorithm of LaPOLeaF here, including some basic information about OLF construction.
IiiC An Example on doublemoon dataset
We generate the doublemoon dataset of 600 data points, 300 in each moon, to illustrate the main stages in LaPOLeaF, helping the readers to build an intuitive impression of this method. 5 labeled points are randomly selected in each moon. In the first step, the OLF with was constructed using the steps described in Part 1 of Algorithm 1 (Fig. 4a). Here, the parameters in LoDOG are set as: . The root of each subtree is marked by yellow face and red edge. It is easily observed that the edges appear in the OLF is much sparser than that in other GSSL methods based on nearest neighbors [2] [5].
In the C2P propagation (Fig. 4b), the nodes in the subtrees are firstly tagged by layer index. The subtree with greatest height has 14 layers. After the bottomup label propagation, the root of each labeled subtree becomes labeled. And other nodes on the path from the initial labeled node to its corresponding root are labeled as well. There are 44 nodes labeled now. Yet the unlabeled subtrees remain unchanged.
Fig. 4c shows that in R2R propagation stage, each unlabeled root borrowed the labeled from its nearest neighboring root with higher density. The green arrows show the label borrowing information, with arrow head indicating the label owner.
IiiD Deriving the label for a new datum
A salient advantage of LaPOLeaF is that it can obtain the label for a new datum (let us denote this task as LXNew) in time. Because (a) the leading tree structure can be incrementally updated in time, and the LoDOG algorithm can find in time, OLF can be updated in time. And (b) the label propagation on the OLF takes time.
The interested reader can refer to our previous work [7], in which we have provided an detailed description of the algorithm for incrementally updating the fat node leading tree. And we also provided therein a proof of the correctness of the algorithm.
Iv Scalability of LaPOLeaF
To scale the model LaPOLeaF for big data context, we propose two approaches. One uses parallel computing platform and divideandconquer strategy to obtain an exact solution, and the other is an approximate approach based on LocalitySensitive Hashing (LSH).
Iva Divide and conquer approach
The problem confronted has three aspects with the divideandconquer strategy. (a) Computing the distance matrix with time complexity. (b) The computation of and needs accessing a row of elements in the whole distance matrix, so and for all data also has the complexity of . (c) The distances between the centers should be prepared in advance in the R2R propagation stage, since the memory of a single computer is not able to accommodate the whole distance matrix for a large dataset, and the distances between centers can not be retrieved directly from the whole distance matrix. Apart from the three parts, other steps in LaPOLeaF are all linear to and usually could run on a single machine.
IvA1 Compute the distance matrix in parallel
Distance matrix puts a considerable burden on both the computation time and the capacity of memory. As an example, we have computed the required memory capacity for the distance matrix of 100,000 samples is over 37GB, even when a distance is stored in a float of 4 Bytes.
Here, we propose a divideandconquer method for exactly (not approximately) computing large distance matrix, whose idea is illustrated in Fig. 5.
Although the computing of distance matrix is of complexity, the positive message is that the mainstream CPU manufacturers (such as Intel and AMD) and scientific computing softwares (such as Matlab and R) have made great efforts to accelerate the matrix operations. For the distance metric norm, instead of computing the distance between the objects pair by pair, we formulate the distances for the full connections between the two parts of a bipart graph as in Theorem 2. For small datasets of 1,000 instances and 8 attributes on Matlab 2014a, matrix computing runs about 20 times faster than pairwise distance computing.
Theorem 2.
The Euclidean distance matrix for the full connections within a complete bipartite graph is given by elementwise square root of . is computed via
(8) 
where and are the matrixes formed with data points (as row vectors) in each part of the bipartite graph, whose sizes are and , respectively. is the elementwise square for .
Proof.
Considering an element , we can write
(9) 
Thus, the theorem is proven. ∎
Ideally, the distance matrix of arbitrarilysized data can be computed in this way provided there are enough computers. However, if the computers are not adequate for the dataset at hand, one can turn to the second approach LSH.
IvA2 Computing , and nneigh in parallel
Because of the additive characteristic of local density , the whole vector can be computed in a fully parallel fashion on each computing node, when the whole matrix are split in belt and stored separately on different computing nodes. Suppose there are blocks of the distance matrix for samples, then we have
(10) 
(11) 
where, is the local density vector of elements w.r.t. distance matrix block , and is the element in the th distance matrix block.
Unlike on a single computer, where can be computed with the guidance of sorted , computing each in parallel has to access the whole and all for with .
IvA3 Prepare the distance matrix for centers
The R2R propagation stage needs to access the distances between any pair of centers (roots of the subtrees in OLF), denoted as . If the distance matrix is stored on a centralized computer, can be extracted directly from the whole distance matrix . However, when the is stored in a distributed system, it is divided into blocks, denoted as . Each is stored on a different computing node, and the index range for instances whose distances are indicated by is . Usually we have , except for the last matrix block.
Therefore, to extract from the distributed , one has to sort firstly the centers according to their index in ascending order, then get the distance entry between Center and Center via
(12) 
By sorting the centers, each distance matrix needs to be accessed only once to get .
IvB Approximate approach with LSH
As mentioned above, if there are no adequate computers for a given large dataset to run exact LaPOLeaF, then it is reasonable to merge some closely neighboring data points into one bucket by employing LSH techniques [14, 12, 13]
. The basic idea of LSH is that the nearest located neighbors have high probability to share the same hash code (viz. collide with each other), and the far away data points are not likely to collide.
For different distance metrics, we need different hash functions. For norm, the hash function is [12]
(13) 
where is a random vector and is a random real number sampled uniformly from the interval . For angular similarity, the hash function could be [14]
(14) 
where, is a random vector, and is the sign function.
Ji improved the work in [14] by introducing the GramSchmidt orthogonalization process to the random vector group to form a representing unit named as “superbit”.
After running LSH algorithm on the original dataset, the instances will be put into many buckets. Then each bucket is treated as a fat node by LaPOLeaF, and the number of the data lying in the th bucket is the in (4).
V Time complexity and relationship to related works
Va Complexity analysis
By investigating each step in Algorithm 1, we find out that except the calculation of the distance matrix requires exactly basic operations, all other steps in LaPOLeaF has the linear time complexity to the size of . When compared to LLGC [5], FLP [2], AGR [1], and HAGR [3], LaPOLeaF is much more efficient, as listed in Table I. In Table I, is the size of ; is the number of iterations; is the number of classes; is the number of points on the th layer. Empirical evaluation in Section VI verified this analysis.
Please note that although we write for straightforward computation of matrix inverse, the complexity could be reduced to [15][16].
Methods  Graph construction  Label propagation 

LLGC  
FLP  
AGR  
HAGR  
LaPOLeaF  
It is also worthwhile to compare the efficiency of LXNew. For LaPOLeaF, LXNew only require a linear time complexity w.r.t. the size of the existing OLF. However, for the traditional GSSL methods, LXNew requires the running time as long as for labeling all data points in .
VB Relationship discussion
The label propagation in LaPOLeaF is a heuristic algorithm that lacks an optimization objective. Hence it offers no mathematical guarantee to achieve best solution in this stage. However, we argue that the optimization has been moved forward to the OLF construction stage. Since we obtained an optimal partial ordered structure of the whole dataset, we believe that an iteration optimization which regards the data as in peertopeer relation is no longer compulsory. This is the difference between LaPOLeaF and other GSSL methods.
Meanwhile, LaPOLeaF can be regarded as an improved version of NN. In NN, the nearest neighbors is considered as a sphericalshaped information granule and the unlabeled data are assigned label with a voting strategy. The parameter is set by a user , and the result is quite sensitive to the choice of . By contrast, in LaPOLeaF, the information granules are arbitrarilyshaped leading trees and the size of each tree is automatically decided by the data and LaPOLeaF, so usually the sizes are different. Because OLF better captures the nature of the data distribution, and the label propagation is reasonably designed, LaPOLeaF constantly outperforms NN.
Vi Experimental studies
The efficiency and effectiveness of LaPOLeaF is evaluated on five real world datasets, among which three are small data from UCI machine learning repository and the other two are larger scaled. The information of the datasets is shown in Table II. The 3 small datasets are used to demonstrate the effectiveness of LaPOLeaF, and the other two datasets are used to show the scalability of LaPOLeaF through the measures of parallel computing and Localitysensitive Hashing (LSH).
The experiment for the small datasets and Activity data is conducted on a personal computer with an Intel i52430M CPU and 16GB DDR3 memory. MNIST data is learned both on the PC and a Spark cluster of eight work stations.
Dataset  # Instances  # Attributes  # Classes 

Iris  150  4  3 
Wine  178  13  3 
Yeast  1,484  8  8 
MNIST  70,000  784  10 
Activity  43,930,257  16  6 
Via UCI small datasets
With the 3 small sized UCI datasets, namely, Iris, Wine, and Yeast, it is shown that LaPOLeaF achieves competitive accuracy while the efficiency is much higher, when compared with the classical semisupervised learning methods Linear Discriminant Analysis (LDA) [17], Neighborhood Component Analysis (NCA) [18] , Semisupervised Discriminant Analysis (SDA) [19], and Framework of Learning by Propagability (FLP) [2]. The parameter configuration and some experimental details, such as the distance metric chosen and preprocessing method, for all the 5 datasets are listed in Table III.
Dataset  percent  Preprocessing  Distance  
Iris  2  0.25  8  6  Zscore  Euclidean  
Wine  2  0.4  8  6  NCA DR  Euclidean  
Yeast  5  0.1  7  16  Zscore  Cosine  
MNIST  10  0.3  405  100  none  Euclidean  
Yeast  8  0.3  347  16  see Fig. 7  Cosine  
: DR here is the abbreviation of dimensionality reduction. 
The accuracies for the competing models on the 3 datasets are shown in Table IV, from which one can see that LaPOLeaF achieves the best accuracy twice and the accuracy is comparable even when the other method FLP wins on the dataset Wine.
Method  Iris  Wine  Yeast 
LDA  66.9125.29  62.05  19.008.70 
NCA  92.283.24  83.109.70  32.766.32 
SDA  89.41 5.40  90.895.39  37.006.89 
FLP  93.453.09  93.133.32  40.035.40 
LaPOLeaF  94.864.57  90.685.32  42.282.36 
The main purpose of LaPOLeaF is not to improve the accuracy of GSSL, but to improve its efficiency by getting rid of the paradigm of iterative optimization to achieve the minimum value of an objective function. LaPOLeaF exhibits very high efficiency, for example, it completes the whole SSL process within 0.27 seconds for Iris dataset, on the mentioned personal computer.
ViB MNIST dataset
The MNIST dataset contains 70,000 handwriting digits images in total, about 7,000 for each digit (’0’’9’). To emphasize the effectiveness of LaPOLeaF itself, we directly use the original pixel data as the learning features as in [20]. Since the distance matrix is oversized, we applied the divideandconquer technology described in Section IVA. The whole dataset is equally divided into 7 subsets, so the size of each distance matrix block is 10,00070,000.
After computing the matrix blocks , the vector group {, , nneigh}, the OLF of can be constructed by running LoDOG algorithm on a single machine. The parameters and some intermediate results for the two datasets MNIST and Activity are detailed in the last two rows of Table III. The objective function values for choosing with MNIST data is shown in Fig. 6.
10 labeled samples are randomly chosen from each digit, and the accuracies achieved by LaPOLeaF and the stateoftheart method Hierarchical Anchor Graph Regularization (HAGR) [20] is listed in Table V.
Method  HAGR  HAGR  LaPOLeaF 

Accuracy  79.171.39  88.661.23  84.922.35 
One can see that LaPOLeaF achieves competitive accuracy on MNIST data. However, the highlight of LaPOLeaF is its efficiency. LaPOLeaF complete the whole learning process, including the OLF construction and the three stages in label propagation, within 48 minutes on the personal computer. The time consumptions are detailed in Table VI.
Stage  LoDOG  LP  
Time(s)  952  1483  436  15  3 
: LP here is the abbreviation of label propagation. 
ViC Activity dataset
The Activity dataset is from the domain of Human Activity Recognition [21]. It includes the monitoring data sampled by the accelerators and gyroscopes built in the smart phones and smart watches. Since there are many different models of phones and watches, and the sampling frequency, accuracy are therefore different, the data collected are heterogeneous. The dataset contains 43,930,257 observations and 16 attributes in each.
ViC1 preprocessing
Because the raw data in separated into 4 comma separated values (.csv) files, we firstly perform preprocessing as shown in Fig. 7.
i) The records from the four .csv files are aligned and merged into one file. We use the subject ID and equipment ID to align the data from different files, and the difference in sampling frequency are dealt with by interpolation.
ii) Since that empirical cumulative distribution function (ECDF) feature has been reported to outperform FFT and PCA in HAR task
[22]. We compute the ECDF feature from the original data and use it in the future learning. Conventionally, the time frame is set to 1 second, and the overlapping ratio is 50%. Since the major sampling frequency is 200, we include 200 observations in one frame to compute ECDF, and move forward 100 observations when one row of ECDF features is computed. In this way, the time granularity of activity recognition is half a second. The segment number is set to 5, so the resulting dimensionality of the feature is 6*5=30. ECDF has reduced the size of Activity data to 439,302 rows (about 1% of the original).iii) Use minmax normalization to normalize every column of the data.
iv) Because of the large number of samples and features, we employ LSH (specifically, SBLSH [13]) to tackle the two problems at the same time. With SBLSH method, we empirically set the parameter depth of a superbit as 5, number of superbits as 6. LSH will further reduce the amount of data rows by merging the collided samples into the same bucket. For example, the resultant number of hash buckets is 771 for the subset of subject_a and smart phone model Nexus4_1, compared with the ECDF feature rows’ number of 3,237. The number of samples contained in each hash bucket is treated as the weight to compute in OLF construction.
ViC2 LaPOLeaF for Activity data
After the preprocessing stage, Activity data has been transformed into 84,218 rows of categorical features. Each row consists of the series of hash bucket number, and the weight of each row is the number of ECDF rows that share the same hash bucket number sequence. The distance matrix is computed based on the ECDF features in the bucket sequence rather than the hash codes themselves, and cosine distance is used. The parameters in OLF is set as . This parameter configuration leads to the number of subtrees in OLF is 347.
With the constructed OLF, we run C2P, R2R, and P2C propagation sequentially, taking randomly selected 120 (20 per class) labeled ECDF features as the labeled data . The final accuracy achieved by LaPOLeaF is 86.273.36.
Vii Conclusions
The existing GSSL methods have two weaknesses. One is low efficiency due to the iterative optimization process, and the other is inconvenience two predict the label for newly arrived data. This paper firstly made a sound assumption that the neighboring data points are not in equal positions, but lying in a partialordered relation; and the label of a center can be regarded as the contribution of its followers. Based on this assumption and our previous work named as LoDOG, a new noniterative semisupervised approach called LaPOLeaF is proposed. LaPOLeaF exhibits two salient advantages: a) It has much higher efficiency than the sateoftheart models while keep the accuracy comparable. b) It can deliver the labels for a few newly arrived data in a time complexity of , where is the number of the old data forming the optimal leading forest (OLF). To enable LaPOLeaF to accommodate big data, we proposed an exact divideandconquer approach and an approximate localitysensitivehashing (LSH) method. Theoretical analysis and empirical validation have shown the effectiveness and efficiency of LaPOLeaF. We plan to extend LaPOLeaF in two directions: one is to apply it into the real world big data mining problem with tight time constraints, and the other is to improve the accuracy while keeping the high efficiency unchanged.
We provide the algorithm of deciding the layer index of each node in a tree using a queue data structure.
The time complexity of Algorithm 2 is , because the basic operations for each node are EnQue() and DeQue().
Acknowledgment
This work has been supported by the National Key Research and Development Program of China under grants 2016QY01W0200 and 2016YFB1000905, the National Natural Science Foundation of China under grant 61572091.
References
 [1] W. Liu, J. He, and S.F. Chang, “Large graph construction for scalable semisupervised learning,” in Proceedings of the 27th international conference on machine learning (ICML10), pp. 679–686, 2010.
 [2] B. Ni, S. Yan, and A. Kassim, “Learning a propagable graph for semisupervised learning: Classification and regression,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 1, pp. 114–126, 2012.
 [3] M. Wang, W. Fu, S. Hao, H. Liu, and X. Wu, “Learning on big graph: Label inference and regularization with anchor hierarchy,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 5, pp. 1101–1114, 2017.

[4]
J. Xu, G. Wang, and W. Deng, “DenPEHC: Density peak based efficient hierarchical clustering,”
Information Sciences, vol. 373, pp. 200–218, 2016.  [5] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in Advances in neural information processing systems, pp. 321–328, 2004.
 [6] A. Rodriguez and A. Laio, “Clustering by fast search and find of density peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, 2014.
 [7] J. Xu, G. Wang, T. Li, W. Deng, and G. Gou, “Fat node leading tree for data stream clustering with density peaks,” KnowledgeBased Systems, vol. 120, pp. 99–117, 2017.
 [8] W. Pedrycz and W. Homenda, “Building the fundamentals of granular computing: A principle of justifiable granularity,” Applied Soft Computing, vol. 13, no. 10, pp. 4209–4218, 2013.
 [9] W. Pedrycz, G. Succi, A. Sillitti, and J. Iljazi, “Data description: A general framework of information granules,” KnowledgeBased Systems, vol. 80, pp. 98–108, 2015.
 [10] X. Zhu, W. Pedrycz, and Z. Li, “Granular data description: Designing ellipsoidal information granules,” IEEE Transactions on Cybernetics, DOI: 10.1109/TCYB.2016.2612226, 2016.
 [11] J. Xu, G. Wang, T. Li, and W. Pedrycz, “Local densitybased optimal granulation and manifold information granule description,” IEEE Transactions on Cybernetics, DOI: 10.1109/TCYB.2017.2750481, 2017.
 [12] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Localitysensitive hashing scheme based on pstable distributions,” in Proceedings of the twentieth annual symposium on Computational geometry, pp. 253–262, ACM, 2004.
 [13] J. Ji, J. Li, S. Yan, B. Zhang, and Q. Tian, “Superbit localitysensitive hashing,” in Advances in Neural Information Processing Systems, pp. 108–116, 2012.

[14]
M. S. Charikar, “Similarity estimation techniques from rounding algorithms,” in
Proceedings of the 34th Annual ACM Symposium on Theory of Computing
, pp. 380–388, ACM, May 2002.  [15] V. V. Williams, “Breaking the coppersmithwinograd barrier,” 2011.
 [16] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, Third Edition. MIT Press, 2009.
 [17] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: recognition using class specific linear projection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, pp. 711–720, Jul 1997.
 [18] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, “Neighborhood component analysis,” Advances in Neural Information Processing Systems, pp. 513–520, 2004.

[19]
D. Cai, X. He, and J. Han, “Semisupervised discriminant analysis,” in
2007 IEEE 11th International Conference on Computer Vision
, pp. 1–7, Oct 2007.  [20] M. Wang, W. Fu, S. Hao, H. Liu, and X. Wu, “Learning on big graph: Label inference and regularization with anchor hierarchy,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, pp. 1101–1114, May 2017.
 [21] A. Stisen, H. Blunck, S. Bhattacharya, T. S. Prentow, M. B. Kjærgaard, A. Dey, T. Sonne, and M. M. Jensen, “Smart devices are different: Assessing and mitigatingmobile sensing heterogeneities for activity recognition,” in Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems, pp. 127–140, ACM, 2015.
 [22] N. Y. Hammerla, R. Kirkham, P. Andras, and T. Ploetz, “On preserving statistical characteristics of accelerometry data using their empirical cumulative distribution,” in Proceedings of the 2013 International Symposium on Wearable Computers, pp. 65–68, ACM, 2013.
Comments
There are no comments yet.