Non-iterative Label Propagation on Optimal Leading Forest

09/25/2017 ∙ by Ji Xu, et al. ∙ IEEE 0

Graph based semi-supervised learning (GSSL) has intuitive representation and can be improved by exploiting the matrix calculation. However, it has to perform iterative optimization to achieve a preset objective, which usually leads to low efficiency. Another inconvenience lying in GSSL is that when new data come, the graph construction and the optimization have to be conducted all over again. We propose a sound assumption, arguing that: the neighboring data points are not in peer-to-peer relation, but in a partial-ordered relation induced by the local density and distance between the data; and the label of a center can be regarded as the contribution of its followers. Starting from the assumption, we develop a highly efficient non-iterative label propagation algorithm based on a novel data structure named as optimal leading forest (LaPOLeaF). The major weaknesses of the traditional GSSL are addressed by this study. We further scale LaPOLeaF to accommodate big data by utilizing block distance matrix technique, parallel computing, and Locality-Sensitive Hashing (LSH). Experiments on large datasets have shown the promising results of the proposed methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

LABELS of data are laborious or expensive to obtain, while unlabeled data are generated or sampled in tremendous size in big data era. This is the reason why semi-supervised learning (SSL) is increasingly drawing the interests and attention from the machine learning society. Among the variety of many SSL model streams, Graph-based SSL (GSSL) has the reputation of being easily understood through visual representation and is convenient to improve the learning performance by exploiting the corresponding matrix calculation. Therefore, there has been a lot of research works in this regard, e.g.,

[1], [2], [3].

However, the existing GSSL models have two apparent limitations. One is the models usually need to solve an optimization problem in an iterative fashion, hence the low efficiency. The other is that these models have difficulty in delivering label for a new bunch of data, because the solution for the unlabeled data is derived specially for the given graph. With newly included data, the graph has changed and the whole iterative optimization process is required to run once again.

We ponder the possible reasons of these limitations and argue that the crux is these models treat the relationship among the neighboring data points as “peer-to-peer”. Because the data points are considered equal significant to represent their class, most GSSL objective functions try optimizing on each data point with equal priority. However, this “peer-to-peer” relationship is questionable in many situations. For example, if a data point lies at the centering location of the space of its class, then it will has more representative power than the other one that diverges more from the central location, even if and are in the same K-NN or (-NN) neighborhood.

This paper is grounded on the partial-order-relation assumption: the neighboring data points are not in equal status, and the label of the leader (or parent) is the contribution of its followers (or children). The assumption is intuitively reasonable since there is an old saying: “a man is known by the company he keeps”. The labels of the peripheral data may change because of the model or parameter selection, but the labels of the core data are much more stable. Fig. 1 illustrates this idea.

Fig. 1: partial-order-relation assumption: the label of the center can be regarded as the contribution from the labels of its followers. Therefore, one can safely infer herein that the left unlabeled point is a triangle and the right one is a pentagram.

This paper proposes a non-iterative label propagation algorithm taking our previous research work, namely local density based optimal granulation (LoDOG), as starting point. In LoDOG, the input data was organized as an optimal number of subtrees. Every non-center node in the subtrees is led by its parent to join the microcluster the parent belongs to. In [4], these subtrees are called Leading Tree. The proposed method, Label Propagation on Optimal Leading Forest (LaPOLeaF), performs label propagation on the structure of the relatively independent subtrees in the forest, rather than on the traditional nearest neighbor graph.

Therefore, LaPOLeaF exhibits several advantages when compared with other GSSL methods: (a) the propagation is performed on the subtrees, so the edges under consideration are much more sparse than that of nearest neighbor graph;
(b) the subtrees are relatively independent to each other, so the massive label propagation computation is easier to be parallelized when the size of samples is huge;
(c) LaPOLeaF performs label propagation in a non-iterative fashion, so it is of high efficiency.
Overall, LaPOLeaF algorithm is formulated in a simple way and the empirical evaluations show the promising accuracy and very high efficiency.

The rest of the paper is organized as follows. Section II briefly reviews the related work. The model of LaPOLeaF is presented in details in Section III. Section IV describes the method to scale LaPOLeaF for big data. Section V analyzes the computation complexity and discusses the relationship to other researches, and Section VI describes the experimental study. We reach a conclusion in Section VII.

Ii Related studies

Ii-a Graph-based semi-supervised learning (GSSL)

Suppose an undirected graph is denoted as , where is the set of the vertices, the set of edges, and is the mapping from an edge to a real number (usually defined as the similarity between the two ending points). GSSL takes the input data as the vertices of the graph, and places an edge between two vertices if are similar or correlated. The basic idea of GSSL is propagating the labels of the labeled samples to the unlabeled with the constructed graph. The propagation strength between and on each edge is in proportion to the weight .

Almost all the existing GSSL works on two fundamental assumptions. One is called “clustering assumption”, meaning that the samples in the same cluster should have the same labels. Clustering assumption usually is used for the labeled sample set. The other is called “manifold assumption”, which means that similar (or neighboring) samples should have similar labels. Manifold assumption is used for both the labeled and unlabeled data.

Starting from the two assumptions, GSSL usually aims at optimizing an objective function with two terms. However, the concrete components in different GSSL models vary. For example, in [5] the objective function is


where is the label indication matrix; is the sum of the row of ; is the label of the labeled data.

Liu proposed an Anchor Graph Regulation (AGR) approach to predict the label for each data point as a locally weighted average of the labels of anchor points[1]. In AGR, the function is


where is the regression matrix that describes the relationship between raw samples and anchors; . is the Laplacian matrix.

Wang proposed a hierarchical AGR method to address the granularity dilemma in AGR , by adding a series of intermediate granular anchor layer between the finest original data and the coarsest anchor layer [3].

One can see that the underlying philosophy is still the two assumptions. Slightly different from the two assumptions, Ni proposed a novel concept graph harmoniousness, which integrates the feature learning and label learning into one framework (framework of Learning by Propagability, FLP) [2]. The objective function has only one term in FLP, yet also needs to obtain the local optimal solution by alternately running iterative optimizing procedure on two variables.

Ii-B optimal leading forest

denotes the dataset . is the index set of . is the distance (under any metric) between and .

Definition 1.

Local Density.[6] The local density of is computed as , where is the cut-off distance or band-width parameter.

Definition 2.

leading node and -distance. If is the nearest neighbor with higher local density to , then is called the leading node of . Formally, , denoted as for short. is called the -distance of , or simply .

We store all the in an array named as LN.

Definition 3.

leading tree (LT)[4]. If , . Let an arrow start from , , and end at . Thus, and the arrows form a tree . Each node in (except ) tends to be led by to join the same cluster belongs to, unless itself makes a center. Such a tree is called a leading tree.

Definition 4.

operator[7]. For any non-root node in an LT, there is a leading node for . This mapping is denoted as .

we denote for short.

Definition 5.

partial order in LT[7]. Suppose , we say such that .

Definition 6.

center potential. Let denote the potential of to be selected as a center, is computed as .

Intuitively, if an object has a large (means it has many near neighbors) and a large (means relatively far from another object of larger ), then would have great chance to be the center of a collection of data.

Pedrycz proposed the principle of justifiable granularity indicating that a good information granule (IG) should has sufficient experiment evidence and specific semantic [8], [9], [10]. That is, there should be as many as possible data points included in an IG, and the closure of the IG should be compact and tight in geometric perspective.

Following this principle, we have proposed a local density based optimal granulation (LoDOG) method to build justifiable granules accurately and efficiently [11]. In LoDOG, we construct the optimal IGs of by disconnecting the corresponding Leading tree into an optimal number of subtrees. The optimal number is derived via minimizing the objective function:


where .

Here, is the number of IGs; is the parameter striking a balance between the experimental evidence and semantic; is the set of points included in granule; returns the cardinality of a set; is a strictly monotonically increasing function used to adjust the magnitude of to well match that of . This function can be automatically selected from a group of common functions such as logarithm functions, linear functions, power functions, and exponential functions; is the root of the granule as a leading tree.

We used LoDOG to construct the Optimal Leading Forest (OLF) from the dataset. The readers are referred to [11] for more details of LoDOG.

Definition 7.

optimal leading forest (OLF). leading trees can be constructed from the dataset by using LoDOG method. All the leading trees are collectively called optimal leading forest.

The concept of OLF is used to determine the localized ranges of label propagation on the whole leading tree of . That is, OLF indicates where to stop propagating the label of a labeled datum to its neighbors.

Iii Label Propagation on Optimal Leading Forest (LaPOLeaF)

LaPOLeaF first makes a global optimization to construct the OLF, then performs label propagation on each of the subtrees. Following the aforementioned partial relation assumption, the relationship between the children and their parent is formulated as 4, and each stage of the label propagation of LaPOLeaF will be guided by this formula.



is the label vector of the parent for a

-classification problem. is the label vector of the -th child w.r.t. the current parent. That the -th element equals one and all others equal to zero represents a class label of the -th class, . For regression problems, and are simply scalar value. is the population of the raw data points merged in the fat node in the subtree, if the node is derived as an information granule, after some granulation methods such as local sensitive hashing (LSH) (e.g., [12][13]) or others. If there is no granulation performed before the LT construction, all are assigned with constant 1.

LaPOLeaF is designed to consist of three stages after the OLF has been constructed, namely, from children to parent (C2P), from root to root (R2R), and from parent to children (P2C). The idea of these stages is illustrated in Figure 2.

Fig. 2: Diagram of non-iterative label propagation on the subtrees of an FNLT. (a) gets its label as the weighted summation of the and , and the label of is computed likely in a cascade fashion. (b) is the root of an unlabeled subtree. In this situation, we have to borrow the label for from . If is not labeled either, this “borrow” operation will be transitively carried out (see Section III-A2 for details). (c) After the previous two stages, all roots of the subtrees are guaranteed being labeled. Then also under the guidance of 4, all the unlabeled children will get their label information in a top-down fashion.

To decide the layer number for each node, one can easily design a hierarchical traverse algorithm (see Appendix) for the sub-leading-tree.

Iii-a Three key stages of label propagation in LaPOLeaF

Iii-A1 From children to parent

Definition 8.

unlabeled (labeled) node. A node in the subtree of the OLF is an unlabeled node (or the node is unlabeled), if its label vector is a zero vector. Otherwise, i.e., if its label vector has at least one element greater than zero, the node is called a labeled node (or the node is labeled).

Definition 9.

unlabeled (labeled) subtree. A subtree in OLF is called an unlabeled subtree (or the subtree is unlabeled), if every node in this tree is not labeled. Otherwise, i.e., if the leading tree contains at least one labeled node, this tree is called a labeled subtree (or the subtree is labeled).

Since the label of a parent is regarded as the contribution of its children, the propagation process is required to start from the bottom of each subtree. The label vector of an unlabeled children is initialized as vector , therefore it does not contribute to the label of its parent. Once the layer index of each node is ready, the bottom-up propagation can start to execute in a parallel fashion for the labeled subtrees.

Proposition 1.

After C2P propagation, the root of a labeled subtree must be labeled.


According to the definitions of labeled node and labeled subtree, and the procedure of C2P propagation, a parent is labeled if it has at least one child labeled after the corresponding round of the propagation. The propagation is progressing sequentially along the bottom-up direction, and the root is the parent at the top layer. Therefore, this proposition obviously holds. ∎

Iii-A2 From root to root

If the labeled data are rare or unevenly distributed, there would be some unlabeled subtrees. In such a case, we must borrow some label information from other labeled subtrees. Because the label of the root is more stable than other nodes, the root of an unlabeled subtree should borrow label information from a root of a labeled subtree . However, there must be some requirements for . To keep consistence with our partial order assumption, is required to be superior to and is the nearest root to . Formally,


where is the set of labeled roots.

If there exists no such for a particular , we can conclude that the root of the whole leading tree constructed from (before splitting into a forest) is not labeled. So, to guarantee every unlabeled root can successfully borrow a label, we only need to guarantee is labeled.

If is unlabeled after C2P propagation, we consider the label-borrowing trick for as well. However, there is no other root satisfying , so we modify (5) a little to borrow label from for :


The R2R propagation is executed for the unlabeled roots in the representation-power-ascending order.

Iii-A3 From parent to children

After the previous two stages, all root nodes of the subtrees are labeled. In this P2C propagation, the labels are propagated in a top-down fashion, i.e., the labels are sequentially propagated from top layer to bottom layer and this process can be parallelized on the independent subtrees.

We need consider two situations: a) for a parent , all children , , are unlabeled. Here, We simply assign =, because this assignment directly satisfies (4) no matter what value each takes. b) for a parent , without loss of generality, assume the first children are labeled, the other children are unlabeled. In this situation, we generate a virtual parent to replace the original and the labeled children. Using (4), we have


Then, the unlabeled children can be assigned with the label like in the first situation. The concept of virtual parent is illustrated in Fig. 3.

Fig. 3: Illustration of the virtual parent idea in P2C propagation stage. (a) A parent and 4 children (first 2 are labeled, and the last two are unlabeled). (b) To compute the labels for and , the labeled nodes (, and ) are replaced by a virtual parent .

Iii-B LaPOLeaF algorithm

We present the overall algorithm of LaPOLeaF here, including some basic information about OLF construction.

Input: Dataset
Output: Labels for
1 Part 1: //Preparing the OLF;
2 Compute distance matrix for ;
3 Compute local density (Definition 1) ;
4 Compute leading nodes LN, -distance (Definition 2) ;
5 Compute representation power using (6);
6 Split the LT into OLF using objective function (3), return roots and nodes set in each subtree;
7 Build adjacent List for each subtree;
8 Part 2: //Label propagation on the OLF;
9 Decide the level Number for each node using a hierarchical traverse approach (See Appendix);
10 C2P propagation using (4);
11 R2R propagation using (5) and (6);
12 P2C propagation using (7);
Return Labels for
Algorithm 1 LaPOLeaF Algorithm

Iii-C An Example on double-moon dataset

We generate the double-moon dataset of 600 data points, 300 in each moon, to illustrate the main stages in LaPOLeaF, helping the readers to build an intuitive impression of this method. 5 labeled points are randomly selected in each moon. In the first step, the OLF with was constructed using the steps described in Part 1 of Algorithm 1 (Fig. 4a). Here, the parameters in LoDOG are set as: . The root of each subtree is marked by yellow face and red edge. It is easily observed that the edges appear in the OLF is much sparser than that in other GSSL methods based on nearest neighbors [2] [5].

In the C2P propagation (Fig. 4b), the nodes in the subtrees are firstly tagged by layer index. The sub-tree with greatest height has 14 layers. After the bottom-up label propagation, the root of each labeled subtree becomes labeled. And other nodes on the path from the initial labeled node to its corresponding root are labeled as well. There are 44 nodes labeled now. Yet the unlabeled subtrees remain unchanged.

Fig. 4c shows that in R2R propagation stage, each unlabeled root borrowed the labeled from its nearest neighboring root with higher density. The green arrows show the label borrowing information, with arrow head indicating the label owner.

The P2C propagation can be fully parallelized because all the roots in OLF are labeled and the propagation within each subtree is independent from others. Using the discussion in Section III-A3, all the unlabeled non-root nodes are labeled (as in Fig. 4d).

Fig. 4: An illustrative example of LaPOLeaF on the double- moon dataset. (a)The OLF constructed from the dataset. (b) C2P propagation. (c) R2R propagation. The green arrows indicate the borrower and the owner when an unlabeled root borrows label from another root. All the roots are labeled after this stage. (d) P2C propagation. The color saturation reflects the value of the maximal element in a label vector. The closer to 1 the value is, the higher saturation of the color.

Iii-D Deriving the label for a new datum

A salient advantage of LaPOLeaF is that it can obtain the label for a new datum (let us denote this task as LXNew) in time. Because (a) the leading tree structure can be incrementally updated in time, and the LoDOG algorithm can find in time, OLF can be updated in time. And (b) the label propagation on the OLF takes time.

The interested reader can refer to our previous work [7], in which we have provided an detailed description of the algorithm for incrementally updating the fat node leading tree. And we also provided therein a proof of the correctness of the algorithm.

Iv Scalability of LaPOLeaF

To scale the model LaPOLeaF for big data context, we propose two approaches. One uses parallel computing platform and divide-and-conquer strategy to obtain an exact solution, and the other is an approximate approach based on Locality-Sensitive Hashing (LSH).

Iv-a Divide and conquer approach

The problem confronted has three aspects with the divide-and-conquer strategy. (a) Computing the distance matrix with time complexity. (b) The computation of and needs accessing a row of elements in the whole distance matrix, so and for all data also has the complexity of . (c) The distances between the centers should be prepared in advance in the R2R propagation stage, since the memory of a single computer is not able to accommodate the whole distance matrix for a large dataset, and the distances between centers can not be retrieved directly from the whole distance matrix. Apart from the three parts, other steps in LaPOLeaF are all linear to and usually could run on a single machine.

Iv-A1 Compute the distance matrix in parallel

Distance matrix puts a considerable burden on both the computation time and the capacity of memory. As an example, we have computed the required memory capacity for the distance matrix of 100,000 samples is over 37GB, even when a distance is stored in a float of 4 Bytes.

Here, we propose a divide-and-conquer method for exactly (not approximately) computing large distance matrix, whose idea is illustrated in Fig. 5.

Fig. 5: (a) If the whole dataset is divided into two subsets, then the whole distances between any pair of points can be computed through tree parts. The first two parts correspond to the full connections within the two subgraphs respectively (green and red curves), and the third part corresponds to the full connections within the complete bipartite graph (black lines). (b) The number of subsets is generalized from 2 to . Note that although we try to balance the size of each subset, all the are not necessarily equal. therefore, while is always square matrix, for may NOT be square matrix.

Although the computing of distance matrix is of complexity, the positive message is that the mainstream CPU manufacturers (such as Intel and AMD) and scientific computing softwares (such as Matlab and R) have made great efforts to accelerate the matrix operations. For the distance metric -norm, instead of computing the distance between the objects pair by pair, we formulate the distances for the full connections between the two parts of a bipart graph as in Theorem 2. For small datasets of 1,000 instances and 8 attributes on Matlab 2014a, matrix computing runs about 20 times faster than pairwise distance computing.

Theorem 2.

The Euclidean distance matrix for the full connections within a complete bipartite graph is given by element-wise square root of . is computed via


where and are the matrixes formed with data points (as row vectors) in each part of the bipartite graph, whose sizes are and , respectively. is the element-wise square for .


Considering an element , we can write


Thus, the theorem is proven. ∎

Ideally, the distance matrix of arbitrarily-sized data can be computed in this way provided there are enough computers. However, if the computers are not adequate for the dataset at hand, one can turn to the second approach LSH.

Iv-A2 Computing , and nneigh in parallel

Because of the additive characteristic of local density , the whole vector can be computed in a fully parallel fashion on each computing node, when the whole matrix are split in belt and stored separately on different computing nodes. Suppose there are blocks of the distance matrix for samples, then we have


where, is the local density vector of elements w.r.t. distance matrix block , and is the element in the th distance matrix block.

Unlike on a single computer, where can be computed with the guidance of sorted , computing each in parallel has to access the whole and all for with .

Iv-A3 Prepare the distance matrix for centers

The R2R propagation stage needs to access the distances between any pair of centers (roots of the subtrees in OLF), denoted as . If the distance matrix is stored on a centralized computer, can be extracted directly from the whole distance matrix . However, when the is stored in a distributed system, it is divided into blocks, denoted as . Each is stored on a different computing node, and the index range for instances whose distances are indicated by is . Usually we have , except for the last matrix block.

Therefore, to extract from the distributed , one has to sort firstly the centers according to their index in ascending order, then get the distance entry between Center and Center via


By sorting the centers, each distance matrix needs to be accessed only once to get .

Iv-B Approximate approach with LSH

As mentioned above, if there are no adequate computers for a given large dataset to run exact LaPOLeaF, then it is reasonable to merge some closely neighboring data points into one bucket by employing LSH techniques [14, 12, 13]

. The basic idea of LSH is that the nearest located neighbors have high probability to share the same hash code (viz. collide with each other), and the far away data points are not likely to collide.

For different distance metrics, we need different hash functions. For -norm, the hash function is [12]


where is a random vector and is a random real number sampled uniformly from the interval . For angular similarity, the hash function could be [14]


where, is a random vector, and is the sign function.

Ji improved the work in [14] by introducing the Gram-Schmidt orthogonalization process to the random vector group to form a representing unit named as “super-bit”.

After running LSH algorithm on the original dataset, the instances will be put into many buckets. Then each bucket is treated as a fat node by LaPOLeaF, and the number of the data lying in the th bucket is the in (4).

V Time complexity and relationship to related works

V-a Complexity analysis

By investigating each step in Algorithm 1, we find out that except the calculation of the distance matrix requires exactly basic operations, all other steps in LaPOLeaF has the linear time complexity to the size of . When compared to LLGC [5], FLP [2], AGR [1], and HAGR [3], LaPOLeaF is much more efficient, as listed in Table I. In Table I, is the size of ; is the number of iterations; is the number of classes; is the number of points on the th layer. Empirical evaluation in Section VI verified this analysis.

Please note that although we write for straightforward computation of matrix inverse, the complexity could be reduced to [15][16].

Methods Graph construction Label propagation
TABLE I: Complexity comparison.

It is also worthwhile to compare the efficiency of LXNew. For LaPOLeaF, LXNew only require a linear time complexity w.r.t. the size of the existing OLF. However, for the traditional GSSL methods, LXNew requires the running time as long as for labeling all data points in .

V-B Relationship discussion

The label propagation in LaPOLeaF is a heuristic algorithm that lacks an optimization objective. Hence it offers no mathematical guarantee to achieve best solution in this stage. However, we argue that the optimization has been moved forward to the OLF construction stage. Since we obtained an optimal partial ordered structure of the whole dataset, we believe that an iteration optimization which regards the data as in peer-to-peer relation is no longer compulsory. This is the difference between LaPOLeaF and other GSSL methods.

Meanwhile, LaPOLeaF can be regarded as an improved version of -NN. In -NN, the nearest neighbors is considered as a spherical-shaped information granule and the unlabeled data are assigned label with a voting strategy. The parameter is set by a user , and the result is quite sensitive to the choice of . By contrast, in LaPOLeaF, the information granules are arbitrarily-shaped leading trees and the size of each tree is automatically decided by the data and LaPOLeaF, so usually the sizes are different. Because OLF better captures the nature of the data distribution, and the label propagation is reasonably designed, LaPOLeaF constantly outperforms -NN.

Vi Experimental studies

The efficiency and effectiveness of LaPOLeaF is evaluated on five real world datasets, among which three are small data from UCI machine learning repository and the other two are larger scaled. The information of the datasets is shown in Table II. The 3 small datasets are used to demonstrate the effectiveness of LaPOLeaF, and the other two datasets are used to show the scalability of LaPOLeaF through the measures of parallel computing and Locality-sensitive Hashing (LSH).

The experiment for the small datasets and Activity data is conducted on a personal computer with an Intel i5-2430M CPU and 16GB DDR3 memory. MNIST data is learned both on the PC and a Spark cluster of eight work stations.

Dataset # Instances # Attributes # Classes
Iris 150 4 3
Wine 178 13 3
Yeast 1,484 8 8
MNIST 70,000 784 10
Activity 43,930,257 16 6
TABLE II: Information of the datasets in the experiments

Vi-a UCI small datasets

With the 3 small sized UCI datasets, namely, Iris, Wine, and Yeast, it is shown that LaPOLeaF achieves competitive accuracy while the efficiency is much higher, when compared with the classical semi-supervised learning methods Linear Discriminant Analysis (LDA) [17], Neighborhood Component Analysis (NCA) [18] , Semi-supervised Discriminant Analysis (SDA) [19], and Framework of Learning by Propagability (FLP) [2]. The parameter configuration and some experimental details, such as the distance metric chosen and preprocessing method, for all the 5 datasets are listed in Table III.

Dataset percent Preprocessing Distance
Iris 2 0.25 8 6 Z-score Euclidean
Wine 2 0.4 8 6 NCA DR Euclidean
Yeast 5 0.1 7 16 Z-score Cosine
MNIST 10 0.3 405 100 none Euclidean
Yeast 8 0.3 347 16 see Fig. 7 Cosine
: DR here is the abbreviation of dimensionality reduction.
TABLE III: Parameters configuration for the 5 datasets

The accuracies for the competing models on the 3 datasets are shown in Table IV, from which one can see that LaPOLeaF achieves the best accuracy twice and the accuracy is comparable even when the other method FLP wins on the dataset Wine.

Method Iris Wine Yeast
LDA 66.9125.29 62.05 19.008.70
NCA 92.283.24 83.109.70 32.766.32
SDA 89.41 5.40 90.895.39 37.006.89
FLP 93.453.09 93.133.32 40.035.40
LaPOLeaF 94.864.57 90.685.32 42.282.36
TABLE IV: Accuracy comparison on the 3 small datasets.

The main purpose of LaPOLeaF is not to improve the accuracy of GSSL, but to improve its efficiency by getting rid of the paradigm of iterative optimization to achieve the minimum value of an objective function. LaPOLeaF exhibits very high efficiency, for example, it completes the whole SSL process within 0.27 seconds for Iris dataset, on the mentioned personal computer.

Vi-B MNIST dataset

The MNIST dataset contains 70,000 handwriting digits images in total, about 7,000 for each digit (’0’-’9’). To emphasize the effectiveness of LaPOLeaF itself, we directly use the original pixel data as the learning features as in [20]. Since the distance matrix is oversized, we applied the divide-and-conquer technology described in Section IV-A. The whole dataset is equally divided into 7 subsets, so the size of each distance matrix block is 10,00070,000.

After computing the matrix blocks , the vector group {, , nneigh}, the OLF of can be constructed by running LoDOG algorithm on a single machine. The parameters and some intermediate results for the two datasets MNIST and Activity are detailed in the last two rows of Table III. The objective function values for choosing with MNIST data is shown in Fig. 6.

Fig. 6: Objective function values v.s. the number of information granules in LoDOG method for MNIST data.

10 labeled samples are randomly chosen from each digit, and the accuracies achieved by LaPOLeaF and the state-of-the-art method Hierarchical Anchor Graph Regularization (HAGR) [20] is listed in Table V.

Accuracy 79.171.39 88.661.23 84.922.35
TABLE V: Accuracies by HAGR and LaPOLeaF on MNIST

One can see that LaPOLeaF achieves competitive accuracy on MNIST data. However, the highlight of LaPOLeaF is its efficiency. LaPOLeaF complete the whole learning process, including the OLF construction and the three stages in label propagation, within 48 minutes on the personal computer. The time consumptions are detailed in Table VI.

Stage LoDOG LP
Time(s) 952 1483 436 15 3
: LP here is the abbreviation of label propagation.
TABLE VI: Running time of each stage on MNIST

Vi-C Activity dataset

The Activity dataset is from the domain of Human Activity Recognition [21]. It includes the monitoring data sampled by the accelerators and gyroscopes built in the smart phones and smart watches. Since there are many different models of phones and watches, and the sampling frequency, accuracy are therefore different, the data collected are heterogeneous. The dataset contains 43,930,257 observations and 16 attributes in each.

Vi-C1 preprocessing

Because the raw data in separated into 4 comma separated values (.csv) files, we firstly perform preprocessing as shown in Fig. 7.

Fig. 7: Preprocessing flow chart of the Activity data.

i) The records from the four .csv files are aligned and merged into one file. We use the subject ID and equipment ID to align the data from different files, and the difference in sampling frequency are dealt with by interpolation.

ii) Since that empirical cumulative distribution function (ECDF) feature has been reported to outperform FFT and PCA in HAR task

[22]. We compute the ECDF feature from the original data and use it in the future learning. Conventionally, the time frame is set to 1 second, and the overlapping ratio is 50%. Since the major sampling frequency is 200, we include 200 observations in one frame to compute ECDF, and move forward 100 observations when one row of ECDF features is computed. In this way, the time granularity of activity recognition is half a second. The segment number is set to 5, so the resulting dimensionality of the feature is 6*5=30. ECDF has reduced the size of Activity data to 439,302 rows (about 1% of the original).

iii) Use min-max normalization to normalize every column of the data.

iv) Because of the large number of samples and features, we employ LSH (specifically, SB-LSH [13]) to tackle the two problems at the same time. With SB-LSH method, we empirically set the parameter depth of a super-bit as 5, number of super-bits as 6. LSH will further reduce the amount of data rows by merging the collided samples into the same bucket. For example, the resultant number of hash buckets is 771 for the subset of subject_a and smart phone model Nexus4_1, compared with the ECDF feature rows’ number of 3,237. The number of samples contained in each hash bucket is treated as the weight to compute in OLF construction.

Vi-C2 LaPOLeaF for Activity data

After the preprocessing stage, Activity data has been transformed into 84,218 rows of categorical features. Each row consists of the series of hash bucket number, and the weight of each row is the number of ECDF rows that share the same hash bucket number sequence. The distance matrix is computed based on the ECDF features in the bucket sequence rather than the hash codes themselves, and cosine distance is used. The parameters in OLF is set as . This parameter configuration leads to the number of subtrees in OLF is 347.

With the constructed OLF, we run C2P, R2R, and P2C propagation sequentially, taking randomly selected 120 (20 per class) labeled ECDF features as the labeled data . The final accuracy achieved by LaPOLeaF is 86.273.36.

Vii Conclusions

The existing GSSL methods have two weaknesses. One is low efficiency due to the iterative optimization process, and the other is inconvenience two predict the label for newly arrived data. This paper firstly made a sound assumption that the neighboring data points are not in equal positions, but lying in a partial-ordered relation; and the label of a center can be regarded as the contribution of its followers. Based on this assumption and our previous work named as LoDOG, a new non-iterative semi-supervised approach called LaPOLeaF is proposed. LaPOLeaF exhibits two salient advantages: a) It has much higher efficiency than the sate-of-the-art models while keep the accuracy comparable. b) It can deliver the labels for a few newly arrived data in a time complexity of , where is the number of the old data forming the optimal leading forest (OLF). To enable LaPOLeaF to accommodate big data, we proposed an exact divide-and-conquer approach and an approximate locality-sensitive-hashing (LSH) method. Theoretical analysis and empirical validation have shown the effectiveness and efficiency of LaPOLeaF. We plan to extend LaPOLeaF in two directions: one is to apply it into the real world big data mining problem with tight time constraints, and the other is to improve the accuracy while keeping the high efficiency unchanged.

We provide the algorithm of deciding the layer index of each node in a tree using a queue data structure.

Input: The root of and the adjacent list AL of the tree.
Output: The layer indices for the nodes LayerInd[n].
1 Initialize an empty queue theQue;
2 EnQue(theQue,T);
3 LayerInd[T]=1;
4 while !IsEmpty(theQue) do
5       QueHead=DeQue(theQue);
6       if AL[QueHead]!=NULL then
7             EnQue(AL[QueHead]);
8             LayerInd[AL[QueHead]]= LayerInd[QueHead]+1;
9       end if
11 end while
12Return LayerInd[];
Algorithm 2 Decide the layer index of nodes in a tree.

The time complexity of Algorithm 2 is , because the basic operations for each node are EnQue() and DeQue().


This work has been supported by the National Key Research and Development Program of China under grants 2016QY01W0200 and 2016YFB1000905, the National Natural Science Foundation of China under grant 61572091.


  • [1] W. Liu, J. He, and S.-F. Chang, “Large graph construction for scalable semi-supervised learning,” in Proceedings of the 27th international conference on machine learning (ICML-10), pp. 679–686, 2010.
  • [2] B. Ni, S. Yan, and A. Kassim, “Learning a propagable graph for semisupervised learning: Classification and regression,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 1, pp. 114–126, 2012.
  • [3] M. Wang, W. Fu, S. Hao, H. Liu, and X. Wu, “Learning on big graph: Label inference and regularization with anchor hierarchy,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 5, pp. 1101–1114, 2017.
  • [4]

    J. Xu, G. Wang, and W. Deng, “DenPEHC: Density peak based efficient hierarchical clustering,”

    Information Sciences, vol. 373, pp. 200–218, 2016.
  • [5] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schölkopf, “Learning with local and global consistency,” in Advances in neural information processing systems, pp. 321–328, 2004.
  • [6] A. Rodriguez and A. Laio, “Clustering by fast search and find of density peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, 2014.
  • [7] J. Xu, G. Wang, T. Li, W. Deng, and G. Gou, “Fat node leading tree for data stream clustering with density peaks,” Knowledge-Based Systems, vol. 120, pp. 99–117, 2017.
  • [8] W. Pedrycz and W. Homenda, “Building the fundamentals of granular computing: A principle of justifiable granularity,” Applied Soft Computing, vol. 13, no. 10, pp. 4209–4218, 2013.
  • [9] W. Pedrycz, G. Succi, A. Sillitti, and J. Iljazi, “Data description: A general framework of information granules,” Knowledge-Based Systems, vol. 80, pp. 98–108, 2015.
  • [10] X. Zhu, W. Pedrycz, and Z. Li, “Granular data description: Designing ellipsoidal information granules,” IEEE Transactions on Cybernetics, DOI: 10.1109/TCYB.2016.2612226, 2016.
  • [11] J. Xu, G. Wang, T. Li, and W. Pedrycz, “Local density-based optimal granulation and manifold information granule description,” IEEE Transactions on Cybernetics, DOI: 10.1109/TCYB.2017.2750481, 2017.
  • [12] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,” in Proceedings of the twentieth annual symposium on Computational geometry, pp. 253–262, ACM, 2004.
  • [13] J. Ji, J. Li, S. Yan, B. Zhang, and Q. Tian, “Super-bit locality-sensitive hashing,” in Advances in Neural Information Processing Systems, pp. 108–116, 2012.
  • [14]

    M. S. Charikar, “Similarity estimation techniques from rounding algorithms,” in

    Proceedings of the 34th Annual ACM Symposium on Theory of Computing

    , pp. 380–388, ACM, May 2002.
  • [15] V. V. Williams, “Breaking the coppersmith-winograd barrier,” 2011.
  • [16] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, Third Edition. MIT Press, 2009.
  • [17] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: recognition using class specific linear projection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, pp. 711–720, Jul 1997.
  • [18] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov, “Neighborhood component analysis,” Advances in Neural Information Processing Systems, pp. 513–520, 2004.
  • [19] D. Cai, X. He, and J. Han, “Semi-supervised discriminant analysis,” in

    2007 IEEE 11th International Conference on Computer Vision

    , pp. 1–7, Oct 2007.
  • [20] M. Wang, W. Fu, S. Hao, H. Liu, and X. Wu, “Learning on big graph: Label inference and regularization with anchor hierarchy,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, pp. 1101–1114, May 2017.
  • [21] A. Stisen, H. Blunck, S. Bhattacharya, T. S. Prentow, M. B. Kjærgaard, A. Dey, T. Sonne, and M. M. Jensen, “Smart devices are different: Assessing and mitigatingmobile sensing heterogeneities for activity recognition,” in Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems, pp. 127–140, ACM, 2015.
  • [22] N. Y. Hammerla, R. Kirkham, P. Andras, and T. Ploetz, “On preserving statistical characteristics of accelerometry data using their empirical cumulative distribution,” in Proceedings of the 2013 International Symposium on Wearable Computers, pp. 65–68, ACM, 2013.