1 Introduction
The main gist behind clustering is to group datapoints into various groups (clusters) based on their features i.e. properties. The generation of clusters basically varies application wise (Xu and Wunsch, 2005); because it depends on what factors are to be taken into consideration to form a particular cluster. But, the focus of every clustering algorithm remains same i.e. to group similar datapoints to a common cluster. The thing that differs is how this goal of forming a cluster is achieved. Different algorithms use different concept to deal with similarity measure among the datapoints. There are many popular clustering algorithms (Hartigan, 1975) that group datapoints based on various strategies to define the similarity measure between them. Centroidbased, connectivitybased, densitybased, graphbased, etc are the commonly used strategies.
Often used algorithms like kmeans, hierarchical clustering, DBSCAN, etc require a set of datapoints in space beforehand. Without making fine adjustments to these preexisting clustering algorithms, it is not possible to cluster the datapoints in realtime. Hence, a real world problem exists if there is necessity of grouping datapoints which are incoming to a system, in realtime. Besides, algorithm like kmeans needs to specify number of clusters to which the datapoints are to be grouped into and also cannot identify clusters with arbitrary shape.
Hence, taking all these limitations into consideration, this paper proposes a different kind of machine learning algorithm that can group incoming datapoints without need of specifying number of clusters to be formed. The main agenda behind proposing this algorithm is to facilitate those problems which require clustering of datapoints based on some levelofsimilarity, and most importantly in realtime. Without having to specify the number of clusters to be formed, the algorithm is capable of:

grouping of incoming datapoints in realtime,

dealing with n number of features (high set of dimensionality),

generating clusters based on a levelofsimilarity (a similarity measure),

identifying clusters with arbitrary shape.
There have been few approaches to generation of clusters of data in realtime. DStream (Chen and Tu, 2007) used a densitybased approach to generate and adjust clusters in realtime. (Aggarwal et al., 2003a) divided the whole clustering process into two components: online and offline. Online component periodically stored detailed summary statistics and offline component facilitated an analyst to use variety of inputs including time horizon and number of clusters to be formed. CluStream (Aggarwal et al., 2003b) did clustering of large evolving data streams by viewing the stream of data as a changing process over time. Work by (Guha et al., 2000) maintained clustering of a sequence of datapoints observed at a period of time.
The main focus of our work is to let an analyst to decide the levelofsimilarity according as required, and leave the rest to the algorithm. The rest part of the paper is as follows. Section 2 describes all about the proposed algorithm  its core logic and blocks. Implementation of algorithm is done in section 3 and finally the work is concluded in Section 4 along with future directions.
2 The Proposed Algorithm
This algorithm requires initial declaration of cluster strictness. Cluster Strictness is the lowest permitted level of similarity between a datapoint and the centroid of a cluster. Here, cluster centroid is the average of features of the datapoints present in a cluster, and is an effective way of representing that particular cluster. If cluster strictness is set to a measure of 60; then the variability between a cluster centroid (C) and a datapoint (D) is a measure of 40 atmost, if D is to be associated to cluster C.
Hence, this algorithm is capable of generating clusters based on a levelofsimilarity. If one needs clusters with very low variance among the datapoints, the cluster strictness can be adjusted accordingly; may be around 8095%. That is why, cluster strictness plays a significance role in generating clusters and its value depends completely on what application the clustering is being done.
2.1 Datapoint & Features
A datapoint can have multiple features (nfeatures). Those features are the characteristics which collectively define a datapoint. A stream of datapoints can be grouped into various clusters, based on their features. Below is an example of a datapoint with 7 features.
A datapoint: 5 10 15 20 25 30 35
During implementation of the proposed algorithm, this pattern is used to represent a datapoint and similarity measure between the datapoints and cluster(s) is calculated accordingly. The algorithm takes datapoints with nfeatures, one datapoint at a time, and requires all the datapoints to have same fixed number of features.
When the algorithm runs for the very first time, and is waiting for a new datapoint; till this moment number of clusters is
. Cluster strictness needs to be defined beforehand. Based on requirements, this value can be adjusted. Higher the value of cluster strictness, less is the variance between the datapoints in a cluster. Depending on the datapoints, using higher value of cluster strictness may result into more number of clusters. This is more likely to happen, if the incoming datapoints are less identical with each other.After assigning some value to cluster strictness, say 70%, the number of features that should be matched between a datapoint and a cluster is calculated using the formula,
(1) 
Suppose, if the datapoints that are to be clustered have 20 features. That means, cluster strictness with 70% results into 14 features that must be matched (at least) for a datapoint to be associated to a particular cluster.
2.2 Shape & Size of Clusters
It is not possible to illustrate datapoints with nfeatures (where n>2) in a 2D plane. For better understanding of size and shape of the clusters formed by the proposed algorithm, let us consider incoming datapoints of only one feature. The centroid with small value has less coverage area than a centroid with larger value. Suppose, there are two clusters with centroid 10 and 100. With similarity measure set to 60, the cluster with centroid 10 accepts a new datapoint with its feature between the range 614; while the cluster with centroid 100 assigns a new datapoint with its feature lying between 60140. This depicts the fact that cluster size varies based on value of its centroid.
Whenever a new datapoint arrives, it’s similarity is checked with centroid of all the clusters. Since, the algorithm is centroidbased, with addition of a new datapoint in a cluster, shifts the value of its centroid. The centroid of a cluster shifts to a particular direction, where the density of datapoints is more concentrated. Hence, the algorithm is able to identify arbitrary shaped clusters as well.
2.3 Qualifying Feature
A feature qualifies to be counted as matched feature, when its similarity measure lies between the range: (cluster_strictness) and (100 + ( 100  cluster_strictness)). That is, the valid range is 70130, when cluster strictness is considered 70. Suppose, when variability check of 5 and 13 is to be done with respect to 9, a basic mathematics can be used. 9 with respect to 9 gives a similarity measure of 100. 5 with respect to 9 gives a similarity measure of 55.56. And, 13 with respect to 9 gives a similarity measure of 144.44. If allowed variability is considered to be a measure of 50 then similarity measure of both 5 and 12 fall under the range 50150, and are considered to be associated to 9 by atleast a measure of 50.
In case of datapoints, following formula is used to calculate the similarity measure.
(2) 
where i is an index for clusters, and j is an index for features.
As an example, in formula 2, i=3 and j=7 reflect the fact that cluster 3 is being considered, and 7th feature of a datapoint & 7th feature of the centroid of cluster 3 are being used to compute the level of similarity between them. If the resulted similarity measure lies between the range (cluster_strictness) and (100 + ( 100  cluster_strictness)) then 7th feature is said to be matched, and the value of matched feature counter for the cluster 3 is incremented by 1. This process of calculating similarity feature of an incoming datapoint is done with all the cluster(s). And if the matched feature counter for a cluster reaches the limit of minimum number of features that should be matched between a datapoint and a cluster, that particular cluster is now placed into a list of qualified cluster.
2.4 Qualified Cluster List
This list contains the cluster(s) that satisfy the condition of having at least a minimum number of features that have matched with an incoming datapoint. After calculation of similarity measure for a datapoint with all the cluster(s), there might arise any one of the following three situations.
2.4.1 Qualified cluster list is empty
Qualified list being empty indicates that no similar cluster(s) were found within the provided level of similarity. Hence, a new cluster is generated and the datapoint is assigned to the newly formed cluster. The datapoint is the centroid of this cluster.
2.4.2 Qualified cluster list contains exactly 1 cluster
In this case, the datapoint is simply assigned to the cluster which is present in the list. And, the centroid of the cluster is recalculated.
2.4.3 Qualified cluster list contains more than 1 cluster
Case 1: If the list contains more than 1 cluster, the cluster with maximum matched features is identified. The datapoint is now assigned to the identified cluster. Case 2: Sometimes two or more cluster might comeup with the same highest number of matched features. In this case, the cluster with maximum average of qualifying similarity measures is identified, and the datapoint is assigned to that cluster accordingly.
2.5 Conflicting Clusters
Whenever multiple clusters show up in qualified cluster list, while trying to identify the nearest similar cluster based on the highest number of matched features, those clusters are said to be the conflicting ones. In such case, an average of qualifying similarity measures is calculated for all the clusters which are in the qualified list and have same number of matched feature, and finally the cluster with the maximum average is considered to be the cluster to which the datapoint is associated.
While calculating an average, only the qualifying similarity measures are being considered. One fact is also being taken into consideration that technically 60 and 140 are same with respect to 100. Both have variability measure of 40. Hence, if a similarity measure lies within the valid range, but is greater than 100 then the similarity measure is brought to the left side of the number system with reference to 100, using formula 3.
3 Implementation of the Algorithm
For best illustration of the implementation of the proposed algorithm, let us consider following 10 datapoints, each of 10 features, which are to be clustered in realtime. The datapoints are input to the algorithm in serial fashion.
Datapoint 1: 10 15 20 25 30 35 40 45 50 55
Datapoint 2: 9 35 18 45 10 32 60 41 10 20
Datapoint 3: 18 13 18 27 30 38 38 41 49 57
Datapoint 4: 20 20 18 5 15 34 50 43 10 50
Datapoint 5: 17 17 18 15 22 35 44 43 10 53
Datapoint 6: 10 32 20 45 12 55 40 55 9 25
Datapoint 7: 22 15 20 7 10 40 25 50 50 60
Datapoint 8: 200 300 400 500 600 250 350 450 550 650
Datapoint 9: 250 350 450 550 550 200 400 200 100 300
Datapoint 10: 10 20 25 40 50 40 50 60 20 50
Let us suppose, levelofsimilarity for a cluster is considered to be 60. That means variability measure of 40 between a datapoint and a cluster centroid is acceptable. With cluster_strictness set to 60, using formula 1 we get minimum number features that must be matched to be 6.
For Datapoint 1
Initially when Datapoint 1 is input to the algorithm, because of absence of cluster(s), a new cluster C1 is created and Datapoint 1 is assigned to it, and features of Datapoint 1 is assumed the centroid of the C1.
C1 = Datapoint 1
C1 centroid: 10 15 20 25 30 35 40 45 50 55
For Datapoint 2
With cluster C1
similarity_measure(C1, F1) = 90
similarity_measure(C1, F2) = 233.34
similarity_measure(C1, F3) = 90
similarity_measure(C1, F4) = 180
similarity_measure(C1, F5) = 33.34
similarity_measure(C1, F6) = 91.43
similarity_measure(C1, F6) = 150
similarity_measure(C1, F6) = 91.11
similarity_measure(C1, F6) = 20
similarity_measure(C1, F6) = 36.36
Number of qualified features came out to be 4 which is less than 6. C1 cannot be added to list of qualified cluster.
Since, there are no any clusters in the list of qualified cluster, a new cluster C2 is generated and Datapoint 2 is assigned to C2.
C2 = Datapoint 2
C2 centroid: 9 35 18 45 10 32 60 41 10 20
For Datapoint 3
With cluster C1
similarity_measure(C1, F1) = 180
similarity_measure(C1, F2) = 86.67
similarity_measure(C1, F3) = 90
similarity_measure(C1, F4) = 108
similarity_measure(C1, F5) = 100
similarity_measure(C1, F6) = 108.57
similarity_measure(C1, F7) = 95
similarity_measure(C1, F8) = 91.11
similarity_measure(C1, F9) = 98
similarity_measure(C1, F10) = 103.64
Number of qualified features came out to be 9 which is greater than 6. C1 is added to list of qualified cluster.
For Datapoint 3
With cluster C2
similarity_measure(C2, F1) = 200
similarity_measure(C2, F2) = 37.14
similarity_measure(C2, F3) = 100
similarity_measure(C2, F4) = 60
similarity_measure(C2, F5) = 300
similarity_measure(C2, F6) = 118.75
similarity_measure(C2, F7) = 63.33
similarity_measure(C2, F8) = 100
similarity_measure(C2, F9) = 490
similarity_measure(C2, F10) = 285
Number of qualified features came out to be 5 which is less than 6. C2 cannot be added to list of qualified cluster.
Since, C1 is only the cluster in the list of qualified cluster, Datapoint 3 is assigned to C1.
C1 = Datapoint 1, Datapoint 3
C1 centroid: 14 14 19 26 30 36.5 39 43 49.5 56
For Datapoint 4
With cluster C1
similarity_measure(C1, F1) = 142.86
similarity_measure(C1, F2) = 142.86
similarity_measure(C1, F3) = 94.74
similarity_measure(C1, F4) = 19.23
similarity_measure(C1, F5) = 50
similarity_measure(C1, F6) = 93.15
similarity_measure(C1, F7) = 128.21
similarity_measure(C1, F8) = 100
similarity_measure(C1, F9) = 20.20
similarity_measure(C1, F10) = 89.29
Number of qualified features came out to be 5 which is less than 6. C1 cannot be added to list of qualified cluster.
For Datapoint 4
With cluster C2
similarity_measure(C2, F1) = 222.22
similarity_measure(C2, F2) = 57.14
similarity_measure(C2, F3) = 100
similarity_measure(C2, F4) = 11.11
similarity_measure(C2, F5) = 150
similarity_measure(C2, F6) = 106.25
similarity_measure(C2, F7) = 83.33
similarity_measure(C2, F8) = 104.87
similarity_measure(C2, F9) = 100
similarity_measure(C2, F10) = 250
Number of qualified features came out to be 5 which is less than 6. C2 cannot be added to list of qualified cluster.
Since, there are no any clusters in the list of qualified cluster, a new cluster C3 is generated and Datapoint 4 is assigned to C3.
C3 = Datapoint 4
C3 centroid: 20 20 18 5 15 34 50 43 10 50
For Datapoint 5
With cluster C1
similarity_measure(C1, F1) = 121.43
similarity_measure(C1, F2) = 121.43
similarity_measure(C1, F3) = 94.74
similarity_measure(C1, F4) = 57.69
similarity_measure(C1, F5) = 73.33
similarity_measure(C1, F6) = 95.89
similarity_measure(C1, F7) = 112.82
similarity_measure(C1, F8) = 100
similarity_measure(C1, F9) = 20.20
similarity_measure(C1, F10) = 94.64
Number of qualified features came out to be 8 which is greater than 6. C1 is added to list of qualified cluster.
For Datapoint 5
With cluster C2
similarity_measure(C2, F1) = 188.89
similarity_measure(C2, F2) = 48.57
similarity_measure(C2, F3) = 100
similarity_measure(C2, F4) = 33.33
similarity_measure(C2, F5) = 220
similarity_measure(C2, F6) = 109.38
similarity_measure(C2, F7) = 73.33
similarity_measure(C2, F8) = 104.88
similarity_measure(C2, F9) = 100
similarity_measure(C2, F10) = 265
Number of qualified features came out to be 5 which is less than 6. C2 cannot be added to list of qualified cluster.
For Datapoint 5
With cluster C3
similarity_measure(C3, F1) = 85
similarity_measure(C3, F2) = 85
similarity_measure(C3, F3) = 100
similarity_measure(C3, F4) = 300
similarity_measure(C3, F5) = 146.67
similarity_measure(C3, F6) = 102.94
similarity_measure(C3, F7) = 88
similarity_measure(C3, F8) = 100
similarity_measure(C3, F9) = 100
similarity_measure(C3, F10) = 106
Number of qualified features came out to be 8 which is greater than 6. C3 is added to list of qualified cluster. Since, C1 and C3 are in the list of qualified cluster and both the clusters have same number of qualified features. Now, average of the qualifying features is calculated for both the clusters. Qualifying features beyond 100 are brought to below 100 using the conversion formula:
(3) 
Average of qualifying feature (C1) = [(100  (121.43  100)) + (100  (121.43  100)) + 94.74 + 73.33 + 95.89 + (100  (112.82  100)) + 100 + 94.64] / 8 = 87.87
Average of qualifying feature (C3) = [85 + 85 + 100 + (100  (102.94  100)) + 88 + 100 + 100 + (100  (106  100))] / 8 = 93.63
Since, the average of qualifying features for C3 is greater, Datapoint 5 is assigned to C3.
C3 = Datapoint 4, Datapoint 5
C3 centroid: 18.5 18.5 18 10 18.5 34.5 47 43 10 51.5
For Datapoint 6
With cluster C1
similarity_measure(C1, F1) = 71.43
similarity_measure(C1, F2) = 228.57
similarity_measure(C1, F3) = 105.26
similarity_measure(C1, F4) = 173.08
similarity_measure(C1, F5) = 40
similarity_measure(C1, F6) = 150.68
similarity_measure(C1, F7) = 102.56
similarity_measure(C1, F8) = 127.91
similarity_measure(C1, F9) = 18.18
similarity_measure(C1, F10) = 44.64
Number of qualified features came out to be 4 which is less than 6. C1 cannot be added to list of qualified cluster.
For Datapoint 6
With cluster C2
similarity_measure(C2, F1) = 111.11
similarity_measure(C2, F2) = 91.43
similarity_measure(C2, F3) = 111.11
similarity_measure(C2, F4) = 100
similarity_measure(C2, F5) = 120
similarity_measure(C2, F6) = 171.88
similarity_measure(C2, F7) = 66.67
similarity_measure(C2, F8) = 134.15
similarity_measure(C2, F9) = 90
similarity_measure(C2, F10) = 125
Number of qualified features came out to be 9 which is greater than 6. C2 is added to list of qualified cluster.
For Datapoint 6
With cluster C3
similarity_measure(C3, F1) = 54.05
similarity_measure(C3, F2) = 172.97
similarity_measure(C3, F3) = 111.11
similarity_measure(C3, F4) = 450
similarity_measure(C3, F5) = 64.86
similarity_measure(C3, F6) = 159.42
similarity_measure(C3, F7) = 85.11
similarity_measure(C3, F8) = 127.91
similarity_measure(C3, F9) = 90
similarity_measure(C3, F10) = 48.54
Number of qualified features came out to be 5 which is less than 6. C3 cannot be added to list of qualified cluster.
Since, C2 is only the cluster in the list of qualified cluster, Datapoint 6 is assigned to C2.
C2 = Datapoint 2, Datapoint 6
C2 centroid: 9.5 33.5 19 45 11 43.5 50 48 9.5 22.5
For Datapoint 7
With cluster C1
similarity_measure(C1, F1) = 157.14
similarity_measure(C1, F2) = 107.14
similarity_measure(C1, F3) = 105.26
similarity_measure(C1, F4) = 26.92
similarity_measure(C1, F5) = 33.33
similarity_measure(C1, F6) = 109.59
similarity_measure(C1, F7) = 64.10
similarity_measure(C1, F8) = 116.28
similarity_measure(C1, F9) = 101.01
similarity_measure(C1, F10) = 107.14
Number of qualified features came out to be 7 which is greater than 6. C1 is added to list of qualified cluster.
For Datapoint 7
With cluster C2
similarity_measure(C2, F1) = 231.58
similarity_measure(C2, F2) = 44.78
similarity_measure(C2, F3) = 105.26
similarity_measure(C2, F4) = 15.56
similarity_measure(C2, F5) = 90.90
similarity_measure(C2, F6) = 91.95
similarity_measure(C2, F7) = 50
similarity_measure(C2, F8) = 104.17
similarity_measure(C2, F9) = 526.32
similarity_measure(C2, F10) = 266.67
Number of qualified features came out to be 4 which is less than 6. C2 cannot be added to list of qualified cluster.
For Datapoint 7
With cluster C3
similarity_measure(C3, F1) = 118.92
similarity_measure(C3, F2) = 81.08
similarity_measure(C3, F3) = 111.11
similarity_measure(C3, F4) = 70
similarity_measure(C3, F5) = 54.05
similarity_measure(C3, F6) = 115.94
similarity_measure(C3, F7) = 53.19
similarity_measure(C3, F8) = 116.28
similarity_measure(C3, F9) = 500
similarity_measure(C3, F10) = 116.50
Number of qualified features came out to be 7 which is greater than 6. C3 is added to list of qualified cluster.
Since, C1 and C3 are in the list of qualified cluster and both the clusters have same number of qualified features. Now, average of the qualifying features is calculated for both the clusters.
Average of qualifying feature (C1) = [(100  (107.14  100)) + (100  (105.26  100)) + (100  (109.59  100)) + 64.10 + (100  (116.28  100)) + (100  (101.01  100)) + (100  (107.14  100))] / 7 = 88.24
Average of qualifying feature (C3) = [(100  (118.92  100)) + 81.08 + (100  (111.11  100)) + 70 + (100  (115.94  100)) + (100  (116.28  100)) + (100  (116.50  100))] / 7 = 81.76
Since, the average of qualifying features for C1 is greater, Datapoint 7 is assigned to C1.
C1 = Datapoint 1, Datapoint 3, Datapoint 7
C1 centroid: 16.67 14.33 19.33 19.67 23.33 37.67 34.33 45.33 49.67 57.33
For Datapoint 8
With cluster C1
similarity_measure(C1, F1) = 1200
similarity_measure(C1, F2) = 2093.02
similarity_measure(C1, F3) = 2068.97
similarity_measure(C1, F4) = 2542.37
similarity_measure(C1, F5) = 2571.43
similarity_measure(C1, F6) = 663.72
similarity_measure(C1, F7) = 1019.42
similarity_measure(C1, F8) = 992.65
similarity_measure(C1, F9) = 1107.38
similarity_measure(C1, F10) = 1133.72
Number of qualified features came out to be 0 which is less than 6. C1 cannot be added to list of qualified cluster.
For Datapoint 8
With cluster C2
similarity_measure(C2, F1) = 2105.26
similarity_measure(C2, F2) = 895.52
similarity_measure(C2, F3) = 2105.26
similarity_measure(C2, F4) = 1111.11
similarity_measure(C2, F5) = 5454.55
similarity_measure(C2, F6) = 574.71
similarity_measure(C2, F7) = 700
similarity_measure(C2, F8) = 937.5
similarity_measure(C2, F9) = 5789.47
similarity_measure(C2, F10) = 2888.89
Number of qualified features came out to be 0 which is less than 6. C2 cannot be added to list of qualified cluster.
For Datapoint 8
With cluster C3
similarity_measure(C3, F1) = 1081.08
similarity_measure(C3, F2) = 1621.62
similarity_measure(C3, F3) = 2222.22
similarity_measure(C3, F4) = 5000
similarity_measure(C3, F5) = 3243.24
similarity_measure(C3, F6) = 724.64
similarity_measure(C3, F7) = 744.68
similarity_measure(C3, F8) = 1046.51
similarity_measure(C3, F9) = 5500
similarity_measure(C3, F10) = 1262.14
Number of qualified features came out to be 0 which is less than 6. C3 cannot be added to list of qualified cluster. Since, there are no any clusters in in the list of qualified cluster, a new cluster C4 is generated and Datapoint 8 is assigned to C4.
C4 = Datapoint 8
C4 centroid: 200 300 400 500 600 250 350 450 550 650
For Datapoint 9
With cluster C1
similarity_measure(C1, F1) = 1500
similarity_measure(C1, F2) = 2441.86
similarity_measure(C1, F3) = 2327.59
similarity_measure(C1, F4) = 2796.61
similarity_measure(C1, F5) = 2357.14
similarity_measure(C1, F6) = 530.97
similarity_measure(C1, F7) = 1165.05
similarity_measure(C1, F8) = 441.18
similarity_measure(C1, F9) = 201.34
similarity_measure(C1, F10) = 523.26
Number of qualified features came out to be 0 which is less than 6. C1 cannot be added to list of qualified cluster.
For Datapoint 9
With cluster C2
similarity_measure(C2, F1) = 2631.58
similarity_measure(C2, F2) = 1044.78
similarity_measure(C2, F3) = 2368.42
similarity_measure(C2, F4) = 1222.22
similarity_measure(C2, F5) = 5000
similarity_measure(C2, F6) = 459.77
similarity_measure(C2, F7) = 800
similarity_measure(C2, F8) = 416.67
similarity_measure(C2, F9) = 1052.63
similarity_measure(C2, F10) = 1333.33
Number of qualified features came out to be 0 which is less than 6. C2 cannot be added to list of qualified cluster.
For Datapoint 9
With cluster C3
similarity_measure(C3, F1) = 1351.35
similarity_measure(C3, F2) = 1891.89
similarity_measure(C3, F3) = 2500
similarity_measure(C3, F4) = 5500
similarity_measure(C3, F5) = 2972.97
similarity_measure(C3, F6) = 579.71
similarity_measure(C3, F7) = 851.06
similarity_measure(C3, F8) = 465.12
similarity_measure(C3, F9) = 1000
similarity_measure(C3, F10) = 582.52
Number of qualified features came out to be 0 which is less than 6. C3 cannot be added to list of qualified cluster.
For Datapoint 9
With cluster C4
similarity_measure(C4, F1) = 125
similarity_measure(C4, F2) = 116.67
similarity_measure(C4, F3) = 112.5
similarity_measure(C4, F4) = 110
similarity_measure(C4, F5) = 91.67
similarity_measure(C4, F6) = 80
similarity_measure(C4, F7) = 114.29
similarity_measure(C4, F8) = 44.44
similarity_measure(C4, F9) = 18.18
similarity_measure(C4, F10) = 46.15
Number of qualified features came out to be 7 which is greater than 6. C4 is added to list of qualified cluster. Since, C4 is only the cluster in the list of qualified cluster, Datapoint 9 is assigned to C4.
C4 = Datapoint 8, Datapoint 9
C4 centroid: 200 300 400 500 600 250 350 450 550 650
For Datapoint 10
With cluster C1
similarity_measure(C1, F1) = 60
similarity_measure(C1, F2) = 139.53
similarity_measure(C1, F3) = 129.31
similarity_measure(C1, F4) = 203.39
similarity_measure(C1, F5) = 214.29
similarity_measure(C1, F6) = 106.19
similarity_measure(C1, F7) = 145.63
similarity_measure(C1, F8) = 132.35
similarity_measure(C1, F9) = 40.27
similarity_measure(C1, F10) = 87.21
Number of qualified features came out to be 6. C1 is added to list of qualified cluster.
For Datapoint 10
With cluster C2
similarity_measure(C2, F1) = 105.26
similarity_measure(C2, F2) = 59.70
similarity_measure(C2, F3) = 131.58
similarity_measure(C2, F4) = 88.89
similarity_measure(C2, F5) = 454.55
similarity_measure(C2, F6) = 91.95
similarity_measure(C2, F7) = 100
similarity_measure(C2, F8) = 125
similarity_measure(C2, F9) = 210.53
similarity_measure(C2, F10) = 222.22
Number of qualified features came out to be 6. C2 is added to list of qualified cluster.
For Datapoint 10
With cluster C3
similarity_measure(C3, F1) = 54.05
similarity_measure(C3, F2) = 108.11
similarity_measure(C3, F3) = 138.89
similarity_measure(C3, F4) = 400
similarity_measure(C3, F5) = 270.27
similarity_measure(C3, F6) = 115.94
similarity_measure(C3, F7) = 106.38
similarity_measure(C3, F8) = 139.53
similarity_measure(C3, F9) = 200
similarity_measure(C3, F10) = 97.09
Number of qualified features came out to be 6. C3 is added to list of qualified cluster.
For Datapoint 10
With cluster C4
similarity_measure(C4, F1) = 4.44
similarity_measure(C4, F2) = 6.15
similarity_measure(C4, F3) = 5.88
similarity_measure(C4, F4) = 7.62
similarity_measure(C4, F5) = 8.69
similarity_measure(C4, F6) = 17.78
similarity_measure(C4, F7) = 13.33
similarity_measure(C4, F8) = 18.46
similarity_measure(C4, F9) = 6.15
similarity_measure(C4, F10) = 10.53
Number of qualified features came out to be 0 which is less than 6. C4 cannot be added to list of qualified cluster. Since, C1, C2 and C3 are in the list of qualified cluster and all three of them have same number of qualified features. Now, average of the qualifying features is calculated for all the clusters.
Average of qualifying feature (C1) = [60 + (100  (139.53  100)) + (100  (129.31  100)) + (100  (106.19  100)) + (100  (132.65  100)) + 87.21] / 6 = 73.255
Average of qualifying feature (C2) = [(100  (105.26  100)) + (100  (131.58  100)) + 88.89 + 91.95 + 100 + (100  (125  100))] / 6 = 86.5
Average of qualifying feature (C3) = [(100  (108.11  100)) + (100  (115.94  100)) + (100  (106.38  100)) + (100  (139.53  100)) + 97.07] / 6 = 71.185
Since, the average of qualifying features for C2 is greater, Datapoint 10 is assigned to C2.
C2 = Datapoint 2, Datapoint 6, Datapoint 10
C2 centroid: 9.67 29 21 43.33 24 42.33 50 52 13 31.67
The datapoints were successfully grouped into 4 various clusters based on similarity measure of their features. Cluster strictness of 60 was considered, hence permitted variability measure between a datapoint and a cluster centroid was 40.


Graphical representation of how 10 example datapoints were clustered
4 Conclusion
We presented the theoretical aspect of the proposed algorithm and implemented it on 10 datapoints (all of them with 10features) with supposition that they are input to the algorithm one after another, just like a streamofdata in serial fashion. We also discussed how formation of clusters can be manipulated just by increasing/decreasing the levelofsimilarity. The algorithm is applicable in realworld scenario where the task is to group a stream of datapoints, incoming to a system, based on their similarity with the average of features (centroid) of datapoints existing in their respective previously formed clusters. If none of the existing clusters satisfy the similarity measure, new clusters are formed and datapoints are assigned to appropriate clusters accordingly.
5 Future Directions
In our literature review, we did not find any realtime algorithm which resembles to the working theme of application of this algorithm. It would definitely be meaningless to test the performance of this algorithm with existing realtime clustering algorithms which collect data for a quantum of time and perform clustering analysis on the collected data. Hence, calculation of performance of this algorithm with relative to another clustering algorithm that groups datapoints one at a time, would be the main future direction to work on.
When number of clusters in the space is relatively large, the complexity of algorithm also increases proportionately. Hence, it would be an interesting problem to try decreasing the number of similarity checks that is done when a new datapoints arrives. Instead of checking the similarity of datapoints with all the preexisting clusters, an approach can be made so that only a certain number of clusters are taken into consideration for similarity check. This way the number of required computations can be decreased relatively.
We would like to gratefully acknowledge Intel AI DevCloud for providing cloud compute for computational needs. Also, special thanks to Amit Sharma for his support in drafting this paper in LaTeX, and Rohan Negi for his healthy discussions during the overall period of this research work.
References
 Aggarwal et al. (2003a) Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. A framework for clustering evolving data streams. In Proceedings of the 29th International Conference on Very Large Data Bases  Volume 29, VLDB ’03, pages 81–92. VLDB Endowment, 2003a. ISBN 0127224424. URL http://dl.acm.org/citation.cfm?id=1315451.1315460.
 Aggarwal et al. (2003b) Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. A framework for clustering evolving data streams. In VLDB, 2003b.
 Chen and Tu (2007) Yixin Chen and Li Tu. Densitybased clustering for realtime stream data. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, pages 133–142, New York, NY, USA, 2007. ACM. ISBN 9781595936097. doi: 10.1145/1281192.1281210. URL http://doi.acm.org/10.1145/1281192.1281210.
 Guha et al. (2000) S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. In Proceedings of the 41st Annual Symposium on Foundations of Computer Science, FOCS ’00, pages 359–, Washington, DC, USA, 2000. IEEE Computer Society. ISBN 0769508502. URL http://dl.acm.org/citation.cfm?id=795666.796588.
 Hartigan (1975) John A. Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., New York, NY, USA, 99th edition, 1975. ISBN 047135645X.

Xu and Wunsch (2005)
Rui Xu and D. Wunsch.
Survey of clustering algorithms.
IEEE Transactions on Neural Networks
, 16(3):645–678, May 2005. ISSN 10459227. doi: 10.1109/TNN.2005.845141.
Comments
There are no comments yet.