I Motivation
The concept of Internet of Things (IoT) [1, 2] is intimately related to records of certain events, e.g. a network attached device capturing weather information to be broadcasted to other devices for processing. Given a) the transmitter frequently sends out such information every time interval , and b) the receiving device keeps track of the timestamps when data was transmitted/recorded, the time series stores information on failure of recording/sending/transmission/receiving.
If we cluster the onedimensional data such that consecutive events are not more than
apart, we can infer periods in time where data might be missing. Upon detection, corresponding action such as retransmission, data interpolation, etc. can be performed. Moreover, the characteristics of
intervals of no data (relative frequency, duration, …) might help to diagnose the sanity of the communication network.Since general purpose, multidimensional clustering methods such as Fisher’s discriminant [3], k–means [4] or more generally EM [5]
do not exploit the special property of ordering in one dimension, we aim at a simpler approach that does not need knowledge of the number of clusters/intervals, and it avoids density estimation such as with
denclue [6].Regarding cluster classification our approach is close to the conceputal notion dbscan [7]
introduces: clusters of points and outliers/noise points. However, we exploit the fact that the sequence of timestamps is naturally ordered
^{1}^{1}1the number of seconds passed since some defined event (forUNIX epoch time
this is Jan 1, 1970 UTC) monotonically increases, thus records of consecutive events , , …have ordered timestamps , , … and thus minimize computational complexity by a factor of . Of course, if we would have to sort first, e.g. using heapsort [8], we are back to asymptotic runtime of .Our main contribution here is to adapt the concept of dbscan to event clustering for application in IoT service quality characterization. We present an algorithm with linear runtime complexity which asymptotically outperforms native dbscan that operates at an overall average runtime complexity of . Of course, while dbscan can be applied to any number of spatial dimensions our approach is limited to the onedimensional case.
Ii OneDimensional Clustering
Iia Problem Formulation
Given a set of ordered timestamps , i.e.
(1) 
and an expected time interval , provide time intervals such that
(2) 
Note that might be an external parameter to the algorithm that provides the solution or it is defined by itself, e.g. through . As we will discuss, the need to be computed and therefore is efficiently determined along the lines.
IiB Central Idea
In order to fulfill eq. 2, of course, we need to compute at least time intervals
(3) 
Whenever a new time series point gets (randomly) added, there is no a priory way of determining whether got exceeded from the existing .
To classify the
as interval bounds we note that the binary sequence(4) 
switches from 1 to 0 for an opening interval bound , and from 0 to 1 for a closing interval bound , only. denotes the function and . Hence the quantity
(5) 
yields the desired association
(6) 
Per requirement, a) eq. 1, the are ordered, and b) the binary (discrete) function implies the alternating property
(7) 
Thus, linearly scanning through the and their corresponding results in the two sets
(8) 
such that we can simply interleave these to obtain the corresponding time intervals as our solution, eq. 2.
IiC Boundary Conditions
However, there is a couple of options how to exactly interleave the which depend on the boundary condition. More specifically, let us assume the sequence starts e.g. with intervals that are smaller than . In this case, , and one needs to manually add a to construct intervals
(9) 
A corresponding issue might happen at the end of the time series depending on whether is equal or not equal^{2}^{2}2 Note that by virtue of eq. 7 the difference is at most . Actually, it is already obvious from the fact that that one needs to manually add – imagine the case where each is a boundary value, but we have two of the missing to classify all . to .
In order to prevent manually dealing with all the (four) different boundary condition scenarios, we might want to (virtually) add the following timestamps from the outset:
(10) 
Hence, we obtain , and therefore
(11) 
which yields that corresponding to for classification such that we always have
(12) 
from
(13) 
with .
IiD Isolated Points
Given the solution eq. 12, due to the ordering of the , we can simply form the open intervals
(14) 
that we associate with time intervals of failure. Note, that . However, these intervals do not imply
(15) 
i.e., informally, it is not true that no event happens during the intervals , but we certainly have
(16) 
where we refer to as an isolated event. In terms of dbscan these timestamps form the noise, while all are border points.
Isolated events have and since they are not interval boundary points they need to have . This way we can use and to classify isolated events according to
(17) 
where denotes the set of isolated timestamps. Likewise, we can define clustered timestamps as
(18) 
Since is binary and eq. 5 holds for , all are uniquely classified, i.e. . It is rather straightforward to convince oneself that there is the association in the sense that all are within an interval of and within an interval of .
IiE Implementation & Computational Complexity
LABEL:lst:PseudoCodeImpl provides an example implementation of the method from sections IIC and IIB in pseudocode for demonstration purposes. E.g. the call of cluster_events(t, dT) on
t=[20,18,1,2,2.9,10,11,100,200,202,202,203]
given dT as
1, 0, 1, 10, 100, and mean of the elements of t
returns output equivalent to
[], [20, 18, 1, 2, 2.9, 10, 11, 100, 200, 202, 202, 203]
[[202, 202]], [20, 18, 1, 2, 2.9, 10, 11, 100, 200, 203]
[(1, 2.9), (10, 11), (202, 203)], [20, 18, 100, 200]
[(20, 18), (1, 11), (200, 203)], [100]
[(20, 203)], []
[(20, 11), (200, 203)], [100]
respectively.
The procedure presented in sections IID, IIC and IIB and LABEL:lst:PseudoCodeImpl uses algebraic operations for , logical operations for and again algebraic operations for which determines the interval boundary classification with a total of operations. The final loop in LABEL:lst:PseudoCodeImpl to interleave the tauPlus and tauMinus lists is just for the user’s convenience.
The naive approach would compute two time intervals for each and perform two logical operations of those against to determine the classification, hence computations. Note that due to the given linear ordering in onedimensional space, our algorithm’s runtime is exact. In particular, it is fully deterministic when the number of timestamps is fixed.
Moreover, the required memory for our approach is linear in . Only the event series list t of size and the lists x, tauMinus, and tauPlus with a total size of at most timestamps need to be stored. The lists dt, b, and B can be computed on the fly occupying storage .
To confirm our analytical findings we performed a numerical experiment which is presented by fig. 1. It evaluates the speedup of our algorithm compared to a vanilla implementation of dbscan. Within the observed error boundaries, the scaling factor is plausible for large wrt. the speedup factor
(19) 
with the individual runtime of dbscan and our linear approach, respectively.
IiF Application
We observe that for a given, fixed event series with total time interval , the quantity
(20) 
where
(21) 
computes the fraction of time with no failure in operation. is fixed by the expected, logarithmic, and normalized event frequency , i.e. represents the scale of frequency where all timestamps are equally spaced within the time series interval. corresponds to smaller scales, to larger ones.
Scanning by varying provides a characteristics that quantifies the reliability of e.g. an IoT service. It is rather straightforward to show that is monoton decreasing with increasing^{3}^{3}3 The larger , the more the clusters cover the whole time series. Due to eq. 2 clusters never shrink in size for increasing , they either grow or merge to bigger clusters, letting the overall cover increase. .
In case where the time series is generated by a single, periodic data stream, we get a unit step function , i.e. for and for . Nevertheless, similar information could be obtained by simply checking a histogram , cf. eq. 3, that counts the number of in some binning interval (number density). In the case above we would observe a single peak in . Note, that contains similar information to .
However, our clustering output provides information that is blind to, because it does not account for the ordering of the . In particular,
(22) 
provides a normed measure of the number of clusters^{4}^{4}4 The Kronecker delta is for , else. It forces to result in . . While just quantifies the total coverage of by the clusters, provides insight whether the coverage is established by a number of patches or a single/a few intervals with data frequency of at least . This way we might draw conclusions on e.g. the reliability of an IoT service. Ideally we want .
Last but not least, we might consider the number of isolated events
(23) 
as an additional indicator of reliability, since they are orthogonal to the information contained in . We might classify isolated events as indicator of loose IoT service quality and thus it should stay close to zero until it quickly increases to one for some .
Figure 2 illustrates these applications by plotting for an event series generated from uniformly random samples drawn from joined by equidistant samples in . We observe that at
there is little variance in
, indicating that there is no single dominant event frequency . Moreover, there is a step in at that covers 90% of its range which refers to a dominant event frequency one order of magnitude lower than . Since we conclude this frequency to be present along major time intervals within . Also, rapidly drops. Therefore, the existence of isolated events vanishes at time scales larger than such that we have a clean signal.In contrast, for . Thus, due to the randomness we introduced in our sample, for highfrequency events, increasing coverage of is achieved by a number of isolated clusters (random nature of the signal!). Finally, for frequencies 3 orders of magnitude larger than , , i.e. no more clustering of events is present.
Figure 3 depicts a sample data flow and processing pipeline where the discussed method can be employed to rate and monitor e.g. an IoT device or the data availability of satellite imagery in the big geospatial data platform IBM PAIRS [11, 12]. Given that this information service is expected to send data packages at frequency , an event cluster engine records and stores the timestamps for further analysis. At the same time a frequency detector might dynamically adjust , e.g. by computing the mean of the over a given time window. The event clustering engine is coupled to a user interface that might be interacted with by a RESTful API [13] served by e.g. Python Flask [14] to trigger the execution of LABEL:lst:PseudoCodeImpl in order to return the sets and . Once the clustering has been performed, the quantities can be computed and analyzed by a cluster measure engine which itself feeds derived service quality indicators to a monitoring system such as e.g. Ganglia [15] or Nagios [16]. These might then release alerts by an appropriate messaging service such as plain email or employing a system such as Apache Kafka [17].
Iii Conclusion
We discussed and implemented a onedimensional, oneparameter clustering method with linear complexity on input and memory usage. It might be the preferred choice over the more general apporach dbscan takes when clustering ordered timestamps. Based on the algorithm’s output we suggested measures that have useful application in the domain of IoT to quantify data availability or to indicate the reliability/stability of an IoT device connecting to the network. In particular, the presented approach is part of the data availability RESTful service of IBM’s big geospatial database PAIRS.
References
 [1] ITUT, “Overview of the Internet of things,” 2012. [Online]. Available: http://handle.itu.int/11.1002/1000/11559
 [2] M. Chiang and T. Zhang, “Fog and IoT: An Overview of Research Opportunities,” IEEE Internet of Things Journal, vol. 3, no. 6, p. 854, 2016. [Online]. Available: http://www.downloadpaper.com/wpcontent/uploads/2017/01/2016ieeeFogandIoTAnOverviewofResearchOpportunities.pdf
 [3] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of eugenics, vol. 7, no. 2, p. 179, 1936. [Online]. Available: http://www.comp.tmu.ac.jp/morbier/R/Fisher1936Ann._Eugen.pdf
 [4] S. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Information Theory, vol. 28, no. 2, p. 129, 1982. [Online]. Available: http://www.cs.toronto.edu/~roweis/csc25152006/readings/lloyd57.pdf
 [5] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the royal statistical society. Series B (methodological), p. 1, 1977. [Online]. Available: http://www.eng.auburn.edu/~troppel/courses/7970%202015A%20AdvMobRob%20sp15/literature/paper%20W%20refs/dempster%20EM%201977.pdf
 [6] A. Hinneburg and D. A. Keim, “An efficient approach to clustering in large multimedia databases with noise,” in KDD, vol. 98, 1998, pp. 58–65. [Online]. Available: http://www.aaai.org/Papers/KDD/1998/KDD98009.pdf
 [7] M. Ester, H.P. Kriegel, J. Sander, and X. Xu, “A densitybased algorithm for discovering clusters in large spatial databases with noise.” in KDD, vol. 96, 1996, p. 226. [Online]. Available: http://www.aaai.org/Papers/KDD/1996/KDD96037.pdf
 [8] I. Wegener, “The worst case complexity of McDiarmid and Reed’s variant of BOTTOMUP HEAPSORT is less than n log n+ 1.1 n,” Information and Computation, vol. 97, no. 1, p. 86, 1992. [Online]. Available: http://www.sciencedirect.com/science/article/pii/089054019290005Z
 [9] S. v. d. Walt, S. C. Colbert, and G. Varoquaux, “The NumPy array: a structure for efficient numerical computation,” Computing in Science & Engineering, vol. 13, no. 2, pp. 22–30, 2011. [Online]. Available: https://arxiv.org/pdf/1102.1523

[10]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, and V. Dubourg, “Scikitlearn: Machine learning in Python,”
Journal of Machine Learning Research, vol. 12, no. Oct, pp. 2825–2830, 2011. [Online]. Available: http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html  [11] L. J. Klein, F. J. Marianno, C. M. Albrecht, M. Freitag, S. Lu, N. Hinds, X. Shao, S. Bermudez Rodriguez, and H. F. Hamann, “PAIRS: A scalable geospatial data analytics platform,” in Big Data (Big Data), 2015 IEEE International Conference on. IEEE, 2015, pp. 1290–1298. [Online]. Available: http://researcher.watson.ibm.com/researcher/files/uskleinl/IEEE_BigData_final_klein.pdf
 [12] S. Lu, X. Shao, M. Freitag, L. J. Klein, J. Renwick, F. J. Marianno, C. Albrecht, and H. F. Hamann, “IBM PAIRS curated big data service for accelerated geospatial data analytics and discovery,” in Big Data (Big Data), 2016 IEEE International Conference on. IEEE, 2016, pp. 2672–2675. [Online]. Available: https://static.aminer.org/pdf/fa/bigdata2016/S09208.pdf
 [13] Fielding, “Architectural Styles and the Design of Networkbased Software Architectures,” Ph.D. dissertation, 2000. [Online]. Available: http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm
 [14] Wikipedia, “Flask (web framework),” 2017. [Online]. Available: https://en.wikipedia.org/wiki/Flask_(web_framework)
 [15] ——, “Ganglia (software),” 2017. [Online]. Available: https://en.wikipedia.org/wiki/Ganglia_(software)
 [16] ——, “Nagios,” 2017. [Online]. Available: https://en.wikipedia.org/wiki/Nagios
 [17] ——, “Apache Kafka,” 2017. [Online]. Available: https://en.wikipedia.org/wiki/Apache_Kafka

[18]
S. E. Haupt, “A demonstration of coupled receptor/dispersion modeling with a genetic algorithm,”
Atmospheric Environment, vol. 39, no. 37, pp. 7181–7189, Dec. 2005. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/S1352231005007685  [19] J. D. Albertson, T. Harvey, G. Foderaro, P. Zhu, X. Zhou, S. Ferrari, M. S. Amin, M. Modrak, H. Brantley, and E. D. Thoma, “A Mobile Sensing Approach for Regional Surveillance of Fugitive Methane Emissions in Oil and Gas Production,” Environmental Science & Technology, vol. 50, no. 5, pp. 2487–2497, Mar. 2016. [Online]. Available: http://pubs.acs.org/doi/full/10.1021/acs.est.5b05059
Comments
There are no comments yet.