The concept of Internet of Things (IoT) [1, 2] is intimately related to records of certain events, e.g. a network attached device capturing weather information to be broadcasted to other devices for processing. Given a) the transmitter frequently sends out such information every time interval , and b) the receiving device keeps track of the timestamps when data was transmitted/recorded, the time series stores information on failure of recording/sending/transmission/receiving.
If we cluster the one-dimensional data such that consecutive events are not more than
apart, we can infer periods in time where data might be missing. Upon detection, corresponding action such as retransmission, data interpolation, etc. can be performed. Moreover, the characteristics ofintervals of no data (relative frequency, duration, …) might help to diagnose the sanity of the communication network.
do not exploit the special property of ordering in one dimension, we aim at a simpler approach that does not need knowledge of the number of clusters/intervals, and it avoids density estimation such as withdenclue .
Regarding cluster classification our approach is close to the conceputal notion dbscan 
introduces: clusters of points and outliers/noise points. However, we exploit the fact that the sequence of timestamps is naturally ordered111the number of seconds passed since some defined event (for
UNIX epoch timethis is Jan 1, 1970 UTC) monotonically increases, thus records of consecutive events , , …have ordered timestamps , , … and thus minimize computational complexity by a factor of . Of course, if we would have to sort first, e.g. using heapsort , we are back to asymptotic runtime of .
Our main contribution here is to adapt the concept of dbscan to event clustering for application in IoT service quality characterization. We present an algorithm with linear runtime complexity which asymptotically outperforms native dbscan that operates at an overall average runtime complexity of . Of course, while dbscan can be applied to any number of spatial dimensions our approach is limited to the one-dimensional case.
Ii One-Dimensional Clustering
Ii-a Problem Formulation
Given a set of ordered timestamps , i.e.
and an expected time interval , provide time intervals such that
Note that might be an external parameter to the algorithm that provides the solution or it is defined by itself, e.g. through . As we will discuss, the need to be computed and therefore is efficiently determined along the lines.
Ii-B Central Idea
In order to fulfill eq. 2, of course, we need to compute at least time intervals
Whenever a new time series point gets (randomly) added, there is no a priory way of determining whether got exceeded from the existing .
To classify theas interval bounds we note that the binary sequence
switches from 1 to 0 for an opening interval bound , and from 0 to 1 for a closing interval bound , only. denotes the function and . Hence the quantity
yields the desired association
Per requirement, a) eq. 1, the are ordered, and b) the binary (discrete) function implies the alternating property
Thus, linearly scanning through the and their corresponding results in the two sets
such that we can simply interleave these to obtain the corresponding time intervals as our solution, eq. 2.
Ii-C Boundary Conditions
However, there is a couple of options how to exactly interleave the which depend on the boundary condition. More specifically, let us assume the sequence starts e.g. with intervals that are smaller than . In this case, , and one needs to manually add a to construct intervals
A corresponding issue might happen at the end of the time series depending on whether is equal or not equal222 Note that by virtue of eq. 7 the difference is at most . Actually, it is already obvious from the fact that that one needs to manually add – imagine the case where each is a boundary value, but we have two of the missing to classify all . to .
In order to prevent manually dealing with all the (four) different boundary condition scenarios, we might want to (virtually) add the following timestamps from the outset:
Hence, we obtain , and therefore
which yields that corresponding to for classification such that we always have
Ii-D Isolated Points
Given the solution eq. 12, due to the ordering of the , we can simply form the open intervals
that we associate with time intervals of failure. Note, that . However, these intervals do not imply
i.e., informally, it is not true that no event happens during the intervals , but we certainly have
where we refer to as an isolated event. In terms of dbscan these timestamps form the noise, while all are border points.
Isolated events have and since they are not interval boundary points they need to have . This way we can use and to classify isolated events according to
where denotes the set of isolated timestamps. Likewise, we can define clustered timestamps as
Since is binary and eq. 5 holds for , all are uniquely classified, i.e. . It is rather straightforward to convince oneself that there is the association in the sense that all are within an interval of and within an interval of .
Ii-E Implementation & Computational Complexity
given dT as
-1, 0, 1, 10, 100, and mean of the elements of t
returns output equivalent to
, [-20, -18, 1, 2, 2.9, 10, 11, 100, 200, 202, 202, 203]
[[202, 202]], [-20, -18, 1, 2, 2.9, 10, 11, 100, 200, 203]
[(1, 2.9), (10, 11), (202, 203)], [-20, -18, 100, 200]
[(-20, -18), (1, 11), (200, 203)], 
[(-20, 203)], 
[(-20, 11), (200, 203)], 
The procedure presented in sections II-D, II-C and II-B and LABEL:lst:PseudoCodeImpl uses algebraic operations for , logical operations for and again algebraic operations for which determines the interval boundary classification with a total of operations. The final loop in LABEL:lst:PseudoCodeImpl to interleave the tauPlus and tauMinus lists is just for the user’s convenience.
The naive approach would compute two time intervals for each and perform two logical operations of those against to determine the classification, hence computations. Note that due to the given linear ordering in one-dimensional space, our algorithm’s runtime is exact. In particular, it is fully deterministic when the number of timestamps is fixed.
Moreover, the required memory for our approach is linear in . Only the event series list t of size and the lists x, tauMinus, and tauPlus with a total size of at most timestamps need to be stored. The lists dt, b, and B can be computed on the fly occupying storage .
To confirm our analytical findings we performed a numerical experiment which is presented by fig. 1. It evaluates the speedup of our algorithm compared to a vanilla implementation of dbscan. Within the observed error boundaries, the scaling factor is plausible for large wrt. the speedup factor
with the individual runtime of dbscan and our linear approach, respectively.
We observe that for a given, fixed event series with total time interval , the quantity
computes the fraction of time with no failure in operation. is fixed by the expected, logarithmic, and normalized event frequency , i.e. represents the scale of frequency where all timestamps are equally spaced within the time series interval. corresponds to smaller scales, to larger ones.
Scanning by varying provides a characteristics that quantifies the reliability of e.g. an IoT service. It is rather straightforward to show that is monoton decreasing with increasing333 The larger , the more the clusters cover the whole time series. Due to eq. 2 clusters never shrink in size for increasing , they either grow or merge to bigger clusters, letting the overall cover increase. .
In case where the time series is generated by a single, periodic data stream, we get a unit step function , i.e. for and for . Nevertheless, similar information could be obtained by simply checking a histogram , cf. eq. 3, that counts the number of in some binning interval (number density). In the case above we would observe a single peak in . Note, that contains similar information to .
However, our clustering output provides information that is blind to, because it does not account for the ordering of the . In particular,
provides a normed measure of the number of clusters444 The Kronecker delta is for , else. It forces to result in . . While just quantifies the total coverage of by the clusters, provides insight whether the coverage is established by a number of patches or a single/a few intervals with data frequency of at least . This way we might draw conclusions on e.g. the reliability of an IoT service. Ideally we want .
Last but not least, we might consider the number of isolated events
as an additional indicator of reliability, since they are orthogonal to the information contained in . We might classify isolated events as indicator of loose IoT service quality and thus it should stay close to zero until it quickly increases to one for some .
Figure 2 illustrates these applications by plotting for an event series generated from uniformly random samples drawn from joined by equi-distant samples in . We observe that at
there is little variance in, indicating that there is no single dominant event frequency . Moreover, there is a step in at that covers 90% of its range which refers to a dominant event frequency one order of magnitude lower than . Since we conclude this frequency to be present along major time intervals within . Also, rapidly drops. Therefore, the existence of isolated events vanishes at time scales larger than such that we have a clean signal.
In contrast, for . Thus, due to the randomness we introduced in our sample, for high-frequency events, increasing coverage of is achieved by a number of isolated clusters (random nature of the signal!). Finally, for frequencies 3 orders of magnitude larger than , , i.e. no more clustering of events is present.
Figure 3 depicts a sample data flow and processing pipeline where the discussed method can be employed to rate and monitor e.g. an IoT device or the data availability of satellite imagery in the big geo-spatial data platform IBM PAIRS [11, 12]. Given that this information service is expected to send data packages at frequency , an event cluster engine records and stores the timestamps for further analysis. At the same time a frequency detector might dynamically adjust , e.g. by computing the mean of the over a given time window. The event clustering engine is coupled to a user interface that might be interacted with by a RESTful API  served by e.g. Python Flask  to trigger the execution of LABEL:lst:PseudoCodeImpl in order to return the sets and . Once the clustering has been performed, the quantities can be computed and analyzed by a cluster measure engine which itself feeds derived service quality indicators to a monitoring system such as e.g. Ganglia  or Nagios . These might then release alerts by an appropriate messaging service such as plain e-mail or employing a system such as Apache Kafka .
We discussed and implemented a one-dimensional, one-parameter clustering method with linear complexity on input and memory usage. It might be the preferred choice over the more general apporach dbscan takes when clustering ordered timestamps. Based on the algorithm’s output we suggested measures that have useful application in the domain of IoT to quantify data availability or to indicate the reliability/stability of an IoT device connecting to the network. In particular, the presented approach is part of the data availability RESTful service of IBM’s big geo-spatial database PAIRS.
-  ITU-T, “Overview of the Internet of things,” 2012. [Online]. Available: http://handle.itu.int/11.1002/1000/11559
-  M. Chiang and T. Zhang, “Fog and IoT: An Overview of Research Opportunities,” IEEE Internet of Things Journal, vol. 3, no. 6, p. 854, 2016. [Online]. Available: http://www.download-paper.com/wp-content/uploads/2017/01/2016-ieee-Fog-and-IoT-An-Overview-of-Research-Opportunities.pdf
-  R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of eugenics, vol. 7, no. 2, p. 179, 1936. [Online]. Available: http://www.comp.tmu.ac.jp/morbier/R/Fisher-1936-Ann._Eugen.pdf
-  S. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on Information Theory, vol. 28, no. 2, p. 129, 1982. [Online]. Available: http://www.cs.toronto.edu/~roweis/csc2515-2006/readings/lloyd57.pdf
-  A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the royal statistical society. Series B (methodological), p. 1, 1977. [Online]. Available: http://www.eng.auburn.edu/~troppel/courses/7970%202015A%20AdvMobRob%20sp15/literature/paper%20W%20refs/dempster%20EM%201977.pdf
-  A. Hinneburg and D. A. Keim, “An efficient approach to clustering in large multimedia databases with noise,” in KDD, vol. 98, 1998, pp. 58–65. [Online]. Available: http://www.aaai.org/Papers/KDD/1998/KDD98-009.pdf
-  M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise.” in KDD, vol. 96, 1996, p. 226. [Online]. Available: http://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf
-  I. Wegener, “The worst case complexity of McDiarmid and Reed’s variant of BOTTOM-UP HEAPSORT is less than n log n+ 1.1 n,” Information and Computation, vol. 97, no. 1, p. 86, 1992. [Online]. Available: http://www.sciencedirect.com/science/article/pii/089054019290005Z
-  S. v. d. Walt, S. C. Colbert, and G. Varoquaux, “The NumPy array: a structure for efficient numerical computation,” Computing in Science & Engineering, vol. 13, no. 2, pp. 22–30, 2011. [Online]. Available: https://arxiv.org/pdf/1102.1523
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, and V. Dubourg, “Scikit-learn: Machine learning in Python,”Journal of Machine Learning Research, vol. 12, no. Oct, pp. 2825–2830, 2011. [Online]. Available: http://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html
-  L. J. Klein, F. J. Marianno, C. M. Albrecht, M. Freitag, S. Lu, N. Hinds, X. Shao, S. Bermudez Rodriguez, and H. F. Hamann, “PAIRS: A scalable geo-spatial data analytics platform,” in Big Data (Big Data), 2015 IEEE International Conference on. IEEE, 2015, pp. 1290–1298. [Online]. Available: http://researcher.watson.ibm.com/researcher/files/us-kleinl/IEEE_BigData_final_klein.pdf
-  S. Lu, X. Shao, M. Freitag, L. J. Klein, J. Renwick, F. J. Marianno, C. Albrecht, and H. F. Hamann, “IBM PAIRS curated big data service for accelerated geospatial data analytics and discovery,” in Big Data (Big Data), 2016 IEEE International Conference on. IEEE, 2016, pp. 2672–2675. [Online]. Available: https://static.aminer.org/pdf/fa/bigdata2016/S09208.pdf
-  Fielding, “Architectural Styles and the Design of Network-based Software Architectures,” Ph.D. dissertation, 2000. [Online]. Available: http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm
-  Wikipedia, “Flask (web framework),” 2017. [Online]. Available: https://en.wikipedia.org/wiki/Flask_(web_framework)
-  ——, “Ganglia (software),” 2017. [Online]. Available: https://en.wikipedia.org/wiki/Ganglia_(software)
-  ——, “Nagios,” 2017. [Online]. Available: https://en.wikipedia.org/wiki/Nagios
-  ——, “Apache Kafka,” 2017. [Online]. Available: https://en.wikipedia.org/wiki/Apache_Kafka
S. E. Haupt, “A demonstration of coupled receptor/dispersion modeling with a genetic algorithm,”Atmospheric Environment, vol. 39, no. 37, pp. 7181–7189, Dec. 2005. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/S1352231005007685
-  J. D. Albertson, T. Harvey, G. Foderaro, P. Zhu, X. Zhou, S. Ferrari, M. S. Amin, M. Modrak, H. Brantley, and E. D. Thoma, “A Mobile Sensing Approach for Regional Surveillance of Fugitive Methane Emissions in Oil and Gas Production,” Environmental Science & Technology, vol. 50, no. 5, pp. 2487–2497, Mar. 2016. [Online]. Available: http://pubs.acs.org/doi/full/10.1021/acs.est.5b05059