Event Clustering Event Series Characterization on Expected Frequency

04/05/2020 ∙ by Conrad M Albrecht, et al. ∙ ibm 0

We present an efficient clustering algorithm applicable to one-dimensional data such as e.g. a series of timestamps. Given an expected frequency Δ T^-1, we introduce an O(N)-efficient method of characterizing N events represented by an ordered series of timestamps t_1,t_2,...,t_N. In practice, the method proves useful to e.g. identify time intervals of "missing" data or to locate "isolated events". Moreover, we define measures to quantify a series of events by varying Δ T to e.g. determine the quality of an Internet of Things service.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Motivation

The concept of Internet of Things (IoT) [1, 2] is intimately related to records of certain events, e.g. a network attached device capturing weather information to be broadcasted to other devices for processing. Given a) the transmitter frequently sends out such information every time interval , and b) the receiving device keeps track of the timestamps when data was transmitted/recorded, the time series stores information on failure of recording/sending/transmission/receiving.

If we cluster the one-dimensional data such that consecutive events are not more than

apart, we can infer periods in time where data might be missing. Upon detection, corresponding action such as retransmission, data interpolation, etc. can be performed. Moreover, the characteristics of

intervals of no data (relative frequency, duration, …) might help to diagnose the sanity of the communication network.

Since general purpose, multi-dimensional clustering methods such as Fisher’s discriminant [3], k–means [4] or more generally EM [5]

do not exploit the special property of ordering in one dimension, we aim at a simpler approach that does not need knowledge of the number of clusters/intervals, and it avoids density estimation such as with

denclue [6].

Regarding cluster classification our approach is close to the conceputal notion dbscan [7]

introduces: clusters of points and outliers/noise points. However, we exploit the fact that the sequence of timestamps is naturally ordered

111the number of seconds passed since some defined event (for

UNIX epoch time

this is Jan 1, 1970 UTC)
monotonically increases, thus records of consecutive events , , …have ordered timestamps , , …
and thus minimize computational complexity by a factor of . Of course, if we would have to sort first, e.g. using heapsort [8], we are back to asymptotic runtime of .

Our main contribution here is to adapt the concept of dbscan to event clustering for application in IoT service quality characterization. We present an algorithm with linear runtime complexity which asymptotically outperforms native dbscan that operates at an overall average runtime complexity of . Of course, while dbscan can be applied to any number of spatial dimensions our approach is limited to the one-dimensional case.

Ii One-Dimensional Clustering

Ii-a Problem Formulation

Given a set of ordered timestamps , i.e.

(1)

and an expected time interval , provide time intervals such that

(2)

Note that might be an external parameter to the algorithm that provides the solution or it is defined by itself, e.g. through . As we will discuss, the need to be computed and therefore is efficiently determined along the lines.

Ii-B Central Idea

In order to fulfill eq. 2, of course, we need to compute at least time intervals

(3)

Whenever a new time series point gets (randomly) added, there is no a priory way of determining whether got exceeded from the existing .

To classify the

as interval bounds we note that the binary sequence

(4)

switches from 1 to 0 for an opening interval bound , and from 0 to 1 for a closing interval bound , only. denotes the function and . Hence the quantity

(5)

yields the desired association

(6)

Per requirement, a) eq. 1, the are ordered, and b) the binary (discrete) function implies the alternating property

(7)

Thus, linearly scanning through the and their corresponding results in the two sets

(8)

such that we can simply interleave these to obtain the corresponding time intervals as our solution, eq. 2.

Ii-C Boundary Conditions

However, there is a couple of options how to exactly interleave the which depend on the boundary condition. More specifically, let us assume the sequence starts e.g. with intervals that are smaller than . In this case, , and one needs to manually add a to construct intervals

(9)

A corresponding issue might happen at the end of the time series depending on whether is equal or not equal222 Note that by virtue of eq. 7 the difference is at most . Actually, it is already obvious from the fact that that one needs to manually add – imagine the case where each is a boundary value, but we have two of the missing to classify all . to .

1algorithm cluster_events is
2    input:  list t of ordered timestamps ,
3            float variable dT of expected inverse frequency
4    output: list tau of cluster intervals ,
5            list x of isolated timestamps
6
7    define lists tauMinus, tauPlus, tau, x
8    define lists dt, b, B
9
10    N  length of t
11    b[0]  1
12    b[N]  1
13
14    for each i in 1,2,,N-1 do
15        dt[i]  t[i] - t[i-1]
16        if dt[i]  dT then b[i]  1
17        else b[i]  0
18
19    for each i in 0,1,,N-1 do
20        B[i]  b[i+1] - b[i]
21        if B[i]   = -1  then append t[i] to tauMinus else
22        if B[i]   = 1   then append t[i] to tauPlus   else
23        if b[i+1] = 1   then append t[i] to x
24
25    for each i in 0,,length of tauMinus do
26        add interval [tauMinus[i], tauPlus[i]] to tau
27
28    return tau, x
Listing 1: Sample implementation of our clustering procedure as pseudo-code.

In order to prevent manually dealing with all the (four) different boundary condition scenarios, we might want to (virtually) add the following timestamps from the outset:

(10)

Hence, we obtain , and therefore

(11)

which yields that corresponding to for classification such that we always have

(12)

from

(13)

with .

Ii-D Isolated Points

Given the solution eq. 12, due to the ordering of the , we can simply form the open intervals

(14)

that we associate with time intervals of failure. Note, that . However, these intervals do not imply

(15)

i.e., informally, it is not true that no event happens during the intervals , but we certainly have

(16)

where we refer to as an isolated event. In terms of dbscan these timestamps form the noise, while all are border points.

Isolated events have and since they are not interval boundary points they need to have . This way we can use and to classify isolated events according to

(17)

where denotes the set of isolated timestamps. Likewise, we can define clustered timestamps as

(18)

Since is binary and eq. 5 holds for , all are uniquely classified, i.e. . It is rather straightforward to convince oneself that there is the association in the sense that all are within an interval of and within an interval of .

Ii-E Implementation & Computational Complexity

LABEL:lst:PseudoCodeImpl provides an example implementation of the method from sections II-C and II-B in pseudo-code for demonstration purposes. E.g. the call of cluster_events(t, dT) on

t=[-20,-18,1,2,2.9,10,11,100,200,202,202,203]

given dT as

-1, 0, 1, 10, 100, and mean of the elements of t

returns output equivalent to
 [], [-20, -18, 1, 2, 2.9, 10, 11, 100, 200, 202, 202, 203]
 [[202, 202]], [-20, -18, 1, 2, 2.9, 10, 11, 100, 200, 203]
 [(1, 2.9), (10, 11), (202, 203)], [-20, -18, 100, 200]
 [(-20, -18), (1, 11), (200, 203)], [100]
 [(-20, 203)], []
 [(-20, 11), (200, 203)], [100]
respectively.

The procedure presented in sections II-D, II-C and II-B and LABEL:lst:PseudoCodeImpl uses algebraic operations for , logical operations for and again algebraic operations for which determines the interval boundary classification with a total of operations. The final loop in LABEL:lst:PseudoCodeImpl to interleave the tauPlus and tauMinus lists is just for the user’s convenience.

The naive approach would compute two time intervals for each and perform two logical operations of those against to determine the classification, hence computations. Note that due to the given linear ordering in one-dimensional space, our algorithm’s runtime is exact. In particular, it is fully deterministic when the number of timestamps is fixed.

Moreover, the required memory for our approach is linear in . Only the event series list t of size and the lists x, tauMinus, and tauPlus with a total size of at most timestamps need to be stored. The lists dt, b, and B can be computed on the fly occupying storage .

Fig. 1: Numerical speedup analysis comparing a Python implementation of our algorithm using the module numpy [9], v1.13.0, to the vanilla implementation of dbscan, DBSCAN(), from the package sklearn [10], v0.18.1, module cluster with parameter settings min_samples=2 and metric=’l1’. We plot the quantity (cf. blue dots) from eq. 19 versus the number of timestamps to cluster.
Timing measurements were performed with the standard Python module timeit on commodity hardware with sufficient RAM to prevent swapping. Experiments have been repeated 40 times to aggregate statistics for error estimation using error propagation by first order Taylor expansion.

The event series was generated by a white noise random distribution on

and has been rescaled by . For the experiments, dT was set to two values: 1 and 1e-4, corresponding to the parameter eps of DBSCAN().
While we observed an approximately linear speedup for small (cf. gray, dashed fit line ), an overall logarithmic speedup (cf. red, solid fit line ) is plausible which supports the analytical result: .

To confirm our analytical findings we performed a numerical experiment which is presented by fig. 1. It evaluates the speedup of our algorithm compared to a vanilla implementation of dbscan. Within the observed error boundaries, the scaling factor is plausible for large wrt. the speedup factor

(19)

with the individual runtime of dbscan and our linear approach, respectively.

Ii-F Application

We observe that for a given, fixed event series with total time interval , the quantity

(20)

where

(21)

computes the fraction of time with no failure in operation. is fixed by the expected, logarithmic, and normalized event frequency , i.e.  represents the scale of frequency where all timestamps are equally spaced within the time series interval. corresponds to smaller scales, to larger ones.

Scanning by varying provides a characteristics that quantifies the reliability of e.g. an IoT service. It is rather straightforward to show that is monoton decreasing with increasing333 The larger , the more the clusters cover the whole time series. Due to eq. 2 clusters never shrink in size for increasing , they either grow or merge to bigger clusters, letting the overall cover increase. .

In case where the time series is generated by a single, periodic data stream, we get a unit step function , i.e.  for and for . Nevertheless, similar information could be obtained by simply checking a histogram , cf. eq. 3, that counts the number of in some binning interval (number density). In the case above we would observe a single peak in . Note, that contains similar information to .

Fig. 2: Sample plot of IoT service quality measures from event series : (, red), (, blue), and (, gray) by varying . The series consists of a burst of random events that covers approx. 10% of the total time range . During the rest of the time the events are periodic at a rate of about .

However, our clustering output provides information that is blind to, because it does not account for the ordering of the . In particular,

(22)

provides a normed measure of the number of clusters444 The Kronecker delta is for , else. It forces to result in . . While just quantifies the total coverage of by the clusters, provides insight whether the coverage is established by a number of patches or a single/a few intervals with data frequency of at least . This way we might draw conclusions on e.g. the reliability of an IoT service. Ideally we want .

Last but not least, we might consider the number of isolated events

(23)

as an additional indicator of reliability, since they are orthogonal to the information contained in . We might classify isolated events as indicator of loose IoT service quality and thus it should stay close to zero until it quickly increases to one for some .

Figure 2 illustrates these applications by plotting for an event series generated from uniformly random samples drawn from joined by equi-distant samples in . We observe that at

there is little variance in

, indicating that there is no single dominant event frequency . Moreover, there is a step in at that covers 90% of its range which refers to a dominant event frequency one order of magnitude lower than . Since we conclude this frequency to be present along major time intervals within . Also, rapidly drops. Therefore, the existence of isolated events vanishes at time scales larger than such that we have a clean signal.

In contrast, for . Thus, due to the randomness we introduced in our sample, for high-frequency events, increasing coverage of is achieved by a number of isolated clusters (random nature of the signal!). Finally, for frequencies 3 orders of magnitude larger than , , i.e. no more clustering of events is present.

Fig. 3: Sample data and processing flow of an implementation of the IoT quality service employing the procedure LABEL:lst:PseudoCodeImpl as well as the measures eqs. 23, 22 and 20. The main text provides details. Note, that here the , and reference isolated timestamps of the set , not to be confused with all timestamps of indexed by , i.e. .

Figure 3 depicts a sample data flow and processing pipeline where the discussed method can be employed to rate and monitor e.g. an IoT device or the data availability of satellite imagery in the big geo-spatial data platform IBM PAIRS [11, 12]. Given that this information service is expected to send data packages at frequency , an event cluster engine records and stores the timestamps for further analysis. At the same time a frequency detector might dynamically adjust , e.g. by computing the mean of the over a given time window. The event clustering engine is coupled to a user interface that might be interacted with by a RESTful API [13] served by e.g. Python Flask [14] to trigger the execution of LABEL:lst:PseudoCodeImpl in order to return the sets and . Once the clustering has been performed, the quantities can be computed and analyzed by a cluster measure engine which itself feeds derived service quality indicators to a monitoring system such as e.g. Ganglia [15] or Nagios [16]. These might then release alerts by an appropriate messaging service such as plain e-mail or employing a system such as Apache Kafka [17].

Iii Conclusion

We discussed and implemented a one-dimensional, one-parameter clustering method with linear complexity on input and memory usage. It might be the preferred choice over the more general apporach dbscan takes when clustering ordered timestamps. Based on the algorithm’s output we suggested measures that have useful application in the domain of IoT to quantify data availability or to indicate the reliability/stability of an IoT device connecting to the network. In particular, the presented approach is part of the data availability RESTful service of IBM’s big geo-spatial database PAIRS.

The cluster method might be useful for other domains as well. Applications that have to characterize peaks of measurements can benefit. One of them is the problem of geo-locating leaks through a network of detector sensors [18] such as in the field of industrial pollution detection [19].

References