Many online systems, especially with billions of users, are generating a large stream of logs (Dietz and Pernul, 2018), recording users’ dynamics in the systems, e.g. users (un)follow other users, “(dis)like” objects, enter (exit) live video channels, and click objects. Then, a question is raised:
How can we efficiently know the most popular objects (include users), i.e. mode, top-K popular ones, and even the distribution of frequency in a fast and large log stream at any time?
Mathematically, the questions converge to calculate and update the statistics of a dynamic array of finite values. Thus the existing fast algorithms on the statistics are as follows:
Mode of an array. The mode of an array and its corresponding frequency can be calculated by sorting the array (if it’s of numeric value) and scanning the sorted array in time, where is the length of array (Dobkin and Munro, 1980). Notice that through judging the frequency of mode we can solve the element distinctness problem, which was proven to have to be solved with time complexity (Steele and Yao, 1982; Lubiw and Rácz, 1991). Therefore, calculating the mode of an array has the lower bound of as well. If the elements of array can only take finite values, the complexity of calculating the mode can be reduced. Suppose they can only take values. One can use buckets to store the frequency of each distinct element. Then, the mode can be calculated in time by scanning the buckets.
The problem of range mode query, which calculates the mode of a sub-array for a given array and a pair of indices , has also been investigated (Krizanc et al., 2005; Petersen and Grabowski, 2009; Chan et al., 2014). The array with finite values was considered. With a static data structure, the range mode query can be answered in time (Chan et al., 2014).
Majority and frequency approximation. The majority is the element whose frequency is more than half of . An algorithm was proposed to find majority in time and space (Boyer and Moore, 1991)
. Many work on the statistics like frequency count and quantiles, are under a setting of sliding window(Arasu and Manku, 2004; Datar et al., 2002; Babcock et al., 2002; Gibbons and Tirthapura, 2002; Lin et al., 2004). They consider the most recently observed data elements (within the window) and calculate the statistics. Space-efficient algorithms were proposed to maintain the statistics over the sliding window on a stream.
However, those existing work slow down their algorithms without considering that the increase and decrease of object frequency are always 1 at a time in log streams. Therefore, we propose an algorithm S-Profile to keep profiling the dynamic array. With such a profile, we can answer the queries of the statistics: mode, top-K and frequency distributions.
In summary, S-Profile has the following advantages:
Optimal efficiency: S-Profile needs time complexity and space complexity to profile dynamic arrays, where is the maximum number of objects.
Querying Statistics: With the profiling, we have sorted frequency-object pairs, and can simply answer the queries on mode, top-K, majority and other statistics in .
Applicable: Our S-Profile can be plugged into most of log streams in many systems, and profiling objects of interest.
In experiments, S-Profile compares with the existing methods in various settings of dynamic arrays, and shows its performance and robustness.
2. An -Complexity Algorithm for Updating the Mode and Statistics
We define tuples as a log stream, where and is the object id and action in the -th tuple. Action can be either “add” or “remove”, which, for example, can indicate object is “liked” or “disliked”, or user is followed or unfollowed. Conceptually, we could imagine a dynamic array of objects associated with a log stream, by appending object into if is “add”, and deleting object from if is “remove”. Dynamic array is not necessarily generated and stored, which is defined for convenient description of our algorithm.
Therefore, our problem can be described as follows:
Problem 1 (Profiling dynamic array).
Given: a log stream of tuples adding and removing an object each time,
To find: fast profiling of dynamic array of objects at any time,
Such that: answering the queries on mode, top-K and other statistics of objects is trivial and fast.
Let be the maximum number of distinct objects in a log stream or dynamic array . Without loss of generality, we assume id , i.e. integers between 1 and . For any distinct objects, we can map them into the integers from 1 to as ids.
We can use buckets to store the frequency of each distinct object. Let be such a frequency array with length . is the frequency of object with id . With , most statistics of can be calculated without visiting itself. For example, the mode of refers to locations in where the element has the maximum value. Although updating with each tuple of a log stream trivial and costs , finding the maximum value in at each time is still time consuming.
Therefore, we first introduce a proposed data structure of a profile, named “block set”, which can answer the statistical queries in a trivial cost. And we show later that we can maintain such a profile in time complexity and space complexity.
2.1. Proposed data structure for profiling
In order to find the mode of , we just need to care about the maximum in . If only integers are added to , the maximum element of can be easily updated. However, in Problem 1 removing integer is also allowed. This complicates the calculation of mode and other statistics of . So, the sorted array must be employed and maintained. To facilitate the queries, can be implemented as a binary tree. The heap and balanced tree are two kinds of binary tree, and are widely used for efficiently maintaining an sorted array. Upon a modification on , they both can be updated in time. The root node of a heap is the array element with the extreme value. This means the heap is only suitable for producing the ’s elements with either maximum frequency (the mode) or the minimum frequency. The balanced tree is good at answering the query of median of , and can also output the mode and top-K elements, etc. It should be pointed out, these general algorithms do not take the particularity of Problem 1 into account (the modification on is restricted to plus 1 or minus 1). By the way, no one can maintain the sorted array under an arbitrary modification with a time complexity below , because it can be regarded as a sorting problem which has been proven to have time complexity.
We use Figure 1 as an example to illustrate the proposed data structure for maintaining . Suppose has frequency values in ascending order. In order to locate the index of the -th element of in and vice verse, two conversion arrays are defined: and . In other words, we have and . Here, we use both the subscript and bracket notation to specify an element of array. As shown in Figure 1(c), we can partition into nonoverlapped segments according to its elements. Each such segment is called block here and represented by an integer triple , where and are starting and ending indices respectively, and is the element value (frequency). So, a block always satisfies:
There are at most blocks, which form a block set. It fully captures the information in the sorted array . An array of pointers called is also needed to make a link from each element in to its relevant block. According to the definition of block, we always have:
Here we use “” to denote the member of a block, and so on.
The block set represents the sorted frequency array , while with arrays , and we no longer need to store and . These proposed new data structure well profile the dynamic array . The remaining thing is to maintain them and answer the statistical queries on in an efficient manner.
2.2. S-Profile: the -Complexity Updating Algorithm
We first consider the situation where an integer is added to . As shown in Figure 1(a) and 1(b), a brute-fore approach to update is swapping the updated frequency to its right-hand neighbor one by one, until is in the appropriate order again. Now, with the proposed and the block it points to, we can easily determine the index of which is the destination of swapping the updated frequency. Then, we can update the relevant two blocks and pointer arrays (see Figure 1(d)).
Now, based on the situation shown as Figure 1(d), we assume a “4” is removed from . As shown in Figure 2, we first locate the updated element in .
Then, with the information in its corresponding block we know with which it should be swapped. We further check if the updated frequency exists in before. If it does not we need to create a new block (the case in Figure 2(b)), otherwise another block is modified.
The whole details of the algorithm for updating the data structure and returning the mode of is described as Algorithm 1. We assume the data structures (, , and ) have been initialized, while Algorithm 1 responds to an event in the log stream and returns the updated mode and frequency.
As the proposed data structure maintains the sorted frequency array, it can be utilized to calculate the object with the minimum frequency (maybe a negative number) as well. We just need to replace Step 29 and 30 in Algorithm 1 with the following steps.
We can observe that the time complexity of the S-Profile algorithm is , as there is no iteration at all. The space complexity is , where is the maximum number of objects in the log stream. Precisely, it needs integers to store the pointer arrays and an additional storage for . In the worst case includes blocks, but usually this number is much smaller than .
Other queries on statistics of objects can also be answered. For example, the top-K order element is that whose frequency is the K-th largest. We can just use to locate the block. Then, the frequency and object id can be obtained with block’s member and the array. Especially, the median in frequency can be located with the K/2-th element of the array.
2.3. Possible Applications
For some mission-critical tasks (e.g., fraud detection) in big graphs, the efficiency to make decisions and infer interesting patterns is crucial. As a result, recent years have witnessed an increasing interest in heuristic “shaving” algorithms with low computational complexity(Hooi et al., 2016; Shin et al., 2017). A critical step of them is to keep finding low-degree nodes at every time of shaving nodes from a graph. Thus, S-Profile can be plugged into such algorithms for further speedup, by treating a node as an object and its degree as frequency.
Furthermore, S-Profile can also deal with a sliding window on a log stream, by letting every tuple outdated from the window be a new incoming tuple , where is the opposite action of .
3. Experimental Results
We have implemented the proposed S-Profile algorithm and its counterparts in C++, and tested them with randomly generated log streams. The streams are produced with the following steps. We first randomly generate an "add" or "remove" action, with 70% and 30% probabilities respectively. Then, for each "add" action we randomly choose an object id according to a probability distribution (calledposPDF). For each "remove" action another distribution (called negPDF) is used to randomly choose an object id. With this procedure, we obtained three test log streams:
Stream1: both posPDF and negPDF are uniform random distribution on [1, m].
Stream2: both posPDF and negPDF
are normal distributions withand , respectively.
Stream3: posPDF is a normal distribution (, ), while negPDF is a lognormal distribution (, ).
In the following subsections, we first compare the proposed S-Profile with the heap based approach, for updating the mode and frequency. Then, the comparison with the balanced tree is presented for calculating the median. All experiments are carried out on a Linux machine with Intel Xeon E5-2630 CPUs (2.30 GHz). The CPU time (in second) of different algorithms are reported.
3.1. Comparison with the Heap
Heap is a kind of binary tree where the value in parent node must be larger or equal to the values in its children. Used to maintain the sorted frequency array, it is easy to obtain the mode (the root has the largest frequency). Noticed the balanced tree is inferior to the heap for calculating the mode. In Figure 3,
we show the CPU time consumed for updating the mode with the heap based method and our S-Profile. The x-axis means the number of processed tuples (). From the results we see that our method is at least 2.2X faster than the heap based method. Another experiment is carried out to fix while varying . The results shown in Figure 4 also reveal that our S-Profile is at least 2X faster.
For different kind of log stream, the performance of the heap based method varies a lot. For the worst case updating the heap needs time, despite this rarely happens in our tested screams. On the contrary, S-Profile needs time for updating the data structure. This advantage is verified by the rather flat trend shown in Figure 5.
It should be emphasized that, in addition to the speedup to the heap based method, our S-Profile possesses the advantage of wider applicability. Our method is not restricted to calculating the mode and corresponding frequency. As it well profiles the sorted frequency array, with it answering the queries on top-K and other statistics of objects is trivial and fast.
3.2. Comparison with the Balanced Tree
The proposed S-Profile can also calculate the median of the dynamic array. We compare it with the balanced tree based method implemented in the GNU C++ PBDS (Tavory et al., [n. d.]), which is more efficient than our implementation of balanced tree. The trends of CPU time are shown in Figure 6. They show that the runtime of the proposed S-Profile increases much less than that of the balanced tree based method when increases. We can observe that the time of S-Profile is linearly depends on , the number of modifications on array , and hardly varies with different . On the contrary, the balanced tree based method exhibits superlinearly increase whether with or . Overall, the test results show that S-Profile is from 13X to 452X faster than the balanced tree based method on updating the median of the dynamic array.
We propose an accurate algorithm, S-Profile, to fast keep profiling the dynamic array from online systems. It has the following advantages:
Optimal efficiency: S-Profile needs time complexity for every updating of a dynamic array, and totally linear complexity in memory.
Querying Statistics: With profiling, we can answer the statistical queries in a trivial and fast way.
Applicable: S-Profile can be plugged into most of log streams, and heuristic graph mining algorithms.
- Arasu and Manku (2004) A. Arasu and G. S. Manku. 2004. Approximate counts and quantiles over sliding windows. In Proceedings of the 23 ACM Symposium on Principles of Database Systems (PODS). 286–296.
- Babcock et al. (2002) B. Babcock, M. Datar, and R. Motwani. 2002. Sampling from a moving window over streaming data. In Proceedings of the 13th annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 633–634.
- Boyer and Moore (1991) R. S. Boyer and J. S. Moore. 1991. MJRTY—A fast majority vote algorithm. In Automated Reasoning. Springer, 105–117.
- Chan et al. (2014) T. M. Chan, S. Durocher, K. G. Larsen, J. Morrison, and B. T. Wilkinson. 2014. Linear-space data structures for range mode query in arrays. Theory of Computing Systems 55, 4 (2014), 719–741.
- Datar et al. (2002) M. Datar, A. Gionis, P. Indyk, and R. Motwani. 2002. Maintaining stream statistics over sliding windows. SIAM J. Comput. 31, 6 (2002), 1794–1813.
Dietz and Pernul (2018)
M. Dietz and G.
Big log data stream processing: Adapting an anomaly detection technique. InInternational Conference on Database and Expert Systems Applications. 159–166.
- Dobkin and Munro (1980) D. Dobkin and J. I. Munro. 1980. Determining the mode. Theoretical Computer Science 12, 3 (1980), 255–263.
- Gibbons and Tirthapura (2002) P. B. Gibbons and S. Tirthapura. 2002. Distributed streams algorithms for sliding windows. In Proceedings of the 14th annual ACM Symposium on Parallel Algorithms and Architectures (SPAA). 63–72.
- Hooi et al. (2016) B. Hooi, H. A. Song, A. Beutel, N. Shah, K. Shin, and C. Faloutsos. 2016. Fraudar: Bounding graph fraud in the face of camouflage. In Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD). 895–904.
- Krizanc et al. (2005) D. Krizanc, P. Morin, and M. Smid. 2005. Range mode and range median queries on lists and trees. Nordic Journal of Computing 12, 1 (2005), 1–17.
- Lin et al. (2004) X. Lin, H. Lu, J. Xu, and J. X. Yu. 2004. Continuously maintaining quantile summaries of the most recent n elements over a data stream. In Proceedings of the 20th International Conference on Data Engineering (ICDE). 362.
- Lubiw and Rácz (1991) A. Lubiw and A. Rácz. 1991. A lower bound for the integer element distinctness problem. Information and Computation 94, 1 (1991), 83–92.
- Petersen and Grabowski (2009) H. Petersen and S. Grabowski. 2009. Range mode and range median queries in constant time and sub-quadratic space. Inform. Process. Lett. 109, 4 (2009), 225–228.
et al. (2017)
K. Shin, B. Hooi,
J. Kim, and C. Faloutsos.
DenseAlert: Incremental dense-subtensor detection in tensor streams. InProceedings of International Conference on Knowledge Discovery and Data Mining (KDD). 1057–1066.
Steele and Yao (1982)
J. M. Steele and A. C.
Lower bounds for algebraic decision trees.Journal of Algorithms 3, 1 (1982), 1–8.
- Tavory et al. ([n. d.]) A. Tavory, V. Dreizin, and B. Kosnik. [n. d.]. Policy-Based Data Structures. https://gcc.gnu.org/onlinedocs/libstdc++/ext/pb_ds/