Attention to time series analytics is bound to increase in the IoT era as cheap sensors can now deliver vast volumes of many types of measurements. The size of the data is also bound to increase. E.g., an IoT-ready oil drilling rig produces about TB of operational data in one day. 111https://wasabi.com/storage-solutions/internet-of-things/ One way to solve this problem is to increase the expense in computing and storage in order to catch up. However, in many domains, the data size increase is expected to outpace the increase of computing abilities, thus making this approach unattractive (Galakatos et al., 2017; Chaudhuri et al., 2017). Another solution is approximate analytics over compressed time series.
|function||error guarantees on||AI||Tight||error guarantees on||AI||Tight|
|family||aligned time series||misaligned time series|
Approximate analytics enables fast computation over historical time series data. For example, consider the database in Figure 1, which has a Temperature table and a Pressure table. Each table contains (i) one Timeseries column containing time series data, as a UDT (Eisenberg and Melton, 2002) and (ii) several other “dimension” attributes , such as geographic locations and other properties of the sensors that delivered the time series. The Plato SQL query in Figure 1(c) “returns the top-10 temperature/pressure 5-second cross-correlation scores among all the (temperature, pressure) pairs satisfying a (not detailed in the example) condition over the dimension attributes”. Notice, the first argument of the TSA UDF is a time series analytic expression (in red italics). We could write simply ’CCorr(t.timeseries, p.timeseries, 5)’, as there is a built-in cross-correlation expression CCorr but, instead, the example writes the equivalent expression that uses more basic functions (such as the average
, the standard deviationand the time Shifting) to exhibit the ability of Plato to process expressions that are compositions of well-known arithmetic operators, vector operators, aggregation and time shifting. Either way, computing the accurate cross-correlations would cost more than minutes. However, Plato reduces the running time to within one second by computing the approximate correlations. It also delivers deterministic error guarantees. (In SQL, the result is a string concatenation of the approximate answer and the error guarantee. The functions approximateAnswer and guarantee extract the respective pieces.)
The success of approximate querying on IoT time series data is based on an important beneficial property of time series data: the points in the sequence of values normally depend on the previous points and exhibit continuity. For example, a temperature sensor is very unlikely to report a 100 degrees increase within a second. Therefore, in the signal processing and data mining communities (Keogh et al., 2001a; Keogh, 1997; Faloutsos et al., 1994; Chan and Fu, 1999), time series data is usually modeled and compressed by continuous functions in order to reduce its size. For instance, the Piecewise Aggregate Approximation (PAA) (Keogh et al., 2001a) and the Piecewise Linear Representation (PLR) (Keogh, 1997) adopt polynomial functions (0-degree in PAA and 1-degree in PLR) to compress the time series; (Pan et al., 2017) uses Gaussian functions; (Tobita, 2016) applies natural logarithmic functions and natural exponential functions to compress time series. Plato is open to any existing time series compression techniques. Notice that there is no one-size-fits-all function family that can best model all kinds of time series data. For example, polynomials and ARMA models are better at modeling data from physical processes such as temperature (Mei and Moura, 2017; Choi, 2012), while Gaussian functions are better for modeling relatively randomized data (Kim, 2003) such as stock prices. How to choose the best function family has been widely studied in prior work (Philo, 1997; Wiscombe and Evans, 1977; Denison et al., 1998; Kovács et al., 2002) and recent efforts even attempt to automate the process (Kumar et al., 2015). We assume that the Plato users make a proper selection of how to model/compress the time series data and we do not further discuss this issue.
Architecture. Figure 2 shows the high-level architecture. During insertion time, the provided time series is compressed. In particular, a compression function family (e.g., 2nd-degree polynomials) is chosen by the user. Internally, in a simple version, each time series is segmented (partitioned) first in equal lengths. Then, for each segment the system finds the best estimation function, which is the member of the function family that best approximates the values in this segment. The most common definition of “best” is the minimization of the reconstruction error, i.e., the minimization of the Euclidean distance between the original and the estimated values. This is also the definition that Plato assumes. The compressed database stores the parameters of the estimation function for each segment, which take much less space than the original time series data. In the more sophisticated version, segmentation and estimation are mingled together (Koski et al., 1995; Keogh et al., 2001b) to achieve better compression. The result is that the time series is partitioned into variable-length segments.
Consequently, given a query with TSA UDF calls, 222We focus on aggregation queries whose results are single scalar values, so the approximate answers are also scalar values. the database computes quickly an approximate answer for each TSA call by using the compressed data. Note, the TSAs may combine multiple time series; e.g., a correlation or a cross-correlation.
Example 0 ().
Consider a room temperature time series and an air pressure time series in Figure 1 and consider the TSA(‘Ccorr(, )’, , ) where ‘Ccorr()’ refers to the 60-seconds cross-correlation of and (see definition in Table 4). Both and have data points at 1-second resolution and are segmented by variable length segmentation methods and compressed by PLR (1-degree polynomial functions). The precise answer is . But instead of accessing the () original data points, Plato produces the approximate answer (error is ) by accessing just the function parameters (), () for and (), () for in the compressed database.333Due to reasons relating to computation efficiency, as explained in Section 4.2.2, Plato does not actually store the parameters (), () and (), () in their standard basis but rather it stores coefficients in an orthonormal basis.
The well-known downside of approximate querying is that errors are introduced. When the example’s user receives the approximate answer she cannot tell how far this answer is from the true answer, i.e., the precise answer. The novelty of Plato is the provision of tight (i.e., lower bound) deterministic error guarantees for the answers, even when the time series expressions combine multiple series. In the Example 1, Plato guarantees that the true answer is within of the approximate answer with confidence. (Indeed, is within of .) It produces these guarantees by utilizing error measures associated with each segment.
Scope of Queries and Error Guarantees. Plato supports the time series analytic expressions formally defined in Table 3 (Section 2).They are composed of vector operators (, , , Shift), arithmetic operators, the aggregation operator Sum that turns its input vector into a scalar, and the Constant
operator that turns its input scalar into a vector. As such, Plato queries can express not only statistics that involve one time series (eg, average, variance, and n-th moment) but also statistics that involve multiple time series, such as correlation and cross-correlation.
The error guarantee framework is also general. It allows efficient error guarantee computation for all possible estimation function families, as long as the error measures of Table 2 are computed in advance. 444We will show that in certain cases one or two measures suffice. Figure 1 shows the error measures (in blue) for each segment of the example. With the help of the error measures, no matter whether a time series is compressed by trigonometric functions or polynomial functions or some other family, Plato is able to give tight deterministic error guarantees for queries involving the compressed time series.
|-norm of the estimation errors|
|-norm of the estimated values|
|Absolute reconstruction error|
Function Family Groups Producing Practical Error Guarantees. Plato produces tight error guarantees, for any function family that may have been used in the compression. In addition, our theoretical and experimental analysis identifies which families lead to high quality guarantees.
The formulas of Table 1 provide error guarantees for characteristic, simple expressions and exhibit the difference in guarantee quality. Any other expression, e.g., the statistics of Table 4, are also given error guarantees by composing the error measures and guarantees of their subexpressions (as shown in the paper) and the same quality characterizations apply to them inductively.
This is how to interpret the results of Table 1: Three function family groups have been identified: (1) The Linear Scalable Family group (LSF), (2) the Vector Space (VS), which includes the LSF and (3) ANY, which, according to its name, includes everything. Given the function family used in the compression, we first categorize in one of LSF or VS/LSF (i.e., VS excluding LSF) or ANY/VS. For example, if is the 2-degree polynomials, then belongs to LSF. See Figure 4 for other examples. Next, we consider whether the segments of the involved compressed time series are aligned or misaligned and finally we look at the error guarantee formula for the expression.
The specifics of interpreting the table’s results and the specifics of their efficient computation require the detailed discussion of the paper. (Eg, the summation index corresponds to the optimal segment combination (Section 4.2.1.) Nevertheless, a clear and general high level lesson about the practicality of the error guarantees emerges from the table’s summary: Some function families allow for much higher quality error guarantees than other function families. The typical characteristic of “higher quality” is Amplitude Independence (AI). If an error guarantee is AI, then it is not influenced by the measure, i.e., it is not affected by the amplitude of the values of the estimation functions and, thus, it is not affected from the amplitude of the original data. An AI error guarantee is only affected by the reconstruction errors caused by the estimation functions, which intuitively implies that AI error guarantees are close to the actual error.
These guarantees are tight in the following sense. Given (a) the function family categorization into LSF, VS/LSF or ANY/VS and (b) segments with the error measures of Table 2, the formula provided by Table 1 produces an error guarantee that is as small as possible. That is, for this superfamily and for the given error measures, any attempt to create a better (i.e., smaller) error guarantee will fail because there are provably time series and at least one time series analytics expression where the true error is exactly as large as the error guarantee.
The experimental results, where we tried data sets with different characteristics and different compression methods, verified the above intuition: AI error guarantees were order(s) of magnitude smaller than their amplitude dependent counterparts. Indeed, AI ones over variable-length compressions were invariably small enough to be practically meaningful, while non-AI guarantees were too large to be practically useful.
Particularly interesting are the analytics that combine multiple vectors, such as correlation and cross-correlation, by vector multiplication. Then the amplitude independence of the error guarantees does not apply generally. Rather the dichotomy illustrated in Figure 4 emerges: (i) for compressions with aligned time series segments, the error guarantee is AI when the used function family forms a Vector Space (VS) in the conventional sense (Halmos, 2012); and (ii) for compressions with misaligned time series segments, which are the more common case, choosing a VS family is not enough for AI guarantees. The family must be a Linear Scalable Family (LSF), which is a property that we define in this paper (Section 3.1).
The contributions are summarized as follows.
We deliver tight deterministic error guarantees for a wide class of analytics over compressed time series. The key challenge is analytics (e.g., correlation and cross-correlation) that combine multiple time series but it is not known in advance which time series may be combined. Thus, each time series has been compressed individually, much before a query arrives. The reconstruction errors of the individual time series’ compressions cannot provide, by themselves, decent guarantees for queries that multiply time series. To make the problem harder, time series segmentations are generally misaligned.555Misalignment happens because the most effective compressions use variable length segmentations. But even if the segmentations were fixed length, queries such as cross-corellation and cross-autocorellation time shift one of their time series, thus producing misalignment with the second time series.
The provided guarantees apply regardless of the specifics of the segmentation and estimation function family used during the compression, thus making the provided deterministic error guarantees applicable to any prior work on segment-based compression (eg, variable-sized histograms etc). The only requirement is the common assumption that the estimation function minimizes the Euclidean distance between the actual values and the estimates.
We identify broad estimation function family groups (namely, the already defined Vector Space family and the presently defined Linear Scalable Family) that lead to theoretically and practically high quality guarantees. The theoretical aspect of high quality is crisply captured by the Amplitude Independence (AI) property. Furthermore, the error guarantees are computed very efficiently, in time proportional to the number of segments.
The results broadly apply to analytics involving composition of the typical operators, which is powerful enough to express common statistics, such as variance, correlation, cross-correlation and other in any time range.
We conduct an extensive empirical evaluation on four real-life datasets to evaluate the error guarantees provided by Plato and the importance of the VS and LSF properties on error estimation. The results show that the AI error guarantees are very narrow - thus, practical. Furthermore, we compare to sampling-based approximation and show experimentally that Plato delivers deterministic (100% confidence) error guarantees using fewer data than it takes to produce probabilistic error guarantees with 95% and 99% confidence via sampling.
2. Time Series and Expressions
|Time Series Analytic (TSA)|
|Arithmetic Expression (Ar)|
|Ar||literal value in|
|Aggregation Expression (Agg)|
|Time Series Expression (TSE)|
|input time series|
Time Series A time series , , , is a sequence of data points observed from start time to end time (). Following the assumptions in (Morse and Patel, 2007; Chen and Ng, 2004; Vlachos et al., 2002) we assume that time is discrete and the resolution of any two time series is the same. Equivalently, we say is fully defined in the integer time domain . We assume a domain is the global domain meaning that all the time series are defined within subsets of this domain. When the domain of a time series is implied by the context, then can be simplified as .
Example 0 ().
Assume the global domain is . Consider two time series and . Then and are fully defined in domains [1,5] and [3,6] respectively. refers to the data point of at the -th position in the global domain.
|TSA Expression||Definition||Equivalent TSA Expression||Usage of error measures|
|Standard Deviation ‘()’|
|Correlation ‘Corr(,)’||,,, ,,|
|Cross-correlation ‘CCorr(,,)’||,,, ,,|
Time Series Analytic (TSA) Expressions Table 3 shows the formal definition of the time series analytic (called TSA). The TSAs supported are expressions composed of linear algebra operators and arithmetic operators. Typically, the TSA has subexpressions that compose one or more linear algebra operators over multiple time series vectors as defined below.
Given a numeric value and two integers and , Constant. For example, Constant() produces .
Given a time series and an integer value , Shift()=. Notice Shift() = for all . Figure 5(a) visualizes the Shift operator. Consider the time series , then Shift() is .
Given two time series and , where and . 666Setting and ensures all the data points in are defined. For example, given and then . Similarly, we define and .
A time series analytic (TSA) is an arithmetic expression of the form , where are the standard arithmetic operators () and is either an arithmetic literal or an aggregation over a time series expression. An aggregation expression Sum computes the summation of the data points of in the domain , i.e., Sum= where can be an input time series or a derived time series computed by time series expressions (TSEs). 777Note that, when the time series expressions involve time shifting, we assume that the aggregation will only operator in the valid data points, that is the data points in the defined range. When the bounds of and are implied from the context, we simplify to .
3. Internal, Compressed Time Series Representation
When a user inserts a time series into the database, Plato physically stores the compressed time series representation instead of the raw time series. More precisely, the user provides (i) a time series , (ii) the identifier of a segmentation algorithm, which is chosen from a list provided by Plato, and (iii) the identifier of a function family, which is selected from a list provided by Plato. Internally, Plato uses the chosen segmentation algorithm and the chosen compression function family to partition into a list of disjoint segments . For each segment , instead of storing its original data points , Plato stores a compressed segment representation , where is the start position, is the end position, is the function representation of , where is the estimation function chosen from the identified function family and is a set of (two to three depending on the function family) error measures.
Overall, for a time series , Plato physically stores (i) the list =(), and (ii) one token (which can simply be an integer) as the function family identifier. 888It is not necessary for Plato to physically store a token for the segmentation algorithm identifier as the time series stored in Plato has been partitioned already.
We comment on the prior state-of-the-art segmentation / compression algorithms that Plato uses in Appendix A. Next, we introduce the selection of the estimation function and the computation of error measures.
3.1. Estimation Function Selection
Choosing an estimation function for a time series segment has two steps: (i) user identifies the function family, and (ii) Plato selects the best function in the family, i.e., the function that minimizes the Euclidean distance between the original values and the estimated values produced by the function.
Step 1: Function family selection. Table 8 gives example function family identifiers, which the user may select, and the corresponding function expressions. For example, =“” means that the chosen function family is the “second-degree polynomial function family” and the corresponding function family expression is .
Step 2: Estimation function selection. Any function in the chosen function family is a candidate estimation function. Following the prior work (Lazaridis and Mehrotra, 2003; Aghabozorgi et al., 2015), Plato selects the candidate estimation function that minimizes the Euclidean distance between the original values and the estimated values produced by the function to be the final estimation function. More precisely,
Example 0 ().
Given a time series , assume the function family identifier is “” (i.e., “first-degree polynomial function family”). Functions and are two candidate estimation functions. Finally, Plato selects as the estimation function since it produces the minimal Euclidean error, i.e., .
Function Representation (Physical) vs. Function (Logical). Once an estimation function is selected, Plato stores the corresponding function representation , which includes (i) the coefficients of the function , and (ii) the function family identifier . 999All the segments in the same time series share one token . For example, the function representation of the estimation function in Example 1 is = ( , ) where is a function family identifier indicating that the function family is “1-degree polynomial function family”.
When we talk about the function itself logically, it can be regarded as a vector that maps time series: given a domain , the vector maps a value to each position in the domain . For example, consider the estimation function in Example 1. Then .
3.2. Error Measures
In addition to the estimation function, Plato stores extra error measures for each time series segment (defined in domain ) where , , and are defined in Table 2.
Example 0 ().
Consider the time series in Example 1 again. is the estimation function. Thus , , and .
4. Error guarantee computation
Error Guarantee Definition. Given a TSA involving time series , let be the accurate answer of by executing directly on the original data points of . Let be the approximate answer of by executing on the compressed time series representations. Then is the true error of . Notice that is unknown since is unknown. An upper bound () of the true error is called a deterministic error guarantee of . With the help of , we know that the accurate answer is within the range with confidence. Plato provides tight deterministic error guarantees for time series expressions defined in Table 3 (Section 2).
Error Guarantee Decomposition. Recall that the time series analytic defined in Table 3 (Section 2) combines one or more time series aggregation operations via arithmetic operators, i.e., where . In order to provide the deterministic error guarantee of the time series analytic , the key step is to calculate the deterministic error guarantee of each aggregation operation . Once we have for each aggregate expression, it is not hard to combine them to get the final error guarantee (see Appendix B).
Given a TSA and the compressed time series representation . When calculating , there are two cases depending on whether is an input time series or not. 101010If a time series is generated by applying some time series operators, then it is not a base time series. For example, , then is not a base time series.
Case 1. is an input time series, then where is the reconstruction error in the error measures of . 111111Here we assume the aggregation operator aggregates the whole time series.
Case 2. is a derived time series by applying the time series operators (recursively), Constant(), Shift(), , and . In this case, the aggregation operator can be depicted as a tree. Figure 7 shows an example tree of the aggregation operator in the “correlation TSA”. In order to compute , we first calculate the error measures for the root time series in the tree by propagating the error measures from the bottom time series to the root. Then we return the in the as the final error guarantee.
Next, we focus on computing the error measures for derived time series. We first explain the simpler case where each time series is a single segment. Table 8 shows the formulas for computing error measures for derived time series in this case. For the general scenario where multiple segments are involved in each input time series in the expression, there are two cases depending on whether the segments are aligned or not: If the -th segment in has the same domain with the -th segment in for all , then and are aligned, otherwise, they are misaligned.
In the following, we will show how to compute the most challenging error guarantee in both aligned and misaligned cases in Section 4.1 and Section 4.2 respectively. The computation of error guarantees of other expressions (i.e., Constant(), Shift(), and ) is presented in Appendix C.
4.1. Error Guarantee on Aligned Segments
Notations. Given a time series and the estimation function of , is the vector of errors produced by the estimation function. In the following, , and are all regarded as vectors. is the inner product of and . is a restriction operation, which restricts a vector to the domain [a,b]. Recall a time series segment is a subsequence of a time series. Thus, a segment is the restriction of a time series from a bigger domain into a smaller domain , denoted as . Figure 5(b) visualizes the restriction operator. For example, consider a time series , then is a restriction of . Note that for all .
Given two compressed time series representation and for the aligned time series and where and . Notice and have the same domain, i.e., , for all . For any estimation function family, the error guarantee of Sum() on aligned time series is:
The details are shown in Appendix D.
Example 0 ().
Consider the two aligned time series in Figure 7(a). Both and are partitioned into two segments in this case, i.e., () and (). Plato stores the error measures for each segment . For instance, . Then the error guarantee of Sum() on and is computed as .
4.1.1. Orthogonal projection optimization
If the estimation function family forms a vector space (VS), 121212A vector space is a set that is closed under finite vector addition and scalar multiplication. http://mathworld.wolfram.com/VectorSpace.html. then we can apply the orthogonal projection property in VS to significantly reduce the error guarantee of from Formula 2 to Formula 4.1.1.
Example 0 ().
Consider the two aligned time series in Figure 7(a) again. The estimation function family is polynomial function family, it is VS. Based on Formula 3, the error guarantee for Sum() is + = . This error guarantee is about smaller than that in Example 1 (i.e., ), where we did not take into account that the function family is VS.
Lemma 0 ().
(Orthogonal Projection Property) Let be a function family forms a vector space VS and be the estimation function of time series . Then is the orthogonal projection of onto (Nelson, 1973).
Lemma 3 implies that is orthogonal to any function , which means . Therefore, given any two aligned segments and , as both and are in VS, thus and .
For visualization purposes, consider a time series with three data points and let be the 1-degree polynomial function family (i.e., 2-dimensional). The estimation function that minimizes the error to the original data is (Figure 9(a)). As shown in Figure 9(b), is the orthogonal projection of onto . The error vector is