Scalable Model-Based Management of Correlated Dimensional Time Series in ModelarDB

03/25/2019 ∙ by Søren Kejser Jensen, et al. ∙ Aalborg University 0

To monitor critical infrastructure, high quality sensors sampled at a high frequency are increasingly installed. However, due to the big amounts of data produced, only simple aggregates are stored. This removes outliers and hides fluctuations that could indicate problems. As a solution we propose compressing time series with dimensions using a model-based method we name Multi-model Group Compression (MMGC). MMGC adaptively compresses groups of correlated time series with dimensions using an extensible set of models within a user-defined error bound (possibly zero). To partition time series into groups, we propose a set of primitives for efficiently describing correlation for data sets of varying sizes. We also propose efficient query processing algorithms for executing multi-dimensional aggregate queries on models instead of data points. Last, we provide an open-source implementation of our methods as extensions to the model-based Time Series Management System (TSMS) ModelarDB. ModelarDB interfaces with the stock versions of Apache Spark and Apache Cassandra and thus can reuse existing infrastructure. Through an evaluation we show that, compared to widely used systems, our extended ModelarDB provides up to 11 times faster ingestion due to high compression, 65 times better compression due to the adaptivity of MMGC, 92 times faster aggregate queries as they are executed on models, and close to linear scalability while also being extensible and supporting online query processing.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Companies maintaining critical infrastructure, e.g., for energy production, benefit from monitoring with a high degree of coverage and having data points sampled at a high frequency. To facilitate this in the energy domain, entities such as wind turbines are monitored by high quality sensors with wired power and connectivity. As a result, invalid, missing and out-of-order readings are rare, and all except missing values can be corrected using established methods. In addition to data points, metadata, e.g., location and sensor type, is stored for each time series to support analysis along multiple dimensions. However, due to the big amount of data points being produced, only simple aggregates are stored, removing outliers and fluctuations as a result. As a remedy, model-based storage allows for compression of time series within a known error bound (possibly zero) [32, 21]. A model is any representation from which the original time series can be reconstructed within a known error bound. Model-based storage of time series has been improved through Multi-model Compression (MMC) and Model-based Group Compression (MGC). MMC utilizes that the structure of time series changes over time and compresses each time series using multiple models [26, 27, 31, 14, 23]. MGC exploits that time series are correlated, e.g., temperature sensors in close proximity likely report similar values, and compresses correlated time series as one stream of models [32, 16]. MGC is illustrated in Figure 1. In the example a linear function given by is used to represent three correlated time series, creating a mapping from a timestamp to an approximated value for the three values observed at that timestamp.

Figure 1: Time series compressed and stored as one model per time series (Top), or as one model for all time series (Bottom)

However, to our knowledge no method for MMC exploits the correlation between time series, while existing methods for MGC each only utilize a single type of model. In this paper, we focus on the novel problem of compressing groups of correlated time series with user-defined dimensions using both MMC and MGC. We name this new type of compression Multi-model Group Compression (MMGC). We demonstrate that MMGC is suitable for use with a TSMS by extending the open-source MMC TSMS ModelarDB [23] with MMGC. To differentiate between the two versions of ModelarDB we will use ModelarDBv1 for the original version and ModelarDBv2 for our version extended with MMGC. We also demonstrate how multi-dimensional aggregate queries can be performed much more efficiently on models compared to data points. As a result, ModelarDBv2 provides a high compression ratio for time series data, distributed storage and query processing for scalability, stream processing for low latency, and efficient support for multi-dimensional aggregate queries of time series. In summary, we make the following contributions in the area of big data systems:

  • [noitemsep,leftmargin=*]

  • The concept of Multi-model Group Compression and extension of existing models for compressing groups of time series.

  • Primitives for partitioning time series into groups of correlated time series based on a dimensional hierarchy and user hints.

  • Algorithms for performing simple aggregate and multi-dimensional aggregate queries on models representing multiple time series.

  • The TSMS ModelarDBv2 implementing our methods for partitioning, Multi-model Group Compression and query processing.

  • An evaluation of ModelarDBv2 and its algorithms for partitioning, Multi-model Group Compression, and query processing.

The structure of the paper is as follows. Definitions are provided in Section 2. Section 3 provides an overview of ModelarDBv2. Section 4 documents our partitioning primitives, while Section 5 describes our MGC extensions to existing models. In Section 6 our query processing algorithms are described. An evaluation of ModelarDBv2 is given in Section 7. Related work is presented in Section 8. Last, Section 9 provides our conclusion and future work.

2 Preliminaries

We now provide definitions for use in the paper. We also provide an intuitive understanding of the definitions using examples. As ModelarDBv2 extends ModelarDBv1 Definitions 1–6 are from [23].

Definition 1 (Time Series)

A time series TS is a sequence of data points, in the form of time stamp and value pairs, ordered by time in increasing order . For each pair , , the time stamp represents the time when the value was recorded. A time series consisting of a fixed number of data points is a bounded time series.

Definition 2 (Regular Time Series)

A time series is considered regular if the time elapsed between each data point is always the same, i.e., for and irregular otherwise.

Definition 3 (Sampling Interval)

The sampling interval of a regular time series is the time elapsed between each pair of data points in the time series for .

To exemplify the definitions we use the time series . Each pair in is a recorded time stamp and a value. The time stamps are measurements in milliseconds of the time elapsed since recording started. To construct a bounded time series we can consider a subset of the data points were, e.g., , . Both versions of are regular and have a of milliseconds.

Definition 4 (Model)

A model is a representation of a time series using a pair of functions . For each , , the function is a real-valued mapping from

to an estimate of the value for the corresponding data point in

. is a mapping from a time series and the corresponding to a positive real value representing the error of the values estimated by .

A model can be fitted to the bounded subset of using, e.g., a linear function with , and if the uniform error norm is used for the error function , . This model represents with an error of .

Definition 5 (Gap)

A gap between a regular bounded time series and a regular time series with the same sampling interval and recorded from the same source, is a pair of time stamps with , , and where no data points exist between and .

Figure 2: Illustration of a gap between and
Definition 6 (Regular Time Series with Gaps)

A regular time series with gaps is a regular time series, where for . For a regular time series with gaps, a gap is a sub-sequence where for .

A gap is shown in Figure 2. For simplicity time series from the same source separated by gaps will be referred to as a time series with gaps. As a concrete example of a time series with gaps
contains the gap . As contains a gap it is an irregular time series with an undefined . However, can also be represented as the regular time series with gaps
with milliseconds.

Definition 7 (Dimension)

A dimension with members is a 3-tuple where (i) is hierarchically organized descriptions of the time series in the set of time series with the special value as the top element of the hierarchy; (ii) is surjective; (iii) For , and where ; (iv) For , and , if then ; (v) ; (vi) .

A time series belongs to a dimension’s most detailed level that has no descendants. Each member (except ) at a level has a parent at level . This allows users to do analysis at different levels by grouping on a level. To better describe the relation of the time series to real-world entities we will be writing dimensions using named levels. For example, for time series collected from wind turbines a location dimension could be defined as Turbine Park Region Country . For a time series , the function then provides a member for the Turbine level, while provides a member for the Park level. If is collected from a sensor on a wind turbine with id placed in Aalborg, the member for the first level is , while the member for the next level is until returns indicating the top of the hierarchy.

Definition 8 (Time Series Group)

A time series group is a set of regular time series, possibly with gaps, , where for all they have the same sampling interval and that where and are the first timestamp of and , respectively.

For example, is a time series group which contains the time series with milliseconds and the regular time series with gaps with milliseconds. The irregular time series cannot be in the set as it does not have milliseconds.

Figure 3: Model-based compression of time series
Definition 9 (Segment)

A segment for a time series group is a 6-tuple defined as representing the data points for a bounded time interval of a time series group . The 6-tuple consists of start time , end time , sampling interval , a function which for the gives the set of timestamps for which in , and where the values of all other timestamps are defined by the model within the error bound .

To ensure a model-based representation of time series does not exceed an error bound, the time series can be split into segments. As data points are ingested, segments are created to represent the time series within the user-defined error bound as shown in Figure 3. To illustrate this, we use the following three time series , and . Representing these time series with the linear function creates an approximation with the error when using the -norm. If the error bound, e.g., is 10, the segment , is created.

In this paper we focus on using MMGC to compress unbounded regular times series, possibly with gaps and dimensions, while the time series are being ingested by a TSMS and analyzed using data warehouse style Online Analytical Processing (OLAP) queries.

3 Architecture

3.1 Overview

Figure 4: Architecture of a worker node. Query processing and storage are co-located to increase locality

ModelarDBv2 is a novel distributed model-based TSMS designed as a portable library, ModelarDBv2 Core, that is simple to interface with existing software. We interface it with the stock versions of Apache Spark for query processing and Apache Cassandra for storage for a master/worker architecture. ModelarDBv2 implements MMGC by adding a Partitioner component and making changes to all of ModelarDBv1’s components [23]

. The Partitioner takes as input a set of dimensional time series and partitions them into groups based on user hints. To prevent data skew, each group is assigned to the worker with the most available resources. During ingestion the system automatically selects an appropriate model for each dynamically sized sub-sequence of each time series group. Three models, extended to support

MGC, are included in ModelarDBv2 Core: the constant PMC-Mean model (PMC[25], the linear Swing model (Swing[15], and the lossless compression algorithm for floating-point values proposed for the Gorilla (Gorilla[28]. Users can optionally implement more models through an extension API without recompiling ModelarDBv2. For query processing ModelarDBv2 uses SQL and expands the Segment View and Data Point View proposed for ModelarDBv1 [23]. The Segment View allows aggregates to be executed efficiently on segments, e.g., SUM on a linear model uses constant time, while queries on the Data Point View are executed on reconstructed data points.

The architecture of each worker node in ModelarDBv2 is split into three sets of components as shown in Figure 4. In Figure 4 each component is annotated with the software providing that functionality and components that have been modified for ModelarDBv2 are shown with a gray gradient. Components outside the dashed lines are implemented as part of the master node. Data Ingestion ingests time series and constructs models within a user-defined error bound; Query Processing caches recently constructed and queried segments and processes queries at either the segment or data point level; Segment Storage provides a uniform interface with predicate push-down for the persistent segment group store. In summary, ModelarDBv2 is simple to deploy in a cluster while providing state-of-the-art ingestion rates, compression and query performance, in one system. ModelarDBv2 achieves this by compressing multiple correlated time series with dimensions using models distributed as part of ModelarDBv2 Core and optionally user-defined.

3.2 Ingestion and Representation of Gaps

At each ModelarDBv2 fits a model to the data points from a group of time series, instead of one model per time series as ModelarDBv1 [23]. Both treat models as black-boxes with a common interface allowing arbitrary user-defined models. ModelarDBv2 performs ingestion in four steps: (i) a data point from each time series in the group is received and added to a buffer, (ii) it is verified if the current model can be fitted to the new data points, if not the next model is used, (iii) when the last model can fit no more data points, the model providing the best compression ratio is flushed to memory and disk, (iv), last, the data points represented by the flushed model are removed from the buffer and the process repeated from the first model in the sequence. Any gaps are stored as part of the current segment before ingestion continues. To simplify management of gaps and improve filtering during query processing, both the start time and end time are stored for each segment. In addition, segments are stored disconnected to improve the compression ratio due to not storing overlapping data points as for connected segments [26, 27]. As a result, each segment represents a dynamically sized sub-sequence from a group of time series using the model providing the best compression within a user-defined error bound (possibly zero).

For storing gaps we consider two methods. The first stores gaps as triples , were is the start time and the end time of a gap in the time series indicated by the . The second makes a new segment if a gap occurs and store gaps as s as shown in Figure 5. The group in this example consists of three time series, so the model is fitted to three values at . At time , a gap occurs in and a new model is fitted to the values from only two time series. To indicate that this model only represents a subset of the time series, the s of the time series not represented are stored in the segment, see . When data points are received from all time series again, the process is repeated, see and . Thus a segment represents data points for a static number of time series.

Figure 5: Flushing to remove gaps from segments

For ModelarDBv2 we use the second method as it: (i) simplifies implementation of user-defined models, as storing gaps with the first method requires that models take any combination of gaps into account, (ii) simplifies and reduces the computation required for ingestion, execution of aggregate queries and reconstruction of data points as these operations must skip gaps. This choice is, however, a trade-off, as storing gaps as triples uses bytes while a new segment uses bytes. As a result, ModelarDBv2 significantly improves the state-of-the-art for compression, see Section 7, while making user-defined models simple to implement.

Figure 6: Schema for storing time series groups using segments

3.3 Storage Schema

The storage schema used by ModelarDBv2 to support MMGC is shown in Figure 6. The Time Series table contains metadata and denomalized user-defined dimensions for each time series, with each identified by a . The only required metadata is the . The represents what group a time series has been partitioned into and is computed by ModelarDBv2 using user hints. is a constant that ModelarDBv2 applies to each value during ingestion and query processing. With a scaling constant correlated time series with different values can be compressed together. The Model table maps a to the Java Classpath of that model. Last, the Segment table contains all ingested data points as dynamically sized segments.

In data warehouse terms, the segment table functions as a fact table with new segments continuously appended during ingestion. The user-defined dimensions are stored denormalized as part of the Time Series table. However, no explicit time dimension is required as aggregate queries in the time dimension can be computed efficiently using only StartTime and EndTime as described in Section 6.3. For Cassandra two modifications are made to the general schema. First, to more efficiently support predicate push-down, the primary key for Segment is changed to Gid, EndTime, Gaps [23]. Gaps is included to prevent duplicate keys due to the dynamic splitting described in Section 4.2. The values in Gaps are stored as integers with each bit representing if a gap has occurred for that time series in the group. Second, as the column is not used for indexing, it is changed so the size of the segment is stored instead to save space. The can be efficiently recomputed as  [23].

4 partitioning of time series

4.1 Partitioning of Correlated Time Series

To provide the benefit of model-based storage and query processing while ensuring low latency, models must be fitted online [23]. However, in a distributed system, time series compressed together should be ingested on one node to prevent excessive network traffic from limiting the scalability of the system. So to prevent migration of data in the cluster, the time series must be partitioned based only on metadata or previously collected data. As historical data might not exist and even a small data set of only time series creates pairs of possibly very large time series to compare for correlation, simply computing what time series are correlated from historical data quickly becomes infeasible.

1:Let be the set of time series.
2:Let be the dimensions for all time series.
3:Let be the set of user-defined correlations.
4:
5:
6:
7:while  do
8:     
9:     for each  do
10:         if  then
11:              
12:              
13:         end if
14:     end for
15:end while
16:return
Algorithm 1 Group time series using the primitives

We propose a set of primitives that can be combined to efficiently describe correlation for data sets with different quantities of time series and dimensions. The primitives are specified in ModelarDBv2’s configuration file as modelardb.correlation clauses, with multiple primitives in one clause implicitly combined with an AND operator, while multiple clauses implicitly are combined with an OR operator. Using these user hints ModelarDBv2 partitions time series into groups to be ingested together. The primitives allow correlation to be specified as sets of time series, levels for which members must be equal in dimensions, or the distance between all of the dimensions (described below). Grouping is performed as shown in Algorithm 1. After initializing a group per time series in Line 5, the algorithm iteratively combines groups until a fixpoint in the number of groups. The function correlated in Line 10 checks if the groups should be merged based on the user-defined correlations.

When specifying correlation as time series, their location (files or sockets) must be provided, e.g., 4L80R9a_Temperature.gz 4L80R9b_Temperature.gz. For time series that are correlated but do not contain similar values, a scaling constant can be added per time series. While this allows precise control over the groups, it quickly becomes too time consuming as the number of time series increases. The other primitives are based on the notion that time series correlation can be derived from their dimensions. As an example, temperature sensors in close proximity will likely produce similar values. The similarity of a dimension for two groups can be computed as their Lowest Common Ancestor (LCA) level. The LCA level is the lowest level in a dimension where all time series in the two groups have equivalent members starting from . An example of computing the LCA can be seen in Figure 7.

Figure 7: An example Location dimension for wind turbines were the LCA for and is the member Park

To specify correlation based on members, the user must provide either a triple consisting of a dimension, a level, and a member or a pair with a dimension and an LCA level. The triple Measure 1 Temperature, e.g., specifies that time series sharing the member Temperature at level one of the Measure dimension are correlated. The pair Location 2 says that if the LCA level is equal to or higher than two for the Location dimension, the time series are correlated. Zero specifies that all levels must be equal, and a negative number that all but the lowest levels must equal. When specifying a scaling constant for many time series, it can be defined for time series with a shared member as a 4-tuple containing dimension, level, member, scaling constant. These primitives are appropriate for a data set with few dimensions but many time series.

For data sets with both a large number of time series and dimensions, the user can specify correlation as the distance between dimensions. The intuition is that time series with much overlap between their members will be correlated. For example, for the location dimension in Figure 7, time series sharing members at the Turbine level are more likely to be correlated, than if they only share members at the Country level. The distance specifies that all members must match for the time series to be grouped, and that all time series should be grouped. Values in-between specify different degrees of overlap. The user can inject domain knowledge by changing the impact of a dimension using a for which the default value is . Distances above due to user-defined weights are reduced to . For distance-based correlation the rule of thumb is to use the lowest non-zero value for a data set such that only time series with many overlapping members are grouped. The lowest distance can be calculated as where is the set of levels in each dimension and is the set of dimensions.

1:Let be the dimensions for all time series.
2:Let be a time series group.
3:Let be a time series group.
4:
5:
6:
7:for each  do
8:     
9:     
10:     
11:     
12:     
13:end for
14:
15:return
Algorithm 2 Use of distance to indicate correlation

The pseudo-code for computing the distance between two time series groups is shown in Algorithm 2. In Line 11 is computed as to reduce the impact of groups with equivalent members only at the top of the hierarchy. In Line 12 the distance of the dimension is multiplied by the user-defined weight for that dimension before being added to the accumulator. In Line 14-15 the distance between the two time series groups are normalized to the range and compared to the user-defined threshold to determine if the two time series groups are correlated. As an example, for the Location dimension shown in Figure 7, the normalized distance between the time series with and can be computed as .

4.2 Dynamically Splitting Groups

As external events can change the values received for a time series, e.g., a wind turbine might be turned off or damaged, ModelarDBv2 can split a group if its time series become temporarily uncorrelated. A split can be performed after emission of a segment as it indicates that the structure of a time series has changed so the next data point would exceed the error bound. To minimize the number of non-beneficial splits and the overhead of determining when to split, ModelarDBv2

uses two heuristics: poor compression ratio and the percentage error between ingested data points. First, ModelarDB

v2 checks if the compression ratio of the new segment is below a user-configurable fraction of the average (default is ). If the compression ratio is lower and ModelarDBv2 has non-emitted data points Algorithm 3 is executed.

1:Let be a time series group.
2:Let be the user-defined error bound.
3:Let be the data points buffered for .
4:
5:
6:while  do
7:     
8:     
9:     for each  do
10:         
11:         
12:         if  then
13:              
14:              
15:         end if
16:     end for
17:     
18:     
19:end while
20:return
Algorithm 3 Potentially splitting groups of time series temporarily

The algorithm groups time series if their buffered data points are correlated, and can create groups of size one to the size of the original group. Time series currently in a gap are grouped together. In Line 9-16 is added to if the values of are within twice the user-defined error bound of . The double error bound is used as two data points cannot be approximated together if outside this bound. After all time series in have been grouped, the new groups in are returned. An example of a split is shown in Figure 8. While ModelarDBv2 discards data points emitted as segments, the entire time series is shown in Figure 8 to show how they change over time. At the group is ingested using the Segment Generator , however, at all time series in the group are no longer correlated and segments with poor compression are emitted. Therefore, the group is split into two and ingestion continues with and . is unused after the split but not deallocated as it synchronizes ingestion for the splits to simplify joining and joins the split groups if they become correlated. Then at the group is split again and each time series is now being ingested separately.

Figure 8: Ingestion of with dynamic splitting and joining
1:Let be groups marked for joining.
2:Let be a set of time series groups.
3:Let be the user-defined error-bound.
4:Let be the data points buffered for .
5:
6:
7:while  do
8:     
9:     
10:     
11:     for each  do
12:         
13:         
14:         
15:         
16:         if  and  then
17:              
18:              
19:              
20:         end if
21:     end for
22:     
23:     
24:end while
25:return
Algorithm 4 Potentially restoring a split group of time series

The algorithm for restoring a split group is shown in Algorithm 4 and is similar to Algorithm 3. However, when joining groups it is only necessary to compare one time series from each as a group consists of correlated time series (otherwise a split would have occurred). To simplify joining groups Algorithm 4 is only potentially executed at the end of each so all groups have received data points for the same time period. As a segment being emitted indicates a significant change of the values ingested by a group, a split group is only marked for joining after emitting a number of segments. The number of segments that must be emitted are doubled after each attempt to join a split group to reduce the overhead of joining. The intuition is that each failed attempted at joining further indicates that the current splits are preferable. Continuing with the example in Figure 8, at two series become correlated again and are merged into one group. Last, at all the time series are correlated again so takes over ingestion.

5 Multi-Model Group Compression

To benefit from MMGC a set of models is required. However, as most model-based compression methods for time series are designed for individual time series [32, 21], existing models must be extended to support MGC before they are used with ModelarDBv2. We first describe a simple method for using any model with MGC by storing multiple models per segment, and then two model specific approaches that allow a group to use one model per segment.

5.1 Multiple Models per Segment

A baseline method for adding MGC support to any model is to split the data points received and fit them to separate model that are stored together as part of one segment. As gaps are managed by ModelarDBv2 no extensions to the models are required. However, to use the metadata in a segment for multiple models, each representing the values of different time series, the models must represent the same time interval. This is intuitively simple to ensure by verifying that all models will not exceed the error bound before fitting each new data point. However, this is unnecessary as explained next.

Figure 9: Fitting values to models, (O) indicates a model can fit the value, (X) that it cannot, and a dashed line is a new segment

Three cases can occur when multiple models are updated as shown in Figure 9. For case (I) all models can represent the data point received from their respective time series. The opposite occurs in case (II) as the first model cannot represent the data point it received within the user-defined error bound. For both case (I) and case (II) it is trivial to see that all models represent the same time interval. In case (III), the first model can represent the data point received, however, the second model cannot. As the models in the segment no longer represent the same time interval, the end time of the segment is simply not incremented to . As each model represents all previously ingested values, the end time of a segment can be safely reduced in increments of until . For models where the number of parameters depends on the number of data points fitted, e.g., Gorilla, the leftover parameters should be deleted. Afterwards the next set of data points are fitted to a new set of models. While the use of models stored in one segment reduces the amount of duplicate metadata from copies to one and is simple to implement, it does not reduce the storage required for the values. To further improve compression, each model must represent multiple time series using one set of parameters.

5.2 Single Model per Segment

To fully exploit MMGC a set of models must be provided which all compress time series using a single model. We found that the models used by ModelarDBv1 can be extended to efficiently compress a group of time series using a single model based on two general ideas. For models using lossless compression, e.g., Gorilla, values from multiple time series should be stored in time ordered blocks. This allows exploitation of both temporal correlation and correlation across time series at each . For models that fit ingested data points using an upper and lower bound according to the uniform error norm, e.g., PMC and Swing, only the data points with the minimum and maximum value for each can modify the bounds and invalidate the model. As a result, the set of values , where , for a time stamp can be reduced to a range of values represented by the 3-tuple . We now show in detail how MMGC can be performed efficiently using the three models provided as part of ModelarDBv2 Core.

Figure 10: Modifying models to fully support MGC
Figure 11: An aggregate performed on the linear model representing a group of three time series
1:Let be the WHERE clause of the SQL query.
2:Let be a mapping from Gid to Tid and reverse.
3:Let be a mapping from Members to Gid.
4:Let be the function preparing storage for the results.
5:Let be the segment aggregation function.
6:Let be a function for aggregating results.
7:
8:
9: Executed on workers with the result sent to the master
10:
11:
12:for each  do
13:     
14:end for
15: The results are merged and the final result computed
16:
17:return
Algorithm 5 Execution of simple aggregates on the Segment View

For PMC, the set of values from a group of time series is represented as within the error bound of and , with as the maximum range. As a result, PMC requires no changes as the model only tracks the current minimum, maximum and average value. See PMC in Figure 10. As Swing produces a linear function that is guaranteed to pass through the initial data point, the initial point can be computed using PMC. Then as the Swing model maintains the upper bound and lower bound for a linear function that can represent the values of all data points received within , the data point are then appended one at a time. See Swing in Figure 10. For Gorilla, values from data points with the same time stamp are stored in blocks. As the time series in a group are correlated, values in each block will have only a small delta compared to the first value and only require a few bits to encode. See Gorilla in Figure 10. To demonstrate the benefit of our MGC extensions we compress three real-life time series representing the temperature of co-located wind turbines. Compared to using only MMC, enabling MMGC in ModelarDBv2 reduces the storage required by 28.97% with a 0% error bound, by 29.22% for 1%, by 36.74% for 5%, and by 44.07% for 10%.

6 Query Processing

1:Let be the WHERE clause of the SQL query.
2:Let be a mapping from Gid to Tid and reverse.
3:Let be a mapping from Members to Gid.
4:Let be the roll-up level in the time hierarchy.
5:Let be the function preparing storage for the results.
6:Let be the segment aggregation function.
7:Let be a function for aggregating results.
8:
9:
10: Executed on workers with the result sent to the master
11:
12:
13:for each  do
14:     
15:     
16:     
17:     if  then
18:         
19:     else
20:         while  do
21:              
22:              
23:              
24:         end while
25:         
26:     end if
27:end for
28: The results are merged and the final result computed
29:
30:return
Algorithm 6 Rewrite and execution of aggregate queries with a roll-up in the time dimension using the Segment View

6.1 Query Interface

As a model , can reconstruct the data points it represents within error bound , queries can be executed on these data points. However, many aggregate queries can be answered directly from a model, e.g., for constant and linear functions MIN, MAX, SUM and AVG queries can be answered in constant time [23]. To support this, ModelarDBv2 provides a Segment View with the schema (Tid int, StartTime timestamp, EndTime timestamp, SI int, Mid int, Parameters blob, Gaps blob,Dimensions) and a Data Point View with the schema (Tid int, TS timestamp, Value float, Dimensions).Dimensions represents the columns storing the denormalized user-defined dimensions. The user-defined dimensions are cached in-memory and added to segments and data points when required during query processing using a hash-join with an array used instead of a hash table (s are integers starting at ). Using the Segment View, ModelarDBv2 supports executing aggregate queries on segments using user-defined aggregate functions, which for simple queries are suffixed with _S, e.g., MAX_S. Functions performing aggregation in the time dimension are suffixed with aggregate and level in the time hierarchy, e.g., CUBE_AVG_HOUR. All aggregate functions divide the result by the scaling constant of each time series as part of the step. Queries performing aggregation using the user-defined dimensional hierarchy can be executed using a GROUP BY on the appropriate columns in the Segment View, reducing the problem to computing a simple aggregate on segments. As a result, in this section we describe how simple aggregate queries and multi-dimensional aggregate queries in the time dimension can be executed on a segment for distributive and algebraic functions [17].

6.2 Aggregate Queries

To allow queries to be expressed at the time series level instead of the time series group level, a mapping between s and s is performed as part of query processing using metadata from the time series table shown in Figure 6. As a result, queries provided by the user and the result returned from ModelarDBv2 only reference s, with s being utilized to simplify predicate push-down as the segment store only needs to index one id per segment. While ModelarDBv1 only supports predicate push-down for Tid, StartTime and EndTime [23], ModelarDBv2 also supports predicate push-down for user-defined dimensions by rewriting all instances of a dimensional member in the WHERE clause to the s of the groups that include time series with that dimensional member.

The pseudo-code for executing aggregate queries using the Segment View is shown in Algorithm 5. In Line 8 the SQL query is rewritten to query segments in terms of time series groups by replacing the s and members in the SQL queries WHERE clause with the matching s at the master before the query is sent to each worker node. In Line 9–10 each worker node initializes memory for storing the intermediate values and retrieves relevant segments from its data store. Then for each segment, in Line 11–13 the aggregate function passed as argument is executed on each segment. Finally, in Line 15, to support both distributive and algebraic functions, computation that must be performed on the intermediate results is performed. An example of a simple aggregate query executed on the Segment View is shown in Figure 11. First, all s in the query are rewritten to their corresponding s. Then for each segment the aggregate function specified in the query is executed. The aggregate function also applies the scaling constant. After the aggregate has been computed for all segments, the final aggregate is computed from the intermediate results, e.g., by computing an average.

6.3 Aggregation in the Time Dimension

As the schema shown in Figure 6 stores the start time and end time as part of each segment, aggregates in the time dimension can be computed using only the segment table without an expensive join with a separate time dimension. The pseudo-code for executing aggregate queries in the time dimensions using the Segment View is shown in Algorithm 6. The algorithm follows the same structure as Algorithm 5. First, in Line 9 the query is rewritten in terms of s instead of s and members, before each worker initializes memory for storing the intermediate results and retrieves relevant segments in Line 10–11. In Line 12–26 the algorithm iterates over each segment and computes an intermediate aggregate for each of the requested time intervals. Last, the final result is computed and returned in Line 28 to support distributive and algebraic functions.

An example of an aggregation in both a user-defined dimension and the time dimension using segments is shown in Figure 12. The query computes the sum per hour for the time series with , and , using the function CUBE_SUM_HOUR to compute the result efficiently on segments instead on data points. After rewriting the query, the aggregate is computed for the interval from until which is the next timestamp delimiting two aggregation intervals. Afterwards, the aggregate is computed for the interval from until . Last, the aggregate is computed for the interval from to and including . The last value is computed with an inclusive end time as ModelarDBv2, to increase the compression ratio, does not store connected segments [23].

Figure 12: Aggregation in the time dimension on a linear model representing a group of three time series

7 Evaluation

7.1 Overview and Evaluation Environment

We evaluate partitioning of correlated dimensional time series, MMGC, and efficient execution of multi-dimensional aggregate queries using models. For MMGC and query processing we compare ModelarDBv2 to the current state-of-the-art big data file formats used in industry (Apache ORC and Apache Parquet), systems used in industry (InfluxDB and Apache Cassandra), and the state-of-the-art for model-based compression ModelarDBv1 [23]. Apache Spark is used to execute queries on data stored in ORC, Parquet and Cassandra, and for InfluxDB queries are executed on a single node as distribution is not supported by the open-source version. Last, we evaluate the scalability of ModelarDBv2 using Microsoft Azure.

The hardware and software used for our seven node local evaluation cluster is shown in Table 1. This cluster consists of one master node that functions as a Primary HDFS NameNode, Secondary HDFS NameNode and Spark Master, and six workers that function as Cassandra Nodes, HDFS Datanodes, and Spark Slaves. For each experiment we only keep the necessary software running and disable replication for all systems. Disk space usage is measured with the data on a single node using du. The default configuration of each system is used to the highest degree possible with changed values shown in Table 1. The selected values were found to work well with the hardware configuration and the data sets. For parameters we change to evaluate their effect, all values are shown and the default highlighted in bold. The memory Spark can allocate is statically defined by spark.driver.memory and spark.executor.memory to prevent Cassandra or HDFS from crashing. To determine the memory for each setting we started at 4 GiB and lowered it until all experiments executed. For ModelarDBv2 we use the extended models described in Section 5.2 PMC [25], Swing [15], and Gorilla [28].

Hardware

 

Processor Intel Core i7-2620M
Memory 8GiB of 1333 MHz DDR3
Storage 7,200RPM Hard-Drive
Network 1 Gbit Ethernet
Software

 

Ubuntu GNU/Linux v16.04 LTS on ext4
ModelarDB v2.0
Model Error Bound 0%, 1%, 5%, 10%
Model Length Limit 50
Dynamic Split Fraction 10
Bulk Write Size 50,000
InfluxDB v1.4.2
InfluxDB-Java v2.10
Apache Hadoop v2.8.0
Apache Spark v2.1.0
spark.driver.memory 4 GiB
spark.executor.memory 3 GiB
spark.streaming.unpersist false
spark.streaming.stopGracefullyOnShutdown true
spark.sql.orc.filterPushdown true
spark.sql.parquet.filterPushdown true
Apache Cassandra v3.9
batch_size_fail_threshold_in_kb 50 MiB
commitlog_segment_size_in_mb 128 MiB
DataStax Spark Cassandra Connector v2.0.3
Table 1: Evaluation environment

For the existing formats data points are stored using the Data Point View’s schema: (Tid int, TS timestamp, Value float, Dimensions). timestamp is each storage format’s native timestamp type. For Cassandra (Tid, TS, Value) is used as the primary key, for InfluxDB all time series are stored as one measurement with the Tid as a tag, and for ORC and Parquet a file is created per series and stored on HDFS in a folder with the name Tid=n so Spark can prune by Tid without reading each file.

7.2 Data Sets and Queries

Data Set “EP” This real-life data set consists of regular time series with gaps from energy production. The data set is provided by an energy trading company, has seconds and is collected over 508 days. Two dimensions are available: Production: Entity Type and Measure: Concrete Category. In total the data is 339 GiB in size when stored as uncompressed CSV.

Data Set “EH” This real-life data set consist of regular time series with gaps from energy production. The data was collected by us with an approximate milliseconds using an OPC Data Access server running on a Windows server. As pre-processing, the time stamps are rounded to the nearest 100 milliseconds, and data points with equivalent timestamps due to the rounding have been removed. This pre-processing step is only required due to limitations of the collection process and not present in a production setup. The data set contains two dimensions: Location: Entity Park Country and Measure: Concrete Category. In total the data is 582.68 GiB in size when stored as uncompressed CSV.

Queries We use a set of small simple aggregate queries to evaluate ModelarDBv2 for interactive analysis (S-AGG), a set of large scale simple aggregate queries to evaluate scalability (L-AGG), a set of medium scale multi-dimensional aggregate queries to evaluate reporting (M-AGG), and a set of point/range queries to evaluate extraction of sub-sequences (P/R). Half of S-AGG consist of aggregates on one time series with the other half consisting of GROUP BY queries on five time series using Tid to GROUP BY. L-AGG consists of queries aggregating the full data set with half being GROUP BY queries that GROUP on Tid. M-AGG consists of multi-dimensional aggregate queries with the WHERE clause containing the member indicating energy production. Half the queries GROUP BY month and dimension while the others GROUP BY month, dimension and Tid. P/R consists of time point and range queries restricted by WHERE clauses with either TS or Tid and TS.

Figure 13: Ingestion, EP
Figure 14: Storage, EP
Figure 15: Storage, EH
Figure 16: Models, EP
Figure 17: Models, EH
Figure 18: Distance
Figure 19: L-AGG, EP
Figure 20: Scale-out, L-AGG

7.3 Experiments

Ingestion Rate The ingestion rate is primarily evaluated on a single worker as the open-source version of InfluxDB does not support distribution. For each system we ingest a subset of files from EP representing different measures of energy production. This subset consists of gzipped CSV files (6.59 GiB). For InfluxDB we use the Java client library Influxdb-Java with a batch size of 50,000. The dimensions are read from a 6.7 MiB CSV file. ModelarDBv2 stores the dimensions as described in Section 3.3. For the existing formats the denormalized dimensions are appended to the data points using an in-memory cache. We also measure the ingestion rate of ModelarDBv2 using all six nodes of the cluster to measure its scalability when ingesting using two scenarios: Bulk Loading (B) without queries and Online Analytics (O) with aggregate queries executed on random time series using the Segment View during ingestion. On one node ModelarDBv2 uses a single ingestor while Spark Streaming with a five second micro-batch interval and a receiver per node is used when running on the cluster.

The results can be seen in Figure 16. As expected InfluxDB and Cassandra perform the worst as they are designed to be queried during ingestion. ModelarDBv2 can also execute queries during ingestion but ingests the data set 5.5 times faster than InfluxDB and 11 times faster than Cassandra on a single node. Even when compared to Parquet and ORC, which are unsuitable for online analytics since they cannot be queried before a file is completely written, ModelarDBv2 is 2.59 and 2.93 times faster, respectively. Compared to ModelarDBv1, ModelarDBv2 is 2.10 times faster. When the existing formats do not ingest dimensions ModelarDBv2’s ingestion with dimensions is 1.69–5.5 times faster. On six worker nodes ModelarDBv2 achieves a 4.48 times speedup for bulk loading and a 4.11 times speedup when also executing queries. In summary, ModelarDBv2 provides a higher ingestion rate than the existing formats due to an efficient model-agnostic ingestion method and state-of-the-art compression while also supporting online analytics.

Effect of Error Bound We evaluate the benefit of MMGC using EP and EH. We ingest both using 0%, 1%, 5%, and 10% error bounds and the best combination of correlation primitives we found for each data set despite our limited domain knowledge. For systems that do not support approximation, the error bound is 0%. For each data set we present the storage used, the models used, and provide the actual average error calculated as where is the set of ingested data points, is the th approximated value and is the th real value. As many time series in EP are correlated MMGC should significantly reduce the storage required, while for EH MMGC should only provide a benefit with a high error bound as these time series are much less correlated.

The results for EP can seen in Figure 16 with ModelarDBv2 using up to 16.19 times less storage than the other formats. Correlation is set as Production 0, Measure 1 ProductionMWh as the data set does not contain location information but had multiple different measurements for energy production per entity. Compared to the state-of-the-art model-based ModelarDBv1, ModelarDBv2 provides a 1.45 times reduction for a 0% error bound, 1.46 times for up to 1% with an average error of 0.02%, 1.51 times for up to 5% with an average error of 0.17%, and 1.54 times for up to 10% with an average error of 0.34%. The result for EH can be seen in Figure 16 with correlation defined by the lowest distance (0.16666667) using our rule of thumb in Section 4.1. For EH ModelarDBv2 reduces the storage required by up to 65.28 times compared to all existing formats except ModelarDBv1 when a low error bound is used. It is expected that ModelarDBv1 provides slightly better compression than ModelarDBv2 for EH as these time series only exhibit very limited correlation, and with a low error bound even small deviations can make a model exceed the error bound. In addition, ModelarDBv2 still outperforms all other formats and the difference in the storage required is minimal as for a 0% percent error bound the increase is only 1.18 times, for 1% it is only 1.15, for 5% it is only 1.004 times, and for 10% ModelarDBv2 reduces the storage required by 1.22 times while only having an average actual error of 2.03%. Figure 16 and Figure 20 show that all models were used in different combinations for both data sets. In summary, ModelarDBv2 provides better compression than existing storage formats by dynamically selecting appropriate combinations of models for each data set and error bound pair using MMGC.

Figure 21: S-AGG, EP
Figure 22: S-AGG, EH
Figure 23: P/R, EP
Figure 24: P/R, EH
Figure 25: M-AGG-One, EP
Figure 26: M-AGG-Two, EP
Figure 27: M-AGG-One, EH
Figure 28: M-AGG-Two, EH

Effect of Distance We evaluate the effectiveness of specifying correlation as a distance by ingesting EP and EH with all possible distances between the time series in each data set until as all data sets by then require more space. The number of dimensions and levels limit the possible distances, e.g., for EP distances are in increments of without weights. As both dimensions in EP have two levels they have the same impact on the distance. However, as the Measure dimension is a stronger indicator of correlation, the weight of Production is increased so only groups with equivalent members in the Production dimension are grouped.

The results for EP and EH are shown in Figure 20, and as expected only the lowest distance provides a decrease in the storage required as increasing the distance creates inappropriate groupings of time series. This fits with our rule of thumb to use the lowest distance as a start when specifying correlation. When using the lowest distance we see only a 1.14–1.29 increase in storage compared to our manually tuned results for EP (still up to 14.2 times lower than the existing formats), while distance-based correlation outperforms our manual tuning attempts for EH. In summary, even without domain knowledge, MMGC can be used successfully with distance-based specification of correlation and a simple rule of thumb.

Scale-out We evaluate the scalability of ModelarDBv2 using two experiments. First, we compare it against the existing formats when executing L-AGG on the cluster. Second, we evaluate the system’s ability to scale when executing L-AGG using 1–32 Standard_D8_v3 nodes on Microsoft Azure. The node type is selected based on the documentation for Spark, Cassandra, Azure [3, 1, 2]. The configuration from the local cluster is used on Azure with the exception that Spark is allowed 50% of each node’s memory as no crashes occurs with this initial configuration. So ModelarDBv2 cannot simply cache the entire data set in memory. EP is duplicated until the data ingested by each node is at least equal to its memory. To ensure duplicate values do not skew the results, the values of each duplicated data set are multiplied with a random value in the range [0.001, 1.001). Queries are executed using the most appropriate method for each system: InfluxDB’s command-line interface (CLI), ModelarDB’s Segment View (SV) and Data Point View (DPV), and for Cassandra, Parquet, and ORC a Spark SQL Data Frame (S).

The results for the cluster can be seen in Figure 20. ModelarDBv2 outperforms almost all of the existing formats with Parquet being just 1.16 times faster due to the benefits of its column-based layout for simple aggregate queries on a single column. However, compared to ModelarDBv2 Parquet has multiple downsides as it is 2.59 times slower to ingest data, does not allow for online analytics, and uses 11.59 times more storage for EP. We are unable to execute L-AGG on InfluxDB as the open-source version does not support distribution and fails due to memory limitations on a single node with all 8 GiB available to it and the OS. ModelarDBv2 executes L-AGG on a single worker node in just 6.63 hours using the Segment View. Also, we have previously shown that ModelarDBv1 outperforms InfluxDB at scale [23], and ModelarDBv2 is 1.25 times faster than ModelarDBv1. The results for Azure are shown in Figure 20 with ModelarDBv2 scaling linearly for both the Segment View and the Data Point View until 32 nodes. This is expected as ModelarDBv2 assigns each time series to a specific node, allowing the queries to be answered without shuffling. In summary, ModelarDBv2 provides either faster (up to 59.34 times) or at least comparable query performance compared to the existing formats for large scale queries, while also providing faster ingestion (up to 11 times), supports online analytics, has better compression (up 65.28 times), and can scale linearly when additional nodes are added.

Additional Query Processing Performance To further evaluate the query performance of ModelarDBv2, we execute S-AGG, P/R and M-AGG on all data sets using the same query interface as for the scale-out experiments. However, M-AGG cannot be executed for InfluxDB as it can only aggregate time intervals with a fixed size, e.g., an hour or a day [7, 5]. In addition, as InfluxDB has no DatePart functionality, aggregates over, e.g., the days of months as supported by ModelarDBv2 are not natively supported [6].

The results for S-AGG are in Figures 24 and 24. As expected, for EP ModelarDBv2 is slightly slower than most of the existing formats as a group of time series must be read from disk even if the query only uses one time series. Despite this overhead, the only format with support for online analytics faster than ModelarDBv2 is InfluxDB and it is only 2 times faster. The results for EH are similar although as EH consists of fewer but longer time series than EP, the overhead of reading a group is larger. Here Parquet is 28.93 times faster than ModelarDBv2 due to the benefits of its column-based layout for simple aggregate queries on a single column, however, ModelarDBv2 provides 2.59 times faster ingestion and uses 54.04 times less storage. InfluxDB is the only format that supports online analytics that is faster than ModelarDBv2 for EH (1.45 times). However, while InfluxDB performs well for the queries in S-AGG, ModelarDBv2 provides 5.5 times faster ingestion, uses less storage (2.48 times for EP and 2.19 for EH), and executes queries that InfluxDB cannot as shown in Figure 161616, and 20, respectively.

Point queries and range queries are not the intended use case for ModelarDBv2 as a point query might read a large segment representing multiple time series from disk. Due to this overhead MMGC is never a benefit for point queries and range queries on individual time series when compared to MMC. However, for completeness we evaluate the overhead of MMGC for such queries with a comparison to ModelarDBv1. The results for PR can be seen in Figures 24 and 24. As expected for both data sets ModelarDBv2 is slower than ModelarDBv1 due to the overhead of reading groups from disk. For EP ModelarDBv2 is only 3.5% slower than ModelarDBv1, while ModelarDBv2 is 5.25 times slower than ModelarDBv1 for EH as the grouped time series in EH are less correlated than in EP.

The results for M-AGG on EP can be seen in Figures 28 and 28. For M-AGG-One in Figure 28 the queries GROUP BY category matching the groups created for EP when ingesting the data. As the queries aggregate energy production by month and GROUP BY the correlated time series, ModelarDBv2 only reads data necessary for each query and outperforms the existing formats by 1.84–55.47 times using the Segment View. For M-AGG-Two in Figure 28 the queries GROUP BY concrete to drill-down one level below that used for partitioning. However, contrary to pre-computed aggregates, ModelarDBv2 can execute separate queries on each time series in a group so changing the level of aggregating does not impact the performance. For M-AGG-Two on EP ModelarDBv2 is the fastest by 2.20–57.17 times. The results for M-AGG on EH are similar to those for EP and can be seen in Figures 28 and 28. For M-AGG-One the queries GROUP BY park and ModelarDBv2 is 1.05–82.45 times faster than the existing formats, while for M-AGG-Two the queries GROUP BY entity and ModelarDBv2 is 1.12–91.92 times faster.

In summary, for simple aggregate queries ModelarDBv2 provides competitive performance despite the overhead caused by MMGC when querying individual time series. For point and range queries this overhead is more prevalent as expected since ModelarDBv2 was not designed for such queries. For multi-dimensional aggregate queries ModelarDBv2 fully benefits from its use of MMGC and outperforms all existing formats by up to 91.92 times, even when drilling down below the level at which the data is grouped.

8 Related Work

We summarize papers about model-based time series management and model-based OLAP. These are surveys about model-based time series management [32, 21], Hadoop OLAP [30] and TSMSs [22].

Multi-Model Compression: MMC was proposed in [26, 27]. Models are fit to a time series in parallel until they all fail, the model with the highest compression ratio is then stored. The Adaptive Approximation (AA) algorithm [31] fits models in parallel and creates segments as each model fails. After all models have failed, the segments from the model with the highest compression ratio are stored. In [14] regression models are fitted in sequence with coefficients added as required by the error-bound. The model providing the best compression ratio are stored when coefficients are reached.

Model-Based Group Compression: MGC has primarily been used for distributed data acquisition instead of centralized compression. An overview and comparison is given in [37]. Gamps [16] performs MGC at a central location by approximating each time series using constant functions. Afterwards, the error bound is relaxed and overlapping models are compressed together, possibly with scaling. Static grouping is done using an approximate algorithm, with the sets re-computed at run-time using dynamically sized windows.

Model-Based Data Management Systems: Database Management Systems with explicit support for using mathematical models for data cleaning or compression have also been proposed. MauveDB [12] integrates the use of models as part of an Relational Database Management System (RDBMS) using views, to support data cleaning without needing to export the data to an external application. FunctionDB [34] natively supports models in the form of polynomial functions, allowing queries to be evaluated directly on models when possible. Plato [24] supports models for cleaning and has a framework for adding user-defined models that integrate with the system’s optimizer and query processor. Using an in-memory tree-based index, a distributed key-value store and MapReduce [18] allows segments to be stored and queried in a distributed system. ModelarDBv1 [23] provides distributed model-based time series management using MMC with user-defined models by integrating the portable ModelarDBv1 Core with Spark and Cassandra.

Model-Based OLAP: Another use of model-based time series compression is for approximate materialization of data cubes. Perera et. al. [29] propose offline algorithms for finding similarities between time series aggregates, in an OLAP cube, similar aggregates, can then be materialized as a model or as a model and an offset to reduce the size of a materialized cube. A similar method for online data cubes were proposed by Shaikh et. al. [33]. Using models an approximate data cube is materialized in memory. As data points are ingested the in-memory data cube is updated and the data points written to disk for persistence. To preserve memory models representing the oldest data might also be flushed to disk.

ModelarDBv2: In contrast to existing model-based compression algorithms [26, 27, 31, 14, 16] and model-based systems [12, 34, 24, 18, 23], ModelarDBv2 utilizes models for compression and unifies MMC and MGC to create the novel MMGC method for efficient compression of time series. In addition, a simple API allows users to add user-defined models without recompiling ModelarDBv2. Compared to other OLAP systems [35, 20, 10, 38, 9, 36, 19, 11], ModelarDBv2 executes multi-dimensional aggregate queries on models. Also, while the existing model-based approaches for OLAP [29, 33] store both the raw data points and model, ModelarDBv1 stores only the highly compressed models. In summary, ModelarDBv2 provides state-of-the-art compression and query performance for dimensional time series by compressing correlated time series as one sequence of model and executing OLAP queries on models.

9 Conclusion & Future Work

Motivated by the need for a system that efficiently can both store and perform multi-dimensional analysis of the large amounts of data produced by reliable sensors, we presented ModelarDBv2, a distributed model-based TSMS that achieves state-of-the-art compression and query performance by exploiting correlation between time series using a set of arbitrary models (optionally user-defined). To achieve this we presented multiple novel contributions: (i) the novel concept of Multi-model Group Compression and extensions to models to support it, (ii) a set of primitives that simplify describing correlation between time series for data sets of any size without requiring historical data, and (iii) query processing algorithms for efficiently evaluating multi-dimensional aggregate queries directly on models. For distributed query processing and storage ModelarDBv2 uses the stock versions of Apache Spark and Apache Cassandra, respectively. Through an evaluation we demonstrated that compared to existing systems, ModelarDBv2 provides faster ingestion, a significantly reduced storage requirement by adaptively selecting appropriate models for dynamically sized segments, and provides much faster or at least similar query performance for aggregates.

For future work, we plan to simplify the use of ModelarDBv2 and increase it’s query performance: (i) Developing indexing techniques that exploit that data is stored as user-defined models. (ii) Supporting high level analytical queries, e.g., similarity search, to be performed directly on user-defined models. (iii) Either removing or automatically infering parameter arguments.

10 Acknowledgments

This research was supported by the DiCyPS center funded by Innovation Fund Denmark [13], the GOFLEX project EU grant agreement No 731232 [4], and Microsoft Azure for Research [8].

References

  • [1] Apache Cassanddra - Hardware Choices. http://cassandra.apache.org/doc/latest/operating/hardware.html. Viewed: 2019-01-31.
  • [2] Apache Spark - Hardware Provisioning. https://spark.apache.org/docs/2.1.0/hardware-provisioning.html. Viewed: 2019-01-31.
  • [3] Azure Databricks. https://azure.microsoft.com/en-us/pricing/details/databricks/. Viewed: 2019-01-31.
  • [4] GOFLEX. https://goflex-project.eu/. Viewed: 2019-01-31.
  • [5] InfluxDB - Issue 3991. https://github.com/influxdata/influxdb/issues/3991. Viewed: 2019-01-31.
  • [6] InfluxDB - Issue 6723. https://github.com/influxdata/influxdb/issues/6723. Viewed: 2019-01-31.
  • [7] InfluxQL reference - Durations. https://docs.influxdata.com/influxdb/v1.4/query_language/spec/#durations. Viewed: 2019-01-31.
  • [8] Microsoft Azure for Research. https://www.microsoft.com/en-us/research/academic-program/microsoft-azure-for-research/. Viewed: 2019-01-31.
  • [9] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. Spark SQL: Relational data processing in spark. In Proceedings of the SIGMOD International Conference on Management of Data, pages 1383–1394. ACM, 2015.
  • [10] S. Chen. Cheetah: a high performance, custom data warehouse on top of MapReduce. Proceedings of the VLDB Endowment, 3(1-2):1459–1468, 2010.
  • [11] B. Dageville, T. Cruanes, M. Zukowski, V. Antonov, A. Avanes, J. Bock, J. Claybaugh, D. Engovatov, M. Hentschel, J. Huang, et al. The Snowflake Elastic Data Warehouse. In Proceedings of the SIGMOD International Conference on Management of Data, pages 215–226. ACM, 2016.
  • [12] A. Deshpande and S. Madden. MauveDB: supporting model-based user views in database systems. In Proceedings of the SIGMOD International Conference on Management of Data, pages 73–84. ACM, 2006.
  • [13] DiCyPS - Center for Data-Intensive Cyber-Physical Systems. http://www.dicyps.dk/dicyps-in-english/. Viewed: 2019-01-31.
  • [14] F. Eichinger, P. Efros, S. Karnouskos, and K. Böhm. A time-series compression technique and its application to the smart grid. The VLDB Journal, 24(2):193–218, 2015.
  • [15] H. Elmeleegy, A. K. Elmagarmid, E. Cecchet, W. G. Aref, and W. Zwaenepoel. Online piece-wise linear approximation of numerical streams with precision guarantees. Proceedings of the VLDB Endowment, 2(1):145–156, 2009.
  • [16] S. Gandhi, S. Nath, S. Suri, and J. Liu. Gamps: Compressing multi sensor data by grouping and amplitude scaling. In Proceedings of the SIGMOD International Conference on Management of Data, pages 771–784. ACM, 2009.
  • [17] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data mining and Knowledge Discovery, 1(1):29–53, 1997.
  • [18] T. Guo, T. G. Papaioannou, and K. Aberer. Efficient Indexing and Query Processing of Model-View Sensor Data in the Cloud. Big Data Research, 1:52–65, 2014.
  • [19] A. Gupta, D. Agarwal, D. Tan, J. Kulesza, R. Pathak, S. Stefani, and V. Srinivasan. Amazon Redshift and the case for simpler data warehouses. In Proceedings of the SIGMOD International Conference on Management of Data, pages 1917–1923. ACM, 2015.
  • [20] Y. Huai, A. Chauhan, A. Gates, G. Hagleitner, E. N. Hanson, O. O’Malley, J. Pandey, Y. Yuan, R. Lee, and X. Zhang. Major technical advancements in Apache Hive. In Proceedings of the SIGMOD International Conference on Management of Data, pages 1235–1246. ACM, 2014.
  • [21] N. Q. V. Hung, H. Jeung, and K. Aberer. An evaluation of model-based approaches to sensor data compression. IEEE Transactions on Knowledge and Data Engineering, 25(11):2434–2447, 2013.
  • [22] S. K. Jensen, T. B. Pedersen, and C. Thomsen. Time Series Management Systems: A Survey. IEEE Transactions on Knowledge and Data Engineering, 29(11):2581–2600, Nov 2017.
  • [23] S. K. Jensen, T. B. Pedersen, and C. Thomsen. ModelarDB: Modular Model-based Time Series Management with Spark and Cassandra. Proceedings of the VLDB Endowment, 11(11):1688–1701, July 2018.
  • [24] Y. Katsis, Y. Freund, and Y. Papakonstantinou. Combining Databases and Signal Processing in Plato. In Proceedigns of the Biennial Conference on Innovative Data Systems Research, 2015.
  • [25] I. Lazaridis and S. Mehrotra. Capturing sensor-generated time series with quality guarantees. In IEEE Transactions on Knowledge and Data Engineering, pages 429–440. IEEE, 2003.
  • [26] T. G. Papaioannou, M. Riahi, and K. Aberer. Towards online multi-model approximation of time series. In Proceedings of the International Conference on Mobile Data Management, volume 1, pages 33–38. IEEE, 2011.
  • [27] T. G. Papaioannou, M. Riahi, and K. Aberer. Towards Online Multi-Model Approximation of Time Series. Technical report, EPFL LSIR, 2011.
  • [28] T. Pelkonen, S. Franklin, J. Teller, P. Cavallaro, Q. Huang, J. Meza, and K. Veeraraghavan. Gorilla: A fast, scalable, in-memory time series database. Proceedings of the VLDB Endowment, 8(12):1816–1827, 2015.
  • [29] K. S. Perera, M. Hahmann, W. Lehner, T. B. Pedersen, and C. Thomsen. Modeling Large Time Series for Efficient Approximate Query Processing. In Revised Selected Papers from the DASFAA International Workshops, SeCoP, BDMS, and Poster, pages 190–204. Springer, 2015.
  • [30] M. Ptiček and B. Vrdoljak. Mapreduce research on warehousing of big data. In 40th International Convention on Information and Communication Technology, Electronics and Microelectronics, 2017.
  • [31] J. Qi, R. Zhang, K. Ramamohanarao, H. Wang, Z. Wen, and D. Wu. Indexable online time series segmentation with error bound guarantee. World Wide Web, 18(2):359–401, 2015.
  • [32] S. Sathe, T. G. Papaioannou, H. Jeung, and K. Aberer. A survey of model-based sensor data acquisition and management. In Managing and Mining Sensor Data, pages 9–50. Springer, 2013.
  • [33] S. A. Shaikh and H. Kitagawa. Approximate OLAP on Sustained Data Streams. In Proceedings of the International Conference on Database Systems for Advanced Applications, Part II, pages 102–118. Springer, 2017.
  • [34] A. Thiagarajan and S. Madden. Querying continuous functions in a database system. In Proceedings of the SIGMOD International Conference on Management of Data, pages 791–804. ACM, 2008.
  • [35] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive-a petabyte scale data warehouse using hadoop. In Proceedings of the International Conference on Data Engineering, pages 996–1005. IEEE, 2010.
  • [36] S. Vitthal Ranawade, S. Navale, A. Dhamal, K. Deshpande, and C. Ghuge. Online Analytical Processing on Hadoop using Apache Kylin. International Journal of Applied Information Systems, 12:1–5, 05 2017.
  • [37] B. Wang, Y. Song, Y. Sun, and J. Liu. Improvements to Online Distributed Monitoring Systems. In 2016 IEEE Trustcom/BigDataSE/ISPA, pages 1093–1100. IEEE, 2016.
  • [38] F. Yang, E. Tschetter, X. Léauté, N. Ray, G. Merlino, and D. Ganguli. Druid: A Real-time Analytical Data Store. In Proceedings of the SIGMOD International Conference on Management of Data, pages 157–168. ACM, 2014.