1 Introduction
Companies maintaining critical infrastructure, e.g., for energy production, benefit from monitoring with a high degree of coverage and having data points sampled at a high frequency. To facilitate this in the energy domain, entities such as wind turbines are monitored by high quality sensors with wired power and connectivity. As a result, invalid, missing and outoforder readings are rare, and all except missing values can be corrected using established methods. In addition to data points, metadata, e.g., location and sensor type, is stored for each time series to support analysis along multiple dimensions. However, due to the big amount of data points being produced, only simple aggregates are stored, removing outliers and fluctuations as a result. As a remedy, modelbased storage allows for compression of time series within a known error bound (possibly zero) [32, 21]. A model is any representation from which the original time series can be reconstructed within a known error bound. Modelbased storage of time series has been improved through Multimodel Compression (MMC) and Modelbased Group Compression (MGC). MMC utilizes that the structure of time series changes over time and compresses each time series using multiple models [26, 27, 31, 14, 23]. MGC exploits that time series are correlated, e.g., temperature sensors in close proximity likely report similar values, and compresses correlated time series as one stream of models [32, 16]. MGC is illustrated in Figure 1. In the example a linear function given by is used to represent three correlated time series, creating a mapping from a timestamp to an approximated value for the three values observed at that timestamp.
However, to our knowledge no method for MMC exploits the correlation between time series, while existing methods for MGC each only utilize a single type of model. In this paper, we focus on the novel problem of compressing groups of correlated time series with userdefined dimensions using both MMC and MGC. We name this new type of compression Multimodel Group Compression (MMGC). We demonstrate that MMGC is suitable for use with a TSMS by extending the opensource MMC TSMS ModelarDB [23] with MMGC. To differentiate between the two versions of ModelarDB we will use ModelarDB_{v1} for the original version and ModelarDB_{v2} for our version extended with MMGC. We also demonstrate how multidimensional aggregate queries can be performed much more efficiently on models compared to data points. As a result, ModelarDB_{v2} provides a high compression ratio for time series data, distributed storage and query processing for scalability, stream processing for low latency, and efficient support for multidimensional aggregate queries of time series. In summary, we make the following contributions in the area of big data systems:

[noitemsep,leftmargin=*]

The concept of Multimodel Group Compression and extension of existing models for compressing groups of time series.

Primitives for partitioning time series into groups of correlated time series based on a dimensional hierarchy and user hints.

Algorithms for performing simple aggregate and multidimensional aggregate queries on models representing multiple time series.

The TSMS ModelarDB_{v2} implementing our methods for partitioning, Multimodel Group Compression and query processing.

An evaluation of ModelarDB_{v2} and its algorithms for partitioning, Multimodel Group Compression, and query processing.
The structure of the paper is as follows. Definitions are provided in Section 2. Section 3 provides an overview of ModelarDB_{v2}. Section 4 documents our partitioning primitives, while Section 5 describes our MGC extensions to existing models. In Section 6 our query processing algorithms are described. An evaluation of ModelarDB_{v2} is given in Section 7. Related work is presented in Section 8. Last, Section 9 provides our conclusion and future work.
2 Preliminaries
We now provide definitions for use in the paper. We also provide an intuitive understanding of the definitions using examples. As ModelarDB_{v2} extends ModelarDB_{v1} Definitions 1–6 are from [23].
Definition 1 (Time Series)
A time series TS is a sequence of data points, in the form of time stamp and value pairs, ordered by time in increasing order . For each pair , , the time stamp represents the time when the value was recorded. A time series consisting of a fixed number of data points is a bounded time series.
Definition 2 (Regular Time Series)
A time series is considered regular if the time elapsed between each data point is always the same, i.e., for and irregular otherwise.
Definition 3 (Sampling Interval)
The sampling interval of a regular time series is the time elapsed between each pair of data points in the time series for .
To exemplify the definitions we use the time series . Each pair in is a recorded time stamp and a value. The time stamps are measurements in milliseconds of the time elapsed since recording started. To construct a bounded time series we can consider a subset of the data points were, e.g., , . Both versions of are regular and have a of milliseconds.
Definition 4 (Model)
A model is a representation of a time series using a pair of functions . For each , , the function is a realvalued mapping from
to an estimate of the value for the corresponding data point in
. is a mapping from a time series and the corresponding to a positive real value representing the error of the values estimated by .A model can be fitted to the bounded subset of using, e.g., a linear function with , and if the uniform error norm is used for the error function , . This model represents with an error of .
Definition 5 (Gap)
A gap between a regular bounded time series and a regular time series with the same sampling interval and recorded from the same source, is a pair of time stamps with , , and where no data points exist between and .
Definition 6 (Regular Time Series with Gaps)
A regular time series with gaps is a regular time series, where for . For a regular time series with gaps, a gap is a subsequence where for .
A gap is shown in Figure 2. For simplicity time series from the same source separated by gaps will be referred to as a time series with gaps. As a concrete example of a time series with gaps
contains the gap . As contains a gap it is an irregular time series with an undefined . However, can also be represented as the regular time series with gaps
with milliseconds.
Definition 7 (Dimension)
A dimension with members is a 3tuple where (i) is hierarchically organized descriptions of the time series in the set of time series with the special value as the top element of the hierarchy; (ii) is surjective; (iii) For , and where ; (iv) For , and , if then ; (v) ; (vi) .
A time series belongs to a dimension’s most detailed level that has no descendants. Each member (except ) at a level has a parent at level . This allows users to do analysis at different levels by grouping on a level. To better describe the relation of the time series to realworld entities we will be writing dimensions using named levels. For example, for time series collected from wind turbines a location dimension could be defined as Turbine Park Region Country . For a time series , the function then provides a member for the Turbine level, while provides a member for the Park level. If is collected from a sensor on a wind turbine with id placed in Aalborg, the member for the first level is , while the member for the next level is until returns indicating the top of the hierarchy.
Definition 8 (Time Series Group)
A time series group is a set of regular time series, possibly with gaps, , where for all they have the same sampling interval and that where and are the first timestamp of and , respectively.
For example, is a time series group which contains the time series with milliseconds and the regular time series with gaps with milliseconds. The irregular time series cannot be in the set as it does not have milliseconds.
Definition 9 (Segment)
A segment for a time series group is a 6tuple defined as representing the data points for a bounded time interval of a time series group . The 6tuple consists of start time , end time , sampling interval , a function which for the gives the set of timestamps for which in , and where the values of all other timestamps are defined by the model within the error bound .
To ensure a modelbased representation of time series does not exceed an error bound, the time series can be split into segments. As data points are ingested, segments are created to represent the time series within the userdefined error bound as shown in Figure 3. To illustrate this, we use the following three time series , and . Representing these time series with the linear function creates an approximation with the error when using the norm. If the error bound, e.g., is 10, the segment , is created.
In this paper we focus on using MMGC to compress unbounded regular times series, possibly with gaps and dimensions, while the time series are being ingested by a TSMS and analyzed using data warehouse style Online Analytical Processing (OLAP) queries.
3 Architecture
3.1 Overview
ModelarDB_{v2} is a novel distributed modelbased TSMS designed as a portable library, ModelarDB_{v2} Core, that is simple to interface with existing software. We interface it with the stock versions of Apache Spark for query processing and Apache Cassandra for storage for a master/worker architecture. ModelarDB_{v2} implements MMGC by adding a Partitioner component and making changes to all of ModelarDB_{v1}’s components [23]
. The Partitioner takes as input a set of dimensional time series and partitions them into groups based on user hints. To prevent data skew, each group is assigned to the worker with the most available resources. During ingestion the system automatically selects an appropriate model for each dynamically sized subsequence of each time series group. Three models, extended to support
MGC, are included in ModelarDB_{v2} Core: the constant PMCMean model (PMC) [25], the linear Swing model (Swing) [15], and the lossless compression algorithm for floatingpoint values proposed for the Gorilla (Gorilla) [28]. Users can optionally implement more models through an extension API without recompiling ModelarDB_{v2}. For query processing ModelarDB_{v2} uses SQL and expands the Segment View and Data Point View proposed for ModelarDB_{v1} [23]. The Segment View allows aggregates to be executed efficiently on segments, e.g., SUM on a linear model uses constant time, while queries on the Data Point View are executed on reconstructed data points.The architecture of each worker node in ModelarDB_{v2} is split into three sets of components as shown in Figure 4. In Figure 4 each component is annotated with the software providing that functionality and components that have been modified for ModelarDB_{v2} are shown with a gray gradient. Components outside the dashed lines are implemented as part of the master node. Data Ingestion ingests time series and constructs models within a userdefined error bound; Query Processing caches recently constructed and queried segments and processes queries at either the segment or data point level; Segment Storage provides a uniform interface with predicate pushdown for the persistent segment group store. In summary, ModelarDB_{v2} is simple to deploy in a cluster while providing stateoftheart ingestion rates, compression and query performance, in one system. ModelarDB_{v2} achieves this by compressing multiple correlated time series with dimensions using models distributed as part of ModelarDB_{v2} Core and optionally userdefined.
3.2 Ingestion and Representation of Gaps
At each ModelarDB_{v2} fits a model to the data points from a group of time series, instead of one model per time series as ModelarDB_{v1} [23]. Both treat models as blackboxes with a common interface allowing arbitrary userdefined models. ModelarDB_{v2} performs ingestion in four steps: (i) a data point from each time series in the group is received and added to a buffer, (ii) it is verified if the current model can be fitted to the new data points, if not the next model is used, (iii) when the last model can fit no more data points, the model providing the best compression ratio is flushed to memory and disk, (iv), last, the data points represented by the flushed model are removed from the buffer and the process repeated from the first model in the sequence. Any gaps are stored as part of the current segment before ingestion continues. To simplify management of gaps and improve filtering during query processing, both the start time and end time are stored for each segment. In addition, segments are stored disconnected to improve the compression ratio due to not storing overlapping data points as for connected segments [26, 27]. As a result, each segment represents a dynamically sized subsequence from a group of time series using the model providing the best compression within a userdefined error bound (possibly zero).
For storing gaps we consider two methods. The first stores gaps as triples , were is the start time and the end time of a gap in the time series indicated by the . The second makes a new segment if a gap occurs and store gaps as s as shown in Figure 5. The group in this example consists of three time series, so the model is fitted to three values at . At time , a gap occurs in and a new model is fitted to the values from only two time series. To indicate that this model only represents a subset of the time series, the s of the time series not represented are stored in the segment, see . When data points are received from all time series again, the process is repeated, see and . Thus a segment represents data points for a static number of time series.
For ModelarDB_{v2} we use the second method as it: (i) simplifies implementation of userdefined models, as storing gaps with the first method requires that models take any combination of gaps into account, (ii) simplifies and reduces the computation required for ingestion, execution of aggregate queries and reconstruction of data points as these operations must skip gaps. This choice is, however, a tradeoff, as storing gaps as triples uses bytes while a new segment uses bytes. As a result, ModelarDB_{v2} significantly improves the stateoftheart for compression, see Section 7, while making userdefined models simple to implement.
3.3 Storage Schema
The storage schema used by ModelarDB_{v2} to support MMGC is shown in Figure 6. The Time Series table contains metadata and denomalized userdefined dimensions for each time series, with each identified by a . The only required metadata is the . The represents what group a time series has been partitioned into and is computed by ModelarDB_{v2} using user hints. is a constant that ModelarDB_{v2} applies to each value during ingestion and query processing. With a scaling constant correlated time series with different values can be compressed together. The Model table maps a to the Java Classpath of that model. Last, the Segment table contains all ingested data points as dynamically sized segments.
In data warehouse terms, the segment table functions as a fact table with new segments continuously appended during ingestion. The userdefined dimensions are stored denormalized as part of the Time Series table. However, no explicit time dimension is required as aggregate queries in the time dimension can be computed efficiently using only StartTime and EndTime as described in Section 6.3. For Cassandra two modifications are made to the general schema. First, to more efficiently support predicate pushdown, the primary key for Segment is changed to Gid, EndTime, Gaps [23]. Gaps is included to prevent duplicate keys due to the dynamic splitting described in Section 4.2. The values in Gaps are stored as integers with each bit representing if a gap has occurred for that time series in the group. Second, as the column is not used for indexing, it is changed so the size of the segment is stored instead to save space. The can be efficiently recomputed as [23].
4 partitioning of time series
4.1 Partitioning of Correlated Time Series
To provide the benefit of modelbased storage and query processing while ensuring low latency, models must be fitted online [23]. However, in a distributed system, time series compressed together should be ingested on one node to prevent excessive network traffic from limiting the scalability of the system. So to prevent migration of data in the cluster, the time series must be partitioned based only on metadata or previously collected data. As historical data might not exist and even a small data set of only time series creates pairs of possibly very large time series to compare for correlation, simply computing what time series are correlated from historical data quickly becomes infeasible.
We propose a set of primitives that can be combined to efficiently describe correlation for data sets with different quantities of time series and dimensions. The primitives are specified in ModelarDB_{v2}’s configuration file as modelardb.correlation clauses, with multiple primitives in one clause implicitly combined with an AND operator, while multiple clauses implicitly are combined with an OR operator. Using these user hints ModelarDB_{v2} partitions time series into groups to be ingested together. The primitives allow correlation to be specified as sets of time series, levels for which members must be equal in dimensions, or the distance between all of the dimensions (described below). Grouping is performed as shown in Algorithm 1. After initializing a group per time series in Line 5, the algorithm iteratively combines groups until a fixpoint in the number of groups. The function correlated in Line 10 checks if the groups should be merged based on the userdefined correlations.
When specifying correlation as time series, their location (files or sockets) must be provided, e.g., 4L80R9a_Temperature.gz 4L80R9b_Temperature.gz. For time series that are correlated but do not contain similar values, a scaling constant can be added per time series. While this allows precise control over the groups, it quickly becomes too time consuming as the number of time series increases. The other primitives are based on the notion that time series correlation can be derived from their dimensions. As an example, temperature sensors in close proximity will likely produce similar values. The similarity of a dimension for two groups can be computed as their Lowest Common Ancestor (LCA) level. The LCA level is the lowest level in a dimension where all time series in the two groups have equivalent members starting from . An example of computing the LCA can be seen in Figure 7.
To specify correlation based on members, the user must provide either a triple consisting of a dimension, a level, and a member or a pair with a dimension and an LCA level. The triple Measure 1 Temperature, e.g., specifies that time series sharing the member Temperature at level one of the Measure dimension are correlated. The pair Location 2 says that if the LCA level is equal to or higher than two for the Location dimension, the time series are correlated. Zero specifies that all levels must be equal, and a negative number that all but the lowest levels must equal. When specifying a scaling constant for many time series, it can be defined for time series with a shared member as a 4tuple containing dimension, level, member, scaling constant. These primitives are appropriate for a data set with few dimensions but many time series.
For data sets with both a large number of time series and dimensions, the user can specify correlation as the distance between dimensions. The intuition is that time series with much overlap between their members will be correlated. For example, for the location dimension in Figure 7, time series sharing members at the Turbine level are more likely to be correlated, than if they only share members at the Country level. The distance specifies that all members must match for the time series to be grouped, and that all time series should be grouped. Values inbetween specify different degrees of overlap. The user can inject domain knowledge by changing the impact of a dimension using a for which the default value is . Distances above due to userdefined weights are reduced to . For distancebased correlation the rule of thumb is to use the lowest nonzero value for a data set such that only time series with many overlapping members are grouped. The lowest distance can be calculated as where is the set of levels in each dimension and is the set of dimensions.
The pseudocode for computing the distance between two time series groups is shown in Algorithm 2. In Line 11 is computed as to reduce the impact of groups with equivalent members only at the top of the hierarchy. In Line 12 the distance of the dimension is multiplied by the userdefined weight for that dimension before being added to the accumulator. In Line 1415 the distance between the two time series groups are normalized to the range and compared to the userdefined threshold to determine if the two time series groups are correlated. As an example, for the Location dimension shown in Figure 7, the normalized distance between the time series with and can be computed as .
4.2 Dynamically Splitting Groups
As external events can change the values received for a time series, e.g., a wind turbine might be turned off or damaged, ModelarDB_{v2} can split a group if its time series become temporarily uncorrelated. A split can be performed after emission of a segment as it indicates that the structure of a time series has changed so the next data point would exceed the error bound. To minimize the number of nonbeneficial splits and the overhead of determining when to split, ModelarDB_{v2}
uses two heuristics: poor compression ratio and the percentage error between ingested data points. First, ModelarDB
_{v2} checks if the compression ratio of the new segment is below a userconfigurable fraction of the average (default is ). If the compression ratio is lower and ModelarDB_{v2} has nonemitted data points Algorithm 3 is executed.The algorithm groups time series if their buffered data points are correlated, and can create groups of size one to the size of the original group. Time series currently in a gap are grouped together. In Line 916 is added to if the values of are within twice the userdefined error bound of . The double error bound is used as two data points cannot be approximated together if outside this bound. After all time series in have been grouped, the new groups in are returned. An example of a split is shown in Figure 8. While ModelarDB_{v2} discards data points emitted as segments, the entire time series is shown in Figure 8 to show how they change over time. At the group is ingested using the Segment Generator , however, at all time series in the group are no longer correlated and segments with poor compression are emitted. Therefore, the group is split into two and ingestion continues with and . is unused after the split but not deallocated as it synchronizes ingestion for the splits to simplify joining and joins the split groups if they become correlated. Then at the group is split again and each time series is now being ingested separately.
The algorithm for restoring a split group is shown in Algorithm 4 and is similar to Algorithm 3. However, when joining groups it is only necessary to compare one time series from each as a group consists of correlated time series (otherwise a split would have occurred). To simplify joining groups Algorithm 4 is only potentially executed at the end of each so all groups have received data points for the same time period. As a segment being emitted indicates a significant change of the values ingested by a group, a split group is only marked for joining after emitting a number of segments. The number of segments that must be emitted are doubled after each attempt to join a split group to reduce the overhead of joining. The intuition is that each failed attempted at joining further indicates that the current splits are preferable. Continuing with the example in Figure 8, at two series become correlated again and are merged into one group. Last, at all the time series are correlated again so takes over ingestion.
5 MultiModel Group Compression
To benefit from MMGC a set of models is required. However, as most modelbased compression methods for time series are designed for individual time series [32, 21], existing models must be extended to support MGC before they are used with ModelarDB_{v2}. We first describe a simple method for using any model with MGC by storing multiple models per segment, and then two model specific approaches that allow a group to use one model per segment.
5.1 Multiple Models per Segment
A baseline method for adding MGC support to any model is to split the data points received and fit them to separate model that are stored together as part of one segment. As gaps are managed by ModelarDB_{v2} no extensions to the models are required. However, to use the metadata in a segment for multiple models, each representing the values of different time series, the models must represent the same time interval. This is intuitively simple to ensure by verifying that all models will not exceed the error bound before fitting each new data point. However, this is unnecessary as explained next.
Three cases can occur when multiple models are updated as shown in Figure 9. For case (I) all models can represent the data point received from their respective time series. The opposite occurs in case (II) as the first model cannot represent the data point it received within the userdefined error bound. For both case (I) and case (II) it is trivial to see that all models represent the same time interval. In case (III), the first model can represent the data point received, however, the second model cannot. As the models in the segment no longer represent the same time interval, the end time of the segment is simply not incremented to . As each model represents all previously ingested values, the end time of a segment can be safely reduced in increments of until . For models where the number of parameters depends on the number of data points fitted, e.g., Gorilla, the leftover parameters should be deleted. Afterwards the next set of data points are fitted to a new set of models. While the use of models stored in one segment reduces the amount of duplicate metadata from copies to one and is simple to implement, it does not reduce the storage required for the values. To further improve compression, each model must represent multiple time series using one set of parameters.
5.2 Single Model per Segment
To fully exploit MMGC a set of models must be provided which all compress time series using a single model. We found that the models used by ModelarDB_{v1} can be extended to efficiently compress a group of time series using a single model based on two general ideas. For models using lossless compression, e.g., Gorilla, values from multiple time series should be stored in time ordered blocks. This allows exploitation of both temporal correlation and correlation across time series at each . For models that fit ingested data points using an upper and lower bound according to the uniform error norm, e.g., PMC and Swing, only the data points with the minimum and maximum value for each can modify the bounds and invalidate the model. As a result, the set of values , where , for a time stamp can be reduced to a range of values represented by the 3tuple . We now show in detail how MMGC can be performed efficiently using the three models provided as part of ModelarDB_{v2} Core.
For PMC, the set of values from a group of time series is represented as within the error bound of and , with as the maximum range. As a result, PMC requires no changes as the model only tracks the current minimum, maximum and average value. See PMC in Figure 10. As Swing produces a linear function that is guaranteed to pass through the initial data point, the initial point can be computed using PMC. Then as the Swing model maintains the upper bound and lower bound for a linear function that can represent the values of all data points received within , the data point are then appended one at a time. See Swing in Figure 10. For Gorilla, values from data points with the same time stamp are stored in blocks. As the time series in a group are correlated, values in each block will have only a small delta compared to the first value and only require a few bits to encode. See Gorilla in Figure 10. To demonstrate the benefit of our MGC extensions we compress three reallife time series representing the temperature of colocated wind turbines. Compared to using only MMC, enabling MMGC in ModelarDB_{v2} reduces the storage required by 28.97% with a 0% error bound, by 29.22% for 1%, by 36.74% for 5%, and by 44.07% for 10%.
6 Query Processing
6.1 Query Interface
As a model , can reconstruct the data points it represents within error bound , queries can be executed on these data points. However, many aggregate queries can be answered directly from a model, e.g., for constant and linear functions MIN, MAX, SUM and AVG queries can be answered in constant time [23]. To support this, ModelarDB_{v2} provides a Segment View with the schema (Tid int, StartTime timestamp, EndTime timestamp, SI int, Mid int, Parameters blob, Gaps blob,Dimensions) and a Data Point View with the schema (Tid int, TS timestamp, Value float, Dimensions).Dimensions represents the columns storing the denormalized userdefined dimensions. The userdefined dimensions are cached inmemory and added to segments and data points when required during query processing using a hashjoin with an array used instead of a hash table (s are integers starting at ). Using the Segment View, ModelarDB_{v2} supports executing aggregate queries on segments using userdefined aggregate functions, which for simple queries are suffixed with _S, e.g., MAX_S. Functions performing aggregation in the time dimension are suffixed with aggregate and level in the time hierarchy, e.g., CUBE_AVG_HOUR. All aggregate functions divide the result by the scaling constant of each time series as part of the step. Queries performing aggregation using the userdefined dimensional hierarchy can be executed using a GROUP BY on the appropriate columns in the Segment View, reducing the problem to computing a simple aggregate on segments. As a result, in this section we describe how simple aggregate queries and multidimensional aggregate queries in the time dimension can be executed on a segment for distributive and algebraic functions [17].
6.2 Aggregate Queries
To allow queries to be expressed at the time series level instead of the time series group level, a mapping between s and s is performed as part of query processing using metadata from the time series table shown in Figure 6. As a result, queries provided by the user and the result returned from ModelarDB_{v2} only reference s, with s being utilized to simplify predicate pushdown as the segment store only needs to index one id per segment. While ModelarDB_{v1} only supports predicate pushdown for Tid, StartTime and EndTime [23], ModelarDB_{v2} also supports predicate pushdown for userdefined dimensions by rewriting all instances of a dimensional member in the WHERE clause to the s of the groups that include time series with that dimensional member.
The pseudocode for executing aggregate queries using the Segment View is shown in Algorithm 5. In Line 8 the SQL query is rewritten to query segments in terms of time series groups by replacing the s and members in the SQL queries WHERE clause with the matching s at the master before the query is sent to each worker node. In Line 9–10 each worker node initializes memory for storing the intermediate values and retrieves relevant segments from its data store. Then for each segment, in Line 11–13 the aggregate function passed as argument is executed on each segment. Finally, in Line 15, to support both distributive and algebraic functions, computation that must be performed on the intermediate results is performed. An example of a simple aggregate query executed on the Segment View is shown in Figure 11. First, all s in the query are rewritten to their corresponding s. Then for each segment the aggregate function specified in the query is executed. The aggregate function also applies the scaling constant. After the aggregate has been computed for all segments, the final aggregate is computed from the intermediate results, e.g., by computing an average.
6.3 Aggregation in the Time Dimension
As the schema shown in Figure 6 stores the start time and end time as part of each segment, aggregates in the time dimension can be computed using only the segment table without an expensive join with a separate time dimension. The pseudocode for executing aggregate queries in the time dimensions using the Segment View is shown in Algorithm 6. The algorithm follows the same structure as Algorithm 5. First, in Line 9 the query is rewritten in terms of s instead of s and members, before each worker initializes memory for storing the intermediate results and retrieves relevant segments in Line 10–11. In Line 12–26 the algorithm iterates over each segment and computes an intermediate aggregate for each of the requested time intervals. Last, the final result is computed and returned in Line 28 to support distributive and algebraic functions.
An example of an aggregation in both a userdefined dimension and the time dimension using segments is shown in Figure 12. The query computes the sum per hour for the time series with , and , using the function CUBE_SUM_HOUR to compute the result efficiently on segments instead on data points. After rewriting the query, the aggregate is computed for the interval from until which is the next timestamp delimiting two aggregation intervals. Afterwards, the aggregate is computed for the interval from until . Last, the aggregate is computed for the interval from to and including . The last value is computed with an inclusive end time as ModelarDB_{v2}, to increase the compression ratio, does not store connected segments [23].
7 Evaluation
7.1 Overview and Evaluation Environment
We evaluate partitioning of correlated dimensional time series, MMGC, and efficient execution of multidimensional aggregate queries using models. For MMGC and query processing we compare ModelarDB_{v2} to the current stateoftheart big data file formats used in industry (Apache ORC and Apache Parquet), systems used in industry (InfluxDB and Apache Cassandra), and the stateoftheart for modelbased compression ModelarDB_{v1} [23]. Apache Spark is used to execute queries on data stored in ORC, Parquet and Cassandra, and for InfluxDB queries are executed on a single node as distribution is not supported by the opensource version. Last, we evaluate the scalability of ModelarDB_{v2} using Microsoft Azure.
The hardware and software used for our seven node local evaluation cluster is shown in Table 1. This cluster consists of one master node that functions as a Primary HDFS NameNode, Secondary HDFS NameNode and Spark Master, and six workers that function as Cassandra Nodes, HDFS Datanodes, and Spark Slaves. For each experiment we only keep the necessary software running and disable replication for all systems. Disk space usage is measured with the data on a single node using du. The default configuration of each system is used to the highest degree possible with changed values shown in Table 1. The selected values were found to work well with the hardware configuration and the data sets. For parameters we change to evaluate their effect, all values are shown and the default highlighted in bold. The memory Spark can allocate is statically defined by spark.driver.memory and spark.executor.memory to prevent Cassandra or HDFS from crashing. To determine the memory for each setting we started at 4 GiB and lowered it until all experiments executed. For ModelarDB_{v2} we use the extended models described in Section 5.2 PMC [25], Swing [15], and Gorilla [28].
Hardware  


Processor  Intel Core i72620M 
Memory  8GiB of 1333 MHz DDR3 
Storage  7,200RPM HardDrive 
Network  1 Gbit Ethernet 
Software  


Ubuntu GNU/Linux  v16.04 LTS on ext4 
ModelarDB  v2.0 
— Model Error Bound  0%, 1%, 5%, 10% 
— Model Length Limit  50 
— Dynamic Split Fraction  10 
— Bulk Write Size  50,000 
InfluxDB  v1.4.2 
InfluxDBJava  v2.10 
Apache Hadoop  v2.8.0 
Apache Spark  v2.1.0 
— spark.driver.memory  4 GiB 
— spark.executor.memory  3 GiB 
— spark.streaming.unpersist  false 
— spark.streaming.stopGracefullyOnShutdown  true 
— spark.sql.orc.filterPushdown  true 
— spark.sql.parquet.filterPushdown  true 
Apache Cassandra  v3.9 
— batch_size_fail_threshold_in_kb  50 MiB 
— commitlog_segment_size_in_mb  128 MiB 
DataStax Spark Cassandra Connector  v2.0.3 
For the existing formats data points are stored using the Data Point View’s schema: (Tid int, TS timestamp, Value float, Dimensions). timestamp is each storage format’s native timestamp type. For Cassandra (Tid, TS, Value) is used as the primary key, for InfluxDB all time series are stored as one measurement with the Tid as a tag, and for ORC and Parquet a file is created per series and stored on HDFS in a folder with the name Tid=n so Spark can prune by Tid without reading each file.
7.2 Data Sets and Queries
Data Set “EP” This reallife data set consists of regular time series with gaps from energy production. The data set is provided by an energy trading company, has seconds and is collected over 508 days. Two dimensions are available: Production: Entity Type and Measure: Concrete Category. In total the data is 339 GiB in size when stored as uncompressed CSV.
Data Set “EH” This reallife data set consist of regular time series with gaps from energy production. The data was collected by us with an approximate milliseconds using an OPC Data Access server running on a Windows server. As preprocessing, the time stamps are rounded to the nearest 100 milliseconds, and data points with equivalent timestamps due to the rounding have been removed. This preprocessing step is only required due to limitations of the collection process and not present in a production setup. The data set contains two dimensions: Location: Entity Park Country and Measure: Concrete Category. In total the data is 582.68 GiB in size when stored as uncompressed CSV.
Queries We use a set of small simple aggregate queries to evaluate ModelarDB_{v2} for interactive analysis (SAGG), a set of large scale simple aggregate queries to evaluate scalability (LAGG), a set of medium scale multidimensional aggregate queries to evaluate reporting (MAGG), and a set of point/range queries to evaluate extraction of subsequences (P/R). Half of SAGG consist of aggregates on one time series with the other half consisting of GROUP BY queries on five time series using Tid to GROUP BY. LAGG consists of queries aggregating the full data set with half being GROUP BY queries that GROUP on Tid. MAGG consists of multidimensional aggregate queries with the WHERE clause containing the member indicating energy production. Half the queries GROUP BY month and dimension while the others GROUP BY month, dimension and Tid. P/R consists of time point and range queries restricted by WHERE clauses with either TS or Tid and TS.
7.3 Experiments
Ingestion Rate The ingestion rate is primarily evaluated on a single worker as the opensource version of InfluxDB does not support distribution. For each system we ingest a subset of files from EP representing different measures of energy production. This subset consists of gzipped CSV files (6.59 GiB). For InfluxDB we use the Java client library InfluxdbJava with a batch size of 50,000. The dimensions are read from a 6.7 MiB CSV file. ModelarDB_{v2} stores the dimensions as described in Section 3.3. For the existing formats the denormalized dimensions are appended to the data points using an inmemory cache. We also measure the ingestion rate of ModelarDB_{v2} using all six nodes of the cluster to measure its scalability when ingesting using two scenarios: Bulk Loading (B) without queries and Online Analytics (O) with aggregate queries executed on random time series using the Segment View during ingestion. On one node ModelarDB_{v2} uses a single ingestor while Spark Streaming with a five second microbatch interval and a receiver per node is used when running on the cluster.
The results can be seen in Figure 16. As expected InfluxDB and Cassandra perform the worst as they are designed to be queried during ingestion. ModelarDB_{v2} can also execute queries during ingestion but ingests the data set 5.5 times faster than InfluxDB and 11 times faster than Cassandra on a single node. Even when compared to Parquet and ORC, which are unsuitable for online analytics since they cannot be queried before a file is completely written, ModelarDB_{v2} is 2.59 and 2.93 times faster, respectively. Compared to ModelarDB_{v1}, ModelarDB_{v2} is 2.10 times faster. When the existing formats do not ingest dimensions ModelarDB_{v2}’s ingestion with dimensions is 1.69–5.5 times faster. On six worker nodes ModelarDB_{v2} achieves a 4.48 times speedup for bulk loading and a 4.11 times speedup when also executing queries. In summary, ModelarDB_{v2} provides a higher ingestion rate than the existing formats due to an efficient modelagnostic ingestion method and stateoftheart compression while also supporting online analytics.
Effect of Error Bound We evaluate the benefit of MMGC using EP and EH. We ingest both using 0%, 1%, 5%, and 10% error bounds and the best combination of correlation primitives we found for each data set despite our limited domain knowledge. For systems that do not support approximation, the error bound is 0%. For each data set we present the storage used, the models used, and provide the actual average error calculated as where is the set of ingested data points, is the th approximated value and is the th real value. As many time series in EP are correlated MMGC should significantly reduce the storage required, while for EH MMGC should only provide a benefit with a high error bound as these time series are much less correlated.
The results for EP can seen in Figure 16 with ModelarDB_{v2} using up to 16.19 times less storage than the other formats. Correlation is set as Production 0, Measure 1 ProductionMWh as the data set does not contain location information but had multiple different measurements for energy production per entity. Compared to the stateoftheart modelbased ModelarDB_{v1}, ModelarDB_{v2} provides a 1.45 times reduction for a 0% error bound, 1.46 times for up to 1% with an average error of 0.02%, 1.51 times for up to 5% with an average error of 0.17%, and 1.54 times for up to 10% with an average error of 0.34%. The result for EH can be seen in Figure 16 with correlation defined by the lowest distance (0.16666667) using our rule of thumb in Section 4.1. For EH ModelarDB_{v2} reduces the storage required by up to 65.28 times compared to all existing formats except ModelarDB_{v1} when a low error bound is used. It is expected that ModelarDB_{v1} provides slightly better compression than ModelarDB_{v2} for EH as these time series only exhibit very limited correlation, and with a low error bound even small deviations can make a model exceed the error bound. In addition, ModelarDB_{v2} still outperforms all other formats and the difference in the storage required is minimal as for a 0% percent error bound the increase is only 1.18 times, for 1% it is only 1.15, for 5% it is only 1.004 times, and for 10% ModelarDB_{v2} reduces the storage required by 1.22 times while only having an average actual error of 2.03%. Figure 16 and Figure 20 show that all models were used in different combinations for both data sets. In summary, ModelarDB_{v2} provides better compression than existing storage formats by dynamically selecting appropriate combinations of models for each data set and error bound pair using MMGC.
Effect of Distance We evaluate the effectiveness of specifying correlation as a distance by ingesting EP and EH with all possible distances between the time series in each data set until as all data sets by then require more space. The number of dimensions and levels limit the possible distances, e.g., for EP distances are in increments of without weights. As both dimensions in EP have two levels they have the same impact on the distance. However, as the Measure dimension is a stronger indicator of correlation, the weight of Production is increased so only groups with equivalent members in the Production dimension are grouped.
The results for EP and EH are shown in Figure 20, and as expected only the lowest distance provides a decrease in the storage required as increasing the distance creates inappropriate groupings of time series. This fits with our rule of thumb to use the lowest distance as a start when specifying correlation. When using the lowest distance we see only a 1.14–1.29 increase in storage compared to our manually tuned results for EP (still up to 14.2 times lower than the existing formats), while distancebased correlation outperforms our manual tuning attempts for EH. In summary, even without domain knowledge, MMGC can be used successfully with distancebased specification of correlation and a simple rule of thumb.
Scaleout We evaluate the scalability of ModelarDB_{v2} using two experiments. First, we compare it against the existing formats when executing LAGG on the cluster. Second, we evaluate the system’s ability to scale when executing LAGG using 1–32 Standard_D8_v3 nodes on Microsoft Azure. The node type is selected based on the documentation for Spark, Cassandra, Azure [3, 1, 2]. The configuration from the local cluster is used on Azure with the exception that Spark is allowed 50% of each node’s memory as no crashes occurs with this initial configuration. So ModelarDB_{v2} cannot simply cache the entire data set in memory. EP is duplicated until the data ingested by each node is at least equal to its memory. To ensure duplicate values do not skew the results, the values of each duplicated data set are multiplied with a random value in the range [0.001, 1.001). Queries are executed using the most appropriate method for each system: InfluxDB’s commandline interface (CLI), ModelarDB’s Segment View (SV) and Data Point View (DPV), and for Cassandra, Parquet, and ORC a Spark SQL Data Frame (S).
The results for the cluster can be seen in Figure 20. ModelarDB_{v2} outperforms almost all of the existing formats with Parquet being just 1.16 times faster due to the benefits of its columnbased layout for simple aggregate queries on a single column. However, compared to ModelarDB_{v2} Parquet has multiple downsides as it is 2.59 times slower to ingest data, does not allow for online analytics, and uses 11.59 times more storage for EP. We are unable to execute LAGG on InfluxDB as the opensource version does not support distribution and fails due to memory limitations on a single node with all 8 GiB available to it and the OS. ModelarDB_{v2} executes LAGG on a single worker node in just 6.63 hours using the Segment View. Also, we have previously shown that ModelarDB_{v1} outperforms InfluxDB at scale [23], and ModelarDB_{v2} is 1.25 times faster than ModelarDB_{v1}. The results for Azure are shown in Figure 20 with ModelarDB_{v2} scaling linearly for both the Segment View and the Data Point View until 32 nodes. This is expected as ModelarDB_{v2} assigns each time series to a specific node, allowing the queries to be answered without shuffling. In summary, ModelarDB_{v2} provides either faster (up to 59.34 times) or at least comparable query performance compared to the existing formats for large scale queries, while also providing faster ingestion (up to 11 times), supports online analytics, has better compression (up 65.28 times), and can scale linearly when additional nodes are added.
Additional Query Processing Performance To further evaluate the query performance of ModelarDB_{v2}, we execute SAGG, P/R and MAGG on all data sets using the same query interface as for the scaleout experiments. However, MAGG cannot be executed for InfluxDB as it can only aggregate time intervals with a fixed size, e.g., an hour or a day [7, 5]. In addition, as InfluxDB has no DatePart functionality, aggregates over, e.g., the days of months as supported by ModelarDB_{v2} are not natively supported [6].
The results for SAGG are in Figures 24 and 24. As expected, for EP ModelarDB_{v2} is slightly slower than most of the existing formats as a group of time series must be read from disk even if the query only uses one time series. Despite this overhead, the only format with support for online analytics faster than ModelarDB_{v2} is InfluxDB and it is only 2 times faster. The results for EH are similar although as EH consists of fewer but longer time series than EP, the overhead of reading a group is larger. Here Parquet is 28.93 times faster than ModelarDB_{v2} due to the benefits of its columnbased layout for simple aggregate queries on a single column, however, ModelarDB_{v2} provides 2.59 times faster ingestion and uses 54.04 times less storage. InfluxDB is the only format that supports online analytics that is faster than ModelarDB_{v2} for EH (1.45 times). However, while InfluxDB performs well for the queries in SAGG, ModelarDB_{v2} provides 5.5 times faster ingestion, uses less storage (2.48 times for EP and 2.19 for EH), and executes queries that InfluxDB cannot as shown in Figure 16, 16, 16, and 20, respectively.
Point queries and range queries are not the intended use case for ModelarDB_{v2} as a point query might read a large segment representing multiple time series from disk. Due to this overhead MMGC is never a benefit for point queries and range queries on individual time series when compared to MMC. However, for completeness we evaluate the overhead of MMGC for such queries with a comparison to ModelarDB_{v1}. The results for PR can be seen in Figures 24 and 24. As expected for both data sets ModelarDB_{v2} is slower than ModelarDB_{v1} due to the overhead of reading groups from disk. For EP ModelarDB_{v2} is only 3.5% slower than ModelarDB_{v1}, while ModelarDB_{v2} is 5.25 times slower than ModelarDB_{v1} for EH as the grouped time series in EH are less correlated than in EP.
The results for MAGG on EP can be seen in Figures 28 and 28. For MAGGOne in Figure 28 the queries GROUP BY category matching the groups created for EP when ingesting the data. As the queries aggregate energy production by month and GROUP BY the correlated time series, ModelarDB_{v2} only reads data necessary for each query and outperforms the existing formats by 1.84–55.47 times using the Segment View. For MAGGTwo in Figure 28 the queries GROUP BY concrete to drilldown one level below that used for partitioning. However, contrary to precomputed aggregates, ModelarDB_{v2} can execute separate queries on each time series in a group so changing the level of aggregating does not impact the performance. For MAGGTwo on EP ModelarDB_{v2} is the fastest by 2.20–57.17 times. The results for MAGG on EH are similar to those for EP and can be seen in Figures 28 and 28. For MAGGOne the queries GROUP BY park and ModelarDB_{v2} is 1.05–82.45 times faster than the existing formats, while for MAGGTwo the queries GROUP BY entity and ModelarDB_{v2} is 1.12–91.92 times faster.
In summary, for simple aggregate queries ModelarDB_{v2} provides competitive performance despite the overhead caused by MMGC when querying individual time series. For point and range queries this overhead is more prevalent as expected since ModelarDB_{v2} was not designed for such queries. For multidimensional aggregate queries ModelarDB_{v2} fully benefits from its use of MMGC and outperforms all existing formats by up to 91.92 times, even when drilling down below the level at which the data is grouped.
8 Related Work
We summarize papers about modelbased time series management and modelbased OLAP. These are surveys about modelbased time series management [32, 21], Hadoop OLAP [30] and TSMSs [22].
MultiModel Compression: MMC was proposed in [26, 27]. Models are fit to a time series in parallel until they all fail, the model with the highest compression ratio is then stored. The Adaptive Approximation (AA) algorithm [31] fits models in parallel and creates segments as each model fails. After all models have failed, the segments from the model with the highest compression ratio are stored. In [14] regression models are fitted in sequence with coefficients added as required by the errorbound. The model providing the best compression ratio are stored when coefficients are reached.
ModelBased Group Compression: MGC has primarily been used for distributed data acquisition instead of centralized compression. An overview and comparison is given in [37]. Gamps [16] performs MGC at a central location by approximating each time series using constant functions. Afterwards, the error bound is relaxed and overlapping models are compressed together, possibly with scaling. Static grouping is done using an approximate algorithm, with the sets recomputed at runtime using dynamically sized windows.
ModelBased Data Management Systems: Database Management Systems with explicit support for using mathematical models for data cleaning or compression have also been proposed. MauveDB [12] integrates the use of models as part of an Relational Database Management System (RDBMS) using views, to support data cleaning without needing to export the data to an external application. FunctionDB [34] natively supports models in the form of polynomial functions, allowing queries to be evaluated directly on models when possible. Plato [24] supports models for cleaning and has a framework for adding userdefined models that integrate with the system’s optimizer and query processor. Using an inmemory treebased index, a distributed keyvalue store and MapReduce [18] allows segments to be stored and queried in a distributed system. ModelarDB_{v1} [23] provides distributed modelbased time series management using MMC with userdefined models by integrating the portable ModelarDB_{v1} Core with Spark and Cassandra.
ModelBased OLAP: Another use of modelbased time series compression is for approximate materialization of data cubes. Perera et. al. [29] propose offline algorithms for finding similarities between time series aggregates, in an OLAP cube, similar aggregates, can then be materialized as a model or as a model and an offset to reduce the size of a materialized cube. A similar method for online data cubes were proposed by Shaikh et. al. [33]. Using models an approximate data cube is materialized in memory. As data points are ingested the inmemory data cube is updated and the data points written to disk for persistence. To preserve memory models representing the oldest data might also be flushed to disk.
ModelarDB_{v2}: In contrast to existing modelbased compression algorithms [26, 27, 31, 14, 16] and modelbased systems [12, 34, 24, 18, 23], ModelarDB_{v2} utilizes models for compression and unifies MMC and MGC to create the novel MMGC method for efficient compression of time series. In addition, a simple API allows users to add userdefined models without recompiling ModelarDB_{v2}. Compared to other OLAP systems [35, 20, 10, 38, 9, 36, 19, 11], ModelarDB_{v2} executes multidimensional aggregate queries on models. Also, while the existing modelbased approaches for OLAP [29, 33] store both the raw data points and model, ModelarDB_{v1} stores only the highly compressed models. In summary, ModelarDB_{v2} provides stateoftheart compression and query performance for dimensional time series by compressing correlated time series as one sequence of model and executing OLAP queries on models.
9 Conclusion & Future Work
Motivated by the need for a system that efficiently can both store and perform multidimensional analysis of the large amounts of data produced by reliable sensors, we presented ModelarDB_{v2}, a distributed modelbased TSMS that achieves stateoftheart compression and query performance by exploiting correlation between time series using a set of arbitrary models (optionally userdefined). To achieve this we presented multiple novel contributions: (i) the novel concept of Multimodel Group Compression and extensions to models to support it, (ii) a set of primitives that simplify describing correlation between time series for data sets of any size without requiring historical data, and (iii) query processing algorithms for efficiently evaluating multidimensional aggregate queries directly on models. For distributed query processing and storage ModelarDB_{v2} uses the stock versions of Apache Spark and Apache Cassandra, respectively. Through an evaluation we demonstrated that compared to existing systems, ModelarDB_{v2} provides faster ingestion, a significantly reduced storage requirement by adaptively selecting appropriate models for dynamically sized segments, and provides much faster or at least similar query performance for aggregates.
For future work, we plan to simplify the use of ModelarDB_{v2} and increase it’s query performance: (i) Developing indexing techniques that exploit that data is stored as userdefined models. (ii) Supporting high level analytical queries, e.g., similarity search, to be performed directly on userdefined models. (iii) Either removing or automatically infering parameter arguments.
10 Acknowledgments
References
 [1] Apache Cassanddra  Hardware Choices. http://cassandra.apache.org/doc/latest/operating/hardware.html. Viewed: 20190131.
 [2] Apache Spark  Hardware Provisioning. https://spark.apache.org/docs/2.1.0/hardwareprovisioning.html. Viewed: 20190131.
 [3] Azure Databricks. https://azure.microsoft.com/enus/pricing/details/databricks/. Viewed: 20190131.
 [4] GOFLEX. https://goflexproject.eu/. Viewed: 20190131.
 [5] InfluxDB  Issue 3991. https://github.com/influxdata/influxdb/issues/3991. Viewed: 20190131.
 [6] InfluxDB  Issue 6723. https://github.com/influxdata/influxdb/issues/6723. Viewed: 20190131.
 [7] InfluxQL reference  Durations. https://docs.influxdata.com/influxdb/v1.4/query_language/spec/#durations. Viewed: 20190131.
 [8] Microsoft Azure for Research. https://www.microsoft.com/enus/research/academicprogram/microsoftazureforresearch/. Viewed: 20190131.
 [9] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. Spark SQL: Relational data processing in spark. In Proceedings of the SIGMOD International Conference on Management of Data, pages 1383–1394. ACM, 2015.
 [10] S. Chen. Cheetah: a high performance, custom data warehouse on top of MapReduce. Proceedings of the VLDB Endowment, 3(12):1459–1468, 2010.
 [11] B. Dageville, T. Cruanes, M. Zukowski, V. Antonov, A. Avanes, J. Bock, J. Claybaugh, D. Engovatov, M. Hentschel, J. Huang, et al. The Snowflake Elastic Data Warehouse. In Proceedings of the SIGMOD International Conference on Management of Data, pages 215–226. ACM, 2016.
 [12] A. Deshpande and S. Madden. MauveDB: supporting modelbased user views in database systems. In Proceedings of the SIGMOD International Conference on Management of Data, pages 73–84. ACM, 2006.
 [13] DiCyPS  Center for DataIntensive CyberPhysical Systems. http://www.dicyps.dk/dicypsinenglish/. Viewed: 20190131.
 [14] F. Eichinger, P. Efros, S. Karnouskos, and K. Böhm. A timeseries compression technique and its application to the smart grid. The VLDB Journal, 24(2):193–218, 2015.
 [15] H. Elmeleegy, A. K. Elmagarmid, E. Cecchet, W. G. Aref, and W. Zwaenepoel. Online piecewise linear approximation of numerical streams with precision guarantees. Proceedings of the VLDB Endowment, 2(1):145–156, 2009.
 [16] S. Gandhi, S. Nath, S. Suri, and J. Liu. Gamps: Compressing multi sensor data by grouping and amplitude scaling. In Proceedings of the SIGMOD International Conference on Management of Data, pages 771–784. ACM, 2009.
 [17] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing groupby, crosstab, and subtotals. Data mining and Knowledge Discovery, 1(1):29–53, 1997.
 [18] T. Guo, T. G. Papaioannou, and K. Aberer. Efficient Indexing and Query Processing of ModelView Sensor Data in the Cloud. Big Data Research, 1:52–65, 2014.
 [19] A. Gupta, D. Agarwal, D. Tan, J. Kulesza, R. Pathak, S. Stefani, and V. Srinivasan. Amazon Redshift and the case for simpler data warehouses. In Proceedings of the SIGMOD International Conference on Management of Data, pages 1917–1923. ACM, 2015.
 [20] Y. Huai, A. Chauhan, A. Gates, G. Hagleitner, E. N. Hanson, O. O’Malley, J. Pandey, Y. Yuan, R. Lee, and X. Zhang. Major technical advancements in Apache Hive. In Proceedings of the SIGMOD International Conference on Management of Data, pages 1235–1246. ACM, 2014.
 [21] N. Q. V. Hung, H. Jeung, and K. Aberer. An evaluation of modelbased approaches to sensor data compression. IEEE Transactions on Knowledge and Data Engineering, 25(11):2434–2447, 2013.
 [22] S. K. Jensen, T. B. Pedersen, and C. Thomsen. Time Series Management Systems: A Survey. IEEE Transactions on Knowledge and Data Engineering, 29(11):2581–2600, Nov 2017.
 [23] S. K. Jensen, T. B. Pedersen, and C. Thomsen. ModelarDB: Modular Modelbased Time Series Management with Spark and Cassandra. Proceedings of the VLDB Endowment, 11(11):1688–1701, July 2018.
 [24] Y. Katsis, Y. Freund, and Y. Papakonstantinou. Combining Databases and Signal Processing in Plato. In Proceedigns of the Biennial Conference on Innovative Data Systems Research, 2015.
 [25] I. Lazaridis and S. Mehrotra. Capturing sensorgenerated time series with quality guarantees. In IEEE Transactions on Knowledge and Data Engineering, pages 429–440. IEEE, 2003.
 [26] T. G. Papaioannou, M. Riahi, and K. Aberer. Towards online multimodel approximation of time series. In Proceedings of the International Conference on Mobile Data Management, volume 1, pages 33–38. IEEE, 2011.
 [27] T. G. Papaioannou, M. Riahi, and K. Aberer. Towards Online MultiModel Approximation of Time Series. Technical report, EPFL LSIR, 2011.
 [28] T. Pelkonen, S. Franklin, J. Teller, P. Cavallaro, Q. Huang, J. Meza, and K. Veeraraghavan. Gorilla: A fast, scalable, inmemory time series database. Proceedings of the VLDB Endowment, 8(12):1816–1827, 2015.
 [29] K. S. Perera, M. Hahmann, W. Lehner, T. B. Pedersen, and C. Thomsen. Modeling Large Time Series for Efficient Approximate Query Processing. In Revised Selected Papers from the DASFAA International Workshops, SeCoP, BDMS, and Poster, pages 190–204. Springer, 2015.
 [30] M. Ptiček and B. Vrdoljak. Mapreduce research on warehousing of big data. In 40th International Convention on Information and Communication Technology, Electronics and Microelectronics, 2017.
 [31] J. Qi, R. Zhang, K. Ramamohanarao, H. Wang, Z. Wen, and D. Wu. Indexable online time series segmentation with error bound guarantee. World Wide Web, 18(2):359–401, 2015.
 [32] S. Sathe, T. G. Papaioannou, H. Jeung, and K. Aberer. A survey of modelbased sensor data acquisition and management. In Managing and Mining Sensor Data, pages 9–50. Springer, 2013.
 [33] S. A. Shaikh and H. Kitagawa. Approximate OLAP on Sustained Data Streams. In Proceedings of the International Conference on Database Systems for Advanced Applications, Part II, pages 102–118. Springer, 2017.
 [34] A. Thiagarajan and S. Madden. Querying continuous functions in a database system. In Proceedings of the SIGMOD International Conference on Management of Data, pages 791–804. ACM, 2008.
 [35] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hivea petabyte scale data warehouse using hadoop. In Proceedings of the International Conference on Data Engineering, pages 996–1005. IEEE, 2010.
 [36] S. Vitthal Ranawade, S. Navale, A. Dhamal, K. Deshpande, and C. Ghuge. Online Analytical Processing on Hadoop using Apache Kylin. International Journal of Applied Information Systems, 12:1–5, 05 2017.
 [37] B. Wang, Y. Song, Y. Sun, and J. Liu. Improvements to Online Distributed Monitoring Systems. In 2016 IEEE Trustcom/BigDataSE/ISPA, pages 1093–1100. IEEE, 2016.
 [38] F. Yang, E. Tschetter, X. Léauté, N. Ray, G. Merlino, and D. Ganguli. Druid: A Realtime Analytical Data Store. In Proceedings of the SIGMOD International Conference on Management of Data, pages 157–168. ACM, 2014.
Comments
There are no comments yet.