An IDEA: An Ingestion Framework for Data Enrichment in AsterixDB

02/21/2019
by   Xikui Wang, et al.
University of California, Irvine
0

Big Data today is being generated at an unprecedented rate from various sources such as sensors, applications, and devices, and it often needs to be enriched based on other existing information to support complex analytical queries. Depending on the use case, the enrichment operations can be compiled code, declarative queries, or machine learning models with different complexities. For enrichments that will be frequently used in the future, it can be advantageous to push their computation into the ingestion pipeline so that they can be stored (and queried) together with the data. In some cases, the referenced information may change over time, so the ingestion pipeline should be able to adapt to such changes to guarantee the currency and/or correctness of the enrichment results. In this paper, we present a new data ingestion framework that supports data ingestion at scale, enrichments requiring complex operations, and adaptiveness to reference data changes. We explain how this framework has been built on top of Apache AsterixDB and investigate its performance at scale under various workloads.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

08/13/2019

Adaptive Learning of Aggregate Analytics under Dynamic Workloads

Large organizations have seamlessly incorporated data-driven decision ma...
10/28/2018

VDMS: An Efficient Big-Visual-Data Access for Machine Learning Workloads

We introduce the Visual Data Management System (VDMS), a data management...
10/28/2018

VDMS: Efficient Big-Visual-Data Access for Machine Learning Workloads

We introduce the Visual Data Management System (VDMS), which enables fas...
09/10/2020

Subscribing to Big Data at Scale

Today, data is being actively generated by a variety of devices, service...
12/01/2021

Processing Analytical Queries in the AWESOME Polystore [Information Systems Architectures]

Modern big data applications usually involve heterogeneous data sources ...
02/12/2021

Updatable Materialization of Approximate Constraints

Modern big data applications integrate data from various sources. As a r...
03/19/2022

METL: a modern ETL pipeline with a dynamic mapping matrix

Modern ETL streaming pipelines extract data from various sources and for...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Traditionally, data to be analyzed has been obtained from one or more operational systems, fed through an Extract, Transform, and Load (ETL) process, and stored in a data warehouse [8]. In today’s Big Data era, the data that people work with is no longer limited to the operational data from a company but also includes social network messages, sensor readings, user click-streams, and so on. Data from these sources is generated rapidly and continuously. It becomes increasingly undesirable to stage the data in large batches, process it overnight, and then load it into a data warehouse due to the volume of the incoming stream and the need to analyze current data when making important decisions.

To support the ingestion of continuously generated data and provide near real-time data analysis, streaming engines have been introduced into the Big Data analysis architecture [4, 10, 27]. The incoming data is collected by a streaming engine and then pushed to (or periodically pulled by) a warehouse for later complex data analysis. Adding a streaming engine simplifies the ingestion process for the data warehouse, but it introduces data routing overhead between different systems. To minimize this overhead and simplify the architecture, some systems such as Apache AsterixDB [12] have chosen to provide an integrated ingestion facility, data feeds, to enable users to ingest data directly into the system.

The ingested data, such as sensor readings, is often not useful alone in high-level data analysis. It often needs to be enriched with other existing information (e.g., sensor metadata, reference data) to reveal more valuable insights. Depending on the use case, the enrichment operations can be as simple as joining the incoming data with existing data, or as complex as applying machine learning models to the incoming data. When enrichments need to be queried frequently, their computations are often pushed into the ingestion pipeline and the enriched data is persisted [19]. This requires the ingestion framework to have the ability to process the incoming data and to access the existing data used in enrichment operations.

Most streaming engines support incoming data processing, but some only support a limited query syntax [6, 16]. If a processing task requires existing data from a warehouse, most streaming engines would need to query the warehouse repeatedly, as otherwise they would have to keep a copy of that data locally. Frequent data queries to the warehouse would increase the load on the system and incur unexpected latency, and maintaining multiple copies of the data would require data migration to keep the data consistent [19]. Both choices would slow down the enrichment/ingestion pipeline and increase the complexity of building the overall analysis platform.

The data feeds in AsterixDB support data enrichment during ingestion by allowing users to attach user-defined functions (UDFs) to the ingestion pipeline. A Java UDF can be used on a data feed to enrich the incoming data using existing information from resource files. A SQL++ UDF on a data feed can manipulate the incoming data declaratively using a SQL++ query. Note, however, that such a SQL++ function in AsterixDB should be limited to only accessing the content of a given input record, as the execution plan for a SQL++ UDF in general can be stateful. If other data were accessed by a SQL++ function attached to an AsterixDB data feed, its query plan could fail to be evaluated or generate intermediate states that neglect changes in referenced data. This limits the expressiveness (usefulness) of data enrichment using SQL++ UDFs in AsterixDB today.

Considering the potential benefits of data enrichment during ingestion, we believe that a data ingestion facility should provide high-performance ingestion for incoming data, a full query syntax support to data enrichment, efficient access to existing data, and adaptiveness to changes in referenced data. With these requirements in mind, we have built a new ingestion framework for Apache AsterixDB. We have improved the scalability and stability of its data feeds and enabled users to attach declarative UDFs on data feeds with support for a full query capability. We have decoupled the ingestion pipeline into layers based on their functionality and life-cycle to improve ingestion efficiency and to allow ingestion pipelines to adapt to data changes dynamically.

The rest of this paper is organized as follows. We introduce the background information about Apache AsterixDB and its Hyracks runtime engine in Section 2, and then discuss how to handle rapid incoming data for database systems in Section 3 and how to enrich ingested data at scale in Section 4. In Section 5, we investigate different strategies for utilizing data enrichment for data analysis and the current limitations of each for providing current and correct data enrichment in time. We introduce how we built the new ingestion framework and the techniques used in Section 6, and elaborate on the details of the new ingestion framework in Section 7. We show the experiments in Section 8, review related work in Section 9, and conclude our work in Section 10.

2 Background

In this paper, we use Apache AsterixDB to highlight and address the challenges of enriching incoming data during data ingestion. To help readers understand this better, we give a brief introduction to Apache AsterixDB and its runtime engine Hyracks.

2.1 Apache AsterixDB

Apache AsterixDB [2] is an open source Big Data Management System (BDMS). It provides distributed data management for large-scale, semi-structured data. It aims to reduce the need for gluing together multiple systems for Big Data analysis. AsterixDB uses SQL++ [7, 22] (an SQL-inspired query language for semi-structured data) for user queries and AsterixDB Data Model (ADM) to manage the stored data. ADM is a superset of JSON and supports complex objects with nesting and collections. Before storing data in AsterixDB, a user can create a Datatype, which describes known aspects of the data being stored, and a Dataset, which is a collection of records of a datatype. AsterixDB allows a user to specify a datatype as “open” which makes it a minimal, extensible description of the stored data. As shown in the example in Figure 1, we can create an open datatype named “TweetType” with only two required attributes: “id” and “text”. Tweets containing additional attributes can be stored and queried in this dataset as well.

  CREATE TYPE TweetType AS OPEN {
      id : int64,
      text: string
  };
  CREATE DATASET Tweets(TweetType)
  PRIMARY KEY id;
Figure 1: DDL statements for storing tweets

2.2 Hyracks

Hyracks [3] is a partitioned parallel computation platform that provides runtime execution support for AsterixDB. Computation requests from users are submitted to Hyracks as jobs. A “job” is a unit of work that can be executed on Hyracks. A “job specification” describes how data flows and is processed in a job. It contains a DAG of operators, which describe computational operations, and connectors, which express data routing strategies. Data flows in a runtime Hyracks job in frames containing multiple objects. An operator reads an incoming data frame, processes the data in it, and pushes the processed data frame to another connected operator through a connector. AsterixDB uses jobs to evaluate user queries. A query submitted to AsterixDB is first parsed and optimized into a query plan, and then compiled as a job specification to run on the Hyracks platform. Figure 2 shows how a user query can be represented as a Hyracks job.

Figure 2: Translating a user query to a Hyracks job

3 Data Ingestion

In many contemporary data analysis applications, the data no longer stays on storage devices to be batched into a database system later. Instead, data enters the system rapidly in a continuous fashion; examples of this include social network messages, infield sensor readings, and user click-streams. The traditional bulk loading technique in database systems cannot be applied here due to the active nature of the incoming data. A naive solution to this problem could be to create an external program that obtains/receives these records from external sources and pushes them actively into a dataset using INSERT statements like the one in Figure 3.

    INSERT INTO Tweets([
      {"id":0, "text": "Let there be light"}
    ]);
Figure 3: Insert data into a dataset

As data comes in at a rapid rate, it is impractical to issue repetitive insert statements for ingesting data into a dataset even if a user batches multiple records into a single insert statement. The processing cost of massive insert statements soon becomes a bottleneck in the system and cannot be scaled to handle faster incoming data. In addition, a robust and efficient facility for moving data from external sources into database systems is a common need for many users with similar data analysis use cases. To untangle database systems and users from the handling of rapidly incoming data, one popular solution is to couple a streaming engine with a database system and use the streaming engine to handle the data [19]. For example, one can set up a Kafka instance to ingest data from external sources, then create a program [21] or use the kafka-mongodb-connector [11] to transmit the ingested data to a MongoDB instance for later analysis.

While streaming engines provide scalable and reliable data ingestion for database systems, they simultaneously introduce additional data routing costs. For example, the incoming data has to be persisted in Kafka first, then be pushed to the connected database system. Writing the same piece of data multiple times costs more resources and also delays demanding analytical queries awaiting for the latest ingested data. In addition, configuring multiple systems and wiring them together can be challenging for data analysts, who usually have little system management experience.

To simplify the process of ingesting data for end users and improve ingestion efficiency for analytical systems, some database systems provide integrated data ingestion facilities for handling rapid incoming data. Apache AsterixDB, for example, provides data feeds that allow users to assemble a simple data ingestion pipeline with DDL statements [12]. A user can use the DDL statements in Figure 4 to create a data feed that receives incoming JSON formatted tweets using a socket server with specified configurations.

  CREATE FEED TweetFeed WITH {
    "type-name" : "TweetType",
    "adapter-name": "socket_adapter",
    "format" : "JSON",
    "sockets": "127.0.0.1:10001",
    "address-type": "IP"
  };
  CONNECT FEED TweetFeed TO DATASET Tweets;
  START FEED TweetFeed;
Figure 4: Create a socket feed

A data feed consists of two components: an adapter, which obtains/receives data from external data sources as raw bytes; and a parser, which translates the ingested bytes into ADM records. Compared with the “glue” solution using streaming engines, data feeds have no extra data routing overheads, and a user can easily assemble a basic data ingestion pipeline with declarative statements.

4 Big Data Enrichment

4.1 Motivations

Due to various restrictions, such as the limited bandwidth of infield sensors and predefined data formats from API providers, data coming from external sources may not contain all the information needed for meaningful data analysis. In these scenarios, one can enrich the ingested data with existing knowledge (reference data) or machine learning models to reveal more useful information. For example, the IP address in a log record can be enriched by referencing IP reputation data to see whether there is known threat activity associated with that IP address [15]. Similarly, an incoming tweet can be enriched by referencing a sensitive key word list to see if its text is potentially threatening. It can also be processed by data mining techniques to extract sentiment and named entities for later analysis [9, 18].

One way of expressing data enrichment requests in database systems is to use User-defined Functions (UDFs). Users can use UDFs to modularize their data enrichment operations and scale them to Big Data applications with support from a BDMS. We use Apache AsterixDB as an example of how to use UDFs for data enrichment.

4.2 UDFs for Data Enrichment

AsterixDB supports both Java and SQL++ in its UDF framework. One can implement a Java UDF that utilizes the provided API to manipulate an input record, or create a SQL++ UDF to enrich input data using declarative queries. For example, we might want to create a UDF that checks whether a given tweet from the U.S. contains the keyword “bomb”. If so, the UDF adds a new field, “safety_check_flag”, to the tweet and sets it to “Red”. If not, the UDF adds the same new field and sets it to “Green”. Figure 5 shows an example of the Java UDF implementation (Java UDF 1).

        ...
        public void evaluate(IFunctionHelper functionHelper) throws Exception {
            JRecord inputRecord = (JRecord) functionHelper.getArgument(0);
            JString countryCode = (JString) inputRecord.getValueByName("country");
            JString text = (JString) inputRecord.getValueByName("text");
            safetyCheckFlag.setValue(countryCode.getValue().equals("US")
                    && text.getValue().contains("bomb") ? "Red" : "Green");
            inputRecord.addField("safety_check_flag", safetyCheckFlag);
            functionHelper.setResult(inputRecord);
        }
        ...
Figure 5: Java UDF 1 for tweet safety check

Although Java UDFs are powerful tools for enriching incoming data, especially when combined with machine learning models, constructing Java UDFs can be more complicated than writing SQL++ queries when expressing the same data enrichment requirements in many use cases. In addition, an SQL++ UDF can be updated using a UPDATE statement instantly. By contrast, updating a Java UDF requires a recompilation and redeployment process. Figure 6 shows an equivalent SQL++ UDF that performs the safety check for a given tweet (SQL++ UDF 1).

  CREATE FUNCTION USTweetSafetyCheck(tweet) {
    LET safety_check_flag =
      CASE tweet.country = "US"
        AND contains(tweet.text, "bomb")
        WHEN true THEN "Red" ELSE "Green"
      END
    SELECT tweet.*, safety_check_flag
  };
Figure 6: SQL++ UDF 1 for tweet safety check

4.3 Utilizing Existing Knowledge

In some use cases, a user needs to access existing knowledge, such as machine learning models or relevant stored information, for data enrichment. Both Java and SQL++ UDFs can support utilizing existing knowledge. A Java UDF in AsterixDB can load external files during its initialization. A SQL++ UDF can access reference data stored in datasets. To expand on our tweet safety check example in Section 4.2, given a list of countries and their sensitive keywords, suppose we want to flag a tweet from a country if it contains one of the keywords associated with that country. For a Java UDF, we can put the country-to-keywords mappings into a local resource file and load it during the UDF initialization. Figure 7 shows a snippet of the implementation of this Java UDF (Java UDF 2). For a SQL++ UDF, we can store the mappings in a “SensitiveWords” dataset and use a SQL++ query to enrich the input data. Figure 8 shows its SQL++ implementation (SQL++ UDF 2).

    ...
    @Override
    public void initialize(IFunctionHelper functionHelper, String nodeInfo) throws IOException {
           ...
       BufferedReader fr = Files.newBufferedReader(Paths.get(keywordListPath));
        fr.lines().forEach(line -> {
            String[] items = line.split("\\|");
            keywordList.putIfAbsent(items[1], new LinkedList<>());
            keywordList.get(items[1]).add(items[2]);
        });
    }
    ...
    public void evaluate(IFunctionHelper functionHelper) throws Exception {
        JRecord inputRecord = (JRecord) functionHelper.getArgument(0);
        JString countryCode = (JString) inputRecord.getValueByName("country");
        JString text = (JString) inputRecord.getValueByName("text");
        List<String> keywords = keywordList.getOrDefault(countryCode.getValue(),
                Collections.emptyList());
        for (String keyword : keywords) {
            safetyCheckFlag.setValue(text.getValue().contains(keyword) ? "Red" : "Green");
        }
        inputRecord.addField("safety_check_flag", safetyCheckFlag);
        functionHelper.setResult(inputRecord);
    }
    ...
Figure 7: Java UDF 2 for tweet safety check
CREATE FUNCTION tweetSafetyCheck(tweet) {
  LET safety_check_flag = CASE
    EXISTS(SELECT s FROM SensitiveWords s
        WHERE tweet.country = s.country AND
        contains(tweet.text, s.word))
    WHEN true THEN "Red" ELSE "Green"
    END
  SELECT tweet.*, safety_check_flag
};
Figure 8: SQL++ UDF 2 for tweet safety check

Loading external files in a Java UDF is commonly used for necessary configurations that are not frequently updated. However, when an update occurs, such as new keywords being added for a certain country, the resource files on every node also need to be updated. A reference dataset used in a SQL++ UDF, however, can easily be updated by INSERT/UPSERT statements 111A UPSERT statement in SQL++ inserts an object if there is no another object with the specified key. If not, it replaces the previous object with the new one..

5 Data Enrichment for Analysis

The goal of data enrichment is to allow analysts to use the enriched information in analytical queries. One can enrich ingested data lazily, when constructing the analytical queries, or eagerly at ingestion time and then store the enriched results. Enriching data in analytical queries is good for one-time analytical queries, while enriching and storing enriched results allows faster responses for future analytical queries. In this section, we show examples and discuss the implementation of both options.

5.1 Option 1 - Enrich in Analytical Queries

For data enrichment used in one-time analytical queries, one can apply enrichment UDFs directly when querying the data. A UDF in an analytical query can be optimized with the query together to produce an optimized query plan. For example, to find out how many tweets in each country are marked as “Red”, a sample analytical query using a SQL++ UDF 2 is shown in Figure 9. As the UDF is evaluated together with the analytical query, the evaluation of such analytical queries can be slowed down in the case of complex data enrichment UDFs.

SELECT tweet.country Country, count(tweet) Num
FROM Tweets tweet
LET enrichedTweet = tweetSafetyCheck(tweet)[0]
WHERE enrichedTweet.safety_check_flag = "Red"
GROUP BY tweet.country;
Figure 9: An analytical query using SQL++ UDF 2

5.2 Option 2 - Enrich during Data Ingestion

In common use cases, the enriched data may be used repeatedly in analytical queries at different points in time. In such use cases, enriching data lazily on-the-fly in each analytical query separately can be expensive, as it wastes time evaluating the same UDF on the same data multiple times. In such cases, it can be beneficial to instead persist the enriched data and use it for all future analytical queries with similar needs. To allow faster responses to those queries, data enrichment in such use cases is often completed eagerly during the data ingestion process. Here we discuss three different approaches to enriching data during ingestion using Apache AsterixDB.

5.2.1 Approach 1 - External Programs

A naive approach to enrich data during ingestion would be to set up an external program that obtains/receives data from data sources and issues DML statements to enrich the collected data, and then insert the enriched data into a dataset. A sample insert statement that enriches data using SQL++ UDF 2 and inserts the result into a target dataset “EnrichedTweets”222One can create the “EnrichedTweets” dataset using a DML statement similar to Figure 1. is shown in Figure 10. However, as discussed in Section 3, issuing repeated insert statements has significant overheads and would not scale well.

  INSERT INTO EnrichedTweets(
    LET TweetsBatch = ([{"id":0, ...},
      {"id":1, ...}, ...])
    SELECT VALUE tweetSafetyCheck(tweet)
    FROM TweetsBatch tweet
  );
Figure 10: Enrich and insert collected tweets

5.2.2 Approach 2 - External Programs w/ Data Feeds

A user can use the data feeds feature introduced in Section 3 to improve ingestion performance. The data can first be ingested into a dataset using data feeds, then enriched and stored in another dataset by applying UDFs. A user could set up an external program that repeatedly issues the DML statement in Figure 11 to initiate data enrichment for ingested data. Depending on the arrival rate of incoming data, the user may issue a new DML statement as soon as the previous one returns to catch up with the ingestion progress when the arrival rate is high, or wait for a certain period to batch the ingested data when the arrival rate is low. Benefiting from data feeds, this approach consumes the incoming data efficiently, even when the data comes in fast, but a user still needs to set up an external program that constantly initiates the data enrichment. In addition, the data is unnecessarily materialized twice since all information in tweets are kept in the enriched tweets as well.

  INSERT INTO EnrichedTweets(
    SELECT VALUE tweetSafetyCheck(tweet)
    FROM Tweets tweet WHERE tweet.id NOT IN
      (SELECT VALUE enrichedTweet.id
        FROM EnrichedTweets enrichedTweet)
  );
Figure 11: Enrich and insert ingested tweets

5.2.3 Approach 3 - Data Feeds w/ UDFs

In order to avoid the unnecessary materialization of incoming data and make the enriched data available to users as soon as possible, we may attach the data enrichment operation directly to the ingestion pipeline so that the ingested data is enriched before it arrives at storage. Apache AsterixDB allows users to attach certain UDFs to data feeds. As an example, a user could attach SQL++ UDF 1 (in Figure 6) to a data feed using the DDL statement in Figure 12. Incoming tweets are first received by the feed adapter, then parsed by the feed parser, and then enriched by the attached UDF. Finally, they are stored in the connected dataset. A Java UDF, such as the UDF in Figure 5, can also be attached to a data feed.

  CONNECT FEED TweetFeed
  TO DATASET EnrichedTweets
  APPLY FUNCTION USTweetSafetyCheck;
Figure 12: Attach a SQL++ UDF to a data feed

5.3 More Complex Enrichment

5.3.1 Challenges

In AsterixDB today, UDF 1 can be attached to a data feed directly, as it only accesses the incoming record and does not create any intermediate states. We call this kind of UDF a stateless UDF. UDF 2 is different from UDF 1, as UDF 2 accesses external resources (the “SensitiveWords” dataset in the case of SQL++, or the equivalent local resource files in the case of Java) and creates intermediate states (such as in-memory hash tables) used for data enrichment. We call this kind of UDF a stateful UDF.

Attaching a stateful UDF to a data feed can be problematic since in some cases the referenced data can itself be modified during the ingestion process, in which case the intermediate states based on the referenced data need to be refreshed accordingly. Also, not all complex and stateful SQL++ UDFs can be applied to a continuously incoming data stream directly. To illustrate the challenges of applying complex and stateful SQL++ UDFs to the ingestion pipeline, here we discuss three possible computing models for attaching UDF 2 to a data feed.

5.3.2 Model 1 - Apply UDF per Record

A simple computing model for applying a stateful UDF to a feed is to evaluate the attached UDF against each incoming record separately. An incoming record is received and parsed by the feed adapter and parser first, then enriched and persisted in storage. An equivalent insert statement for enriching and persisting one record is shown in Figure 13. In this model, each collected datum is treated as a constant record. The attached UDF evaluates each record separately, and any intermediate states will be refreshed from record to record. This allows the UDF to see data changes during the ingestion process, and it imposes no limitations on the applicable query constructs in attached UDFs. However, evaluating the UDF on a record basis may introduce a lot of execution overhead. This model cannot be applied in situations where the data arrives rapidly.

  INSERT INTO EnrichedTweets(
  LET tweet = { "id": ... }
  SELECT VALUE tweetSafetyCheck(tweet));
Figure 13: Enrich and insert a constant record

5.3.3 Model 2 - Apply UDF per Batch

To mitigate the execution overhead, one alternative is to batch the collected incoming records, apply the UDF to the batch, and store the enriched records. An equivalent insert statement for enriching a batch of records was shown in Figure 10. The records within one batch are enriched using the same reference data, and the reference data changes are captured between batches. A larger batch leads to lower execution overhead but less immediate sensitivity to reference data changes; the converse is also true. A user may choose a balance between ingestion performance and sensitivity to reference data changes by tuning the batch size.

5.3.4 Model 3 - Stream Datasource

To further reduce execution overheads, the system could attempt to treat the incoming data stream as an infinite dataset and evaluate the attached UDF as if the stream is a normal dataset. An equivalent insert statement is shown in Figure 14333The keyword “FEED” is not an actual supported datasource in SQL++, so one cannot run this DDL statement in the Apache AsterixDB system. Here we use it to conceptually denote a continuous feed datasource..

  INSERT INTO EnrichedTweets(
  SELECT VALUE tweetSafetyCheck(t)
  FROM FEED Tweets t);
Figure 14: Enrich and insert records from a feed

This model would be more efficient than the previous two, as the attached UDF is evaluated once for all incoming data. Any pre-computation for enriching the incoming data occurs only once and is used for all incoming data. Although this model would provide the best ingestion performance since it has the smallest execution overhead, it cannot be used when the attached UDF is stateful. Taking SQL++ UDF 2 as an example, when we attach this UDF to a data feed and use this model to compute it, the evaluation would become a join operation between the “SensitiveWords” dataset and the feed data source. Depending on the join algorithm and the size of the “SensitiveWords” dataset, the evaluation of the attached UDF 2 may have problems in certain cases:

  1. Hash Join with a small “SensitiveWords” dataset

    The evaluation of a hash join operation consists of two phases: build and probe [24]. In the build phase, the “SensitiveWords” dataset would be built into a hash table. In the probe phase, the data coming from the Twitter feed would then use the hash table to find the matching records in the “SensitiveWords” dataset.

    When the “SensitiveWords” dataset is small, the created hash table can be kept in memory. This allows incoming data to continuously probe the in-memory hash table for enrichment while the ingestion continues as shown in Figure 15. This appears to be a perfect model for this case, but it cannot incorporate the new changes to the “SensitiveWords” dataset, as the in-memory hash table would be built once and then used throughout the ingestion process.

    Figure 15: Case 1: Small SensitiveWords dataset
  2. Hash Join with a big “SensitiveWords” dataset

    When the “SensitiveWords” dataset is large, part of its data will be spilled to disk for the next round of the join [24]. This is shown in Figure 16. The hash join algorithm expects to process such spilled data recursively, after reading “all” data from Twitter, but of course the tweets will not stop coming. Thus, this model cannot be used in this case.

    Figure 16: Case 2: Big SensitiveWords dataset
  3. Index Nested Loops Join

    If there is an index on the “country” attribute of the “SensitiveWords” dataset, the SQL++ query compiler may choose the index nested loop join algorithm to compute the join operation. In this case, the incoming data can be used to look up the existing index first, then find the matched records for enrichment, as shown in Figure 17. By choosing this join algorithm manually and utilizing indexes for this specific join case, one could avoid creating intermediate states during the enrichment operation and thus see the new data changes directly. However, this approach is not applicable to more general use cases where the indexes on referenced datasets may not always exist, and/or where an enrichment UDF contains other operations that create intermediate states. As an example, the function in Figure 18 flags a tweet if it comes from one of the top 10 countries containing more keywords than others. The top 10 countries list would not be refreshed unless the attached UDF is evaluated again.

    Figure 17: Case 3: Enrich with an available index
      CREATE FUNCTION highRiskTweetCheck(t) {
        LET high_risk_flag = CASE
          t.country IN (SELECT VALUE s.country
            FROM SensitiveWords s
            GROUP BY s.country
            ORDER BY count(s)
            LIMIT 10)
          WHEN true THEN "Red" ELSE "Green"
        END
        SELECT t.*, high_risk_flag
      };
    Figure 18: Enrichment UDF with a nested subquery

In the current Apache AsterixDB release, the data feeds actually use this streaming model to evaluate any attached UDFs on an ingestion pipeline, so the attached UDFs are limited to be stateless. In order to support stateful data enrichment UDFs and allow users to use the full power of SQL++ in more complex data enrichment use cases, we need to create a new data ingestion framework that evaluates attached complex stateful UDFs properly.

6 Framework Building Blocks

As discussed in Section 5.3, only models 1 and 2 support complex data enrichment during data ingestion and capture any reference data changes at the same time. We have thus built a new data ingestion framework based on model 2, as it provides flexibility by allowing users to choose the right batch sizes for their use cases. In this section, we describe the design of this new framework and the optimization techniques that we used for improving its performance.

6.1 Predeployed Jobs

Following model 2, our new ingestion pipeline consists of two independent Hyracks jobs: an intake job and an insert job. A sample ingestion pipeline on a three-node cluster is shown in Figure 19. The intake job contains the feed adapter444The adapter on each node can be activated separately depending on the use case. and parser, and this job runs continuously throughout the lifetime of the ingestion process. The insert job takes a batch of records from the intake job, enriches them by applying the attached UDFs, and inserts the enriched records into a dataset. It runs repeatedly, being invoked once per batch, during ingestion.

The insert job in Figure 19 is constructed based on the query in Figure 10. For every collected batch from the intake job, we replace the array of constant records in the intake query with the collected batch and execute it. As introduced in Section 2.2, a query in AsterixDB is optimized and compiled into a job specification first, then distributed to the cluster for execution. Since the insert job is executed repeatedly, we utilize parameterized predeployed jobs to avoid redundant query compilation and additional job distribution costs.

Figure 19: Ingestion pipeline using insert jobs
Figure 20: Parameterized predeployed job

Parameterized predeployed jobs are straightforward. As shown in Figure 20, a user can choose to predeploy a query with specified parameters. This query is optimized and compiled normally, and then the compiled job specification is predeployed to all nodes in the cluster. This job specification is then cached on the cluster nodes. When a user want to run this query with a particular parameter, instead of repeating the entire query compilation and distribution process, an invocation message with the new invocation parameter is sent.

Using this technique, the insert job is distributed as a predeployed job in the cluster before the feed starts. When the intake job obtains a new batch of records, it invokes a new insert job with the collected batch as the parameter.

6.2 Layered Ingestion Pipeline

Repeatedly executing the insert job allows any intermediate states created in the UDF evaluation to be refreshed so that any new data changes will be used for enriching the incoming data. It should be noted that the evaluation of each insert job, similar to the evaluation of an insert query, will have to wait for the storage log to be flushed to finish properly. Also, since the UDF evaluation and storage operations work sequentially in an insert job, UDF evaluation can be blocked while waiting for the downstream data to be written into storage. To fully utilize the cluster’s computing resources and improve overall throughput, we further decouple the insert job into a computing job and a storage job so they can work concurrently. The decoupled ingestion framework is shown in Figure 21.

Figure 21: Decoupled ingestion framework

In the decoupled ingestion framework, the intake job handles data from external data sources, the computing job evaluates the attached UDFs, if any, and the storage job writes the enriched data into storage. The intake job and storage job begin to run when the data feed starts, while the computing job is run repeatedly as data batches come in. As in the previous discussion, the computing job is distributed as a predeployed job to reduce execution overhead.

6.3 Partition Holders

In the decoupled ingestion framework, data frames are passed from the intake job to a computing job, and then from the computing job to the storage job. Currently, data exchanges in Hyracks are limited to being within the scope of a job; one job cannot access data frames from another job at runtime. As data exchanges between jobs in the decoupled framework are frequent and need to be fast, we needed to add an efficient mechanism to allow data to be passed from one job to another.

Considering that the operators in a job each work on data partitions, by aligning the output partitions of one job with the input partitions of the other job, data frames can be shipped from one job to the other efficiently through in-memory structures. For this, we introduce a new type of operator in Hyracks - a partition holder - to enable efficient data exchanges between jobs.

A partition holder operator “guards” a runtime partition by holding the incoming data frames in a queue with a limited size. There are two types of partition holders, active and passive, as shown in Figure 22. An active partition holder follows the default push strategy in Hyracks; it receives data frames from other jobs and pushes them to its downstream operators actively. A passive partition holder implements a pull strategy; it receives data frames from its upstream operators and waits for other jobs to pull them. Each partition holder has a unique ID that is associated with its partition number. When a new partition holder is created, it registers with the local partition holder manager. Jobs sending/receiving data to/from another job can locate the corresponding partition holders through local partition holder managers. In the decoupled ingestion framework, we add a passive partition holder to the tail of the intake job so that the computing jobs can request and receive data in batches. An active partition holder is added to the head of the storage job so that computing jobs can push the enriched data on to the storage job.

Figure 22: Partition holders
Figure 23: The new ingestion framework

7 The New Ingestion Framework

7.1 Runtime Management

As we discussed earlier, AsterixDB is a parallel data management system that runs on a cluster of commodity machines. In an AsterixDB cluster, one (and only one) node runs the Cluster Controller (CC) that takes in users’ queries and translates them into Hyracks jobs. Only the CC can start new jobs, and it keeps track of the progress of the running jobs in case of any failures. All working nodes in the cluster run a Node Controller (NC) that takes computing tasks from the CC. The CC and NC can coexist on the same node.

In the new ingestion framework, there are two long running jobs, the intake and storage jobs; there is one short lived, but repeatedly invoked, computing job. In order to monitor data feed jobs, we created an Active Feed Manager (AFM) on the CC to manage the lifecycle of data feeds. The AFM tracks all active data feeds and helps them to invoke new computing jobs when new data batches arrive.

When a user submits a start feed request, the CC creates the intake, computing, and storage jobs based on a compiled job specification that is generated from a query template similar to Figure 10. The intake and storage job run directly, and the computing job is predeployed into the cluster for later invocations. The AFM maintains the mappings of data feeds to predeployed computing jobs, so it can invoke new computing jobs for each data feed separately.

When the intake job starts, it asks the AFM on the CC to invoke the first computing job and to keep invoking new computing jobs when the previous one finishes. After that, the intake job begins ingesting data from an external data source, adding data records into its queue, and waiting for the computing job to collect the ingested data. The current computing job takes a data batch from the intake job, enriches its records with an attached UDF if any, and then pushes the enriched data batch to the storage job. When this computing job finishes, the AFM on the CC will then start a new computing job to continue the processing.

When the user stops a feed, the intake job first stops taking new data and then adds a special “EOF” data record into its queue. When a computing job sees this record, it will finish its current execution with the collected data without waiting for a complete batch. The intake job finishes when all ingested data has been consumed. When the intake job finishes, it notifies the AFM to stop invoking new computing jobs for this feed. When the last computing job for the feed finishes, the storage job stops accordingly.

7.2 The Architecture

The new ingestion framework consists of three jobs - the intake, computing, and storage jobs. As we have introduced in Section 2.2, a Hyracks job contains operators, which describe computational operations, and connectors, which express data routing strategies. In order to demonstrate how data is processed and transported in the new ingestion framework, Figure 23 shows the composition of the framework running on three nodes at the level of operators and connectors:

  • The intake job obtains/receives data from external external data sources. The data enters the system through the Adapter. The Adapter collects data as raw bytes and arranges them into data frames for transportation purposes in the system. Depending on the incoming data load and use case, the Adapter can be activated on one or more nodes. The ingested data frames are then fed through the Round-robin Partitioner to be distributed in a round-robin fashion. Since the attached UDFs could be expensive, distributing the incoming data evenly can help to minimize the overall execution time of the computing job. The partitioned data is forwarded to the Intake Partition Holder, which is implemented as a passive partition holder, which then waits for computing jobs to pull the data.

  • The computing job evaluates the attached UDF to enrich data batches. A computing job starts by collecting a data batch from a local intake partition holder. The obtained data batch is first parsed by the Parser and then fed to the UDF Evaluator for data enrichment. Depending on the attached UDF, the UDF evaluator could be a Java program that runs on each node independently, or it could be a group of operators produced by compiling a complex SQL++ UDF. In either cases, local resource files (Java UDFs) or reference datasets (SQL++ UDFs) may be accessed, and/or intermediate states might be created as well. After being enriched, the data is pushed to the Feed Pipeline Sink to be forwarded to the storage job.

  • The storage job receives enriched data and stores it to disk. The enriched data is first received by the Storage Partition Holder, which is implemented as an active partition holder. The Storage Partition Holder pushes the received enriched data actively to the connected Hash Partitioner. The Hash Partitioner partitions the enriched data records by their primary keys so they can be stored in the appropriate Storage Partitions.

8 Experiments

In this section, we present a set of experiments that we conducted to evaluate the new ingestion framework. We compared the basic ingestion performance of the new ingestion framework with that of the existing Apache AsterixDB ingestion framework. We examined the data enrichment performance of the new ingestion framework using various Java and SQL++ UDFs. Finally, we investigated the speed-up and scale-out performance of the new ingestion framework for more complex data enrichment workloads. Our experiments were conducted on a cluster which is connected with a Gigabit Ethernet switch. Each node had a Dual-Core AMD Opteron Processor 2212 2.0GHz, 8GB of RAM, and a 900GB hard disk.

8.1 Basic Data Ingestion

When ingesting data without an attached UDF, the computing job in the new ingestion framework simply moves data from the intake job to the storage job. By comparing the data ingestion performance of the new ingestion framework to that of the current AsterixDB ingestion framework without UDFs, we can examine the execution overhead introduced by periodically refreshing the computing job in the new ingestion framework.

For this purpose, we measured the ingestion time of the current and new frameworks for a certain amount of data, and we increased the size of the cluster to see how their ingestion times reduced when given more resources. The results are shown in Figure 24. In the current version of AsterixDB, all of the data parsing load is assigned to one intake node (the node that runs the feed adapter and parser). For fairer comparison purposes here, we designed another “balanced” version of the current ingestion framework in which the parsing load is manually divided across all nodes. For both the new ingestion framework and the ingestion framework of current AsterixDB, we used only one intake node. We will refer to the experiments on the new framework as “New”, the experiments on the current framework as “Current”, and the experiments on the balanced version of the current framework as “Balanced Current”.

For ingestion using the new ingestion framework, we experimented with different batch sizes to examine the effect of the batch size on ingestion performance. We chose 3 different batch sizes, including 420 records/batch (New 1X), 1680 records/batch (New 4X), and 6720 records/batch (New 16X). Also, we varied the size of the cluster from 1 node to 6 nodes to see how ingestion speeds up with increased cluster sizes. We let the system ingest 1,000,000 tweets and measured the overall ingestion time.

Figure 24: 1M tweets ingestion speed-up over 6 nodes

As we can see in Figure 24, the ingestion performance of Current remained the same as the cluster size increased. In this case, the ingestion performance was limited by the parsing bottleneck on the single intake node. In contrast, Balanced Current kept improving, as more nodes participated in parsing the data. For New with different batch sizes, the ingestion time decreased as the batch size grew. The reason was that fewer computing jobs were initiated to perform the ingestion when the batch size was larger. The overall execution overhead for the computing jobs thus decreased as the batch size increased.

We can focus on the performance difference between New 1X and Balanced Current to compare the performance of the new and the existing ingestion frameworks. When there was one node, the ingestion performance of New 1X was better than Balanced Current. This was because the data parsing and storage operations are decoupled into different jobs in the new framework, but they are chained together in a single job in the existing framework. Having them in different jobs allows data parsing in the new framework to operate on new data batches without waiting for the previous batch to be consumed by the storage operation. This allows the new framework to better utilize the available computing resources. As the cluster size grew, the execution overhead of repeatedly invoking the computing job in the new framework increased. As a result, the performance of New 1X started to lose to (the single-job-based) Balanced Current.

Figure 25: 1M tweets Ingestion with UDFs (log scale)

8.2 Data Enrichment with UDFs

To investigate the performance of the new ingestion framework in enriching data during data ingestion, we designed four sample use cases where the attached UDFs cover several common operations used in database queries, including join, group-by, order-by, and spatial join. The four use cases are as listed below:

  1. Safety Rating: Given a list of country names and their corresponding safety rating, enrich each incoming tweet with a safety rating based on its “country” field value. (Hash join, Appendix A)

  2. Religious Population: Given the population of each religion in every country, enrich each incoming tweet with the overall religious population based on its “country” field value. (Group-by, Appendix B)

  3. Largest Religions: Given the population of each religion in every country, enrich each incoming tweet with the three largest religions according to its “country” field value. (Order-by, Appendix C)

  4. Nearby Monuments: Given a list of monuments and their coordinates, enrich each incoming tweet with the monuments within 1.5 degrees of the tweet’s location. (Spatial join, Appendix D)

All of these enrichment UDFs are stateful, and their evaluations involve the challenges that we discussed in Section 5.3. Since the current ingestion pipeline of AsterixDB doesn’t support attaching such stateful SQL++ UDFs to data feeds, we only experimented with Java UDFs on the current AsterixDB pipeline, denoted as “Current”. As Java UDFs are initialized at the beginning of ingestion, data changes during ingestion could not be captured by the current data feeds. Similar to the basic data ingestion experiment, we added “Balanced Current”, where the parsing load was manually divided across all nodes, to avoid the impacts of the parsing bottleneck on a single node in the current ingestion framework.

We tested both SQL++ and Java UDFs in the new ingestion framework. We varied the batch sizes for both UDFs starting from 420 records/batch to 1680 records/batch and 6720 records/batch to see how batching in the new ingestion framework affects performance. We denote the SQL++ cases as “New SQL++ 1X”, “New SQL++ 4X”, and “New SQL++ 16X” and the Java cases as “New Java 1X”, ‘New Java 4X”, and “New Java 16X ” respectively.

We deployed the system on a 6-node cluster and again let it ingest and enrich 1,000,000 tweets with the different UDFs on the ingestion pipeline to see the overall ingestion time differences for all four enrichment use cases. The experimental results (in log scale) are shown in Figure 25. As we can see, Balanced Current performed the best among all configurations for all UDFs. Compared with Current, the balanced version avoided the parsing bottleneck on a single node. Compared with the other configurations on the new ingestion framework (with both Java and SQL++), Balanced Current saved the computation time by computing the in-memory structures once and reusing them for data enrichment throughout the ingestion process. This allowed higher ingestion speed but neglected all data changes during the ingestion process, so it is included here just as a performance baseline.

On the new ingestion framework, ingestion performance improved as the batch size was increased. Similar to the ingestion-only case, the execution overhead of executing the computing jobs decreased as the batch size increased. The Safety Rating, Religious Population, and Largest Religions use cases had significant speed improvements as we increased the batch sizes while the Nearby Monuments use case did not. This was because, in the Nearby Monuments use case, most of the ingestion time was spent on computing the expensive spatial join between incoming tweets and the reference dataset. The execution overhead reduction was way too small compared with the computation time. As we will see in the later experiments, one can speed up such enrichment UDFs by adding more computing resources.

Comparing New Java with different batch sizes and Balanced Current, the performance difference between the two ingestion frameworks shows the overhead that the new ingestion framework introduced for capturing the reference data changes by repeatedly invoking new computing jobs. As we increased the batch size, the performance of New Java approached Balanced Current (which is the ideal performance for the new ingestion framework). Benefiting from the predeployed jobs and the new batching architecture, the execution overhead associated with the new ingestion framework is seen to be relatively small.

8.3 Scale-out Experiments

In order to investigate the performance of the new ingestion framework on large scale data enrichment tasks, we designed three more complex data enrichment use cases that enriched the incoming tweets with more information. We tested the new framework with these cases to see how it scales. The additional use cases are listed below:

  1. Suspicious Names: Include the number of nearby facilities grouped by their types, the three closest religious buildings within three degrees of the tweet’s location, and information about suspicious users who have the same name as the tweet’s author. (Appendix E)

  2. Tweet Context: Include the average income for the district where the tweet was posted, the number of facilities in this district grouped by their types, and the ethnicity distribution of the residents in this district. (Appendix F)

  3. Worrisome Tweets: Include the religion names of the religious buildings within three degrees of the tweet and the number of terrorist attacks in the past two months that were related to that religion.(Appendix G)

Figure 26: UDF complexity comparison
Figure 27: 50K tweets ingestion speed-up for 24 vs. 6 Nodes with different batch sizes

To demonstrate the complexity of these additional use cases, we compared their evaluation times with that of the “Nearby Monuments” use case, the most complex UDF from the previous experiment. We let the new ingestion framework ingest and enrich 50,000 tweets with these four UDFs, and we measured the ingestion time in each case. The results are shown in Figure 26. We see that the added use cases had different complexities and that different use cases benefited from batch size changes differently. In the Tweet Context use case, there were multiple expensive spatial joins between the referenced datasets before joining with the tweets. Increasing the batch size reduced the times of such computation and thus lowered the overall ingestion time. In the other use cases, the tweets mostly joined with the reference datasets sequentially. Thus, increasing the batch sizes offered limited improvements in the Nearby Monuments, Suspicious Names, and Worrisome Tweets use cases.

The performance of scaling out the new framework is determined by the cluster size, batch size, and UDF complexity. Although increasing the number of nodes for computation can reduce the execution time of a computing job, it also introduces additional execution overhead for executing jobs on a larger cluster, so adding more resources may not always improve the overall ingestion time. Given a simple enrichment UDF, a small batch size, and a large cluster, the speed-up performance might be bounded by the execution overhead. To explore the relationship of these three factors, we experimented with the speed-up performance using different batch and cluster sizes.

We let the framework ingest and enrich 50,000 tweets using all seven UDFs. For each UDF, we measured its ingestion time on a 6-node cluster and a 24-node cluster separately and computed the resulting speed-up. We repeated this computation for each UDF with three different batch sizes, namely 420 records/batch (1X), 1680 records/batch(4X), and 6720 records/batch (16X), and we show the speed-up of each batch size in Figure 27.

Since the UDFs in the Safety Rating, Religious Population, and Largest Religions use cases were relatively simple, adding more resources yielded limited improvements to the execution times of their computing jobs. At the same time, their execution overhead grew as the cluster size increased. As a result, their speed-up is relatively poor. In contrast, the other UDFs (Nearby Monuments, Suspicious Names, Tweet Context, and Worrisome Tweets) each improved with more resources. For Tweet Context in particular, not only were there 4x as many nodes participating in the computation, but the added resources (particularly memory) also allowed the join process to finish earlier. This enabled the system to obtain more than the ideal 4x speed-up. For a given volume of tweets, the bigger the batch size is, the fewer computing job invocations are needed for enriching them, so the speed-up performance is better as the execution overhead increase from the cluster size growth is smaller.

Figure 28: 50K tweets ingestion speed-ups (log scale)

In order to see how ingestion performance improves when adding more resources, we also evaluated the speed-up behavior of the four complex UDFs (Nearby Monuments, Suspicious Names, Tweet Context, and Worrisome Tweets). We asked the new ingestion framework to ingest and enrich 50,000 tweets, and we varied the cluster size from 1 node up to 24 nodes to see how the overall ingestion and enrichment times change when given a fixed batch size of 420 records/batch. The experimental results are shown in Figure 28. As we added computing resources to the system, the ingestion performance also improved since the available computing resources were the primary bottleneck for these complex enrichment UDFs. The overall ingestion time reduced significantly after we initially added more resources and started to level off when we further increased the size of the cluster. This was because the execution overhead of a larger cluster started to take away the speed-up benefits. The overall ingestion time of different use cases still followed the complexities of the attached UDFs.

Lastly, we conducted a scale-out performance experiment with a subset of the UDFs (Nearby Monuments, Suspicious Names, Tweet Context, and Worrisome Tweets). We started with the ingestion of 50,000 tweets (1X tweets) and their enrichment on 6 nodes, and we then increased the size of the incoming data and the cluster accordingly (2X tweets for 12 nodes, 3X tweets for 18 nodes, and 4X tweets for 24 nodes) to see how our new ingestion framework scaled. The results are shown in Figure 29. As we increased the cluster and data size, the ingestion time for the Worrisome Tweets, Suspicious Names, and Nearby Monuments use cases kept relatively stable and only grew gradually due to the execution overhead increase on a larger cluster. The scale-out performance of these cases was as expected. For the Tweet Context use case, similar to the speed-up case, the added resources changed the evaluation of the join operation and thus the ingestion time decreased. This shows that the new ingestion framework is capable of handling data ingestion and enrichment tasks at scale.

Figure 29: Tweet ingestion and enrichment going from 6 nodes to 24 nodes

9 Related Work

Data enrichment has been widely used in various data analysis applications for which the collected data contains limited information and needs to be correlated with existing knowledge to revealing higher-level insights. Abel et al. proposed to construct Twitter user profiles by extracting semantics from tweets and relating them with collected news articles [1]. Moraru et al. introduced a framework for enriching sensor measurements with semantic concepts to generate new features [20]. Our work aims at providing a scalable framework that users can employ to put such data enrichment operations in the ingestion pipeline so that the enriched data can be used by analytical queries as soon as it is persisted.

User-defined functions have been a long standing feature of database systems [17, 25]. UDFs allow users to register their own functions with the database system for customized data processing and then invoke them in declarative queries. Hellerstein and Stonebraker designed a predicate migration algorithm for moving expensive functions in a query plan to minimize the total cost of the query [13]. Rheinländer et al. surveyed optimization techniques for optimizing complex dataflows with UDFs [23]. In our work, we use the UDF feature as a tool for users to use to specify their data enrichment operations. An enrichment UDF is compiled and optimized by the query compiler as a whole. We rely on the query compiler to provide the optimal query plan. Prepared query is a technique that caches compiled plan to improve query performance. The predeployed jobs technique that we used here for reducing the execution time of computing jobs was inspired by this technique.

The traditional ETL process defines a workflow including data collection, extraction, transformation, cleansing, and loading that is performed for moving data from an operational system into a data warehouse [8]. Data is extracted from an operational system, cleaned and transformed into a defined schema for analysis, and loaded into a periodically refreshed data warehouse for querying and data analysis. The refreshment process is often executed in an off-line mode with a relatively long period, in order to minimize the burden on the operational systems [26]. Bruckner et al. propose a near real-time architecture which minimizes the delay of new data being loaded into the data warehouse after being created in the operational system [5]. In this paper, we focused on building an efficient and succinct framework aiming at ingesting and enriching the data at the same time. While a user can achieve part of the ETL functionality by constructing appropriate UDFs, we do not consider the new ingestion framework to be a tool for solving general ETL problems. (Similarly, using a complicated ETL suite for data enrichment here would have been overkill.) The data feeds feature is related to the continuous data loading technique that is commonly used in near-real-time data warehouses [14].

Streaming engines were introduced to the Big Data toolkit to address a need for stream data processing and real-time data analysis. They can handle streaming data sources and provide stream data processing on-the-fly. Many of the streaming engines also allow users to access reference data during processing. Kafka [16] uses “change data capture” in combination with the Connect API to access reference data in databases with a limited processing operations. Flink [6] supports registering external resources as Tables and offers the DataStream API to process the streaming data. Spark Streaming [28] uses Discretized Streams to discretize an incoming stream into Resilient Distributed Datasets and allow users to transform the data using normal Spark operations. Since streaming engines are designed for stream processing but not for complex data analysis queries, the processed results are often stored in connected data warehouses [19]. In this paper, we have focused on data enrichment use cases where the reference data is frequently accessed and changed, and where the processed data needs to be stored in a data warehouse for timely data analysis. We sought to minimize the effort from users so they can create a data ingestion pipeline easily, with declarative statements, and apply enrichment UDFs without limitations. To achieve these goals, we chose to build a new ingestion framework that supports the full power of SQL++ for data enrichment operations inside AsterixDB . The batch processing model that we chose is commonly used in streaming engines as well.

10 Conclusions

In this paper, we have investigated how to enrich incoming data during the data ingestion process. We discussed the challenges in data ingestion, illustrated the possible computing models for evaluating stateful UDFs for data enrichment, and discussed the problems that may occur in different scenarios. We think that an ingestion pipeline that supports efficient data ingestion and enrichment should be able to capture reference data changes during the ingestion process, maintain intermediate states properly, and support different enrichment operations with a full query language. To achieve these goals, we created a new ingestion framework with multiple optimization techniques. Its layered architecture allows the ingestion pipeline to better utilize the cluster resources. The repeatedly executed computing jobs in the framework allow incoming data to be enriched correctly, and its predeployed jobs improve the execution performance of computing jobs. The framework’s partition holders support efficient data exchange between jobs. We conducted a series of experiments to verify the performance of the new ingestion pipeline. The results showed that new the ingestion framework can be scaled to support various data enrichment workloads involving reference data and/or stateful operations.

11 Acknowledgments

The work reported in this paper was supported in part by NSF CNS award 1305430.

References

  • [1] F. Abel, Q. Gao, G.-J. Houben, and K. Tao. Semantic enrichment of Twitter posts for user profile construction on the social web. In Extended semantic web conference, pages 375–389. Springer, 2011.
  • [2] S. Alsubaiee, Y. Altowim, H. Altwaijry, A. Behm, V. Borkar, Y. Bu, M. Carey, I. Cetindil, M. Cheelangi, K. Faraaz, et al. AsterixDB: A scalable, open source BDMS. Proceedings of the VLDB Endowment, 7(14):1905–1916, 2014.
  • [3] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 1151–1162. IEEE, 2011.
  • [4] I. Botan, Y. Cho, R. Derakhshan, N. Dindar, L. Haas, K. Kim, and N. Tatbul. Federated stream processing support for real-time business intelligence applications. In International Workshop on Business Intelligence for the Real-Time Enterprise, pages 14–31. Springer, 2009.
  • [5] R. M. Bruckner, B. List, and J. Schiefer. Striving towards near real-time data integration for data warehouses. In International Conference on Data Warehousing and Knowledge Discovery, pages 317–326. Springer, 2002.
  • [6] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache Flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.
  • [7] D. Chamberlin. SQL++ For SQL Users: A Tutorial. Couchbase, Inc., 2018.
  • [8] S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. SIGMOD Rec., 26(1):65–74, Mar. 1997.
  • [9] A. Doyle, G. Katz, K. Summers, C. Ackermann, I. Zavorin, Z. Lim, S. Muthiah, P. Butler, N. Self, L. Zhao, et al. Forecasting significant societal events using the embers streaming predictive analytics system. Big Data, 2(4):185–195, 2014.
  • [10] L. Duan and Y. Xiong. Big Data analytics and business analytics. Journal of Management Analytics, 2(1):1–21, 2015.
  • [11] H.-P. Grahsl. Kafka connect MongoDB sink, 2016. [Online; accessed 23-December-2018].
  • [12] R. Grover and M. J. Carey. Data ingestion in AsterixDB. In EDBT, pages 605–616, 2015.
  • [13] J. M. Hellerstein and M. Stonebraker. Predicate migration: Optimizing queries with expensive predicates, volume 22. ACM, 1993.
  • [14] C. S. Jensen, T. B. Pedersen, and C. Thomsen. Multidimensional databases and data warehousing. Synthesis Lectures on Data Management, 2(1):1–111, 2010.
  • [15] E. D. Knapp and J. T. Langill. Industrial Network Security (Second Edition). Syngress, Boston, 2015.
  • [16] J. Kreps, N. Narkhede, J. Rao, et al. Kafka: A distributed messaging system for log processing. In Proceedings of the NetDB, pages 1–7, 2011.
  • [17] V. Linnemann, K. Küspert, P. Dadam, P. Pistor, R. Erbe, A. Kemper, N. Südkamp, G. Walch, and M. Wallrath. Design and implementation of an extensible database management system supporting user defined data types and functions. In VLDB, pages 294–305, 1988.
  • [18] S. Mansmann, N. U. Rehman, A. Weiler, and M. H. Scholl. Discovering OLAP dimensions in semi-structured data. Information Systems, 44:120–133, 2014.
  • [19] J. Meehan, C. Aslantas, S. Zdonik, N. Tatbul, and J. Du. Data ingestion for the connected world. In CIDR, 2017.
  • [20] A. Moraru and D. Mladenić. A framework for semantic enrichment of sensor data. Journal of computing and information technology, 20(3):167–173, 2012.
  • [21] A. Morgan. MongoDB & data streaming – implementing a MongoDB Kafka consumer, 2016. [Online; accessed 23-December-2018].
  • [22] K. W. Ong, Y. Papakonstantinou, and R. Vernoux. The SQL++ query language: Configurable, unifying and semi-structured. arXiv preprint arXiv:1405.3631, 2014.
  • [23] A. Rheinländer, U. Leser, and G. Graefe. Optimization of complex dataflows with user-defined functions. ACM Computing Surveys (CSUR), 50(3):38, 2017.
  • [24] L. D. Shapiro. Join processing in database systems with large main memories. ACM Transactions on Database Systems (TODS), 11(3):239–264, 1986.
  • [25] M. Stonebraker, L. A. Rowe, and M. Hirohama. The implementation of POSTGRES. IEEE transactions on knowledge and data engineering, 2(1):125–142, 1990.
  • [26] P. Vassiliadis. A survey of extract–transform–load technology. International Journal of Data Warehousing and Mining (IJDWM), 5(3):1–27, 2009.
  • [27] H. J. Watson. Tutorial: Big Data analytics: Concepts, technologies, and applications. CAIS, 34:65, 2014.
  • [28] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, et al. Apache Spark: A unified engine for Big Data processing. Communications of the ACM, 59(11):56–65, 2016.

Appendix A Safety Score

In this experiment, we enrich a tweet with the safety rating of the country where the tweet comes from. The SafetyRatings dataset contains 50,000 records.

  CREATE TYPE SafetyRatingType AS open {
    country_code : string,
    safety_rating: string
  };
  CREATE DATASET SafetyRatings(SafetyRatingType)
    PRIMARY KEY country_code;
  CREATE FUNCTION enrichTweetQ1(t) {
    LET safety_rating = (SELECT VALUE s.safety_rating
      FROM SafetyRatings s
      WHERE t.country = s.country_code)
    SELECT t.*, safety_rating
  };
Figure 30: Safety rating enrichment

Appendix B Religious Population

In this experiment, we enrich an incoming tweet with the total religious population in its country. The ReligiousPopulations dataset contains 50,000 records.

  CREATE TYPE ReligiousPopulationType AS open {
      rid : string,
      country_name : string,
      religion_name : string,
      population: int
  };
  CREATE DATASET ReligiousPopulations
    (ReligiousPopulationType) PRIMARY KEY rid;
  CREATE FUNCTION enrichTweetQ2(t) {
      LET religious_population =
        (SELECT sum(r.population) FROM
        ReligiousPopulations r
        WHERE r.country_name = t.country)[0]
      SELECT t.*, religious_population
  };
Figure 31: Religious population enrichment

Appendix C Largest Religions

In this experiment, we enrich a tweet with the 3 largest religions in the country that the tweet came from. The ReligiousPopulations dataset contains 50,000 records.

  CREATE TYPE ReligiousPopulationType AS open {
      rid : string,
      country_name : string,
      religion_name : string,
      population: int
  };
  CREATE DATASET ReligiousPopulations
    (ReligiousPopulationType) PRIMARY KEY rid;
  CREATE FUNCTION enrichTweetQ3(t) {
      LET largest_religions =
        (SELECT VALUE r.religion_name
        FROM ReligiousPopulations r
        WHERE r.country_name = t.country
        ORDER BY r.population LIMIT 3)
      SELECT t.*, largest_religions
  };
Figure 32: Largest religions enrichment

Appendix D Nearby Monuments

In this experiment, we enrich an incoming tweet with the monuments that are within 1.5 degree of the tweet’s location. The monumentList dataset contains 50,000 records.

  CREATE TYPE monumentType AS open {
    monument_id: string,
    monument_location: point
  };
  CREATE DATASET monumentList(monumentType)
    PRIMARY KEY monument_id;
  CREATE FUNCTION enrichTweetQ4(t) {
    LET nearby_monuments =
      (SELECT VALUE m.monument_id
      FROM monumentList m
      WHERE spatial_intersect(
        m.monument_location,
        create_circle(
          create_point(t.latitude, t.longitude),
            1.5)))
    SELECT t.*, nearby_monuments
  };
Figure 33: Nearby monuments enrichment

Appendix E Suspicious Names

In this experiment, we enrich a tweet with the number of nearby facilities grouped by their types, three closest religious buildings within three degrees of the tweet’s location, and information about suspicious users who have the same name as the tweet’s author. The ReligiousBuildings dataset contains 10,000 records. The Facilities dataset contains 50,000 records. The SensitiveNames dataset contains 1,000,000 records.

  CREATE TYPE ReligiousBuildingType AS open {
      religious_building_id : string,
      religion_name : string,
      building_location : point,
      registered_believer: int
  };
  CREATE DATASET ReligiousBuildings(ReligiousBuildingType) PRIMARY KEY religious_building_id;
  CREATE TYPE FacilityType AS open {
      facility_id: string,
      facility_location: point,
      facility_type: string
  };
  CREATE DATASET Facilities(FacilityType) PRIMARY KEY facility_id;
  CREATE TYPE SuspiciousNamesType AS open {
      suspicious_name_id: string,
      suspicious_name: string,
      religion_name: string,
      threat_level: int
  };
  CREATE DATASET SuspiciousNames(SuspiciousNamesType) PRIMARY KEY suspicious_name_id;
  CREATE FUNCTION enrichTweetQ5(t) {
      LET nearby_facilities = (
        SELECT f.facility_type FacilityType, count(*) AS Cnt
        FROM Facilities f
        WHERE spatial_intersect(create_point(t.latitude, t.longitude),
          create_circle(f.facility_location, 3.0))
        GROUP BY f.facility_type),
      nearby_religious_buildings = (
        SELECT r.religious_building_id religious_building_id, r.religion_name religion_name
        FROM ReligiousBuildings r
        WHERE spatial_intersect(create_point(t.latitude, t.longitude),
          create_circle(r.building_location, 3.0))
        ORDER BY spatial_distance(create_point(t.latitude, t.longitude), r.building_location) LIMIT 3),
      suspicious_users_info = (
        SELECT s.suspicious_name_id suspect_id, s.religion_name AS religion, s.threat_level AS threat_level
        FROM SuspiciousNames s
        WHERE s.suspicious_name = t.user.name)
      SELECT t.*, nearby_facilities, nearby_religious_buildings, suspicious_users_info
  };

figureEnrich a tweet with nearby facilities and suspicious user information

Appendix F Tweet Context

In this experiment, we enrich a tweet with the average income of the district where the tweet is posted, the number of facilities in this district grouped by their types, and the ethnicity distribution of the residents in this district based on a resident sampling. The DistrictArea dataset contains 500 records. The Facilities dataset contains 50,000 records. The Persons dataset contains 1,000,000 records.

  CREATE TYPE DistrictAreaType AS open {
      district_area_id : string,
      district_area : rectangle
  };
  CREATE DATASET DistrictAreas(DistrictAreaType) PRIMARY KEY district_area_id;
  CREATE TYPE FacilityType AS open {
      facility_id: string,
      facility_location: point,
      facility_type: string
  };
  CREATE DATASET Facilities(FacilityType) PRIMARY KEY facility_id;
  CREATE TYPE AverageIncomeType AS open {
      district_area_id: string,
      average_income: double
  };
  CREATE DATASET AverageIncomes(AverageIncomeType) PRIMARY KEY district_area_id;
  CREATE TYPE PersonType AS open {
      person_id: string,
      ethnicity: string,
      location: point
  };
  CREATE DATASET Persons(PersonType) PRIMARY KEY person_id;
  CREATE FUNCTION enrichTweetQ6(t) {
      LET area_avg_income = (
        SELECT VALUE a.average_income
        FROM AverageIncomes a, DistrictAreas d1
        WHERE a.district_area_id = d1.district_area_id
          AND spatial_intersect(create_point(t.latitude, t.longitude), d1.district_area)),
      area_facilities = (
        SELECT f.facility_type, count(*) AS Cnt
        FROM Facilities f, DistrictAreas d2
        WHERE spatial_intersect(f.facility_location, d2.district_area)
          AND spatial_intersect(create_point(t.latitude, t.longitude), d2.district_area)
        GROUP BY f.facility_type),
      ethnicity_dist = (
        SELECT ethnicity, count(*) AS EthnicityPopulation
        FROM Persons p, DistrictAreas d3
        WHERE spatial_intersect(create_point(t.latitude, t.longitude), d3.district_area)
          AND  spatial_intersect(p.location, d3.district_area)
        GROUP BY p.ethnicity AS ethnicity)
      SELECT t.*, area_avg_income, area_facilities, ethnicity_dist
  };

figureEnrich a tweet with the average income, facility numbers, and ethnicity distribution in the area where the tweet is posted

Appendix G Worrisome Tweets

In this experiment, we enrich a incoming tweet with the names of religions within three degrees of the tweet’s location and the number of terrorist attacks in the past two months related to that religion. The ReligiousBuildings dataset contains 10,000 records. The AttackEvents dataset contains 5,000 records.

  CREATE TYPE ReligiousBuildingType AS open {
      religious_building_id : string,
      religion_name : string,
      building_location : point,
      registered_believer: int
  };
  CREATE DATASET ReligiousBuildings(ReligiousBuildingType) PRIMARY KEY religious_building_id;
  CREATE TYPE AttackEventsType AS open {
      attack_record_id: string,
      attack_datetime: datetime,
      attack_location: point,
      related_religion: string
  };
  CREATE DATASET AttackEvents(AttackEventsType) PRIMARY KEY attack_record_id;
  CREATE FUNCTION enrichTweetQ7(t) {
      LET nearby_religious_attacks = (
        SELECT r.religion_name AS religion, count(a.attack_record_id) AS attack_num
        FROM ReligiousBuildings r, AttackEvents a
        WHERE spatial_intersect(create_point(t.latitude, t.longitude),
          create_circle(r.building_location, 3.0))
          AND t.created_at  < a.attack_datetime + duration("P2M")
          AND t.created_at  > a.attack_datetime
          AND r.religion_name = a.related_religion
        GROUP BY r.religion_name)
      SELECT t.*, nearby_religious_attacks
  };

figureEnrich a tweet with nearby religions and the recent terrorist attacks related to them