Polystore++: Accelerated Polystore System for Heterogeneous Workloads

05/24/2019 ∙ by Rekha Singhal, et al. ∙ Tata Consultancy Services Stanford University 0

Modern real-time business analytic consist of heterogeneous workloads (e.g, database queries, graph processing, and machine learning). These analytic applications need programming environments that can capture all aspects of the constituent workloads (including data models they work on and movement of data across processing engines). Polystore systems suit such applications; however, these systems currently execute on CPUs and the slowdown of Moore's Law means they cannot meet the performance and efficiency requirements of modern workloads. We envision Polystore++, an architecture to accelerate existing polystore systems using hardware accelerators (e.g, FPGAs, CGRAs, and GPUs). Polystore++ systems can achieve high performance at low power by identifying and offloading components of a polystore system that are amenable to acceleration using specialized hardware. Building a Polystore++ system is challenging and introduces new research problems motivated by the use of hardware accelerators (e.g, optimizing and mapping query plans across heterogeneous computing units and exploiting hardware pipelining and parallelism to improve performance). In this paper, we discuss these challenges in detail and list possible approaches to address these problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Modern data-analytic applications such as personalized health care [1], content filtering, and monitoring [2] often process data segregated across legacy data-processing engines (Oracle, Neo4j, and Spark) each with its own custom data models; such applications are referred to as heterogeneous workloads [3]. For example, enterprises typically maintain transactional records in a relational store (like Postgres [4]) and users’ clickstreams in a timeseries store (like TimescaleDB [5]). A recommendation application may need to access both the transactional and clickstream data to predict the next best offers for users (Figure 1). As another example, consider the Mimic III clinical dataset [1]: a comprehensive collection of information on patients admitted to the Beth Israel Deaconess Medical Center in Boston, MA, USA for the period between 2001 and 2012. It includes timeseries information from the bedside monitoring devices, waveforms from the history of previous patients, structured logs of patients’ metadata (patient address, phone number, and more in a relational table), as well as semi-structured data (text) consisting of doctors’ and nurses’ notes and prescriptions. A real-time health-monitoring application for patients, in Figure 2, when operating on Mimic III would therefore rely on different data-processing engines for each of the given data types (timeseries and text) to quickly predict patients’ stay in ICU, typically with a latency target of a few milliseconds.

Fig. 1: Examples of enterprise analytic applications. A recommendation system spans across multiple heterogeneous data stores (RDBMS, Key/Value, and Timeseries).

Typically, these analytic applications consist of a combination of multiple sub-applications, each written and optimized for a particular data-processing engine (Oracle for SQL queries and Neo4j for graph-based operations). The number of such sub-applications depends on the different types of data models accessed by an analytic application. Each sub-application works on one of these data models, reading and processing the data using the corresponding data store. Finally, the processed data is moved across sub-applications (in sequence), transforming it to the data model of the receiving application. Or, sub-applications submit their results to a central storage, mapping it to a single data model, which a post-processing application running on the associated engine processes to generate the final outcome—an approach commonly known as ‘one-size-fits-all.’ However, both these approaches spend majority of the time performing unnecessary movement and remodeling of data, thus, inflating the response latency of a real-time analytic system. Furthermore, they burden the developers of the analytic applications to manually specify and optimize scheduling policies to achieve high performance.

A polystore system [6] on the other hand assumes that ‘one-size-does-not-fit-all,’ and federates and automates processing of data across engines. In addition, these engines work with specific data models to execute heterogeneous workloads efficiently. Each processing engine (a database system) is optimized for a specific set of operations, joins in Postgres [4], matrix operations in SciDB [7], and path-finding in Neo4j [8]. A polystore system exploits these domain-specific characteristics of these engines to expedite heterogeneous workloads. The system forms a complete stack consisting of frontend interfaces for application development, compilers and optimizers to generate efficient code, ‘CAST’ [6] for moving data across various storage platforms, and native processing engines with their respective adapters.

Fig. 2: A real-time application for clinical analysis with an example of natural-language query and the corresponding heterogeneous program for the MIMIC III dataset [1]

. Each step in the heterogeneous program is a workload that interacts with relational, text, and stream store, as well as a deep neural network engine.

Polystore systems minimize workload completion times by exploiting native capabilities of data-processing engines while optimizing for CPU-based computations. However, with Moore’s Law [9] slowing down compute and Dennard Scaling [10] hitting power limits, we no longer observe exponential improvements in general-purpose compute using CPUs with fixed area and power cost. Therefore, industry is now moving toward domain-specific hardware architectures that trade flexibility of CPUs for improved efficiency. For example, Microsoft Brainwave [11]

and Google’s Tensor Processing Unit (TPU) 

[12]

are designed to accelerate deep-learning workloads unlike general-purpose CPU based systems.

We envision that next-generation polystore systems, or Polystore++ systems, can leverage such hardware accelerators [13] to reduce the execution time of heterogeneous workloads. For example, the Snorkel [14] application (Figure 3), is an example of a tight integration of SQL and machine-learning workloads. The load_data function is interspersed in ML code to fetch data from a database using SQL query. A polystore++ system can identify this mix and accelerate the load_data function using hardware accelerators.

Polystore++ systems deploy accelerators in standalone, coprocessor, or bump-in-the-wire modes [15]. In standalone mode, key functions of a Polystore++ system run entirely on the hardware, whereas in the coprocessor mode the system logic is distributed across the host CPUs and hardware accelerators. And, bump-in-the-wire accelerators sit between the data-processing engines and data stores. However, these accelerators introduce new challenges of how to (1) facilitate development of clearly-specified (or clarity-optimized) heterogeneous programs, (2) convert heterogeneous programs into unified representations, (3) identify patterns in workloads to accelerate, (4) optimize execution plans across heterogeneous computing units (CPU-based data-processing engines and hardware accelerators)—a multi-objective optimization, (5) generate optimized code specific to a computing target (FPGA and GPU), and (6) schedule workloads to exploit hardware parallelism and pipelining.

In this paper, we discuss how Polystore++ systems can achieve high performance at low power by identifying and offloading polystore system components that are amenable to acceleration using specialized hardware. We detail new research challenges that surface with the introduction of hardware accelerators in polystore systems and discuss possible approaches to address these challenges §IV.

We organize the paper as follows: §II discusses the state-of-the-art in the areas of polystore systems and hardware accelerators. §III presents an architecture for Polystore++ systems and discusses opportunities for hardware acceleration. §IV highlights the new challenges and approaches introduced due to accelerators and heterogeneous programs in polystore systems ranging from workload optimization, scheduling, and execution in Polystore++ systems.

Ii Related Work

Ii-a Polystore Systems

Ran et al. [16] presents a taxonomy of systems characterized by data stores and query interfaces that they support for workload processing. (1) The federated database systems such as Multibase [17] execute workloads on a set of relational data stores using only an SQL query interface. However, with the advent of complex analytic applications working on different data models (relational, key-value, graph, and tensor), these systems became obsolete. (2) Polyglot systems such as Spark [18], Weld [19] and MyriaX [20], are examples of ’one-size-fits-all,’ that support multiple query interfaces and process heterogeneous workloads on homogeneous data stores. For example, GRfusion [21] is a system that processes multiple queries for graph and machine learning workloads. Furthermore, Polyglot systems reduce workload completion time by fusing operators [19] and executing them on a specialized data-processing engine. (3) Multistores process workloads across heterogeneous data stores using a single query interface. For example, HadoopDB [22] processes queries across both relational and Hadoop data stores. It limits users to specify workloads using a single SQL-based interface. (4) Polystore systems combine advantages of both polyglot and multistore systems by supporting multiple query interfaces to represent disparate workloads and their execution across heterogeneous data stores. They utilize the native operations of data-processing engines to execute workload faster compared to previous systems [23].

Fig. 3: Snorkel: A deep-learning model pipeline using SQL calls in Python for labelling training data using weak supervision.

BigDawg [6] is one of the first implementations of a polystore system, having a SQL-like interface extended with BigDawg operators, such as ‘bdrel’ and ‘bdarray’ to fetch data from relational and array stores. Myria [24] is a cloud service with an algebra-based optimizer for efficient execution on data federated across heterogeneous data stores. Myria uses Pipegen [25] to migrate data across data-processing engines to improve performance (§III). Rheem [26] is another polystore system that builds a cost model for each data processing operator by collecting their resource utilization from respective data-processing engines. Teradata Vantage [27] is a commercial polystore system that supports different paradigms for workload processing on native data-processing engines. It facilitates application development without burdening developers to manually schedule tasks for data movement. However, users still have to explicitly transfer data from legacy database systems to the Teradata Vantage environment. Performance improvements in these polystore systems have been limited to CPU-based computations and cannot keep pace with increasingly demanding modern heterogenous workloads.

Ii-B Hardware Accelerators

Compute acceleration is the use of specialized hardware to perform functions more efficiently than general-purpose CPU [13]. These accelerators exploit hardware parallelism and pipelining to achieve high performance and maintain low power by operating at lower clock frequency and adding hardware support for a variety of operations.

Specialized hardware can accelerate polystore systems by leveraging multicore CPUs, graphics processing unit (GPUs [28]), field programmable gate arrays (FPGAs [29]), application-specific integrated chips (ASICs [30]), or coarse grain reconfigurable arrays (CGRAs [31]) such as Plasticine [32]. Multicore CPUs consume more power per task with limited parallelism and inefficient data movement compared to well-matched applications running on accelerators. GPUs, on the other hand, perform wide SIMD computations using hundreds of cores. GPUs clock these cores at lower speeds compared to CPUs, but achieve higher throughput by running workloads across many such cores. GPUs primarily focus on operations that exploit SIMD parallelism, such as matrix multiplication, and more recently database applications. FPGAs compared to CPUs and GPUs consume less power and allows programming more generalized operations. FPGAs can be used as a flexible accelerators for high-cost operations within an execution plan and are well suited for operations on streaming data. However, these FPGAs are hard to program and can only support limited number of complex operations. Developers program FPGAs using hardware description languages (Verilog or VHDL), with lengthy compile times to synthesize applications to an FPGA configuration (a network of look-up tables, LUTs) leading to large development overheads. CGRAs have short reconfiguration time as they are constructed using standard processing elements [31]. ASICs are fixed-function devices with pre-configured logic (data-processing operators). These devices cannot be reconfigured but achieve extremely high performance and efficiency for these operators.

A hardware accelerator can sit in the access path of data-processing operators (as in Netezza [33]). Or, it can run as a coprocessor, installed on PCIe slots of a data-processing server alongside the CPU to accelerate frequently occurring operations. For example, GPU-based accelerators can speed up Spark workloads (SparkGPU [34] and Flare [3]); FPGA-based devices can accelerate relational databases (Postgres), hybrid workloads (DANA: native SQL extended for ML operators [35]), and streaming workloads (Saber [36]). Furthermore, these devices can operate in standalone mode (like Tensor Processing Unit [12] or Microsoft Brainwave [11]). An exhaustive survey of workload acceleration using these devices is presented in [37].

However, hardware acceleration tradeoffs general-purpose compute to achieve high performance at low power, thus making these devices difficult to program. Each accelerator is programmed using its hardware-specific low-level language (Verilog), which requires a developer to have deep understanding of the underlying hardware. Application and hardware domain-specific languages (DSLs), such as Spatial [38], Relay [39], and Delite [40], ease application development for specific accelerators by abstracting low-level abstractions into high-level primitives (parallel patterns [32]) that a developer is familiar with. For example, Halide [41]

and Tensorflow 

[42] are application-specific DSLs for image processing and deep neural network processing respectively.

Furthermore, each hardware accelerator has a specific area and power profile. Its design determines the corresponding performance and power benefits. Altaf and Wood proposed LogCA [43] to model the performance of these accelerators based on their design and interface to the host system using applications with different computational intensity111Computational Intensity is the amount of work done on a host/accelerator per byte of offloaded data.. LogCA helps in designing accelerator architecture for memory access and compute bound operators (or kernels).

Iii Architecture

A Polystore++ system (Figure 4) consists of following components:

  • An expressive integrated development environment (EIDE) supporting a mix of programming paradigms, languages, and libraries for hardware accelerators. The EIDE facilitates application development using heterogeneous programming paradigms such as Python, SQL, Cipher, and Java including natural language interfaces. It is used by users to declare the configuration for a Polystore++ system such as deployment details of different data stores and architecture of hardware accelerators.

  • A compiler takes in the program description from the EIDE and performs static optimizations and allocations. A compiler’s frontend interfaces with the EIDE to generate an intermediate representation (IR), which is then optimized in the core, and finally the backend sends the program representation to the middleware.

  • The middleware, also termed as runtime, is responsible for the actual execution of a program. As such, it consists of both the execution engine and a runtime optimizer. Similar to optimizations performed in the core of the compiler, optimizer uses cost models and optimization rules to optimize the IR for heterogeneous execution across data-processing engines and hardware accelerators. Finally, the executor coordinates execution through the adapter and data migrator (DM) using the configuration parameters. These parameters specify hardware details, such as FPGA or GPU, about the cluster data-processing engines and accelerators, their respective host CPU architectures, platform deployment details (Spark, Neo4Jm, or Accumulo), and metadata (location, type, and schema).

  • An adapter to interface with each data-processing engine. It receives a piece of IR to transform and execute locally at the data-processing engine. Further, it collects the performance metrics after the workload execution and sends it to the middleware’s optimizer.

  • A data migrator (DM) manages data movement across data-processing engines in response to instructions from the middleware.

To understand the working of a Polystore++ system, consider a query on Admission and Patients tables in the MIMIC III data set [44]. The query requests a patient’s admission history that is identified by ‘pid’ and sorted using admission dates (‘Date’). The Admission and Patients tables are stored in databases DB1 and DB2, respectively. DB1 projects the table Admission on ‘pid,’ while DB2 projects the table Patients on ‘pid.’ After receiving DB2 output through a data migrator, DB1 performs a sort-merge on ‘Date’. A Polystore++ system can accelerate DB1’s sort operations as well as the data migration task from DB2 to DB1, pipelining it to reduce latency.

Iii-a Acceleration in Polystore++ Systems

We now discuss various acceleration opportunities in polystore systems.

Fig. 4: Polystore++ system’s physical layout: hardware accelerators for data-processing engines, adapters, and data migrators running on a pool of servers; a middleware that parses, optimizes, and manages execution of heterogenous workloads; and an interface (EIDE) for specifying these workloads.

Iii-A1 Operator Execution

The data-processing engine of a polystore system translates an application program to an intermediate representation (IR) consisting of a sequence of operators: SQL queries get mapped to projection, hash, sort, group-by, and join operators; HiveQL queries get translated to a sequence of MapReduce operations; cipher programs translate to a series of graph operations such as match, subtree, path, and join; and a deep-learning algorithms are converted into GEMV (matrix-vector multiplication) and GEMM (matrix-matrix multiplication) operations for inference and training.

Further, the Polystore++ system exploits the pipeline-execution behaviour of operators to accelerate them using FPGA (bitonic sort algorithm has inherent pipeline execution [45]) or execute domain-specific operators exhibiting parallel execution, such as deep-learning operators, faster using a custom hardware like TPU [12].

Iii-A2 Data Access

A data-processing engine fetches data before launching data-processing operators. Different data-processing engines support a variety of data access mechanisms depending on storage, location, size, and compression techniques. For example, relational databases employ sequential and index scans as data-access operations, while Hadoop Distributed File System (HDFS) employs only sequential scan operations.

A Polystore++ system can stream output of a sequential scan operation returning large amount of data to FPGA-based accelerator to filter and/or project relevant columns and records to reduce the amount of data communicated to the main memory for further processing. Also, the Polystore++ system can cache index-seek operation(s) to reduce data-access latency.

Iii-A3 Data Migration (DM)

The data-processing engines in polystore systems need to have a mechanism of sending data to each other. For example, most of the data-processing engines support data migration in to comma-separated-value (CSV) format for data portability. So a naive approach for data migration is exporting data from the source data store to a CSV file, transferring the CSV file to the destination data store over a network, and importing the CSV file by the destination data store. This leads to multiple conversions to/from a CSV. However, Pipegen [25] uses ‘network pipes’ to eliminate disk writing and serialization while migrating data across data stores. Pipegen transfers elements (4 int, 3 double), approximately 40 GB, in 35 minutes on Amazon’s m4.large instances, where most of the time is spent transforming different data types into optimized binary.

A Polystore++ system can offload the Pipegen’s network pipes’ computations and serialization algorithms to an accelerator to pipeline data transformation and network transfer. Furthermore, the system can harness Remote Data Memory Access (RDMA) accelerators to transfer data from one server’s memory to another bypassing overheads of memory copy in a network protocol stack.

Iii-A4 Adapter

An adapter co-locates with each data-processing engine to transform and execute a piece of IR on the local data-processing engine. The transformation process of mapping operators in IR to a set of operators compliant to the local processing engine, comprises a fixed set of rules. A Polystore++ system can encode the data-flow graph of the rules in an accelerator to free the host CPU cycles for the local data processing.

Iv Challenges and Approaches

The introduction of an accelerator in a system needs changes in all phases of application development from the programming model to its implementation. In the following sections, we discuss the new research challenges in the Polystore++ system stack, primarily due to introduction of hardware accelerators.

Iv-a Programming Environment (EIDE) Challenges

An EIDE facilitates writing clarity-optimized programs222We refer to these programs as clarity-optimized in contrast to performance-optimized. Heavy performance optimization quickly obfuscates the meaning of code, but is nearly always required to reach near-peak performance. using well-known programming languages such as Python for ML/DL, SQL for database processing, and Cipher for graph operations. For example, Figure 5 shows an annotated data-flow graph for a heterogeneous program. One goal of EIDE is to allow using same data across different data models such as Julia [46] that allows a mix of C and Python data types or a multi-paradigm tensor-based model for representing heterogeneous programs. As such, EIDE’s purpose is to accurately capture the computations being done and expose semantic information to the optimizer and compiler. Furthermore, an EIDE provides a clarity-optimized programming interface while simultaneously enabling efficient execution. We note that there are several key challenges in the design of such a programming environment.

Fig. 5: Heterogeneous workload abstraction as an annotated data-flow-graph. D1, D1’, D2, D2’ and D3 represents data sets input to each piece of source code. IR: Intermediate representation of workload in the data store. Dotted line indicates data transfer and transformation across IR nodes.
a. What useful information can be extracted from each subprogram to execute them optimally

Heterogeneous programs by definition consist of a mix of programming paradigms and abstraction levels. In order to capture and later optimize these programs, such a framework should be able to losslessly capture semantic information from each sub-program. This information is likely to be used in scheduling, resource allocation, as well as operation selection. For example, information about numerical precision may be used to guide between different matrix multiplication algorithms.

b. How to transform same data across different data models

Models and storage formats differ across programming domains, and overheads incurred by data movement and transformation across domains can quickly exceed benefits of acceleration. The related work in the area of communication-avoiding algorithms indicates that data movement can easily become a bottleneck across a distributed system if not properly optimized [47]

. For example, a large model such as the Google Neural Machine Translation 

[48] model may ordinarily have a few gigabytes of weights when stored efficiently. If this data were stored in a textual format, the weight’s storage cost balloons into the terabyte range. Such an increase in required data movement will significantly impact the performance of the system as a whole.

c. How to design a multi-paradigm language

It appears unlikely that any particular sets of domain-specific languages and base languages will suffice to cover computing needs in the long term. As a consequence, EIDE must be able to accommodate novel languages and abstraction levels.

d. What functions should be accelerated

Trivially, everything that can be accelerated should be accelerated, but in practice identifying such opportunities can be difficult. With reconfigurable hardware, nearly everything can be accelerated to varying degrees of profitability; as a result, a Polystore++ system needs to solve the additional problem of area and bandwidth allocation on these accelerators. Additionally, performance characterizations on accelerators such as FPGAs tend to be inaccurate without repeated synthesis, a process which takes hours to days per run for non-trivial designs.

e. How should a natural language query be compiled to a semantic equivalent heterogeneous program

Yaghmazadeh et al. [49] and Jagdish et al. [50] have approached the problem of translating a natural-language query to a SQL query. Virtual assistants such as Almond [51] convert natural language commands into programs. These assistants may be extended to incorporate work such as AutoML [52] to automatically generate machine learning models for more complex queries.

Iv-B Compilation Challenges

We can separate the structure of a traditional compiler into three components: frontend, core, and backend. In a heterogeneous system, their tasks are adapted for inter-operation of both frontend and backend components. The Intermediate Representation (IR) is a data structure which represents the program as a whole.

Iv-B1 IR Design

One of the goal in designing IR is to capture semantics of data processing on different engines and hardware accelerators and should be extensible to incorporate semantics of new compute engines.

A program on data-processing engines is more like a control-flow graph, whereas a hardware accelerator works on data-flow graph. One approach is to have a hierarchical IR consisting of control nodes and each control node may have a data-flow graph for an operator as shown in Figure 5. The program inside a control node may get converted into a set of available hardware domain-specific operators for generating target specific code. Figure 7

shows an OptiML example of short k-means application translated from longer Tensorflow version, that can be optimized for multi-core CPU, GPU or FPGA accelerators 

[38, 40].

Fig. 6: Levels of optimization for a heterogeneous workload – L1: piece of program into DSL, L2 and L3: IR and lowest level of source code respectively at individual data-processing engines.

Iv-B2 Frontend

A typical compiler frontend consists of a parser, followed by a variety of checks. These may include type checking and static assertions, as well as elaborating the code. For example, the C frontend handles macro expansions and includes. In a heterogeneous compiler for Polystore++ systems, the frontend faces the task of constructing a compute graph from a variety of sub-programs. Depending on the construction of the compiler, the frontend must still perform inter-subprogram checks, but can delegate subprogram-local checks to their respective frontends.

Iv-B3 Core

The core of a compiler is where most interesting transformations and optimizations are made. Ideally, such a core can reuse optimizations across a variety of applications. In order to do so, it must be able to express these optimizations in a domain-agnostic manner in Polystore++ environment. Unlike lower-level optimizations such as loop transforms, and strength reductions, these high-level optimizations may require semantic information. These high-level transformations cover L1 optimizations in Figure 6

. For example, when computing the eigenvalues of a matrix either Lanczos or Householder tri-diagonalization methods may be used. These algorithms create different resultant matrices, but both are valid for computing eigenvalues.

While instruction selection and scheduling are part of a traditional backend’s responsibilities, they must be moved to the core due to their interaction with resource and task allocation. In a Polystore++ system, the core must decide where each task should be assigned, and which resources may be attached (storage or network bandwidth). Resource allocation is particularly important as a transformation’s benefits are intimately tied with the characteristics of the executing platform.

Iv-B4 Backend

The role of the backend in a heterogeneous system is the least changed part compared to a traditional compiler. In a heterogeneous system the backend is responsible for generating the code corresponding to each computing unit, involved. Each constituent computing unit is then responsible for providing component-specific optimizations, such as fine-grained tiling, pipelining, and other local tasks in FPGA. In a Polystore++ system, the high-level decisions made by different constituent components become L2 optimizations in Figure 6. Finally, L3 optimizations are the implementation-level transformations for each component, such as those in the underlying C/C++ compiler for a database engine. We note here that while the Roofline Model [53] is used as a method of measuring performance on CPUs and can be extended to fixed hardware, however, obtaining accurate and simple analytical models of theoretical performance on reconfigurable hardware such as FPGAs or CGRAs is substantially more difficult. One efficient technique can be found in Koeplinger et. al. [54] which uses repeated sampling to build an empirical performance model.

Iv-C Optimization Challenges

Optimization challenges arise in both the middleware and compiler. In the middleware these tasks are delegated to the optimizer, which minimizes the total execution time of a program, while optimizing on number and size of data movements and cost of operators’ execution across data stores [55]. In the compiler for a Polystore++ system, such parameter selection is termed tuning or auto-tuning, and naturally arises from parallelism factors, numerical precision, and resource allocation.

Fig. 7: An example of translating a Tensorflow K-Means application to OptiML [56], a domain-specific language for hardware accelerators.

Mathematically, in the mono-objective formulation, this is the problem of finding a global minimizer of an unknown (black-box) objective function :

(1)

where is an input design space of a Polystore++, which includes configuration of heterogeneous computing units and design parameters of hardware accelerators. The problem addressed in this paper is the optimization of a deterministic black-box function , a computer program, over a domain of interest that includes lower and upper bound constraints on the problem variables. The design space of the optimization in Polystore++ includes discrete variables, either categorical (boolean) or ordinal (choice of on-chip memory sizes); these type of variables preclude the computation of derivatives (derivatives are not defined on discrete variables) and by consequence the use of standard gradient-based optimization procedures. Hence, we assume in Polystore++ that the derivative of

is neither symbolically nor numerically available. This problem is referred to in the mathematical optimization literature as black-box optimization

[57] and, in the computer systems community as design space exploration (DSE) [58, 59].

HyperMapper 2.0 [60, 61, 62]

is designed for heterogeneous workloads, and can handle complex design spaces consisting of multiple objectives, categorical/ordinal variables, unknown feasibility constraints

[63], and exploitation of performance profiling of earlier executions of workloads in Polystore++.

Iv-C1 Multi-Objective Optimization

One approach to achieve multi-objective optimization is to build cost models for each of the heterogeneous computing units. A cost model could be a performance prediction model, also known as surrogate models, to estimate performance efficiency of workload on different platforms for scheduling, Luo et al.

[64] presents power-performance characteristic of analytic queries on database systems, and Singhal et al. [65][66][67] and  [68] present similar cost models for relational database, Hive, Hadoop, and Spark platforms deployed on conventional CPU systems. However, none of these include design space for hardware accelerators.

The input data size could be a number of items, a data-processing engine could be Spark, Hadoop, Postgres, or more; hardware accelerator could be FPGA, GPU, CPU-MPI; and hardware details in terms of memory size, FPGA board LUT units or CGRA layout or GPU memory with the connector.

The optimizer has large design space to explore for a Polystore++ system. Moreover, building cost models with a large number of parameters may be expensive. Alternatively, active learning 

[61][62][60] can trade off exploration and exploitation mechanisms to give an approximated optimal configuration for workload execution in Polystore++ systems.

Active learning is a paradigm in supervised machine learning which uses fewer training examples to achieve better optimization by iteratively training a predictor, and using the predictor in each iteration to choose the training examples which will increase its chances of finding better configurations and at the same time improving the accuracy of the prediction model. Thus the optimization results are incrementally improved by interleaving exploration and exploitation steps. One can use randomized decision forests [69] as the base predictors created from a number of sampled points in the parameter space, which is different configurations of the Polystore++ design space.

The application is evaluated on the sampled points, yielding the labels of the supervised setting given by the multiple objectives which could be obtained from the historical executions of workloads on Polystore++. Since our goal is to accurately estimate the points near the Pareto optimal front, we use the current predictor to provide performance values over the parameter space and thus estimate the Pareto fronts. For the next iteration, only parameter points near the predicted Pareto front are sampled and evaluated, and subsequently used to train new predictors using the entire collection of training points from current and all previous iterations. The evaluation of sample point means actual execution of the workload in the given configuration, which may execute sub-optimally, however, it helps learning the cost model for future similar workloads. This process is repeated over a number of iterations forming the active learning loop. Bodin et. al.  [61] and Nardi et. al. [62] indicate that this guided method of searching for highly informative parameter points in fact yields superior predictors as compared to a baseline that uses randomly sampled points alone. By iterating this process several times in the active learning loop, we are able to discover high-quality design configurations that lead to good performance outcomes. Figure 8 shows the high-level active learning DSE algorithm.

Fig. 8: Active learning is a more efficient method of searching the design space by selecting evaluation points which are likely to reveal information about good configurations. In a multi-objective setting, the system attempts to learn the Pareto curve, a generalized notion of optimality.

Iv-D Execution Challenges

The goal of the executor is to schedule the optimized IR and coordinates its execution across heterogeneous computing units. The optimized IR may be considered to be a sequence of stages (like Spark), where each stage may have heterogeneous tasks to execute in parallel and a task could be an operator on a data-processing engine or hardware accelerator or a data movement. The whole workload execution can be perceived as a pipeline of the stages’ execution for maximum throughput. An operator may be implemented as a stream, where data may arrive from local storage or remote platform through the accelerator.

The research challenges for the execution are as follows:

a. IR mapping to local accelerators and kernels

At runtime a variety of candidates may be available, and the selection will ultimately depend on a combination of the runtime environment and data-dependent analyses.

b. Coordination and scheduling across heterogeneous units

The field of scheduling is well studied, but is a difficult engineering challenge, especially due to the complexity of interfacing with accelerators.

c. Runtime acceleration

Hardware acceleration motivates design and development of data flow algorithms for the adapter and DM for a set of data-processing engines. Customized hardware can also be explored for data migration acceleration since it is a prevalent operation in data centers.

d. Runtime statistics

Runtime metrics are crucial for optimization, but hardware accelerators do not provide such utilities. Currently, such metrics may be instead gathered through low-level simulators, but run orders of magnitude slower than actual hardware.

V Conclusion

Modern real-time analytic applications are represented using a mix of heterogeneous programming paradigms (referred to as heterogeneous workload) spanning across heterogeneous data-processing engines working with different data models. Polystore systems suit such applications, but with the decline of Moore’s Law and Dennard Scaling, current and future workloads will require hardware accelerators to meet performance requirements.

We present Polystore++, an architecture to accelerate polystore systems using specialized hardware to achieve performance at low power. This architecture highlights various open and novel challenges, which may guide future work in the area and provide end-to-end performance and power improvements. We note that these challenges span beyond just the database world, but also encompass the compiler, optimization, and hardware communities.

Acknowledgments

The authors wish to thank Jeffrey D. Ullman and Tian Zhao for helpful early discussions; and the anonymous ICDCS reviewers for their feedback on earlier versions of this paper. This work is based on research sponsored by Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) under agreement number FA8650-18-2-7865. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory (AFRL) and Defense Advanced Research Projects Agency (DARPA) or the U.S. Government. This research is also supported in part by Tata Consultancy Services, affiliate members and supporters of the Stanford DAWN project: Ant Financial, Facebook, Google, Infosys, Intel, Microsoft, NEC, Teradata, SAP, and VMware.

References

  • [1] F. Tang, J. Zhou, C. Xiao, and F. Wang, “Predictive Modeling in Urgent Care: A Comparative Study of Machine Learning Approaches,” JAMIA OPEN, vol. 1, no. 1, pp. 87–98, 06 2018. [Online]. Available: https://dx.doi.org/10.1093/jamiaopen/ooy011
  • [2] [Online]. Available: https://en.wikipedia.org/wiki/Hardware_acceleration
  • [3] G. Essertel, R. Tahboub, J. Decker, K. Brown, K. Olukotun, and T. Rompf, “Flare: Optimizing Apache Spark with Native Compilation for Scale-Up Architectures and Medium-Size Data,” in USENIX OSDI, 2018.
  • [4] K. Douglas and S. Douglas, PostgreSQL.   Thousand Oaks, CA, USA: New Riders Publishing, 2003.
  • [5] M. O. M.J. Slabber, F.J. Joubert, “Scalable Time Series Documents Store,” ICALEPCS, 2017.
  • [6] J. Duggan, A. J. Elmore, M. Stonebraker, M. Balazinska, B. Howe, J. Kepner, S. Madden, D. Maier, T. Mattson, and S. Zdonik, “The BigDAWG Polystore System,” SIGMOD Record, vol. 44, no. 2, pp. 11–16, Aug. 2015. [Online]. Available: http://doi.acm.org/10.1145/2814710.2814713
  • [7] M. Stonebraker, P. Brown, J. Becla, and D. Zhang, “SciDB: A Database Management System for Applications with Complex Analytics,” Computing in Science and Engineering, vol. 15, no. 3, pp. 54–62, May 2013. [Online]. Available: http://dx.doi.org/10.1109/MCSE.2013.19
  • [8] M. Lal, Neo4J Graph Data Modeling.   Packt Publishing, 2015.
  • [9] R. R. Schaller, “Moore’s Law: Past, Present and Future,” IEEE Spectrum, vol. 34, no. 6, pp. 52–59, 1997.
  • [10] M. Bohr, “A 30 year retrospective on Dennard’s MOSFET scaling paper,” IEEE Solid-State Circuits Society Newsletter, vol. 12, no. 1, pp. 11–13, 2007.
  • [11] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger, “A Reconfigurable Fabric for Accelerating Large-scale Datacenter Services,” in ACM/IEEE ISCA, 2014.
  • [12] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-Datacenter Performance Analysis of a Tensor Processing Unit,” in ACM/IEEE ISCA, 2017.
  • [13] [Online]. Available: https://en.wikipedia.org/wiki/Hardware_acceleration
  • [14] A. J. Ratner, S. H. Bach, H. R. Ehrenberg, and C. Ré, “Snorkel: Fast Training Set Generation for Information Extraction,” in ACM SIGMOD, 2017.
  • [15] M. Najafi, K. Zhang, M. Sadoghi, and H. Jacobsen, “Hardware Acceleration Landscape for Distributed Real-Time Analytics: Virtues and Limitations,” in IEEE ICDCS, 2017.
  • [16] R. Tan, R. Chirkova, V. Gadepally, and T. G. Mattson, “Enabling Query Processing Across Heterogeneous Data Models: A Survey,” IEEE Big Data, pp. 3211–3220, 2017.
  • [17] J. M. Smith, P. A. Bernstein, U. Dayal, N. Goodman, T. Landers, K. W. T. Lin, and E. Wong, “Multibase: Integrating Heterogeneous Distributed Database Systems,” in ACM AFIPS, 1981.
  • [18] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica, “Apache Spark: A Unified Engine for Big Data Processing,” Communications ACM, vol. 59, no. 11, pp. 56–65, Oct. 2016. [Online]. Available: http://doi.acm.org/10.1145/2934664
  • [19] S. Palkar, J. J. Thomas, A. Shanbhag, D. Narayanan, H. Pirk, M. Schwarzkopf, S. Amarasinghe, and M. Zaharia, “Weld: A Common Runtime for High Performance Data Analytics,” 2017.
  • [20] D. Halperin, V. Teixeira de Almeida, L. L. Choo, S. Chu, P. Koutris, D. Moritz, J. Ortiz, V. Ruamviboonsuk, J. Wang, A. Whitaker, S. Xu, M. Balazinska, B. Howe, and D. Suciu, “Demonstration of the Myria Big Data Management Service,” in ACM SIGMOD, 2014.
  • [21] M. S. Hassan, T. Kuznetsova, H. C. Jeong, W. G. Aref, and M. Sadoghi, “Empowering In-Memory Relational Database Engines with Native Graph Processing,” CoRR, vol. abs/1709.06715, 2017.
  • [22] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin, “HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads,” VLDB Endowment, vol. 2, no. 1, pp. 922–933, Aug. 2009. [Online]. Available: https://doi.org/10.14778/1687627.1687731
  • [23] V. Gadepally, P. Chen, J. Duggan, A. Elmore, B. Haynes, J. Kepner, S. Madden, T. Mattson, and M. Stonebraker, “The BigDAWG Polystore System and Architecture,” in IEEE HPEC, 2016.
  • [24] J. Wang, T. Baker, M. Balazinska, D. Halperin, B. Haynes, B. Howe, D. Hutchison, S. Jain, R. Maas, P. Mehta, D. Moritz, B. Myers, J. Ortiz, D. Suciu, A. Whitaker, and S. Xu, “The Myria Big Data Management and Analytics System and Cloud Services,” in CIDR, 2017.
  • [25] B. Haynes, A. Cheung, and M. Balazinska, “PipeGen: Data Pipe Generator for Hybrid Analytics,” in ACM SoCC, 2016.
  • [26] D. Agrawal, S. Chawla, B. Contreras-Rojas, A. Elmagarmid, Y. Idris, Z. Kaoudi, S. Kruse, J. Lucas, E. Mansour, M. Ouzzani, P. Papotti, J.-A. Quiané-Ruiz, N. Tang, S. Thirumuruganathan, and A. Troudi, “RHEEM: Enabling Cross-platform Data Processing: May the Big Data Be with You!” VLDB Endowment, 2018.
  • [27] [Online]. Available: https://www.teradata.com/Products/Software/Vantage
  • [28] R. Fernando, GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics.   Pearson Higher Education, 2004.
  • [29] I. Kuon, R. Tessier, and J. Rose, “FPGA Architecture: Survey and Challenges,” Foundations and Trends in Electronic Design Automation, vol. 2, no. 2, pp. 135–253, Feb. 2008. [Online]. Available: http://dx.doi.org/10.1561/1000000005
  • [30] I. Kuon and J. Rose, Quantifying and Exploring the Gap Between FPGAs and ASICs, 1st ed.   Springer Publishing Company, Incorporated, 2009.
  • [31] Y. Park, H. Park, and S. Mahlke, “CGRA Express: Accelerating Execution Using Dynamic Operation Fusion,” in ACM CASES, 2009.
  • [32] R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun, “Plasticine: A Reconfigurable Architecture For Parallel Paterns,” in ISCA, 2017.
  • [33] M. Singh and B. Leonhardi, “Introduction to the IBM Netezza Warehouse Appliance,” in CASCON, 2011.
  • [34] [Online]. Available: https://www.oreilly.com/learning/accelerating-spark-workloads-using-gpus
  • [35] D. Mahajan, J. K. Kim, J. Sacks, A. Ardalan, A. Kumar, and H. Esmaeilzadeh, “In-RDBMS Hardware Acceleration of Advanced Analytics,” VLDB Endowment, vol. 11, no. 11, pp. 1317–1331, Jul. 2018. [Online]. Available: https://doi.org/10.14778/3236187.3236188
  • [36] A. Koliousis, M. Weidlich, R. Castro Fernandez, A. L. Wolf, P. Costa, and P. Pietzuch, “SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures,” in ACM SIGMOD, 2016.
  • [37] J. L. Bordim, Y. Ito, and K. Nakano, “Accelerating the CKY Parsing Using FPGAs,” in Springer-Verlag HiPC ’02, 2002.
  • [38] D. Koeplinger, M. Feldman, R. Prabhakar, Y. Zhang, S. Hadjis, R. Fiszel, T. Zhao, L. Nardi, A. Pedram, C. Kozyrakis, and K. Olukotun, “Spatial: A Language and Compiler for Application Accelerators,” in ACM PLDI, 2018.
  • [39] J. Roesch, S. Lyubomirsky, L. Weber, J. Pollock, M. Kirisame, T. Chen, and Z. Tatlock, “Relay: A New IR for Machine Learning Frameworks,” in ACM MAPL, 2018.
  • [40] [Online]. Available: https://stanford-ppl.github.io/Delite
  • [41] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe, “Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines,” in ACM PLDI, 2013.
  • [42] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: A System for Large-scale Machine Learning,” in USENIX OSDI, 2016.
  • [43] M. S. B. Altaf and D. A. Wood, “LogCA: A High-Level Performance Model for Hardware Accelerators,” in ACM ISCA, 2017.
  • [44] J. Alistair E.W., P. Tom J., S. Lu, L. Li-wei H., F. Mengling, G. Mohammad, M. Benjamin, S. Peter, A. C. Leo, and M. Roger G., “MIMIC-III, A Freely Accessible Critical Care Database,” Scientific Data, vol. 3, 2016.
  • [45] A. Srivastava, R. Chen, V. K. Prasanna, and C. Chelmis, “A Hybrid Design for High Performance Large-scale Sorting on FPGA,” International Conference on ReConFigurable Computing and FPGAs (ReConFig), pp. 1–6, 2015.
  • [46] [Online]. Available: https://en.wikipedia.org/wiki/Julia_programming_language
  • [47] J. Demmel, M. Hoemmen, M. Mohiyuddin, and K. Yelick, “Avoiding Communication in Sparse Matrix Computations,” in IEEE SPDP, 2008.
  • [48] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Łukasz Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean, “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation,” CoRR, vol. abs/1609.08144, 2016.
  • [49] N. Yaghmazadeh, Y. Wang, I. Dillig, and T. Dillig, “SQLizer: Query Synthesis from Natural Language,” ACM Programming Languages, vol. 1, no. OOPSLA, pp. 63:1–63:26, Oct. 2017. [Online]. Available: http://doi.acm.org/10.1145/3133887
  • [50] F. Li and V. Jagdish, “ Querying RDBMS Using Natural Language,” in Doctor of Philosophy, 2017.
  • [51] [Online]. Available: https://almond.stanford.edu/
  • [52] F. Hutter, L. Kotthoff, and J. Vanschoren, Eds., Automatic Machine Learning: Methods, Systems, Challenges.   Springer, 2018.
  • [53] S. Williams, A. Waterman, and D. Patterson, “Roofline: An Insightful Visual Performance Model for Multicore Architectures,” Communications ACM, vol. 52, no. 4, pp. 65–76, Apr. 2009. [Online]. Available: http://doi.acm.org/10.1145/1498765.1498785
  • [54] D. Koeplinger, R. Prabhakar, Y. Zhang, C. Delimitrou, C. Kozyrakis, and K. Olukotun, “Automatic Generation of Efficient Accelerators for Reconfigurable Hardware,” in ACM ISCA, 2016.
  • [55] S. Kruse, Z. Kaoudi, J.-A. Quiané-Ruiz, S. Chawla, F. Naumann, and B. Contreras, “RHEEMix in the Data Jungle – A Cross-Platform Query Optimizer,” CoRR, vol. abs/1805.03533, 2018.
  • [56] A. K. Sujeeth, H. Lee, K. J. Brown, H. Chafi, M. Wu, A. R. Atreya, K. Olukotun, T. Rompf, and M. Odersky, “OptiML: An Implicitly Parallel Domain-specific Language for Machine Learning,” in ICML, 2011.
  • [57] P. Feliot, J. Bect, and E. Vazquez, “A Bayesian Approach to Constrained Single- and Multi-objective Optimization,” Journal of Global Optimization, vol. 67, no. 1-2, pp. 97–133, 2017.
  • [58] E. Ïpek, S. A. McKee, R. Caruana, B. R. de Supinski, and M. Schulz, Efficiently Exploring Architectural Design Spaces via Predictive Modeling.   ACM, 2006, vol. 41.
  • [59] E. Kang, E. Jackson, and W. Schulte, “An Approach for Effective Design Space Exploration,” in Monterey Workshop.   Springer, 2010, pp. 33–54.
  • [60] L. Nardi, D. Koeplinger, and K. Olukotun, “Practical design space exploration,” arXiv preprint arXiv:1810.05236, 2018.
  • [61] B. Bodin, L. Nardi, M. Z. Zia, H. Wagstaff, G. Sreekar Shenoy, M. Emani, J. Mawer, C. Kotselidis, A. Nisbet, M. Lujan et al.

    , “Integrating Algorithmic Parameters into Benchmarking and Design Space Exploration in 3D Scene Understanding,” in

    ACM PACT, 2016.
  • [62] L. Nardi, B. Bodin, S. Saeedi, E. Vespa, A. J. Davison, and P. H. Kelly, “Algorithmic Performance-accuracy Trade-off in 3D Vision Applications using Hypermapper,” in IEEE IPDPSW, 2017.
  • [63] M. A. Gelbart, J. Snoek, and R. P. Adams, “Bayesian Optimization with Unknown Constraints,” arXiv preprint arXiv:1403.5607, 2014.
  • [64] B. Luo, Y. Hayamizu, K. Goda, and M. Kitsuregawa, “Modeling Query Energy Costs in Analytical Database Systems with Processor Speed Scaling,” in Database and Expert Systems Applications, S. Hartmann, H. Ma, A. Hameurlain, G. Pernul, and R. R. Wagner, Eds.   Springer International Publishing, 2018, pp. 310–317.
  • [65] R. Singhal and M. K. Nambiar, “Predicting SQL Query Execution Time for Large Data Volume,” in IDEAS, 2016.
  • [66] R. Singhal and A. Verma, “Predicting Job Completion Time in Heterogeneous MapReduce Environments,” in IEEE IPDPS, 2016.
  • [67] R. Singhal and P. Singh, “Performance Assurance Model for Applications on SPARK Platform,” in Performance Evaluation and Benchmarking for the Analytics Era - 9th TPC Technology Conference, 2017.
  • [68] A. Sangroya and R. Singhal, “Performance Assurance Model for HiveQL on Large Data Volume,” in IEEE HiPC, 2015.
  • [69]

    L. Breiman, “Random Forests,”

    Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.