Log In Sign Up

Query Processing on Tensor Computation Runtimes

The huge demand for computation in artificial intelligence (AI) is driving unparalleled investments in hardware and software systems for AI. This leads to an explosion in the number of specialized hardware devices, which are now offered by major cloud vendors. By hiding the low-level complexity through a tensor-based interface, tensor computation runtimes (TCRs) such as PyTorch allow data scientists to efficiently exploit the exciting capabilities offered by the new hardware. In this paper, we explore how databases can ride the wave of innovation happening in the AI space. We design, build, and evaluate Tensor Query Processor (TQP): TQP transforms SQL queries into tensor programs and executes them on TCRs. TQP is able to run the full TPC-H benchmark by implementing novel algorithms for relational operators on the tensor routines. At the same time, TQP can support various hardware while only requiring a fraction of the usual development effort. Experiments show that TQP can improve query execution time by up to 10× over specialized CPU- and GPU-only systems. Finally, TQP can accelerate queries mixing ML predictions and SQL end-to-end, and deliver up to 9× speedup over CPU baselines.


page 1

page 2

page 3

page 4


The Tensor Data Platform: Towards an AI-centric Database System

Database engines have historically absorbed many of the innovations in d...

Cognitive Database: A Step towards Endowing Relational Databases with Artificial Intelligence Capabilities

We propose Cognitive Databases, an approach for transparently enabling A...

A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics (Extended Version)

There has been significant amount of excitement and recent work on GPU-b...

Forecasting SQL Query Cost at Twitter

With the advent of the Big Data era, it is usually computationally expen...

End-to-end Optimization of Machine Learning Prediction Queries

Prediction queries are widely used across industries to perform advanced...

SOL: Reducing the Maintenance Overhead for Integrating Hardware Support into AI Frameworks

The increased interest in Artificial Intelligence (AI) raised the need f...

High-resolution imaging on TPUs

The rapid evolution of artificial intelligence (AI) is leading to a new ...

1. Introduction

DBMS vendors have delivered constant performance improvement for decades by evolving software to keep up with Moore’s law while influencing hardware development through close relationships with manufacturers. While data volumes and demand for analytics are growing faster than ever (Statista, 2022a), CPU performance improvement has slowed down (Theis and Wong, 2017)

. However, processor transistor count has continued to grow over the last decade, as hardware manufacturers adopted first multi-core CPU architectures and then augmented their computing platforms with specialized components such as GPUs, FPGAs, compression and encryption chips, DSPs, and neural-network (NN) accelerators. Although DBMS builders have taken advantage of multi-core and SIMD instructions effectively

(Zhou and Ross, 2002; Polychroniou et al., 2015; Kim et al., 2009), the explosion of the number of specialized hardware components, each with different characteristics and programming abstractions, makes it challenging to support all the exciting capabilities that these new powerful devices can offer.

On the other hand, the huge demand for computation in artificial intelligence (AI) (A. Gholami and Keutzer, 2021), combined with the market fever for AI, is driving unparalleled investments in new hardware and software for AI. Hardware makers (e.g., Intel (Habana, 2022), Apple (Apple, 2022a), Xilinx (Xilinx, 2022), AMD (AMD, 2022)), cloud vendors (e.g., Amazon (AWS, 2022), Microsoft (Chung et al., 2018), Google (Jouppi et al., 2017)), startups (e.g., Graphcore (gra, 2020), Sambanova (sam, 2020), Cerebras (cer, 2020a)), and car companies like Tesla (Tesla, 2022) are investing heavily in this space. Venture Capitals alone are pouring nearly $2B a quarter on special hardware for AI, aiming for a market expected to exceed $200B a year by 2025 (Statista, 2022b)

. On the software side, companies and open source communities are rallying behind a small number of big efforts (e.g., PyTorch 

(pyt, 2020)

, TensorFlow 

(Abadi et al., 2016), TVM (Chen et al., 2018)). The combination of investments in specialized hardware and large communities with a focus on performance is allowing these efforts to thrive. Our realization is that the ML community has made hardware accelerators accessible to nonspecialists (e.g., data scientists). The fact that the most popular ML frameworks are open-source, creates a virtuous cycle whereby any hardware vendor interested in the ML space must support these frameworks well to get adoption. At the same time, these large open source communities successfully tackle the labor-intensive problem of providing specialized kernels for various hardware, e.g., a month after Apple M1 was announced, TVM outperformed Apple’s CoreML by 2 (team, 2022). Hardware vendors can directly improve the performance of the kernels, or improve the hardware itself (pyt, 2022a, b, c). This further helps adoption since the performance gets better at each new software and hardware release.

We argue that the best path forward for analytical DBMSs is to embrace this tectonic shift, and take advantage of the groundswell of new hardware and software targeting AI workloads. To demonstrate the viability of this idea, we propose and prototype a new query processor which runs SQL queries atop tensor computation runtimes (TCRs) such as PyTorch, TVM, and ONNX Runtime (onn, 2022). We call our prototype Tensor Query Processor (TQP). TQP transforms a SQL query into a tensor program, and executes it on TCRs. To our knowledge, TQP is the first query processor built atop TCRs. Careful architectural and algorithmic design enables TQP to: (1) deliver significant performance improvements over popular CPU-based data systems, and match or outperform custom-built solutions for GPUs; (2) demonstrate portability across a wide range of target hardware and software platforms; and (3) achieve all the above with parsimonious and sustainable engineering effort.

The above might appear surprising as specialized hardware accelerators are notoriously hard to program, requiring lots of customization to extract the best performance. Furthermore, their programming abstractions differ sufficiently to make our goals of performance (G1), portability (G2), and parsimonious engineering effort (G3)

seemingly hard to reconcile. However, the key is a compilation layer and a set of novel algorithms, which can map the classical database abstraction to the prevalent one in machine learning (ML), i.e.,

mapping relational algebra to tensor computation. This allows us to free-ride on existing labor-intensive efforts from the ML community to port and optimize TCRs across all the new specialized hardware platforms. The initial performance results are encouraging: TQP is able to outperform open-source GPU databases in terms of query execution time. On CPU, TQP outperforms Spark (Zaharia et al., 2012)

, while it is comparable to a state-of-the-art vectorized engine, DuckDB 

(Raasveldt and Mühleisen, 2020) for several queries. When ML and SQL queries are used in concert, TQP is able to provide end-to-end acceleration for up to 9 speedup over CPU baselines.

Pursuing our goals of portability and parsimonious engineering, we make a deliberate decision to target existing tensor APIs rather than customize lower-level operators. This decision potentially leaves some performance on the table, but leads to a very sustainable long-term play, as TQP benefits from any performance enhancement and optimization added to the underlying software and hardware (e.g., (pyt, 2022a)). To validate this proposition, we run TQP on several different hardware settings: from CPUs, to discrete GPUs, to integrated GPUs (Intel and AMD), to NN-accelerators (TPUs (Jouppi et al., 2017)), and web browsers. Furthermore, TQP is able to run the full TPC-H benchmark on both CPU and GPU with just around 8,000 lines of code—this is quite an achievement considering that until 2021 no GPU database was able to run all the 22 TPC-H queries (Lee et al., 2021).

Contributions. This paper makes the following core contributions:

  • [itemsep=1pt, topsep=1pt, leftmargin=12pt]

  • We show that the tensor interface of TCRs is expressive enough to support all common relational operators.

  • We propose a collection of algorithms and a compiler stack for transforming relational operators into tensor computation.

  • We evaluate the Tensor Query Processor approach extensively against state-of-the-art baselines on the TPC-H benchmark.

Organization. Section 2 introduces some background on TCRs. Section 3 summarizes the challenges and the design choices we make. Section 4 introduces TQP, and Section 5 describes the algorithms used to implement several key relational operators with tensor programs. Experiments are in Section 6. Related works are in Section 7. The paper is concluded by Section 8.

2. Background

In this section, we briefly summarize the system support for tensor computation (Section 2.1), and provide a taxonomy of the tensor operations used throughout the paper (Section 2.2).

2.1. Tensor Computation Runtimes (TCRs)

The last years have witnessed an increase in the popularity of ML models based on NNs (Goodfellow et al., 2016). While in the heydays these models were implemented manually in C++, data scientists now can take advantage of several open-source ML frameworks simplifying the authoring and deployment of NN models. TensorFlow (ten, 2018) and PyTorch (Paszke et al., 2017) are considered the most popular of such frameworks.

ML frameworks follow a common architecture: at the top, they have a high-level Python API111Note that TCRs allow implementation in other languages too (e.g., Java (PyTorch, 2020), Rust (Mazare, 2020), C# (Foundation, 2020)). Python is however the default language of choice by data scientists. where data is commonly represented as multi-dimensional arrays called tensors, while computation is expressed as a composition of tensor operations embedded into the Python language. At the lower level, they have a runtime and a dispatcher/compiler allowing to run the tensor operations over different hardware backends such as CPU, GPU, custom ASICs, and using single node execution, distributed (Li et al., 2020), or mobile/edge (Google, 2021).

Modern ML frameworks allow to run computation in an interpreted mode (often referred to as eager execution), or in a compiled mode (graph execution) enabling code optimizations such as common sub-expression elimination, operator fusion, code generation (eag, 2022), as well as removing Python dependency (TVM, 2022b, a). Interpreted vs. compiled execution is a popular dichotomy in query processing system implementations (Kersten et al., 2018). ML frameworks allow both modalities, and we explore the trade-offs involved when using one vs. another, and the current limits of tensor compilers in Section 6.

We will refer to ML frameworks, runtimes (onn, 2022; ten, 2019), and compilers as tensor computation runtimes (TCRs) in the rest of the paper.

2.2. Tensor Operations

TCRs provide hundreds of tensor operations. We provide a brief summary of the operators used in TQP, organized by category222Since TQP is currently built on top of PyTorch, from now on we will use the PyTorch naming convention. Note that similar tensor operations can be found on other TCRs. Additionally, here we take the freedom to provide a different taxonomy than the one found in the PyTorch documentation (PyTorch, 2022b) and in our previous work (Koutsoukos et al., 2021b)..



This category contains all operations used to create tensors, e.g., from_numpy, fill a tensor with specific elements (zeros, ones, empty, fill, arange) or create a tensor using the same shape of another tensor (zeros_like, ones_like).

Indexing and slicing.:

This category involves operations for selecting one or more elements of a tensor using the square bracket notation, or using indexing (index_select), a mask (masked_select), or a range (narrow).


This category includes reshape, view, and squeeze that reorganize the shape of a tensor (eventually by changing only its metadata). gather, scatter reorganize the elements of a tensor using an index, while sort sorts its elements.


eq, lt, gt, le, ge, isnan are operators in this category. Other operations are where that implements conditional statements, and bucketize that implements binary search.


add, mul, div, sub, fmod, remainder are in this category. We also include logical operators such as logical_and, logical_or, negative, and shift operations.


This category allows to concat or stack multiple tensors.


This category contains operations for calculating simple aggregates (sum, max, min, mean), aggregates over groups (scatter_add, scatter_min, scatter_max, scatter_mean), logical reductions (all, any), as well as operations to build histograms (bincount, histc), nonzero (returning the indexes of non-zero elements), unique and unique_consecutive.

3. Query Processing on TCRs

In this section, we summarize the challenges (Section 3.2), and the design principles we commit to (Section 3.3) when building TQP. But first we show how relational operators can be implemented using tensor programs with an example (Section 3.1).

3.1. Relational Operators as Tensor Programs

TCRs operate over data represented as tensors. Tensors are arrays of arbitrary dimensions containing elements of the same data type. 0d-tensors are referred to as scalars, 1d-tensors as vectors, and 2d-tensors as matrices. For a tensor of dimensions, its shape is a -tuple where each element specifies the size of the -dimension. For example, a matrix with 10 rows and 5 columns is a 2d-tensor of shape (10, 5). In this paper, we only consider dense tensors where each element is explicitly stored in memory.

ML practitioners implement programs (NNs) as a composition of operations over tensors. While relational operations are commonly expressed as queries in a standalone language (e.g., SQL), tensor operations are embedded in a host language (e.g., Python), which is used for implementing control flows, etc. TCRs allow a limited form of declarativeness through graph execution. Next, we introduce several possible ways of implementing a filter using tensors.

Example 3.1 ().

Let’s assume we want to implement a simple filter condition over the l_quantity column of the lineitem table: where l_quantity ¡ 24. First, we can represent the l_quantity column as a 1d-tensor of floating point numbers. As such, we can then use the lt (less than) operator to implement the filter condition (line 1 of Listing 1). lt generates a boolean mask which then we can use as a parameter of the masked_select operator of line 2 to generate the filtered version of the l_quantity column vector. The program can be easily extended over multiple conditions by intersecting the masks using logical_and.

1mask =, 24)
2output = torch.masked_select(l_quantity, mask)
Listing 1: Filter implementation using bitmaps.

This implementation is almost identical to the Bitmap-based representation (Ngom et al., 2021) of filters in vectorized query processors (Polychroniou and Ross, 2019; Raman et al., 2013). In fact, on CPU, TCRs have SIMD implementations of several conditions and intersection operators. An alternative would be to use indexes rather than masks to extract the values. This is commonly referred to as Selection Vector representation (Ngom et al., 2021; Răducanu et al., 2013), and can be similarly implemented using tensor operators lt, nonzero, and index_select.

Listing 2 shows another implementation. Here, we iterate over all the elements of the input tensor and use a Python conditional statement. This implementation does not take advantage of any tensor operation beyond creating the output tensor.

1output = torch.zeros_like(l_quantity), j = 0
2for i in range(l_quantity.shape[0]):
3    datum = l_quantity[i]
4    if datum < 24:
5        output[j] = datum, j = j + 1
6output = output[:j,:]
Listing 2: Filter implementation using Python control flow.

Table 1 shows the performance of the two implementations. The implementation using Python control flow is considerably slower, and GPU execution actually decreases the performance. This result highlights one of the design choices (Section 3.3) we make in TQP: do not use Python for data-dependent code.

Implementation CPU GPU
(lr)2-3 (lr)4-5 Torch TorchScript Torch TorchScript
Bitmap 36.6ms 36.6ms 2.9ms 2.9ms
Python 23s 22.7s 200.3s 200s
Table 1. Execution time of different filter implementations over 6M elements. We evaluate them in PyTorch interpreted (Torch) and compiled (TorchScript) modes.

3.2. Challenges

Implementing a query processor on TCRs requires overcoming several challenges. After all, TCRs are built for authoring and executing NN models, not relational queries.



Expressivity. Relational queries can contain filters with fairly complex expressions (e.g., like, in), sub-queries, group-by aggregates, joins (e.g., natural, anti, semi, outer), etc. It is not clear whether the tensor operations currently available in TCRs are enough to support all these relational operators.


Performance. Even if a relational operator is implementable using tensors, this does not automatically lead to good performance, as the example in Listing 2 suggests. In fact, it is not clear whether tensor programs can achieve good performance, beyond NNs.


Data Representation. To use TCRs as execution engines, relational tables must be transformed into a tensor representation. Previous approaches have explored this challenge (e.g., (Hu et al., 2021)), but their cost of translation is not negligible. Furthermore, TCRs commonly do not support strings or date data types.


Extensibility. Running relational queries over TCRs makes running a query seamlessly over different hardware (CPU, GPU, ASICs, etc.) and backends (single node, distributed, edge, web browser, etc.) possible. A single monolithic compiler architecture does not work in all situations, therefore TQP’s design must be flexible enough to address all these use cases.

3.3. Design Choices

When building TQP, we embrace the following design choices.



Avoid implementing data-dependent control flow in Python. As Table 1 suggests, computation in TQP must use tensor operations as much as possible. Note that for loops and conditionals over schema elements are acceptable (e.g., loops over the columns of a table). This design choice allows us to address C2 and achieve G1.


Tensor-based columnar format for input tabular data. Relational data must be transformed into the tensor format. To do this, TQP adopts a columnar representation of tables, and considers each column in a table as a tensor. We provide more details on our data representation in Section 4.1. This design choice addresses C3.


Adherence to TCRs’ API. This design choice is required for achieving G2 and G3. In fact, if we start extending TCRs with new features and operators, eventually the system will hinter portability and increase the engineering effort because we will have to support them on any hardware. Hence, we take advantage of existing TCRs’ API rather than try to extend them. With this design choice, we are also able to address C1.


Extensible infrastructure allowing easy integration with relational and ML frameworks. Having a flexible infrastructure is of paramount importance since we desire to ride the wave of investments in ML. Therefore, we embrace an extensible architecture composed of a core compiler, pluggable frontends (e.g., query parsing and optimizer), and allowing different output target formats (e.g., PyTorch, ONNX). This design choice addresses C4.

4. Tensor Query Processor (Tqp)

TQP extends Hummingbird (Nakandala et al., 2020; Microsoft, 2022a). In TQP, relational operators and ML models are compiled into tensor programs using a unified infrastructure. Here we only focus on the relational part, as the ML part was already described in (Nakandala et al., 2020).

System Overview. TQP’s workflow has two phases: (1) in the compilation phase an input query is transformed into an executable tensor program; (2) in the execution phase, input data is first transformed into tensors, and then fed into the compiled program to generate the final query result. In its current implementation, TQP uses vanilla PyTorch in the compilation phase as the implementation target for the tensor programs. PyTorch programs are then lowered into different target formats if necessary for portability or performance. The selection of the hardware device to target is generally done in the compilation phase. Next, we describe each phase in detail, starting from compilation (Section 4.2), followed by execution (Section 4.3). But before that, we first describe how TQP represents relational data using tensors.

Figure 1. TQP represents input tables in a columnar format with a 2d-tensor per column.

4.1. Data Representation

Before executing the query, TQP must convert the input (tabular) data to tensors. Databases often manage and convert data into their own proprietary format, and TQP is no different. TQP internally represents tabular data in a columnar format with virtual IDs (Abadi et al., 2013a), as illustrated in Figure 1. Data for each column is stored as a tensor, where is the input number of rows, and is the length required to store the values. The translation logic is different depending on the column data type. For example, numerical columns (sid in Figure 1) can be directly represented as tensors. The conversion of numerical columns to tensors is often zero-copy. TQP represents date data in

numeric tensors as the number of nanoseconds since some pre-defined epoch. In this case, (de)serialization may be required depending on the source/target

date representation. Finally, TQP represents string columns using numeric tensors, where is the maximum character length of any string for that column. Given a string, TQP

stores a character per tensor column and right-pads it with

s if its length is smaller than . We are actively working on adding support for encoded data (e.g., bitpacking, run-length encoding, dictionary encoding) and more compact string representations (nvi, 2021).

4.2. Query Compilation

TQP’s compilation phase is composed of four main layers, as shown in Figure 2. (1) The Parsing Layer converts an input SQL statement into an internal Intermediate Representation (IR) graph depicting the query’s physical plan (Section 4.2.2). The physical plan is generated within TQP by an external frontend database system. Since TQP translates physical plans into IR graphs, the architecture decouples the physical plan specification from the other layers, therefore allowing to plug different frontends. (2) The Canonicalization and Optimization Layer does IR-to-IR transformations (Section 4.2.3). (3) The Planning Layer translates the IR graph generated in the previous layer into an operator plan, where each operator in the IR is mapped into a tensor program implementation (Section 4.2.4). (4) The Execution Layer generates an executor from the operator plan (Section 4.2.5). The executor is the program that runs on the target TCR and hardware. Next, before describing each layer in more detail, we give a quick overview of TQP’s intermediate representation.

Figure 2. TQP’s compilation phase.

4.2.1. Intermediate Representation (IR)

The IR is a graph-based data structure. It consists of a list of operators and variables. Each operator corresponds to a node in the graph and it contains: (1) a list of input variables; (2) a list of output variables; (3) an alias identifying the operator type; and (4) a reference to the corresponding operator instance in the original physical plan. The latter is used to instantiate the tensor program implementing the operator. For example, to create a filter, TQP needs to access the expressions contained in the original physical operator.

Edges represent data (tensors) flowing between operators. In particular, an edge connects an output variable from an operator to an input variable of another operator. A variable contains: (1) a unique identifier, and (2) the corresponding frontend column name in the original plan, which is used to translate expressions. When a variable is created, a unique identifier is generated deterministically based on information available in the graph. Variables in the IR are generated as follows. First, TQP generates a variable for each column in the input table. Then, these variables can be used as input to many operators, however, a new variable will always be created for an output of an operator. Thanks to this design: (1) properties (e.g., sorting information) can be immutably attached to columns; (2) the IR is easier to debug because variables, once defined, are never changed; and (3) TQP can detect at runtime when a column is not used anymore and safely garbage-collect it.

4.2.2. Parsing Layer

The goal of the Parsing Layer is to translate input queries into TQP’s internal IR. This goal is accomplished in two steps: (1) input queries are parsed, optimized, and exposed as frontend-specific physical query plans; and (2) a frontend-specific parsing logic translates the physical plan into an IR plan.

In its current version, TQP supports queries expressed as Spark SQL statements, and it uses the PySpark API to parse, optimize, and return the physical plan in a JSON format. We plan to add support for Calcite (Begoli et al., 2018), DuckDB (Raasveldt and Mühleisen, 2020), and eventually Substrait (sub, 2022)333Note that we currently only support Apache Spark for relational frontends, not in general. TQP, in fact, supports all the ML frontends available in Hummingbird (Microsoft, 2022a).. Then the Spark parser constructs the internal IR version of the physical plan using a DFS post-order traversal. If an unsupported operator is found in the plan, this phase will fail with an exception. The list of operators supported by the IR is extensible (DC4).

4.2.3. Canonicalization and Optimization Layer

This layer implements IR graph transformations similarly to a classical rule-based optimizer. Rules are applied to the IR graph in two stages. In the first stage, canonicalization, the rules are used to eliminate any of the frontend-system idiosyncrasies that are present in the IR graph. For example, Apache Spark returns a projection operator with no inputs for count * statements. In the second stage, optimization, rules rewrite the IR graph for obtaining better performance. While we did not explore in depth the optimization space enabled by TQP’s design, we show that hand-optimized tensor programs are more efficient than the one currently generated by TQP in Section 6.6.

4.2.4. Planning Layer

In this layer, TQP transforms the optimized IR graph into an operator plan composed of PyTorch tensor programs implementing each operator in the IR graph. In Section 5, we describe some operator implementations in detail. The implementation of the Planning Layer is straightforward. For each operator in the IR graph, TQP fetches the corresponding implementation containing the tensor program from a dictionary, which is then instantiated with the IR operator’s reference to the frontend physical operator instance.

4.2.5. Execution Layer

Here the operator plan is wrapped around a PyTorch executor object. This object is responsible for: (1) calling the programs in the operator plan following a topological order; (2) wiring the output tensors generated by each program into the successive one; and (3) keeping track of tensor references to garbage-collect them when not used anymore (i.e., when all the programs using a certain tensor/variable have been executed).

Once the executor program is generated, TQP provides options to compile it into different target formats in addition to (PyTorch) interpreted execution. Currently, TQP allows to lower the executor into the TorchScript and ONNX formats, as well as to use TVM to compile it directly into machine-level code. Note that not all queries can be compiled into all formats since not all PyTorch operations are supported by all the target representations. We will further discuss these trade-offs in Section 6.

4.3. Execution

Once the executor program is generated, it can be executed over the input data. The program automatically manages: (1) converting data into the tensor format; (2) data movements to/from device memory; and (3) scheduling of the operators in the selected device. Once the data is in the right format and on the desired device, all the operators are executed sequentially. Regarding parallelization, TQP exploits the tensor-level intra-operator parallelism provided by the TCRs. However, given the poor scalability performance (Section 6.3), we are exploring support for inter-operator parallelism, and data-parallel strategies. Once the executor completes, TQP returns the result of the query in tensor, NumPy, or Pandas formats.

5. Operator Implementation In TQP

We described how TQP uses the Planning Layer to translate relational operators in the IR graph into tensor programs. Here we provide an overview of a few program implementations. TQP provides tensor-based implementations for the following relational operators: selection, projection, sort, group-by aggregation (sort-based), natural join (hash-based and sort-based), non-equi, left-outer, left-semi, and left-anti joins. TQP supports expressions including comparison and arithmetic operations, functions on date data type, in, case, like statements, as well as aggregate expressions using sum, avg, min, max, and count aggregates (with and without distinct). Finally, TQP supports nulls, and subqueries (scalar, nested, and correlated), and predict UDF444 While generic UDFs are hard to support in TQP because of data conversion and data representation mismatches, Spark vectorized UDFs (vec, 2021) can be supported on CPU. (Microsoft, 2021a, b). With all the above, TQP is able to compile and execute all 22 queries of the TPC-H benchmark (C1). Interestingly, to support the full TPC-H benchmark, only the tensor operations listed in Section 2.2 are required, and we did not have to introduce any additional custom tensor operators (DC3). Due to space constraints, we only describe how TQP implements relational expressions with tensor operations (Section 5.1), and implementations for two representative operators: join (sort- and hash-based, respectively in Section 5.2 and Section 5.3), and group-by aggregation (Section 5.4). Finally, note that the filter implementation in TQP is close to the Bitmap representation described in Section 3.1.

Input: data: input columns passed as an array of tensors.

Output: an array of tensors representing the join output.

1: Sort join keys
3: Build histograms for the left and right key columns
4: Compute the number of rows for each pair of matching keys
5: Compute the prefix sums of histograms
8: Initialize the output size and output offsets
10: Find the bucket of matching keys to which each output belongs
11: Compute the indexes from left and right in the join output
15:return createOutput(data, leftOutIdx, rightOutIdx)
Algorithm 1 Sort-Based Join

5.1. Expressions

Relational expressions such as sum(l_extendedprice (1 - l_discount)) can be found in projection operators, filters conditions, etc. In an expression tree, each leaf node represents a column or a constant value (e.g., l_extendedprice) and each branch node represents an operator (e.g., ). TQP keeps an internal dictionary that maps operators to their corresponding tensor operations, e.g., to torch.mul. To implement an expression with tensor operations, TQP then performs a post-order DFS traversal on the expression tree. For each leaf node, TQP fetches (or generates) the proper column-tensor (constant value). For each internal operator, TQP retrieves the corresponding tensor operation (or a series of tensor operations) from the internal dictionary. In this way (and with the help of Python lambda functions) TQP generates a chain of tensor operations representing the evaluation of the expressions. As an example, from Q21 in TPC-H, the expressions o_orderstatus = ‘f’ and receiptdate ¿ l_commitdate is implemented as torch.logical_and(torch.eq(o_orderstatus,[70]) ,,l_commitdate)), where [70] is a 1x1 tensor storing the ASCII value for the constant ‘F’.

5.2. Sort-Based Join

Figure 3. An example of the sort-based join implementation.

TQP adopts a late materialization strategy for joins, similar to the one commonly used in columnar databases (Abadi et al., 2013b; Li and Ross, 1999). TQP takes only the columns in the join predicate as input to the join, and the output is a set of pairs of indexes identifying the records for which the join predicate succeeds. The sort-based equi-join algorithm is shown in Algorithm 1, where, to simplify the description, we describe the case in which two integer columns are joined. With a few modifications, the algorithm is also able to support non-equi joins, left-semi joins and outer joins. We use the typewriter font (e.g., bucketize) to denote tensor operations, and the capital font (e.g., createOutput) to denote class methods. Figure 3 further illustrates the algorithm.

First, TQP sorts the join-key columns from each table (lines to in Algorithm 1,  ➊ in Figure 3). Then, ➋, TQP builds two histograms for the join keys from left and right respectively, i.e., TQP counts the number of occurrences for each unique join key (line ). Then, ➌ by multiplying the values (element-wise) of the histograms (line ), TQP computes the bucket sizes: the number of output rows for each matching join key from left and right. Afterwards, TQP computes the prefix sums for the left and right histograms (➍), as well as their element-wise multiplication (➎) (lines to ). The prefix sums will be used later to retrieve, from each join output, the position in left and right. The total size of the output of the join is then computed as the last element of the prefix sum containing the bucket sizes (line ), and  ➏ TQP generates an index array (offset) of the same size (line ). Then, ➐ TQP performs a parallel binary search on the prefix sum containing the bucket sizes to find the matching join key (bucket) to which each row in the output of the join belongs (line ). Next, ➑ TQP computes the indexes from left and right that generate each row in the output of the join. Figure 3 shows the computation process for row in the join output of the example. To compute the indexes from left and right that are part of a given offset in the output of the join, TQP first subtracts offset by the prefix sum of bucket sizes prior to the current bucket (line ). Now offset becomes the offset in each bucket of the matching join keys. TQP then adds to the offset the previous bucket from the respective prefix sum histogram (cumLeftHist and cumRightHist, respectively), and adds the result (quotient for leftOutIdx, remainder for rightOutIdx) of offset divided by the number of join keys from right in the current bucket of matching join keys (lines to ). Finally, for each row in the join output, TQP knows which rows from left and right contributed to it. It then generates the join output (line , not depicted in Figure 3). It is important to note that all computations in this join implementation are achieved using tensor operations, with only minimal usage of Python code.

Input: data: input columns passed as an array of tensors.

Output: an array of tensors representing the join output.

2: Compute the hash values for join keys (m is the max hash table size)
3: Build the histogram of hash values for the left join keys
5: Build and probe the hash table in an interleaved way
6:for  do
8:      Skip those scattered for future iterations by setting their hashes to
10:      Probe the current hash table and get the left and right indexes
14:      Find the indexes that have matching join keys
17:      Append the indexes to the global results
Algorithm 2 Hash-Based Join

5.3. Hash-Based Join

The hash equi-join algorithm is shown in Algorithm 2. The definition of the input and output here is the same as in Section 5.2. The algorithm is similar to the classical hash join algorithm, except that the build and probe phases are interleaved and repeated as many times as the maximum number of elements that share a hash value (line 6). The algorithm is as follows: TQP first generates the indexes (line 2) and the hash values (line 3) for the and tables. Afterwards, TQP computes a histogram over the table on which the hash table will be built ( in this case, line 4) and checks the maximum number of elements in a hash bucket (line 5). Then, TQP repeatedly builds a hash table (lines 7 and 8) and probes it (lines 11 to 14) to find matching keys (lines 15 to 17). Matching keys are accumulated across iterations (lines 18 and 19). In each iteration, TQP also keeps track of the indexes that are stored into the hash table such that they will not appear in subsequent iterations (lines 9 and 10). To achieve this, let be the hash table size; TQP appends an additional -th bucket to the hash table, and uses it to redirect the already scattered indexes. Note that, when there are no hash collisions, TQP skips the logic of lines 9 to 10 and 18 to 19. This path is therefore close to the optimal.

Compared to the sort-based join, when there are no hash collisions, this implementation is around 30% to 50% faster on CPU and 2 faster on GPU. When there are hash collisions, it is faster than the sort-based join for cases in which at most around 15 elements share a hash value; when there are more than 15 elements sharing a hash value, the sort-based join is faster. We are also currently working on a partitioned hash-join implementation.

5.4. Aggregation

Input: data: input columns passed as an array of tensors.

Output: the aggregation output as an array of tensors.

1: Generate unique groups
5: Evaluate the aggregation expression
Algorithm 3 Aggregation

Algorithm 3 shows the pseudocode of the aggregation implementation. First, TQP horizontally concatenates the values of the group-by columns (lines and ). TQP then sorts the values of the concatenated columns using radix sort, and permutes all the input data columns according to this sorted order (lines and ). Using uniqueConsecutive, TQP eliminates all but the first key from every consecutive group of equivalent keys. Concurrently, TQP computes the inverted indexes that indicate which bucket (unique key) each row in the sorted list ends up in (line ). Finally, with the unique key list and inverted indexes, TQP evaluates the aggregate expression for all groups. This last operation makes use of the expression generated (at compile time) as described in Section 5.1.

6. Evaluation

In this section, we aim to answer the following questions: (1) Is TQP’s performance on CPU comparable to other CPU-based data processing systems on a single core (Section 6.1)? (2) Is TQP’s performance on GPU comparable to other GPU databases (Section 6.2)? (3) How well does TQP scale with the increase in the number of CPU cores and dataset sizes (Section 6.3)? (4) What is the cost/performance trade-off of TQP on GPU (Section 6.4)? (5) Which operation takes the most time in query execution (Section 6.5)? (6) Can hand-optimized query plans improve over TQP’s query time (Section 6.6)? (7) Can TQP accelerate workloads mixing ML and relational queries (Section 6.7)? (8) What are the overheads introduced by TQP (Section 6.8)? (9) Can TQP run over different hardware and software backends while minimizing the engineering effort (Section 6.9 and Section 6.10)?

Baseline Systems. Our goal is to compare TQP with state-of-the-art query processing systems for different hardware settings. Specifically, for CPU execution, we compare TQP with Apache Spark (Zaharia et al., 2012) (recall that Spark and TQP share the same query plans) and DuckDB (Raasveldt and Mühleisen, 2020): a state-of-the-art vectorized engine. For GPU execution, we compare TQP with two well-known open-source GPU databases: BlazingSQL (bla, 2020) and OmniSciDB (omn, 2020).

Hardware and Software Setup. For all the experiments (except when noted otherwise), we use an Azure NC6 v2 machine with 112 GB of RAM, an Intel Xeon CPU E5-2690 v4 @ 2.6GHz (6 virtual cores), and an NVIDIA P100 GPU (with 16 GB of memory). The machine runs Ubuntu 18.04 with PyTorch 1.1 1, torch-scatter 2.0.9, BlazingSQL 21.8.1, PySpark 3.1.1, OmnisciDB 5.9.0, DuckDB 0.2.4, RAPIDS 21.08, CUDA 10.2, TVM 0.8 and scikit-learn 0.21.3.

Experimental Setup. We use the TPC-H benchmark (Council, 2018) which consists of 22 queries. We use the parameters specified in the query validation sections in (Council, 2018). We generate data at different scale factors (from 1 to 10 where 1 means 1 GB data in total 555 Note that some queries can run on scale factors larger than 10 in GPUs, thanks to TQP’s ability to push projections into data conversion. We are working on supporting out-of-memory computation by leveraging PyTorch’s DataLoader (dat, 2022).) using the dbgen tool. We load the generated data from disk into Pandas dataframes. All dataframes use the data types as specified in the benchmark, except for decimals: we use doubles for all systems since TQP does not support decimals yet. Subsequently, we register/convert each dataframe into each system’s internal format, e.g., Spark dataframes for Spark666For Spark, we additionally load the working datasets in memory using cache., PyTorch tensors for TQP, CUDA dataframes for BlazingSQL, etc., and move the data to the GPU, when applicable. We measure the total query execution time, including the time for generating the output. For each experiment, we do 10 runs where the first 5 are used for warm-up. The reported numbers are median values of the last 5 runs.

Key Takeaways. (1) TQP’s query execution time on CPU using a single core is better than Spark’s over the same physical plans; however, (2) TQP’s scalability on CPU is poor because of PyTorch lacking parallelization in some operators’ implementation and its intra-operator parallelism model. (3) TQP is, in general, slower than DuckDB on CPU, but for a few queries, TQP is comparable or even better. (4) Hand-optimized plans can improve TQP’s performance, which suggests that a TCR-aware query optimizer is required to achieve the best performance. (5) TQP’s query execution time on GPU is often better than both BlazingSQL’s and OmniSciDB’s, and TQP supports more queries. (6) When ML model prediction and SQL queries are mixed together, TQP is able to provide end-to-end acceleration which delivers up to performance improvement over CPU baselines. (7) TQP on GPU performs favorably and the query time speedup justifies the cost increase compared to CPU-only systems. (8) TQP can run queries on different hardware and software backends (including even integrated GPUs and web browsers), with orders of magnitude fewer lines of code required compared to the baseline systems.

Query CPU (1 core) GPU
(lr)2-5 (lr)6-9 Spark DuckDB TQP TQPJ Blazing Omnisci TQP TQPJ
Q1 2.261 0.664 7.535 7.301 0.216 0.095 0.157 0.154
Q2 8.751 0.101 0.629 0.577 0.238 0.351 0.039 0.028
Q3 3.669 0.273 1.154 1.165 0.128 0.293 0.027 0.024
Q4 4.719 0.216 1.050 1.087 0.093 0.292 0.020 0.018
Q5 6.963 0.302 2.459 2.963 0.164 0.064 0.048 0.042
Q6 0.381 0.156 0.143 0.073 0.045 0.047 0.003 0.002
Q7 5.569 0.430 2.236 1.931 0.244 0.067 0.042 0.035
Q8 4.034 0.278 2.460 2.503 0.215 0.079 0.050 0.039
Q9 17.61 2.533 4.518 4.616 0.569 0.072 0.105 0.092
Q10 15.98 0.430 1.168 1.184 0.173 0.740 0.057 0.052
Q11 1.047 0.034 0.476 0.324 N/A 0.084 0.016 0.009
Q12 4.063 0.309 0.976 0.966 0.069 0.062 0.025 0.021
Q13 6.081 0.181 9.379 9.197 0.303 0.069 0.153 0.136
Q14 0.509 0.171 0.124 0.096 0.076 N/A 0.007 0.005
Q15 2.640 0.291 0.133 N/A N/A 0.086 0.129 N/A
Q16 16.94 0.093 3.664 3.699 N/A 3.689 0.320 0.301
Q17 3.165 0.381 2.303 2.466 0.121 0.132 0.061 0.051
Q18 6.942 0.765 2.245 2.406 0.204 0.593 0.053 0.048
Q19 2.300 0.419 1.577 1.316 0.188 0.058 0.042 0.036
Q20 4.232 N/A 2.032 1.975 0.149 N/A 0.048 0.041
Q21 12.39 0.932 25.49 24.25 N/A N/A 0.158 0.151
Q22 3.919 0.069 0.315 0.296 N/A N/A 0.011 0.010
Table 2. Query execution time (in seconds) on the TPC-H benchmark (scale factor 1). Bold numbers highlight the best performance for the specific setup (CPU or GPU). We evaluate TQP in two modalities: interpreted (TQP) and compiled using TorchScript (TQPJ). N/A means the query execution did not finish because of an error. TQPJ currently does not support materialized views.
(a) Query execution time over different numbers of cores.
(b) Query execution time over different scale factors.
Figure 4. Scalability on selected queries from TPC-H. For TQP, we report the best time of the interpreted (PyTorch) and compiled (TorchScript) versions. In (a), the scale factor is 1. In (b), all CPU methods use 6 cores. BlazingSQL throws errors for Q9 at scale factors 2, 5, and 10. OmnisciDB does not support Q14. The y-axes in (b) are in (symmetric) log scale.

6.1. Single Core Execution on CPU

In this first experiment, we use a single CPU core and TPC-H at scale factor 1. The results are shown in Table 2 (under CPU). We compare Spark and DuckDB vs. TQP, using both interpreted (TQP) and compiled execution with TorchScript (TQPJ). Both Spark and TQP are able to support all 22 queries.

In terms of query time, TQPJ is either comparable to TQP or better. This is because TorchScript removes Python code dependency and provides optimizations not offered by vanilla PyTorch (DeVito, 2019). TQP outperforms Spark for most of the queries, sometimes by an order of magnitude (e.g., Q10, Q15, and Q22). Given that TQP uses the same physical plans as Spark, this suggests that the tensor abstraction is indeed good for executing relational queries. The practical reasons are: (1) TQP is column-oriented, while Spark is row-oriented. This makes the former better suited for analytical queries; (2) some tensor operations use SIMD instructions, while Spark does not exploit vectorization; (3) in TQP, tensor operations are implemented in C++, while Spark is Java-based; (4) Spark is designed as a scale-out system. For some queries (i.e., Q1, Q13, and Q21) where TQP is slower than Spark, the reasons are: (1) TQP’s left anti-join and left outer-join implementations are not optimized; (2) the performance of the uniqueConsecutive operator in PyTorch is not optimal.

Finally, TQP has better performance than DuckDB only for 3 out of the 21 queries supported by DuckDB. For the other queries, DuckDB clearly outperforms TQP. If we exclude Q1, Q13, and Q21 (discussed above), TQP is within the same order of magnitude as DuckDB. To evaluate whether this poor performance compared with DuckDB is due to bad query plans or the tensor abstraction, we hand-code optimal plans and tensor programs in Section 6.6 and show that TQP can match and even outperform DuckDB on CPU.

6.2. Execution on GPU

In this experiment, we evaluate the performance of TQP on GPU. The results are shown in Table 2 (under GPU). Starting from TQP vs. TQPJ, as in the CPU case, TQPJ outperforms TQP. Compared with the baselines, TQP (interpreted or compiled) outperforms BlazingSQL (Blazing in the table) for all the queries, and it outperforms OmniSciDB (Omnisci) on 14 queries out of the 18 queries supported by OmniSciDB. For the remaining 4 queries, TQP achieves query times within a factor of 2 from OmniSciDB. Note that TQP supports all 22 TPC-H queries, while BlazingSQL and OmniSciDB only support 17 and 18 queries, respectively.

Finally, if we compare the best CPU performance versus the best GPU ones, in general, we see that the runtimes on GPU are 1.5 to 48 better than the CPU ones (single core), except for Q16 where DuckDB is about 3 faster than the best performing GPU method. This somehow counter-intuitive result is due to the fact that, at scale factor 1, GPU resources are not completely saturated. Therefore, it makes sense to explore how these systems scale with more data and more available core. This is what we explore next.

6.3. Scalability

For this and the following experiments, we select a representative set of queries: complex aggregation (Q1), joins and filters (Q2), simple filters (Q6), complex joins (Q9), simple join and aggregation (Q14), a complex mix of join, aggregation, and sub-queries (Q18).

6.3.1. Scaling the Number of Cores

In this experiment, we scale the number of available CPU cores from 1 to 6 over TPC-H at scale factor 1. Figure (a)a compares the scaling performance of Spark, DuckDB, and TQP. Spark has the best scalability trend lines almost for all queries. DuckDB also scales well. TQP’s scaling performance is however sub-optimal, and for some queries increasing the number of cores provides no benefits. There are two reasons: (1) PyTorch uses intra-operator parallelism which is not as efficient as the shuffle (Zaharia et al., 2012) or morsel-based (Leis et al., 2014) approaches in Spark and DuckDB, respectively; (2) some PyTorch operators run on a single core (e.g., unique and unique_consecutive (PyTorch, 2022c) used in aggregation). We are investigating how to overcome this limitation by adding data-parallel support to TQP leveraging PyTorch Distributed Data Parallel (Li et al., 2020; ddp, 2022) or by adding parallel operator implementations.

6.3.2. Scaling the Data

In this experiment, we scale the dataset from 1GB to 10GB. In Figure (b)b, we compare the scalability performance of CPU implementations running over 6 cores (Spark, DuckDB), as well as GPU systems (BlazingSQL and OmniSciDB). In general, we see that TQP CPU scales the worst for almost all queries (only Spark is worst for Q6 and Q14), while GPU systems scale better than the CPU ones. For Q1, OmiSciDB provides the best performance, followed by BlazingSQL. TQP’s GPU performance is slightly worse than that of DuckDB’s. For Q2, Q14, and Q18, TQP GPU has the best performance, while for Q6, TQP GPU is comparable to OmniSciDB. Finally, for Q9, OmnsciDB has the best performance. Q9 has six joins, and OmniSciDB is able to better use the GPU resources. This query is memory-bound and the memory bandwidth of the P100 makes it much faster on GPU than on CPU.

Figure 5. Cost/performance trade-off for TQP on slected queries at scale factor 10. We plot the speedups of TQP run on different GPUs (NVIDIA T4, P100 and V100) over DuckDB on a baseline CPU-only machine. The dashed lines represent the speedup required by the GPU executions to be more cost-effective compared to the DuckDB CPU baseline.
(a) Query time breakdown for tensor operators on CPU
(b) Query time breakdown for tensor operators on GPU
Figure 6. Query time breakdown for tensor operators for selected TPC-H queries at scale factor 10.
Figure 7. GPU utilization breakdown for selected TPC-H queries at scale factor 10. Utilization varies by query. Runtime is the time spent in scheduling the kernels.

6.4. Cost/Performance Trade-off

In this section, we provide a cost/performance analysis of TQP on GPU compared to a CPU-only baseline. Specifically, we select a general-purpose (CPU-only) VM in Azure with a cost similar to the cheapest VM equipped with GPU (NC4as_T4_v3), and with similar main memory size. Following these constraints we picked a D2ds_v5 with 8 CPU cores, and 32GB of memory. Then we compare the performance of DuckDB on the D2ds_v5 with TQP on (1) NC4as_T4_v3 (with an NVIDIA T4 GPU, about 15% more expensive than the CPU-only machine), (2) NC6s_v2 (with an NVIDIA P100 used in the previous experiments, around 4.6 more expensive than the CPU-only VM), and (3) NC6s_v3 (with an NVIDIA V100, around 6.6 more expensive than the CPU-only VM). For each GPU VM type, we show the speedup required to be more cost-effective than the DuckDB baseline. That is, for the T4, the speedup provided by TQP has to be more than to justify the cost increase of the T4 VM compared to the DuckDB CPU baseline, 4.6 for the P100, 6.6 for the V100. The results for scale factor 10 are shown in Figure 5 for a few representative queries. As shown, GPU execution is cost-effective compared to the CPU-only machine: for 5 out of the 6 queries in the figure (14 of 20 supported queries in the full TPC-H) for the T4, 4 of 6 (8 of 20 in the full TPC-H) for the P100, and 5 of 6 (8 of 20 in the full TPC-H) for the V100 GPU.

6.5. Performance Breakdown

In this experiment, we show the major contributing factors to the query execution time. TQP is integrated with TensorBoard (ten, 2020), which provides performance breakdowns and makes easy spotting bottlenecks (Yuki Asada, 2022). We start with looking into which tensor operators are responsible for the majority of the execution time. Figures (a)a and (b)b show the breakdown for a few selected queries on CPU and GPU, respectively. Interestingly, even if TQP uses the same algorithms on both CPU and GPU, the same query can show different operator contributions. For example, for Q1 on CPU the majority of the time is spent on computing the unique elements, while on GPU, the majority of the time is spent on scatter_add. This is because the quality of the operator implementations is different for CPU and GPU. Across queries, on CPU and GPU, the majority of time is also spent on different operators. On CPU, most queries are bounded by unique operators, masked_select, and indexing; on GPU, most of the time is spent on sorting, unique and nonzero. These observations suggest that: (1) the quality of kernels differs between CPU and GPU, e.g., after further investigation, we find that the GPU implementation of scatter_add is suboptimal, and nonzero requires host/device synchronization (non, 2022) (however, we believe that over time the community will fix such performance issues); and (2) it might be worth investigating backend-aware tensor algorithms.

Finally, we report the GPU utilization for the same set of queries in Figure 7. As we can see, each query has different utilization characteristics. For instance, Q1 contains complex aggregation and it spends 87% of the time on kernel execution; conversely, Q6 and Q14 are simple queries and most of the time is spent allocating GPU memory. Finally, Q2 spends a considerable amount of time in generating the output on CPU.

6.6. Hand-Optimized Plans

TPC-H Query CPU (1 core) CPU (6 cores) GPU
(lr)2-5 (lr)6-9 (lr)10-13 Best Baseline TQP Hand-Opt. Best Baseline TQP Hand-Opt. Best Baseline TQP Hand-Opt.
(lr)3-5 (lr)7-9 (lr)11-13 Torch JIT TVM Torch JIT TVM Torch JIT TVM
Q1 6.54 (DuckDB) 5.97 6.89 N/A 1.1 (DuckDB) 4.68 5.17 N/A 0.17 (OmnisciDB) 0.73 0.73 N/A
Q6 1.5 (DuckDB) 0.87 1.18 0.24 0.25 (DuckDB) 0.66 0.71 0.12 0.02 (OmnisciDB) 0.01 0.01 0.06
Q9 45.11 (DuckDB) 19.34 18.66 N/A 7.75 (DuckDB) 14.59 13.83 N/A 0.14 (OmnisciDB) 0.45 0.44 N/A
Q14 1.7 (DuckDB) 0.52 0.49 0.47 0.33 (DuckDB) 0.12 0.10 0.16 0.12 (BlazingSQL) 0.01 0.01 0.30
Table 3. Query execution time (in seconds) on selected TPC-H queries (scale factor 10). TQP Hand-Opt. uses hand-optimized tensor programs. We use Torch, JIT, and TVM to refer to execution using PyTorch (interpreted), TorchScript (compiled), and TVM, respectively. Bold numbers highlight the best performance for the specific setup: CPU (1 core), CPU (6 cores), or GPU.

Next, we study whether TQP’s performance can be improved with a better optimizer able to generate better tensor programs. To understand this, we hand-optimize the tensor programs for a few selected queries similarly to what a reasonable optimizer with knowledge about cardinalities and tensor characteristics would do, e.g., avoid sorting (or computing unique) over already sorted (or unique) columns, and select better join implementations. The results are shown in Table 3, where we report the best baseline for each setting (CPU 1 and 6 cores, and GPU), and over three execution modes: interpreted PyTorch (Torch), compiled TorchScript (JIT), and compiled using TVM. TVM only supports Q6 and Q14.

If we focus on the CPU numbers first, TQP’s performance is comparable to or even better than that of DuckDB’s, while TQP was much slower compared to DuckDB both on single- and multi-core execution when not using the hand-optimized plans. TQP is now faster than DuckDB for all queries over 1 CPU core, and two queries over 6 CPU cores. For some queries, TQP is faster than DuckDB by a large margin, e.g., for Q6, 1-core TVM execution is 6 faster. This is because TVM uses code generation and operator fusion to minimize intermediate data materialization across operators. When scaling to 6 cores, TQP scales well only for Q14, while DuckDB scales linearly. For the other queries, TQP’s query times improve by at most 2. This again shows the limitations of PyTorch’s scalability on CPU, which cannot be improved by using better tensor programs.

Finally, if we focus on the GPU performance, we see that OmniSciDB has still better performance for Q1 and Q9, although TQP’s query times on GPU improve by 3 and 4, respectively, when using the hand-optimized plans. This is because TQP’s aggregate implementation heavily uses sorting, while OmniSciDB uses hash-based implementations.

6.7. Prediction Queries

We now investigate the performance benefits of using a unified runtime for queries mixing relational and ML operators. We use prediction queries as a use case, i.e., queries embedding a trained ML model performing predictions over some input data (Microsoft, 2021b). Recall that TQP natively supports predictions of any PyTorch model (e.g., NNs), and traditional ML models through its integration with Hummingbird. Here, we join the customer and orders tables in TPC-H (scale factor 10

), and train a gradient boosting tree model (with 128 trees with max depths of 8) over a mix of categorical (

c_orderstatus) and numerical features (c_custkey, c_nationkey, c_acctbal, sum(o_totalprice)

) after we apply one-hot encoding and feature scaling, respectively. We run a prediction query using the trained model over the query

with two filter predicates added (c_mktsegment = ‘building’ and o_orderdate ¿= date ‘1993-10-01’). Note that this prediction query mixes ML operators (tree ensemble, one-hot encoding, scaling, and concatenation) with relational ones (join, aggregation and filtering). We compare TQP with two baselines: one where the prediction query is executed over Spark (MLlib (Meng et al., 2016a) is used to build the model), and one where we use DuckDB for the relational part and scikit-learn (Pedregosa et al., 2011) for the ML part 777 Note that moving data from DuckDB to scikit-learn is zero-copy since DuckDB can directly return data in Pandas dataframe format (duc, 2022).. Since TQP subsumes Hummingbird, it is able to compile both the ML and the relational operators of the query into a unified plan executable on TCRs. Figure 8 shows the result. For CPU single core, TQP is about 40% faster than Spark, while DuckDB+scikit-learn is about 7 faster than TQP. When enabling all cores, Spark and DuckDB scale much better than TQP, for the reasons described in Section 6.3. Finally, TQP is able to exploit GPU acceleration end-to-end, which brings a 9 improvement of query time compared to the best CPU baseline.

Figure 8. Query time on a query mixing ML prediction and relational operators. In parenthesis shows the number of cores for CPU systems. The x-axis is in symmetric log scale.
(a) End-to-end execution breakdown on CPU
(b) End-to-end execution breakdown on GPU
Figure 9. End-to-end breakdown (incl. all overheads, and w/o pipelining and caching) for selected queries at scale factor 10.

6.8. Overheads

Next, we evaluate the overheads of TQP for both CPU and GPU. The breakdown of the end-to-end execution with all overheads is shown in Figure 9. Note that: (1) data conversion is done once and many databases (e.g., BlazingSQL, OmniSciDB, Spark, SQL Server, etc.) requires it; (2) TQP pipelines data movement (to the GPU) with query execution (non-blocking IO), while for this experiment we explicitly make data movement blocking; (3) the machine in this experiment uses PCIe 3 which is slower than the latest version, PCIe 5; (4) query compilation can be cached, but here we report the full query compilation time as the sum of the time for the frontend database to generate the physical plan, and the time for TQP to generate the final executable tensor program.

If we focus first on the CPU side (Figure (a)a), compilation and data conversion takes the majority of the time only for simple queries (e.g., Q6), while for the other queries the majority of the time is spent on the query execution. However, in the GPU case (Figure (b)b), except for Q2 and Q9, the majority of the time is spent on data operations (conversion and movement) and compilation. However, in practice, as described above, these overheads are hidden (e.g., data movement using pipelining) or are one-time overheads (data conversion and query compilation). Regarding query compilation, 90% of the time is spent on initializing the PyTorch models from the Spark plans, and we are currently investigating how to speed up this process. Finally, using TorchScript adds substantial compilation overheads since queries are traced using input samples.

Intel UHD Graphics 630 AMD Radeon Pro 5300M NVIDIA K80 NVIDIA V100 TPU Chrome
(TVM on Metal) (TVM on Metal) (PyTorch) (PyTorch) (PyTorch on XLA) (ORT on WASM)
62 17 5 1 25 1900
Table 4. Query time (in milliseconds) of TPC-H Query 6 (scale factor 1) using the hand-optimized plan over different hardware and software backends. In parenthesis is the TCR used as well as the compilation stack (when applicable).

6.9. Portability

To evaluate whether TQP can run on different hardware and software backends, we run TPC-H Query 6 with the hand-optimized plan on: (1) two integrated graphic cards, one from Intel, and one from AMD; (2) two discrete GPUs from NVIDIA (K80 and V100: the former a generation before the P100 GPU used for the experiments in the previous sections; the latter one, one generation after); (3) a custom ASIC used for NN training and inference (TPU); and (4) a web browser. We use a scale factor of 1. The results are shown in Table 4. This experiment proves the versatility of TQP. For the integrated GPUs, we use TVM to code-generate the query using Metal (Apple, 2022b). For the two discrete GPUs, we use vanilla PyTorch, while for the TPU we use the XLA backend for PyTorch888Note that PyTorch/XLA does not support all the necessary tensor operations and the execution fallback to regular CPU for part of the query is not available. (PyTorch, 2022a). Finally, we are able to run the query in the browser by exporting it into the ONNX format, and running it in Chrome using ONNX Runtime (ORT) for WebAssembly (WASM) (Microsoft, 2022b).

6.10. Engineering Effort

To demonstrate the minimal engineering effort required by TQP to run queries over different hardware, we compare the lines of code for a few relational operators (hash and sort-based joins, aggregation) across all evaluated systems. For each relational operator and each system, we use cloc (Danial, 2021) to count the lines of source code (excluding comment and blank lines) from the files containing the algorithmic functionality of the operator. This is admittedly a subjective process, but we believe the numbers of lines of code can roughly reflect the engineering effort required to implement relational operators in each system. Table 5 shows the results. Compared with the baselines, TQP requires significantly lower engineering effort: up to 10 less compared to CPU implementations, 50 less compared to GPU ones. It is worth noting that TQP is able to target different hardware with the same implementation, so the engineering effort required for TQP to scale over different hardware is constant. The other baseline systems do not share this property. For instance, to run Spark on GPU (e.g., using RAPIDS (spa, 2020), the same backend of BlazingSQL), we would have to add the lines of code for the GPU implementation.

System Relational Operator
(lr)2-4 Hash Join Sort-Based Join Aggregation
TQP (Various HW) 148 182 104 (sort-based)
Spark(CPU) 706 1439 637 (sort-based)
DuckDB (CPU) 1415 877 1466 (hash-based)
BlazingSQL (GPU) 1628 N/A 1389 (hash-based)
OmnisciDB (GPU) 10141 N/A 2416 (hash-based)
Table 5. Lines of source code for implementing relational operators, excluding blank lines and comments.

7. Related Work

Common representation for relational and ML workloads. Since the ’90s (Netz et al., 2000)

, there have been many works trying to integrate relational queries with data science and ML workloads 

(Kumar et al., 2017; Hellerstein et al., 2012; Feng et al., 2012; Schüle et al., 2019, 2019; Meng et al., 2016b; Schelter et al., 2016; Prasad et al., 2015; Karanasos et al., 2020; Buono et al., 2021; Syed and Vassilvitskii, 2017; tid, 2020; Wang et al., 2020; Kotlyar et al., 1997; Petersohn et al., 2020; Sinthong and Carey, 2021; Jankov et al., 2019; Yuan et al., 2021; Boehm et al., 2016; Hutchison et al., 2017; Damme et al., 2022; Boehm et al., 2019; Microsoft, 2021a; Kernert et al., 2014;, 2021). To our knowledge, we are the first proposing to execute relational queries over TCRs. Earlier attempts tried to run a few relational operators on the TPU using TensorFlow (Holanda and Mühleisen, 2019). TQP is orthogonal to previous efforts trying to optimize relational and tensor algebra (e.g.,  (Hutchison et al., 2017; Wang et al., 2020)) and we believe TQP can leverage them to further improve its performance. An analysis of matrix query languages can be found in (Geerts et al., 2021). Here, we focus on TCRs’ tensor interface, which is more flexible than a linear algebra API.

SciDB (Stonebraker et al., 2011; Rogers et al., 2010) is a database using array as base data representation. TensorDB (Kim and Candan, 2014) further proposes support for tensor data and decomposition operations inside databases. SciDB, TensorDB, and TQP suggest using a format closer to data science and ML to represent data. However, TQP further exploits TCRs to run both relational and ML workloads on hardware accelerators.

GPUs and hardware accelerators. Several systems have explored running relational queries over GPUs (Shanbhag et al., 2020; Lee et al., 2021; Yuan et al., 2013; Lutz et al., 2020; Paul et al., 2020; Paul et al., 2016; Power et al., 2015). We refer readers to (Paul et al., 2021) for a recent survey. However, the majority of them focus mostly on microbenchmarks, while, to our knowledge, only RateUpDB is able to support the full TPC-H benchmark. TQP is able to run the TPC-H benchmark on both CPU and GPU, thanks to TCRs’ flexibility to support different hardware backends. TCUDB (Hu et al., 2021) suggests using the Tensor Core Unit (TCU) of GPUs for accelerating relational operators. TCUDB requires an expensive transformation from tables to matrices, and it also uses low-level CUDA kernels, while TQP takes advantage of the high-level tensor interface of TCRs. GPUs are the default hardware for running neural network models. However, there has recently been a rise in custom ASICs (Jouppi et al., 2017; gra, 2020; cer, 2020a; sam, 2020; Apple, 2022a) purposely built for ML workloads. With TQP, we are proposing a solution allowing us to run relational queries on any hardware supported by TCRs, since many ASICs (ipu, 2020; cer, 2020b; Jouppi et al., 2017) provide high-level interfaces directly through TCRs or are targetable through tensor compilers (Chen et al., 2018; Lattner et al., 2020).

Query processing over heterogeneous hardware. Several recent works have started to explore query execution over heterogeneous hardware, such as CPU-GPU co-execution (Rosenfeld et al., 2022; Wang et al., 2021; Bre et al., 2018; Chrysogelos et al., 2019; Pirk et al., 2016; Heimel et al., 2013; Rossbach et al., 2013; Funke et al., 2018). Many of them rely on OpenCL (ope, 2020) to target different hardware. However, targeting a common language (or similarly a generic compiler, e.g., MLIR (Lattner et al., 2020)), requires non-trivial engineering effort since each device requires proper tuning (Pirk et al., 2016), algorithms, and data structures (as well as abstractions/dialects in the MLIR case). Conversely, TQP is able to natively run on any hardware supported by TCRs, and uses TCRs’ tensor operation implementations, and compilation stacks. Currently, the user has to specify which fragment of the query should run on which hardware, but we are exploring how to automate this and enable co-execution.

A trend arises recently that suggests splitting relational operators into smaller functions that can be easily composed and efficiently dispatched over heterogeneous hardware  (Bandle and Giceva, 2021; Vu, 2019; Koutsoukos et al., 2021a). TQP fits in this trend, whereby tensor operations are sub-components.

Vectorized execution, query compilation, and columnar databases. MonetDB/X100 (Boncz et al., 2005) pioneered the vectorized execution model as well as the columnar data layout (Stonebraker et al., 2005). TQP follows a similar design, where data is stored in a columnar format with virtual IDs (Abadi et al., 2013b), but each column is represented as a tensor. Recent works, such as HyPer (Neumann, 2011) and others (Shaikhha et al., 2016; Menon et al., 2017; Neumann and Freitag, 2020), have focused on query compilation. Nevertheless, since (1) there is no clear winner between query compilation and vectorized execution (Kersten et al., 2018) ; (2) many industry-grade systems use vectorized execution because it is easier to debug and profile (Behm, 2022); and (3) compiled systems start to move to vectorized execution (e.g., Spark with Photon), we evaluate TQP against a state-of-the-art vectorized engine, DuckDB (Raasveldt and Mühleisen, 2020).

On the ML system side, TensorFlow initially embraced a compiled (graph) execution (Abadi et al., 2016), while PyTorch pioneered interpreted (eager) execution (Paszke et al., 2017). Compilers (Chen et al., 2018; xla, 2020; tor, 2022; Lattner et al., 2020; Kjolstad et al., 2017; Fegade et al., 2021a, b) and optimization techniques (Jia et al., 2019a, b; Jeong et al., 2019) for neural networks are hot topics in the MLSys community. With TQP, our goal is to ride the wave of innovation in this domain. For TQP, interpreted vs. compiled execution is just another point in the query optimization space, since TCRs allow to seamlessly switch between them.

8. Conclusion

We proposed TQP, the first system able to run relational queries on TCRs. TQP is able to take advantage of all the innovation poured into TCRs, as well as to run efficiently on any hardware devices supported by TCRs. Our experiments showed not only that TQP is capable of running the full TPC-H benchmark on TCRs, but also that TQP’s performance is comparable and often superior to that of specialized CPU and GPU query processing systems.


  • (1)
  • ten (2018) 2018. TensorFlow.
  • ten (2019) 2019. Tensor-RT.
  • bla (2020) 2020. BlazingSQL.
  • cer (2020a) 2020a. Cerebras.
  • cer (2020b) 2020b. Cerebras Software.
  • gra (2020) 2020. GraphCore.
  • omn (2020) 2020. OmnisciDB.
  • ope (2020) 2020. OpenCL.
  • pyt (2020) 2020. Pytorch Ecosystem.
  • ipu (2020) 2020. PyTorch Release for IPU.
  • sam (2020) 2020. Sambanova: Massive Models for Everyone.
  • spa (2020) 2020. Spark-RAPIDS.
  • ten (2020) 2020. TensorBoard.
  • xla (2020) 2020. Tensorflow XLA.
  • tid (2020) 2020. Tidypredict.
  • nvi (2021) 2021. GPU-Accelerated String Processing with RAPIDS.
  • vec (2021) 2021. Introducing Pandas UDF for PySpark.
  • eag (2022) 2022. Code with Eager Execution, Run with Graphs: Optimizing Your Code with RevNet as an Example. Retrieved February, 2022 from
  • dat (2022) 2022. Datasets & DataLoaders.
  • duc (2022) 2022. Efficient SQL on Pandas with DuckDB.
  • pyt (2022a) 2022a. Intel Extension for PyTorch.
  • pyt (2022b) 2022b. Introducing Accelerated PyTorch Training on Mac.
  • onn (2022) 2022. ONNX Runtime.
  • ddp (2022) 2022. PyTorch Distributed Overview.
  • pyt (2022c) 2022c. PyTorch for AMD ROCm Platform now available as Python package.
  • sub (2022) 2022. Substrait.
  • non (2022) 2022. torch.nonzero.
  • tor (2022) 2022. TorchScript Documentation.
  • A. Gholami and Keutzer (2021) S. Kim M. W. Mahoney A. Gholami, Z. Yao and K. Keutzer. 2021. AI and memory wall. Berkeley. Retrieved January, 2022 from
  • Abadi et al. (2013a) Daniel Abadi, Peter Boncz, and Stavros Harizopoulos. 2013a. The Design and Implementation of Modern Column-Oriented Database Systems. Now Publishers Inc., Hanover, MA, USA.
  • Abadi et al. (2013b) Daniel Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreaos, and Samuel Madden. 2013b. The Design and Implementation of Modern Column-Oriented Database Systems.
  • Abadi et al. (2016) Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation. 265–283.
  • (2021) 2021. Redshift ML.
  • AMD (2022) AMD. 2022. ROCm. Retrieved January, 2022 from
  • Apple (2022a) Apple. 2022a. Apple Neural Engine. Retrieved January, 2022 from
  • Apple (2022b) Apple. 2022b. Metal. Retrieved January, 2022 from
  • AWS (2022) AWS. 2022. Inferentia. Retrieved January, 2022 from
  • Bandle and Giceva (2021) Maximilian Bandle and Jana Giceva. 2021. Database Technology for the Masses: Sub-Operators as First-Class Entities. Proc. VLDB Endow. 14, 11 (2021), 2483–2490.
  • Begoli et al. (2018) Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J. Mior, and Daniel Lemire. 2018. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources. In Proceedings of the 2018 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, USA, 221–230.
  • Behm (2022) Alexander Behm. 2022. Photon: A High-Performance Query Engine for the Lakehouse. In CIDR.
  • Boehm et al. (2019) Matthias Boehm, Iulian Antonov, Mark Dokter, Robert Ginthör, Kevin Innerebner, Florijan Klezin, Stefanie N. Lindstaedt, Arnab Phani, and Benjamin Rath. 2019. SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle. CoRR abs/1909.02976 (2019).
  • Boehm et al. (2016) Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, and Shirish Tatikonda. 2016. SystemML: Declarative Machine Learning on Spark. Proc. VLDB Endow. 9, 13 (sep 2016), 1425–1436.
  • Boncz et al. (2005) Peter A. Boncz, Marcin Zukowski, and Niels Nes. 2005. MonetDB/X100: Hyper-Pipelining Query Execution.. In CIDR., 225–237.
  • Bre et al. (2018) Sebastian Bre, Bastian Köcher, Henning Funke, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2018. Generating Custom Code for Efficient Query Execution on Heterogeneous Processors. The VLDB Journal 27, 6 (2018), 797–822.
  • Buono et al. (2021) Francesco Del Buono, Matteo Paganelli, Paolo Sottovia, Matteo Interlandi, and Francesco Guerra. 2021. Transforming ML Predictive Pipelines into SQL with MASQ. In SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021. ACM, 2696–2700.
  • Chen et al. (2018) Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018.

    TVM: An Automated End-to-end Optimizing Compiler for Deep Learning. In

    Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation.
  • Chrysogelos et al. (2019) Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. HetExchange: Encapsulating Heterogeneous CPU-GPU Parallelism in JIT Compiled Engines. Proc. VLDB Endow. 12, 5 (2019), 544–556.
  • Chung et al. (2018) Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengill, Ming Liu, Mahdi Ghandi, Daniel Lo, Steve Reinhardt, Shlomi Alkalay, Hari Angepat, Derek Chiou, Alessandro Forin, Doug Burger, Lisa Woods, Gabriel Weisz, Michael Haselman, and Dan Zhang. 2018. Serving DNNs in Real Time at Datacenter Scale with Project Brainwave. IEEE Micro 38 (2018), 8–20.
  • Council (2018) Transaction Processing Performance Council. 2018. TPC Benchmark H. Retrieved January, 2022 from
  • Damme et al. (2022) Patrick Damme, Marius Birkenbach, Constatinos Bitsakos, Matthias Boehm, Philippe Bonnet, Florina Ciorba, Mark Dokter, Pawel Dowgiallo, Ahmed Eleliemy, Christian Faerber, Georgios Goumas, Dirk Habich, Niclas Hedam, Marlies Hofer, Wenjun Huang, Kevin Innerebner, Vasileios Karakostas, Roman Kern, Tomaž Kosar, Alexander Krause, Daniel Krems, Andreas Laber, Wolfgang Lehner, Eric Mier, Tilmann Rabl, Piotr Ratuszniak, Pedro Silva, Nikolai Skuppin, Andreas Starzacher, Benjamin Steinwender, Ilin Tolovski, Pinar Tözün, Wojciech Ulatowski, Yuanyuan Wang, Izajasz Wrosz, Aleš Zamuda, Ce Zhang, and Xiao Xiang Zhu. 2022. DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines. In 12th Annual Conference on Innovative Data Systems Research (CIDR ’22).
  • Danial (2021) Albert Danial. 2021. cloc: v1.92.
  • DeVito (2019) Zachary DeVito. 2019. TorchScript: Optimized Execution of PyTorch Programs. Retrieved January, 2022 from
  • Fegade et al. (2021a) Pratik Fegade, Tianqi Chen, Phillip Gibbons, and Todd Mowry. 2021a. Cortex: A Compiler for Recursive Deep Learning Models. In Proceedings of Machine Learning and Systems, Vol. 3. 38–54.
  • Fegade et al. (2021b) Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, and Todd C. Mowry. 2021b. The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding. CoRR abs/2110.10221 (2021).
  • Feng et al. (2012) Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher Ré. 2012. Towards a Unified Architecture for In-RDBMS Analytics. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, USA, 325–336.
  • Foundation (2020) .NET Foundation. 2020. TorchSharp - PyTorch .NET bindings. Retrieved February, 2022 from
  • Funke et al. (2018) Henning Funke, Sebastian Breß, Stefan Noll, Volker Markl, and Jens Teubner. 2018. Pipelined Query Processing in Coprocessor Environments. In Proceedings of the 2018 International Conference on Management of Data. ACM, New York, NY, USA, 1603–1618.
  • Geerts et al. (2021) Floris Geerts, Thomas Muñoz, Cristian Riveros, Jan Van den Bussche, and Domagoj Vrgoč. 2021. Matrix Query Languages. SIGMOD Rec. 50, 3 (2021), 6–19.
  • Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.
  • Google (2021) Google. 2021. Improved On-Device ML on Pixel 6, with Neural Architecture Search. Retrieved January, 2021 from
  • Habana (2022) Habana. 2022. Habana. Retrieved January, 2022 from
  • Heimel et al. (2013) Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, and Volker Markl. 2013. Hardware-Oblivious Parallelism for in-Memory Column-Stores. Proc. VLDB Endow. 6, 9 (2013), 709–720.
  • Hellerstein et al. (2012) Joseph M. Hellerstein, Christoper Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and et al. 2012. The MADlib Analytics Library: Or MAD Skills, the SQL. Proc. VLDB Endow. 5, 12 (2012), 1700–1711.
  • Holanda and Mühleisen (2019) Pedro Holanda and Hannes Mühleisen. 2019. Relational Queries with a Tensor Processing Unit. In Proceedings of the 15th International Workshop on Data Management on New Hardware. ACM, New York, NY, USA, Article 19, 3 pages.
  • Hu et al. (2021) Yu-Ching Hu, Yuliang Li, and Hung-Wei Tseng. 2021. TCUDB: Accelerating Database with Tensor Processors. CoRR abs/2112.07552 (2021).
  • Hutchison et al. (2017) Dylan Hutchison, Bill Howe, and Dan Suciu. 2017. LaraDB. Proceedings of the 4th Algorithms and Systems on MapReduce and Beyond - BeyondMR’17 (2017).
  • Jankov et al. (2019) Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jermaine, and Zekai J. Gao. 2019. Declarative Recursive Computation on an RDBMS: Or, Why You Should Use a Database for Distributed Machine Learning. Proc. VLDB Endow. 12, 7 (2019), 822–835.
  • Jeong et al. (2019) Eunji Jeong, Sungwoo Cho, Gyeong-In Yu, Joo Seong Jeong, Dongjin Shin, and Byung-Gon Chun. 2019. JANUS: Fast and Flexible Deep Learning via Symbolic Graph Execution of Imperative Programs. In 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019, Boston, MA, February 26-28, 2019. USENIX Association, 453–468.
  • Jia et al. (2019a) Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019a. TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. ACM, New York, NY, USA, 47–62.
  • Jia et al. (2019b) Zhihao Jia, James Thomas, Todd Warszawski, Mingyu Gao, Matei Zaharia, and Alex Aiken. 2019b. Optimizing DNN Computation with Relaxed Graph Substitutions. In Proceedings of Machine Learning and Systems, Vol. 1. 27–39.
  • Jouppi et al. (2017) Norman P. Jouppi, Cliff Young, Nishant Patil, David A. Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, Richard C. Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-Datacenter Performance Analysis of a Tensor Processing Unit. CoRR abs/1704.04760 (2017).
  • Karanasos et al. (2020) Konstantinos Karanasos, Matteo Interlandi, Fotis Psallidas, Rathijit Sen, Kwanghyun Park, Ivan Popivanov, Doris Xin, Supun Nakandala, Subru Krishnan, Markus Weimer, Yuan Yu, Raghu Ramakrishnan, and Carlo Curino. 2020. Extending Relational Query Processing with ML Inference. In CIDR 2020, 10th Conference on Innovative Data Systems Research, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings.
  • Kernert et al. (2014) David Kernert, Frank Köhler, and Wolfgang Lehner. 2014. SLACID - Sparse Linear Algebra in a Column-Oriented in-Memory Database System. In Proceedings of the 26th International Conference on Scientific and Statistical Database Management. ACM, New York, NY, USA, Article 11, 12 pages.
  • Kersten et al. (2018) Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter A. Boncz. 2018. Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask. Proc. VLDB Endow. 11, 13 (2018), 2209–2222.
  • Kim et al. (2009) Changkyu Kim, Tim Kaldewey, Victor W. Lee, Eric Sedlar, Anthony D. Nguyen, Nadathur Satish, Jatin Chhugani, Andrea Di Blas, and Pradeep Dubey. 2009. Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs. Proc. VLDB Endow. 2, 2 (2009), 1378–1389.
  • Kim and Candan (2014) Mijung Kim and K. Selçuk Candan. 2014. TensorDB: In-Database Tensor Manipulation with Tensor-Relational Query Plans. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. ACM, New York, NY, USA, 2039–2041.
  • Kjolstad et al. (2017) Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The Tensor Algebra Compiler. Proc. ACM Program. Lang. 1, Article 77 (2017), 29 pages.
  • Kotlyar et al. (1997) Vladimir Kotlyar, Keshav Pingali, and Paul Stodghill. 1997. A Relational Approach to the Compilation of Sparse Matrix Programs. Technical Report. USA.
  • Koutsoukos et al. (2021a) Dimitrios Koutsoukos, Ingo Müller, Renato Marroquín, Ana Klimovic, and Gustavo Alonso. 2021a. Modularis: Modular Relational Analytics over Heterogeneous Distributed Platforms. VLDB 14, 13 (2021), 3308–3321.
  • Koutsoukos et al. (2021b) Dimitrios Koutsoukos, Supun Nakandala, Konstantinos Karanasos, Karla Saur, Gustavo Alonso, and Matteo Interlandi. 2021b. Tensors: An abstraction for general data processing. Proc. VLDB Endow. 14, 10 (2021), 1797–1804.
  • Kumar et al. (2017) Arun Kumar, Matthias Boehm, and Jun Yang. 2017. Data Management in Machine Learning: Challenges, Techniques, and Systems. In Proceedings of the 2017 ACM International Conference on Management of Data. ACM, New York, NY, USA, 1717–1722.
  • Lattner et al. (2020) Chris Lattner, Jacques Pienaar, Mehdi Amini, Uday Bondhugula, River Riddle, Albert Cohen, Tatiana Shpeisman, Andy Davis, Nicolas Vasilache, and Oleksandr Zinenko. 2020. MLIR: A Compiler Infrastructure for the End of Moore’s Law. (2020). arXiv:2002.11054
  • Lee et al. (2021) Rubao Lee, Minghong Zhou, Chi Li, Shenggang Hu, Jianping Teng, Dongyang Li, and Xiaodong Zhang. 2021. The Art of Balance: A RateupDB™ Experience of Building a CPU/GPU Hybrid Database Product. Proc. VLDB Endow. 14, 12 (2021), 2999–3013.
  • Leis et al. (2014) Viktor Leis, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2014. Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age. ACM, New York, NY, USA, 743–754.
  • Li et al. (2020) Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, and Soumith Chintala. 2020. PyTorch Distributed: Experiences on Accelerating Data Parallel Training. Proc. VLDB Endow. 13, 12 (2020).
  • Li and Ross (1999) Zhe Li and Kenneth A. Ross. 1999. Fast Joins Using Join Indices. The VLDB Journal 8, 1 (apr 1999), 1–24.
  • Lutz et al. (2020) Clemens Lutz, Sebastian Breß, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2020. Pump Up the Volume: Processing Large Data on GPUs with Fast Interconnects. ACM, New York, NY, USA, 1633–1649.
  • Mazare (2020) Laurent Mazare. 2020. PyTorch Rust bindings. Retrieved February, 2022 from
  • Meng et al. (2016a) Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2016a. MLlib: Machine Learning in Apache Spark. Journal of Machine Learning Research 17, 34 (2016), 1–7.
  • Meng et al. (2016b) Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, Doris Xin, Reynold Xin, Michael J. Franklin, Reza Zadeh, Matei Zaharia, and Ameet Talwalkar. 2016b. MLlib: Machine Learning in Apache Spark. J. Mach. Learn. Res. 17, 1 (2016), 1235–1241.
  • Menon et al. (2017) Prashanth Menon, Todd C. Mowry, and Andrew Pavlo. 2017. Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and Prefetching Work Together at Last. Proc. VLDB Endow. 11, 1 (2017), 1–13.
  • Microsoft (2021a) Microsoft. 2021a. PREDICT in T-SQL.
  • Microsoft (2021b) Microsoft. 2021b. Tutorial: Score machine learning models with PREDICT in serverless Apache Spark pools. Retrieved January, 2022 from
  • Microsoft (2022a) Microsoft. 2022a. Hummingbird. Retrieved January, 2022 from
  • Microsoft (2022b) Microsoft. 2022b. ONNX Runtime Web—running your machine learning model in browser. Retrieved January, 2022 from
  • Nakandala et al. (2020) Supun Nakandala, Karla Saur, Gyeong-In Yu, Konstantinos Karanasos, Carlo Curino, Markus Weimer, and Matteo Interlandi. 2020. A Tensor Compiler for Unified Machine Learning Prediction Serving. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). USENIX Association, 899–917.
  • Netz et al. (2000) Amir Netz, Jeff Bernhardt, Usama Fayyad, and Surajit Chaudhuri. 2000. Integration of Data Mining and Relational Databases. In Proceedings of the 26th International Conference on Very Large Databases. VLDB Endowment.
  • Neumann (2011) Thomas Neumann. 2011. Efficiently Compiling Efficient Query Plans for Modern Hardware. Proc. VLDB Endow. 4, 9 (2011), 539–550.
  • Neumann and Freitag (2020) Thomas Neumann and Michael J. Freitag. 2020. Umbra: A Disk-Based System with In-Memory Performance. In 10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings.
  • Ngom et al. (2021) Amadou Ngom, Prashanth Menon, Matthew Butrovich, Lin Ma, Wan Shen Lim, Todd C. Mowry, and Andrew Pavlo. 2021. Filter Representation in Vectorized Query Execution. ACM, New York, NY, USA.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.
  • Paul et al. (2020) Johns Paul, Bingsheng He, Shengliang Lu, and Chiew Tong Lau. 2020. Improving Execution Efficiency of Just-in-Time Compilation Based Query Processing on GPUs. Proc. VLDB Endow. 14, 2 (2020), 202–214.
  • Paul et al. (2016) Johns Paul, Jiong He, and Bingsheng He. 2016. GPL: A GPU-Based Pipelined Query Processing Engine. In Proceedings of the 2016 International Conference on Management of Data. ACM, New York, NY, USA, 1935–1950.
  • Paul et al. (2021) Johns Paul, Shengliang Lu, and Bingsheng He. 2021. Database Systems on GPUs. Foundations and Trends® in Databases 11, 1 (2021), 1–108.
  • Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12 (2011), 2825–2830.
  • Petersohn et al. (2020) Devin Petersohn, Stephen Macke, Doris Xin, William Ma, Doris Lee, Xiangxi Mo, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D. Joseph, and Aditya Parameswaran. 2020. Towards Scalable Dataframe Systems. Proc. VLDB Endow. 13, 12 (2020), 2033–2046.
  • Pirk et al. (2016) Holger Pirk, Oscar Moll, Matei Zaharia, and Sam Madden. 2016. Voodoo - a Vector Algebra for Portable Database Performance on Modern Hardware. Proc. VLDB Endow. 9, 14 (2016).
  • Polychroniou et al. (2015) Orestis Polychroniou, Arun Raghavan, and Kenneth A. Ross. 2015. Rethinking SIMD Vectorization for In-Memory Databases. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, USA, 1493–1508.
  • Polychroniou and Ross (2019) Orestis Polychroniou and Kenneth A. Ross. 2019. Towards Practical Vectorized Analytical Query Engines. In Proceedings of the 15th International Workshop on Data Management on New Hardware. ACM, New York, NY, USA, Article 10, 7 pages.
  • Power et al. (2015) Jason Power, Yinan Li, Mark D. Hill, Jignesh M. Patel, and David A. Wood. 2015. Toward GPUs Being Mainstream in Analytic Processing: An Initial Argument Using Simple Scan-Aggregate Queries. In Proceedings of the 11th International Workshop on Data Management on New Hardware. ACM, New York, NY, USA, Article 11, 8 pages.
  • Prasad et al. (2015) Shreya Prasad, Arash Fard, Vishrut Gupta, Jorge Martinez, Jeff LeFevre, Vincent Xu, Meichun Hsu, and Indrajit Roy. 2015. Large-Scale Predictive Analytics in Vertica: Fast Data Transfer, Distributed Model Creation, and In-Database Prediction. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, USA, 1657–1668.
  • PyTorch (2020) PyTorch. 2020. PyTorch Java bindings. Retrieved February, 2022 from
  • PyTorch (2022a) PyTorch. 2022a. PyTorch on XLA Devices. Retrieved January, 2022 from
  • PyTorch (2022b) PyTorch. 2022b. Torch.Tensor Documentation. Retrieved January, 2022 from
  • PyTorch (2022c) PyTorch. 2022c. Unique.cpp. Retrieved January, 2022 from
  • Raasveldt and Mühleisen (2020) Mark Raasveldt and Hannes Mühleisen. 2020. Data Management for Data Science - Towards Embedded Analytics. In 10th Conference on Innovative Data Systems Research, CIDR 2020, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings.
  • Raman et al. (2013) Vijayshankar Raman, Gopi Attaluri, Ronald Barber, Naresh Chainani, David Kalmuk, Vincent KulandaiSamy, Jens Leenstra, Sam Lightstone, Shaorong Liu, Guy M. Lohman, Tim Malkemus, Rene Mueller, Ippokratis Pandis, Berni Schiefer, David Sharpe, Richard Sidle, Adam Storm, and Liping Zhang. 2013. DB2 with BLU Acceleration: So Much More than Just a Column Store. Proc. VLDB Endow. 6, 11 (2013), 1080–1091.
  • Rogers et al. (2010) J Rogers, R Simakov, E Soroush, P Velikhov, M Balazinska, D DeWitt, B Heath, D Maier, S Madden, J Patel, et al. 2010. Overview of SciDB: Large scale array storage, processing and analysis. In 2010 International Conference on Management of Data, SIGMOD’10. 963–968.
  • Rosenfeld et al. (2022) Viktor Rosenfeld, Sebastian Breß, and Volker Markl. 2022. Query Processing on Heterogeneous CPU/GPU Systems. ACM Comput. Surv. 55, 1, Article 11 (2022), 38 pages.
  • Rossbach et al. (2013) Christopher J. Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. Dandelion: a compiler and runtime for heterogeneous systems. In ACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP ’13, Farmington, PA, USA, November 3-6, 2013. ACM, 49–68.
  • Răducanu et al. (2013) Bogdan Răducanu, Peter Boncz, and Marcin Zukowski. 2013. Micro Adaptivity in Vectorwise. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, USA, 1231–1242.
  • Schelter et al. (2016) Sebastian Schelter, Shannon Quinn, Suneel Marthi, and Andrew Musselman. 2016. Samsara: Declarative Machine Learning on Distributed Dataflow Systems.
  • Schüle et al. (2019) Maximilian Schüle, Matthias Bungeroth, Dimitri Vorona, Alfons Kemper, Stephan Günnemann, and Thomas Neumann. 2019. ML2SQL - Compiling a Declarative Machine Learning Language to SQL and Python. In Advances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT 2019, Lisbon, Portugal, March 26-29, 2019., 562–565.
  • Schüle et al. (2019) Maximilian Schüle, Frédéric Simonis, Thomas Heyenbrock, Alfons Kemper, Stephan Günnemann, and Thomas Neumann. 2019. In-Database Machine Learning: Gradient Descent and Tensor Algebra for Main Memory Database Systems. In BTW 2019. Gesellschaft für Informatik, Bonn, 247–266.
  • Shaikhha et al. (2016) Amir Shaikhha, Yannis Klonatos, Lionel Parreaux, Lewis Brown, Mohammad Dashti, and Christoph Koch. 2016. How to Architect a Query Compiler. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’16). ACM, New York, NY, USA, 1907–1922.
  • Shanbhag et al. (2020) Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, USA, 1617–1632.
  • Sinthong and Carey (2021) Phanwadee Sinthong and Michael J. Carey. 2021. PolyFrame: A Retargetable Query-Based Approach to Scaling Dataframes. Proc. VLDB Endow. 14, 11 (2021), 2296–2304.
  • Statista (2022a) Statista. 2022a. Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2025. (Jan. 2022).
  • Statista (2022b) Statista. 2022b. Worldwide AI hardware market revenues. (Jan. 2022).
  • Stonebraker et al. (2005) Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. 2005. C-store: A Column-oriented DBMS. In VLDB. 553–564.
  • Stonebraker et al. (2011) Michael Stonebraker, Paul Brown, Alex Poliakov, and Suchi Raman. 2011. The Architecture of SciDB. In Scientific and Statistical Database Management, Judith Bayard Cushing, James French, and Shawn Bowers (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1–16.
  • Syed and Vassilvitskii (2017) Umar Syed and Sergei Vassilvitskii. 2017. SQML: Large-Scale in-Database Machine Learning with Pure SQL. In Proceedings of the 2017 Symposium on Cloud Computing. ACM, New York, NY, USA, 659.
  • team (2022) OctoML AI team. 2022. TVM on M1 GPUs performance. (Feb. 2022).
  • Tesla (2022) Tesla. 2022. Tesla unveils chip to train A.I. models inside its data centers. Retrieved January, 2022 from
  • Theis and Wong (2017) Thomas N. Theis and H.-S. Philip Wong. 2017. The End of Moore’s Law: A New Beginning for Information Technology. Computing in Science Engineering 19, 2 (2017), 41–50.
  • TVM (2022a) TVM. 2022a. Bring Your Own Codegen To TVM. Retrieved January, 2022 from
  • TVM (2022b) TVM. 2022b. Pass Infrastructure. Retrieved January, 2022 from
  • Vu (2019) Tin Vu. 2019. Deep Query Optimization. In Proceedings of the 2019 International Conference on Management of Data. ACM, New York, NY, USA, 1856–1858.
  • Wang et al. (2021) Dalin Wang, Feng Zhang, Weitao Wan, Hourun Li, and Xiaoyong Du. 2021. FineQuery: Fine-Grained Query Processing on CPU-GPU Integrated Architectures. In 2021 IEEE International Conference on Cluster Computing. 355–365.
  • Wang et al. (2020) Yisu Remy Wang, Shana Hutchison, Jonathan Leang, Bill Howe, and Dan Suciu. 2020. SPORES: Sum-Product Optimization via Relational Equality Saturation for Large Scale Linear Algebra. Proc. VLDB Endow. 13, 12 (2020), 1919–1932.
  • Xilinx (2022) Xilinx. 2022. Xilinx AI Engine Technology. Retrieved January, 2022 from
  • Yuan et al. (2021) Binhang Yuan, Dimitrije Jankov, Jia Zou, Yuxin Tang, Daniel Bourgeois, and Chris Jermaine. 2021. Tensor Relational Algebra for Distributed Machine Learning System Design. Proc. VLDB Endow. 14, 8 (2021), 1338–1350.
  • Yuan et al. (2013) Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2013. The Yin and Yang of Processing Data Warehousing Queries on GPU Devices. Proc. VLDB Endow. 6, 10 (2013), 817–828.
  • Yuki Asada (2022) Apurva Gandhi Advitya Gemawat Lihao Zhang Vivek Gupta Ehi Nosakhare Dalitso Banda Rathijit Sen Matteo Interlandi Yuki Asada, Victor Fu. 2022. Share the Tensor Tea: How Databases can Leverage the Machine Learning Ecosystem. To Appear at VLDB (2022).
  • Zaharia et al. (2012) Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In NSDI 2012.
  • Zhou and Ross (2002) Jingren Zhou and Kenneth A. Ross. 2002. Implementing Database Operations Using SIMD Instructions. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, USA, 145–156.