Code Generation Techniques for Raw Data Processing

12/09/2017
by   Xin Zhang, et al.
0

The motivation of the current study was to design an algorithm that can speed up the processing of a query. The important feature is generating code dynamically for a specific query. We present the technique of code generation that is applied to query processing on a raw file. The idea was to customize a query program with a given query and generate a machine- and query-specific source code. The generated code is compiled by GCC, Clang or any other C/C++ compiler, and the compiled file is dynamically linked to the main program for further processing. Code generation reduces the cost of generalizing query processing. It also avoids the overhead of the conventional interpretation during achieve high performance. Database Management Systems (DBMSs) perform excellent jobs in many aspects of big data, such as storage, indexing, and analysis. DBMSs typically format entire data and load them into their storage layer. They increase data-to-query time, which is the cost time it takes to convert data into a specific schema and persist them in a disk. Ideally, DBMSs should adapt to the input data and extract one/some of columns, not the entire data, that is/are associated with a given query. Therefore, the query engine on a raw file can reduce the cost of conventional general operators and avoid some unnecessary procedures, such as fully scanning, tokenizing and paring the whole data. In the current study, we introduce our code-generation approach for in-situ processing of raw files, which is based on the template approach and the hype approach. The approach minimizes the data-to-query time and achieves a high performance for query processing. There are some benefits from our work: reducing branches and instructions, unrolling loops, eliminating unnecessary data type checks and optimizing the binary code with a compiler on a local machine.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/06/2022

DisTRaC: Accelerating High Performance Compute Processing for Temporary Data Storage

High Performance Compute (HPC) clusters often produce intermediate files...
research
06/03/2019

A scheme for dynamically integrating C library functions into a λProlog implementation

The Teyjus system realizes the higher-order logic programming languageλP...
research
08/20/2017

Fast Access to Columnar, Hierarchically Nested Data via Code Transformation

Big Data query systems represent data in a columnar format for fast, sel...
research
12/21/2022

Resource Utilization Monitoring for Raw Data Query Processing

Scientific experiments, simulations, and modern applications generate la...
research
08/16/2018

Automatic Generation of a Hybrid Query Execution Engine

The ever-increasing need for fast data processing demands new methods fo...
research
05/24/2021

DynaHash: Efficient Data Rebalancing in Apache AsterixDB (Extended Version)

Parallel shared-nothing data management systems have been widely used to...
research
05/24/2019

Compiler Design for Legal Document Translation in Digital Government

One of the main purposes of a computer is automation. In fact, automatio...

Please sign up or login with your details

Forgot password? Click here to reset