Efficient and Accurate In-Database Machine Learning with SQL Code Generation in Python

04/07/2021
by   Michael Kaufmann, et al.
0

Following an analysis of the advantages of SQL-based Machine Learning (ML) and a short literature survey of the field, we describe a novel method for In-Database Machine Learning (IDBML). We contribute a process for SQL-code generation in Python using template macros in Jinja2 as well as the prototype implementation of the process. We describe our implementation of the process to compute multidimensional histogram (MDH) probability estimation in SQL. For this, we contribute and implement a novel discretization method called equal quantized rank binning (EQRB) and equal-width binning (EWB). Based on this, we provide data gathered in a benchmarking experiment for the quantitative empirical evaluation of our method and system using the Covertype dataset. We measured accuracy and computation time and compared it to Scikit Learn state of the art classification algorithms. Using EWB, our multidimensional probability estimation was the fastest of all tested algorithms, while being only 1-2 accurate than the best state of the art methods found (decision trees and random forests). Our method was significantly more accurate than Naive Bayes, which assumes independent one-dimensional probabilities and/or densities. Also, our method was significantly more accurate and faster than logistic regression. This motivates for further research in accuracy improvement and in IDBML with SQL code generation for big data and larger-than-memory datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/11/2020

In-Machine-Learning Database: Reimagining Deep Learning with Old-School SQL

In-database machine learning has been very popular, almost being a clich...
research
04/26/2019

One-Shot Learning for Text-to-SQL Generation

Most deep learning approaches for text-to-SQL generation are limited to ...
research
07/01/2019

Using Database Rule for Weak Supervised Text-to-SQL Generation

We present a simple and novel way to do the task of text-to-SQL problem ...
research
01/19/2020

SQLFlow: A Bridge between SQL and Machine Learning

Industrial AI systems are mostly end-to-end machine learning (ML) workfl...
research
09/01/2022

Python Implementation of the Dynamic Distributed Dimensional Data Model

Python has become a standard scientific computing language with fast-gro...
research
06/14/2023

SQL2Circuits: Estimating Metrics for SQL Queries with A Quantum Natural Language Processing Method

Quantum computing has developed significantly in recent years. Developin...
research
11/03/2017

Toward real-time data query systems in HEP

Exploratory data analysis tools must respond quickly to a user's questio...

Please sign up or login with your details

Forgot password? Click here to reset