Efficient Fuzz Testing for Apache Spark Using Framework Abstraction

03/08/2021
by   Qian Zhang, et al.
0

The emerging data-intensive applications are increasingly dependent on data-intensive scalable computing (DISC) systems, such as Apache Spark, to process large data. Despite their popularity, DISC applications are hard to test. In recent years, fuzz testing has been remarkably successful; however, it is nontrivial to apply such traditional fuzzing to big data analytics directly because: (1) the long latency of DISC systems prohibits the applicability of fuzzing, and (2) conventional branch coverage is unlikely to identify application logic from the DISC framework implementation. We devise a novel fuzz testing tool called BigFuzz that automatically generates concrete data for an input Apache Spark program. The key essence of our approach is that we abstract the dataflow behavior of the DISC framework with executable specifications and we design schema-aware mutations based on common error types in DISC applications. Our experiments show that compared to random fuzzing, BigFuzz is able to speed up the fuzzing time by 1477X, improves application code coverage by 271 errors. The demonstration video of BigFuzz is available at https://www.youtube.com/watch?v=YvYQISILQHs feature=youtu.be.

READ FULL TEXT
research
03/22/2019

On Testing of Data-Intensive Software Systems

Today's software systems like cyber-physical production systems or big d...
research
03/22/2019

On Testing Data-Intensive Software Systems

Today's software systems like cyber-physical production systems or big d...
research
07/31/2018

PABED A Tool for Big Education Data Analysis

Cloud computing and big data have risen to become the most popular techn...
research
02/11/2023

ASDF: A Differential Testing Framework for Automatic Speech Recognition Systems

Recent years have witnessed wider adoption of Automated Speech Recogniti...
research
08/07/2018

MaRe: Container-Based Parallel Computing with Data Locality

Application containers are emerging as key components in scientific proc...
research
01/13/2022

FuzzingDriver: the Missing Dictionary to Increase Code Coverage in Fuzzers

We propose a tool, called FuzzingDriver, to generate dictionary tokens f...
research
07/10/2020

COBRA: Compression via Abstraction of Provenance for Hypothetical Reasoning

Data analytics often involves hypothetical reasoning: repeatedly modifyi...

Please sign up or login with your details

Forgot password? Click here to reset