Inferring Drop-in Binary Parsers from Program Executions

04/19/2021
by   Thurston H. Y. Dang, et al.
0

We present BIEBER (Byte-IdEntical Binary parsER), the first system to model and regenerate a full working parser from instrumented program executions. To achieve this, BIEBER exploits the regularity (e.g., header fields and array-like data structures) that is commonly found in file formats. Key generalization steps derive strided loops that parse input file data and rewrite concrete loop bounds with expressions over input file header bytes. These steps enable BIEBER to generalize parses of specific input files to obtain parsers that operate over input files of arbitrary size. BIEBER also incrementally and efficiently infers a decision tree that reads file header bytes to route input files of different types to inferred parsers of the appropriate type. The inferred parsers and decision tree are expressed in an IR; separate backends (C and Perl in our prototype) can translate the IR into the same language as the original program (for a safer drop-in replacement), or automatically port to a different language. An empirical evaluation shows that BIEBER can successfully regenerate parsers for six file formats (waveform audio [1654 files], MT76x0 .BIN firmware containers [5 files], OS/2 1.x bitmap images [9 files], Windows 3.x bitmaps [9971 files], Windows 95/NT4 bitmaps [133 files], and Windows 98/2000 bitmaps [859 files]), correctly parsing 100 99.98 corpora. The regenerated parsers contain automatically inserted safety checks that eliminate common classes of errors such as memory errors. We find that BIEBER can help reverse-engineer file formats, because it automatically identifies predicates for the decision tree that relate to key semantics of the file format. We also discuss how BIEBER helped us detect and fix two new bugs in stb_image as well as independently rediscover and fix a known bug.

READ FULL TEXT

page 4

page 17

page 19

research
11/06/2017

SAIC: Identifying Configuration Files for System Configuration Management

Systems can become misconfigured for a variety of reasons such as operat...
research
12/15/2020

Looking for non-compliant documents using error messages from multiple parsers

Whether a file is accepted by a single parser is not a reliable indicati...
research
10/11/2021

Integrating Structural Description of Data Format Information into Programming to Auto-generate File Reading Programs

File reading is the basis for data sharing and scientific computing. How...
research
09/23/2021

FormatFuzzer: Effective Fuzzing of Binary File Formats

Effective fuzzing of programs that process structured binary inputs, suc...
research
09/27/2021

Accelerating LSM-Tree with the Dentry Management of File System

The log-structured merge tree (LSM-tree) gains wide popularity in buildi...
research
07/31/2023

AisLSM: Revolutionizing the Compaction with Asynchronous I/Os for LSM-tree

The log-structured merge tree (LSM-tree) is widely employed to build key...

Please sign up or login with your details

Forgot password? Click here to reset