Revisiting Deep Learning for Variable Type Recovery

04/07/2023
by   Kevin Cao, et al.
0

Compiled binary executables are often the only available artifact in reverse engineering, malware analysis, and software systems maintenance. Unfortunately, the lack of semantic information like variable types makes comprehending binaries difficult. In efforts to improve the comprehensibility of binaries, researchers have recently used machine learning techniques to predict semantic information contained in the original source code. Chen et al. implemented DIRTY, a Transformer-based Encoder-Decoder architecture capable of augmenting decompiled code with variable names and types by leveraging decompiler output tokens and variable size information. Chen et al. were able to demonstrate a substantial increase in name and type extraction accuracy on Hex-Rays decompiler outputs compared to existing static analysis and AI-based techniques. We extend the original DIRTY results by re-training the DIRTY model on a dataset produced by the open-source Ghidra decompiler. Although Chen et al. concluded that Ghidra was not a suitable decompiler candidate due to its difficulty in parsing and incorporating DWARF symbols during analysis, we demonstrate that straightforward parsing of variable data generated by Ghidra results in similar retyping performance. We hope this work inspires further interest and adoption of the Ghidra decompiler for use in research projects.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/19/2019

DIRE: A Neural Approach to Decompiled Identifier Naming

The decompiler is one of the most common tools for examining binaries wi...
research
04/30/2018

Machine Learning for Exam Triage

In this project, we extend the state-of-the-art CheXNet (Rajpurkar et al...
research
12/22/2021

Semantics-Recovering Decompilation through Neural Machine Translation

Decompilation transforms low-level program languages (PL) (e.g., binary ...
research
10/11/2022

Leveraging Artificial Intelligence on Binary Code Comprehension

Understanding binary code is an essential but complex software engineeri...
research
01/19/2021

Improving type information inferred by decompilers with supervised machine learning

In software reverse engineering, decompilation is the process of recover...
research
08/13/2021

Augmenting Decompiler Output with Learned Variable Names and Types

A common tool used by security professionals for reverse-engineering bin...
research
05/03/2022

Deep API Learning Revisited

Understanding the correct API usage sequences is one of the most importa...

Please sign up or login with your details

Forgot password? Click here to reset