DIRE: A Neural Approach to Decompiled Identifier Naming

09/19/2019
by   Jeremy Lacomis, et al.
0

The decompiler is one of the most common tools for examining binaries without corresponding source code. It transforms binaries into high-level code, reversing the compilation process. Decompilers can reconstruct much of the information that is lost during the compilation process (e.g., structure and type information). Unfortunately, they do not reconstruct semantically meaningful variable names, which are known to increase code understandability. We propose the Decompiled Identifier Renaming Engine (DIRE), a novel probabilistic technique for variable name recovery that uses both lexical and structural information recovered by the decompiler. We also present a technique for generating corpora suitable for training and evaluating models of decompiled code renaming, which we use to create a corpus of 164,632 unique x86-64 binaries generated from C projects mined from GitHub. Our results show that on this corpus DIRE can predict variable names identical to the names in the original source code up to 74.3

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/23/2021

Variable Name Recovery in Decompiled Binary Code using Constrained Masked Language Modeling

Decompilation is the procedure of transforming binary programs into a hi...
08/13/2021

Augmenting Decompiler Output with Learned Variable Names and Types

A common tool used by security professionals for reverse-engineering bin...
06/08/2019

Recovering Variable Names for Minified Code with Usage Contexts

In modern Web technology, JavaScript (JS) code plays an important role. ...
04/16/2020

Deep Generation of Coq Lemma Names Using Elaborated Terms

Coding conventions for naming, spacing, and other essentially stylistic ...
09/01/2021

An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags

This paper presents an ensemble part-of-speech tagging approach for sour...
03/26/2018

code2vec: Learning Distributed Representations of Code

We present a neural model for representing snippets of code as continuou...
03/01/2021

Roosterize: Suggesting Lemma Names for Coq Verification Projects Using Deep Learning

Naming conventions are an important concern in large verification projec...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.