Finding Inlined Functions in Optimized Binaries

by   Toufique Ahmed, et al.

Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries' behavior can be quite challenging, especially when compiled under higher levels of compiler optimization. These optimizations can transform comprehensible, "natural" source constructions into something entirely unrecognizable. Reverse engineering binaries, especially those suspected of being malevolent or guilty of intellectual property theft, are important and time-consuming tasks. There is a great deal of interest in tools to "decompile" binaries back into more natural source code to aid reverse engineering. Decompilation involves several desirable steps, including recreating source-language constructions, variable names, and perhaps even comments. One central optimization step in creating binaries is inlining functions. Recovering these inlined functions from optimized binaries is an essential task that most state-of-the-art decompiler tools try to do but do not perform very well. In this paper, we evaluate a supervised learning approach to the problem of recovering inlined functions. We leverage open-source software and develop an automated labeling scheme to generate a reasonably large dataset of binaries labeled with inlined functions. We augment this large but limited labeled dataset with a pre-training step, which learns the decompiled code statistics from a much larger unlabeled dataset. Thus augmented, our learned labeling model can be combined with an existing decompilation tool, Ghidra, to achieve substantially improved performance in inlined function recovery, especially at higher levels of optimization.


page 1

page 2

page 3

page 4


Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries

Reverse engineering binaries is required to understand and analyse progr...

Leveraging Artificial Intelligence on Binary Code Comprehension

Understanding binary code is an essential but complex software engineeri...

Improving type information inferred by decompilers with supervised machine learning

In software reverse engineering, decompilation is the process of recover...

NatGen: Generative pre-training by "Naturalizing" source code

Pre-trained Generative Language models (e.g. PLBART, CodeT5, SPT-Code) f...

Commit2Vec: Learning Distributed Representations of Code Changes

Deep learning methods, which have found successful applications in field...

Software Ethology: An Accurate and Resilient Semantic Binary Analysis Framework

When reverse engineering a binary, the analyst must first understand the...

Discovery of Layered Software Architecture from Source Code Using Ego Networks

Software architecture refers to the high-level abstraction of a system i...

Please sign up or login with your details

Forgot password? Click here to reset