SLaDe: A Portable Small Language Model Decompiler for Optimized Assembler

05/21/2023
by   Jordi Armengol-Estapé, et al.
0

Decompilation is a well-studied area with numerous high-quality tools available. These are frequently used for security tasks and to port legacy code. However, they regularly generate difficult-to-read programs and require a large amount of engineering effort to support new programming languages and ISAs. Recent interest in neural approaches has produced portable tools that generate readable code. However, to-date such techniques are usually restricted to synthetic programs without optimization, and no models have evaluated their portability. Furthermore, while the code generated may be more readable, it is usually incorrect. This paper presents SLaDe, a Small Language model Decompiler based on a sequence-to-sequence transformer trained over real-world code. We develop a novel tokenizer and exploit no-dropout training to produce high-quality code. We utilize type-inference to generate programs that are more readable and accurate than standard analytic and recent neural approaches. Unlike standard approaches, SLaDe can infer out-of-context types and unlike neural approaches, it generates correct code. We evaluate SLaDe on over 4,000 functions from AnghaBench on two ISAs and at two optimizations levels. SLaDe is up to 6 times more accurate than Ghidra, a state-of-the-art, industrial-strength decompiler and up to 4 times more accurate than the large language model ChatGPT and generates significantly more readable code than both.

READ FULL TEXT

page 3

page 7

page 10

research
09/30/2019

Structural Language Models for Any-Code Generation

We address the problem of Any-Code Generation (AnyGen) - generating code...
research
07/05/2021

DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling

Rap generation, which aims to produce lyrics and corresponding singing b...
research
11/29/2022

Coder Reviewer Reranking for Code Generation

Sampling diverse programs from a code language model and reranking with ...
research
02/17/2023

PAC Prediction Sets for Large Language Models of Code

Prediction sets have recently been shown to be a promising strategy for ...
research
03/09/2023

Planning with Large Language Models for Code Generation

Existing large language model-based code generation pipelines typically ...
research
09/16/2022

Code as Policies: Language Model Programs for Embodied Control

Large language models (LLMs) trained on code completion have been shown ...
research
11/19/2021

Types for Tables: A Language Design Benchmark

Context: Tables are ubiquitous formats for data. Therefore, techniques f...

Please sign up or login with your details

Forgot password? Click here to reset