Pluvio: Assembly Clone Search for Out-of-domain Architectures and Libraries through Transfer Learning and Conditional Variational Information Bottleneck

07/20/2023
by   Zhiwei Fu, et al.
0

The practice of code reuse is crucial in software development for a faster and more efficient development lifecycle. In reality, however, code reuse practices lack proper control, resulting in issues such as vulnerability propagation and intellectual property infringements. Assembly clone search, a critical shift-right defence mechanism, has been effective in identifying vulnerable code resulting from reuse in released executables. Recent studies on assembly clone search demonstrate a trend towards using machine learning-based methods to match assembly code variants produced by different toolchains. However, these methods are limited to what they learn from a small number of toolchain variants used in training, rendering them inapplicable to unseen architectures and their corresponding compilation toolchain variants. This paper presents the first study on the problem of assembly clone search with unseen architectures and libraries. We propose incorporating human common knowledge through large-scale pre-trained natural language models, in the form of transfer learning, into current learning-based approaches for assembly clone search. Transfer learning can aid in addressing the limitations of the existing approaches, as it can bring in broader knowledge from human experts in assembly code. We further address the sequence limit issue by proposing a reinforcement learning agent to remove unnecessary and redundant tokens. Coupled with a new Variational Information Bottleneck learning strategy, the proposed system minimizes the reliance on potential indicators of architectures and optimization settings, for a better generalization of unseen architectures. We simulate the unseen architecture clone search scenarios and the experimental results show the effectiveness of the proposed approach against the state-of-the-art solutions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/17/2020

Constraint-Based Software Diversification for Efficient Mitigation of Code-Reuse Attacks

Modern software deployment process produces software that is uniform, an...
research
10/12/2019

Deep Transfer Learning for Source Code Modeling

In recent years, deep learning models have shown great potential in sour...
research
04/17/2023

A study on a Q-Learning algorithm application to a manufacturing assembly problem

The development of machine learning algorithms has been gathering releva...
research
03/15/2019

Get rid of inline assembly through trustable verification-oriented lifting

Formal methods for software development have made great strides in the l...
research
07/19/2022

On the Usability of Transformers-based models for a French Question-Answering task

For many tasks, state-of-the-art results have been achieved with Transfo...
research
08/27/2022

A Multi-Format Transfer Learning Model for Event Argument Extraction via Variational Information Bottleneck

Event argument extraction (EAE) aims to extract arguments with given rol...
research
10/02/2020

XDA: Accurate, Robust Disassembly with Transfer Learning

Accurate and robust disassembly of stripped binaries is challenging. The...

Please sign up or login with your details

Forgot password? Click here to reset