COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning

10/11/2022
by   Yifan Zhang, et al.
0

Compiled software is delivered as executable binary code. Developers write source code to express the software semantics, but the compiler converts it to a binary format that the CPU can directly execute. Therefore, binary code analysis is critical to applications in reverse engineering and computer security tasks where source code is not available. However, unlike source code and natural language that contain rich semantic information, binary code is typically difficult for human engineers to understand and analyze. While existing work uses AI models to assist source code analysis, few studies have considered binary code. In this paper, we propose a COntrastive learning Model for Binary cOde Analysis, or COMBO, that incorporates source code and comment information into binary code during representation learning. Specifically, we present three components in COMBO: (1) a primary contrastive learning method for cold-start pre-training, (2) a simplex interpolation method to incorporate source code, comments, and binary code, and (3) an intermediate representation learning algorithm to provide binary code embeddings. Finally, we evaluate the effectiveness of the pre-trained representations produced by COMBO using three indicative downstream tasks relating to binary code: algorithmic functionality classification, binary code similarity, and vulnerability detection. Our experimental results show that COMBO facilitates representation learning of binary code visualized by distribution analysis, and improves the performance on all three downstream tasks by 5.45 large-scale language representation models. To the best of our knowledge, COMBO is the first language representation model that incorporates source code, binary code, and comments into contrastive code representation learning and unifies multiple tasks for binary code analysis.

READ FULL TEXT
research
05/04/2022

CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training

Recent years have witnessed increasing interest in code representation l...
research
10/11/2022

Leveraging Artificial Intelligence on Binary Code Comprehension

Understanding binary code is an essential but complex software engineeri...
research
06/05/2023

CONCORD: Clone-aware Contrastive Learning for Source Code

Deep Learning (DL) models to analyze source code have shown immense prom...
research
01/04/2023

Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries

Reverse engineering binaries is required to understand and analyse progr...
research
01/20/2023

Which Features are Learned by CodeBert: An Empirical Study of the BERT-based Source Code Representation Learning

The Bidirectional Encoder Representations from Transformers (BERT) were ...
research
10/03/2022

ContraGen: Effective Contrastive Learning For Causal Language Model

Despite exciting progress in large-scale language generation, the expres...
research
07/14/2020

Contextualized Code Representation Learning for Commit Message Generation

Automatic generation of high-quality commit messages for code commits ca...

Please sign up or login with your details

Forgot password? Click here to reset