UniASM: Binary Code Similarity Detection without Fine-tuning

10/28/2022
by   Yeming Gu, et al.
0

Binary code similarity detection (BCSD) is widely used in various binary analysis tasks such as vulnerability search, malware detection, clone detection, and patch analysis. Recent studies have shown that the learning-based binary code embedding models perform better than the traditional feature-based approaches. In this paper, we proposed a novel transformer-based binary code embedding model, named UniASM, to learn representations of the binary functions. We designed two new training tasks to make the spatial distribution of the generated vectors more uniform, which can be used directly in BCSD without any fine-tuning. In addition, we proposed a new tokenization approach for binary functions, increasing the token's semantic information while mitigating the out-of-vocabulary (OOV) problem. The experimental results show that UniASM outperforms state-of-the-art (SOTA) approaches on the evaluation dataset. We achieved the average scores of recall@1 on cross-compilers, cross-optimization-levels and cross-obfuscations are 0.72, 0.63, and 0.77, which is higher than existing SOTA baselines. In a real-world task of known vulnerability searching, UniASM outperforms all the current baselines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/25/2022

jTrans: Jump-Aware Transformer for Binary Code Similarity

Binary code similarity detection (BCSD) has important applications in va...
research
11/10/2022

Semantic Learning and Emulation Based Cross-platform Binary Vulnerability Seeker

Clone detection is widely exploited for software vulnerability search. T...
research
11/13/2018

SAFE: Self-Attentive Function Embeddings for Binary Similarity

The binary similarity problem consists in determining if two functions a...
research
06/10/2021

Semantic-aware Binary Code Representation with BERT

A wide range of binary analysis applications, such as bug discovery, mal...
research
09/25/2019

A Survey of Binary Code Similarity

Binary code similarity approaches compare two or more pieces of binary c...
research
05/19/2023

Searching by Code: a New SearchBySnippet Dataset and SnippeR Retrieval Model for Searching by Code Snippets

Code search is an important task that has seen many developments in rece...
research
03/09/2022

BinMLM: Binary Authorship Verification with Flow-aware Mixture-of-Shared Language Model

Binary authorship analysis is a significant problem in many software eng...

Please sign up or login with your details

Forgot password? Click here to reset