Symmetry-Preserving Program Representations for Learning Code Semantics

08/07/2023
by   Kexin Pei, et al.
0

Large Language Models (LLMs) have shown promise in automated program reasoning, a crucial aspect of many security tasks. However, existing LLM architectures for code are often borrowed from other domains like natural language processing, raising concerns about their generalization and robustness to unseen code. A key generalization challenge is to incorporate the knowledge of code semantics, including control and data flow, into the LLM architectures. Drawing inspiration from examples of convolution layers exploiting translation symmetry, we explore how code symmetries can enhance LLM architectures for program analysis and modeling. We present a rigorous group-theoretic framework that formally defines code symmetries as semantics-preserving transformations and provides techniques for precisely reasoning about symmetry preservation within LLM architectures. Using this framework, we introduce a novel variant of self-attention that preserves program symmetries, demonstrating its effectiveness in generalization and robustness through detailed experimental evaluations across different binary and source code analysis tasks. Overall, our code symmetry framework offers rigorous and powerful reasoning techniques that can guide the future development of specialized LLMs for code and advance LLM-guided program reasoning tasks.

READ FULL TEXT

page 11

page 12

research
05/22/2023

Neural Machine Translation for Code Generation

Neural machine translation (NMT) methods developed for natural language ...
research
06/19/2018

Neural Code Comprehension: A Learnable Representation of Code Semantics

With the recent success of embeddings in natural language processing, re...
research
01/27/2022

Reasoning Like Program Executors

Reasoning over natural language is a long-standing goal for the research...
research
09/04/2023

Code Representation Pre-training with Complements from Program Executions

Large language models (LLMs) for natural language processing have been g...
research
04/15/2020

Evaluation of Generalizability of Neural Program Analyzers under Semantic-Preserving Transformations

The abundance of publicly available source code repositories, in conjunc...
research
08/17/2023

CodeCoT and Beyond: Learning to Program and Test like a Developer

In natural language processing, transformer-based large language models ...
research
03/12/2020

Control-flow Flattening Preserves the Constant-Time Policy (Extended Version)

Obfuscating compilers protect a software by obscuring its meaning and im...

Please sign up or login with your details

Forgot password? Click here to reset