DOBF: A Deobfuscation Pre-Training Objective for Programming Languages

02/15/2021
by   Baptiste Roziere, et al.
1

Recent advances in self-supervised learning have dramatically improved the state of the art on a wide variety of tasks. However, research in language model pre-training has mostly focused on natural languages, and it is unclear whether models like BERT and its variants provide the best pre-training when applied to other modalities, such as source code. In this paper, we introduce a new pre-training objective, DOBF, that leverages the structural aspect of programming languages and pre-trains a model to recover the original version of obfuscated source code. We show that models pre-trained with DOBF significantly outperform existing approaches on multiple downstream tasks, providing relative improvements of up to 13 language code search. Incidentally, we found that our pre-trained model is able to de-obfuscate fully obfuscated source files, and to suggest descriptive variable names.

READ FULL TEXT
research
01/05/2022

SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations

Recent years have seen the successful application of large pre-trained m...
research
08/10/2021

SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation

Code representation learning, which aims to encode the semantics of sour...
research
12/13/2020

InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees

Building deep learning models on source code has found many successful s...
research
09/14/2023

Pop Quiz! Do Pre-trained Code Models Possess Knowledge of Correct API Names?

Recent breakthroughs in pre-trained code models, such as CodeBERT and Co...
research
12/05/2021

VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Variable names are critical for conveying intended program behavior. Mac...
research
01/24/2022

Cobol2Vec: Learning Representations of Cobol code

There has been a steadily growing interest in development of novel metho...
research
05/06/2023

Pre-training Language Model as a Multi-perspective Course Learner

ELECTRA, the generator-discriminator pre-training framework, has achieve...

Please sign up or login with your details

Forgot password? Click here to reset