Contrastive Code Representation Learning

07/09/2020
by   Paras Jain, et al.
5

Machine-aided programming tools such as automated type predictors and autocomplete are increasingly learning-based. However, current approaches predominantly rely on supervised learning with task-specific datasets. We propose Contrastive Code Representation Learning (ContraCode), a self-supervised algorithm for learning task-agnostic semantic representations of programs via contrastive learning. Our approach uses no human-provided labels, only the raw text of programs. ContraCode optimizes for a representation that is invariant to semantic-preserving code transformations. We develop an automated source-to-source compiler that generates textually divergent variants of source programs. We then train a neural network to identify variants of anchor programs within a large batch of non-equivalent negatives. To solve this task, the network must extract features representing the functionality, not form, of the program. In experiments, we pre-train ContraCode with 1.8M unannotated JavaScript methods mined from GitHub, then transfer to downstream tasks by fine-tuning. Pre-training with ContraCode consistently improves the F1 score of code summarization baselines by up to 8 and top-1 accuracy of type inference baselines by up to 13 ContraCode achieves 9 current state-of-the-art static type analyzer for TypeScript.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 16

09/06/2020

Self-Supervised Learning for Code Retrieval and Summarization through Semantic-Preserving Program Transformations

Code retrieval and summarization are useful tasks for developers, but it...
05/27/2020

CLOCS: Contrastive Learning of Cardiac Signals

The healthcare industry generates troves of unlabelled physiological dat...
05/04/2022

CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training

Recent years have witnessed increasing interest in code representation l...
01/26/2022

CodeRetriever: Unimodal and Bimodal Contrastive Learning

In this paper, we propose the CodeRetriever model, which combines the un...
12/02/2021

GraphCode2Vec: Generic Code Embedding via Lexical and Program Dependence Analyses

Code embedding is a keystone in the application of machine learning on s...
02/04/2022

Supervised Contrastive Learning for Product Matching

Contrastive learning has seen increasing success in the fields of comput...
12/05/2021

VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Variable names are critical for conveying intended program behavior. Mac...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.