Addressing Leakage in Self-Supervised Contextualized Code Retrieval

04/17/2022
by   Johannes Villmow, et al.
0

We address contextualized code retrieval, the search for code snippets helpful to fill gaps in a partial input program. Our approach facilitates a large-scale self-supervised contrastive training by splitting source code randomly into contexts and targets. To combat leakage between the two, we suggest a novel approach based on mutual identifier masking, dedentation, and the selection of syntax-aligned targets. Our second contribution is a new dataset for direct evaluation of contextualized code retrieval, based on a dataset of manually aligned subpassages of code clones. Our experiments demonstrate that our approach improves retrieval substantially, and yields new state-of-the-art results for code clone and defect detection.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/27/2022

HELoC: Hierarchical Contrastive Learning of Source Code Representation

Abstract syntax trees (ASTs) play a crucial role in source code represen...
research
09/06/2020

Self-Supervised Learning for Code Retrieval and Summarization through Semantic-Preserving Program Transformations

Code retrieval and summarization are useful tasks for developers, but it...
research
09/06/2021

Self-supervised Product Quantization for Deep Unsupervised Image Retrieval

Supervised deep learning-based hash and vector quantization are enabling...
research
06/11/2021

A comprehensive solution to retrieval-based chatbot construction

In this paper we present the results of our experiments in training and ...
research
01/17/2022

ICLEA: Interactive Contrastive Learning for Self-supervised Entity Alignment

Self-supervised entity alignment (EA) aims to link equivalent entities a...
research
03/28/2023

Colo-SCRL: Self-Supervised Contrastive Representation Learning for Colonoscopic Video Retrieval

Colonoscopic video retrieval, which is a critical part of polyp treatmen...

Please sign up or login with your details

Forgot password? Click here to reset