Is a Single Model Enough? MuCoS: A Multi-Model Ensemble Learning for Semantic Code Search

07/10/2021 ∙ by Lun Du, et al. ∙ Xi'an Jiaotong University Microsoft 0

Recently, deep learning methods have become mainstream in code search since they do better at capturing semantic correlations between code snippets and search queries and have promising performance. However, code snippets have diverse information from different dimensions, such as business logic, specific algorithm, and hardware communication, so it is hard for a single code representation module to cover all the perspectives. On the other hand, as a specific query may focus on one or several perspectives, it is difficult for a single query representation module to represent different user intents. In this paper, we propose MuCoS, a multi-model ensemble learning architecture for semantic code search. It combines several individual learners, each of which emphasizes a specific perspective of code snippets. We train the individual learners on different datasets which contain different perspectives of code information, and we use a data augmentation strategy to get these different datasets. Then we ensemble the learners to capture comprehensive features of code snippets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Code search is the most frequent developer activity in software development process (Caitlin15). Reusable code examples help improve the efficiency of developers in their developing process (Brandt09; Shuai2020). Given a natural language query that describes the developer’s intent, the goal of code search is to find the most relevant code snippet from a large source code corpus.

Many code search engines have been developed for code search. They mainly rely on traditional information retrieval (IR) techniques such as keyword matching (Meili15) or a combination of text similarity and Application Program Interface (API) matching (Lv15). Recently, many works have taken steps to apply deep learning methods (he2016deep; ChoMGBBSB14; wang2019tag2gauss; wang2019tag2vec; yang2020domain) to code search (Gu2018; Cambronero2019; Yan2020; Li2020; Feng2020; Zhu2020; Shuai2020; Ye2020; Haldar2020; Ling2020; Ling2020a; wang2020cocogum)

, using neural networks to capture deep and semantic correlations between natural language queries and code snippets, and have achieved promising performance improvements. These methods employ various types of model structures, including sequential models

(Gu2018; Cambronero2019; Yan2020; Li2020; Feng2020; Zhu2020; Shuai2020; Ye2020; Haldar2020), graph models (Ling2020; Guo2020), and transformers (Feng2020).

Existing deep learning code search methods mainly use a single model to represent queries and code snippets. However, code may have diverse information from different dimensions, such as business logic, specific algorithm, and hardware communication, making it hard for a single code representation module to cover all the perspectives. On the other hand, as a specific query may focus on several perspectives, it is difficult for a single query representation module to represent different user intents.

public static String replaceHtmlEntities(String  content, Map<String, Character>  map) {
    for (Entry<String, Character>  entry : escapeStrings.entrySet()) {
      if ( content.indexOf( entry.getKey()) != -1) {
         content =  content.replace( entry.getKey(), String.valueOf( entry.getValue()));
      }
    }
    return  content;
  }
public static String replaceHtmlEntities(String  var0, Map<String, Character>  var2) {
    for (Entry<String, Character>  var1 : escapeStrings.entrySet()) {
      if ( var0.indexOf( var1.getKey()) != -1) {
         var0 =  var0.replace( var1.getKey(), String.valueOf( var1.getValue()));
      }
    }
    return  var0;
 }
public void doAESEncryption() throws Exception{
  if(!initAESDone)
   initAES();
  cipher = Cipher.getInstance("AES/CBC/PKCS5Padding");
  //System.out.println(secretKey.getEncoded());
   cipher.init(Cipher.ENCRYPT_MODE, secretKey);
   AlgorithmParameters params = cipher.getParameters();
  iv = params.getParameterSpec(IvParameterSpec.class).getIV();
  secretCipher = cipher.doFinal(secretPlain);
  clearPlain();
 }
public void doAESEncryption() throws Exception{
  if(!initAESDone)
   initAES();
  cipher = Cipher.getInstance("AES/CBC/PKCS5Padding");
  //System.out.println(secretKey.getEncoded());
   AlgorithmParameters params = cipher.getParameters();
   cipher.init(Cipher.ENCRYPT_MODE, secretKey);
  iv = params.getParameterSpec(IvParameterSpec.class).getIV();
  secretCipher = cipher.doFinal(secretPlain);
  clearPlain();
 }
Figure 1.

An overview of MuCoS. This framework consists of three phases: data augmentation/separation, individual encoder fine-tuning, and ensemble learning. We first generate three datasets each of which focuses on a specific aspect of code snippets based on data augmentation or separation. Then we learn three individual code search models by fine-tuning three pre-trained CodeBert models based on the generated datasets, respectively. Finally, we use a multi-layer perceptron followed by a concatenation of encodings from three individual models for ensemble learning.

Listing 2: Code before and after statement permutation.
Listing 2: Code before and after statement permutation.
Listing 1: Code before and after variable renaming.
Listing 1: Code before and after variable renaming.