Log In Sign Up

CodeGRU: Context-aware Deep Learning with Gated Recurrent Unit for Source Code Modeling

Recently many NLP-based deep learning models have been applied to model source code for source code suggestion and recommendation tasks. A major limitation of these approaches is that they take source code as simple tokens of text and ignore its contextual, syntaxtual and structural dependencies. In this work, we present CodeGRU, a Gated Recurrent Unit based source code language model that is capable of capturing contextual, syntaxtual and structural dependencies for modeling the source code. The CodeGRU introduces the following several new components. The Code Sampler is first proposed for selecting noise-free code samples and transforms obfuscate code to its proper syntax, which helps to capture syntaxtual and structural dependencies. The Code Regularize is next introduced to encode source code which helps capture the contextual dependencies of the source code. Finally, we propose a novel method which can learn variable size context for modeling source code. We evaluated CodeGRU with real-world dataset and it shows that CodeGRU can effectively capture contextual, syntaxtual and structural dependencies which previous works fails. We also discuss and visualize two use cases of CodeGRU for source code modeling tasks (1) source code suggestion, and (2) source code generation.


page 1

page 2

page 3

page 4


DeepVS: An Efficient and Generic Approach for Source Code Modeling Usage

Recently deep learning-based approaches have shown great potential in th...

Adding Context to Source Code Representations for Deep Learning

Deep learning models have been successfully applied to a variety of soft...

Learning to Represent Programs with Graphs

Learning tasks on source code (i.e., formal languages) have been conside...

Towards Learning (Dis)-Similarity of Source Code from Program Contrasts

Understanding the functional (dis)-similarity of source code is signific...

Capturing Structural Locality in Non-parametric Language Models

Structural locality is a ubiquitous feature of real-world datasets, wher...

Backdoors in Neural Models of Source Code

Deep neural networks are vulnerable to a range of adversaries. A particu...

Logical Segmentation of Source Code

Many software analysis methods have come to rely on machine learning app...

1 Introduction

Source code suggestions, code generation, bug fixing, etc. are vital features of a modern integrated development environment (IDE). These features help software developers to build and debug software rapidly. In the last few years, there have been massive amount of increase in code related databases over the internet. Many open source websites (i.e. w3school, GitHub, Stack Overflow, etc.) provide API libraries and code usage examples to help troubleshoot the source code, bug fixing, and much more. Software developers exceedingly rely on such resources for above-mentioned purposes.

Natural language processing (NLP) Cambria and White [2014], Manning et al. , Ranjan and Ahmad [2016] explores, understands and manipulates natural language text or speech to do serviceable things. NLP techniques have shown its effectiveness in many fields such as speech recognition Bellegarda [2000], information retrieval Berger and Lafferty , text mining Rajman and Besançon , machine translation Xiao et al. , code completion Hindle et al. , Raychev et al. [b], Karaivanov et al. , Tu et al.

code search, API usage patterns mining and code summarization. One of the most common NLP technique for source code modeling is statistical language models (SLM), which calculates the probability distribution over sequences in a corpus. Given a sequence

of length it assigns the probability to the whole sequence and then calculates the likelihood of all sub-sequences to find the most likely next sequence.

The advancement in the deep neural network (DNN) based NLP models Young et al. [a] have recently shown that they can effectively overcome the context issue that cannot be effectively addressed by SLM Raychev et al. [b], Hindle et al. , Tu et al. based models. Many deep learning based approaches have been applied for different tasks for source code modeling such as code summarization Iyer et al. , Allamanis et al. [b], code generation Sethi et al. [a], error fixing Sethi et al. [a], Gupta et al. , and code recommendation White et al. , Gu et al. , Dam et al.

. Applying Recurrent neural networks (RNN) models for source code modeling can help improve the performance of such SLM models. Recently some works

White et al. , Raychev et al. [b] directly implement RNN for code suggestion. A major limitation of these approaches is that they take source code as simple tokens of text and ignore its contextual, syntaxtual and structural dependencies. Another limitation is that they learn source code as a sequence to sequence problem with fixed size context where the right context may not be captured in the fixed size window, which leads to the inaccurate prediction of the next code token.

Compared with natural language text, source code tends to have richer contextual, syntaxtual and structural dependencies. Treating source code as a simple text cannot effectively capture these dependencies. Software developers usually choose to have different names for methods, classes, and variables, which makes it difficult to capture the right context. For example, one software developer may choose a name num for an INT data type, while another one may choose size for the same purpose. Consider another example where a common method i.tostring() converts a variable to String data type. A similar method person.tostring() refers to an object of a class that returns a person’s information. In addition, the source code must follow the rules defined by its grammar. For example, a catch block must be followed by a try block. Another example is that when a developer uses do block, the next block should be while() and the next token suggestion should be ; according to the syntax of java language grammar. We argue here that these structural, syntaxtual and contextual dependencies can be fruitful in source code modeling. Using them can help improve various applications including code suggestion and code generation.

In this paper we propose CodeGRU to better capture source code’s context, syntax and structure when suggesting the next source code token. This work includes several new components. First, different from previous works White et al. , Raychev et al. [b], Hindle et al. , we do not simply consider source code as text. To capture the structural dependencies, we propose code sampler, a novel approach to carefully sample the noise-free data. It removes all the unnecessary code and transforms obfuscate code into its proper structure. Second, we propose a novel code regularize technique, which parses the source code into an abstract syntax tree (AST) for encoding the contextual dependencies, which will be discussed in detail in section 4. The CodeGRU can effectively capture the right context even it is separated far apart in the code due to our novel code regularize technique. Finally, we introduce a novel approach for variable size source code context learning.

This work make the following unique contributions:

  • A novel approach for source code modeling is proposed, which consists of the Code Sampler and the Code Regularizer. The Code Sampler selects noise-free data and transforms the source code into its proper structure and syntax. The Code Regularizer parses the source code into an abstract syntax tree (AST) and encodes its contextual information. These two components help CodeGRU to capture contextual, syntaxtual and structural dependencies. CodeGRU also reduces the vocabulary size up to 10%-50% and helps to overcome the out of vocabulary issue.

  • A novel method which learns variable size context of the source code is also proposed. Unlike previous works, we do not use fixed size context window approach to model the source code. The CodeGRU can learn variable size context of the source code and increases its context based on source code syntax and structure.

  • An extensive evaluation of the CodeGRU on the real-world data set shows improvement in accuracy. Different from previous works, CodeGRU considers a large vocabulary size which enables it to suggest the next code token that has not appeared in the training data.

  • We also present two use cases of CodeGRU: (1) code suggestion, which can suggest multiple predictions for the next code token, and (2) code generation, which can generate the whole next code sequence.

The remainder of the paper is organized as follows. Section 2, discusses the related works. Section 3 covers preliminary technical details. Section 4, discusses our novel CodeGRU model in detail. Section 5 covers the empirical evaluation of CodeGRU. Then we move on to section 6, where we discuss use cases of CodeGRU. Finally, section 8 concludes this work.

2 Related Work

Most of the modern IDEs provide code completion and code suggestion features. In recent years, deep neural techniques have been successfully applied to various tasks in natural language processing, and also have shown its effectiveness to problems such as code completion, code suggestion, code generation, API mining, code migration, and code categorization.

Hindle et al. Hindle et al. have shown how natural language processing techniques can help in source code modeling. They provide a simple n-gram based model which helps predict the next code token in Eclipse IDE. Raychev et al. Raychev et al. [b] used statistical language model for synthesizing code completions. They applied n-gram and RNN language model for the task of code completion. Tu et al. Tu et al. , proposed a cache based language model that consists of an n-gram and a cache. Hellendoorn et al. Hellendoorn and Devanbu further improved the cache based model by introducing nested locality. White et al. White et al. applied deep learning for source code modeling purpose. Another approach for source code modeling is to use probabilistic context-free grammars(PCFGs) Bielik et al. . Allamanis et al. Allamanis and Sutton used a PCFG based model to mine idioms from source code. Maddison et al. Maddison and Tarlow used a structured generative model for source code. They evaluated their approach with n-gram and PCFG based language models and showed how they can help in source code generation tasks. Raychev et al.Raychev et al. [a]

applied decision trees for predicting API elements. Chan et al.

Chan et al. used a graph-based search approach to search and recommend API usages.

Recently there has been an increase in API usage Wang et al. [2013], Keivanloo et al. , Dsouza et al. [2016] mining and suggestion. Thung et al. Thung et al. [2013] introduced a recommendation system for API methods recommendation by using feature requests. Nguyen et al. Nguyen et al. [b] proposed a methodology to learn API usages from byte code. Allamanis et al. Allamanis and Sutton introduced a model which automatically mines source code idioms. A neural probabilistic language model introduced in Allamanis et al. [a] that can suggest names for the methods and classes. Franks et al. Franks et al. [2015] created a tool for Eclipse named CACHECA for source code suggestion using a n-gram model. Nguyen et al. Nguyen et al. [2012] introduced an Eclipse plugin which provide context-sensitive code completion based on API usage patterns mining techniques. Chen et al. Chen and Xing [2016] created a web-based tool to find analogical libraries for different languages.

Yin et al. Yin and Neubig proposed a syntax-driven neural code generation approach that generate an abstract syntax tree by sequentially applying actions from a grammar model. A similar work conducted by Rabinovich et al. Rabinovich et al. , which introduced an abstract syntax networks modeling framework for tasks like code generation and semantic parsing. Sethi et al. Sethi et al. [b] introduced a model which automatically generate source code from deep Learning based research papers. Allamanis et al. [c], Allamanis et al. proposed a bimodal to help suggest source code snippets with a natural language query. It is also capable of retrieving natural language descriptions with a source code query. Recently deep learning based approaches have widely been applied for source code modeling. Such as code summarization Iyer et al. , Allamanis et al. [b], Guerrouj et al. [2015], code mining Va , clone detection Kumar and Singh [2015], API learning Gu et al. etc.

Our work is similar to White et al. , Raychev et al. [b], which applied RNN neural networks based models to show how deep learning can help in improving source code modeling. A major limitation of their works is that they consider source code as simple tokens of text and ignores the contextual, syntaxtual and structural dependencies. The most similar work to ours is DNN Nguyen et al. [a], however it varies in several important ways. They apply deep neural networks for source code modeling with a fixed size of context, which can only suggest the next code token, whereas our work can generate whole sequence of source code and consider variable size context. Their work considers the context size of n=4, where larger size may cause scalability problem as mentioned in their work Nguyen et al. [a]. This work introduces a novel approach of variable size context learning which shows tremendous improvemnent in source code modeling. This work can not only predict the next code token according to the correct context and syntax of the grammar but also can predict variable types, class objects, class methods and much more. The CodeGRU is capable of suggesting multiple next code tokens with correct context, syntax, and structure. Different from the Santos et al. Eddie Antonio Santos [2018] and Gupta et al. Gupta et al. [2017], our work focuses on source code suggestion tasks, whereas their works focus on fixing syntax errors with a very limited vocabulary size of 113 and 129 code tokens respectively. Their approach only considers language defined keywords and stop words to build a limited vocabulary which is not ideal for source code suggestion purpose.

3 Preliminaries

In this section, we will discuss the preliminaries and technical overview of this work.

Figure 1:

An architecture of a RNN neuron where input is a code token vector at index

, and the outputs are different next code tokens based on the context and probabilities

Fig. 1 shows the architectural of the RNN for source code modeling, where is input layer, is context layer also known as hidden layer and is the output layer. The hidden state activation at a time step is computed as a function on the previous along with current code token .


Usually is composed of an element-wise nonlinear and affine transformation of and .


Here is the weight matrix for the input to hidden layer and is the weight matrix for the state to state matrix, and

is an activation function. The RNN models

Mikolov et al. [2010], Funahashi and Nakamura tends to look back further than

. But vanilla RNN suffers from vanishing gradient problem which can be overcomed by using Gated Recurrent unit (GRU) model.

The GRU exposes Young et al. [b] the full hidden content without any control which is ideal for source code modeling. It is composed of two gates, the rest gate and the update gate . Further, it entirely exposes its memory context on each time step . Exposing the entire context on each time step helps to learn contextual dependencies better which vanilla RNN fail to capture. It can be expressed as


Where and is prior context and fresh context respectively.


A major difference from Eq. 2 is that the is modulated by the reset gates . Here is element-wise multiplication and is the activation function. We use sigmoid activation which can be expressed as


We use both RNN and GRU based models to show the effectiveness of our approach. Next we will discuss our proposed CodeGRU in detail.

4 The CodeGRU Model

Figure 2: The framework of CodeGRU, which is a context aware deep learning model for source code modeling.

In this section, we will introduce CodeGRU in detail. The CodeGRU is composed of several novel components. The overall workflow of CodeGRU is illustrated in Fig. 2. The first step is data collection, which we have discussed in section 5.1. Next step is Code Sampler, which helps to remove noise from raw data and capture syntaxtual and structural dependencies. Then, we encode the sampled data using Code Regularizer to capture the contextual dependencies. Finally, we build the language model which learns the variable size context of the source code which will be discussed later in detail.

4.1 Code Sampler

Code sampler will first take the code database and selects the language specific files such as Java, Python, c, c++, etc. Here we focus on files. Then it compiles the sampled files using java compilerCorportation [2018] to remove noise. Here noise refers to incorrect syntax, the unsuccessful compilation of code, debug issue, etc. It also removes partially compiled programs. The source code compilation process is solely done on the model training dataset and has no impact on the model testing dataset. Further, it removes all blank lines in the sampled files. Next, it removes all block level and inline level comments. Then it transforms the obfuscate source code into proper structure as shown in Fig. 3. Here structure means indentation, block leveling, and other such elements. This transformation helps CodeGRU to capture structural and syntaxtual dependencies for source code. Our approach is general and can be easily extended to other static type languages where a language parser is available.

Figure 3: Transforming an obfuscate source code block into its proper structure, which will help in capturing the structural dependencies.

4.2 Code Regularizer

Source code consists of different kinds of tokens such as classes, functions, variables, literals, language-specific keywords, data types, stop words, etc. Among all these, language dependent keywords, stop words, library functions, and data types form a shared vocabulary which can also be considered as context. To capture the context of source code tokens we encode them into their token types. Here we care about the token type rather than the token identifier. As discussed earlier code token identifiers can vary from developer to developer and does not present any useful information, whereas their data types can help in capturing the context information. So we encode such information to capture the contextual information of source code. For this purpose, we use java parser Jav [2018] to parse the source code files to extract their abstract syntax tree (AST). An AST is a tree representation of the abstract syntactic structure of the source code. These ASTs help us encode the type information of source code tokens. The exact values of literals (Int, float, long, double, byte, String Literals, etc) have no impact in code suggestion while keeping them makes the data noisy. We encode all literal values to their base data types according to the Java language grammar222 For example given System.out.println( "Hello World"), the string value "Hello World" is of type String according to java language grammar. We encode it with is literal type String combined with a special token Val. Similarly, the value of identifier a in a = 1.1 is not important, so we encode 1.1 with its literal type Float along with a special token Val.

Figure 4: An example of encoding a parsed Java AST with our Code Regularizer.
Code Token Token Type Special Token
Int IntVar
String Literal StringVal
Literal null
File FileVar
Exception ExceptionVar
ArrayList<String> StringArrayListVar
List<Int> IntListVar
char CharVal
Boolean true
Table 1: Common source code encoding examples encoded by our Code Regularizer

A challenging issue in source code modeling is to encode the variable and class object identifies. Java is a strong static type language, which means types of such declarations need to be defined before use. In Fig. 4 one can see an encoding example for variable and class object identifiers. We encode all primitive type variable identifiers into their resolved data types. In Fig. 4 one can see we encode all instances of primitive type variable identifier i with its resolved data type Int combined with a special token Var. To encode class object identifiers we use a symbolic resolver222 to resolve class objects type. It takes a java project and analyzes all files in it, and then it resolves an instance of the class object. In above example the class object identifier admins is replaced with its class type Admin combined with a special token Var.

Furthermore, we encode the complex data types such as ArrayList<String>. We encode it with its subtype String followed by its base type ArrayList combined with special identifier Var. We leave special code tokens (true, false, null) unencoded. Unlike literals, variables, and class objects, such tokens reflect constant behavior, which does not need encoding. We also leave class methods and function names to their original code tokens. This novel encoding approach helps us build an open vocabulary without compromising any code token. Table 1 shows some popular code tokens and their resolved types encoded by our novel approach.

Programming languages strictly follow the rules defined by their grammar. Each line in a programming language starts with a language reserved identifier, variable or class object deceleration, assignment statements, etc. whereas an assignment statement can only have a variable name, object instance or array index on the left-hand side. Similarly, source code languages follow block rules such as try-catch/final, do-while where one must follow the other. By capturing such information, it can help improve the accuracy of source code models.

4.3 Tokenization and Vocabulary Building

In this work, we consider a large size of vocabulary for modeling source code without compromising any source code token. Unlike previous works White et al. , Hindle et al. , we do not remove any source code tokens which helps CodeGRU to capture source code regularities much more effectively. To build the vocabulary first, we tokenize the source code files at token level as shown in Fig. 5. Each unique source code token corresponds to an entry in the vocabulary. Then, we vectorize the source code where each source code token has a postilion integer value. In Table 2 one can see the vocabulary statistics, where shows the vocabulary statics without Code Regularizer and shows the vocabulary statistics with Code Regularizer.

Figure 5: Process of building vocabulary for language models.
Min Max Mean Median
5159 37979 15128.6 15128.6
4378 27362 11220.2 9621.0
Table 2: Vocabulary statistics between projects.

4.4 Variable Size Context Learning

The proposed CodeGRU uses GRU Cho et al. to perform variable size context learning. The CodeGRU takes a source code program as line wise sequence of code tokens X. Here the goal is to produce the next token y by satisfying the context of X. We can express a code statement S at line L. Then a source code program can be represented as where is the line number and is tokenization of S at . It breaks each into several by iteratively increasing the context on each iteration. The CodeGRU learns the source code context at and keeps increase the until it reaches the upper bound limit of . When the CodeGRU reaches the upper bound limit of , it increases and keeps learning the source code context. The proposed approach removes the limitation of the sliding window approach used by previous models White et al. , Hindle et al. , Nguyen et al. [a] where the right context for the next code token may not be captured in that given window. This way CodeGRU iteratively learns over the source code with variable context size which shows improvement in modeling source code.

Further, here we expect the CodeGRU to assign the high probability to the next source code suggestion by having a low Cross-entropy. The Cross-entropy is a cost function to observe how best the model works. A low value of Cross-entropy indicates a good model. It can be expressed as


5 Empirical Evaluation

In this section, we provide an empirical evaluation of CodeGRU. We train and evaluate our models on Intel(R) Xeon(R) CPU E5-2620 v4 2.10GHz with 16 cores and 64GB of ram running CentOS 7 operating system, equipped with an NVIDIA Tesla K40m with 12GB of GPU memory along with 2880 CUDA cores. The average train and test time statistics are summarized in Table 3. Table 3 also shows the source code suggestion and the source code generation time. On average, it takes 1-7 days to fully train and evaluate a single project and it takes 10-30 minutes to test the trained project. It takes less than 30 milliseconds for source code suggestion and source code generation tasks.

To evaluate the performance of CodeGRU, we aim at answering the following questions:

  • RQ1: Does the proposed approach outperform the state-of-the-art approaches?

  • RQ2: How well CodeGRU performs in source code suggestion and source code generation tasks?

  • RQ3: To what extent CodeGRU helps reduce the vocabulary size?

  • RQ4: What applications CodeGRU can be applied in to facilitate software developers?

To answer the research question (RQ1), we compare the performance of the proposed approach with the state-of-the-art approaches White et al. , Hindle et al. , Nguyen et al. [a] in order to find out the performance improvement of the proposed approach. To answer the research question (RQ2), we evaluated the CodeGRU with mean reciprocal rank(MRR) with state-of-art approaches White et al. , Hindle et al. , Nguyen et al. [a] in order to evaluate how well CodeGRU performs in term of source code suggestion and generation tasks. To answer the research question (RQ3), We provide the statistical results for vocabulary with our proposed approach and without our proposed approach. To answer the (RQ4), we discussed and visualized two case studies in section 6.

Train (days) Test (minutes) Code Suggestion(milliseconds) Code Generation (milliseconds)
1-7 10-30 <30 <30
Table 3: Average time statistics of deep learning based models.

5.1 Dataset

To build our code database, we collected open-source java projects from GitHub summarized111 in Table 4. We choose the projects used in previous references White et al. , Hindle et al. , Nguyen et al. [a]. The Table 4 shows the version of the projects, files count in each project, total number of code lines, total code tokens and unique code tokens found in each project without any transformation. We split each project into ten equal folds. From which one folds is used for testing and rest are used for training purpose. After each model training, we rebuild the model so that previous model’s training do not have any impact on the afresh model.

Code Tokens
Projects Version LOC Total Unique
ant 1.10.5 84076 555216 12334
cassandra 3.11.3 90087 796988 15669
db40 7.2 146721 1005485 17333
jgit 5.1.3 101777 786703 14020
poi 4.0.0 258230 1992844 37979
maven 3.6.0 49152 373115 7105
batik 1.10.0 72683 494842 12927
jts 1.16.0 56544 432047 9082
itext 5.5.13 141435 1128560 19678
antlr 4.7.1 35828 264077 5159
Table 4: List of java projects used for evaluation. Each project is open source and gathered at its latest version at the time of this study. The table shows the name of the project, version of the project, line of code (LOC) count, total code tokens count and unique code tokens in each project.

5.2 Baselines

We train several baseline models for the evaluation of this work. In this section, we briefly describe the baselines for comparison in detail.

We compare our work against the n-gram model used in Hindle et al. Hindle et al. , RNN model used in White et al. White et al. and DNN model used in Nguyen et al. Nguyen et al. [a]. Further, we train three different models RNN+, GRU and CodeGRU. RNN+ is a variation of vanilla RNN trained by using our proposed approach. GRU based model is trained similer to previous works Hindle et al. , White et al. by treating source code as simple text and with fixed size window approach. The CodeGRU model is trained by using our proposed approach with variable size context as discribed earlier in section 4. For training purposes, we use the valid java source code files. Each source code file is first tokenized and vectorized as discussed earlier in section 4.3. Then we encode each vector into a binary encoded matrix. This binary encoded matrix is also known as One-hot encoding where each column’s index is zero and the vocabulary indexed column has the value one. Then, we map the vocabulary to a continuous feature vector of dense size 300 similar to Word2Vec Rong [2014]. This approach helps us build a dense vector representation for each vocabulary index without compromising over the semantic meaning of the source code tokens.

We use similar settings for each model as in previous studies White et al. , Hindle et al. , Nguyen et al. [a]. We train a 7-gram model with Good Turing smoothing. The Table. 5 shows the architecture of deep learning based models. We use 300 hidden unites for each model training. We use Adam Kingma and Ba [2014] optimizer with the learning rate as 0.001. To control over fitting we use Dropout Gajbhiye et al. [2018] at the rate of 0.2. We ran each model for

100 epochs

with the batch size of 512. When a model is fully trained, we outsource the trained model111 in hierarchical data format 5 (HDF5) format along with model settings. We train four models simultaneously to get most out of the NVIDIA Tesla K40m CUDA cores.

Type Size Activations
Input Code embedding 300
Estimator RNN,GRU 300 tanh
Over Fitting Dropout 0.25
Output Dense , softmax
Loss Categorical cross entropy
Optimizer Adam 0.001
Table 5: Deep learning models architecture summary.

5.3 Metrics

We choose the top-k accuracy and mean reciprocal rank (MRR) metrics as used in the previous works White et al. , Hindle et al. , Nguyen et al. [a]. We calculate top-k accuracy, where k=1,2,3,5,10. We also evaluated the CodeGRU with mean reciprocal rank (MRR)

metric. The MRR is a rank based evaluation metric in which suggestions that occur earlier in the list are weighted higher than those that occur later in the list. MRR is the average of the reciprocal ranks of suggested code token list for given code sequences

. The MRR produces a value between 0-1, where the value 1 indicates perfect source coed suggestion model. The MRR can be expressed as


where is set of code sequences and refers to the index of the first relevant prediction.

5.4 Results

The accuracy scores of all the models are shown in Table 6. One can see the simple RNN+ model outperforms other baseline White et al. , Hindle et al. , Nguyen et al. [a] models. Previous works does not consider source codes contextual, syntaxtual and structural dependencies but still our proposed approach out-performs them. Table 6 shows the accuracy scores as done in previous works. We observe that both RNN+ and CodeGRU models perform better when the project size is large. The itext project is the largest one in our data set. One can see in Table 6 it gains the highest accuracy score of 66.52 @ k=1 and it gains 87.25 @ k=10, whereas previous models gains much lower score 56.54 @ k=1 and 82.57 @ k=10. Further, Table 6 shows that accuracy of CodeGRU model improves tremendously when variable size context is used as compared to the fixed size context based GRU model.

Previous Works Our Work
Projects k N-grams RNN DNN
ant 1 11.31 59.46 60.34 61.63 62.73 62.82
3 12.35 73.94 74.61 77.49 74.72 78.36
5 12.66 77.00 77.09 80.45 77.57 81.14
10 12.76 79.97 80.51 83.33 80.31 83.96
cassendra 1 09.39 51.64 54.74 57.94 52.72 54.25
3 10.64 67.61 68.17 69.29 68.15 70.11
5 10.81 71.81 73.23 78.62 72.19 74.07
10 10.98 75.59 77.07 83.05 75.82 77.67
db4o 1 08.52 50.48 51.21 52.80 53.85 54.52
3 09.23 68.15 69.37 70.26 70.39 71.85
5 09.34 72.61 72.98 74.66 74.31 75.92
10 09.41 76.54 77.12 78.72 77.92 79.77
jgit 1 11.41 58.51 53.80 61.33 60.01 62.29
3 12.96 72.24 70.01 76.12 74.29 76.91
5 13.15 74.65 75.64 79.10 77.48 79.91
10 13.30 78.98 79.75 82.17 80.76 82.98
poi 1 00.12 56.34 59.31 64.10 63.85 66.57
3 17.93 72.96 73.76 78.84 77.48 80.59
5 18.09 76.95 77.47 82.02 80.88 83.69
10 18.24 80.77 81.79 85.41 84.25 86.98
maven 1 12.28 55.17 57.16 60.65 60.93 60.17
3 13.16 64.00 62.86 74.99 73.93 74.54
5 14.03 76.19 77.65 78.13 77.15 77.78
10 14.47 79.39 80.54 81.59 80.37 81.52
batik 1 13.29 47.87 58.89 61.24 51.47 62.61
3 13.97 75.24 76.15 76.09 68.36 76.92
5 14.04 78.24 79.21 80.13 73.12 81.03
10 14.07 80.30 81.71 84.01 76.61 84.86
jts 1 12.66 54.23 57.28 58.49 55.37 59.68
3 14.07 70.83 72.45 74.16 71.65 74.76
5 14.17 74.52 75.21 77.81 75.04 78.45
10 14.34 78.06 80.75 81.58 78.38 82.34
itext 1 12.85 55.33 56.54 66.31 59.48 66.52
3 15.22 72.49 68.01 80.90 73.64 80.74
5 15.58 75.96 78.16 83.84 76.97 84.04
10 15.80 79.42 82.57 87.06 80.25 87.25
antlr 1 16.90 54.04 55.14 57.55 64.88 58.77
3 18.78 73.15 73.33 74.29 77.31 75.61
5 19.35 76.71 77.15 77.92 80.23 79.27
10 19.43 79.81 80.52 81.43 83.13 82.87
Table 6: Accuracy comparison of CodeGRU with previous works. is the accuracy count.

Table 7 shows the model evaluation results for code suggestion task. The MRR score lies between 0-1, and a higher value indicates a better source code suggestion. One can see even our simple model outperforms baseline models. The max MRR score of our model is 0.744, and the min score is 0.625. Further, we observe that the MRR score of RNN+ and CodeGRU increase with the data size. One can see in Table 7 that the itext project which is the largest one in our code database, it gained the MRR score of 0.744 which means that our proposed approach can give very accurate suggestion in its top five suggestion list.

Previous Works Our Work
Projects N-grams RNN DNN
ant 0.118 0.677 0.686 0.698 0.686 0.710
cassendra 0.100 0.587 0.596 0.607 0.603 0.625
db4o 0.088 0.569 0.591 0.616 0.620 0.632
jgit 0.122 0.649 0.669 0.686 0.669 0.694
poi 0.169 0.598 0.613 0.692 0.702 0.736
maven 0.129 0.669 0.670 0.676 0.675 0.678
batik 0.136 0.647 0.667 0.693 0.614 0.703
jts 0.133 0.627 0.633 0.664 0.637 0.675
itext 0.141 0.620 0.676 0.732 0.661 0.744
antlr 0.179 0.609 0.615 0.661 0.719 0.678
Table 7: The MRR score comparison of CodeGRU with previous works.

5.5 Impact of Code Sampler and Code Regularize on Vocabulary

Natural language based models train the model on a finite vocabulary and remove the tokens from the test dataset which are not present in training. This approach is not practical in terms of source code where software developers continuously use new variables, class objects, and function names. Previous works Hindle et al. , White et al. use a similar approach, where they remove all code tokens not appearing in training data set to build an open vocabulary at the time of testing. Apart from previous approaches, this work does not remove any code token even if it does not appear at training time. Our work can not only predict the next code token according to the correct syntax of grammar but also can capture the correct context for next source code suggestion. The CodeGRU is capable of suggesting the next code token even in the presence of an unseen token. Our novel Code Sampler and Code Regularize techniques can help build an open vocabulary without compromising any source code token. Table 8 shows the vocabulary statistic with and without or proposed approach. Our approach helps us reduce the vocabulary size up to 10%-50%.

Projects Code Tokens % Decrease
ant 555216 12334 9578 22.34%
cassandra 796988 15669 14055 10.30%
db40 1005485 17333 14800 14.61%
jgit 786703 14020 11222 19.96%
poi 1992844 37979 27362 27.95%
maven 373115 7105 5957 16.16%
batik 494842 12927 9664 25.24%
jts 432047 9082 6131 32.49%
itext 1128560 19678 9055 53.98%
antlr 264077 5159 4378 15.14%
Table 8: Impact of Code Sampler and Code Regularize on vocabulary. The table shows the total count of code tokens, where shows the vocabulary size without our approach and shows the vocabulary size with our proposed approach.

5.6 Impact of Variable Size Context based Learning

To evaluate the impact of our variable size context learning on the model performance, we train RNN+ model as discribed earlier in section 5.2. We set the upper bound limit of the context to as in previous works White et al. , Hindle et al. . Table 7 shows that RNN+ gains the MRR score of 0.732 with variable length context in the itext project, while the score drops to 0.620 with RNN model when a fixed size context windows approach is used. We further evaluate both models for code generation task. One can see in Table 9 that, our RNN+ with variable size context learning technique can easily generate the next whole sequence with accurate context and syntax, while the fixed size context RNN model fails to generate very accurate sequences. It shows that variable size context learning over code sequence can help improve the prediction of next code token tremendously. We also observe that CodeGRU can accurately predict the next code token within its top five suggestions bracket almost all the time. Moreover, we observe with variable size context learning over the source code helps to predict the whole next sequence of code with accurate syntax and context. We visualize CodeGRU in section 6 in detail for code suggestion and code generation tasks.

Code Input Variable size context Fixed size context
}While }While ( intvar >= intval ) }While () != null )
if ( if ( intvar < intval ) if ( xslftablecellvar [ intvar +
String StringVar String StringVar = stringval String StringVar areareference stringval
Table 9: The impact of variable size context with fixed size context on CodeGRU for code generation task.

6 Use Cases of CodeGRU

In this section, we will discuss and visualize two use cases of CodeGRU, (1) code suggestion, which aims to suggest multiple predictions for the next code token and (2) code generation, which aims to generate the whole next source code sequence.

6.1 Code Suggestion

CodeGRU is capable of ranking the next code token suggestions by calculating the likelihood based on a given context. In Fig. 6 (a) one can see an example for code suggestion task where a software developer is writing a simple for loop at line six and the most probable suggestion should be int but in visual studio code it does not show such suggestion. In Fig. 6 (b) one can see that our CodeGRU suggests the correct next code token int at the top of its suggestion list. In Table 10 one can clearly see CodeGRU successfully suggests the next code token in its top three suggestion list almost all the time. We can also observe that the CodeGRU suggest the next source code token by capturing the correct context and syntax according to the Java language grammar.

Figure 6: Source code suggestion and generation examples in visual studio code IDE by using our novel CodeGRU model.
Code Input Top three code suggestion with CodeGRU.
for ( [’int’, ’codegeneratorextension’, ’string’]
private static final [’string’, ’class’, ’int’]
StringArrayListVar . [’add’, ’addall’, ’write’]
if ( StringArrayListVar . [’contains’, ’add’, ’addall’]
for ( String StringVar : [’stringarrayvar’, ’stringlistvar’, ’stringsetvar’]
}catch ( [’exception’, ’ioexception’, ’recognitionexception’]
listvar . [’add’, ’seek’, ’antlr’]
for ( int intvar = intval [’;’, ’(’, ’.’]
}else [’{’, ’if’, ’return’]
hashmapvar . [’add’, ’close’, ’write’]
} while [’(’, ’;’, ’=’]
Table 10: Use case of CodeGRU for code suggestion task. The input is a variable length code and the output is the list of top-ranked suggestions recommended by CodeGRU.

6.2 Code Generation

Unlike previous works White et al. , Hindle et al. , Nguyen et al. [a], the CodeGRU can generate the whole next code sequence. We take the same example discussed earlier in Fig. 6 (a) where a developer is writing a source code for ( at line 6 in visual studio code. As Fig. 6 (c) shows CodeGRU successfully capture the correct context and suggests the whole sequence of code for( int intvar = intval ; intvar < intvar ; intvar ++ ) with correct syntax. This approach can help reduce the efforts of a software developer tremendously. In Table 11 we summarized some common code generation examples generated by our CodeGRU model.

Code Input Code generation with CodeGRU
for ( int for( int intvar = intval ; intvar < intvar ; intvar ++ ) {
public static public static class nongreedyconfigs extends baselexertestdescriptor
} While } While ( intvar != token . eof ) ;
if ( stringarrayvar if ( stringarrayvar . length == intval ){ system . gettext
String StringVar String StringVar = stringval ; assertequals ( stringvar , stringvar ) ;
private static final private static final string [] stringarrayvar = stringarrayval ;
while ( stringvar while ( stringvar != null ) { stringbuildervar . intvar == intval ;
String temp () { return String temp () { return stringvar ;
StringArrayListVar . StringArrayListVar . add ( stringval
} catch ( } catch ( exception exceptionvar )
} else if } else if ( intvar >= intvar
Table 11: Use case of CodeGRU for code generation task, where the input is a variable length code and output is the next code sequence generated by our CodeGRU along with correct context and syntax.

7 Threats to Validity

Internal: All models are developed using keras version 2.2 with tensorflow version 1.1 backend. Although our experiments are detailed and results have shown the effectiveness of our approach but still neural networks are in its infancy. Change in neural network settings or evaluating with a different project version it may possible to have different results.

External: Further, all the source code projects used in this study are collected from GitHub, a well-known source code repositories provider. It is not necessary that the projects used in this study represent other languages source code or Java language source code entirely. Use of different projects or languages may affect the working of our approach.

8 Conclusion and Future Work

This paper proposed CodeGRU, a novel approach for source code modeling by capturing source code’s contextual, structural and syntaxtual dependencies. Different from previous works, we do not treat source code as simple text. This work introduces several new components such as Code Sampler and Code Regularizer. These components helps to reduce the vocabulary size up to 10%-50% and helps overcome out of vocabulary issue. Further, CodeGRU can learn source code with variable size context. This work has shown that CodeGRU can not only predict the next code token according to the correct context and syntax of the grammar but also can predict variable names, class objects, class methods and much more. We have also visualized the use cases of CodeGRU for code suggestion and code generation tasks. With our novel approach, the CodeGRU suggested the next code token almost all the time in its top three suggestion list. Moreover, it is also capable of generating the whole next code sequence, which is difficult for previous works to do.

In the future, we would like to evaluate our approach for the dynamic typed languages such as Python. In dynamic type languages, a source code token can have different token types which makes it difficult to capture token types. We also aim at providing an end to end solution which can help software developers directly utilize these models. Another limitation of deep learning based approaches is computation power, where training a new model require additional resources. A common software developer cannot afford to have a server or GPU based computer to train and utilize these models. There is a need for centralization of these languages model which can directly benefit software developers with minimum effort.

Acknowledgements: We are thankful to all the anonymous reviewer that helped us improve our work greatly.

This work was supported by the National Key R&D Program of China [grant no. 2018YFB1003902].


  • Jav [2018] Java programming language parser. 2018. URL
  • [2] Miltiadis Allamanis and Charles Sutton. Mining Idioms from Source Code. pages 472–483. doi: 10.1145/2635868.2635901.
  • Allamanis et al. [a] Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. Suggesting accurate method and class names. Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering - ESEC/FSE 2015, pages 38–49, a. ISSN 0362-1340. doi: 10.1145/2786805.2786849.
  • Allamanis et al. [b] Miltiadis Allamanis, Hao Peng, and Charles Sutton. A Convolutional Attention Network for Extreme Summarization of Source Code. b.
  • Allamanis et al. [c] Miltiadis Allamanis, Daniel Tarlow, and Andrew D Gordon. Bimodal Modelling of Source Code and Natural Language. Icml, pages 2123–2132, c.
  • Bellegarda [2000] J R Bellegarda. Large vocabulary speech recognition with multispan statistical language models. IEEE Transactions on Speech and Audio Processing, 8(1):76–84, 2000. ISSN 1063-6676. doi: 10.1109/89.817455.
  • [7] Adam Berger and John Lafferty. Information retrieval as statistical translation. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’99, (2):222–229. ISSN 09042512. doi: 10.1145/312624.312681.
  • [8] Pavol Bielik, Veselin Raychev, and Martin Vechev. PHOG: Probabilistic Model for Code.

    International Conference on Machine Learning (ICML’16)

    , pages 2933–2942.
  • Cambria and White [2014] Erik Cambria and Bebo White. Jumping NLP curves: A review of natural language processing research. IEEE Computational Intelligence Magazine, 9(2):48–57, 2014. ISSN 1556603X. doi: 10.1109/MCI.2014.2307227.
  • [10] Wing-Kwan Chan, Hong Cheng, and David Lo. Searching Connected API Subgraph via Text Phrases. Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, pages 10:1–10:11. doi: 10.1145/2393596.2393606.
  • Chen and Xing [2016] Chunyang Chen and Zhenchang Xing. Similartech: Automatically recommend analogical libraries across different programming languages. pages 834–839, 2016. doi: 10.1145/2970276.2970290. URL
  • [12] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. ISSN 09205691. doi: 10.3115/v1/D14-1179.
  • Corportation [2018] Oracle Corportation. Javac - Java programming language compiler. 2018. URL
  • [14] Hoa Khanh Dam, Truyen Tran, and Trang Pham. A deep language model for software code. pages 1–4.
  • Dsouza et al. [2016] Andrea Renika Dsouza, Di Yang, and Cristina V. Lopes. Collective intelligence for smarter API recommendations in python. Proceedings - 2016 IEEE 16th International Working Conference on Source Code Analysis and Manipulation, SCAM 2016, pages 51–60, 2016. ISSN 1942-5430. doi: 10.1109/SCAM.2016.22.
  • Eddie Antonio Santos [2018] Dhvani Patel Abram Hindle Jose Nelson Amaral Eddie Antonio Santos, Joshua Charles Campbell. Syntax and sensibility: Using language models to detect and correct syntax errors. 3D Digital Imaging and Modeling, International Conference on, pages 311–322, 2018. doi: 10.1109/SANER.2018.8330219.
  • Franks et al. [2015] Christine Franks, Zhaopeng Tu, Premkumar Devanbu, and Vincent Hellendoorn. Cacheca: A cache language model based code suggestion tool. pages 705–708, 2015. URL
  • [18] K Funahashi and Y Nakamura. Approximation of dynamical systems by continuous time recurrent neural networks. Neural Networks, (6):801–806. ISSN 08936080. doi: 10.1016/S0893-6080(05)80125-X.
  • Gajbhiye et al. [2018] A. Gajbhiye, S. Jaf, N. A. Moubayed, A. S. McGough, and S. Bradley. An Exploration of Dropout with RNNs for Natural Language Inference. ArXiv e-prints, October 2018.
  • [20] Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. Deep API learning. Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering - FSE 2016, pages 631–642. ISSN 9781450321389. doi: 10.1145/2950290.2950334.
  • Guerrouj et al. [2015] Latifa Guerrouj, David Bourque, and Peter C. Rigby. Leveraging Informal Documentation to Summarize Classes and Methods in Context. Proceedings - International Conference on Software Engineering, 2:639–642, 2015. ISSN 02705257. doi: 10.1109/ICSE.2015.212.
  • [22] Rahul Gupta, Aditya Kanade, and Shirish Shevade.

    Deep Reinforcement Learning for Programming Language Correction.

  • Gupta et al. [2017] Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. DeepFix : Fixing Common C Programming Errors by Deep Learning. Aaai’17, 1(Traver):1345–1351, 2017.
  • [24] Vincent J Hellendoorn and Premkumar Devanbu. Are Deep Neural Networks the Best Choice for Modeling Source Code? Fse, pages 763–773. doi: 10.1145/3106237.3106290.
  • [25] Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. On the Naturalness of Soft. Proc. of the nth{34} Int. Conf. on Soft. Eng., (5):837–847. ISSN 00010782. doi: 10.1145/2902362.
  • [26] Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer.

    Summarizing Source Code using a Neural Attention Model.

    Acl, pages 2073–2083.
  • [27] Svetoslav Karaivanov, Veselin Raychev, and Martin Vechev. Phrase-Based Statistical Translation of Programming Languages. Onward, pages 173–184. doi: 10.1145/2661136.2661148.
  • [28] Iman Keivanloo, Juergen Rilling, and Ying Zou. Spotting working code examples. Proceedings of the 36th International Conference on Software Engineering - ICSE 2014, pages 664–675. ISSN 02705257. doi: 10.1145/2568225.2568292.
  • Kingma and Ba [2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL
  • Kumar and Singh [2015] Balwinder Kumar and Satwinder Singh. Code Clone Detection and Analysis Using Software Metrics and Neural Network-A Literature Review. 3(2):127–132, 2015.
  • [31] Chris J. Maddison and Daniel Tarlow. Structured Generative Models of Natural Source Code. arXiv preprint arXiv:1401.0514, pages 1–17.
  • [32] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. The Stanford CoreNLP Natural Language Processing Toolkit. Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60. ISSN 1098-6596. doi: 10.3115/v1/P14-5010.
  • Mikolov et al. [2010] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernocký, and Sanjeev Khudanpur. Recurrent neural network based language model. Interspeech, (September):1045–1048, 2010.
  • Nguyen et al. [a] Anh Tuan Nguyen, Trong Duc Nguyen, Hung Dang Phan, and Tien N. Nguyen. A deep neural network language model with contexts for source code. 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 323–334, a. ISSN 1687-5966. doi: 10.1109/SANER.2018.8330220.
  • Nguyen et al. [2012] Anh Tuan Nguyen, Tung Thanh Nguyen, Hoan Anh Nguyen, Ahmed Tamrawi, Hung Viet Nguyen, Jafar Al-Kofahi, and Tien N. Nguyen. Graph-based pattern-oriented, context-sensitive source code completion. pages 69–79, 2012. URL
  • Nguyen et al. [b] Tam The Nguyen, Hung Viet Pham, Phong Minh Vu, and Tung Thanh Nguyen. Learning API Usages from Bytecode: A Statistical Approach. b. ISSN 02705257. doi: 10.1145/2884781.2884873.
  • [37] Maxim Rabinovich, Mitchell Stern, and Dan Klein. Abstract Syntax Networks for Code Generation and Semantic Parsing. doi: 10.18653/v1/P17-1105.
  • [38] Martin Rajman and Romaric Besançon. Text mining: natural language techniques and text mining applications. Data mining and reverse engineering, pages 50–64. ISSN 17980461. doi: 10.4304/jetwi.1.1.60-76.
  • Ranjan and Ahmad [2016] Nihar Ranjan and Saim Ahmad. A Survey on Techniques in NLP. 134(8):6–9, 2016. doi: 10.15662/IJAREEIE.2016.0509045.
  • Raychev et al. [a] Veselin Raychev, Pavol Bielik, Martin Vechev, Andreas Krause, Veselin Raychev, Pavol Bielik, Martin Vechev, and Andreas Krause. Learning programs from noisy data. Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages - POPL 2016, (1):761–774, a. ISSN 07308566. doi: 10.1145/2837614.2837671.
  • Raychev et al. [b] Veselin Raychev, Martin Vechev, and Eran Yahav. Code completion with statistical language models. Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation - PLDI ’14, pages 419–428, b. ISSN 03621340. doi: 10.1145/2594291.2594321.
  • Rong [2014] Xin Rong. word2vec parameter learning explained. CoRR, abs/1411.2738, 2014.
  • Sethi et al. [a] Akshay Sethi, Anush Sankaran, Naveen Panwar, Shreya Khare, and Senthil Mani. DLPaper2Code: Auto-generation of Code from Deep Learning Research Papers. a.
  • Sethi et al. [b] Akshay Sethi, Anush Sankaran, Naveen Panwar, Shreya Khare, and Senthil Mani. DLPaper2Code: Auto-generation of Code from Deep Learning Research Papers. b.
  • Thung et al. [2013] Ferdian Thung, Shaowei Wang, David Lo, and Julia Lawall. Automatic recommendation of API methods from feature requests. 2013 28th IEEE/ACM International Conference on Automated Software Engineering, ASE 2013 - Proceedings, (November):290–300, 2013. doi: 10.1109/ASE.2013.6693088.
  • [46] Zhaopeng Tu, Zhendong Su, and Premkumar Devanbu. On the localness of software. Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering - FSE 2014, pages 269–280. doi: 10.1145/2635868.2635875.
  • [47] B C Canada Va. MAPO : Mining API Usages from Open Source Repositories. pages 54–57.
  • Wang et al. [2013] Jue Wang, Yingnong Dang, Hongyu Zhang, Kai Chen, Tao Xie, and Dongmei Zhang. Mining succinct and high-coverage API usage patterns from source code. IEEE International Working Conference on Mining Software Repositories, pages 319–328, 2013. ISSN 21601852. doi: 10.1109/MSR.2013.6624045.
  • [49] Martin White, Christopher Vendome, Mario Linares-Vasquez, and Denys Poshyvanyk. Toward Deep Learning Software Repositories. 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pages 334–345. ISSN 21601860. doi: 10.1109/MSR.2015.38.
  • [50] Yan Xiao, Jacky Keung, Kwabena E. Bennin, and Qing Mi. Machine translation-based bug localization technique for bridging lexical gap. Information and Software Technology, (August 2017):58–61. ISSN 09505849. doi: 10.1016/j.infsof.2018.03.003.
  • [51] Pengcheng Yin and Graham Neubig. A Syntactic Neural Model for General-Purpose Code Generation. doi: 10.18653/v1/P17-1041.
  • Young et al. [a] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. Recent Trends in Deep Learning Based Natural Language Processing. pages 1–24, a.
  • Young et al. [b] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. Recent Trends in Deep Learning Based Natural Language Processing. pages 1–31, b.