Siri, Write the Next Method

03/08/2021 ∙ by Fengcai Wen, et al. ∙ 0

Code completion is one of the killer features of Integrated Development Environments (IDEs), and researchers have proposed different methods to improve its accuracy. While these techniques are valuable to speed up code writing, they are limited to recommendations related to the next few tokens a developer is likely to type given the current context. In the best case, they can recommend a few APIs that a developer is likely to use next. We present FeaRS, a novel retrieval-based approach that, given the current code a developer is writing in the IDE, can recommend the next complete method (i.e., signature and method body) that the developer is likely to implement. To do this, FeaRS exploits "implementation patterns" (i.e., groups of methods usually implemented within the same task) learned by mining thousands of open source projects. We instantiated our approach to the specific context of Android apps. A large-scale empirical evaluation we performed across more than 20k apps shows encouraging preliminary results, but also highlights future challenges to overcome.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Developing high-quality software while reducing time-to-market are two classical contrasting objectives in the software industry. This translates into the need for increasing the productivity of software developers, by lowering their learning curves when dealing with unfamiliar code, and by maximizing the quality of the code they write. In response to these needs, researchers have proposed recommender systems for software engineering, defined by Robillard et al. as “applications that provide information items valuable for a software engineering task in a given context” [33].

Some recommender systems pursue a long-lasting dream of software engineering research: The (semi-)automatic generation of source code. The goal of these tools is speeding up the implementation of new code. Code completion techniques are nowadays one of the killer features of IDEs [24]. Researchers have proposed different methods to improve code completion accuracy and, more in general, its capabilities [9, 16, 26, 41, 32, 28, 25]. While these approaches are certainly valuable to speed up code writing, they are limited to recommendations related to the next few tokens a developer is likely to type given the current context. In the best case, they can recommend a sequence of APIs that a developer is likely to use next [26, 28].

We aim at reaching the next level in supporting developers during the writing of new code. We present FeaRS, an approach and an IDE plugin which monitors the code written by Android developers in the IDE and is able to recommend the complete code of the next method (i.e., signature and method body) they are likely to implement based on method(s) they already have implemented.

FeaRS relies on a set of implementation patterns that we built by mining 20,713 open-source Android apps available on GitHub. To give a concrete example, the code snippet in Fig. 1 implements an options menu in an Android app. To perform such a task, tutorials recommend as first step to inflate the menu in the onCreateOptionsMenu(...) method and, then, to handle the item selection in the onOptionsItemSelected(...) method. Assuming the existence of this implementation pattern in several apps, FeaRS can learn it and recommend the implementation of onOptionsItemSelected(...) once onCreateOptionsMenu(...) has been implemented by the developer.

Fig. 1: An implementation pattern in Android

We analyzed 2,721,800 commits performed during the history of the subject apps to identify new methods that are implemented within the same commit. This results, for each analyzed commit , in a set of new methods created in . By extracting this information for thousands of commits, we can identify implementation patterns repeatedly followed by Android developers, e.g., the implementation of could imply the implementation of . We refer to as the Left-Hand Side (LHS) of the pattern and to as the Right-Hand Side (RHS).

The identification of these implementation patterns is far from trivial. Indeed, two commits and performed in two different repositories may implement different sets of new methods (e.g., and ) that, however, represent the same implementation pattern (i.e., and ). Recognizing this situation is necessary to identify groups of methods that are repeatedly implemented together in different commits/apps, and not just by chance in a single/few commit(s).

FeaRS identifies clusters of methods likely to implement the same feature in the overall set of mined added methods. Going back to the previous example, this means that and are assigned to the same cluster , and and to . This results in the flattening of and to the same implementation pattern (i.e., ). Once this processing is done for all mined commits, FeaRS applies association rule discovery [3] on all commits, thus creating the set of implementation patterns it relies on.

When monitoring the code written by a developer in the IDE, FeaRS identifies newly written methods and assigns, if possible, each of them to one of the clusters created in the previous step. Then, it checks if an implementation pattern having one or more of the newly implemented methods as LHS is available and, in case a pattern is found, the corresponding RHS is triggered as a recommendation to the developer.

We evaluated FeaRS in a study in which we simulated its usage in the change history of the same 20,713 apps we used to extract the implementation patterns. We used the first 80% of the apps’ histories to extract the implementation patterns, the subsequent 10% to tune the FeaRS’s parameters, and the last 10% to assess its performance (i.e., test set). For each commit in the test set, we simulated the scenario in which a developer implemented a subset of the new methods added in and used FeaRS to generate recommendations using as LHS. Then, in case a recommendation is generated, we check if the RHS corresponds to one of the methods actually implemented in and not part of .

The achieved results show the feasibility of our approach, but also its strong limitations. Indeed, while FeaRS is able to generate meaningful recommendations for thousands of methods, several of them concern small methods that are not expected to substantially boost the developer’s productivity.

Ii FeaRS

Fig. 2 depicts the inner working of FeaRS.

Fig. 2: The FeaRS pipeline

The black boxes represent components that we developed; the grey boxes depict external tools we reused and/or adapted.

All components except the Android Studio IDE plugin reside on a central server providing an access point via the FeaRS Web service. Steps 1-7 are executed offline and only once. Step 8 is executed every time the developer completes the implementation of a new method.

Ii-a Mining Android Apps

The Android apps miner identifies GitHub repositories related to Android apps. Their history is then analyzed to identify methods implemented within the same commit. We use the GitHub APIs to search for repositories satisfying the following criteria:

They are written in Java. While Android is transitioning to Kotlin as the official language, the majority of Android apps is still written in Java [11]. Note that while we instantiated FeaRS to the specific problem of recommending complete methods for Java Android apps, all the steps in Fig. 2 can be customized to any programming language.

They are Android apps. We ensure that the repository contains a build.gradle file with an explicit dependency towards the Android SDKs, indicating the usage of the Gradle build system, the default choice in Android Studio.

They have a limited, but non-trivial change history. We excluded apps with less than 100 commits since we are interested in identifying the new methods added by developers within the same commit. Also, we excluded apps having more than 1,000 commits, since we do not want FeaRS to learn coding patterns peculiar only to a few apps.

The Android apps miner identified and cloned 20,713 GitHub repositories, the set of apps that we use in this work, available in our replication package [30]. The set can be expanded by re-running the Android apps miner.

Ii-B Identifying Methods Added in Commits

The set of cloned repositories is provided as input to the History miner (step 2 in Fig. 2). This component extracts the list of commits performed in all branches of each repository by using the - command. This command allows analyzing all branches of a project without intermixing their history, avoiding unwanted effects of merge commits.

History miner uses JavaParser [19] to extract, from the Java files added or modified in each commit, the AST nodes which represent the callable declarations (i.e., methods and constructors). In particular, we are interested in the callable declarations added in each commit. Commits not implementing at least two new methods and/or constructors are excluded at this stage, since we want FeaRS to learn implementation patterns in the form of , where represents a set of one or more methods and a method that FeaRS can recommend based on the fact that the developer implemented . Thus, assuming to be a singleton, at least two new methods must be implemented in a commit (i.e., the one in and ) to make it useful for learning. We excluded commits adding more than 10 new methods (14% of the total number of commits), since these are likely to be tangled commits not representative of any specific implementation pattern [15].

Overall, we processed 2,721,800 commits, of which 841,995 were useful for building FeaRS (i.e., those adding at least two new methods and no more than ten). These commits are provided as input to the module in charge of the methods clustering (step 4 in Fig. 2).

Ii-C Clustering Similar Methods

To identify recurring implementation patterns in the considered commits, FeaRS applies clustering to group methods added in different commits, possibly from different systems, that implement equivalent or very similar functionalities. Two commits and performed in two different repositories may implement different sets of new methods (e.g., and ) that represent the same implementation pattern (i.e., and ). FeaRS can identify, through association rule discovery, that these sets of methods represent a repetitive implementation pattern.

FeaRS builds a weighted undirected graph. Each method added in any of the commits is considered as a node. The weight on the edges connecting each pair of nodes represents the similarity between the two corresponding methods. To assess similarity we use the publicly available ASIA clone detector [1], since it (i) is designed to capture the similarity between two Android methods; and (ii) returns as output an easily interpretable value from 0 (min similarity) to 1 (max). We customized the ASIA similarity algorithm in two ways.

First, in the original implementation all terms in the two methods to compare are lowercased before computing their textual similarity. This is suboptimal in FeaRS, since high precision in the identification of related methods is fundamental.

Experiments revealed that the similarity of methods is artificially boosted by lowercase transformation: Given two methods and , it happens that a term appearing in the name of (e.g., date) is matched with the type of an object appearing in (e.g., Date). By not transforming Date to lowercase, the presence of these two terms does not influence positively the similarity between and .

Second, while ASIA uses tf-idf (term frequency-inverse document frequency) as a weighting schema for the terms during the textual similarity computation, we only employ term frequency, because we noticed that a single term appearing in both methods and having a very high idf (i.e., being very rare in the corpus) can result in a high similarity between the two methods, even if they implement completely different features. This is especially true in small methods, due to the low number of terms present in them and the strong impact a single shared term can have on their similarity.

The graph we built contains 2,018,479 nodes. We prune all edges with a weight below a threshold ( will be tuned in our evaluation). This creates a set of disconnected subgraphs, each one representing a cluster of methods implementing strongly related functionalities. Within each subgraph (i.e., cluster) we identify the cluster centroid: the method with the highest number of edges, which serves as representative for that cluster. The centroid is used later on by the FeaRS Web service when interacting with the IDE plugin.

Ii-D Association Rule Mining

This module takes as input the list of commits generated by the History miner and the clusters output of the previous step (step 6 in Fig. 2) and creates a text file reporting in each line a set of methods added in the same commit and in the same file, using the cluster they belong to. For example, assuming a commit adding three methods , , and to a file , and those methods being assigned to clusters , , and , respectively, a line will be added to the file. We decided to split methods added in the same commit but in different files to extract more “cohesive” association rules, and to avoid learning recommendations that span different files (i.e., the developer is working on and FeaRS recommends a method to add in ).

FeaRS analyzes the created file using Association Rule Mining [2] to identify implementation patterns, relying on the arules package. We use the first 80% of the apps’ commits to extract the association rules, 10% for tuning the parameters of FeaRS and 10% to evaluate it. The output is a set of association rules in the form {LHS} RHS, where the LHS can be composed by one or more methods, while the RHS always has a single method. This means that FeaRS can only recommend the next method to implement given the one(s) already implemented by the developer.

There are three parameters that we tune in our evaluation: minimum support (), confidence for the mined rules (), and maximum size of the LHS ().

The support () indicates how frequently a rule is observed in the dataset and, in our case, represents the percentage of analyzed commits that contains the specific rule.

The confidence () assesses how often a given rule is actually true in the dataset. Given a rule {LHS} RHS, it is computed as the number of commits implementing in the same file all methods in the LHS and RHS divided by the number of commits implementing the LHS in the same file (with or without the RHS). Finally, we also tune the maximum size of the LHS ().

Ii-E The FeaRS Android Studio Plugin

Fig. 3 shows the FeaRS Android Studio IDE plugin.

Fig. 3: The FeaRS Android Studio plugin

The plugin interacts with the server through the Web service (step 8 in Fig. 2). The developer can start and stop FeaRS through simple and icons in the IDE toolbar. By clicking , FeaRS starts monitoring the code written by the developer and identifies when a new method is added. When this happens, the text of the new methods added by the developer since she pressed the start button is sent to the Web service.

The Web service identifies, for each received method, the cluster it belongs to. Our customized version of the ASIA clone detector computes the similarity between each received method and each centroid representative of the computed clusters. The similarity for the most similar centroid is compared against a threshold (the fifth and last FeaRS parameter to tune): If , the method is assigned to the cluster represented by the most similar centroid, otherwise no match is found and the method is discarded.

All combinations of received methods that are matched with a centroid are used to generate different LHSs. For example, if three methods added by the developer are matched to clusters , , and , we generate 7 possible LHSs: {}, {}, {}, {}, {}, {}, and {}.

FeaRS checks if any of these LHSs is equal to the LHS of one of the association rules previously extracted. In case of a match, a recommendation is generated. In the reported example, if {} is matched in a rule {} , then the centroid of cluster is returned by the Web service to the plugin as a recommendation. For the same LHS several different RHSs may be recommended. The matching of the LHS of two rules can lead to redundant recommendations. In the example, let us assume that two rules are matched, one with {} and one with {} as LHS, and that both of them have as RHS. In this case, the Web service returns the centroid of reporting that it is recommended based on the LHS belonging to the rule having the highest confidence.

The generated recommendations are shown in the IDE as depicted in the bottom part of Fig. 3.

2
shows the signatures of the methods implemented by the developer that are part of the LHS of the association rule used to recommend the method shown in

3
(i.e., RHS of the rule). In case several recommendations share the same LHS, the plugin displays them as one recommendation allowing developers to switch between different RHSs using the arrow buttons above

3
. The buttons at the bottom of the code snippet

4
allow to: (i) provide a feedback reporting if the recommendation was useful; (ii) copy the snippet; and (iii) delete the recommendation. The feedback, in our current implementation, is stored but not used. We plan to use it in future to adjust the confidence of the recommendations. If the developer decides to copy the snippet, a comment documenting the GitHub repository from when the snippet has been taken is added to the code, so that the developer can check its reusability from a legal perspective.

The slider at the top of the plugin GUI

1
allows the developer to customize the “chattiness” of the plugin on three different levels. Low, Medium, and High sensitivity are three different FeaRS configurations that resulted from the calibration of its parameters presented in Section IV-A. By moving the slider towards Low, FeaRS becomes more strict and generates fewer, but higher quality, recommendations, while the opposite holds for High.

Iii Study Design

The goal of this study is to assess the performance of FeaRS when used to recommend the next method to implement given one or more (already implemented) methods as input. It thus addresses the following research question:

RQ: What is the accuracy of FeaRS in recommending complete methods in the context of Android apps?

Iii-a Context Selection and Data Collection

Fig. 5 overviews the steps in our experimental design. We exploit the dataset of 20,713 Android apps as the context of our study. Then, we split such a dataset into three blocks namely training, validation, and test. Fig. 4 depicts how we create and use these three sets in our study.

Fig. 4: Data splitting and processing
Fig. 5: Study Design

The black arrows represent the change history of the apps considered in our study. Note that the history of the apps is not aligned, meaning that not all the apps exist in the same time period. The vertical dashed lines show how we divide the change history of the apps.

We use the first 80% to extract the association rules used by FeaRS to generate recommendations. We refer to this subset of the history as the “training set.” The subsequent 10% is used to tune the parameters of FeaRS to identify the best configurations (i.e., “validation set”), which are used to generate recommendations on the “test set” (i.e., the last 10%), with the goal of assessing the performance of FeaRS.

One important clarification: We do not use the first 80% of each repository as the training set, due to the misalignment of the mined change histories. Instead, given the date of the oldest commit present in all analyzed apps and the date of the most recent commit, we take the first 80% of the time interval going from to as training set. As shown in Fig. 4, this may result in some apps exclusively contributing to the training set (or to the validation/test sets).

However, such a design is needed to avoid using “data from the future” when generating recommendations for the validation and test set and, thus, to simulate a real usage scenario for FeaRS. Indeed, by selecting the first 80% of the history of each app to learn the association rules, it could happen that a given has the last commit of training set made on date , while for the latest commit of its entire history comes on date , with (i.e., is older than ). This would mean that association rules learned on will be applied to generate recommendations for commits performed on date (that will be part of the test set), thus using data from the future to learn how to trigger recommendations, something that cannot happen in a real usage scenario.

Parameter Experimented values
0.05, 0.20, 0.35, 0.50, 0.65, 0.80
8.00E-06, 4.80E-05, 8.80E-05, 1.28E-04, 1.68E-04
0.80, 0.85, 0.90, 0.95
1, 2, 3, 4, 5, 6, 7, 8, 9
TABLE I: FeaRS parameters tuning options

Once the association rules are learned, we assess the performance of FeaRS on the validation set with different parameter configurations (Table I), for a total of 1,080 configurations. Given the number of mined commits, the minimum value of we experiment (i.e., 8.00E-06) ensures that an association rule is learned from at least 5 commits to be considered valid.

In all combinations of parameters, we used , meaning that the minimum similarity needed to cluster two methods together (i.e., ) is also the minimum similarity used when generating recommendations to assign a newly implemented method to a cluster (i.e., , see Section II-E).

As shown in Fig. 4, to identify the best configuration(s) we use 10% of the apps change history (validation set).

For each commit in the validation set (, , and in Fig. 4) we match all newly added methods to the clusters that have been defined during the association rules extraction from the training set (using the same similarity threshold as for the clusters definition). This means that we simulated the scenario in which each of the added methods is written by the developer in the IDE, and the FeaRS plugin checks if the added method can be matched with any of the existing clusters (i.e., if its similarity with one of the centroids is higher than ). If a method is not matched, no further action is taken, while all matched methods are assigned to the corresponding cluster.

Fig. 4 represents our running example, in which the grey box on the left shows the association rules learned on the training set, and the black box at the bottom shows how performance is computed for each commit in the evaluation set. In the case of commit , three added methods have been matched to clusters , , and . Then, we compute all possible combinations of the matched clusters involving all but one of them. In the case of commit , this means all possible combinations having length lower than three: {}, {}, {}, {}, {}, {}. Then, we check if any of those combinations match the LHS of one of the rules learned from the training set. In Fig. 4 the pair {} matches the rule {} . This means that, assuming and to be written before (more discussion on this assumption in our threats to validity), FeaRS would be able in a real usage scenario to successfully recommend the next method to implement (i.e., the centroid). Thus, in Fig. 4, we count the number of recommendations generated by FeaRS (1), column “Recomm.”, the number of correct recommendations (1), and the number of methods added in commit that FeaRS would have potentially been able to recommend (1 out of 3), column “Cover. Meth.” Concerning commit , it would match the rule {} generating one wrong recommendation (see Fig. 4). No recommendation would be triggered for commit , since no matched rules are found.

There are two special cases that must be handled:

First, when multiple association rules have the same RHS (e.g., assume {} and {} are both available in the set of learned association rules). In this case, both rules could be applied, for example, in the context of commit in Fig. 4. However, considering both rules as successful would inflate the performance of FeaRS since, in a real usage scenario, if {} is applied, {} cannot be applied, since already exists.

Second, in case of a “circular dependency” between the LHS and the RHS of two rules, e.g., = {} and = {} . The LHS of matches the RHS of , and the RHS of is contained in the LHS of .

In theory both rules could be applied to commit in Fig. 4, but the application of one rule would exclude the other in a real usage scenario. If we apply , it means that has been implemented by the developer and it does not make sense to recommend it with . Similarly, if is applied, this means that already exists, making useless.

In both cases we select the rule with the highest confidence.

Iii-B Data Analysis

We assess the performance of each experimented configuration by computing the following metrics:

[leftmargin=0pt,labelindent=]

Recall:

, where is the number of commits for which FeaRS generated at least one correct recommendation and is the set of commits mined in the validation set. A correct recommendation is not necessarily an exact match to the actual implemented code, but the similarity has to be above a certain threshold which is consistent with the predefined clusters. Recall indicates in how many commits FeaRS could be potentially useful for developers.

Precision:

, where is the number of commits for which FeaRS generated at least one recommendation (correct or wrong).

Covcommits:

. This metric indicates the percentage of commits from the validation set that could have triggered FeaRS to generate at least one recommendation (correct or wrong) for developers.

Covmeth:

, where is the number of methods successfully recommended by FeaRS and is the total number of methods added in . This coverage metric indicates the percentage of methods added in all commits from the validation set that could have been automatically generated by FeaRS.

#Recom:

# is the number of recommendations generated by FeaRS in a commit for which it was triggered. We report both the mean and the median values.

Disttokens:

is the distance in number of tokens that must be modified, added or deleted by a developer when they receive a correct recommendation from FeaRS, which does not imply an exact match with the code actually implemented by the developer. Thus, we assess the effort needed by developers to adapt the received recommendation to their codebase (an example computation of such a metric is shown in Fig. 6).

Fig. 6: An example of calculation
Fig. 7: Tuning of FeaRS’s parameters

Iv Results Discussion

Iv-a FeaRS Parameters Tuning

Fig. 7 shows the results of the parameters tuning performed on the validation set. Each of the four graphs reports on the x-axis the values experimented for a specific parameter; from left to right: minimum confidence (), minimum support (), minimum similarity to cluster two methods (), and maximum size of the LHS (). The y-axis reports the (left) and the precision (right) achieved, with red dots indicating values, and black dots precision values. We decided to use these two metrics, over the others, for the parameters tuning since we wanted to contrast the talkativeness of our tool (i.e., in how many commits it generates a recommendation) against the precision of the generated recommendations. To better understand what the black and red dots represent, consider the graph when its value is set to 0.05. The dots plotted in correspondence of this value represent the performance achieved when fixing and varying all other parameters.

One first observation is related to the range of performance achieved by different configurations: The varies from 0.02 to 0.28, while the precision from 0.08 to 0.84. While the values of may look low, it is important to note that the validation set includes 70,562 commits.

The trends observed for the four parameters indicate that has the strongest influence on performance. When the minimum confidence needed to trigger a recommendation grows, as expected the precision linearly increases with a corresponding linear decrease of recall (left part of Fig. 7). Setting lower than 0.50 does not ensure acceptable precision.

Concerning , increasing its minimum value does not substantially increase precision while having a strong negative effect on . Low values of this parameter are preferable. Instead, increasing the parameter results in a notable increase in precision, especially when moving from 0.80 to 0.90/0.95. In this case, 0.90 seems to be a good compromise, also considering the minor loss of as compared to lower values. Finally, the does not play a big role in the performance of FeaRS. As the output of this tuning process, we identified three configurations that we linked to the sensitivity bar in our IDE plugin and that are shown in the gray boxes at the right of Fig. 7.

These configurations have been picked using the following process. We started from the assumption that a precision level below 0.50 (i.e., one out of two generated recommendations is correct) is not acceptable. Then, we picked as a high sensitivity configuration the one ensuring a precision of at least 0.50 and having the highest . This configuration is able to generate 8,355 correct recommendations in the validation set, with a precision of 52%. Then, we increase the minimum acceptable precision by 10%, identifying the configuration ensuring at least a 60% precision with the maximum . This resulted in the medium sensitivity configuration, that can successfully recommend useful methods in 7,092 cases, with a precision of 64%. Finally, a further increase of the precision level to at least 70%, led to the identification of the low sensitivity configuration, that can recommend 5,801 correct methods, with a precision of 72%. These three configurations are the ones we experiment with.

Iv-B Quantitative Results

Table II reports the results achieved by the three FeaRS’s configurations on the test set. The top part of the table reports the raw data used to compute the performance metrics in the bottom part of the table. In the top part, while “#commits w. corr. recomm.” indicates the number of commits with at least one correct recommendation, “#corr. recomm.” represents the number of correctly recommended methods, possibly more than one per commit.

high medium low
sensit. sensit. sensit.
#commits 69,480 69,480 69,480
#added methods 219,331 219,331 219,331
#commits w. recomm. 8,757 6,447 4,116
#commits w. corr. recomm. 4,878 4,167 3,110
#recommendations 14,642 9,996 7,170
#corr. recomm. 7,383 6,183 5,149
recall 0.07 0.05 0.04
precision 0.50 0.62 0.72
coverage 0.13 0.09 0.06
coverage 0.03 0.03 0.02
#recom(median) 1 1 1
#recom(mean) 1.67 1.55 1.74
distance(Q1,Q2,Q3) 0,1,2 0,1,2 0,1,2
distance(mean) 1.94 2.03 1.81
%distance(Q1,Q2,Q3) 0,13,22 0,13,22 0,13,20
%distance(mean) 14% 14% 13%
TABLE II: Performance when considering all methods

The results achieved by the three configurations are in line with what we observed on the validation set: precision goes from 0.50 (high sensitivity) to 0.72 (low sensitivity), with recall moves in an inverse direction, decreasing from 0.07 (high sensitivity) to 0.04 (low sensitivity).

The recall values, while low, still correspond to thousands of methods correctly recommended. As we learned while performing the qualitative analysis in Section IV-C, a correct recommendation does not imply a “useful” recommendation. We noticed that many of the correct recommendations are due to small methods (e.g., a getter method triggers the implementation of the corresponding setter), and decided to re-compute the performance of FeaRS only considering recommended methods with at least four lines of code (including signature but excluding annotations and the closing brace). To correctly compute recall, this also required us to exclude from our analysis the commits in which a successful recommendation would not be possible at all, due to the absence of newly implemented methods having at least four lines.

high medium low
sensit. sensit. sensit.
#commits 31,088 31,088 31,088
#added methods 83,562 83,562 83,562
#commits w. recomm. 900 763 564
#commits w. corr. recomm. 568 536 413
#recommendations 1,329 1,099 738
#corr. recomm. 778 742 522
recall 0.02 0.02 0.01
precision 0.59 0.68 0.71
coverage 0.03 0.03 0.02
coverage 0.01 0.01 0.01
#recom(median) 1 1 1
#recom(mean) 1.48 1.44 1.30
distance(Q1,Q2,Q3) 0,3,10 0,3,10 0,3,4
distance(mean) 5.08 5.07 3.98
%distance(Q1,Q2,Q3) 0,14,28 0,14,28 0,10,18
%distance(mean) 17% 16% 13%
TABLE III: Performance when excluding short methods

Table III reports the results achieved in this scenario. The precision values are in line with before (min: 0.59, max: 0.71), showing that the “quality” of the recommendations is not influenced by the length of the recommended methods. Instead, we observed a drop of recall, that does not go over 2%, with a number of correct recommendations ranging between 522 (low sensitivity) and 778 (high sensitivity).

The number of recommendations generated by FeaRS (#recom) is usually very low (median=1 and mean2 in both scenarios). This shows that FeaRS does not generate many cases to inspect when triggered. Also, the results of distancetokens indicate that developers need to modify only a few tokens to adapt the received recommendations to their code.

While these results show the potential of FeaRS, they highlight (as in cases discussed for Table II), that the recommended methods are short, with a potential small benefit for developers. Our qualitative analysis will help in better assessing the value of these recommendations.

Iv-C Qualitative Examples

Iv-C1 Correct Recommendations

Fig. 8 shows an example of a recommendation generated for the Memento app for Android Wear[23].

Fig. 8: Correct recommendation to the usage of external storage in Android.

Suppose that the developer implements the isExternalStorageReadable() method to check whether the external storage of the device is mounted in read-only mode. FeaRS can pop up and recommend the isExternalStorageWritable() method to check also if it is writable or not. This rule had four matching instances in our test set from four different repositories.

Fig. 9 shows an example of providing a custom back navigation for an Android DrawerLayout.

Fig. 9: Correct recommendation to provide a custom back navigation for an Android DrawerLayout.

Following the implementation of an onNavigationItemSelected(...) method that uses a DrawerLayout, FeaRS recommends a proper implementation for the onBackPressed() method. Interestingly, in case of a missing implementation, the DrawerLayout might not close properly, as it is discussed in a Stack Overflow question[35]. We found 19 matching instances for this rule in 17 different repositories.

Fig. 10 shows an example recommendation for the creation of a Google Map object from the Google Maps SDK.

Fig. 10: Correct recommendation for the creation of a GoogleMap instance from the Google Maps SDK for Android.

We found 68 matches for this rule in 62 repositories. FeaRS matches an onCreate(...) method in which an Activity creates a SupportMapFragment from the SDK. Next, it recommends an initial implementation for the onMapReady(...) method, that shows how to add a marker to the map. We found various implementations having a different initial marker position (e.g., London, Sydney).

Iv-C2 Unmatched Implementation Patterns

We present FeaRS’s recommendations that have been triggered during the evaluation process (i.e., their LHS has been matched in the test commits) but that have never been successful (i.e., the RHS has not been matched).

Fig. 11 shows an example of recommendation generated for the Artissans Android app[6].

Fig. 11: Unmatched recommendation for user credential validation in sign-up activity.

Suppose that the developer implements the isValidEmail() method to check whether the email address provided when creating an new account is valid. FeaRS recommends the isValidPassword() method to check, in the same scenario, if the provided password/confirm password fields are valid (i.e.,

they are composed by at least six characters, and they match each other). This rule had been triggered twice without finding a match for the RHS, thus being classified as an incorrect recommendation. However, when we looked into the two commits in which this recommendation was triggered, we found that both of them actually implemented an

isValidPassword() method that, however, only validated the password based on its length, do not making the recommended method and the implemented one similar enough to be counted as a correct recommendation. This example is representative of others we found.

Fig. 12: Unmatched recommendation for creating custom filter for filterable adapter in Android.

For example, Fig. 12 relates to the creation of a custom filter applied to a RecyclerView.Adapter in Android. The class Filter is used in Android to constrain data according to a specified pattern.

Following the implementation of a UserFilter constructor, FeaRS recommends a proper implementation of the overridden publishResults method from the Filter class that, as explained in the Android documentation, is invoked in the UI thread to publish the filtering results in the user interface. Again, this recommendation was not matched (and considered wrong) during our study, but also in this case looking into the test commit[5] subject of the recommendation, we found that a similar overridden publishResults method was implemented as well following a custom filter constructor. Unfortunately, also in this case the similarity between the RHS of the rule and the implemented publishResults was not high enough to identify the recommendation as useful.

These cases show that our experimental design, while useful to provide a first indication about the quality of the recommendations triggered by FeaRS, has imprecisions in assessing FeaRS’s performance. As previously said, only complementing this mining-based study with experiments with developers can help in better assessing FeaRS’s usefulness.

V Threats to Validity

Construct validity. In our experimental design we assumed that if a commit added three methods belonging to clusters , , and and FeaRS has an association rule {} , FeaRS would have been useful in that commit to recommend to the developer. However, we cannot know whether was written before , thus making FeaRS’s recommendation useless in practice. Such a threat can only be addressed by (i) performing a user study in which developers code live using FeaRS, or (ii) recording IDE interaction data of programming sessions. While this is part of our future work, we preferred as first evaluation for FeaRS something that can be large-scale and fully automated, before moving to more costly studies requiring human involvement. In the design of our study, we only consider coding activities from one single commit might perform an implementation task, while ignoring those cases in which a given task can be separated into several commits. Actually we considered the idea of using close commits as a single data point, but we found out that it is hard to define a proper criterion for the selection of multiple commits and it might be risky for the cohesiveness of the task.

Another threat is related to the criterion we used to identify a generated recommendation as “correct.” Given a commit in which and are added, we assume that a recommendation is correct if is matched to an existing cluster and is matched to an existing cluster (or vice versa, i.e., to and to ). This implies an assumption, meaning that the assignment of methods to cluster is correct or that, in other words, when a method is assigned to a cluster, the method actually implements functionalities related to those of the cluster. To partially address this threat, two of the authors manually analyzed a set of 100 methods assigned by FeaRS to a specific cluster, with the goal of verifying whether the assigned cluster actually implements the same feature of the method.

After solving conflicts arisen in 7% of cases, they reported an accuracy of 91%. Thus, we acknowledge possible imprecisions.

Internal validity. We tuned the FeaRS’s parameters on a set of commits not used for the learning of the association rules nor for assessment of FeaRS’s performance. We experimented with 1,080 combinations of parameters. However, it is possible that better performance can be achieved by considering other possible values. Thus, from this point of view, the reported performance is an underestimation. We adopted a careful experimental design to avoid using “data from the future” when tuning and testing our approach.

External validity. Overall, our study involves 20,713 open-source Android apps. The main issue is related to the fact that all used apps are open source, and might not be representative of commercial apps. Also, while FeaRS is general enough to be adapted to other contexts (e.g., Java programming in general), we decided to focus on a more narrow scenario at least for this first work.

Vi Related Work

FeaRS is one of the many recommender systems proposed in the software engineering literature. The latter have been proposed to support many different tasks, such as the recommendation of formal and informal documentation (see e.g., [40, 44, 27]), the automatic generation of code for different purposes (e.g., [43, 42, 20, 13, 21, 34]), or the recommendation of relevant code examples/discussions for a task at hand (e.g., [12, 31, 36, 17, 18]). We focus our discussion on the most related works, and in particular on those dealing with code completion techniques and code search engines.

Vi-a Code Completion Techniques

Basic code completion features of IDEs often rely on the static type system of a programming language and do not consider the actual code context. Suggestions are usually sorted, e.g., in alphabetical order. As a result, relevant recommendations are not always easy to identify.

An alternative approach was presented by Bruch et al. [9]. Their intelligent code completion system filters out candidates from the list of tokens recommended by the IDE that are not relevant to the current working context, and ranks candidates based on how relevant to the context they are.

Another context-sensitive approach was developed by Nguyen et al. [26]. Their GraPacc method uses graphs to model API usage patterns, where nodes represent actions (e.g., method calls) and control points (e.g., while), and edges represent control and data flow dependencies between nodes. Context information such as the relation between API elements and other code elements is considered for ranking the most fitted API usage patterns.

Statistical language models have also been used for code completion. In their seminal work on the naturalness of software, Hindle

et al.

developed a code completion engine for Java, based on an n-gram language model

[16]. Their work has been extended by Nguyen et al. [25] and Tu et al. [41].

A language model approach was implemented by Raychev et al. too [28]. They extract sequences of method calls from a large codebase to train a model, which they use to support the autocompletion of method calls, achieving an accuracy of 90% when considering the top three results. Method call completion was also explored by Asaduzzaman et al. [7]. Their approach, called CSCC, relies on a database of method call usage contexts collected from open source projects and applies a hash function to find relevant recommendations. From another perspective, Robbes and Lanza proposed to improve code completion by focusing on the recent changes implemented by the developer [32].

Popular IDEs have recognized the importance of supporting context-sensitive recommendations. For example, IntelliJ IDEA has a feature called Smart completion to filter and show suggestions applicable to the current context. NetBeans has a Smart Code Completion feature to display at the top of the suggestions the most relevant ones for the context. Eclipse has plugins to extend its core code completion, among these, aiX Code Completer[4] and Codota[10] use AI techniques and can even recommend a full line of code.

While these approaches are undoubtedly valuable to speed up code writing, they are limited to recommendations related to the next few tokens the developer is likely to type given the current context. In the best case, they can recommend a few APIs that the developer is likely to use next. With FeaRS we forge another step ahead, to predict the next full method a developer is likely to implement.

Vi-B Code Search Engines

FeaRS is also related to approaches implementing code search engines that allow retrieving code samples and reusable open source code from the Web.

Early online code search engines (e.g., codesearch.google.com, koders.com, and krugle.org) offered keyword-based search and file-level retrieval. These approaches could be improved by considering structural and semantic information of code. Bajracharya et al. [8] developed Sourcerer, a code search engine that extracts structural information from the code and stores it in a relational model so it can be queried for code search. It supports queries for control structures, Java types, and micro patterns (e.g., implementation of Semaphore).

Reiss developed an approach to combine code search with transformations to map the retrieved code, to meet user specifications [29]. For the searching, it allows the user to specify multiple semantic rules, which also form the basis for the transformations.

Thummalapenta et al. developed an approach to support code search engines with static analysis to return fewer, but more relevant code samples for search queries [39, 37]. Their primary goal was to support a user in reusing a given API. Later they extend their approach with SpotWeb [38] to assist users by detecting hotspots that can serve as starting points for reusing APIs.

API usage was also proposed by McMillan et al. [14, 22] to return highly relevant matches for a source code search engine. Their approach combines three sources of information to locate relevant software: the textual descriptions of applications, the API calls used inside each application, and the dataflow among those API calls.

Compared to code search engines, FeaRS also relies on an extensive database of methods’ source code in open source applications. These methods are organized in clusters based on a similarity algorithm implemented in the ASIA clone detector [1]. FeaRS does not require the user to write a “query” to identify relevant pieces of code, but extrapolates this need by monitoring the IDE.

Vii Conclusions

Code completion, while provenly useful and extensively used by developers [24] is just a step in the direction of an automated pair programmer, adding complete methods that a developer would have to add anyway and thus removing from the developer the burden of rote work. This was the ambitious goal that we set out to achieve with this work, embodied in the creation of FeaRS, an approach and a tool [30] to automatically recommend to developers the complete next method to write during implementation activities.

FeaRS relies on a simple but intuitive idea: programming is an eclectic activity, which some even go as far as calling it “natural” [16]. What a developer is doing has a high chance of having been done by someone else, somewhere else before. Leveraging this idea, FeaRS mines vast amounts of data to recommend complete methods given a set of methods being implemented by a developer. We evaluated FeaRS on the change history of 20,713 Android apps. The results show the potential of FeaRS, with hundreds of correct methods recommended even in its most conservative configuration.

However, our findings are not conclusive for what concerns the actual usefulness of the generated recommendations in a real usage scenario, in which developers use FeaRS during coding activities. This is due to two observations we made. First, some of the methods recommended by FeaRS are quite short and, while they can still be useful, they could also represent “trivial” recommendation for developers. We believe this can in part be made up by introducing a user feedback loop, which is part of our future work. The quantitative results show that around 15% of the tokens from the recommendations need to be modified, added or deleted to fit the user’s code base. One of our future plans is to integrate code adaption techniques into FeaRS to avoid potential conflicts or compilation errors with the user’s code environment, and convert the coding convention into the user’s style. Second, due to our experimental design, the “unmatched recommendations” are always considered false positives, while we observed that some are actually valuable recommendations. Thus, a deeper evaluation of FeaRS including a well-designed user study represents another main target of our future research.

Acknowledgment

We gratefully acknowledge the financial support of the Swiss National Science Foundation for the projects PROBE (SNF Project No. 172799) and CCQR (SNF Project No. 175513).

References

  • [1] E. Aghajani, G. Bavota, M. Linares-Vásquez, and M. Lanza (2019) Automated documentation of Android apps. IEEE Transactions on Software Engineering. Cited by: §II-C, §VI-B.
  • [2] R. Agrawal, T. Imieliński, and A. Swami (1993-06) Mining association rules between sets of items in large databases. SIGMOD Rec. 22 (2), pp. 207–216. External Links: ISSN 0163-5808 Cited by: §II-D.
  • [3] R. Agrawal and R. Srikant (1995) Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering, pp. 3–14. Cited by: §I.
  • [4] Cited by: §VI-A.
  • [5] Cited by: §IV-C2.
  • [6] Cited by: §IV-C2.
  • [7] M. Asaduzzaman, C. K. Roy, K. A. Schneider, and D. Hou (2014) Context-sensitive code completion tool for better API usability. In 2014 IEEE International Conference on Software Maintenance and Evolution, Vol. , pp. 621–624. Cited by: §VI-A.
  • [8] S. Bajracharya, T. Ngo, E. Linstead, Y. Dou, P. Rigor, P. Baldi, and C. Lopes (2006) Sourcerer: a search engine for open source code supporting structure-based search. In Companion to the 21st ACM SIGPLAN Symposium on Object-Oriented Programming Systems, Languages, and Applications, OOPSLA ’06, pp. 681–682. External Links: ISBN 159593491X, Document Cited by: §VI-B.
  • [9] M. Bruch, M. Monperrus, and M. Mezini (2009) Learning from examples to improve code completion systems. In Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, ESEC/FSE 2009, pp. 213–222. Cited by: §I, §VI-A.
  • [10] Cited by: §VI-A.
  • [11] R. Coppola, L. Ardito, and M. Torchiano (2019) Characterizing the transition to Kotlin of Android apps: a study on F-Droid, Play Store, and GitHub. In Proceedings of the International Workshop on App Market Analytics, pp. 8–14. Cited by: §II-A.
  • [12] J. Cordeiro, B. Antunes, and P. Gomes (2012) Context-based recommendation to support problem solving in software development. In Proceedings of RSSE 2012, pp. 85–89. Cited by: §VI.
  • [13] R. L. Glass (1996-04) Some thoughts on automatic code generation. SIGMIS Database 27 (2), pp. 16–18. External Links: ISSN 0095-0033, Document Cited by: §VI.
  • [14] M. Grechanik, C. Fu, Q. Xie, C. McMillan, D. Poshyvanyk, and C. Cumby (2010) A search engine for finding highly relevant applications. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE ’10, pp. 475–484. External Links: ISBN 9781605587196 Cited by: §VI-B.
  • [15] K. Herzig and A. Zeller (2013) The impact of tangled code changes. In 2013 10th Working Conference on Mining Software Repositories (MSR), Vol. , pp. 121–130. Cited by: §II-B.
  • [16] A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu (2012) On the naturalness of software. In Proceedings of the 34th International Conference on Software Engineering, ICSE 2012, pp. 837–847. External Links: ISBN 9781467310673 Cited by: §I, §VI-A, §VII.
  • [17] R. Holmes, R. Walker, and G. Murphy (2005) Strathcona example recommendation tool. SIGSOFT Software Engineering Notes 30, pp. 237–240. Cited by: §VI.
  • [18] R. Holmes, R. Walker, and G. Murphy (2006) Approximate structural context matching: an approach to recommend relevant examples. IEEE TSE 32 (12), pp. 952–970. Cited by: §VI.
  • [19] Cited by: §II-B.
  • [20] C. Lezos, G. Dimitroulakos, I. Latifis, and K. Masselos (2016) Automatic generation of code analysis tools: the CastQL approach. In Proceedings of the 1st International Workshop on Real World Domain Specific Languages, RWDSL ’16. External Links: ISBN 9781450340519, Document Cited by: §VI.
  • [21] H. Liao, J. Jiang, and Y. Zhang (2010) A study of automatic code generation. In 2010 International Conference on Computational and Information Sciences, Vol. , pp. 689–691. Cited by: §VI.
  • [22] C. McMillan, M. Grechanik, D. Poshyvanyk, C. Fu, and Q. Xie (2012) Exemplar: a source code search engine for finding highly relevant applications. IEEE Transactions on Software Engineering 38 (5), pp. 1069–1087. Cited by: §VI-B.
  • [23] Cited by: §IV-C1.
  • [24] G. C. Murphy, M. Kersten, and L. Findlater (2006) How are java software developers using the eclipse ide?. IEEE Software 23 (4), pp. 76–83. Cited by: §I, §VII.
  • [25] A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen (2016) A large-scale study on repetitiveness, containment, and composability of routines in open-source projects. In Proceedings of the IEEE/ACM 13th Working Conference on Mining Software Repositories (MSR 2016), pp. 362–373. Cited by: §I, §VI-A.
  • [26] A. T. Nguyen, T. T. Nguyen, H. A. Nguyen, A. Tamrawi, H. V. Nguyen, J. Al-Kofahi, and T. N. Nguyen (2012) Graph-based pattern-oriented, context-sensitive source code completion. In 2012 34th International Conference on Software Engineering (ICSE), pp. 69–79. Cited by: §I, §VI-A.
  • [27] L. Ponzanelli, S. Scalabrino, G. Bavota, A. Mocci, R. Oliveto, M. Di Penta, and M. Lanza (2017) Supporting software developers with a holistic recommender system. In Proceedings of ICSE 2017 (39th International Conference on Software Engineering), pp. 94–105. Cited by: §VI.
  • [28] V. Raychev, M. Vechev, and E. Yahav (2014) Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2014, pp. 419–428. Cited by: §I, §VI-A.
  • [29] S. P. Reiss (2009) Semantics-based code search. In Proceedings of the 31st International Conference on Software Engineering, ICSE ’09, pp. 243–253. External Links: ISBN 9781424434534, Document Cited by: §VI-B.
  • [30] Cited by: §II-A, §VII.
  • [31] P. Rigby and M. Robillard (2013) Discovering essential code elements in informal documentation. In Proceedings of ICSE 2013, pp. 832–841. Cited by: §VI.
  • [32] R. Robbes and M. Lanza (2010) Improving code completion with program history. Automated Software Engineering 17 (2), pp. 181–212. Cited by: §I, §VI-A.
  • [33] M. P. Robillard, W. Maalej, R. J. Walker, and T. Zimmermann (2014) Recommendation systems in software engineering. Springer Publishing Company, Incorporated. External Links: ISBN 3642451349 Cited by: §I.
  • [34] N. K. Singh (2013) EB2ALL: an automatic code generation tool. In Using Event-B for Critical Device Software Systems, pp. 105–141. External Links: ISBN 978-1-4471-5260-6, Document Cited by: §VI.
  • [35] Cited by: §IV-C1.
  • [36] W. Takuya and H. Masuhara (2011) A spontaneous code recommendation tool based on associative search. In Proceedings of SUITE 2011, pp. 17–20. Cited by: §VI.
  • [37] S. Thummalapenta and T. Xie (2007) Parseweb: a programmer assistant for reusing open source code on the web. In Proceedings of the Twenty-Second IEEE/ACM International Conference on Automated Software Engineering, ASE ’07, pp. 204–213. External Links: ISBN 9781595938824, Document Cited by: §VI-B.
  • [38] S. Thummalapenta and T. Xie (2008) SpotWeb: detecting framework hotspots and coldspots via mining open source code on the web. In 2008 23rd IEEE/ACM International Conference on Automated Software Engineering, Vol. , pp. 327–336. Cited by: §VI-B.
  • [39] S. Thummalapenta (2007) Exploiting code search engines to improve programmer productivity. In Companion to the 22nd ACM SIGPLAN Conference on Object-Oriented Programming Systems and Applications Companion, OOPSLA ’07, pp. 921–922. External Links: ISBN 9781595938657, Document Cited by: §VI-B.
  • [40] C. Treude and M. P. Robillard (2016) Augmenting API documentation with insights from stack overflow. In Proceedings of ICSE 2016 (38th International Conference on Software Engineering), pp. 392–403. Cited by: §VI.
  • [41] Z. Tu, Z. Su, and P. Devanbu (2014) On the localness of software. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, pp. 269–280. Cited by: §I, §VI-A.
  • [42] M. Tufano, J. Pantiuchina, C. Watson, G. Bavota, and D. Poshyvanyk (2019)

    On learning meaningful code changes via neural machine translation

    .
    In Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019, pp. 25–36. Cited by: §VI.
  • [43] M. Tufano, C. Watson, G. Bavota, M. D. Penta, M. White, and D. Poshyvanyk (2018) An empirical investigation into learning bug-fixing patches in the wild via neural machine translation. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, Montpellier, France, September 3-7, 2018, pp. 832–837. Cited by: §VI.
  • [44] E. Wong, J. Yang, and L. Tan (2013) AutoComment: mining question and answer sites for automatic comment generation. In Proceedings of ASE 2013 28th IEEE/ACM International Conference on Automated Software Engineering), pp. 562–567. Cited by: §VI.