Application Programming Interfaces (APIs) play an important role in software development . With the help of APIs, developers can accomplish their programming tasks more efficiently . However, due to the huge number of APIs in the library, it is impractical for developers to get familiar with all of them and always select the correct ones for specific development tasks.
To tackle this problem, many API recommendation approaches and tools have been proposed to relieve the burden of developers in understanding and searching APIs. Based on different inputs, there are generally two types of API recommendation scenarios, i.e., recommendation with queries and recommendation without queries. For the first type, it requires developers to state what is wanted in natural languages as queries and feed into the recommendation system; while for the second type, since there are no explicit queries, existent code, especially the neighbouring ones, will be leveraged as context, and the missing APIs will be inferred and recommended to end users. A majority of related work employs text similarity-based techniques. For example, some recommend APIs according to the similarity between search queries and supplementary information of APIs [3, 4]; some return API usages depending on how much they are related to context information in source code [5, 6]. Generally, these approaches use keywords to narrow down the search scale in massive target repositories and speed up recommendation efficiency. However, in many cases, the correct API information is not literally similar to the query because of the semantic gap [7, 8, 9]. For example, the answer to the query “Make a negative number positive” could be “java.lang.Math.abs”, which returns the absolute value of the argument, matching the problem perfectly. For these dissimilar query-answer pairs, textual matching is of limited usage. Secondly, very few of these approaches consider the role of developers’ feedback information in the recommendation process. Such information is usually crucial to improve the API recommendation performance.
Feedback information generally refers to user’s interaction information with the recommended results during a recommendation session. Usually, it reflects the users’ preference for different items. In traditional recommendation systems , the use of feedback information could greatly improve the accuracy of recommendation [11, 12]. For example, in a movie recommendation system, the user viewing history is regarded as feedback information. In an online shopping system, feedback usually refers to the product browsing history of a particular customer. We note that they are usually referred to as implicit feedback. (In contrast, users’ rating is considered to be explicit.)
In the process of API recommendation, selecting an API from the recommended list usually suggests that the API is useful for the user to solve the particular problem in the query. Hence, it is deemed to be the correct answer to the query. During each query-answer session, we record the query alongside with the API selected by the user, putting such query-API pair into the feedback repository. The API is regarded as feedback information of the query.
In this paper, we propose a novel framework, BRAID (Boosting RecommendAtion with Implicit FeeDback), to boost recommendation effectiveness by leveraging (implicit) feedback information. Particularly, we focus on the first type of recommendation scenarios, i.e., recommendation with queries. By introducing feedback, not only do we improve the performance of API recommendations, but also we can accomplish personalized recommendation. For the same query, different list order will be recommended based on user personal interaction history (feedback). Moreover, our framework could accommodate existing recommendation approaches as components.
To effectively integrate user feedback into the code recommendation loop, we harness learning-to-rank (LTR) techniques, which are widely used in areas such as information retrieval and recommendation. The key of LTR in information retrieval is to train a ranking model by which a given query can decide an optimized order of the relevant documents based on feedback information. By viewing APIs as documents, we can apply LTR techniques to API recommendation to boost its performance. Furthermore, to accelerate the feedback learning process, we incorporate active learning which is to alleviate the “cold start” of tenuous feedback information at the beginning. We query an oracle to get the correct label and put it to the training set. By iterating this process we can obtain a well-trained active learning model with the expanded labeled set. This training set can be, in turn, used to train a well-performed model to generate an optimized recommendation list.
, as baselines and Hit@k/Top-k accuracy, MAP, MRR as evaluation metrics. With continuous accumulation of feedback information, the Top-accuracy of BIKER gets increased by 14.5%, the absolute growth of RACK is nearly 35.42%, and that of NLP2API is 32.26% eventually.
The main contributions of the paper are as below.
We propose a novel framework BRAID, which integrates programmers’ feedback information by using the learning-to-rank technique to improve the accuracy of API recommendation.
BRAID also features the active learning technique, with which the learning process of feedback information can be accelerated. Even with a small proportion of feedback data, the performance of recommendation can still be enhanced considerably.
We conduct a comprehensive empirical study and compare BRAID to three state-of-the-art API recommendation systems. The results show that our approach performs well and demonstrate the generalizability.
Our work is orthogonal to the recent efforts in recommending APIs with machine learning techniques, largely in the context of intelligent software development. It is not to put forward yet another recommendation method, but is to boost the performance and is applicable to a wide spectrum of extent query-based recommendation systems. To the best of our knowledge, this is the first time that feedback is taken into account seriously in API recommendation, and represents one of the first work to utilize LTR in this area.
Structure of the paper. Section 2 briefly introduces the background of this study. Section 3 gives the details of our approach. Section 4 presents the experimental settings and comparative results on related API recommendation systems. In section 5 and 6, threats to validity and related work are discussed respectively. Finally, conclusion is drawn and future research is outlined in Section 7.
As a widely used ranking technique, LTR has achieved great success in a variety of fields including information retrieval, natural language processing, and software engineering [15, 16, 17]. The basic task of LTR is to learn ordered documents from the document set
by optimizing a loss function which is dependent on a given query
. LTR is essentially a supervised learning task, typically by extracting features from documents and predicting the corresponding labels which reflect the relevance between the query and the documents. Different from traditional similarity calculation based approaches, the main characteristic of LTR is to define a loss function and train a ranking modelto sort the candidate documents in . In this work, in a nutshell, we regard APIs as “documents”, and thus naturally cast API recommendation as an LTR problem.
LTR techniques could be classified based on the underlying learning models. Examples include SVM techniques, boosting techniques 
, neural network techniques, and others . A more interesting classification is based on the characteristics of the input space, where one usually speaks of pointwise, pairwise and listwise LTR[21, 22]
. In general, the pointwise approach focuses on the relevance of a query and a single document. By converting each single document into a feature vector, it can predict the relevant score of the document via classification or regression methods. The pairwise approach regards ranking as comparing the relative preference between document pairs. In this way, it turns a ranking task into deciding the relative order of each document pair, which can be considered as a binary classification or a pairwise regression problem. The listwise approach takes the results of the user query (namely, a list of documents) as a data point in the training data set based on which a ranking modelcan be trained. For a new query, predicts each document on the list for the new query and then ranks them in (say) a descending order.
In API recommendation, it is neither practical nor necessary to obtain a fully ranked list of APIs, since programmers are merely interested in the most appropriate APIs associated with the query and ignore the irrelevant ones. Instead we only need to compare pairwise preference of a few candidate APIs with the help of programmers’ feedback. As a result, in our framework, we adopt the pairwise LTR technique.
2.2 Active Learning
Supervised learning requires annotated/labeled data, which in many cases may be very expensive to obtain. Active learning is proposed with the general aim to train a model of better performance but with fewer training instances. When the annotated data is scarce or the cost of labeling data is high, the active learning algorithm can actively select specific data to label, and these data will then be sent to annotators. Generally speaking, the selected samples should be the most informative, which can not only make a maximum contribution to model optimization, but also help reduce the amount of annotated data .
In general, the paradigm of active learning can be represented as a tuple , where is the model to be learnt (e.g., a classifier), denotes the query function which acquires the most informative data from unlabeled samples, and represents the oracle which labels the samples. In addition, and are the sets of labeled and unlabeled samples respectively.
An active learning algorithm usually starts by training a model with only a small amount of labeled data from . Then it queries the function which defines the selection strategy, and thus obtains the samples from the unlabeled data set . As the next step, it submits these selected samples to the oracle for annotation and puts them into the labeled set. Finally, the newly labeled samples are used to retrain the model. This process repeats until some specific termination criteria are met, such as those based on the number of iterations or the performance related metrics.
As illustrated by Fig. 1, the BRAID framework consists mainly of four parts:
Initial API recommendation. Given a query as input, an initial API recommendation list is returned. This could be acquired by applying the existing API recommendation algorithms to the given query.
The feedback repository which stores pairs of queries and associated recommended APIs. More formally, the feedback repository is a set of pairs where is a query and is the corresponding API. Initially the feedback repository is empty, but will accumulate in the course of interactions with users. The query and the selected APIs (i.e., from the user feedback) are to be recorded in the feedback repository.
The feature extraction engine which generates a feature vector for each API on the recommended API list when a query is given. The feature vector comprises two parts, i.e., feedback features and related information features. In particular, the feedback information is obtained by looking up the feedback repository whereas the related information is obtained from relevant domain knowledge, e.g., Java official API document information (cf. Section 3.2).
The ranking engine which gives the ranking of the recommended APIs for a given query. To this end, the engine applies two techniques: (1) LTR to compute scores based on the generated feature vectors (cf. Section 3.3.1); and (2) active learning which leverages crowdsourced knowledge (from, e.g., Stack Overflow) as an oracle and trains a classifier to predict the scores (cf. Section 3.3.2). The two scores are combined to give the final verdict (cf. Section 3.3.3).
The basic workflow of our approach is as follows.
When a user makes a query to the system (in the form of user input as, e.g., a short sentence in natural language), a base API recommendation method is employed to provide an initial API list .
The system looks up the feedback repository , checking whether or not there is a query similar to the user query . If this is the case, the system returns a set of query-API pairs where the queries’ similarity score with is above a certain threshold (cf. Section 3.1), i.e.,
Otherwise, there is no available query in the feedback repository similar to (which is especially the case at the initial stage of the interaction), and is simply an empty set. The recommended APIs in and are to be fed to the feature extraction engine.
The feature extraction engine, upon receiving and , computes a composite feature vector . includes two components, i.e., and . The former corresponds to the feedback features, while the latter corresponds to the related information features. (In case that is empty, consists solely of related information features.)
The ranking engine takes as input, and applies the trained learning-to-rank model and active learning model to obtain the prediction values. The system then calculates the API scores based on the prediction values of these two models. Afterwards is re-ranked in a descending order according to the API scores, and new recommendations are presented to the users.
As the core component of our framework, the feedback repository is maintained throughout the life of the system and is kept updating with the interaction of the users. In the beginning, the feedback repository is empty. (In this case, no feedback feature can be provided, and thus BRAID outputs the initial API recommendation list as a result.) When the APIs are recommended to the users (e.g., programmers) who are supposed to implicitly label the most relevant API which is treated as the “ground-truth” recommendation of the given query, the query-API pair would be the feedback from the user and is stored in the feedback repository. The feedback repository grows gradually along with more users’ interactions.
In general, the feedback repository is used in feature extraction (see (c) and 3) above) and in training the LTR model (cf. Section 3.3.1). We note that, for efficiency consideration, we do not re-train the LTR model whenever the feedback repository is updated. Instead it is done on a user session basis, which can strike a balance between ranking precision and overheads.
3.1 Preprocessing and similarity calculation
To facilitate feature extraction and learning steps, we first need to convert user queries and APIs (as well as their related documents) into vectors. As mentioned in Section 1, the lexical gap between queries in natural languages and APIs in programming languages impedes the recommendation performance. We hence use word embedding to bridge such a gap during vectorization. To train the model, we collect API related posts in Stack Overflow website.111https://stackoverflow.com/ Particularly we use the data dumped from Stack Exchange.222https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z, updated in March 2019 All the titles of the posts which are tagged with Java are extracted in particular, since we mainly focus on Java related API recommendation. (Note however that the general methodology is clearly not Java-specific.) The remaining posts are subject to classic textual preprocessing steps including tokenization and stemming. NLTK333http://www.nltk.org/ is employed to fulfil the pre-processing task, and Word2Vec444https://radimrehurek.com/gensim/models/word2vec.html is used to train the embedding model. Similar to , we calculate the IDF (Inverse Document Frequency) of each word in the preprocessed post corpus, and thus build an IDF vocabulary as the weighting schema of the embedding model.
Similarity calculation. To calculate the similarity between a user query and a text (e.g., the query stored in the feedback repository), we first convert them to two bag-of-words and . Then we use the semantic similarity measure introduced by Mihalcea et al. .
For any , is defined to be the maximum value of for each word . Formally
where is the semantic similarity of the two words and , captured by the cosine distance of the embeddings of and as vectors:
Based on Equation (1), the asymmetry similarity can be defined as:
where is computed as the number of documents that contain .
Finally, the (symmetric) similarity between and is derived by the arithmetic mean of and , i.e.,
In this way, we can compute the similarity between user query and other artifacts such as API, query in feedback repository, etc. Recall that in step 2), the system needs to check whether there exists a query in the feedback repository which is similar to the user query. For this purpose, we set a parameter as the similarity threshold to distinguish whether or not two queries and are similar. If , then they are considered to be relevant. (Our experiment, via trial-and-error, empirically indicated that is a suitable configuration.)
3.2 Feature extraction
Recall that the basic functionality of the feature extraction module is to compute the features of APIs. As stated in workflow 3), the input of this module is and , where is the recommended top- APIs for the query and is a set of query-API pairs stored in the feedback repository which crucially, corresponds to queries similar to . The aim is to generate a feature vector for each of the APIs in , based on . As the feature extraction is based on the query , this can be treated as a process of query-aware feature engineering.
The rationale is that the relevance of each API in the recommended API list to the user query depends on (1) the relevance of the API-related description information to , and (2) whether in the feedback repository some API exists for dealing with a similar query. As a result, we consider
related information features, representing the relevance to the recommended APIs as well as the associated document description;
feedback features, representing the relevance to the APIs in the feedback repository.
which are articulated as follows.
Related information feature: The related information feature of each API on the recommended API list consists of the following two parts.
API feature, representing the similarity between the user query and the API under consideration. The similarity measure is calculated as in Section 3.1.
API description feature, representing the similarity between the description under consideration and the user query . The similarity measure is also calculated as in Section 3.1. The description can be obtained via official API documentation. Particularly, we extract API descriptions out of official JDK 8 documentation.
Feedback feature: Feedback feature is extracted based on the similarity between a user query and queries in feedback repository .
We then collect a subset of consisting of only those whose API appears in , namely, . Formally, we define as below.
We remark that there may be several tuples in whose is the same. Therefore, an API in may have several similarities to be considered as the feedback feature, and we select the most relevant five as the feedback feature.
Algorithm 1 shows the pseudo-code of feedback feature generation for . We first create an object of Hashmap type to accommodate the result (Line 2); then we sort in a descending order based on the similarity score (Line 3). Afterwards, we iterate the , and for each , we create an array (Line 6) to record the most relevant 5 similarity values with the API, from the sorted (Line 7-16). Then, the and pair is inserted into (Line 19).
Example. As an example, when a user query is ’killing a running thread in Java’, we firstly get the recommended API list shown in Table I from an initial API recommendation tool (e.g., BIKER), obtaining the of the . Then we look up the feedback repository , finding a pair whose query is similar to shown in Table II. Because ’java.lang.Thread.interrupt’ of the is in (the ninth API), this and the similarity between and can make up the tuple . The similarity is calculated as 0.72 based on the Equation (4). There is no more other , so we put the similarity (0.72) into the first position of the feature vector , and the rest four elements would be zero. and form . Combining with the together forms feature vectors of the APIs in the .
|Query||killing a running thread in java|
|Initial API list (by BIKER )||1||java.lang.Thread.start|
|Query||Stopping looping thread in Java|
3.3 Re-ranking recommendation API list
In this section, we describe the functionality of the ranking engine. As stated earlier, the input is a list of APIs produced by the adopted recommendation tool, endowed with feature vectors based on the user query. The ranking engine aims to re-rank the APIs on the list so the recommendation is more customised to the users’ feedback. To this end, we harness two techniques, i.e., LTR and active learning.
3.3.1 LTR model and rank scores
LTR is a supervised learning approach, which demands labeled training data. To this end, we use the recommended APIs for the queries stored in the feature repository. Recall that each query-API pair in the feedback repository has gone through feedback engineering (Section 3.2). We can then collect the feature vectors of the APIs in , and label the selected API (i.e., ) as , and others as . This process gives rise to the labeled training data set for the LTR model.
We adopt LambdaMart , a widely-used algorithm for ranking, as our LTR model. LambdaMART is a boosted tree model with the optimization strategy based on LambdaRank . The key observation of the optimization strategy is that, in order to train a model, the objective function is not needed. Instead, we only need the gradient of the objective function, which can be modeled by the sorted positions of the items for a given query. In LambdaMART, we assume that there is an objective utility function Util whereby we define
where for two feature vectors and such that ranks higher than , and represent the scores of and respectively,
is a parameter of the sigmoid function the value of which determines the shape of the function,can be any ranking measure change generated by swapping the rank positions of and . For example, when stands for the change of metric MAP, such a model actually optimizes MAP directly.
Symmetrically, in case that ranks higher than , we define
where is the indicator function defined as:
It follows that, for each feature vector , we can define the utility function as
Since we build the LTR model based on the tree-based algorithms , the regularization term is based on the complexity of the tree model. More concretely, it is defined as
where represents the number of the leaf node, is the weight of the leaf node, and are hyper parameters used to adjust the weights of and . (The experimental results show that is set to 0.3 and to 1 in our setting.)
Finally, the objective of our LTR model is to maximize
where ranges over all labeled samples.
LambdaMART trains a boosted tree model MART (multiple additive regression trees), in which the prediction value of the model is a linear combination of the outputs of a set of regression trees. In our LTR model, the LambdaMART maps the feature vector to , which can be written as:
where is a function modeled by a single regression tree and the is the weight associated with the -th regression tree. Both and are learned during training, and is the number of trees.
We consider the given user query , extract features as in Section 3.2, and then use the trained LTR model to predict the rank score for the recommended API list . The result is denoted by , which comprises for all feature vectors of each API in .
3.3.2 Active learning model and relevance scores
We utilize the active learning technique to improve the learning efficiency when the feedback repository data is scarce. An active learning algorithm usually starts by training a model with selective labeled data for which we follow the same approach as LTR (cf. Section 3.3.1). The structure of the active learning module is shown in Fig. 2.
For the active learning paradigm
, we use the Logistic Regression algorithm to train a model. The uncertainty sampling strategy  is used to select the most informative data (which may not be classified well by the classifier model) as the query function . In our work, we collect the query-API pairs to serve as the oracle . These query-API pairs represent crowdsourced knowledge derived from the questions and accepted answers in Stack Overflow posts, which can be used to annotate the selected data. Note that this is one way to instantiate the oracle; one can certainly seek other resources to serve as the oracle.
Because BRAID outputs the initial API recommendation list when the feedback repository is empty, the active learning module commences to play its role when the feedback data is available. We train the model by the following steps: First, we collect the feature vectors of the APIs in (cf. Section 3.3.1) and label them to form the labeled set . Hence, we formulate as a classification problem, and accordingly, use to build an active learning classifier model . Next, we collect the feature vectors of the recommended APIs of the queries in Stack Overflow to form the unlabeled set . After applying the current model to the unlabeled set , we use the uncertainty sampling strategy on to select data for which the classifier is less certain. Then the queries based on the selected data are sent to the oracle for annotation, and the results will be put into the feedback repository. The selected samples will be used for expanding the labeled set along with their labels to retrain the classifier model . The above steps are repeated, and we finally get the optimized classifier model and an expanded feedback repository which will also be used in the LTR model to provide input (cf. Section 3.3.1).
Similar to LTR, we consider the features extracted from a given query (cf. Section 3.2
) as input, and use the well-trained classifier to predict the relevance of each API on the recommended list. Because there are only two classes (either 1 or 0), in which ‘1’ represents relevant, while ‘0’ means irrelevant, the probability generated by the classifier can be interpreted as the relevance score.is then obtained by computing the relevance score for the recommended API list of a user query . In this way, we can combine active learning with API recommendation systems.
3.3.3 Re-ranking list and collecting user feedback
The last step is to re-rank the API list. In Section 3.3.1 and Section 3.3.2, we have obtained the predictions of the API ( and ) of the through well-trained LTR and active learning models respectively. By normalizing , we calculate the overall prediction score of the APIs as follows.
where represents the rank score of the -th API in the recommended list of , and is the relevance score of the -th API which takes the position of API into account. and are the maximum and minimum values of the rank score respectively; is the weight which is a dynamic value dependent on the position of the -th API (i.e., ). In our experiments, is set to . We then re-rank in a descending order based on the final prediction score . Programmers can choose an adequate API from the re-ranked list corresponding to the query. Meanwhile, the decision will be recorded in the feedback repository.
In this section, we evaluate the proposed BRAID approach. We shall mainly study the following research questions (RQs).
How effective is BRAID to recommend API for given queries in general?
How does the feedback information contribute to BRAID for recommending API, in particular, how does the accumulation of the feedback repository improve the performance of BRAID?
How do LTR and active learning techniques contribute to BRAID respectively?
Is the overhead introduced by BRAID acceptable?
The BRAID approach is essentially an “add-on” technique, which is designed to be instrumented to extant query-based API recommendation systems for which we use three representative systems, i.e., BIKER, RACK, and NLP2API, as baselines.
BIKER  collects 413 questions, along with their ground-truth APIs, as the testing dataset for the empirical study. They are extracted from API-related posts of Stack Overflow following the approach in Ye et al. . The question titles of the posts are considered as the query whereas the API referred to in the accepted answers are treated as standard answers. Sometimes, for a common programming task query, if other answers are also helpful but not flagged as the accepted answer, they can also be added to the correct answers.
RACK  collects 150 queries for the evaluation from three Java tutorial sites: KodeJava555https://kodejava.org, JavaDB666https://www.javadb.com and Java2s777https://java2s.com. These sites contain a mass of programming tasks whose descriptions generally are composed of three parts, i.e., a question title, a solution consisting of code snippets, and a comment used to interpret code. Similar to the accepted answers in Stack Overflow posts, the comment explaining the code also refers to one or more APIs which are vital to deal with the question. Hence the ground-truth dataset is made by question titles of the programming tasks in these sites and the corresponding APIs extracted from code interpretation.
NLP2API  collects 310 code search query-API pairs. Similar to RACK, the source of data is also the Java tutorial sites. In addition to the sites which RACK refers to, they also focus on the data on CodeJava.888https://www.codejava.net Thus, besides the 150 queries already gained by RACK, there are 160 new ground-truth pairs, which make up 310 pairs of NLP2API. Though some query-API pairs of NLP2API are the same as RACK, it has no effect on our evaluation results because of the differences in recommendation algorithms. The ground-truth data set of this API recommendation system is composed in the same way as RACK.
In the experiments, we reuse the existing data sets, as well as the implementations, from the replication packages of the baselines, i.e., BIKER999https://github.com/tkdsheep/BIKER-ASE2018, RACK101010http://homepage.usask.ca/ masud.rahman/rack/, and NLP2API111111https://github.com/masud-technope/NLP2API-Replication-Package
. The data set is randomly split into the training set and the testing set at the ratio of 9:1. In addition, to give a fair comparison, we repeat the split three times corresponding to three runs of the individual experiment. With each split, the result is recorded and the average values are eventually calculated as the final results. Our implementation is based on XGBoost (ver. 0.82) and modAL (ver. 0.3.4)121212https://github.com/modAL-python/modAL for LTR and active learning modules respectively. The experiments are conducted on a PC running Windows 10 OS with an AMD Ryzen 5 1600 CPU (6 cores) of 3.2GHz and 8GB DDR4 RAM.
4.2 Performance metrics
Hit@k/Top-k Accuracy, which is the percentage of queries of which at least one recommended API is relevant within the top results. Formally,
where represents the number of queries whose relevant API appears in the top-, and is the total number of the queries.
Mean Average Precision (MAP) is the mean of the average precision (AP) scores for each query. Formally,
where is the set of ranking position of the relevant APIs of the ranked APIs list of the -th query, and represents the number of relevant API in the top-.
Mean Reciprocal Rank (MRR) calculates the inverse of the first appearing relevant API of a query, then adds them up and averages as the result.
where represents the ranking position of the first relevant API in the -th query.
4.3 Experimental results
RQ1. How effective is BRAID to recommend API for given queries in general?
In the experiment, we randomly select 10 query-answer pairs from the training set which are used to build the feedback repository; one example is given in Table III. Note that these randomly selected query-answer pairs have low correlation with the queries in the testing set.
|How to implement the hashCode and|
|equals method using Apache Commons|
|How do I set the value of file attributes|
|How do I compare two dates|
|How do I call a stored procedure that return a result set|
|How do I turn the Num Lock button on|
|How to Move image on screen|
|Connect with a Web server|
|Execute a command from code|
|How do I create a web based file upload|
|Get Request Parameters in a Servlet|
As the main aim of this experiment is to investigate the effectiveness of the feedback repository to recommendation improvements, we set the feedback repository unchanged. We use queries from the testing set to evaluate three baselines BIKER, RACK and NLP2API augmented with BRAID respectively, i.e., BRAID (BIKER), BRAID (RACK) and BRAID (NLP2API). We measure the performance BRAID (BIKER), BRAID (RACK) and BRAID (NLP2API) at the class level in terms of Hit@1, Hit@3, Hit@5, MAP and MRR. In addition to the feedback repository shown in Table III, we repeat the experiment for extra 4 times, each time with different (randomly sampled) feedback repositories and queries. We calculate the average metrics and the results are shown in Table IV.
Note that RACK and NLP2API recommend API classes, while BIKER can recommend API methods and API classes. In this experiment, the APIs in the feedback repository are at the class level, so we consider the case of BIKER recommended API class in this experiment. As for the other experiments, we consider API methods recommendation for BIKER.
|Imp. Avg. BRAID||11.7%||1.1%||0.81%||3.56%||3.73%|
|Imp. Avg. BRAID||18.67%||31.85%||11.76%||11.88%||16.04%|
|Imp. Avg. BRAID||27.74%||30%||14.92%||31.31%||21.64%|
From Table IV, one can see that almost all metrics have improved compared with the baselines. In general, even when a small-scale feedback repository is harnessed, BRAID demonstrates substantial improvements over the baselines by 11.7%, 1.1%, 0.81%, 3.56%, 3.73% for BIKER, 18.67%, 31.85%, 11.76%, 11.88%, 16.04% for RACK and 27.74%, 30%, 14.92%, 31.31%, 21.64% for NLP2API respectively. This confirms that the feedback repository is effective in boosting the performance of API recommendations. In addition, the same feedback repository works well on the three API recommendation systems (BIKER, RACK and NLP2API), which demonstrates the generalization ability of BRAID for query-based API recommendation.
RQ2. How does the accumulation of the feedback repository improve the performance of BRAID?
In the first experiment, we fix the feedback repository. In real scenario, the feedback repository is to be updated with the feedback received from the end users. How does the accumulation of the feedback repository (representing the feedback information) influence the recommendation results? Our experiment aims to answer this question.
We randomly select the query-answer pairs from the training set to form the feedback repository. The size of the feedback repository varies from 0% to 100% of the training set, with an increment of 10%. Note that the baseline is represented by the case of size equal to 0%, where the feedback repository is disabled. For each sampled feedback repository, we carry out experiments 10 times and the reported results represent the average.
Table V presents the experimental results. To better visualize the trend, we also plot the results in Fig. 3. One can observe that the performance improves with the accumulation of the feedback repository. This is consistent across all the three baselines, indicating the generalization of our approach for query-based recommendation. In particular, all the metrics have been enhanced considerably. The MAP and MRR are up 10 percent for BIKER, 21% for RACK, and over 25% for NLP2API.
Arguably, the most important indicator Hit@1 enjoys the largest boosting, which demonstrates that our approach can rank the most relevant API to the top-1 through feedback information. Fig. 4 (BRAID curve in blue) shows the Hit@1 metric of all these baselines: Hit@1 of BIKER increased by 14.5%, of RACK increased by 35.42%, and of NLP2API increased by 32.26%, absolutely.
RQ3. How do LTR and active learning techniques contribute to BRAID respectively?
Recall that our approach makes use of two learning techniques, i.e., LTR and active learning. To better interpret the performance improvement of BRAID, we perform an ablation analysis to pinpoint the individual contribution of each technique to the performance.
In the experiment, similar to the previous one, we gradually increase the size of the feedback repository. At each stage, we disable either LTR or active learning and collect the performance metrics accordingly. We calculate the results of baselines for testing data and the averages (over all stages) of LTR and active learning techniques respectively. The experimental results are given in Table VI.
|Imp. Avg. LTR||17.83%||11.17%||7.66%||8.75%||10.6%|
|Imp. Avg. AL||21.17%||9.72%||4.98%||10.29%||11.2%|
|Imp. Avg. BRAID||24.53%||13.28%||10.2%||12.96%||14.01%|
|Imp. Avg. LTR||65.39%||31.2%||13.74%||29.27%||33.34%|
|Imp. Avg. AL||69.94%||22.9%||8.26%||24.62%||31.59%|
|Imp. Avg. BRAID||77.21%||33.74%||14.63%||32.15%||37.52%|
|Imp. Avg. LTR||67.68%||36.91%||15.84%||49.09%||40.29%|
|Imp. Avg. AL||67.54%||35.98%||16.26%||48.88%||39.77%|
|Imp. Avg. BRAID||72.55%||38.84%||17%||52.57%||42.83%|
From the table, we can see the roles that learning-to-rank and active learning techniques play in boosting the API recommendation. These two techniques make different contributions in all of the baselines. Active learning performs better than LTR for three baselines on average. Moreover, the performance of RACK is the lowest among the three baselines, but gets the highest boost with our approach. The improvement tendency of two techniques is consistent for all the three baselines. We also find, from the improvement trend of the three baselines, that both techniques focus more on the Hit@1, MAP, MRR and Hit@3 than Hit@5. Among them, the effect of Hit@1 is outstanding. Despite LTR and active learning techniques optimize the performance in different degrees, overall, neither of them perform better than combining them together, which justifies the methodology adopted by BRAID.
In Fig. 4, we plot the Hit@1 curves of the overall BRAID approach (as discussed in RQ2), LTR and active learning with respect to feedback sizes. From the figures, we can see that when the data of feedback repository is small, active learning performs better (except RACK). And when there is a lot of feedback data, LTR performs better on three baselines. With the greater engagement of feedback, in general, LTR, active learning and BRAID all grow steady and perform better than the original baselines (the Hit@1 metric of BIKER, RACK, NLP2API is 38.33%, 31.25%, 33.33%). It is noteworthy that the overall BRAID achieves the greatest improvement which confirms the importance of joint forces of LTR and active learning.
RQ4. Is the overhead introduced by BRAID acceptable?
As an “add-on” technique, when used in conjunction with existing recommendation systems, BRAID boosts the effectiveness (as demonstrated by the previous experiments) but inevitably introduces overheads. Are these overheads acceptable? This is what we are investigating.
Table VII shows the runtime of our approach. The original time records the runtime of the baseline. The extraction time represents the time spent on feature extraction. The training time represents the time for training the ranking model of BRAID. The ranking time represents the time to re-rank the API recommended list. The total time is the sum of the extraction, training and ranking time, which represents the overhead introduced by BRAID. The pct.(%) calculates the percentage of the total time in the original time.
We repeat this experiment for 10 times on each baseline. For each time, we conduct 10 user queries and calculate the runtime of each query. From Table VII, we can see that most of the total time is spent on training the ranking model while the re-ranking process is largely negligible (measured in seconds). Among the three baselines, BIKER takes the longest time 14.29 seconds, because loading data takes up most of the time. Overall, BRAID works on BIKER takes 0.203 seconds on average, 1.42% more of the original time, 0.173 seconds on RACK, which is 1.73% of the original time, while 0.175 seconds on NLP2API, 4.41% more of the original time.
|Approach||Original(s)||Overheads introduced by BRAID|
5 Threats to Validity
Threats to internal validity are related to experimental errors and biases . The main threats of this kind originate from the potential bias introduced in the data. To ensure a fair comparison with the baselines, we use the same data published as the replication package of the original work. Moreover, we directly employ their tools to avoid possible errors during re-implementation. The experiments in our study are usually conducted three times and the average values are used as the final results. In the active learning process, we leverage crowdsourced knowledge from Stack Overflow posts as oracles to provide feedback data. This strategy is adopted in many studies, including the comparative study , and other research work . To ensure the quality, we double check the extracted data manually and confirm these labels are correct.
Threats to external validity focus on the efficacy that the results can be generalized to other cases different from those used in the experiments . Indeed, like other empirical studies, it is hard to guarantee that our framework works well on any other third-party recommendation approach. However, we believe that the three state-of-the-art tools selected to demonstrate the advantage of our approach are representative, and the comprehensive experiments can well illustrate the performance enhancement. In addition, in our experiments, we concentrate on APIs in Java, which is the same strategy adopted in baseline work. Nevertheless, BRAID is designed to be a language-independent framework where our methodology does not capitalize any peculiarities of Java whereby we believe it can be adapted to other programming languages than Java.
6 Related Work
Recommendation systems have been intensively studied in software engineering to assist developers with a wide range of activities [36, 37]. Rather than a detailed literature review, we shall mainly discuss those closely related with ours. Particularly, we focus on three threads of work, i.e., search based code recommendation, generation based code recommendation/completion and results ranking related techniques.
Search based code recommendation. Code recommendation generally starts from code search. When facing a programming problem, developers usually turn to the Internet for help. Indeed, a recent case study conducted at Google confirmed that developers search for code very frequently 
. Work of this category typically leverages code from open source projects, sometimes augmented with various software artifacts to enhance recommendation precision. Examples include Strathcona, Portfolio , SENSORY , and Aroma . Strathcona recommends code examples for developers by comparing structural similarity in the code repository. Portfolio mainly combines NLP, PageRank  and spreading activation network algorithms to find the most relevant code for users. SENSORY considers the statement sequence information and uses Burrows Wheeler Transform algorithm to search in the code repository, and then re-rank the result based on the structure information. Aroma takes a partial code snippet as query input, and returns a set of code snippets as recommendations. The above approaches mainly rely on code information to perform recommendation.
Meanwhile, some approaches employ additional information from other software artifacts or crowdsourced knowledge. Examples include BIKER , RACK , and NLP2API , all of which serve as our baselines in this paper. These approaches leverage Q&A posts from Stack Overflow website to find the most relevant APIs. NLP2API also incorporates (pseudo-) feedback information as our work, but its purpose is to reformulate the query. Similarly, QUICKAR  also aims to automatically provide reformulation of a given query. Some examples augmented with other information for recommendation are APIREC , and FOCUS . APIREC leverages fine-grained change commit history from Github to extract frequent change patterns to supplement the recommendation process. FOCUS tackles the usage pattern recommendation problem from the perspective of collaborative filtering, and similar projects information is consulted during the recommendation process. Thung et al. unify the historical feature requests and API document information to recommend API methods . Yuan et al.  combine code parsing and text processing on Android tutorials and SDK documents to recommend functional APIs in Android. Ponzanelli et al. propose a holistic recommendation system Libra, which integrates the IDE and the web browser . Libra could provide more personalized recommendations since it records developers’ navigation history and other contextual information.
Generation based code recommendation/completion.
Another important thread mostly bases their methodology on deep learning related techniques
. White et al. empirically demonstrate that a relatively simple RNN model can outperform n-gram models at certain software engineering tasks, such as code suggestion. Gu et al.  propose DeepAPI, which adapted a neural language model to encode the words of the query and associated API sequences. By training the model with a large corpus of annotated API from GitHub, DeepAPI could generate API usage sequences for the query. In their subsequent work , a deep neural network model, i.e., CODEnn, was proposed to bridge the lexical gap between queries and source code. It can generate a unified vector representation for both code and descriptions. Raychev et al.  combine 3-gram and RNN models to synthesize a code snippet, which can complete method invocation and invocation parameters. Despite that such thread of research mainly generates target code entities, they could still be plugged into our framework, as long as an initial API recommendation list could be produced.
Ranking recommendation results. Apart from different approaches towards code recommendation, a few initiatives have focused on applying machine learning based techniques to rank the recommendation candidates. Thung et al.  propose an automated approach, namely WebAPIRec, which can convert web API recommendation into a personalized ranking task based on the API usage historical data. WebAPIRec can learn a model which minimizes errors of Web APIs ordering. Different from our work, WebAPIRec does not utilize feedback information during recommendation. Liu et al.  propose a ranking-based discriminative approach, RecRank, to optimize the top-1 recommendation on top of APIREC. Specially, it uses the usage path based features to rank the recommendation list generated by APIREC . In contrast, our approach does not bind with any particular component recommendation method. In addition, RecRank does not consider the feedback information either. Niu et al.  apply the LTR technique to recommend code examples given a query. A pair-wise LTR algorithm is employed to train a ranking schema, which can be used for new queries later. They address a different recommendation problem, through LTR techniques as well. Moreover, feedback information is also neglected in their approach.
In this paper, we propose BRAID, a novel framework to boost the performance of query-based API recommendation systems. BRAID takes a user query and the result of an existing API recommendation as input. It adopts the user preference click history as feedback information and leverages learning-to-rank and active learning techniques to build up a new API recommendation model. With the augmentation of the feedback information, BRAID performs increasingly better comparing with the baseline API recommenders. The experiments show that BRAID can substantially enhance the effectiveness of state-of-the-art API recommenders. In the future work, we plan to develop a full-fledged tool based on BRAID as a plugin of current mainstream IDEs to better support programming. In addition, we believe the approach put forward in the current paper actually has broader applicability whereby we plan to extend it to other recommendation scenarios in software engineering.
This work was partially supported by the National Key R&D Program of China (No. 2018YFB1003902), the National Natural Science Foundation of China (NSFC, No. 61972197), the Collaborative Innovation Center of Novel Software Technology and Industrialization, and the Qing Lan Project. T. Chen is partially supported by Birkbeck BEI School Project (ARTEFACT), NSFC grant (No. 61872340), and Guangdong Science and Technology Department grant (No. 2018B010107004).
-  J. Brandt, P. J. Guo, J. Lewenstein, M. Dontcheva, and S. R. Klemmer, “Two studies of opportunistic programming: interleaving web foraging, learning, and writing code,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 2009, pp. 1589–1598.
-  M. Piccioni, C. A. Furia, and B. Meyer, “An empirical study of api usability,” in Acm, 2013.
-  Q. Huang, X. Xia, Z. Xing, D. Lo, and X. Wang, “Api method recommendation without worrying about the task-api knowledge gap,” in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. ACM, 2018, pp. 293–304.
-  H. Yu, W. Song, and T. Mine, “Apibook: an effective approach for finding apis,” in Proceedings of the 8th Asia-Pacific Symposium on Internetware. ACM, 2016, pp. 45–53.
-  P. Nguyen, J. Di Rocco, D. Ruscio, L. Ochoa, T. Degueule, and M. Di Penta, “Focus: A recommender system for mining api function calls and usage patterns,” in 41st ACM/IEEE International Conference on Software Engineering (ICSE), 2019.
-  J. Fowkes and C. Sutton, “Parameter-free probabilistic api mining across github,” in Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 2016, pp. 254–265.
-  S. Haiduc and A. Marcus, “On the effect of the query in ir-based concept location,” in IEEE International Conference on Program Comprehension, 2011.
-  J. Yang and T. Lin, “Inferring semantically related words from software context,” 2012.
-  X. Li, H. Jiang, Y. Kamei, and X. Chen, “Bridging semantic gaps between natural languages and apis with word embedding,” IEEE Transactions on Software Engineering, 2018.
-  P. Resnick and H. R. Varian, “Recommender systems,” Communications of the ACM, vol. 40, no. 3, pp. 56–59, 1997.
-  G. Salton and C. Buckley, “Improving retrieval performance by relevance feedback,” Journal of the American society for information science, vol. 41, no. 4, pp. 288–297, 1990.
-  C. Carpineto and G. Romano, “A survey of automatic query expansion in information retrieval,” Acm Computing Surveys, vol. 44, no. 1, pp. 1–50, 2012.
-  M. M. Rahman, C. K. Roy, and D. Lo, “Rack: Automatic api recommendation using crowdsourced knowledge,” in IEEE International Conference on Software Analysis, 2016.
-  M. M. Rahman and C. Roy, “Effective reformulation of query for code search using crowdsourced knowledge and extra-large data analytics,” in 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2018, pp. 473–484.
-  T.-Y. Liu et al., “Learning to rank for information retrieval,” Foundations and Trends® in Information Retrieval, vol. 3, no. 3, pp. 225–331, 2009.
-  H. Li, “Learning to rank for information retrieval and natural language processing,” Synthesis Lectures on Human Language Technologies, vol. 4, no. 1, pp. 1–113, 2011.
-  X. Ye, R. Bunescu, and C. Liu, “Learning to rank relevant files for bug reports using domain knowledge,” in Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 2014, pp. 689–699.
-  Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon, “Adapting ranking svm to document retrieval,” in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2006, pp. 186–193.
-  Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, “An efficient boosting algorithm for combining preferences,” Journal of machine learning research, vol. 4, no. Nov, pp. 933–969, 2003.
-  Y. Song, H. Wang, and X. He, “Adapting deep ranknet for personalized search,” in Proceedings of the 7th ACM international conference on Web search and data mining. ACM, 2014, pp. 83–92.
-  F. Xia, T.-Y. Liu, J. Wang, W. Zhang, and H. Li, “Listwise approach to learning to rank: theory and algorithm,” in Proceedings of the 25th international conference on Machine learning. ACM, 2008, pp. 1192–1199.
-  Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li, “Learning to rank: from pairwise approach to listwise approach,” in Proceedings of the 24th international conference on Machine learning. ACM, 2007, pp. 129–136.
-  B. Settles, “Active learning literature survey,” University of Wisconsin-Madison Department of Computer Sciences, Tech. Rep., 2009.
-  R. Mihalcea, C. Corley, C. Strapparava et al., “Corpus-based and knowledge-based measures of text semantic similarity,” in Aaai, vol. 6, no. 2006, 2006, pp. 775–780.
-  C. J. Burges, “From ranknet to lambdarank to lambdamart: An overview,” Tech. Rep. MSR-TR-2010-82, June 2010. [Online]. Available: https://www.microsoft.com/en-us/research/publication/from-ranknet-to-lambdarank-to-lambdamart-an-overview/
-  C. J. C. Burges, R. Ragno, and Q. V. Le, “Learning to rank with nonsmooth cost functions,” in Advances in Neural Information Processing Systems 19, Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006, 2006, pp. 193–200. [Online]. Available: http://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions
-  T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, 2016, pp. 785–794.
-  D. D. Lewis and J. Catlett, “Heterogeneous uncertainty sampling for supervised learning,” in Eleventh International Conference on International Conference on Machine Learning, 1994.
-  X. Ye, H. Shen, X. Ma, R. Bunescu, and C. Liu, “From word embeddings to document similarities for improved information retrieval in software engineering,” in Proceedings of the 38th international conference on software engineering. ACM, 2016, pp. 404–415.
-  R. F. Silva, C. K. Roy, M. M. Rahman, K. A. Schneider, K. Paixao, and M. de Almeida Maia, “Recommending comprehensive solutions for programming tasks by mining crowd knowledge,” in Proceedings of the 27th International Conference on Program Comprehension. IEEE Press, 2019, pp. 358–368.
-  X. Liu, L. Huang, and V. Ng, “Effective api recommendation without historical software repositories.” in ASE, 2018, pp. 282–292.
-  C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, and C. Fu, “Portfolio: finding relevant functions and their usage,” in Proceedings of the 33rd International Conference on Software Engineering. ACM, 2011, pp. 111–120.
-  C. Manning, P. Raghavan, and H. Schütze, “Introduction to information retrieval,” Natural Language Engineering, vol. 16, no. 1, pp. 100–103, 2010.
-  R. Feldt and A. Magazinius, “Validity threats in empirical software engineering research-an initial survey.” in Seke, 2010, pp. 374–379.
-  L. Nie, H. Jiang, Z. Ren, Z. Sun, and X. Li, “Query expansion based on crowd knowledge for code search,” IEEE Trans. Services Computing, vol. 9, no. 5, pp. 771–783, 2016. [Online]. Available: https://doi.org/10.1109/TSC.2016.2560165
-  M. Robillard, R. Walker, and T. Zimmermann, “Recommendation systems for software engineering,” IEEE software, vol. 27, no. 4, pp. 80–86, 2009.
-  M. Gasparic and A. Janes, “What recommendation systems for software engineering recommend: A systematic literature review,” Journal of Systems and Software, vol. 113, pp. 101–113, 2016.
-  C. Sadowski, K. T. Stolee, and S. Elbaum, “How developers search for code: a case study,” in Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 2015, pp. 191–201.
-  R. Holmes and G. C. Murphy, “Using structural context to recommend source code examples,” in Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005. IEEE, 2005, pp. 117–125.
-  L. Ai, Z. Huang, W. Li, Y. Zhou, and Y. Yu, “Sensory: Leveraging code statement sequence information for code snippets recommendation,” in 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), vol. 1. IEEE, 2019, pp. 27–36.
-  S. Luan, D. Yang, C. Barnaby, K. Sen, and S. Chandra, “Aroma: Code recommendation via structural code search,” Proceedings of the ACM on Programming Languages, vol. 3, no. OOPSLA, p. 152, 2019.
-  S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” in International Conference on World Wide Web, 1998, pp. 107–117.
-  M. M. Rahman and C. K. Roy, “Quickar: automatic query reformulation for concept location using crowdsourced knowledge,” in Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 2016, pp. 220–225.
-  A. T. Nguyen, M. Hilton, M. Codoban, H. A. Nguyen, L. Mast, E. Rademacher, T. N. Nguyen, and D. Dig, “Api code recommendation using statistical learning from fine-grained changes,” in Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 2016, pp. 511–522.
-  F. Thung, S. Wang, D. Lo, and J. Lawall, “Automatic recommendation of api methods from feature requests,” in IEEE/ACM International Conference on Automated Software Engineering, 2013.
-  W. Yuan, H. H. Nguyen, L. Jiang, Y. Chen, J. Zhao, and H. Yu, “Api recommendation for event-driven android application development,” Information and Software Technology, vol. 107, pp. 30–47, 2019.
-  L. Ponzanelli, S. Scalabrino, G. Bavota, A. Mocci, R. Oliveto, M. D. Penta, and M. Lanza, “Supporting software developers with a holistic recommender system,” in Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017, 2017, pp. 94–105.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.
-  M. White, C. Vendome, M. Linares-Vásquez, and D. Poshyvanyk, “Toward deep learning software repositories,” in Proceedings of the 12th Working Conference on Mining Software Repositories. IEEE Press, 2015, pp. 334–345.
-  X. Gu, H. Zhang, D. Zhang, and S. Kim, “Deep api learning,” in Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 2016, pp. 631–642.
-  X. Gu, H. Zhang, and S. Kim, “Deep code search,” in 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 2018, pp. 933–944.
-  V. Raychev, M. Vechev, and E. Yahav, “Code completion with statistical language models,” in Acm Sigplan Notices, vol. 49, no. 6. ACM, 2014, pp. 419–428.
-  F. Thung, R. J. Oentaryo, D. Lo, and Y. Tian, “Webapirec: Recommending web apis to software projects via personalized ranking,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 1, no. 3, pp. 145–156, 2017.
-  H. Niu, I. Keivanloo, and Y. Zou, “Learning to rank code examples for code search engines,” Empirical Software Engineering, vol. 22, no. 1, pp. 259–291, 2017.