User Review-Based Change File Localization for Mobile Applications

03/03/2019 ∙ by Yu Zhou, et al. ∙ Universität Zürich Nanjing University of Aeronautics and Astronautics Birkbeck, University of London 0

In the current mobile app development, novel and emerging DevOps practices (e.g., Continuous Delivery, Integration, and user feedback analysis) and tools are becoming more widespread. For instance, the integration of user feedback (provided in form of user reviews) in the software release cycle represents a valuable asset for the maintenance and evolution of mobile apps. To fully make use of these assets, it is highly desirable for developers to establish semantic links between the user reviews and the software artefacts to be changed (e.g., source code and documentation), and thus to localize the potential files to change for addressing the user feedback. In this paper, we propose EINSTEIN (usEr-revIews iNtegration via claSsification, clusTEring, and linkINg), an automated approach to support the continuous integration of user feedback via classification, clustering, and linking of user reviews. EINSTEIN leverages domain-specific constraint information and semi-supervised learning to group user reviews into multiple fine-grained clusters concerning similar users' requests. Then, by combining the textual information from both commit messages and source code, it automatically localizes potential change files to accommodate the users' requests. Our empirical studies demonstrate that the proposed approach (significantly) outperforms the state-of-the-art baseline work in terms of clustering and localization accuracy, and thus produces more reliable results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The extensive proliferation of smart devices represents one of the most visible technology and society growth of the last years. Indeed, mobile phones, tablets and smart watches are widely used in many aspects of today’s life [1, 2]. This growth is particularly reflected in the growth of the app industry, with millions of developed and maintained mobile applications [3, 4].

This trend also impacts the current mobile app development, which is characterized by novel and emerging DevOps practices (e.g., Continuous Integration, Deployment, Deliver, and user feedback analysis) and tools. For instance, the integration of user feedback (provided in form of user reviews) in the software release cycle represents a valuable asset for the maintenance and evolution of these apps [5, 6], or for ensuring a reliable testing automation for them [7]. Thus, a key and winning aspect of successful apps is related to the capability of developers to deliver high-quality apps and, at the same time, address user requests; this is crucial for the app to stay on the market and to keep gaining users [2, 8].

Mobile user reviews, mainly distributed from major online app stores (e.g., Google Play and Apple AppStore), provide valuable feedback for further improvements of mobile apps. They might report software bugs, complain usage inconvenience, request new features, etc [1, 6, 9]. Such information is valuable for developers, since it represents crowd-sourced knowledge from the customers’ perspective, providing useful information for the evolution and release planning of mobile apps [5, 6, 9]. As a concrete example, among many reviews of a popular instant messager app signal111https://play.google.com/store/apps/details?id=org.thoughtcrime.securesms, one group concentrates on the theme issues. Particularly, one review states that “Wish it had a dark or black theme.” In the following release of the app, new themes, including the aforementioned dark and black ones, were integrated. As another example, for the app AcDisplay222https://play.google.com/store/apps/details?id=com.achep.acdisplay, one review states “It would be great if you could just use the proximity sensors to wake the screen much like the Moto app uses IR sensors when you wave over the phone.” Later on this feature was added in the next version of the app.

Due to the high number of app user reviews developers receive on a daily basis (popular apps could receive more than 500 reviews per day on average [10]

), collecting and analyzing them manually becomes increasingly infeasible. As a result, developers are interested in adopting automated approaches which are able to classify/cluster such reviews, and to localize potential change files. This is key to enhance the development productivity, and in turn, to facilitate the continuous delivery of app products.

Recent work has proposed tools for user feedback classification [6, 11], clustering [12, 13] and summarization [14, 15]. Unfortunately, most of these tools suffer from various important limitations. First, the classification or clustering accuracy is hindered by the general low-quality of user reviews [9, 6, 13]. Compared to other kinds of software artefacts such as software documents, bug report, logging messages which are provided by developers, reviews are generated by (non-technical) users, who tend to produce reviews of lower-quality (e.g., the textual descriptions are usually short and unstructured, mixed with typos, acronyms and even emojis [16]

). Second, existing classification and clustering are usually conducted at a coarse-grained sentence level containing potentially multiple topics without taking domain-specific knowledge into account. This further reduces the accuracy of classification/clustering, and impede the effectiveness of further localization. Third, and the most important, available tools are not able to cope with the lexicon gap between user reviews and software artefacts (e.g., the source code) of the apps, which makes the standard textual similarity based localization approaches less effective

[13, 7, 11]. Consequently existing tools have to settle with a low file localization accuracy [13, 11].

To overcome the aforementioned limitations, in this paper, we propose RISING (user-Reviews Integration via claSsification, clusterIng, and linkiNG), an automated approach to support the continuous integration of user feedback via classification, clustering, and linking of user reviews. Specifically, RISING leverages domain-specific constraint information and semi-supervised learning to group reviews into multiple fine-grained clusters concerning similar user requests. Then, by combining the textual information from both commit messages and source code, it automatically localizes the files to change to accommodate the users’ requests. Our empirical studies demonstrate that the proposed approach (significantly) outperforms state-of-the-art baselines [13] in terms of accuracy, providing more reliable results.

The main contributions of the paper are summarized as follows:

  • We propose a semi-supervised clustering method by leveraging domain-specific knowledge to capture constraints between the reviews. The experiments demonstrate its efficacy in improving the accuracy of clustering, in particular, its superiority to other clustering methods reported in the literature. To the best of our knowledge, this is the first time that semi-supervised clustering methods are exploited to group mobile app user reviews.

  • We propose a change file localization approach by exploiting commit messages as a media to fill the lexicon gap between user reviews and software artefacts, which, as the experiments demonstrate, enhance the localization accuracy significantly.

  • We collect user reviews and commit messages from 10 apps available from Google Play and Github, and prepare a dataset with the processed reviews and commit logs333https://csyuzhou.github.io/files/dataset.zip. This will not only facilitate the replication of our work, but also serve other related software engineering research for example mining mobile app store and intelligent software development.

Structure of the paper. Section 2 gives some background information for a better understanding of the context of our work, while Section 3 details the proposed approach to address the limitations of state-of-the-art approaches on user reviews analysis. Section 4 presents the main research questions driving our investigation and describes the case studies we conduct to answer the research questions. In Section 5 we provide some discussions and the threats that might have biased our results and how we mitigate them. Related work is discussed in Section 6, while Section 7 concludes the paper and describes our future research agenda.

2 Background

This section provides a brief overview of (i) the contemporary development pipeline of mobile applications and (ii) the importance of user feedback analysis in the mobile context. Section 6 complements this section by providing related work on user feedback analysis and applying Information Retrieval (IR) in software engineering, with a specific emphasis on the mobile application domain.

Development Release Cycle of Mobile Apps. As shown in Figure 1 [17], the conventional mobile software release cycle has evolved in recent years into a more complex process, integrating DevOps software engineering practices [18, 19]. The DevOps movement aims at unifying the conflicting objectives of software development (Dev) and software operations (Ops), with tools for shortening release cycle activities. Continuous Delivery (CD) is one of the most emerging DevOps software development practices, in which developers’ source/test code changes are sent to server machines to automate all software integration (e.g., building and integration testing) tasks required for the delivery [20]. When this automated process fails (known as build failure), developers are forced to go back to coding to discover and fix the root cause of the failure [21, 22, 23]; otherwise, the changes are released to production in short cycles. These software changes are then notified to users as new updates of their mobile apps. In this context, users usually provide feedback on the new version (or the A/B testing versions) of the apps installed on their devices, this often in form of comments in app reviews [9, 6, 24, 12, 15].

Fig. 1: Release Cycle

User Feedback Analysis in the Mobile Context. Mobile user feedback stored in different form (e.g., user reviews, video recorded, A/B testing strategies, etc.), can be used by developers to decide possible future directions of development or maintenance activities [6, 25]. Therefore, user feedback represents a valuable resource to evolve software applications [24]. As consequence, the mobile development would strongly benefit from integrating User Feedback in the Loop (UFL) of the release cycle [7, 26, 27, 28, 29] (as highlighted by the blue elements/lines shown in Fig. 1 [17]), especially on the testing and maintenance activities. This has pushed the software engineering research community to study more effective automated solutions to “enable the collection and integration of user feedback information in the development process” [26]. The key idea of the techniques for user feedback analysis is to model [9, 6, 24, 11], classify [9, 6, 11], summarize [29, 24, 15] or cluster [12] user feedback in order to integrate them into the release cycle. The research challenge is to effectively extract the useful feedback to actively support the developers to accomplish the release cycle tasks.

Mobile Testing and Source Code Localization based on User Feedback Analysis. User feedback analysis can potentially provide to developers information about the changes to perform to achieve a better user satisfaction and mobile app success. However, user reviews analysis alone is not sufficient to concretely help developers to continuously integrate user feedback information in the release cycle, and in particular (i) maintenance [11, 13, 6, 24] and (ii) testing [27, 7, 29, 28] activities. Recent research directions push the boundaries of user feedback analysis in the direction of change-request file localization [11, 13] and user-oriented testing (where user feedback are systematically integrated into the testing process) [7, 27, 28]. We will elaborate more in detail the literature in Section 6.

In this paper we focus on supporting developers with more advanced and reliable approaches to derive and cluster change-requests from user feedback, thus localizing the files to change [11, 13] to better support mobile maintenance tasks [11, 13, 6, 24].

3 Approach

As have been identified in the introduction, there are three major limitations in the existing approaches, i.e., low accuracy of classification and clustering because of low-quality of user reviews, difficulties in coping with the different vocabularies used to describe users’ experience with the apps. and the existence of the lexicon gap between user reviews and software artefacts. RISING employs various techniques to mitigate these issues which we will elaborate in this section. Outlined in Fig. 2, RISING consists of two major parts, i.e., clustering and localization, the details of which will be given in Section 3.1 and Section 3.2 respectively.

Fig. 2: Approach Overview

3.1 User review clustering

Most user reviews are short textual snippets consisting of multiple sentences. These raw sentences may address different aspects of apps and need to be preprocessed before clustering. Based on their contents, the reviews can mainly be classified into four categories, i.e., information giving, information seeking, feature request and problem discovery [6, 13]. In particular, “Information giving” denotes those sentences that inform or update users or developers about an aspect related to the app; “information seeking” denotes those which attempt to obtain information or help; “feature request” denotes those expressing ideas, suggestions or needs for improving the app’s functionalities and performance; “problem discovery” denotes the sentences describing issues with the apps or their unexpected behaviors [6]. Since our aim is to identify those reviews which are directly relevant to apps’ evolution, following [13] we only focus on the last two categories, i.e., feature request and problem discovery. To this end, we first employ ARDOC, a user review classifier developed in the previous work [6] which transforms user reviews into individual sentences and then classifies these sentences into one of the aforementioned four categories. We then collect those sentences of the last two categories.

To improve the accuracy of clustering, two tactics are employed, i.e., finer granularity review segmentation and textual processing, which will be elaborated in the following two subsections.

3.1.1 Fine-grained review segmentation

Clustering user reviews is usually conducted at the sentence level. We observe that, even inside an individual sentence, there still may be multiple topics involved which possibly address quite different concerns. As an example, one user review of AcDisplay reads “I wish there was a pattern lock feature and a camera shortcut for the lockscreen.” Apparently, the user prefers one more feature (pattern lock) and a shortcut utility. Moreover, for composite sentences in user reviews, if they contain adversative conjunctions such as ’but’, the content after ‘but’ usually discloses the real information. As an example from K9-Mail444https://play.google.com/store/apps/details?id=com.fsck.k9, one user states that “This app is good, but it is lacking a key feature for anyone who uses mailing lists: Reply-To-List.” In this case, for the purpose of localization, the content before ‘but’ is not informative at all, and may introduce noises to the follow-up process. As a result, we propose to have a more fine-grained text analysis. In particular, we split the composite sentences into atomic ones each of which expresses a single concern only, and remove the irrelevant part of the sentence.

To achieve that, we employ a statistical parser from the Stanford NLP toolkit555https://nlp.stanford.edu/software/lex-parser.shtml to generate grammatical structures of sentences, i.e., phrase structure trees. We then traverse the leaf nodes of the phrase structure tree to determine whether or not the sentence contains conjunctions. Particularly, we focus on two types of conjunctions, i.e., copulative conjunctions and adversative conjunctions. The former (e.g., ‘and’, ‘as well as’ and ‘moreover’) mainly expresses the addition while the latter (e.g., ‘but’, ‘yet’) denotes contrasts.

For the first type, we recursively parse the nodes to identify the layer where the copulative conjunctions are located. We then obtain the copulative conjunction’s sibling nodes. The two parts connected by the conjunction may be two sentences, two noun phrases, two verb phrases, etc. Given different conditions, we can generate two atomic sentences based on the parts which are connected by the conjunctions. As a concrete example, if the conjunction ‘and’ connects two noun objectives, then the two objectives are split as the only objective of each atomic sentence, but they share the same subjective and verb. (e.g. I wish there was a pattern lock feature and a camera shortcut for the lockscreen. I wish there was a pattern lock feature for the lockscreen. I wish there was a camera shortcut for the lockscreen). If the conjunction ’and’ connects two sentences, then the two sentences will be simply split into two atomic sentences (e.g. There are only 2 things I’d change for a 5 star review; I wish it had audio controls, and I wish there was a camera shortcut from the lock screen. There are only 2 things I’d change for a 5 star review; I wish it had audio controls. There are only 2 things I’d change for a 5 star review; I wish there was a camera shortcut from the lock screen).

For the second type, since we believe that the contents after the adversative conjunctions conveys the real information, we only preserve the leave nodes behind the conjunction nodes and simply leave out the other parts.

3.1.2 Textual processing

User reviews are generally informal and unstructured, mixing with typos, acronyms and even emojis [16]. The noisy data inevitably degrades the performance of clustering and localization which necessitates further textual processing. We first filter out the emoji characters and other punctuation contents. Some emojis which were published as icons are stored in a text format, and their encoding appears as combination of question marks. Some others also use a combination of common punctuations, such as smiley faces. These patterns are matched by using regular expressions. Particularly, we propose two regular expressions to extract the pattern. The first one is ””. It removes all punctuations and replaces them with a space; the second one is ”” which removes non-alphanumeric parts. Furthermore, we also convert all letters to lowercase uniformly.

Given the above steps, sentences are transformed into lists of words (i.e., tokens). We then use the Stanford NLP toolkit666https://stanfordnlp.github.io/CoreNLP/ to transform the inflected words to their lemmatization form. Here a dictionary-based instead of a rule-based approach is used to convert words into tokens which can avoid over-processing of words. (For instance, “images” is transformed correctly to image instead of to imag). User reviews may contain stopwords that could introduce noise for clustering and need to be removed. We note that the existing English stopword list cannot be well applied here for two reasons: first, a large number of user reviews contain irregular acronyms (e.g., asap–as soon as possible, cuz–cause) which cannot be processed by the existing stopword list. Second, some words are in the regular stopword list, but for specific apps, they may convey important information. For example, some words, such as “home”, listed in strings.xml which encodes the string literals used by the GUI components, are of this kind. Therefore, we manually edit the English stopword list777The customized stopword list is also available online with the replication package. accordingly (e.g., by adding some acronyms commonly used and removing some words that appear in strings.xml). We also delete the repeated words and the sentences which contain less than two words, because in short documents like user reviews, documents with less than two words hardly convey any useful information for evolution purpose.

3.1.3 User Review Clustering

Although ARDOC could classify reviews into “problem discovery” and “feature request”, such coarse-grained classification provides limited guidance for developers when confronted with specific maintenance tasks. A more fine-grained approach is highly desirable. Firstly, there is usually a huge number of user reviews and thus it is practically infeasible to address every concern. Therefore, developers would like to identify the most common issues or requests raised by the end users, which are supposed to be treated with higher priority [12]. Secondly, not all user reviews are meaningful, especially in the problem discovery category. In practice, it is not uncommon that some complaints are actually caused by users’ misunderstanding. By grouping similar issues together, such cases would be easier to be identified. Both of these motivate using clustering of pre-processed user reviews.

Construction of word-review matrix.

We adopt the widely-used Vector Space Model (VSM) 

[30] to represent the pre-processed texts. We fix a vocabulary , each of which represents a feature in our approach. Let , the size of the vocabulary, and be the number of atomic sentences. We first construct a raw matrix where each entry is equal to the number of occurrences of the word in the review .

For each word , let denote the occurrence of in all reviews, i.e., , and we use logarithmically scaled document frequency () as the weight assigned to the corresponding word:

Finally we can construct the scaled word-review matrix , where each entry

We remark that there was some related work using traditional as the weighting strategy [12, 31]. However, we use the document frequency () [30] mainly due to the fact that the clustering unit in our approach is at the sentence level. Particularly, these sentences are short where an individual word usually occurs once, so would become meaningless for clustering in most cases. Besides, the purpose of is to reduce the weight of stop words which have been removed by the data preprocessing steps.

Due to the large number of user reviews and the shortness nature of individual atomic sentences, the word vectors are of very high-dimension but very sparse. To reduce the dimension, we use the principal component analysis (PCA) technique 

[32, 33] which is one of the most widely used techniques for dimension reduction. Essentially, PCA replaces the original features with an (usually much smaller) number

of features. The new features are linear combinations of the original ones that maximize the sample variance and try to make the new

features uncorrelated. The conversion between the two feature spaces captures the inherent variability of the data.

COP-Kmeans. After obtaining the vector models, we are in position to cluster similar texts based on their contents. Existing approaches mainly employ automatic clustering algorithms to divide the reviews into multiple groups. However, we postulate that clustering would benefit from leveraging domain knowledge about the mobile app dataset. By investing limited human effort, the performance of clustering could be further boosted. For example, from AcDisplay, some reviews state “I do wish you could set a custom background, though.” and “Would be nice to be able to customize the wallpaper too.” As for traditional clustering algorithms, since the two keywords (i.e., background and wallpaper) are quite different in regular contexts, these two sentences would have a very low similarity score and thus be clustered into two different categories. However, professional developers would easily recognize that “wallpaper” and “background” refer to similar things in UI design, which suggests that the two reviews address the same issue and should be put into the same cluster.

On the other hand, some reviews might address quite irrelevant issues using the same words. For example, again in AcDisplay, two reviews are as below: “I would love the option of having different home screen.”, and “First I’d like to suggest to disable that home button action because it turns the lock screen off …, I hope you do it in next update.”. These two reviews have completely different meanings, but since they both contain key words “home” and “screen”, they are very likely to be clustered together by traditional clustering algorithms.

Domain knowledge of developers could potentially improve the precision of clustering, which has not been exploited by the traditional clustering algorithms. To remedy this shortcoming, we annotate a subset of instances with two types of link information, i.e., must-link and cannot-link constraints, as a priori knowledge and then apply the constrained K-means clustering technique 

[34]. The must-link constraints specify the instance pairs that discuss semantically similar or the same concerns, judged by professional developers with rich development expertise. Likewise, the cannot-link constraints specify the instance pairs that are not supposed to be clustered together. Besides, the must-link constraints define a transitive binary relation over the instances [34]. When making use of the constraints (of both kinds), we take a transitive closure over the constraints. (Note that although only the must-link constraints are transitive, the closure is performed over both kinds because, e.g., if must link to which cannot link to , then we also know that cannot link to .)

To use the K-means family of algorithms, one needs to determine the value of the hyper-parameter . There are some traditional, general-purpose approaches [35, 36, 37]

, but they did not take the topic distribution concerns into consideration so cannot provide a satisfactory solution in our setting. We instead use a heuristic based method to infer

. The heuristic is derived from the N-gram model of the review texts, since we believe the cluster number should strongly correlate to the topic distribution. Concretely, we obtain the 2-gram phrases of all user reviews. Then we merge the same phrases and record the number of occurrences of those phrases. If two phrases share the same word information, the less frequent phrase will be deleted. We also delete the phrases which occur once.

is then set to be the number of the remaining phrases. (2-gram is used as we empirically found that this yields the best performance.)

The COP-Kmeans algorithm takes the must-link and cannot-link dataset, value and atomic sentence vectors as input and produces the clustering results. The pseudo-code is given in Algorithm 1. First, it randomly selects samples from as the initial cluster centers. Then, for each sample in , assign it to the closest cluster such that it doesn’t violate constraints in and . If no such cluster exists, an error message (line 4-21) would be returned. Then, for each cluster , update its centroid by averaging all of the points (line 22-24). This process iterates until the mean vectors no longer change.

Input : 
The Data set ;
The Must-link constraints ;
The Cannot-link constraints ;
The k value ;
Output : 
The clustering results;
1 Randomly select samples from as the initial cluster centers;
2 repeat
3      ;
4       for  do
5             Calculate the distance between the sample and each mean vector ;
6             ;
7             is_merged = false;
8             while  is_merged do
9                   Find the cluster closest to the sample based on : ;
10                   Detecting whether is classified into cluster violates constraints in and ;
11                   if  is_voilated then
12                         ;
13                         is_merged=true;
14                        
15                   else
16                         ;
17                         if  then
18                               Break Return error message;
19                              
20                         end if
21                        
22                   end if
23                  
24             end while
25            
26       end for
27      for  do
28             ;
29            
30       end for
31      
32until Mean vectors are no longer updated;
Algorithm 1 Constrained K-means Algorithm

3.2 Change file localization

For localizing potential change files, our approach combines the information from both the commit message and the source code. To get the commit messages of mobile apps, we exploit open-source projects to collect (i) the title, (ii) the description, (iii) the set of files involved, and (iv) the timestamp for each commit. For source code, we mainly use the file path, class summary, method summary, method name and field declaration. Class summary and method summary can be extracted based on the javadoc tags. Method names and field declarations are parsed through abstract syntax tree (AST) analysis. In both cases, we remove non-textural information, splitting identifiers based on camel case styles, converting letters to lower case formats, stemming, and removing stopwords/repetition words. Finally, the bag-of-words (BoW) models from the target app’s source code and commit messages are generated respectively.

3.2.1 Tag Source Code Files

As mentioned earlier, we propose to leverage historical commit information to bridge the semantics gap between user reviews and source code. To this end, we first tag the source code with the historical change information. Particularly, for each commit, we extract the title, description, timestamps, and the involved file paths. From the file paths, we traverse the corresponding source code files in the project, and all the collected information, i.e., the title, description, and time stamps, is attached with the source file. As a result, each source code file can be regarded as a pair,

where both and are bag of words.

Fig. 3 shows a commit example from AcDisplay. We extract title, description, timestamps (in blue rectangle) and relevant file paths (in red rectangle) information. All the files will be tagged with such information.

Fig. 3: Commit Message Illustration

3.2.2 Localization

Similarity Computation. As mentioned earlier, due to the semantic gap between natural language and programming language, the direct similarity matching cannot precisely localize potential change files. We introduce the commit information to bridge the gap. Therefore, the similarity is attributed to the following two parts:

  • the similarity between the user review and the code components extracted from one class of the target app;

  • the similarity between the user review and the commit tags of one class whose time stamps were earlier than the user review.

Palomba et al. [38] used the asymmetric Dice coefficient [30]

to compute a textual similarity between a user review and a commit, as well as a textual similarity between user reviews and source code components. Since user reviews are usually much shorter than source code files and commits, asymmetric Dice coefficient based similarity measures are usually employed (as opposed to other alternatives such as the cosine similarity or the Jaccard coefficient 

[39]). However, the original asymmetric Dice coefficient treats all the word equally and ignores those words who occur more frequently. Hence, we introduce a weighted asymmetric Dice coefficient as follows:

(1)

where is the set of words within the review , is the set of words within the code components of class , represents the document frequency (df) of the word , and the function returns the argument whose value is smaller. In (1), we use ’s value as the weight of the words. The intuition is that the more frequently a word occurs, the more important the word is.

The similarity between user review and commit tags is computed analogously, by replacing by as shown in (2), where is the set of words within the commit tags of class .

(2)

Dynamic Interpolation Weights. The similarity score between user reviews and source code files is calculated by a linear combination of the similarity score between the reviews and the source code contained in the files and the one between the reviews and the commit messages associated with the files (cf. Section 3.2.1

). However, in the initial stage of the project life cycle, there is no enough commit information, a reminiscent of the cold-start problem. During the course of the project, commit messages accumulate. In light of this, we dynamically assign the weights to the two parts, inspired by dynamic interpolation weights 

[40, 41]:

where is the number of common words which appear in both user review and commit tags , and is the number of words in user review . We use instead of the concentration parameter because we can determine the maximum number of . Based on the above equation, if does not have enough commit tags (when is small), then the code components of will be preferred, which can cope with the cold-start problem in which there are few commits or even no commits at the beginning of project life cycle. As the number of commit tags is growing, when is large, the commits will be preferred. This strategy could gradually increase the weight of commit messages during similarity calculation over time.

4 Case Study

We collect the user reviews and commit messages of ten popular apps available from Google Play. The basic statistics of the selected projects is listed in Table I.

App Name Category Version Comm. Msg. No. Review No.
AcDisplay Personalization 3.8.4 1096 8074
SMS Backup+ Tools 1.5.11 1687 1040
AnySoftKeyboard Tools 1.9 4227 3043
Phonograph Music&Audio 1.2.0 1470 6986
Terminal Emulator Tools 1.0.70 1030 4351
SeriesGuide Entertainment 44 9217 5287
ConnectBot Communication 1.9.5 1714 4806
Signal Communication 4.29.5 3846 6460
AntennaPod Video Players 1.7.0 4562 3708
K-9 Mail Communication 5.600 8005 8040
Total - - 36854 58775
TABLE I: Overview of selected apps

The selection criteria for Android apps are (i) open-source Android apps published on the Google Play market with version system and commit messages publicly accessible, and (ii) diversity in terms of app category (e.g., Personalization, Tools, Communication), size, and number of commit messages.

We developed two web scrapers to crawl down the raw data: one is to extract user reviews from Google Play Store, and the other is to extract commit messages from GitHub. As for the source code, we download the latest version of the apps from GitHub. The version information is also shown in the table.

To evaluate how well our approach could help developers localize potential change files, we investigate the following two research questions.

RQ1:

Does the constraint-based clustering algorithm perform better?

RQ2:

Do commit messages improve the accuracy of localization?

RQ1. We implemented the approach and ran it on our dataset to address the above research questions. We chose ChangeAdvisor [13] as the baseline for two reasons. Firstly, ChangeAdvisor is the closest/most relevant approach with ours and the two address the same question largely; secondly, ChangeAdvisor is the state-of-the-art reference on clustering of user reviews in literature, and its superiority had been demonstrated compared to other similar approaches, such as BLUiR [42]. We observe that the work in [13] did not distinguish the two general categories of user reviews, i.e., feature request and problem discovery, in the clustering process. Thus it is very likely that reviews of the two categories are clustered together. For example, in AcDisplay, the two reviews “The error was with my phone and other apps were crashing as well” and “It lacks UI customization like resizing the clock size or adding circle boarder …” are grouped into the same cluster. Apparently, they express quite different concerns. To give a fair comparison, we differentiate the two categories, reuse the original prototype of ChangeAdvisor and the same parameter settings as published in [13].

App Name Review FR No. PD No.
No. Total M-link N-link Total M-link N-link
AcDisplay 8074 1400 50 50 1437 50 50
SMS Backup+ 1040 677 22 8 1425 32 21
AnySoftKeyboard 3043 280 25 3 290 16 6
Phonograph 6986 1094 63 28 960 42 53
Terminal Emulator 4351 248 13 9 372 10 28
SeriesGuide 5287 588 28 21 460 16 21
ConnectBot 4806 467 29 16 604 43 17
Signal 6460 629 36 23 792 32 45
AntennaPod 3708 336 25 21 359 16 22
K-9 Mail 8040 1018 65 36 1854 66 28
Total 58775 6737 356 215 8553 323 291
TABLE II: Apps’ review information (FR: feature request; PD: problem discovery)

Table II shows the information of the input to the clustering stage. As mentioned before, we distinguish the two categories of feature request (FR) and problem discovery (PD). The table also gives the human-annotated cluster constraints information of each app. In total, in the feature request category, the annotated must-link (M-link) takes up around 10.57% (356 pairs out of 6737), and cannot-link (N-link) takes up around 6.38% (215 pairs out of 6737); while in the problem discovery category, the percentage of must-link instances is around 7.55% (323 pairs out of 8553), and of cannot-link is around 6.80% (291 pairs out of 8553). The marked instances are randomly selected from each app. In line with the metrics used in [13], we compare the cohesiveness and separation of clustering results between two approaches.

RISING incorporates domain knowledge to annotate the must-link and cannot-link information to the subset of user reviews, and leverages a heuristic-based method to infer the hyperparameter

. ChangeAdvisor directly applies the Hierarchical Dirichlet Process (HDP) algorithm to group the sentences of the reviews [43]. So we first need to compare the cluster numbers that the two approaches yield, and then the quality of each cluster.

App Name ChangeAdvisor RISING
FR No. PD No. FR No. PD No.
AcDisplay 10 12 83 129
SMS Backup+ 12 10 71 107
AnySoftKeyboard 11 11 46 41
Phonograph 8 7 105 106
Terminal Emulator 11 8 38 45
SeriesGuide 11 11 59 44
connectBot 8 11 47 75
Signal 12 10 85 97
AntennaPod 10 8 43 57
K-9 Mail 6 10 113 154
Total 99 98 690 855
TABLE III: Clustering information

Table III presents the comparison of clustering between our approach and ChangeAdvisor. From the table, we can observe that, the number of clusters yielded by ChangeAdvisor and our approach varies a lot. The median cluster values of feature request and problem discovery categories in the studied apps by ChangeAdvisor are 10.5 and 10, respectively. However, in our approach, the median cluster values of the two categories are 65 and 86, respectively. Moreover, we found that, the clustering result is quite unbalanced in ChangeAdvisor. For example, in AcDisplay, the number of clusters of ChangeAdvisor (Problem Discovery Category) is 12, but 1142 out of 1437 sentences are grouped into one cluster, which takes up more than 79%; while in our approach, the largest cluster contains 339 instance sentences, which takes up around 23.6%. This highlights that the clusters generated by our approach are of more practical use to developers compared to the one generated by ChangeAdvisor.

To compare the quality of clusters obtained by two approaches, we use the Davis-Bouldin index (DBI), a widely adopted method to evaluate clustering algorithms [44], as a metric to assess the cohesiveness of intra-clusters and the separation of inter-clusters. This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset. DBI consists of two parts, one is the measure of scatter within the cluster, and the other is of separation between clusters.

For a cluster , the measure of scatter of is defined as follows.

(3)

where is the size of the cluster and is the centroid of the cluster .

The measure of separation between clusters and is defined as follows.

(4)

where is the element of the centriod of cluster . DBI can then be defined as

(5)

where is the number of clusters.

The DBI value is a standard measurement of the quality of clustering, i.e., the cohesiveness of intra-cluster and the separation of inter-clusters. Typically, a better clustering algorithm admits a lower value of DBI.

ChangeAdvisor uses HDP algorithm for clustering, which accepts text objects as input. To enable the adoption of DBI, we need to convert the dataset of ChangeAdvisor into a vector format. In order to ensure a fair comparison, we use the same method to convert the transformed sentences into the vector representation as in our approach, as detailed in Section 3.1.

App Name ChangeAdvisor RISING
FR PD FR PD
AcDisplay 0.493 0.361 0.035 0.020
SMS Backup+ 0.321 0.444 0.047 0.042
AnySoftKeyboard 0.357 0.342 0.050 0.050
Phonograph 0.514 0.693 0.031 0.029
Terminal Emulator 0.300 0.557 0.105 0.060
SeriesGuide 0.440 0.303 0.075 0.057
ConnectBot 0.606 0.479 0.080 0.027
Signal 0.317 0.391 0.055 0.027
AntennaPod 0.447 0.548 0.048 0.046
K-9 Mail 0.928 0.538 0.040 0.022
Average 0.472 0.466 0.057 0.038
TABLE IV: DBI results comparison

The results are summarized in Table IV. From the table, we can observe that our approach yields a significantly better DBI result compared with ChangeAdvisor. The average DBI values of feature request and problem discovery by ChangeAdvisor are 0.472 and 0.466 respectively; while by our approach, the average values are 0.057 and 0.038 respectively.

To further evaluate the quality of clustered reviews, we hired three app developers, each of whom has over three years of development experience. We asked them to look into the clustered sentences and to assess the coherence of contents of each individual cluster as well as the semantics separation of different clusters. The assessment is given in Likert scale grades: “exactly related topics” (5), ”mostly related topics” (4), “basically related topics” (3), “just a few related topics” (2), and “not relevant topics” (1). Different from the evaluation method in [13], we evaluate all the clusters, and calculate the average value as the final result.

The results are shown in Table V. From the table, we observe that RISING yields a better value of Likert scale compared with ChangeAdvisor. The average values of feature request and problem discovery categories by ChangeAdvisor are 2.07 and 1.94 respectively; while by RISING, the average values are 4.20 and 4.26 respectively.

App Name ChangeAdvisor RISING
FR PD FR PD
AcDisplay 2.22 2.12 4.30 4.29
SMS Backup+ 1.93 2.03 4.23 4.26
AnySoftKeyboard 2.50 2.47 4.23 4.09
Phonograph 2.35 1.55 4.40 4.35
Terminal Emulator 2.18 2.15 3.83 4.17
SeriesGuide 2.17 1.74 4.22 4.29
ConnectBot 1.43 2.05 4.20 4.35
Signal 1.96 1.70 4.26 4.31
AntennaPod 2.08 1.67 4.17 4.25
K-9 Mail 1.87 1.92 4.11 4.25
Average 2.07 1.94 4.20 4.26
TABLE V: Likert results comparison

The above objective and subjective measures answer RQ1 that our constraints-based clustering method, aided by more intensive (automated) data preprocessing and marginal human annotation efforts, could significantly boost the clustering performance.

RQ2. To address RQ2, we need to judge whether commit messages improve the accuracy of localization. In the experiments, we use the same ten Android apps in the preceding step (cf. Table I). As the first step, we need to obtain the ground-truth for the localization result which requires human examination. To reduce personal bias, we hired additional two mobile app developers, both of whom have over 3 years of development experience as well. The five evaluators were asked to check the localization result individually and discuss together until a consensus is reached which then serves as the ground truth.

As the next step, we apply ChangeAdvisor and RISING to the reviews to compare the results returned from them against the ground-truth results from the previous step. For each category in each app, we randomly select 10-15 user reviews and then applied ChangeAdvisor and RISING separately to these sample reviews. Overall, we select 230 (121 + 109) user reviews from these 10 apps. RISING could return potential change files in all the cases. However, ChangeAdvisor could only give outputs when inputting 98 (62 + 36) user reviews, less than 50% of RISING. The details of the localizability comparison of the two approaches in the studied apps are given in Table VI.

To evaluate the localization result, we employed the Top-k accuracy and NDCG as metrics which are commonly used in recommendation systems [45, 46, 47]. Top-k accuracy can be calculated as

where represents the set of all user feedbacks and the function returns 1 if at least one of source code files actually is relevant to the user feedback ; and returns 0 otherwise.

Table VII reports the Top-k accuracy achieved by ChangeAdvisor and by RISING for each of the considered apps where the value is set to be 1, 3 and 5. From the table, we can observe that, in most cases, RISING significantly outperforms ChangeAdvisor in terms of Top-k hitting. On average, for feature request category, the Top-1, Top-3, Top-5 values can be improved from 52.98% to 74.38%, 77.56% to 90.08%, and 82.99% to 98.35% respectively; for problem discovery category, the Top-1, Top-3, Top-5 values are improved from 45.86% to 73.39%, 60.48% to 92.66%, and 72.74% to 97.25% respectively.

NDCG is defined as follows:

where = 1 if the -th source code file is related to the user feedback, and = 0 otherwise. IDCG is the ideal result of DCG, which means all related source code files are ranked higher than the unrelated ones. For example, if an algorithm recommends five source code files in which the 1-st, 3-rd and 5-th source code files are related, the results are represented as , whereas the ideal result is .

Table VIII reports the NDCG values achieved by ChangeAdvisor and RISING for each of the considered apps where, similarly, the value is set to be 1, 3 and 5. Based on the table, we observe that, in most cases of the studied apps, the NDCG value of RISING is greater than that of ChangeAdvisor, which indicates a better performance. On average, the NDCG@1, NDCG@3 and NDCG@5 values of ChangeAdvisor in the problem discovery category are 45.86%, 43.36%, and 56.50% respectively. In contrast, the corresponding values of RISING in this category are 73.35%, 71.41%, and 83.99% respectively. In feature request category, the NDCG@1, NDCG@3 and NDCG@5 values of ChangeAdvisor are 52.98%, 58.39%, 67.21% respectively; while the values of RISING in this category are 74.28%, 70.41%, and 84.01% respectively.

The experiment results answer RQ2 that, in terms of the localization accuracy, our approach which exploits commit messages to fill the lexicon gap could improve the performance significantly.

App Name ChangeAdvisor RISING
FR No. PD No. FR No. PD No.
AcDisplay 3 2 14 10
SMS Backup+ 4 2 11 12
AnySoftKeyboard 7 5 11 10
Phonograph 8 4 15 11
Terminal Emulator 3 3 12 13
SeriesGuide 11 7 12 11
ConnectBot 8 3 11 11
Signal 5 4 13 10
AntennaPod 7 2 12 10
K-9 Mail 6 4 10 11
Sum 62 36 121 109
TABLE VI: Overview of Localization
App Name ChangeAdvisor RISING
FR PD FR PD
Top-1 Top-3 Top-5 Top-1 Top-3 Top-5 Top-1 Top-3 Top-5 Top-1 Top-3 Top-5
AcDisplay 1.0000 1.0000 1.0000 0.5000 0.5000 0.5000 1.0000 1.0000 1.0000 0.7000 1.0000 1.0000
SMS Backup+ 0.5000 0.7500 0.7500 0.5000 0.5000 0.5000 0.7273 1.0000 1.0000 0.8333 1.0000 1.0000
AnySoftKeyboard 0.7143 0.8571 0.8571 0.8000 1.0000 1.0000 1.0000 1.0000 1.0000 0.9000 1.0000 1.0000
Phonograph 0.7500 0.8750 0.8750 0.5000 0.7500 1.0000 0.8000 1.0000 1.0000 0.9091 0.9091 1.0000
Terminal Emulator 0.3333 0.6667 0.6667 0.3333 0.6667 1.0000 0.3333 0.6667 0.6667 0.6923 0.9231 1.0000
SeriesGuide 0.2727 0.7273 0.8182 0.2857 0.7143 0.8571 0.5833 0.7500 0.8333 0.7273 1.0000 1.0000
ConnectBot 0.3750 0.8750 0.8750 0.6667 0.6667 0.6667 0.5455 0.7273 1.0000 0.8182 0.9091 0.9091
Signal 0.4000 0.6000 0.6000 0.2500 0.2500 0.7500 0.5385 0.8462 1.0000 0.5000 0.8000 1.0000
AntennaPod 0.2857 0.5714 0.8571 0.5000 0.5000 0.5000 0.5000 0.8333 1.0000 0.8000 0.9000 0.9000
K-9 Mail 0.6667 0.8333 1.0000 0.2500 0.5000 0.5000 0.9000 0.9000 1.0000 0.4545 0.7273 1.0000
Average 0.5298 0.7756 0.8299 0.4586 0.6048 0.7274 0.7438 0.9008 0.9835 0.7339 0.9266 0.9725
TABLE VII: Top-k Accuracy of Localization
App Name ChangeAdvisor RISING
FR PD FR PD
NDCG@1 NDCG@3 NDCG@5 NDCG@1 NDCG@3 NDCG@5 NDCG@1 NDCG@3 NDCG@5 NDCG@1 NDCG@3 NDCG@5
AcDisplay 1.0000 0.8683 0.9356 0.5000 0.3520 0.4427 1.0000 0.8653 0.9529 0.7000 0.7615 0.8626
SMS Backup+ 0.5000 0.5283 0.5943 0.5000 0.3066 0.4386 0.7273 0.6124 0.8138 0.8333 0.6549 0.8733
AnySoftKeyboard 0.7143 0.7672 0.8000 0.8000 0.6922 0.8692 1.0000 0.8822 0.9657 0.9000 0.9072 0.9482
Phonograph 0.7500 0.6940 0.7899 0.5000 0.5180 0.7303 0.8000 0.7768 0.8966 0.9091 0.7505 0.9039
Terminal Emulator 0.3333 0.4623 0.5503 0.3333 0.3841 0.6684 0.8333 0.7493 0.8590 0.6923 0.6772 0.8079
SeriesGuide 0.2727 0.4719 0.5667 0.2857 0.3805 0.5451 0.5833 0.5594 0.7040 0.7273 0.7218 0.8654
ConnectBot 0.3750 0.5536 0.6599 0.6667 0.6667 0.6667 0.5455 0.5759 0.7492 0.8182 0.8386 0.8719
Signal 0.4000 0.4839 0.4839 0.2500 0.2500 0.4434 0.5385 0.6490 0.7742 0.5000 0.5959 0.7463
AntennaPod 0.2857 0.3863 0.5831 0.5000 0.4599 0.4599 0.5000 0.5944 0.7807 0.8000 0.7066 0.8041
K-9 Mail 0.6667 0.6228 0.7571 0.2500 0.3266 0.3859 0.9000 0.7757 0.9049 0.4545 0.5269 0.7150
Average 0.5298 0.5839 0.6721 0.4586 0.4336 0.5650 0.7428 0.7041 0.8401 0.7335 0.7141 0.8399
TABLE VIII: NDCG@k of Localization

5 Discussion

Identifying meaningful user reviews from app markets is a non-trivial task, since a majority of them are not informative. Furthermore, to link and localize potential change files based on those meaningful feedbacks would be highly desirable for software developers. Compared with state-of-the-art baseline work, RISING could give more fine-grained clustering results and more accurate localization performance. In our experiments, we also observe that, in each run of ChangeAdvisor, the clustering result is noticeably different from other runs, making the clustering less stable or deterministic. However, in RISING, the clustering result is much more stable. In the localization phase, RISING leverages the commit information to bridge the gap. Note that commit history contains all the relevant files for the change transaction including not only source files but also configuration related files (such as XML files). Our approach is thus advantageous over other similar approaches to be able to locate multiple files which are necessary for problem fix or feature request. However, ChangeAdvisor does not take into account the association between files, which would miss, for instance, configuration files.

Threats to Validity

Internal validity. We conclude that, with domain knowledge, marginal human effort could significantly boost the clustering performance. Such effectiveness has already been demonstrated in various scenarios [48, 49]. In our experiment, we only annotate a small portion (6%-10%) of the whole review set, reducing the threat of over-fitting. The recovery of missing traceability links between various software artefacts has also been actively studied in the literature [50]. Commit messages contain rich information about the change history and the motivation of the change itself. Thus the information could bring benefits to bridge the vocabulary gap between professional developers and ordinary users. Another threat arises from the selection bias of the dataset. In our experiments, we strive to reuse the same apps in the baseline work as many as possible. To reduce the noise from the raw data and bias in the result, we take the standard measures to pre-process the raw texts, and include more developers to solve subjective conflicts.

External validity.

In our case study, we deliberately selected 10 apps across different categories instead of being limited within a narrow domain. To give a fair comparison, we use a combination of multiple evaluation metrics, including both objective and subjective ones. Similar to other empirical studies, no evidence could theoretically prove our approach can always accurately localize change files in all scenarios. But we believe that, since our approach is open to different scenarios, domain knowledge could be leveraged via new constraints and heuristics incorporated into our approach which could improve the clustering and localization performance as well in the new dataset.

6 Related Work

The concept of app store mining was introduced by Harman et al. [5] in 2012, and several researchers focused on mining mobile apps and app store data to support developers during the maintenance and evolution of mobile applications, with the goal to achieve a higher app success [2].

6.1 The Role of User Feedback Analysis in the Mobile App Success

App Rating & App Success. Previous research widely investigated the relationship between the rating and a particular characteristic (or feature) of mobile applications [51, 52, 53, 54, 55]. Recent research efforts have been devoted to investigating the reliability of app rating when used as a proxy to represent user satisfaction. For example, Luiz et al. [56]

proposed a framework performing sentiment analysis on a set of relevant features extracted from user reviews. Despite the star rating was considered to be a good metric for user satisfaction, their results suggest that sentiment analysis might be more accurate in capturing the sentiment transmitted by the users. Hu

et al. [57] studied the consistency of reviews and star ratings for hybrid Android and iOS apps discovering that they are not consistent across different app markets. Finally, Catolino [58] preliminarily investigated the extent to which source code quality can be used as a predictor of commercial success of mobile apps.

User Feedback Analysis & App Success. Several approaches have been proposed with the aim to classify useful user reviews for app success. AR-Miner [9] was the first one able to classify informative reviews. Panichella et al.

adopted natural language processing and text and sentiment analysis to automatically classify user reviews

[6, 59] according to a User Review Model (URM). Gu and Kim [60] proposed an approach that summarizes sentiments and opinions of reviews.

Following the general idea of incorporating user feedback into typical development process, Di Sorbo et al. [14, 15] and Scalabrino et al. [12, 61] proposed SURF and CLAP, two approaches aiming at recommending the most important reviews to take into account while planning a new release of a mobile application. CLAP improves AR-Miner by clustering reviews into specific categories (e.g., reports of security issues) and by learning from the app history (or from similar apps) which reviews should be addressed [61]. SURF proposed a first strategy to automatically summarize user feedback in more structured and recurrent topics [15, 29] (e.g., GUI, app pricing, app content, bugs, etc.). Finally, Palomba et al. [13], inspired by the work by Scalabrino et al., proposed ChangeAdvisor, a tool that cluster user reviews of mobile applications. In this paper we considered as baseline ChangeAdvisor since, similarly to our approach, it is based on a clustering approach for user review feedback. In evaluating our approach, we discovered that ChangeAdvisor tends to generate rather different user review clusters with the same study setting and user reviews data, which highlights higher reliability of our approach compared to this state-of-art tool.

6.2 Information Retrieval in SE & the Mobile Context

Information Retrieval techniques have been widely adopted to handle several SE problems. Specifically, strategies for recovery traceability links between textual artefacts and the source code were widely studied in the past [50, 62]. In the same way, several approaches to locating features in the source code [63], and tracing informal textual documentation, such as e-mails [64, 65, 66], forum discussions [67, 68, 69], and bug reports [42] to the source code have been proposed. However, as previously demonstrated by Panichella et al. [70], the configuration used to set the clustering algorithm is an important component of topic modeling techniques used in several traceability recovery approaches, and an optimal choice of the parameters generally results in better performance.

In this context of mobile computing research, two pieces of work are closer to the one we proposed in this paper. Ciurumelea et al. [71, 31]

employed machine learning techniques for the automatic categorization of user reviews on a two-level taxonomy adopting a modified Version of Vector Space Model (VSM) to automatically link user reviews to code artefacts. Similarly, Palomba

et al. [13] with ChangeAdvisor cluster user reviews of mobile applications and suggest the source-code artefacts to maintain. It is important to mention that also in this case we decided to compare our approach against ChangeAdvisor as similarly to our approach, it leverage clustering approaches for user review feedback analysis and IR-based methods for suggesting the source-code artefacts to maintain according to user change-requests.

7 Conclusions and Future Work

User reviews convey client-side requirements for mobile app products. Accurate recovery of the user concerns and automatic localization of relevant source code based on these feedbacks is of great importance to facilitate rapid development. In this paper, we present an approach to localize potential change files based on user reviews for mobile applications. We conducted experiments on 10 popular mobile apps and used a comprehensive set of metrics to assess the performance of our approach. Experimental results show that our approach significantly outperform the state-of-the-art baseline work.

In the immediate future work, we plan to develop a comprehensive environmental support for change file localization so as to give a better applicability of our approach. Moreover, our current case studies are all about open-source apps, while our future plan includes collaboration with commercial app developers and applying our approach to these industry cases.

Acknowledgements

This work was partially supported by the National Key R&D Program of China (No. 2018YFB1003902), and the Collaborative Innovation Center of Novel Software Technology in China. T. Chen is partially supported by UK EPSRC grant (EP/P00430X/1), ARC Discovery Project (DP160101652, DP180100691), and NSFC grant (No. 61662035). We also acknowledge the Swiss National Science Foundation’s support for the project SURF-MobileAppsData (SNF Project No. 200021-166275).

References

  • [1] M. Dehghani, “An assessment towards adoption and diffusion of smart wearable technologies by consumers: the cases of smart watch and fitness wristband products,” ser. CEUR Workshop Proceedings, vol. 1628.   CEUR-WS.org, 2016.
  • [2] W. Martin, F. Sarro, Y. Jia, Y. Zhang, and M. Harman, “A survey of app store analysis for software engineering,” IEEE Transactions on Software Engineering, vol. PP, no. 99, pp. 1–1, 2016.
  • [3] Statista. (2018, Mar.) Number of apps available in leading app stores as of october 2018. https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/.
  • [4] B. of Apps. There are 12 million mobile developers worldwide, and nearly half develop for android first. https://goo.gl/RNCSHC.
  • [5] M. Harman, Y. Jia, and Y. Zhang, “App store mining and analysis: Msr for app stores,” in 2012 9th IEEE Working Conference on Mining Software Repositories (MSR), June 2012, pp. 108–111.
  • [6] S. Panichella, A. Di Sorbo, E. Guzman, C. A. Visaggio, G. Canfora, and H. C. Gall, “How can i improve my app? classifying user reviews for software maintenance and evolution,” in Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution, ser. ICSME ’15, 2015, pp. 281–290.
  • [7] G. Grano, A. Ciurumelea, F. Palomba, S. Panichella, and H. Gall, “Exploring the integration of user feedback in automated testing of android applications,” in Software Analysis, Evolution and Reengineering, 2018 IEEE 25th International Conference on, 2018.
  • [8] A. Machiry, R. Tahiliani, and M. Naik, “Dynodroid: An input generation system for android apps,” in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE 2013, 2013, pp. 224–234.
  • [9] N. Chen, J. Lin, S. C. H. Hoi, X. Xiao, and B. Zhang, “Ar-miner: Mining informative reviews for developers from mobile app marketplace,” in Proceedings of the 36th International Conference on Software Engineering, ser. ICSE 2014, 2014, pp. 767–778.
  • [10] S. Mcilroy, W. Shang, N. Ali, and A. E. Hassan, “User reviews of top mobile apps in apple and google app stores,” Communications of the ACM, vol. 60, no. 11, pp. 62–67, 2017.
  • [11] A. Ciurumelea, A. Schaufelbuhl, S. Panichella, and H. C. Gall, “Analyzing reviews and code of mobile apps for better release planning,” in SANER.   IEEE Computer Society, 2017, pp. 91–102.
  • [12] L. Villarroel, G. Bavota, B. Russo, R. Oliveto, and M. Di Penta, “Release planning of mobile apps based on user reviews,” in Proceedings of the 38th International Conference on Software Engineering, ser. ICSE ’16, 2016, pp. 14–24.
  • [13] F. Palomba, P. Salza, A. Ciurumelea, S. Panichella, H. C. Gall, F. Ferrucci, and A. D. Lucia, “Recommending and localizing change requests for mobile apps based on user reviews,” in Proceedings of the 39th International Conference on Software Engineering, 2017, pp. 106–117.
  • [14] A. Di Sorbo, S. Panichella, C. V. Alexandru, J. Shimagaki, C. A. Visaggio, G. Canfora, and H. C. Gall, “What would users change in my app? summarizing app reviews for recommending software changes,” in Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering.   ACM, 2016, pp. 499–510.
  • [15] A. D. Sorbo, S. Panichella, C. V. Alexandru, C. A. Visaggio, and G. Canfora, “SURF: summarizer of user reviews feedback,” in Proceedings of the 39th International Conference on Software Engineering, ICSE 2017, Buenos Aires, Argentina, May 20-28, 2017 - Companion Volume, 2017, pp. 55–58.
  • [16] P. M. Vu, T. T. Nguyen, H. V. Pham, and T. T. Nguyen, “Mining user opinions in mobile app reviews: A keyword-based approach (t),” in Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), ser. ASE ’15.   Washington, DC, USA: IEEE Computer Society, 2015, pp. 749–759.
  • [17] K. Beck, M. Beedle, A. van Bennekum, A. Cockburn, W. Cunningham, M. Fowler, J. Grenning, J. Highsmith, A. Hunt, R. Jeffries, J. Kern, B. Marick, R. C. Martin, S. Mellor, K. Schwaber, J. Sutherland, and D. Thomas, “Manifesto for agile software development,” 2001. [Online]. Available: http://www.agilemanifesto.org/
  • [18] P. Duvall, S. M. Matyas, and A. Glover, Continuous Integration: Improving Software Quality and Reducing Risk.   Addison-Wesley, 2007.
  • [19] T. Laukkarinen, K. Kuusinen, and T. Mikkonen, “Devops in regulated software development: Case medical devices,” in 39th IEEE/ACM International Conference on Software Engineering: New Ideas and Emerging Technologies Results Track, ICSE-NIER 2017, Buenos Aires, Argentina, May 20-28, 2017, 2017, pp. 15–18.
  • [20] J. Humble and D. Farley, Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation, 1st ed.   Addison-Wesley Professional, 2010.
  • [21] M. R. Islam and M. F. Zibran, “Insights into continuous integration build failures,” in Proceedings of the 14th International Conference on Mining Software Repositories, MSR 2017, Buenos Aires, Argentina, May 20-28, 2017, 2017, pp. 467–470.
  • [22] C. Ziftci and J. Reardon, “Who broke the build? Automatically identifying changes that induce test failures in continuous integration at google scale,” in 39th IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice Track, ICSE-SEIP 2017, 2017, pp. 113–122.
  • [23] C. Vassallo, G. Schermann, F. Zampetti, D. Romano, P. Leitner, A. Zaidman, M. D. Penta, and S. Panichella, “A tale of CI build failures: An open source and a financial organization perspective,” in 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017, Shanghai, China, September 17-22, 2017, 2017, pp. 183–193.
  • [24] A. Di Sorbo, S. Panichella, C. V. Alexandru, J. Shimagaki, C. A. Visaggio, G. Canfora, and H. C. Gall, “What would users change in My app? Summarizing app reviews for recommending software changes,” in Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, ser. FSE 2016.   New York, NY, USA: ACM, 2016, pp. 499–510. [Online]. Available: http://doi.acm.org/10.1145/2950290.2950299
  • [25] E. Noei, D. A. Da Costa, and Y. Zou, “Winning the app production rally,” in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2018, pp. 283–294.
  • [26] M. Nagappan and E. Shihab, “Future trends in software engineering research for mobile apps,” in Leaders of Tomorrow Symposium: Future of Software Engineering, FOSE@SANER 2016, Osaka, Japan, March 14, 2016, 2016, pp. 21–32.
  • [27] G. Grano, A. Di Sorbo, F. Mercaldo, C. A. Visaggio, G. Canfora, and S. Panichella, “Android apps and user feedback: A dataset for software evolution and quality improvement,” in Proceedings of the 2Nd ACM SIGSOFT International Workshop on App Market Analytics, ser. WAMA 2017, 2017, pp. 8–11.
  • [28] L. Pelloni, G. Grano, A. Ciurumelea, S. Panichella, F. Palomba, and H. C. Gall, “Becloma: Augmenting stack traces with user review information,” in 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).   IEEE, 2018, pp. 522–526.
  • [29] S. Panichella, “Summarization techniques for code, change, testing, and user feedback (invited paper),” in 2018 IEEE Workshop on Validation, Analysis and Evolution of Software Tests, VST@SANER 2018, Campobasso, Italy, March 20, 2018, C. Artho and R. Ramler, Eds.   IEEE, 2018, pp. 1–5.
  • [30] R. Baeza-Yates, B. Ribeiro-Neto et al., Modern information retrieval.   ACM press New York, 1999, vol. 463.
  • [31] A. Ciurumelea, A. Schaufelbuhl, S. Panichella, and H. C. Gall, “Analyzing reviews and code of mobile apps for better release planning,” in IEEE 24th International Conference on Software Analysis, Evolution and Reengineering, SANER 2017, Klagenfurt, Austria, February 20-24, 2017, 2017, pp. 91–102.
  • [32] M. B. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu, “Dimensionality reduction for k-means clustering and low rank approximation,” CoRR, vol. abs/1410.6801, 2014.
  • [33] C. H. Q. Ding and X. He, “K-means clustering via principal component analysis,” in Machine Learning, Proceedings of the Twenty-first International Conference (ICML), 2004.
  • [34] K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl et al., “Constrained k-means clustering with background knowledge,” in ICML, vol. 1, 2001, pp. 577–584.
  • [35] G. Hamerly and C. Elkan, “Learning the k in k-means,” in Advances in Neural Information Processing Systems 16 [Neural Information Processing Systems, NIPS 2003, December 8-13, 2003, Vancouver and Whistler, British Columbia, Canada], 2003, pp. 281–288.
  • [36]

    R. Tibshirani, G. Walther, and T. Hastie, “Estimating the number of clusters in a dataset via the gap statistic,” vol. 63, pp. 411–423, 2000.

  • [37] D. Pelleg and A. W. Moore, “X-means: Extending k-means with efficient estimation of the number of clusters,” in Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), 2000, pp. 727–734.
  • [38] F. Palomba, M. Linares-Vasquez, G. Bavota, R. Oliveto, M. Di Penta, D. Poshyvanyk, and A. De Lucia, “User reviews matter! tracking crowdsourced reviews to support evolution of successful apps,” in 2015 IEEE international conference on software maintenance and evolution (ICSME).   IEEE, 2015, pp. 291–300.
  • [39] P. Jaccard, “Étude comparative de la distribution florale dans une portion des alpes et des jura,” Bull Soc Vaudoise Sci Nat, vol. 37, pp. 547–579, 1901.
  • [40] Z. Tu, Z. Su, and P. T. Devanbu, “On the localness of software,” in Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, (FSE-22), Hong Kong, China, November 16 - 22, 2014, 2014, pp. 269–280.
  • [41]

    K. Knight, “Bayesian Inference with Tears,” Tech. Rep., 2009.

  • [42] R. K. Saha, M. Lease, S. Khurshid, and D. E. Perry, “Improving bug localization using structured information retrieval,” in Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on, Nov 2013, pp. 345–355.
  • [43] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Sharing clusters among related groups: Hierarchical dirichlet processes,” in Advances in neural information processing systems, 2005, pp. 1385–1392.
  • [44] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-1, no. 2, pp. 224–227, April 1979.
  • [45] C. Tantithamthavorn, R. Teekavanich, A. Ihara, and K. Matsumoto, “Mining A change history to quickly identify bug locations : A case study of the eclipse project,” in IEEE 24th International Symposium on Software Reliability Engineering, ISSRE 2013, Pasadena, CA, USA, November 4-7, 2013 - Supplemental Proceedings, 2013, pp. 108–113.
  • [46] C. Tantithamthavorn, A. Ihara, and K. Matsumoto, “Using co-change histories to improve bug localization performance,” in

    14th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, SNPD 2013, Honolulu, Hawaii, USA, 1-3 July, 2013

    , 2013, pp. 543–548.
  • [47] X. Li, H. Jiang, Y. Kamei, and X. Chen, “Bridging semantic gaps between natural languages and apis with word embedding,” CoRR, vol. abs/1810.09723, 2018. [Online]. Available: http://arxiv.org/abs/1810.09723
  • [48] M. Bilenko, S. Basu, and R. J. Mooney, “Integrating constraints and metric learning in semi-supervised clustering,” in Proceedings of the twenty-first international conference on Machine learning.   ACM, 2004, p. 11.
  • [49] S. Basu, I. Davidson, and K. Wagstaff, Constrained clustering: Advances in algorithms, theory, and applications.   CRC Press, 2008.
  • [50] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia, and E. Merlo, “Recovering traceability links between code and documentation,” IEEE Transactions on Software Engineering, vol. 28, no. 10, pp. 970–983, 2002.
  • [51] L. Corral and I. Fronza, “Better code for better apps: A study on source code quality and market success of android applications,” in Proceedings of the Second ACM International Conference on Mobile Software Engineering and Systems, ser. MOBILESoft ’15, 2015, pp. 22–32.
  • [52] G. Bavota, M. Linares-Vasquez, C. Bernal-Cardenas, M. Di Penta, R. Oliveto, and D. Poshyvanyk, “The impact of api change- and fault-proneness on the user ratings of android apps,” Software Engineering, IEEE Transactions on, vol. 41, no. 4, pp. 384–407, 2015.
  • [53] M. Linares-Vásquez, G. Bavota, C. Bernal-Cárdenas, M. Di Penta, R. Oliveto, and D. Poshyvanyk, “Api change and fault proneness: A threat to the success of android apps,” in Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, ser. ESEC/FSE 2013, 2013, pp. 477–487.
  • [54] S. E. S. Taba, I. Keivanloo, Y. Zou, J. Ng, and T. Ng, An Exploratory Study on the Relation between User Interface Complexity and the Perceived Quality, 2014.
  • [55] Y. Tian, M. Nagappan, D. Lo, and A. E. Hassan, “What are the characteristics of high-rated apps? A case study on free android applications,” in 2015 IEEE International Conference on Software Maintenance and Evolution, ICSME 2015, Bremen, Germany, September 29 - October 1, 2015, 2015, pp. 301–310.
  • [56] W. Luiz, F. Viegas, R. Alencar, F. Mourão, T. Salles, D. Carvalho, M. A. Gonçalves, and L. Rocha, “A feature-oriented sentiment rating for mobile app reviews,” in Proceedings of the 2018 World Wide Web Conference, ser. WWW ’18, 2018, pp. 1909–1918.
  • [57] H. Hu, S. Wang, C.-P. Bezemer, and A. E. Hassan, “Studying the consistency of star ratings and reviews of popular free hybrid android and ios apps,” Empirical Software Engineering, 2018.
  • [58] G. Catolino, “Does source code quality reflect the ratings of apps?” in Proceedings of the 5th International Conference on Mobile Software Engineering and Systems.   ACM, 2018, pp. 43–44.
  • [59] S. Panichella, A. Di Sorbo, E. Guzman, C. A. Visaggio, G. Canfora, and H. C. Gall, “Ardoc: App reviews development oriented classifier,” in Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, ser. FSE 2016, 2016.
  • [60] X. Gu and S. Kim, “What parts of your apps are loved by users?” in 30th IEEE/ACM International Conference on Automated Software Engineering (ASE 2015), 2015, p. to appear.
  • [61] S. Scalabrino, G. Bavota, B. Russo, R. Oliveto, and M. Di Penta, “Listening to the crowd for the release planning of mobile apps,” IEEE Transactions on Software Engineering, 2017.
  • [62] A. De Lucia, A. Marcus, R. Oliveto, and D. Poshyvanyk, Software and Systems Traceability, 2012, ch. Information Retrieval Methods for Automated Traceability Recovery.
  • [63] B. Dit, M. Revelle, M. Gethers, and D. Poshyvanyk, “Feature location in source code: a taxonomy and survey,” Journal of Software: Evolution and Process, vol. 25, no. 1, pp. 53–95, 2013.
  • [64] A. Bacchelli, M. Lanza, and R. Robbes, “Linking e-mails and source code artifacts,” in Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE 2010, Cape Town, South Africa, 1-8 May 2010, 2010, pp. 375–384.
  • [65] A. Di Sorbo, S. Panichella, C. A. Visaggio, M. Di Penta, G. Canfora, and H. C. Gall, “Development emails content analyzer: Intention mining in developer discussions (T),” in 30th IEEE/ACM International Conference on Automated Software Engineering, ASE 2015, Lincoln, NE, USA, November 9-13, 2015, 2015, pp. 12–23.
  • [66] A. Di Sorbo, S. Panichella, C. A. Visaggio, M. Di Penta, G. Canfora, and H. C. Gall, “DECA: development emails content analyzer,” in Proceedings of the 38th International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May 14-22, 2016 - Companion Volume, pp. 641–644.
  • [67] C. Parnin, C. Treude, L. Grammel, and M.-A. Storey, “Crowd documentation: Exploring the coverage and dynamics of API discussions on stack overflow,” Georgia Tech, Tech. Rep. GIT-CS-12-05, 2012.
  • [68] S. Panichella, J. Aponte, M. Di Penta, A. Marcus, and G. Canfora, “Mining source code descriptions from developer communications,” in IEEE 20th International Conference on Program Comprehension (ICPC’12), 2012, pp. 63–72.
  • [69] C. Vassallo, S. Panichella, M. Di Penta, and G. Canfora, “Codes: Mining source code descriptions from developers discussions,” in Proceedings of the 22Nd International Conference on Program Comprehension, ser. ICPC 2014, 2014, pp. 106–109.
  • [70]

    A. Panichella, B. Dit, R. Oliveto, M. Di Penta, D. Poshyvanyk, and A. De Lucia, “How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms,” in

    Proceedings of the 2013 International Conference on Software Engineering, ser. ICSE ’13, 2013.
  • [71] A. Ciurumelea, S. Panichella, and H. C. Gall, “Automated user reviews analyser,” in Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings.   ACM, 2018, pp. 317–318.