Detecting Low Rating Android Apps Before They Have Reached the Market

12/12/2017 ∙ by Ding Li, et al. ∙ Microsoft 0

Driven by the popularity of the Android system, Android app markets enjoy a booming prosperity in recent years. One critical problem for modern Android app markets is how to prevent apps that are going to receive low ratings from reaching end users. For this purpose, traditional approaches have to publish an app first and then collect enough user ratings and reviews so as to determine whether the app is favored by end users or not. In this way, however, the reputation of the app market has already been damaged. To address this problem, we propose a novel technique, i.e., Sextant , to detect low rating Android apps based on the .apk files.With our proposed technique, an Android app market can prevent from risking its reputation on exposing low rating apps to users. Sextant is developed based on novel static analysis techniques as well as machine learning techniques. In our study, our proposed approach can achieve on average 90.50

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Nowadays, Android is the most popular mobile platform. 88% of mobile phones are running Android (and, 2017c). Such a popularity has stimulated a prosperity in the Android app market. According to one of the recent reports, the total revenue of Android apps has reached 27 billion US dollars in 2016 (app, 2017). Such a prosperity does not only exist in the official Google Play market but also in many other third party markets, which produce 10 billion US dollars revenue in 2016 (app, 2017). Today, millions of apps could be downloaded in Android markets. Thousands of apps are created and uploaded to those markets (and, 2017d) everyday.

Despite the prosperity of Android markets, many apps in the market often receive low star ratings. These apps are not appreciated by end users. They either provide inferior user experience or have low code quality. Having too many low rating apps in an app market can hurt the reputation of the market. It will eventually drive end users away from the market and make the market suffer from losing revenue. This problem can be even more severe for third party markets since they have more competitions than the Google Play market.

Due to this reason, the app markets are trying to prevent low rating apps. Google Play is working on punishing the developers of the apps with lowest star ratings (low, 2017). This approach is useful, but it is more about fixing the damages rather than preventing the damages. Currently, the only way for an app market to know whether an app will have a low rating is to publish the app first, accumulate star ratings from a substantial amount of end users, and finally average these star ratings. In such a process, the low rating app has already reached the end users. It means that the low rating app has already caused damages to the reputation of the market. Another approach to prevent low rating apps is to have manual inspections (ios, 2017). However, this approach is labor intensive and expensive.

Therefore, it is beneficial for app markets to be able to automatically detect low rating apps without any user feedback. It benefits an app market in two ways. First, the market does not need publish low rating apps in order to detect them. When a potential low rating app is uploaded, it will be automatically detected and prevented from the market. In this way, the reputation of the market will not be damaged. Second, the automated detection technique can be used to assist the manual inspection and accelerate the review process. This could save the labor and expense in the manual inspection process.

Being able to detect low rating apps can also be very beneficial for developers and end users. For developers, such a capability enables them to have a quick feedback about their app without waiting for a few weeks for actual user ratings. If they find that their app may potentially be a low rating app, they could have an early plan for modification. For end users, this capability can be useful when users are installing apps from unknown sources. For these apps, they rarely have valid user ratings. Thus, it is very difficult for end users to know the quality of an app from unknown sources. Being able to detect low rating apps can potentially help end users avoid wasting their time on the apps that they do not expect.

Although automatically detecting low rating apps is very valuable, it is very challenging to achieve such a goal. The main challenge comes from the fact that app ratings are highly subjective and abstract. App ratings are subjective because they are provided by end users. Thus, they are inevitably affected by the personal preference of each end user. Such a personal preference is very difficult to be modeled. App ratings are abstract because they describe the general feeling of end users to an app. There isn’t a concrete rule or algorithm to generate the rating of an app. Thus, it is impossible to detect low rating apps by detecting a pre-defined code pattern or bug.

Traditional program analysis techniques are not capable of detecting low rating Android applications. Traditional static or dynamic program analysis techniques work well on detecting and fixing specific problems in programs (Li et al., 2016; Zhang et al., 2015; Zhang et al., 2016; Jia et al., 2015) or modeling concrete metrics, such as energy (Hao et al., 2013; Li et al., 2013; Liu et al., 2014) or runtime (Xu et al., 2010; Xiao et al., 2013; Nistor et al., 2013). However, since the rating of an Android app is subjective and abstract, detecting the specific code problems or modeling specific metrics is not sufficient to detect low rating apps.

To automatically detect low rating Android apps, in this paper, we propose a static analysis and machine learning based approach, Sextant. Our approach detects if an Android app is low rating only based on the .apk package of the app. Our implementation of Sextant is available on github111https://github.com/marapapman/Sextant. To the best of our knowledge, this is the very first technique to do the similar task.

Sextant contains three main components. First, it contains two novel representations for an Android app. These two representations can be retrieved from the .apk file of an app with effective and scalable static analysis techniques. The first representation is the semantic vector, which is used to capture the features of the executable file of the app. The second one is the layout vector, which is used to represent the layout information of the app. Second, Sextant contains two pre-training neural network models to learn unified features from both executable files and UI layout of an Android app. Third, Sextant contains a neural network model that accepts features from the pre-training model and determines whether an Android app is low rating or not.

We also perform an extensive empirical study with Sextant . Specifically, we measure the detection accuracy of Sextant on 33,456 realistic Android apps from the Google Play market. In our experiment, Sextant on average achieved 92.31% accuracy, 90.50% precision, and 94.31% recall. We also compare Sextant with four other baseline models: two models which use the bag of words representation (Peiravian and Zhu, 2013; Sahs and Khan, 2012; Shamili et al., 2010; Aafer et al., 2013; Schmidt et al., 2009; Burguera et al., 2011; Enck et al., 2009; Zhou et al., 2012; Arp et al., 2014), the executable only model, and the UI only model. The proposed Sextant outperforms all the four baseline models with statistically significant differences.

This paper has following contributions

  • Our approach is the very first approach to detect low rating Android apps before they can reach the end users.

  • We propose a novel representation to model the semantics of Java/Android apps

  • Extensive evaluation of the proposed approach on 33,456 Google Play apps demonstrates the effectiveness of the proposed method.

The other parts of this paper are organized as follows. In Section 2

, we briefly discuss the background information of Android apps and convolutional neural networks. In 

Section 3, we discuss the approach of Sextant . In Section 4, we discuss a preliminary study to evaluate how accurate can our executable representation capture the differences and similarities between Android/Java applications on the semantic level. In Section 5, we discuss the evaluation result of Sextant . In Section 6, we discuss the threat to validity of our evaluation. In Section 7, we discuss the related work. Finally, we conclude this paper in Section 8.

2. Background

In this section, we briefly discuss some background information about the structures of Android apps and the principle of convolutional neural networks.

2.1. Structures of Android Apps

An Android app is organized as a .apk package, which includes all the resources and code for the app. Among those resources, the executable is organized as the .dex file and the UI layout files are stored as XML files in the layout/ folder.

The executable of an Android app is in the form of Dalvik bytecode, which is compiled from Java. Therefore, many tools, such as soot (Vallée-Rai et al., 1999) and dex2jar (dex, 2017) can be used to convert the Dalvik bytecode to Java bytecode. The executable of an Android app contains several activities, which are basic components of the Android app. Each activity starts at the “onCreate” callback (and, 2017a).

The UI layout XML files define the visual structure for user interfaces. They can be loaded in the “onCreate” callback of an activity to create the GUI. The basic blocks of the UI layout XML files are the View and ViewGroup tags. A View tag represents the basic UI elements such as text boxes, buttons, and graphs. A ViewGroup tag is a special type of View, it represents a group of other View tags. Combining the View and ViewGroup tags, developers can declare the UI of an app as a layout tree (and, 2017b).

2.2. Convolutional Neural Networks

Convolutional Neural Networks (CNNs) are a popular deep neural network model which has been widely utilized for image classification, machine translation, etc

. It simulates the biological process that each neutron is only activated by a restricted area of the input. Compared to ordinary neural network structures such as restricted boltzmann machine or autoencoder, typical convolutional neural networks are comprised of one or more convolutional layers (often with a subsampling step) and then followed by one or more fully connected layers as in a standard multilayer neural network. The convolutional layer is the key building block of a convolutional neural network. During the forward pass, each filter is slided across the width and height of the input volume to compute dot product between the entries of the filter and the input at any position. In this way, convolutional neural networks can reduce the weights that need to be learned and obtain meaningful representations of the input structured information.

Figure 1. An example of convolutional neural networks

Figure 1 shows an example of convolutional neural networks. In this figure, the input is a 44 matrix. The sliding window, which has a size of 33, is shown as the blue and red boxes in Figure 1. Each color represents a step of moving the sliding window. In a forward pass, the convolutional layer first computes dot product between the 33 filter and the 3

3 sliding window on the input, then it moves the sliding window with stride 1 to the next 3

3 input entries and computes its dot product with the 33 filter again. This process is repeated until all elements in the input matrix are processed. After the convolution operation with a number of filters, multiple feature maps are produced, subsampled, and transformed into a hidden feature vector (hidden layer) which can be eventually utilized for classification or other tasks.

3. Approach

Figure 2. The workflow of our approach

Our approach takes the .apk of an Android app as input and outputs whether the input app is a low rating app. The workflow of our approach is shown in Figure 2. It has three stages. The first stage is the analysis stage. In this stage, Sextant obtains the executable representation, i.e., the semantic vector, and the UI representation, i.e., the layout vector of the input app respectively. The executable representation is obtained by using static program analysis techniques. The UI representation is obtained by parsing the UI layout files of the app. The second stage is the pre-training stage. In this stage, Sextant feeds the executable representation and the UI representation to two Convolutional Neural Network (CNN) models respectively. This stage learns the normalized features of the executable and the UI. In the final stage, i.e.

, the learning stage, Sextant concatenates the executable feature and the UI feature, which are obtained in the pre-training stage, as a global feature vector and then feeds it to a multilayer perceptron followed by a softmax function to perform the low rating app detection. The detection process of the final stage is essentially a classification process. The softmax function determines whether the input app belongs to the low rating app or not.

3.1. Executable Representation

In Sextant , the executable representation of an Android app is the semantic vector, which is defined as , where is the total types of instructions in the Android framework. Here, instructions include all basic operations, such as arithmetic add and branching instructions, and all the APIs in the Android framework, such as file operations or network communications. All types of instructions are numbered from 1 to . is the feature of the type of instructions. It is a 3-tuple , where is the frequency of instructions with type in the input app, is average loop depth of each instruction with type , and is the average number of branches that instructions with type are contained by. For the rest parts of this paper, the terms semantic vector and executable representation are interchangeable.

1public void method(){
2  List<Integer> list=new LinkedList<Integer>();
3  for(int i=0;i<10;i++){ // outer loop
4    list.add(i+1);
5    for(int j=0;j<2;j++){ // inner loop
6        if(list.length>7){
7          list.add(j+i)
8        }
9    }
10  }
11  System.out.println(list.length);
12}
Program 1: Example code for the executable representation

We use the Program 1 as an example to explain the semantic vector. In this example, if we neglect reference instructions and jump instructions, the program has six types of instructions. These instructions are listed in Table 1

ID Description
1 The new instruction at line 2
2 The operation at line 4 and 7
3 The arithmetic add operation at line 3, 4, 5, and 7
4 The comparison instruction at line 3, 5, and 6
5 The assignment instruction at line 2, 3, and 5
6 The at line 11
Table 1. The ID of each type if instructions in Program 1

The semantic vector of Program 1 is shown in Figure 3. Each column of Figure 3 is the 3-tuple of one instruction. The ID of each column in Figure 3 is the same as in the Table 1. The rows , , and represent the frequency, average loop depth, and average branch count respectively. Among all the instructions, we use the arithmetic add operation as an example, which is the column with ID 3 in Figure 3. The arithmetic add operation appears four times in the program, so, the total number is 4. The i++ and i+1 at line 3 and 4 are in the outer loop, so their loop depths are 1. The j++ and j+i at line 5 and line 7 are in the inner loop, so their loop depths are 2. Thus, the average loop depth of the arithmetic add operation, i.e., , is . Only the j+i at line 7 is in a branch. Thus, the average branch count of the arithmetic add operation, i.e., is .

Figure 3. The semantic vector of Program 1

3.2. UI Layout Representation

Sextant uses the layout vector to represent the UI layout of the target app. Similar to the semantic vector, the layout vector is a vector of tuples. Formally, a layout vector is defined as , where is the total types of UI elements in the Android framework. represents the the type of UI elements. represents the UI elements that are from the Android Legacy Library. represents the customized UI elements from developers. Each in the layout vector is a 2-tuple , where and are the frequency and the average depth in the layout tree of the UI elements with the type respectively. For the rest parts of this paper, the terms layout vector and UI representation are interchangeable.

1< LinearLayout  >
2  < LinearLayout  >
3      < TextView   />
4      < TextView  />
5  </ LinearLayout>
6  <android.support.v7.internal.view.menu.ActionMenuItemView  />
7  <android.support.v7.internal.view.menu.ActionMenuItemView  />
8  <android.support.v7.internal.view.menu.ExpandedMenuView  />
9  <com.facebook.resources.ui.FbTextView .. />
10</ LinearLayout>
Program 2: The example UI layout

An example of UI layout file is shown in Program 2, which is one piece of layout files that are retrieved from the market app, Facebook (fac, 2017). This example uses two UI elements from the Android framework, “LinearLayout” and “TextView”. It uses two UI elements from the legacy library, whose tag names start with “android.support”. It also contains a customized UI element, which is “com.facebook.resources.ui.FbTextView”. To encode this layout file as a layout vector. Our approach first numbers the UI elements. Our approach sets the IDs of “LinearLayout” and “TextView” as 1 and 2 respectively. In this case, the UI elements from the legacy library have the ID of 3 and customized elements have the ID of 4. Since one of the “LinearLayout” tags has the depth of zero in the XML tree and another one has the depth of one. Hence, the average depth of “LinearLayout” is 0.5. Similarly, the average depth of other UI elements can be calculated and the layout vector of Program 2 would be Figure 4.

Figure 4. The layout vector of Program 2

3.3. The Analysis Stage

The input of the analysis stage of Sextant is the .apk file of an Android app. The output of it is the semantic vector as the executable representation and the layout vector as the UI layout representation. To do so, Sextant first unpacks the input Android app to retrieve the binary executable file and the UI layout XML files of the app. Then it analyzes the executable file to calculate the frequency, average loop depth, and average branch count of the semantic vector. It also parses the layout files to generate the layout vector. In this section, we will focus on how to generate the semantic vector from the executable file because the process of getting the layout vector from the layout files is similar but more straightforward.

We first introduce the intra-procedural analysis of getting the semantic vector. This process is shown in Algorithm 1, which takes the binary code of a method and generates the semantic vector of the method. Algorithm 1 first builds the nested loop tree and detects all branches of the method. Then it parses all the instructions in the method and updates the loop depth and the branch counts accordingly for each instruction. At line 3 to 8, Algorithm 1 first calculates the frequency, the total loop depth, and the branch count for each type of instructions. Then it averages the loop depth and the branch count by dividing the total loop depth and branch count over the frequency.

Algorithm 1 works directly on the binary code of a method. Thus, it does not have the Abstract Syntax Tree (AST) to identify the boundaries of loops and branches. To detect loops, Sextant uses the standard algorithm (Aho et al., 2006). For branches, instead of detecting the branch structures directly, Sextant detect branches of the Control Flow Graph (CFG) of the method.

In Sextant , a branch is defined as a 3-tuple where and . In this tuple, is the starting point of the branch. It is a branch instruction, such as “if”, and is not the source of a back edge. is the immediate post-dominator of . is all the instructions between and . In Algorithm 1, first removes all the back edges of and then finds all branches in the method. counts how many branches that contain the instruction .

1:A method and the ID map
2:The semantic vector
3:
4:
5:for all  do
6:     
7:     
8:     
9:     
10:end for
11:for all  do
12:     
13:     
14:end for
Algorithm 1 Intra-procedural semantic vector building

Inter-procedural Analysis: Sextant takes a summary based approach to perform the inter-procedural analysis. For a method , its semantic vector is the summary. To perform the inter-procedural analysis, Sextant builds the summary for each method in the reverse topological order of the call graph. While processing each method, if line 3 of Algorithm 1 encounters a method invocation, line 4 to line 7 of Algorithm 1 will be changed to the process in Algorithm 2.

1:A invoke instruction , the semantic vector
2:
3:if  then
4:     
5:     
6:     
7:     
8:else
9:     for all  do
10:         
11:         
12:         
13:     end for
14:end if
Algorithm 2 Handling Summaries

Our summary handling process, which is shown in Algorithm 2, takes the invocation instruction, , and the semantic vector of the current method, , as an input. It first detects whether contains summaries at line 2. If not, it means points to an API, then Algorithm 2 will process as a normal instruction and takes the same steps of Algorithm 1. If the invocation has a summary, suppose it is , Algorithm 2 first adds the frequency of all instructions in to the summary of the current method, , at line 9. Then, at line 10, it updates the total loop depths of all instructions of . Since the average loop depth of each instruction in is increased by the loop depth of the instruction in the current method. The total loop depth in the summary of the current method should be increased by . The similar updates are also performed for the total branch count at line 11.

After the process of Algorithm 2 for has finished, it will return to the Algorithm 1 to process the next instruction. Finally, the average loop depth and branch count will be calculated at line 10 and 11 of Algorithm 1.

Processing UI Layout Files: Sextant takes the UI layout XML files of an Android app as the input and generates the layout vector. The process of this is similar to the process of processing executables. The only difference is that, for UI layout XML files, Sextant counts the depth of each UI element in the XML tree rather than the loop depth or branch count. Android Layout XML files may contain reference tags, which allow people to represent the layout tree in another file with one XML tag. To handle the reference tags, Sextant treats them as method calls in the executable and takes the similar summary based inter-procedural analysis process.

3.4. The Pre-Training Stage

The inputs of the pre-training stage are the semantic vector and the layout vector. Its outputs are the normalized executable feature and UI feature. The reason of having the pre-training stage is that the semantic vector and the layout vector are not in the same shape and have different magnitude of values. Specifically, in Sextant , the elements of the semantic vector are 3-tuples while the elements of the layout vector are 2-tuples. They cannot be combined together as one feature vector directly. Furthermore, the number of instructions can often be more than hundreds of thousands in the semantic vector while the number of UI elements is below one thousand in the layout vector. Any simple method that reshapes and combines the semantic vector and the layout vector directly will make the machine learning techniques ignore the effect of the layout vector.

Sextant learns the normalized executable feature vector and the layout feature vector with two CNN models respectively. These two models have similar structures. The only difference is the shape of the input and the size of hidden layers in the models. For the conciseness of this paper, we will focus on the pre-training model of the executable feature.

The model for executable feature learning is a convolutional neural network model. The first is a convolutional layer with 10 different

filers. This convolutional layer is followed by three dense layers. Each of the first two layers consists of 1000 nodes and the third layer is with 50 nodes as the feature layer. The last layer of the model is a binary classifier to determine whether the input app is a low rating app or not. Each layer uses the

as the activation function and utilizes batch normalization to avoid over fitting. Softmax function is used as the classifier and cross entropy loss is used for supervised training.

For training, Sextant first converts the semantic vectors of the apps in the training set into a set of matrices, where N is the total number of types of instructions. Then these matrices are fed to the convolutional layer. During the training, the model will optimize the cross entropy loss to improve the accuracy of classifying low rating apps. After the training, to obtain the executable feature of an app, Sextant feeds the semantic vectors of the app to the pre-trained model and uses the output of the feature layer as the executable feature for the input app.

UI Feature: We can follow the similar procedure to pre-train a CNN model for the semantic feature. The differences are that it contains 10 different filters in the convolutional layer and has only one dense layer with 50 nodes as the feature layer.

3.5. The Learning Stage

The input of the learning stage is a vector generated by concatenating the one dimensional executable feature and the one dimensional UI feature. The output is a softmax function to predict whether the input app is a low rating app. The multilayer perceptron used in the learning stage has two dense layers and one output layer for classification. The first dense layer has 100 nodes and the second layer has 20 nodes. The output layer has two nodes to generate the classes of the input app. Class 0 presents the input is a low rating app and class 1 represents that the input app is not a low rating app. The dense layers use

as the activation function. They also use batch normalization to avoid over fitting. The cross entropy loss is used as the loss function.

4. Preliminary Study

The quality of the semantic vector is critical for the accuracy of Sextant . One important question is that whether the semantic vector can accurately model the differences and similarities between Java/Android programs on the semantic level. This question is fundamental because if the semantic vector cannot represent the differences and similarities between Java/Android programs on the semantic level with high accuracy, using it to detect low rating Android apps cannot achieve a good result. Due to this reason, before evaluating the accuracy of using Sextant to detect low rating Android apps, we first answer the question about whether the semantic vector can represent differences and similarities between Java/Android apps in this preliminary study.

To answer this question, we use the accuracy of classifying Java programs that implement the same functionalities as the metric to measure how well can the semantic vector capture the similarities and differences between Java/Android programs on the semantic level. Our general experiment is as follows. First, we collect a group of Java applications in categories. All the programs in the same category, , implement the same functionality. Then, we use a convolutional neural network model to classify the category of each Java program. The convolutional neural network model is similar to the pre-training model of the executable representation in Section 3.4. However, the model in this preliminary study has smaller hidden layers and the output layer generates classes instead of two, where is the total number of categories. Finally, we report the classification accuracy.

The data set of the preliminary study was collected from the Google Code Jam Website (cod, 2017), which is a programming competition sponsored by Google every year since 2008. Each year Google Code Jam posts programming questions for more than 10,000 competitors. Each question requires competitors to upload a piece of program that can pass the pre-designed test suite with required resource. Thus, the uploaded answers to the same question implement the same functionality and belong to the same category in our preliminary strudy.

In this our preliminary study, we downloaded the Java answers from the top 100 competitors from 2008 to 2016. We filtered the programs as follows:

  • We removed those solutions that could not be compiled.

  • We removed those solutions that caused Soot to crash.

  • We removed those questions that had less than 20 answers.

The reason for the first two criteria was that our implementation replied on Soot to analyze the binaries of Java programs. For the third criterion, we did this because we needed sufficient cases in each category to evaluate the accuracy. After the filtering, our set of test cases contained 105 questions and 8,245 answers. In other words, we had 8,245 Java programs in 105 categories and each category contained at least 20 programs.

To further evaluate the accuracy of our executable representation, we compared the accuracy of the using the semantic vector with the baseline representation, which was the 1-dimensional bag of words approach (Peiravian and Zhu, 2013; Sahs and Khan, 2012; Shamili et al., 2010; Aafer et al., 2013; Schmidt et al., 2009; Burguera et al., 2011; Enck et al., 2009; Zhou et al., 2012; Arp et al., 2014). This representation equals to only using the frequency of instructions in the semantic vector. We built two baseline neural network models that accepted the bag of words representation as the input. The first one was the fully connected model. In this model, we replaced the convolutional layer to a dense layer. The second model was a baseline convolutional neural network model. For this model, we replaced the 320 filters in the semantic vector model with the 120 filters since the bag of words representation only has one dimension.

During our experiment, we first calculated the semantic vector for each of the programs with the method introduced in Section 3.1

. Then, we used the convoluational neural network model to classify the programs. To have a valid result, we took a 10-fold approach and repeated the experiment for ten times. In each round of the experiment, we randomly split the data into ten sets. Then, we used nine sets as training sets and one as the testing set. Then the classification accuracy was measured in each experiment. Finally, we reported the average results and the standard deviations. The same protocol was applied to the two baseline models for the bag of words representation.

In our measurement, the neural network model on the semantic vectors achieved an average accuracy of 91.86% with a standard deviation of 1.00%. The fully connected neural network model over the bag of words representation achieved on average 89.64% accuracy with a standard deviation of 1.30%. The convolutional neural network model over the bag of words representation achieved on average 90.50% accuracy with a standard deviation of 1.08%. We compared the result of the semantic vector model to the two baseline models with student test. The p values were below 0.04, which meant that the accuracy of the model on the semantic vectors was significantly higher than the two baseline models.

This result suggests that the semantic vectors can accurately represent the semantic similarities between programs. It indicates that the semantic vector can possibly capture the semantic features of a Java/Android program. It is possible to use the semantic vector to detect low rating apps.

Our result also shows that the semantic vector has a significantly higher accuracy than the bag of words representation. This is not surprising since the semantic vector of a program contains the loop and branch information, which is not contained by the bag of words representation.

5. Evaluation

In our evaluation, we seek to answer the following three research questions:

  • RQ 1: How accurate can our approach detect low rating Android apps

  • RQ 2: Processing time of static analysis

  • RQ 3: The learning time

5.1. Implementation

We implemented the analysis stage of Sextant with soot (Vallée-Rai et al., 1999), apktool (apk, 2017), and FlowDroid (Arzt et al., 2014)

. We used apktool to unpack Android .apk files and retrieve the XML files. We used FlowDroid to build the call graph of Android apps. We used soot to build the semantic vector of the executable. We used Keras 

(ker, 2017)

and tensor flow 

(ten, 2017)

to implement the neural network models. We used the API set Android 7.0 as the instruction set of the semantic vector. We also used the UI element set of Android 7.0 for the layout vector. The hardware we used in our experiments was a desktop with Core i7 6700K processor, 32GB memory, and Nvidia GTX 1080 graphic card. In our implementation, we trained the model with batched input. The batch size was 128. We also trained the training data for ten epochs to achieve the best accuracy.

5.2. Data Set

We evaluated Sextant with real Google Play apps. In our experiments, we downloaded Google Play apps from the PlayDrone project (pla, 2017; Viennot et al., 2014), which contains 1.1 million free apps and their meta information from the Google Play market. In our experiment, we categorized the apps based on their star numbers into four groups: 1 to 2 stars, 2 to 3 stars, 3 to 4 stars, and 4 to 5 stars. For each category, we downloaded 10,000 apps randomly. Thus, we had 40,000 apps from PlayDrone and the stars of these apps are roughly evenly distributed from one star to five stars. For these downloaded apps, we filtered out the apps that caused soot or FlowDroid to crash and the apps that their UI layout could not be successfully retrieved by apktool. After the filtering, we had 8,684 apps with more than four stars, 7,854 apps with three to four stars, 8,775 apps with two to three stars, and 8,142 apps with less than two starts. In total, we had 33,456 market Android apps.

Our apps were from 25 categories, such as game, lifestyle, and business. Figure 5 shows the distribution of categories of our downloaded apps.

Figure 5. The distribution of categories of our apps

In Figure 5, we plotted the top 9 categories with the most apps as the pie chart. All other 16 categories were summarized as OTHERS. As shown in the chart, we had a significant amount of apps in each category. The GAME category had the most apps because there were much more games than other apps in the Google Play (Viennot et al., 2014). In fact, GAME could be further broken down into 17 sub categories. Nevertheless, our downloaded apps could still provide enough diversity in the functionalities of apps.

5.3. RQ 1: Accuracy of Classification

Our first research question is to evaluate the accuracy of Sextant of detecting low rating apps. To do this, we first labeled all the apps that have less than three stars as low rating apps. For all the apps that have more than three stars, we labeled them as negative samples. Thus, in total, we labeled 16,917 apps as low rating apps and 16,538 apps as not-low rating apps. Then we followed the 10 fold protocol we had used in Section 4: we randomly split the data set into ten parts, took nine of them as the training set and one as the testing set. This process was repeated for ten times. The averages and standard deviations of precisions, recalls, and accuracies were measured.

In our evaluation, we also built four other baseline models for comparison. The first two were based on the bag of words representation, which was used in previous techniques (Peiravian and Zhu, 2013; Sahs and Khan, 2012; Shamili et al., 2010; Aafer et al., 2013; Schmidt et al., 2009; Burguera et al., 2011; Enck et al., 2009; Zhou et al., 2012; Arp et al., 2014). Our first model was the fully connected neural network model for the bag of words representation. The structure of this model was similar to the re-training model of the semantic vector. The only difference was that we replaced the convolutional layer of the pre-training model of the semantic vector with a dense layer with 1000 nodes. The second model was the CNN model over the bag of words representation. In this model, we replaced the 320 filters in the pre-training model for the semantic vector with 120 filters to accept the 1-dimensional bag of words representation. The third model we built was an executable only model. In this model, we used the pre-training model for the semantic vector learning alone to detect low rating apps. Similarly, our fourth model was the UI only model. We built it by using the pre-training model for UI feature learning alone. This model evaluated the accuracy of only using UI information for low rating app detection. For all these four models, we followed the same 10 fold experiment process and calculated the averages and standard deviations of their precisions, recalls, and accuracies.

Figure 6. The accuracy of Sextant 

The result of our measurement is shown in Figure 6. The bars with represent the result of the bag of words representation with the fully connected neural network model. represents the bag of words representation with the convolutional neural network model. Executable is the result of only using the semantic vector for classification. UI is the accuracy of only using the UI representation. Union is the accuracy of Sextant .

In our experiment, Sextant on average achieved 92.31% of accuracy with the standard deviation as 0.55%. The average precision was 90.50% with a standard deviation of 2.25%. The average recall was 94.31% with a standard deviation of 3.14%.

For the bag of words representation with the fully connected neural network model, it achieved an accuracy of 85.14% with a standard deviation of 0.67%. Its average precision was 81.76% with a standard deviation of 2.78%. Its average recall was 89.32% with a standard deviation of 3.77%.

For the bag of words representation with the convolutional neural network model, the average accuracy of was 87.18% and the standard deviation was 4.7%. The average precision was 84.63% with a standard deviation of 6.7%. The average recall was 93.12% with a standard deviation of 7.70%.

For the executable only model, the average accuracy was 90.82% and the standard deviation was 1.25%. The average precision was 89.10% with a standard deviation of 2.94%. The average recall of the executable only model was 93.49% with a standard deviation of 2.68%.

For the the UI only model, the average accuracy was 70.73% with a standard deviation of 1.15%. The average precision was 69.29 with a standard deviation of 0.92%. The average recall was 76.75% with a standard deviation of 3.62%.

To ensure the statistical significance, we also made a student test between the results of each pair of the five models. The p-value of all tests were smaller than 0.035, except the recall between the and . This result meant that there were statistically significant differences between the results of the five models except the recalls of and .

The result of our experiment is promising. It shows that Sextant can accurately detect low rating apps. Note that, in our implementation, we simply labeled apps with less than three stars as low rating apps. This labeling method does not consider the borderline apps which have stars around three. For example, an app with 2.9 stars is not necessarily worse or less popular than an app with 3.1 stars. It is harder for Sextant to correctly detect the borderline apps. Due to this reason, we think the 92.31% of accuracy is satisfiable.

One interesting observation for our result is that Sextant has a higher recall than precision. This result means that our approach is less likely to miss low rating apps. This result is more beneficial for app stores because they can use Sextant to scan their apps and find out all candidates of low rating apps. This process will only miss 5.7% of low rating apps. Then, the app stores can focus on the apps that are detected as low rating apps. This process could potentially save the labor for app markets to detect low rating apps.

The results of our experiments also prove that it is effective to add branch and loop information into the semantic vector. As shown in Figure 6, using the semantic vector can achieve 3.62% more accuracy than the bag of words approach. It can also achieve a much smaller standard deviation, which means the model with the semantic vector is more robust than the convolutional neural network model over the bag of word representation. The improvement of using the semantic vector to detect low rating apps is larger than the improvement of using it to classify programs with the same functionality in Section 4. This is because realistic Android apps are larger and more complex than solutions to programming questions on the CodeJam platform. It is more beneficial to keep the loop and branch information in large programs.

Our experiments also prove that combining the UI representation with the executable representation can significantly improve the detection accuracy over the executable only model. This is not surprising since UI is also an important factor that can influence user experience of an Android app.

5.4. RQ 2: Run Time Overhead of Static Analysis

Our second research question is to evaluate the time consumed by our static analysis stage to build the executable representation and UI representation from the .apk file of an Android app.

To answer this research question, we first added time stamp logs to our static analysis code and scripts. Then we executed our code and collected the time overhead for each of our test cases. To have a better understanding of the time overhead of the static analysis phase, we broke down the total time as four categories: the processing time of FlowDroid, the time overhead to build the executable representation, the time for apktool to retrieve the UI xml files, and the analysis time to build the UI representation.

In our measurement, the average processing time was 23.98 seconds per app. The standard deviation was 18.00 seconds. The breakdown of the four categories of time is shown in Figure 7. More specifically, on average, the processing time of FlowDroid per each app was 17.44 seconds with a standard deviation of 17.36 seconds, the time overhead to build the executable representation was 1.22 seconds and the standard deviation was 0.77 seconds, The the time cost of apktool to retrieve UI xml files was 5.14 seconds with a standard deviation of 0.42 seconds, and the processing time to build the UI representation was 0.28 seconds with a standard deviation of 0.04 seconds.

Figure 7. The time cost of the static analysis of Sextant 

According to the result, our static analysis stage on average take less than 30 seconds to process one Android market app. This indicates that our approach is very scalable and can be applied to realistic Android markets. One interesting fact about our result is that 94% of the processing time of our approach is consumed by the apktool and FlowDroid. Especially, 73% of the time is consumed by FlowDroid. This is because besides building the call graph, FlowDroid also does other analyses which are not used in our approach. We expect our approach could be faster if we use a more light-weighted way to build the call graph.

5.5. RQ 3: Learning Time

To answer this research question, we measured the time consumed during the training stage. This time contains three parts: the time consumed by the pre-training model of the executable feature learning, the time of the pre-training model of the UI feature learning, and the time of the final learning stage. To ensure the validity of our result, we measured the time consumption of each of the three parts of the learning time for ten times during the 10 fold experiment.

Figure 8. The learning time of Sextant 

The result is shown in Figure 8, where the unit is minute. On average, the pre-training model for the executable feature was 174 minutes with a standard deviation of 28 minutes. The average time to train the UI feature model was six minutes with a standard deviation of 10 seconds. The average training time in the final stage was 22 minutes with a standard deviation of 17 seconds.

As shown in our evaluation, our model could be trained in 3.4 hours. It is acceptable for more than 30,000 apps. This time can be further reduced with more powerful hardware. Another fact in our experiment is that 86% of the total training time is spent while training the executable feature model in the pre-training stage. This because the input of the semantic vector has much larger dimensions than the UI representation and the input in the final stage. Nevertheless, our model can still be trained in a reasonable amount of time.

6. Threat To Validity

External Validity: To guarantee that our applications are representative, we collected 33,456 realistic Android apps from the Google Play market. These apps have different star ratings and are from 25 categories. During the process of app collection, we filtered the apps that could not be processed by soot and FlowDoird and the apps whose layout files could not be retrieved by apktool. This process did not bias the result of our evaluation.

In our experiment, we used the apps from the database crawled by PlayDrone (pla, 2017; Viennot et al., 2014), it contains the snapshot of the Google Play store at Oct 31st, 2014. This app set does not contain the latest Android apps since Android 4.4. This threat does not affect the validity of our experiment because the architecture, API set, and the way to create apps for Android have not been changed significantly since Android 4.0. To further alleviate this threat, in our approach, we used the API set and the UI element set of Android 7.0 in our implementation. The unique APIs and UI elements in Android 4.4 were not captured. Thus, we do not expect the result of our experiments will be significantly different with recent apps.

Our app set only contains free apps. However, we believe that there is not a significant difference between free apps and paid apps in the methodology of programming and UI designing. Thus, we do not expect a very different result of our experiment in paid apps.

Internal Validity: The neural network models of Sextant were randomly initialized. Such a randomness can affect the accuracy of our models. The time measurement in our experiments can also be affected by the randomness. To alleviate the impact of the randomness, we followed 10 fold process. We randomly divided our apps into ten sets. In each experiment, we used nine sets for training and one for testing. We then averaged the results in ten experiments and reported the standard deviation.

Construction Validity: In Section 5.3, we compared the accuracy of Sextant to four baseline models. Such a comparison can be affected by the random measurements errors. To make the result comparison solid, we also performed the student test for each pair of our models. The result of our student test showed that the difference in accuracy between any two models in Section 5.3 was statistically significant.

7. Related Work

To the best of our knowledge, detecting low rating Android apps with only static program and UI information is a problem that has not been addressed before. Monett and colleagues proposed a technique to predict the star number of Android apps from the user reviews (Monett and Stolte, 2016)

. Similar technique is also used to predict movie scores or other sentiment analysis in natural language processing 

(Pang et al., 2008, 2002; Pang and Lee, 2005; Goldberg and Zhu, 2006; Tang et al., 2015). The problem of these techniques is that they still require users to provide reviews for Android apps. It suffers the same problem of the current star rating system.

Another related work to us is Mou and colleagues’ work (Mou et al., 2016). This approach first embeds the keywords of a programming language as vectors so that similar keywords have closer vectors in the Euclidean space. Then, it uses the token level embeddings to classify the programs from the online judgment platform of the Peking University. To classify the programs, this approach represents the nodes in the AST of a program with the token level embeddings and uses a tree-based convolution neural networks to classify the ASTs. This task is similar to what we have done in the preliminary study.

Compared to Mou and colleague’s work, our approach addresses a different problem with realistic apps. Our approach detects the low rating Android market apps. Android apps are substantially larger and more complex than online judgment platform questions. The star ratings of Android apps are also less accurate than the labels of questions. Furthermore, Mou and colleague’s work requires the source code to build the ASTs, while our approach works directly on the executables. Thus, our approach can be used on close-source applications while Mou and colleague’s work cannot.

White and colleagues proposed a neural network based techniques to detect program clones (White et al., 2016). This work first uses recursive neural network based approach to embed tokens of program source code as vectors (White et al., 2015). It then encodes the source code of program snippets as vectors with a recursive autoencoder (Socher et al., 2011). Finally, the program vectors are used to detect program clones. Deckard (Jiang et al., 2007; Gabel et al., 2008) summarizes the patterns in ASTs of programs and counts these patterns as vectors. Then it uses machine learning techniques to detect code clones with the tree patterns. Chen and colleagues embeds CFG of programs as a matrix (Chen et al., 2014) to detect program clones. Unlike Sextant , these approaches focus on detecting program clones. They do not detect low rating Android apps. Further, similar to Mou and colleague’s work (Mou et al., 2016), these approaches also need the source code, while our approach does not.

Mu and colleagues proposed a machine learning based technique, DroidSIFT, to detect malwares (Zhang et al., 2014). This approach builds the API dependency pattern database from benign and malicious Android apps. Then it encodes new apps based on its similarities to the API dependency patterns in the databases. Finally, it detects malwares by learning a model from the app encodings. The limitation of this work is that it encodes apps based on the dependency graph for each API. It is very expensive to build the dependency graph for all APIs used by an app. For malware classification, this problem can be alleviated by focusing on a small set of permission related APIs. However, for detecting low rating apps, people cannot only analyze a small set of APIs. In this case, DroidSIFT will encounter the scalability issue.

Wang and colleagues proposed a neural network based approach for defect prediction on the source code file level (Wang et al., 2016)

. This approach first encodes programs based on the token vector of AST nodes. Then it uses a deep belief network for feature dimension reduction. Finally, it trains a model to classify buggy files with the features with reduced dimension. This technique cannot be directly used to detect low rating apps because of two reasons. First, it requires source code. Second, the size of the token vector can be too large for machine learning models on the whole application level.

Bag of word approaches are used for malware detection (Peiravian and Zhu, 2013; Sahs and Khan, 2012; Shamili et al., 2010; Aafer et al., 2013; Schmidt et al., 2009; Burguera et al., 2011; Enck et al., 2009; Zhou et al., 2012; Arp et al., 2014). Compared with these approaches, Sextant addresses a very different problem. Furthermore, as we evaluated in Section 5.3, our approach significantly outperforms the bag of word approach regarding classification accuracy.

Many approaches also use machine learning techniques to generate code snippets from natural language queries (Gu et al., 2016; Allamanis et al., 2015; Raghothaman et al., 2016). These techniques learn a translation model from the API sequences and their comments. Then, when people provide the model a natural language query, the model will generate the API sequence. Other techniques are also proposed for API patterns mining (Fowkes and Sutton, 2016; Moritz et al., 2013; Wang et al., 2013). Despite the usefulness of these techniques, they address very different problems as we addressed in this paper.

There are also a group of studies that examine the relationship between the rating and features of apps. Tian and colleagues (Tian et al., 2015) found that the size, code complexity, and other 15 features have positive correlations with the star ratings. Linares-Vasquez and colleagues (Linares-Vásquez et al., 2013) studied the relationships of API changes in Android apps and star ratings. Ruiz and colleagues (Ruiz et al., 2014) studied the correlation between ad libraries and ratings of Android apps. Gui and colleagues (Gui et al., 2015) studied the relationship between energy consumption and ratings of Android apps. Although the conclusions of these techniques are interesting, they do not have a method to predict or classify low rating apps.

8. Conclusions

In this paper, we proposed a novel method to detect low rating Android apps only based on the .apk files. With our approach, an app market does not need to risk its reputation by exposing low rating apps to end users for rating collection. Our approach is based on static program analysis and machine learning. In our approach, we first proposed novel representations for the executable file as well as the UI layout of an Android app. Then, based on the executable and UI representations, we built a convolutional neural network model to detect low rating Android apps.

We also performed an extensive evaluation of our approach on 33,456 realistic Android market apps. In our experiment, our approach could detect low rating Android apps with 90.50% precision and 94.32% recall. Our approach also outperformed all the four baseline models with statistical significance.

Overall, our approach is both accurate and scalable. It can be potentially helpful for Android app markets to prevent low rating apps and accelerate manual reviewing process.

References

  • (1)
  • app (2017) 2017. https://9to5mac.com/2017/03/29/app-store-android-app-market-in-revenue/. (July 2017).
  • pla (2017) 2017. https://archive.org/details/android_apps&tab=about. (July 2017).
  • cod (2017) 2017. https://code.google.com/codejam/. (July 2017).
  • and (2017a) 2017a. https://developer.android.com/reference/android/app/Activity.html. (July 2017).
  • and (2017b) 2017b. https://developer.android.com/reference/android/view/ViewGroup.html. (July 2017).
  • ios (2017) 2017. https://developer.apple.com/support/app-review/. (June 2017).
  • ker (2017) 2017. https://github.com/fchollet/keras. (July 2017).
  • dex (2017) 2017. https://github.com/pxb1988/dex2jar. (July 2017).
  • apk (2017) 2017. https://ibotpeaches.github.io/Apktool/. (July 2017).
  • fac (2017) 2017. https://play.google.com/store/apps/details?id=com.facebook.katana&hl=en. (July 2017).
  • and (2017c) 2017c. https://qz.com/826672/android-goog-just-hit-a-record-88-market-share-of-all-smartphones/. (July 2017).
  • and (2017d) 2017d. https://www.appbrain.com/stats/number-of-android-apps. (July 2017).
  • ten (2017) 2017. https://www.tensorflow.org/. (July 2017).
  • low (2017) 2017. http://www.androidauthority.com/googles-new-punishment-poorly-made-apps-774032/. (July 2017).
  • Aafer et al. (2013) Yousra Aafer, Wenliang Du, and Heng Yin. 2013. DroidAPIMiner: Mining API-Level Features for Robust Malware Detection in Android. Springer International Publishing, Cham, 86–103. https://doi.org/10.1007/978-3-319-04283-1_6
  • Aho et al. (2006) Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2006. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.
  • Allamanis et al. (2015) Miltiadis Allamanis, Daniel Tarlow, Andrew D. Gordon, and Yi Wei. 2015. Bimodal Modelling of Source Code and Natural Language. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37 (ICML’15). JMLR.org, 2123–2132. http://dl.acm.org/citation.cfm?id=3045118.3045344
  • Arp et al. (2014) Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, Konrad Rieck, and CERT Siemens. 2014. DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket.. In NDSS.
  • Arzt et al. (2014) Steven Arzt, Siegfried Rasthofer, Christian Fritz, Eric Bodden, Alexandre Bartel, Jacques Klein, Yves Le Traon, Damien Octeau, and Patrick McDaniel. 2014. FlowDroid: Precise Context, Flow, Field, Object-sensitive and Lifecycle-aware Taint Analysis for Android Apps. SIGPLAN Not. 49, 6 (June 2014), 259–269. https://doi.org/10.1145/2666356.2594299
  • Burguera et al. (2011) Iker Burguera, Urko Zurutuza, and Simin Nadjm-Tehrani. 2011. Crowdroid: behavior-based malware detection system for android. In Proceedings of the 1st ACM workshop on Security and privacy in smartphones and mobile devices. ACM, 15–26.
  • Chen et al. (2014) Kai Chen, Peng Liu, and Yingjun Zhang. 2014. Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In Proceedings of the 36th International Conference on Software Engineering. ACM, 175–186.
  • Enck et al. (2009) William Enck, Machigar Ongtang, and Patrick McDaniel. 2009. On lightweight mobile phone application certification. In Proceedings of the 16th ACM conference on Computer and communications security. ACM, 235–245.
  • Fowkes and Sutton (2016) Jaroslav Fowkes and Charles Sutton. 2016. Parameter-free Probabilistic API Mining Across GitHub. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). ACM, New York, NY, USA, 254–265. https://doi.org/10.1145/2950290.2950319
  • Gabel et al. (2008) Mark Gabel, Lingxiao Jiang, and Zhendong Su. 2008. Scalable detection of semantic clones. In Software Engineering, 2008. ICSE’08. ACM/IEEE 30th International Conference on. IEEE, 321–330.
  • Goldberg and Zhu (2006) Andrew B. Goldberg and Xiaojin Zhu. 2006.

    Seeing Stars when There Aren’T Many Stars: Graph-based Semi-supervised Learning for Sentiment Categorization. In

    Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing (TextGraphs-1). Association for Computational Linguistics, Stroudsburg, PA, USA, 45–52.
    http://dl.acm.org/citation.cfm?id=1654758.1654769
  • Gu et al. (2016) Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API Learning. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). ACM, New York, NY, USA, 631–642. https://doi.org/10.1145/2950290.2950334
  • Gui et al. (2015) J. Gui, S. Mcilroy, M. Nagappan, and W. G. J. Halfond. 2015. Truth in Advertising: The Hidden Cost of Mobile Ads for Software Developers. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. 100–110. https://doi.org/10.1109/ICSE.2015.32
  • Hao et al. (2013) S. Hao, D. Li, W. G. J. Halfond, and R. Govindan. 2013. Estimating mobile application energy consumption using program analysis. In 2013 35th International Conference on Software Engineering (ICSE). 92–101. https://doi.org/10.1109/ICSE.2013.6606555
  • Jia et al. (2015) Y. Jia, M. B. Cohen, M. Harman, and J. Petke. 2015. Learning Combinatorial Interaction Test Generation Strategies Using Hyperheuristic Search. In 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. 540–550. https://doi.org/10.1109/ICSE.2015.71
  • Jiang et al. (2007) Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th international conference on Software Engineering. IEEE Computer Society, 96–105.
  • Li et al. (2013) Ding Li, Shuai Hao, William G. J. Halfond, and Ramesh Govindan. 2013. Calculating Source Line Level Energy Information for Android Applications. In Proceedings of the 2013 International Symposium on Software Testing and Analysis (ISSTA 2013). ACM, New York, NY, USA, 78–89. https://doi.org/10.1145/2483760.2483780
  • Li et al. (2016) D. Li, Y. Lyu, J. Gui, and W. G. J. Halfond. 2016. Automated Energy Optimization of HTTP Requests for Mobile Applications. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). 249–260. https://doi.org/10.1145/2884781.2884867
  • Linares-Vásquez et al. (2013) Mario Linares-Vásquez, Gabriele Bavota, Carlos Bernal-Cárdenas, Massimiliano Di Penta, Rocco Oliveto, and Denys Poshyvanyk. 2013. API Change and Fault Proneness: A Threat to the Success of Android Apps. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering (ESEC/FSE 2013). ACM, New York, NY, USA, 477–487. https://doi.org/10.1145/2491411.2491428
  • Liu et al. (2014) Yepang Liu, Chang Xu, and Shing-Chi Cheung. 2014. Characterizing and Detecting Performance Bugs for Smartphone Applications. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). ACM, New York, NY, USA, 1013–1024. https://doi.org/10.1145/2568225.2568229
  • Monett and Stolte (2016) D. Monett and H. Stolte. 2016. Predicting star ratings based on annotated reviews of mobile apps. In 2016 Federated Conference on Computer Science and Information Systems (FedCSIS). 421–428.
  • Moritz et al. (2013) E. Moritz, M. Linares-Vásquez, D. Poshyvanyk, M. Grechanik, C. McMillan, and M. Gethers. 2013. ExPort: Detecting and visualizing API usages in large source code repositories. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE). 646–651. https://doi.org/10.1109/ASE.2013.6693127
  • Mou et al. (2016) Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional Neural Networks over Tree Structures for Programming Language Processing. In

    Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence

    (AAAI’16). AAAI Press, 1287–1293.
    http://dl.acm.org/citation.cfm?id=3015812.3016002
  • Nistor et al. (2013) Adrian Nistor, Linhai Song, Darko Marinov, and Shan Lu. 2013. Toddler: Detecting Performance Problems via Similar Memory-access Patterns. In Proceedings of the 2013 International Conference on Software Engineering (ICSE ’13). IEEE Press, Piscataway, NJ, USA, 562–571. http://dl.acm.org/citation.cfm?id=2486788.2486862
  • Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (ACL ’05). Association for Computational Linguistics, Stroudsburg, PA, USA, 115–124. https://doi.org/10.3115/1219840.1219855
  • Pang et al. (2008) Bo Pang, Lillian Lee, et al. 2008. Opinion mining and sentiment analysis. Foundations and Trends® in Information Retrieval 2, 1–2 (2008), 1–135.
  • Pang et al. (2002) Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs Up?: Sentiment Classification Using Machine Learning Techniques. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10 (EMNLP ’02). Association for Computational Linguistics, Stroudsburg, PA, USA, 79–86. https://doi.org/10.3115/1118693.1118704
  • Peiravian and Zhu (2013) N. Peiravian and X. Zhu. 2013. Machine Learning for Android Malware Detection Using Permission and API Calls. In 2013 IEEE 25th International Conference on Tools with Artificial Intelligence. 300–305. https://doi.org/10.1109/ICTAI.2013.53
  • Raghothaman et al. (2016) Mukund Raghothaman, Yi Wei, and Youssef Hamadi. 2016. SWIM: Synthesizing What I Mean: Code Search and Idiomatic Snippet Synthesis. In Proceedings of the 38th International Conference on Software Engineering (ICSE ’16). ACM, New York, NY, USA, 357–367. https://doi.org/10.1145/2884781.2884808
  • Ruiz et al. (2014) I. J. Mojica Ruiz, M. Nagappan, B. Adams, T. Berger, S. Dienst, and A. E. Hassan. 2014. Impact of Ad Libraries on Ratings of Android Mobile Apps. IEEE Software 31, 6 (Nov 2014), 86–92. https://doi.org/10.1109/MS.2014.79
  • Sahs and Khan (2012) J. Sahs and L. Khan. 2012. A Machine Learning Approach to Android Malware Detection. In 2012 European Intelligence and Security Informatics Conference. 141–147. https://doi.org/10.1109/EISIC.2012.34
  • Schmidt et al. (2009) A-D Schmidt, Rainer Bye, H-G Schmidt, Jan Clausen, Osman Kiraz, Kamer A Yuksel, Seyit Ahmet Camtepe, and Sahin Albayrak. 2009. Static analysis of executables for collaborative malware detection on android. In Communications, 2009. ICC’09. IEEE International Conference on. IEEE, 1–5.
  • Shamili et al. (2010) A. S. Shamili, C. Bauckhage, and T. Alpcan. 2010. Malware Detection on Mobile Devices Using Distributed Machine Learning. In

    2010 20th International Conference on Pattern Recognition

    . 4348–4351.
    https://doi.org/10.1109/ICPR.2010.1057
  • Socher et al. (2011) Richard Socher, Jeffrey Pennington, Eric H Huang, Andrew Y Ng, and Christopher D Manning. 2011. Semi-supervised recursive autoencoders for predicting sentiment distributions. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, 151–161.
  • Tang et al. (2015) Duyu Tang, Bing Qin, Ting Liu, and Yuekui Yang. 2015. User Modeling with Neural Network for Review Rating Prediction. In Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI’15). AAAI Press, 1340–1346. http://dl.acm.org/citation.cfm?id=2832415.2832436
  • Tian et al. (2015) Y. Tian, M. Nagappan, D. Lo, and A. E. Hassan. 2015. What are the characteristics of high-rated apps? A case study on free Android Applications. In 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME). 301–310. https://doi.org/10.1109/ICSM.2015.7332476
  • Vallée-Rai et al. (1999) Raja Vallée-Rai, Phong Co, Etienne Gagnon, Laurie Hendren, Patrick Lam, and Vijay Sundaresan. 1999. Soot - a Java Bytecode Optimization Framework. In Proceedings of the 1999 Conference of the Centre for Advanced Studies on Collaborative Research (CASCON ’99). IBM Press, 13–. http://dl.acm.org/citation.cfm?id=781995.782008
  • Viennot et al. (2014) Nicolas Viennot, Edward Garcia, and Jason Nieh. 2014. A Measurement Study of Google Play. SIGMETRICS Perform. Eval. Rev. 42, 1 (June 2014), 221–233. https://doi.org/10.1145/2637364.2592003
  • Wang et al. (2013) Jue Wang, Yingnong Dang, Hongyu Zhang, Kai Chen, Tao Xie, and Dongmei Zhang. 2013. Mining Succinct and High-coverage API Usage Patterns from Source Code. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR ’13). IEEE Press, Piscataway, NJ, USA, 319–328. http://dl.acm.org/citation.cfm?id=2487085.2487146
  • Wang et al. (2016) Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically Learning Semantic Features for Defect Prediction. In Proceedings of the 38th International Conference on Software Engineering (ICSE ’16). ACM, New York, NY, USA, 297–308. https://doi.org/10.1145/2884781.2884804
  • White et al. (2016) Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep Learning Code Fragments for Code Clone Detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE 2016). ACM, New York, NY, USA, 87–98. https://doi.org/10.1145/2970276.2970326
  • White et al. (2015) Martin White, Christopher Vendome, Mario Linares-Vásquez, and Denys Poshyvanyk. 2015. Toward Deep Learning Software Repositories. In Proceedings of the 12th Working Conference on Mining Software Repositories (MSR ’15). IEEE Press, Piscataway, NJ, USA, 334–345. http://dl.acm.org/citation.cfm?id=2820518.2820559
  • Xiao et al. (2013) Xusheng Xiao, Shi Han, Dongmei Zhang, and Tao Xie. 2013. Context-sensitive Delta Inference for Identifying Workload-dependent Performance Bottlenecks. In Proceedings of the 2013 International Symposium on Software Testing and Analysis (ISSTA 2013). ACM, New York, NY, USA, 90–100. https://doi.org/10.1145/2483760.2483784
  • Xu et al. (2010) Guoqing Xu, Nick Mitchell, Matthew Arnold, Atanas Rountev, Edith Schonberg, and Gary Sevitsky. 2010. Finding Low-utility Data Structures. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’10). ACM, New York, NY, USA, 174–186. https://doi.org/10.1145/1806596.1806617
  • Zhang et al. (2016) Jie Zhang, Ziyi Wang, Lingming Zhang, Dan Hao, Lei Zang, Shiyang Cheng, and Lu Zhang. 2016. Predictive Mutation Testing. In Proceedings of the 25th International Symposium on Software Testing and Analysis (ISSTA 2016). ACM, New York, NY, USA, 342–353. https://doi.org/10.1145/2931037.2931038
  • Zhang et al. (2015) Mu Zhang, Yue Duan, Qian Feng, and Heng Yin. 2015. Towards Automatic Generation of Security-Centric Descriptions for Android Apps. In Proceedings of the 22Nd ACM SIGSAC Conference on Computer and Communications Security (CCS ’15). ACM, New York, NY, USA, 518–529. https://doi.org/10.1145/2810103.2813669
  • Zhang et al. (2014) Mu Zhang, Yue Duan, Heng Yin, and Zhiruo Zhao. 2014. Semantics-Aware Android Malware Classification Using Weighted Contextual API Dependency Graphs. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS ’14). ACM, New York, NY, USA, 1105–1116. https://doi.org/10.1145/2660267.2660359
  • Zhou et al. (2012) Yajin Zhou, Zhi Wang, Wu Zhou, and Xuxian Jiang. 2012. Hey, you, get off of my market: detecting malicious apps in official and alternative android markets.. In NDSS, Vol. 25. 50–52.