Datasets of Android Applications: a Literature Review

by   Franz-Xaver Geiger, et al.
Vrije Universiteit Amsterdam

Mobile phones and tablets have become the most widely used computing devices, with a large predominance of the Android platform. As a natural evolution, the development of Android applications has surged and has become a major field of study, with research efforts ranging from energy efficiency, to code smells, performance, maintainability, security, etc. These kind of challenges ask for dedicated solutions, tools, and datasets. This survey identifies and reviews 31 existing datasets of Android applications and classifies each of them according to key features, such as the total number of apps it contains, whether the commit history of the apps is available, whether it focusses on the source code or on the executable binaries of the apps, the sources used for building the dataset, etc. This study can benefit both the experienced and the novice researcher interested on doing research on Android apps, which can use the results of our study as a map for identifying the most suitable datasets for their research objectives.



There are no comments yet.


page 1

page 2

page 3

page 4


Security Code Smells in Android ICC

Android Inter-Component Communication (ICC) is complex, largely unconstr...

A Framework for Android Based Shopping Mall Applications

Android is Google's latest open source software platform for mobile devi...

How Private is Android's Private DNS Setting? Identifying Apps by Encrypted DNS Traffic

DNS over TLS (DoT) and DNS over HTTPS (DoH) promise to improve privacy a...

Trimming Mobile Applications for Bandwidth-Challenged Networks in Developing Regions

Despite continuous efforts to build and update network infrastructure, m...

Tackling Android Stego Apps in the Wild

Digital image forensics is a young but maturing field, encompassing key ...

Catalog of Energy Patterns for Mobile Applications

Software engineers make use of design patterns for reasons that range fr...

On the adoption, usage and evolution of Kotlin Features on Android development

Context: Currently, more than 2 million applications are published on Go...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Mobile phones and tablets have become the most widely used computing devices. Consequently, development of mobile applications has surged and become a major field of study. Additionally, mobile platforms bring their particular set of constraints. For instance, energy on small devices is a scarcity and power management is paramount. Privacy of users and software security are other highly studied topics. These kind of challenges ask for dedicated solutions, tools, and datasets.

This survey reviews Datasets of Android applications. However, not all research needs the same set of data. Martin et al. provide an extensive survey of studies and datasets of app store analysis for various platforms [26]. They identified seven key subfields: API Analysis, Feature Analysis, Release Engineering, Review Analysis, Security, Store Ecosystem, Size and Effort Prediction, and Closely Related Work (among which is Mining Tools). Research may be interested in technical attributes such as API usage or platform version, as well as non-technical attributes, e.g. reviews, number of downloads, etc. For dynamic analysis of applications, executable artifacts are necessary. Bytecode from APKs can be decompiled to learn information about data flow and other code metrics. To analyze apps for programming practices and project management, source code and data from source code management programs such as version control and bug trackers is helpful. The latter category of information is not readily available for the vast number of proprietary applications. Studies that need this kind of data need to rely on open-source Android apps.

Therefore, we review existing literature for various characteristics which may facilitate different sub-fields and studies. This survey thus focuses on these main traits of datasets of Android applications:

  • Does the dataset facilitate access to source code of applications?

  • Is the source code available in version control (e.g. Git)?

  • Are installable APKs included?

  • Does the dataset link to app stores where additional information (such as ratings and reviews) are accessible?

This literature survey is structured as follows: First, in Section 2, we explain the iterative literature review process from keyword search and snowballing to a concise view of important information in a table. Following that we review and summarize datasets and studies resulting from the search process (Section 3). Learnings from the review results are detailed in Section 4 where we argue that too few datasets include access to source code and those that link source code contain too few applications. Finally, we conclude this survey in Section 5.

2 Literature Review Process

Literature presented in this review was collected with a combination of keyword search and snowballing, i.e., walking the graph of references in both directions. All queries were ran against the Google Scholar database in winter 2017/18. The review is concerned with datasets of Android applications in general but also more specifically with datasets that allow access to source code of apps. Figure 1 shows the iterative search process which followed four steps, repeating phase 2 and phase 3 until search results were exhausted. The four steps are (1) an initial keyword search, (2) filtering of relevant publications by title and abstract, (3) finding candidate publications by following the graph of citations from new search results to both citing and cited articles, and finally, (4) summarizing all found relevant publications in textual and tabular form.

Phase 1: Keyword search

Initially we searched for “Android app dataset”, and “Android app collection” “Android app mining”. The search results were complemented by replacing the keyword app with application in each search term. Filtering the search results for relevant publications showed one major group of publications around the topic of Android application security. These publications are largely centered around AndroZoo [4] and the Android Malware Genome Project [35]. To broaden the search scope and find datasets including source code, the search terms “android app” “source code” repository dataset were included.

Search by keywords

Filter relevant publications by title

Filter relevant publications by abstract

List publications cited by search results

List publications citing search results

Summarize each publication

Describe datasets in table

Categorize collections of Android apps

Phase 1

Phase 2.1

Phase 2.2

Phase 3.1

Phase 3.2

Phase 4.1

Phase 4.2

Phase 4.3

Figure 1: Literature review process

Phase 2.1: Title filter

The search results at this point were filtered to exclude publications that are obviously out of scope for this review by looking at their titles.

Phase 2.2: Abstract filter

After reducing the scope by title, we read through abstracts of all search results and filtered those out which do not create a dataset of Android applications. We looked for indicators, that the paper actually gathers data on Android applications or uses a dataset to study Android apps. Only in the former case did we include the publication in my set of relevant work. In the latter case, we did not deem the paper itself relevant to my review but included it in the snowballing phase to find further links to existing datasets.

If the filter of Phase 2 yielded new results, Phase 3 was revisited. Otherwise, the collection phases were concluded and we would continue with Phase 4.

Phase 3.1: Cited publications

In a next step, we followed links from new papers collected so far to find relevant publications which are cited by them. This allowed to find previous works which the authors of already identified publications deem relevant to the subject.

Phase 3.2: Citing publications

We also searched for publications which refer to papers already in my set of relevant works. While looking at cited publications allows to glance into the past of related literature, searching for articles which cite already known papers gives information about the future from the time of these papers.

This new list of candidate articles was then fed into the filtering process (Phase 2).

Phase 4.1: Summaries

Phase 4 started after the data collection process was complete with 28 relevant publications and repeating phases 2 and 3 did not return any relevant new publications. We read all search results and briefly summarized them (cf. Section 3).

Phase 4.2: Tabular data overview

Data from these summaries was then processed into a table (cf. Appendix A).

Phase 4.3: Categories of datasets

Finally, we categorized datasets of Android apps which we found in the literature into (1) datasets which use data from app markets (cf. Section 3.1), (2) datasets providing executable APKs (cf. Section 3.2), and (3) datasets with access to source code based on F-Droid (cf. Section 3.3).

3 Datasets of Android Apps in the Literature

In the following sections we describe literature included in this survey which provides access to different levels of information. An overview of all included publications with relevant traits in tabular form can be found in Appendix A.

The information collected about Android applications may contain metadata from app stores, such as Google Play (Section 3.1). In Section 3.2 we list previous work that contains executable Android application packages. A directory of open-source Android apps is F-Droid. Datasets that provide access to source code and commit history are often based on it (Section 3.3). In Section 4, we reflect on the findings of this literature study and propose future directions to improve the state of Android app datasets. Finally, Section 5 is a summary of the common characteristics and problems of reviewed datasets.

3.1 Datasets of Market Data

Many interesting insights can be learned from data on application markets and aggregations of that data. Official app stores, such as Google Play 111 contain several million apps for the Android platform. Official and inofficial market places host executables and metadata generated by developers and users for each application. Data from Google Play can only be accessed through the public web interface and an undocumented API used by Android smartphones to manage app installations. Commercial databases exist that mirror metatada from Google Play and other app markets and sell access to this information (e.g., and [17]). Some of these commercial databases contain comprehensive metadata of millions of apps but they lack links to other resources, such as source code or executable artifacts.

Data from market places is widely used despite the difficulties to access it. Petsas et al. [32] monitored different Android app stores with a focus on popularity, pricing, and revenue of apps. They directly scraped information from the web interfaces of the market places in their study. Their findings indicate that 10 percent of the apps account for 70 to 90 percent of total downloads and that popularity of paid apps follows a power law distribution.

Another valuable data point from market places are user reviews. Malavolta et al. [25] investigated users’ perception of hybrid apps by studying 11,917 free apps and their metadata from Google Play. They answered questions from both developers’ and users’ perspective by combining user reviews and technical aspects in their study. In the data collection process they selected sample apps from the most popular apps of each category in Google Play. Grano et al. [14] also studied user reviews albeit from a different source: They built a dataset of 288,065 user reviews for 395 applications sourced from F-Droid. The dataset includes information from Google Play, as well as results from static analysis of the application packages. They labeled reviews with automated classifiers. Other studies use permissions of apps [30], API usage of apps [1, 23], descriptions [13], or times of updates of apps [27] from Google Play.

This wide field of research on data from application markets shows that app metadata, user reviews, and app binaries offer insights and are worth investigating. However, access restrictions and instable APIs limit the use of app stores to the research community.

3.2 Datasets of Executables

On the other hand, AndroZoo is an ongoing effort to gather executable Android applications from as many sources as possible and make them available for analysis. Allix et al. [4] created crawlers for several app stores to collect a comprehensive and up-to-date sample of executable Android app packages — AndroZoo. The crawlers are customized for each app store to collect as many apps as possible. Simultaneously, the authors took measures to minimizing the load on market places they crawl to avoid losing access and jeopardizing long-term integrity of the dataset. The sources from which AndroZoo draws include major market places Google Play, Anzhi, and AppChine, as well as smaller directories 1mobile, AnGeeks, Slideme, ProAndroid, HiApk, and F-Droid. The applications from these app stores were complemented with additional artifacts from peer-to-peer distributed torrents and the Android Malware Genome Project [36]. The procedure to download candidate apps is performed by dedicated crawlers for each source and includes a unique identifier and a checksum of the file for deduplication. Most crawlers are based on the scrapy framework. However, Allix et al. created a special software to overcome restrictions of Google Play, e.g. an undocumented API, rate limits, and the need of an Android device. A central dispatcher spreads the work load to download agents in several locations and over different protocols. With this setup it was possible to eliminate the backlog of old applications. Subsequently, fewer agents were necessary to keep up with new additions to Google Play. A web service is tasked with organizing and storing received APKs. This unit also handles authentication for downloads of the dataset and publicly displays statistics. When creating AndroZoo, Allix et al. encountered several data collection challenges. They list unexpected downtime of markets, HTML instability, monitoring of crawlers, protocol changes, and information loss. Overall, the authors were able to collect more than three million Android applications initially. The current count is more than five million [20]. The majority of these apps stems from Google Play, Anzhi, and AppChine, with the other market places contributing a much lower number. The dataset is available for download for the research community as a regularly updated list of APKs. This list contains SHA256 hashes as identifiers and additional metadata, such as compilation date, malware status, package name, version, etc. Individual apps can be downloaded with the SHA256 hash as index. One defining feature of AndroZoo is, that all apps in the dataset are tested for malware by over 60 security products hosted by VirusTotal. Allix et al. report that 22 percent of apps in Google Play are flagged as malware by at least one product while 50 percent or more are found to be malware in the two major Chinese market places. When counting APKs which at least ten security products recognize as malware, this number drops to around 1 percent of detected malware in Google Play and 33 percent and 17 percent in Anzhi and AppChine respectively. All samples of the Android Malware Genome Project are successfully recognized by at least 10 antivirus products. The dataset lends itself to security research since metadata of all samples contains the malware detection status. Examples of such research based on AndroZoo are [2, 3, 5, 16]. Other uses leverage the fact that the dataset contains several version for many apps [15] and the availability of compiled bytecode [21]. AndroZoo also contains many Android applications which are not marketed in Google Play. This facilitates analysis of marketed and non-marketed apps [31]. Limitations of AndroZoo mostly stem from the fragility of the data collection process. Collecting was not continuous but rather resumed irregularly, if issues occurred. Additionally, app some market maintainers have blocked crawlers and thus caused outages and incomplete sets of data.

Another dataset of Android applications is the Android Malware Genome Project [36]

. Zhou and Jiang collect samples of malicious Android apps from August 2010 to October 2011 to advance understanding of malware on mobile platforms. They present a dataset of 1,260 apps in 49 different malware categories. Furthermore, the authors analyze and characterize the collected malware samples to trace behavior and major outbreaks of certain types. Zhou and Jiang report that most of the samples are repackaged versions of legitimate applications containing malicious payload. Another vector for infecting Android devices are update attacks and drive-by downloads. Types of malware include root-level exploits, botnet clients, incurring costs through calling or messaging to premium-rate numbers, and harvesting of users’ information. In their evolution-based study, Zhou and Jiang describe how Android malware rapidly evolves. Thus, malware authors are able to keep ahead of existing anti-malware solutions through application of sophisticated obfuscation and evasion techniques. The project allows studying of generations and classes of malware but does not link these artifacts with source code or version control data. The authors stopped sharing their data after graduation in 2015.

Recently, Meng et al. [28] published AndroVault

, a knowledge graph of information on over five million Android apps. Since 2013 applications have been crawled from 33 different sources including

Google Play and F-Droid

. The tool computes 36 attributes for each app based on downloaded APKs and descriptions. Resulting data from static and dynamic analysis is combined in a knowledge graph with fast access. Entities in this knowledge graph are heuristically clustered and correlated by attributes. This facilitates easier selection and sampling of relevant apps by certain traits to research specific kind of Android applications.

AndroVault has already proven a useful dataset for research such as malware detection.

One large user of datasets of Android application packages is the security research community, e.g. for evaluation of malware detection systems [6, 35, 24]. Malware detection necessarily needs to work on compiled artifacts because that is the form in which it is installed on devices and for which detection is possible. Datasets of executables are therefore well suited for studying malicious software and training detection systems. Android application packages are not a substitute for source code and project management data, such as issue trackers and code review.

3.3 Datasets Based on F-Droid

So far all described datasets rely on Google Play or similar market places as seeds. This limits the available types of information to market metadata, executable packages and what can be statically or dynamically inferred from the APK files. In order to enable research that relies on access to source code, data from application markets needs to be linked to additional information. One data source that provides access to source files is F-Droid:222 a directory of open-source Android applications. All apps listed in this directory are compiled from source and code repositories are publicly linked.

In 2013, Minelli and Lanza [29] analyzed Android apps from F-Droid and reported notable findings, such as little use of inheritance and heavy reliance on external APIs. Freiling et al. [11] use 240 randomly selected apps from F-Droid to evaluate obfuscation transformations.

Bao et al. [8] collected 468 commits from 154 GitHub repositories of Android apps starting from 1,273 apps on F-Droid. They categorized energy-aware commits in six buckets, corresponding to common power management techniques applied by developers. They found that types of power management related changes differ between Android apps of different app store categories.

Lamba et al. [19] extensively describe F-Droid and used 1,120 apps from the app directory to analyze software use for Android applications. They downloaded the latest version of the source code of all collected apps and ran their analysis on 87,478 Java files with 17.2 million lines of code. Corral and Fronza [9] manually combined data from F-Droid, Google Play, and any available source code repository for 100 apps to compare source code quality with market success. They report that source code quality has a marginal impact on market success.

Nayebi et al. [31] analyzed 1,844 applications from F-Droid and found 69 apps that matched their search criteria. They linked this data to GitHub repositories and Google Play listings for further analysis of release cycles. “A Dataset of Open-Source Android Applications” [17] was similarly generated with F-Droid as starting point. The dataset contains 1,179 entries and links to source code repositories and information gleaned from static analysis of binary artifacts. It additionally contains version control information, such as commit messages and authorship. Unfortunately, the website hosting the dataset seems to be defunct.

For a follow-up study, Krutz et al. [18] extended this data for detailed analysis of app permissions. They searched F-Droid for applications with source repositories on GitHub to find out how and by whom permissions of applications are modified. To that end, they traced changes to Android manifest files through commit history and analyzed traits of developers who perform these changes.

Das et al. [10] seeded their dataset from various sources in order to achieve wider coverage of available apps. Next to F-Droid, they also included open-source applications listed on Wikipedia and they searched for links from Readme files of GitHub repositories to Google Play pages. In total they found 2,443 open-source Android apps with source code on GitHub. Access to version control data allowed them to investigate performance related commits by looking at commit messages stored in Git. In summary, their dataset not only contains links to F-Droid with executable APKs, but also references to source code on GitHub and additional metadata on Google Play.

Tufano et al. [34] manually analyzed 9,164 commits from Git repositories to investigate how bad programming practices are introduced. Android app source code is one of the three fields they study. Their dataset includes 70 apps sourced from F-Droid. Similarly, Stojkovski [33] created a dataset of 865 Android applications sourced from F-Droid to study software quality metrics. Stojkovski also considered Sourceforge but did not use it for lack of automated access to Android applications. As mentioned above, Grano et al. [14] mined user reviews from Google Play for a list of apps from F-Droid. The generated dataset contains 395 apps in around 600 versions.

By resorting to F-Droid as source of Android applications, researchers utilize links to Google Play and especially to source code repositories. F-Droid only lists open-source apps and providing source code is inherent to the platform. This allows researchers to use source code in their analysis and even version control data, such as commit messages and contents.

A drawback of using F-Droid over other market places is, that it only contains 2,697 applications333As of March 12, 2018 and excludes apps which are not freely licensed. The number of apps listed on F-Droid is orders of magnitude smaller than on closed source market places, foremost Google Play. However, what F-Droid lacks in numbers is compensated by the links to source code repositories with version controlled source code, change reviews, and bug trackers. F-Droid therefore is a valuable source for mining Android apps.

3.4 Datasets of Source Code without Reliance on F-Droid

However, access to source code of Android apps does not need to be restricted to applications in F-Droid. Linares-Vasquez et al. [22] try a different approach by directly searching GitHub repositories labeled as containing Java files for AndroidManifest.xml files. These manifest files are mandatory for and unique to Android apps. Therefore, they are a good search criteria to identify source code repositories containing code of Android applications. Linares-Vasquez et al. found 16,333 repositories with code for Android apps which is a much higher number than the number of apps available on F-Droid.

Geiger et al. [12] use the same idea to initially search for manifest files and construct AndroidTimeMachine, a graph database of 8,431 Android apps which are both accessible on GitHub and Google Play. Their dataset links data from Google Play pages and GitHub repositories and includes metadata of all commits in one Neo4j graph.

4 Reflections

This literature review found several datasets of Android applications. They collect and provide executables, market and distribution data, source code, and even analysis results in various forms of detail.

App store data

One common problem faced by many data datasets is the lack of documented access to data from app stores, especially Google Play. Google does not provide a public API and other market places actively block crawlers from collecting data. Tools do exist to gather app store data from many sources but they heavily rely on regular maintenance and updates to keep working. Future work could include creating one dataset with comprehensive access to market place data to facilitate research of Android applications.

Updated data

Researchers have poured a lot of work into creating diverse datasets of Android apps. Information in these app datasets is capable of shedding light on interesting questions in the field of Android research. Unfortunately, many of these datasets have not received updates in years. Information in these datasets turned stale. Researches facing an ever changing environment of application development cannot rely on these old datasets to perform current research. This leads to a gap in possible research since newer Android app datasets may not include similar information necessary to answer some research questions. Future efforts should be directed to update existing datasets and set up new datasets in such a way that they are easier to maintain and kept up-to-date. Releasing tooling to create a dataset is already a step in the right direction. Regularly performing the data collection process and making the results available in a versioned format or a timeline should be the next step.

Accessibility of data

Worse than the problem of outdated data is inaccessible data. Many datasets of Android applications have not been released publicly or authors stopped sharing them after some time. It is unfortunate to see that potentially useful data is not shared with the research community. Instead of re-creating datasets from scratch, building upon previous work and complementing existing data would benefit authors of both old and new publications. Therefore, researchers should make sure they share data in widely accessible formats and on open platforms to be independent of individual maintenance. Also including permanent links to data could help make data more easily accessible years after publication.

Source code

Previous studies and datasets provide different levels of access to data of Android applications. However, none of the datasets combines all potential data. Martin et al. [26] also highlight a key shortcoming of the literature in its current state: There are few mining tools and datasets which combine source code with application metadata from app stores and development tools for large sets of apps. One tool that combines access to all sources mentioned above is CALAPPA. To ease access to app market data, Avdiienko et al. [7] developed a toolchain for mining Android apps. It has modules for data retrieval from various sources. This design allows Avdiienko et al. to combine app metadata, user reviews, executables, and source code where applicable. Modules include crawlers and metadata analysis as well as static program analysis and post-processing. CALAPPA can retrieve source code for Android apps limited to those listed on F-Droid but does not seem to be publicly available. Some datasets have increased the number of Android applications for which source code is available. Unfortunately, this number is still low and the sample of apps is likely biased. Finding additional means to get access to source code should be on the agenda for future work.

Combining existing data

Finally, future research could benefit more from existing datasets, if the information contained in them was relatable to information in other datasets. Various efforts have been undertaken to gather, process, and present relevant data. This information on Android apps from different datasets complements each other. New insights could be gained from combining datasets and drawing connections between the existing data points. Future work could facilitate this kind of research by creating a meta-dataset which links data on Android applications in existing datasets.

5 Conclusions

Researchers of Android applications have a vast amount of data at hand. There are already many datasets containing executable artifacts. App store metadata is plentiful and public albeit difficult to access. Many studies report this problem, especially in accessing data from Google Play. However, insight into source code is limited because the vast majority of apps is proprietary. Several studies tried to gather and combine source code with other app metadata.

Datasets of app store data and executables have the advantage, that they are independent of licensing of the application source code. Data from marketplaces can be scraped for free while APK archives can be downloaded from app stores. On the other hand, source code for proprietary applications is to a large extent not available at all. Having both a comprehensive dataset of (almost) all available apps – as with AndroZoo – and having access to source code is unfortunately not reconcilable.

Appendix A Survey results

Year Summary Data gathered Number of apps Source code Commit history Executables Google Play link F-Droid link Sourced from Remarks
Appannie Commercial data aggregation ongoing 14+ million no no no yes no “All major app stores”
AppBrain Commercial app meta data database (Play mirror) ongoing 3,749,507 no no no yes no Google Play
AppZoom Commercial app review and analysis ongoing ? no no no yes no Google Play, Apple Store
Zhou and Jiang [36] 2012 Selection of malware. 2010 to 2011 1,260 no no yes no no Security announcments, publications from anti virus vendors and researchers. Stopped sharing their data in 2015.
Aafer et al. [1] 2013 Static analysis of APKs July 2012 around 20,000 no no yes partially no McAffee, [36], Google Play
Minelli and Lanza [29] 2013 Static analysis of source code 2013 (?) 20 yes yes no yes yes F-Droid
Petsas et al. [32] 2013 Monitoring of metrics on app stores. Mar – Aug 2012 300,000 no no no no no SlideMe, 1Mobile, AppChina, Anzhi Sources selected for accurate number of downloads reported.
Zheng et al. [35] 2013 Signature based analytics ongoing 150,368 no no yes no no Google Play and other app stores, malware forums.
Arp et al. [6] 2014 Vector based analytics Aug 2010 – Oct 2012 123,453 no no yes partially no Google Play, Chinese and Russion app stores, malware forums, [36]
Gorla et al. [13] 2014

Signature based anomaly detection

Winter and Spring 2013 32,136 no no yes yes no Google Play
Linares-Vásques et al. [23] 2014 Detect energy optimizations from usage patterns 2014 (?) 55 no no no yes no Google Play
Lindorfer et al. [24] 2014 Automated dynamic and static analysis 2012 – 2015 1,034,999 no no yes yes no submissions, malware feeds discontinued
Moonsamy et al. [30] 2014 Fingerprinting of permissions Aug 2010 – Oct 2011 1,227 no no yes no no SlideME, Pandaapp Used [36] as complementary set of malware.
Corral and Fronza [9] 2015 Relating source code quality to market success 2013 100 yes no no yes yes F-Droid
Freiling et al. [11] 2015 Evaluation of obfuscation transformations 2015 (?) 240 no no yes no yes F-Droid
Krutz et al. [17] 2015 Collection and static analysis of open source Android applications with metadata and commit history 2015 (?) 1,179 yes yes yes no yes F-Droid Open Source only. Website hosting dataset seems to be defunct.
Lamba et al. [19] 2015 Static analysis on source code July 2014 1,120 yes no no no yes F-Droid Extensive description of F-Droid
Linares-Vásques et al. [22] 2015 Survey among developers on performance issues 2015 (?) 485 yes yes no no no GitHub Identify Android apps in Github repositories by manifest file.
Malavolta et al. [25] 2015 Study of users’ perception of hybrid apps Nov 2014 11,917 no no no yes no Google Play
Tufano et al. [34] 2015 Identify bad programming practices from commit history 2015 (?) 70 yes yes no no yes F-Droid Next to Android apps, also Apache and Eclipse projects are studied.
Allix et al. [4] 2016 Collection of APKs for analysis. ongoing 5,842,525 no no yes yes444 Link can be constructed from package name if available on Google Play. yes 555 Link can be constructed from package name if available on F-Droid. Various app markets, Torrents, [36]. The collection is still growing. Apps are labeled with the markets they are found on.
Avdiienko et al. [7] 2016 Scraping tool to combine data about Android apps from various sources yes666 Depending on crawler module and source yes1 yes1 yes1 yes1 Google Play,, F-Droid
Bao et al. [8] 2016 Identify power management activities from Git commits 2016 (?) 1,273 yes yes no no yes F-Droid, [17]
Das et al. [10] 2016 Study of performance related commits 2016 (?) 2,443 yes yes no yes yes F-Droid, Wikipedia, Github README files
McIlroy et al. [27] 2016 Study of update frequency of apps 2014 10,713 no no no yes no Google Play
Nayebi et al. [31] 2016 Analysis of app release cycles 2016 (?) 1,844 yes yes yes F-Droid
Grano et al. [14] 2017 Tracking of user feedback from reviews to changes 2017 (?) 395 yes no yes yes yes F-Droid, Google Play Includes 297,323 reviews
Krutz et al. [18] 2017 Static analysis of app permissions of apps in F-Droid. 2017 (?) 1,402 yes yes no yes yes F-Droid, GitHub
Meng etal [28] 2017 Knowledge graph from results of static and dynamic analysis since 2013 5,000,000 no no yes partially partially 28 app stores including Google Play and F-Droid
Stojkovski [33] 2017 Thesis on various software metrics for Android apps 2014 – 2017 (?) 865 yes yes no no yes F-Droid Did not source from Sourceforge for lack of scalable access to Android apps
Geiger et al. [12] 2018 Graph database combining metadata on Google Play and GitHub with commit history 2017 – 2018 8,431 yes yes no yes no GitHub, Google Play


  • [1] Yousra Aafer, Wenliang Du, and Heng Yin. Droidapiminer: Mining api-level features for robust malware detection in android. In International conference on security and privacy in communication systems, pages 86–103. Springer, 2013.
  • [2] Kevin Allix, Tegawendé F. Bissyandé, Quentin Jérome, Jacques Klein, and Yves Le Traon.

    Empirical assessment of machine learning-based malware detectors for Android.

    Empirical Software Engineering, 21(1):183–211, 2016.
  • [3] Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein, and Yves Le Traon. Are Your Training Datasets Yet Relevant? In International Symposium on Engineering Secure Software and Systems, pages 51–67. Springer, 2015.
  • [4] Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein, and Yves Le Traon. AndroZoo: collecting millions of Android apps for the research community. pages 468–471. ACM Press, 2016.
  • [5] Kevin Allix, Quentin Jerome, Tegawende F. Bissyandé, Jacques Klein, Radu State, and Yves Le Traon. A Forensic Analysis of Android Malware–How is Malware Written and How it Could Be Detected? In Computer Software and Applications Conference (COMPSAC), 2014 IEEE 38th Annual, pages 384–393. IEEE, 2014.
  • [6] Daniel Arp, Michael Spreitzenbarth, Malte Hubner, Hugo Gascon, Konrad Rieck, and CERT Siemens. DREBIN: Effective and Explainable Detection of Android Malware in Your Pocket. In Ndss, volume 14, pages 23–26, 2014.
  • [7] Vitalii Avdiienko, Konstantin Kuznetsov, Paolo Calciati, Juan Carlos Caiza Román, Alessandra Gorla, and Andreas Zeller. CALAPPA: a toolchain for mining Android applications. In Proceedings of the International Workshop on App Market Analytics, pages 22–25. ACM, 2016.
  • [8] Lingfeng Bao, David Lo, Xin Xia, Xinyu Wang, and Cong Tian. How Android App Developers Manage Power Consumption?-An Empirical Study by Mining Power Management Commits. In Mining Software Repositories (MSR), 2016 IEEE/ACM 13th Working Conference on, pages 37–48. IEEE, 2016.
  • [9] Luis Corral and Ilenia Fronza. Better Code for Better Apps: A Study on Source Code Quality and Market Success of Android Applications. In Proceedings of the Second ACM International Conference on Mobile Software Engineering and Systems, MOBILESoft ’15, pages 22–32, Piscataway, NJ, USA, 2015. IEEE Press.
  • [10] Teerath Das, Massimiliano Di Penta, and Ivano Malavolta. A Quantitative and Qualitative Investigation of Performance-Related Commits in Android Apps. In Software Maintenance and Evolution (ICSME), 2016 IEEE International Conference on, pages 443–447. IEEE, 2016.
  • [11] Felix C. Freiling, Mykola Protsenko, and Yan Zhuang. An Empirical Evaluation of Software Obfuscation Techniques Applied to Android APKs. In Jin Tian, Jiwu Jing, and Mudhakar Srivatsa, editors, International Conference on Security and Privacy in Communication Networks, volume 153, pages 315–328. Springer International Publishing, Cham, 2015.
  • [12] Franz-Xaver Geiger, Ivano Malavolta, Luca Pascarella, Fabio Palomba, Dario Di Nucci, Ivano Malavolta, and Alberto Bacchelli. A Graph-based Dataset of Commit History of Real-World Android apps. In Proceedings of the 15th International Conference on Mining Software Repositories, MSR, New York, NY, May 2018. ACM.
  • [13] Alessandra Gorla, Ilaria Tavecchia, Florian Gross, and Andreas Zeller. Checking app behavior against app descriptions. In Proceedings of the 36th International Conference on Software Engineering, pages 1025–1035. ACM, 2014.
  • [14] Giovanni Grano, Andrea Di Sorbo, Francesco Mercaldo, Corrado A. Visaggio, Gerardo Canfora, and Sebastiano Panichella. Android apps and user feedback: a dataset for software evolution and quality improvement. In Proceedings of the 2nd ACM SIGSOFT International Workshop on App Market Analytics, pages 8–11. ACM, 2017.
  • [15] Geoffrey Hecht, Omar Benomar, Romain Rouvoy, Naouel Moha, and Laurence Duchien. Tracking the software quality of android applications along their evolution (t). In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on, pages 236–247. IEEE, 2015.
  • [16] Médéric Hurier, Kevin Allix, Tegawendé F. Bissyandé, Jacques Klein, and Yves Le Traon. On the lack of consensus in anti-virus decisions: Metrics and insights on building ground truths of android malware. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pages 142–162. Springer, 2016.
  • [17] Daniel E. Krutz, Mehdi Mirakhorli, Samuel A. Malachowsky, Andres Ruiz, Jacob Peterson, Andrew Filipski, and Jared Smith. A dataset of open-source Android applications. In Proceedings of the 12th Working Conference on Mining Software Repositories, pages 522–525. IEEE Press, 2015.
  • [18] Daniel E. Krutz, Nuthan Munaiah, Anthony Peruma, and Mohamed Wiem Mkaouer. Who Added That Permission to My App? An Analysis of Developer Permission Changes in Open Source Android Apps. pages 165–169. IEEE, May 2017.
  • [19] Yash Lamba, Manisha Khattar, and Ashish Sureka. Pravaaha: Mining Android applications for discovering API call usage patterns and trends. In Proceedings of the 8th India Software Engineering Conference, pages 10–19. ACM, 2015.
  • [20] Li Li. Mining androzoo: A retrospect. In Software Maintenance and Evolution (ICSME), 2017 IEEE International Conference on, pages 675–680. IEEE, 2017.
  • [21] Li Li, Alexandre Bartel, Tegawendé F. Bissyandé, Jacques Klein, Yves Le Traon, Steven Arzt, Siegfried Rasthofer, Eric Bodden, Damien Octeau, and Patrick McDaniel. Iccta: Detecting inter-component privacy leaks in android apps. In Proceedings of the 37th International Conference on Software Engineering-Volume 1, pages 280–291. IEEE Press, 2015.
  • [22] Mario Linares-Vasquez, Christopher Vendome, Qi Luo, and Denys Poshyvanyk. How developers detect and fix performance bottlenecks in android apps. In Software Maintenance and Evolution (ICSME), 2015 IEEE International Conference on, pages 352–361. IEEE, 2015.
  • [23] Mario Linares-Vásquez, Gabriele Bavota, Carlos Bernal-Cárdenas, Rocco Oliveto, Massimiliano Di Penta, and Denys Poshyvanyk. Mining energy-greedy api usage patterns in android apps: an empirical study. In Proceedings of the 11th Working Conference on Mining Software Repositories, pages 2–11. ACM, 2014.
  • [24] Martina Lindorfer, Matthias Neugschwandtner, Lukas Weichselbaum, Yanick Fratantonio, Victor Van Der Veen, and Christian Platzer. Andrubis–1,000,000 apps later: A view on current Android malware behaviors. In Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS), 2014 Third International Workshop on, pages 3–17. IEEE, 2014.
  • [25] Ivano Malavolta, Stefano Ruberto, Tommaso Soru, and Valerio Terragni. Hybrid mobile apps in the google play store: An exploratory investigation. In Proceedings of the Second ACM International Conference on Mobile Software Engineering and Systems, pages 56–59. IEEE Press, 2015.
  • [26] William Martin, Federica Sarro, Yue Jia, Yuanyuan Zhang, and Mark Harman. A survey of app store analysis for software engineering. IEEE Transactions on Software Engineering, 43(9):817–847, 2017.
  • [27] Stuart McIlroy, Nasir Ali, and Ahmed E. Hassan. Fresh apps: an empirical study of frequently-updated mobile apps in the Google play store. Empirical Software Engineering, 21(3):1346–1370, 2016.
  • [28] Guozhu Meng, Yinxing Xue, Jing Kai Siow, Ting Su, Annamalai Narayanan, and Yang Liu. AndroVault: Constructing Knowledge Graph from Millions of Android Apps for Automated Analysis. arXiv preprint arXiv:1711.07451, 2017.
  • [29] Roberto Minelli and Michele Lanza. Software Analytics for Mobile Applications–Insights & Lessons Learned. In Software Maintenance and Reengineering (CSMR), 2013 17th European Conference on, pages 144–153. IEEE, 2013.
  • [30] Veelasha Moonsamy, Jia Rong, and Shaowu Liu. Mining permission patterns for contrasting clean and malicious android applications. Future Generation Computer Systems, 36:122–132, 2014.
  • [31] Maleknaz Nayebi, Homayoon Farrahi, and Guenther Ruhe. Analysis of marketed versus not-marketed mobile app releases. In Proceedings of the 4th International Workshop on Release Engineering, pages 1–4. ACM, 2016.
  • [32] Thanasis Petsas, Antonis Papadogiannakis, Michalis Polychronakis, Evangelos P. Markatos, and Thomas Karagiannis. Rise of the planet of the apps: A systematic study of the mobile app ecosystem. In Proceedings of the 2013 conference on Internet measurement conference, pages 277–290. ACM, 2013.
  • [33] Mile Stojkovski. Thresholds for Software Quality Metrics in Open Source Android Projects. Master’s thesis, NTNU, 2017.
  • [34] Michele Tufano, Fabio Palomba, Gabriele Bavota, Rocco Oliveto, Massimiliano Di Penta, Andrea De Lucia, and Denys Poshyvanyk. When and why your code starts to smell bad. In Software Engineering (ICSE), 2015 IEEE/ACM 37th IEEE International Conference on, volume 1, pages 403–414. IEEE, 2015.
  • [35] Min Zheng, Mingshen Sun, and John CS Lui. Droid analytics: A signature based analytic system to collect, extract, analyze and associate android malware. In Trust, Security and Privacy in Computing and Communications (TrustCom), 2013 12th IEEE International Conference on, pages 163–171. IEEE, 2013.
  • [36] Yajin Zhou and Xuxian Jiang. Dissecting Android Malware: Characterization and Evolution. pages 95–109. IEEE, May 2012.