Android devices have been becoming more and more popular in recent years. As shown in a recent statistical report, in the first quarter of 2018, 85.9% of all smartphones sold to end users were phones with the Android operating system . With the growing popularity of the Android platform, the threats from various unwanted apps, including malware and other potential harmful apps, have become more serious . The apps may leak users’ private information without consent, root users’ devices silently, send premium SMS stealthily and so on, which already affect the dependability of the Android app ecosystem [3, 4, 5].
To ensure the security and privacy of Android users, some device manufacturers [6, 7, 8] collect audit logs from the users’ systems, on which specialized analysts diagnose potential Android attacks, and then generate improved security policies or remove the detected malware from the regulated app markets. For the analysts, the more information they learn from the audit logs, the more likely they are to unveil underlying malicious intentions. Therefore, the precise and complete reconstruction of real-time app behaviors based on the audit logs is one of the most critical problems that the analysts concern.
Many state-of-the-art techniques [9, 10, 11, 12, 7] have been proposed to assist analysts in reconstructing app behaviors based on the runtime logs. For example, DroidScope  reconstructs both the OS-level and Java-level semantics by instrumenting the virtual machine. CopperDroid  leverages system call-related information to automatically reconstruct app behaviors. DroidForensic  captures multi-layer forensic logs from application level, Binder level and system-call level to reconstruct Android attacks. Although these proposed schemes have achieved the reconstruction of most real-world app behaviors, there also exists a problem that needs to be solved. Specifically, the resource-constrained smartphones forces system developers to only record significant logs, such as the invocations of sensitive Android APIs, and makes the system impractical to enforce the mechanisms with overmuch computations, such as a precise information-flow tracking [13, 10]. The contextual information, such as guarding conditions of sensitive actions [14, 15], the information-flow transmission paths [3, 16] and so on, is unavailable in the reconstructed app behaviors due to those constraints. These information is actually the valuable evidence to unveil the intentions of app behaviors.
We propose and implement DroidHolmes, a novel system that recovers contextual information around the gathered audit logs. DroidHolmes firstly gathers the runtime logs about a target app from the Android middleware, and then reconstructs a path matched with the logs from the app’s control-flow graph on the PC. The path is finally used to profile the contextual information of real-time app behaviors. DroidHolmes only requires a small number of logs to recover sufficient context information of app behaviors, and meanwhile the overhead imposed to the mobile devices is negligible. Our primary goal is to help improving the performance of existing analysis techniques, such as information-flow analysis , behavioral reconstruction , and to facilitate the manual examination.
The major challenge of implementing DroidHolmes is the high computational complexity of log matching. It is caused by the coupling relation that when a node is matched, its successors are the candidates for matching subsequent logs. Specifically, multiple nodes are possible candidates when matching a log record produced by some Android mechanisms, such as reflection or inter-component communication (ICC) [17, 18]. What’s worse, to evade malware detection, Android attackers may abuse the mechanisms. Under the circumstance, the number of possible candidate paths increases exponentially. A straightforward method is to record sufficient logs for deciding each branch, but the runtime overhead is considerable. Existing static analysis tools [17, 18] cannot parse obfuscated or encrypted arguments in above mechanisms.
We propose a divide and conquer algorithm to address the challenge. Our key idea is to leverage the information in the call stack to help decomposing the coupling relation among continuous logs. Specifically, the information can be used to position each log record individually. Therefore, the problem can be divided into independent sub-problems, which reduces the computational complexity of log matching.
Our main contributions are summarized as follows:
We implement a novel system named DroidHolmes to recover the contextual information of behaviors in Android apps around limited-quantity logs.
We propose a divide and conquer algorithm to achieve an efficient log matching.
The rest of the paper proceeds as follows: Section II introduces a motivating example for our system. Section III presents the architecture of DroidHolmes. Section IV illustrates the design of DroidHolmes. In Section V, we perform experiments and evaluation on DroidHolmes using open-source test suites and real-world apps. Section VI discusses limitations and future work. Section VII shows the related work, and finally we conclude in Section VIII.
Ii Motivating Example
To motivate our work, as presented in Figure 1, we illustrate a simplified control-flow graph (CFG) extracted from a real-world app. From the graph, we learn that the app’s developer uses many reflection strategies to hide the real intention. However, we cannot directly identify it as a malicious intention only according to the names of callsites (e.g., invoke()) due to lack of contextual information.
The contextual information provides analysts with a comprehensive behavioral profile, which is the valuable evidence to disclose malice of a behavior. For instance, is control dependent on , and the semantic of the conditional statement in is to compare whether the length of the string is longer than a constant value. It may be an anti-virtualization technique to detect the current environment . Then the arguments of some callsites in nodes (e.g., ) are obfuscated, which may be used to evade static analysis. We can also know that there exists a data flow from to after performing the data-flow analysis, where and correspond to getDeviceId() and execute() respectively. Based on these contextual information, we can speculate that the developer intentionally steals the user’s private data.
Although the contextual information of behaviors can be reconstructed by logging all actions of the app, the performance overhead on smartphones is considerable. Therefore, existing logging systems [9, 11, 12] only record a part of the operations related to the user’s security and privacy, such as invoke() in the gray nodes of Figure 1. However, by this strategy, the reconstructed behaviors do not contain the contextual information, which may affect the accuracy of behavioral analysis.
Our goal is to recover contextual information around limited-quantity logs for app behaviors. We notice that there exists a path matched with runtime logs on the app’s control-flow graph. Upon the path, the logged operations are ordered nodes connected by surrounding statements (i.e., the white nodes in Figure 1), where the contextual information is available. Therefore, DroidHolmes focuses on finding the path with a small number of logs.
Iii Architecture of DroidHolmes
Figure 2 depicts the overall architecture of DroidHolmes, which contains two modules:
Logging Module. The logging module is used to record audit logs when a user interacts with apps on the smartphone. The module is deployed in the Android middleware to capture the invocations of specified Android APIs. Finally, the logs are obtained from the user’s smartphone and sent to the auditing module with the detected app.
Auditing Module. The auditing module is run on a PC to recover contextual information around the audit logs extracted from the logging module. It firstly analyzes the app’s bytecode to builds graphs. Upon the graphs, it finds a path matched with the logs. To achieve an efficient log matching, it performs a divide and conquer algorithm to individually position each log record on the graphs and decide the successors from possible candidates. Based on the path, existing tools can be used to recover the contextual information, such as data-flow analysis [3, 23, 13], trigger analysis [15, 14], API misuse [24, 25] and so on.
Iv System Design
Iv-a Logging Implementation
In this section, we explain how to construct our audit logs for the subsequent log matching. As for the selection of logging points, we will elaborate in Section V.
The audit logs is a time-stamped sequence about specified app operations. Since is always temporally ordered, we use the sequence without timestamps to model . A log record is a tuple , where is the description of the invoked API and is the calling context. The details are shown as follows.
contains two parts, namely the signature of the logged API and the used arguments. The signature is used to distinguish different app operations, and the arguments are used to decide the successor when the API relates to some Android mechanisms such as reflection and ICC.
is the key information to position each log record. As shown in Figure 3, when callstmt is logged, the structure of the current call stack is on the right. The calling context can be extracted from the stack. It includes the last K methods in the stack () and the depth of the stack (). With the elements, the searching space of matching each log record is confined into a definite method. Note that K is not equal to , because the methods at the bottom of the stack are invoked by the OS rather than the app. Moreover, recording all methods in the stack imposes considerable performance overhead on the smartphone. Therefore, in Section V, K is decided to balance the efficiency of log matching and the runtime performance of the smartphone.
We need to remove the log records which are produced by the code in Android libraries from before performing log matching. The records have the same process ID as the records produced by the code of testing apps, but there is no code snippet within apps that matches with them. These records can be identified according to the class name of the caller for each callsite. Specifically, the prefix of the class name belongs to a particular set, where the elements include com.android, android.os and so on. For instance, in Figure 3, CallerMethod is the caller for callstmt, and AppClass is the class name. AppClass does not belong to the set, so we preserve the log record about callstmt.
Iv-B Path Generation
Iv-B1 Graph Construction
The graphs are the basis for DroidHolmes to find the path matched with audit logs. However, unlike traditional console applications which has main() as an entry point, the Android apps await orders from users (e.g., clicking a button), systems (e.g., on low battery), or other apps, to launch a specific callback method. Therefore, there is no common entry point for an app. Existing static analysis schemes [3, 18, 26] predefine the calling orders of callback methods in a dummy-main method, which is regarded as the entry point of an app. However, we find that the dummy-main method may miss some callback methods, and the sophisticated maintenance of modeling callback methods can be error-prone, which are illustrated in Section V.
To solve the problem, DroidHolmes respectively builds a divided supergraph for each callback method. The supergraph combines the CFGs for all methods that are reachable from the callback method via the callgraph. Our scheme does not miss any callback method, and can help to construct a decided calling order of callback methods for apps, which is demonstrated in Section V.
Iv-B2 Log Matching
The key module of DroidHolmes to find a path matched with the logs on the constructed supergraphs. It is impractical to find the path by directly using primary searching algorithm (e.g., depth-first search or width-first search) because of the coupling relation among successive logs. For instance, as described in the case study of Section V, most of the logged nodes within the app’s supergraph are about the same app operation (i.e., invoke()). To find the matched path, the primary methods cannot distinguish the operations, so it may traverse the nodes with all possible orders at worse. Furthermore, the sequence of the nodes in the found path may not be identical to the runtime execution flow.
We propose a divide and conquer algorithm to find the path matched with the logs. The algorithm is designed upon the depth-first search (DFS), where Algorithm 1 is crucial to help positioning each log record individually using . Algorithm 1 takes four inputs: (1) the difference between and (i.e., ), (2) the last K methods in the runtime call stack (i.e., ), (3) a global stack that sequentially stores the visited methods during log matching (i.e., EmulatedCallStack), and (4) the length of EmulatedCallStack when the last log record is matched (i.e., Len). It outputs a searching decision for the current node.
This algorithm works as follows. It firstly calculates the difference between Len and Len. Then, it computes the value of dis which is the difference between and . Next, when dis 0, the current node can be visited when one of the two conditions is satisfied: (1) dis .length() and the method sequence in EmulatedCallStack matches with the method sequence in (Line 7), or (2) dis .length() (Line 11). If above two conditions are not satisfied, the algorithm stops to visit the current node (Line 9 and 13).
Divide. The algorithm uses to confine the search range for each log record. However, there is a problem that is always larger than the length of EmulatedCallStack. Specifically, some methods at the bottom of the runtime call stack may be invoked by the OS rather than apps for initialization tasks (e.g., com.android.internal.os.ZygoteInit). Although the methods do not relate to the app’s code, they are considered when counting . Here, we notice that the methods are the same for two successive log records, so we calculate the difference to eliminate the interference of the methods (i.e., ). The dis presents the difference of the depth between the current visited method and the target method in the call stack. The target method contains the callsite of the matching log record. Therefore, are described as follows, dis is used to guide our log matching: (1) dis 0 means that the target method cannot be found by the forward search from the current node. (2) 0 dis .length() means that the target method has to be found along with the method sequence in . (3) dis .length() means that to find the target method, the first method in needs to be firstly visited by the DFS.
In the algorithm, depicts a method call sequence to find the target callsite of each log record. isMatched() is invoked to check if the method sequence obtained from the top of EmulatedCallStack matches with the method sequence in . Figure 4 shows matching examples for isMatched(). Figure 4(a) describes a matched case, where the algorithm needs to find CallerMethod from Method, and along with the method sequence in . Figure 4(b) is a mismatched case because Method does not match with Method. It means that the method visiting sequence is different with the runtime execution flows. Therefore, the algorithm stops to visit the current node, and then performs the DFS to visit Method in the supergraph.
Furthermore, to improve the efficiency of log matching, the log sequence is split into multiple segments. The logged callback methods help to split logs, where each segment contains a method call sequence starting from a callback method. After that, DroidHolmes respectively matches each log segment with the corresponding supergraph, which are about the same callback method. The computational complexity of log matching is reduced by using the segmented logs.
Conquer. When the position of each log record is confined into a definite method, we adopt a straightforward method to find the target callsite. Specifically, we perform the DFS on the CFG of the method to match the signatures of the invoked APIs in callsites with the signature in . Although there may many callsites that call the same API in a method, it is practical to decide the callsites matched with the logs, because the searching space is low. Finally, the path consists of multiple path segments ordered by the logging sequence of callback methods.
To ensure the accuracy of log matching, the successor have to be decided from possible candidates due to some Android mechanisms, such as reflection and ICC. The ICC model provides attackers with a message passing mechanism for data exchange among components, so it may be abused to threaten user privacy. Existing static analysis techniques [26, 16] aim to enumerate all the possible pairs of senders and receivers before analyzing the ICC leaks, which may incur false detected results . As for reflection, it is widely used in real-world apps for hiding real actions, but state-of-the-art static analysis tools cannot solve it completely. Specifically, these static analysis tools [17, 16, 3, 23] cannot parse obfuscated, encrypted or dynamically assigned arguments. Harvester  cannot extract runtime values of all obfuscated arguments either, because it does not support the analysis based on ICC. When reflective calls are not resolved correctly, the successors of these callsites cannot be decided.
To solve above problems, DroidHolmes updates supergraphs by the used arguments of in log records. After updating, DroidHolmes can obtain the decided successor for continuing log matching. Due to space restrictions, we only expound how to cope with startActivityForResult() and invoke() as shown in Figure 5. In fact, DroidHolmes can be extended to solve more problems about reflection and ICC.
In Figure 5(a), DroidHolmes adopts four steps to reconstruct ICC links for startActivityForResult(). It firstly queries an intent table to Log Parser (step (1) and (2)). The intent table records a tuple , which indicates that the component has sent an intent to the component by calling the method . Based on the table, DroidHolmes connects the ICC callsite to the entry point of the target component and complements the return edge from finish() to the specified callback onAcitivityResult() (step (3) and (4)).
Figure 5(b) describes five steps to update the supergraph for invoke(). It checks Log Parser to obtain the signature of the invoked method (step (1) and (2)). For reconstructing the call relation of the method, there are two circumstances: (1) calling an API that is specified to be logged, and (2) calling a developer-defined method which may contain logged operations. We design different strategies as follows. For (1), DroidHolmes inserts a new node behind the node of invoke() (step (3)). The invocation statement within the node explicitly calls the API. For (2), DroidHolmes first builds a sub-supergraph or obtains the sub-supergraph that is already generated, where the entry point is the invoked method. DroidHolmes then adds the node as (1), and inserts the sub-supergraph into the original supergraph by reconstructing the call edge and the return edge (step (3), (4) and (5)).
Iv-C Context Extraction
Iv-C1 Contextual Information
Existing analysis tools can be used to extract contextual information with the result of DroidHolmes. Specifically, the contextual information includes information-flow paths [3, 13, 29], guarding conditions , API usage , app behavior contraction 
and so on. The extracted contexts are not only the evidences to profile app behaviors, but the important features for some machine learning-based systems[24, 15, 31, 25].
Iv-C2 Improvements for Existing Tools
DroidHolmes can help to improve the performance of existing analysis tools by the following ways, including entry-point reconstruction, graph refinement and apk reinforcement. We will evaluate this improvements in Section V.
Entry-point Reconstruction. The result of DroidHolmes can be used to reconstruct a new entry point. Different from the dummy main built by existing tools [3, 18, 26], in the new entry point, the redundant callback transitions uncovered at runtime are pruned, and no callback method is lost. Therefore, with the help of the new entry point, the efficiency and accuracy of behavioral analysis (e.g., data-flow tracking) of existing tools can be improved.
Graph Refinement. The matched path generated by DroidHolmes is refined from the original built supergraphs. Specifically, in the path, the nodes and edges that are unrelated to the runtime execution flow are removed, and DroidHolmes also inserts additional nodes and edges for complementing the call relations about some Android mechanisms, such as reflection and ICC. Therefore, upon the path, the static analysis tools can obtain more precise detected results.
Apk Reinforcement. DroidHolmes supports to generate the reinforced apk for facilitating more analysis tools. The graph’s construction and refinement are achieved on the Soot framework . However, there are some tools that are implemented upon Argus-SAF , WALA  and so on. They cannot directly leverage the matched path to do their own analysis. To facilitate the tools, DroidHolmes supports to output the reinforced apk according to the path. Specifically, DroidHolmes uses the interfaces provided by DroidRA  and IccTA  to complement the statements about reflection and ICC respectively, and removes the statements that are excluded by the path.
V Experiments and Evaluation
To evaluate the effectiveness of DroidHolmes, we seek to answer the following questions:
What kind of operations are required to log in DroidHolmes? What should the value of K in the calling context be set as? How is the runtime performance of the logging module?
How does DroidHolmes help to improve the performance of existing analysis tools?
How much contextual information can be recovered via DroidHolmes for real-world apps? How is the performance of the audit module?
We implement the prototype system of DroidHolmes. Our logging scheme can be deployed on different Android versions, here we select to modify the source code of Android 5.0.1 to implement the logging mechanism and flash its system image into the device of Nexus 4. The auditing module is developed on the Soot framework . This module is deployed on a server with Intel Broadwell E5-2660V4 2.0GHz CPU, 128G memory running Ubuntu 16.04 LTS (64 bit).
V-a Experiments and Evaluation on the Logging Module
V-A1 Operation Selection
In the evaluation, we choose to record three types of app operations for evaluating the effectiveness of our system.
Event Handlers: The event handlers are a series of callbacks related to the state transitions of Android lifecycles, GUI operations and system events . An event handler contains the semantic of behavioral activation. As mentioned above, callback methods are significant for log splitting and log matching.
Privacy-related operations: These are the operations about privacy leaks, which is one of the most critical problems in the Android security . We choose the operations from sources or sinks in SuSi . The logged operations can be divided into following categories: account, bluetooth, device information, database, file, network, SMS, etc.
ICC and Reflection-based operations: These operations are logged to update the graphs. We record the origin, the target and the invoked API for each ICC communication. Meanwhile, we record the reflection-based APIs and their used arguments. The arguments indicate the dynamically loaded classes, used fields, invoked methods and so on.
Note that analysts may have different requirements of the behavioral reconstruction, so DroidHolmes supports to flexibly set logging points. The logging module is run on smartphones, so it is impractical to log all the APIs about the aforementioned operations because of the overhead concern. Therefore, we perform the operation selection by testing 3970 real-world apps gathered from MalGenome project , VirusShare  and Google Play market . In advance, we select 145 Android APIs as candidate operations.
|Operations with high Occ|
|Operations with high Fre|
We explore two metrics: Occ and Fre for an operation. Occ is defined as , where the occurrence # is the number of apps containing the operation, and the app # is the total number of apps. Fre is defined as the average number of executions per minute. We manually run each app for 15 minutes, and then resort to Monkey  to generate 5000 random events for them, where the effectiveness of 500 of events injection is confirmed in prior work .
The top 10 recorded operations for each metric are shown in Table I. We choose 50 operations with higher Occ or Fre because they are commonly-used in practice. Meanwhile, we empirically add 25 operations which are not covered in above but are also widely-used in privacy-related behaviors. Based on the logging selection, DroidHolmes records about 100 log records per minute.
V-A2 Deciding the value of K
The value of K needs to be decided before performing the following evaluations. We randomly select 2000 apps, including 1000 apps from Google Play Market and 1000 apps from MalGenome project  and Virusshare  at first. Then, we calculate the depth of the callsites of selected APIs in the corresponding app’s callgraph. Next, we use three popular benchmarks, namely AnTuTu, CF-Bench and Linpack, to test the system overhead of the smartphone when K is respectively set as different values of the depth. We run each benchmark 10 times and report the average value for each indicator.
Figure 6 depicts the statistics of average system overhead and CDF of APIs under different K (from 4 to 8). The three benchmarks test the performance of the smartphone from different aspects, so we calculate the average value of all testing results under different K respectively to represent the system overhead. From the figure, we notice that the depth of most APIs’ callsites (93.5%) is less than 8, and the average system overhead gradually increases (from 2.25% to 3.11%) when the depth increases from 4 to 8. The APIs whose depth are larger than 8 are usually in some advertising libraries. To ensure the high API coverage and the low system overhead , we set K as 7 because the division of by is biggest. For the APIs whose depth are greater than 7, our log matching algorithm can solve this case with low computation.
V-A3 Performance Overhead
Figure 7 describes the detailed performance overhead of the smartphone when K is set as 7. The first five indicators are generated by AnTuTu, and the next three indicators are generated by CF-Bench, and the last two indicators are generated by Linpack. The overall results show that DroidHolmes introduces negligible runtime overhead (2.39% on average), with the worst overhead case at 4.4% in the multi-threaded MFLOPS indicator.
V-B Effectiveness of Improving Existing Analysis Tools
We evaluate if DroidHolmes can help to improve the data-flow detectability of FlowDroid , IccTA , which are effective static data-flow analysis tools and outperform existing commercial tools [37, 38]. IccTA is based on FlowDroid, and improves the accuracy of ICC-based data-flow tracking.
We choose the open-source test suites of DroidBench  and ICC-Bench  as our benchmarks. There are 119 cases in DroidBench, including modeling of Android lifecycle, reflection, ICC and so on. ICC-Bench contains 20 cases for evaluating the detection accuracy of ICC-based information leakage. Some apps in DroidBench and ICC-Bench are excluded as follows. Three apps in DroidBench are used for testing in Android emulators and the other three are designed for inter-app communication, which are not considered in DroidHolmes temporarily. Another app in ICC-Bench cannot run normally due to the illegal integer assignment. Therefore, we choose 113 cases in DroidBench, and 19 cases in ICC-Bench for the following evaluation.
|Arrays and Lists|
|Inter-Component Communication (ICC)|
|Sum, Precision, Recall, and measure|
|True positives #, TP||74||111||97||111|
|False positives #, FP||16||6||11||6|
|False negatives #, FN||37||0||14||0|
For apps in the benchmarks, we use DroidHolmes to output reinforced apks according to the runtime logs. Then we use FlowDroid and IccTA to detect data-flow leaks within original and reinforced apps respectively. Note that FlowDroid and IccTA detect reinforced apks using our new entry points. The representative results are shown in Table II, where each mark denotes a detection result from a testing app. Overall, DroidHolmes helps to improve the performance of FlowDroid and IccTA (94.87% and 100% for precision and recall respectively), which are calculated based on the detected data flows rather than the testing apps. The detailed analysis is depicted as follows.
ICC. There are 47 cases about ICC. Based on the reinforced apks, the two tools can achieve 100% precision and 100% recall in tracking ICC-based data flows. For the original apps, FlowDroid’s precision and recall are low (46.7% and 20.6% respectively). IccTA reaches a higher precision (90.9%) and recall (88.2%), but it misses some data leaks because static analysis of the content in the intent object is challenging.
Reflection. The newest version of FlowDroid and IccTA can both detect the reflection-based data leaks in the benchmarks. We then use the older version of FlowDroid obtained from GitHub, and find that it fails to detect the data leaks in the original apps. However, the data leaks in the reinforced apks can be detected by the older version of FlowDroid.
Lifecycle. Both FlowDroid and IccTA achieve 93.3% precision and 93.3% recall on the 16 original apps. In comparison, the performance of the two tools is also improved (i.e., 93.75% precision and 100% recall) when analyzing the reinforced apks. Furthermore, we find a design flaw of the lifecycle model in FlowDroid in the case MethodOveride1, though the final detected result is true positive. Specifically, FlowDroid models onCreate() as the first method called in the dummy-main method of the app. Actually, in the app, the overridden method attachBaseContext() is called before onCreate(). The call sequence of the callback methods recorded in the runtime logs is correct. Therefore, the flaw can be solved by using our new entry point.
Special Cases. In Table II, Button2* and Button3* are two special cases. Specifically, in the cases, the data leaks occur only when the user clicks buttons in a specific order. As static analysis tools, FlowDroid or IccTA cannot identify the cases, which incurs that the detected results may be not identical to the real app execution. Obviously, DroidHolmes supports to tell whether data leaks have happened according to the runtime logs, which improves the accuracy of static analysis.
Miscellaneous Cases. DroidHolmes also supports to solve the virtual dispatch problem by the logs. Specifically, in VirtualDispatch2, two classes B and C derive from class A, where B’s operation f() contains sensitive operations whereas C’s f() does not. If a class Test has an operation with the argument a of type A and a.f() is invoked, it is unknown which f() is the actual implementation in static analysis. FlowDroid and IccTA deal with the problem by enumerating all possible paths, which leads to false positives.
DroidHolmes can also help to rewrite the implementation of some methods for achieving higher recall. Specifically, we only rewrite a subset of the methods in Java and Android libraries for the data-flow analysis, such as the cases PublicAPIField2 and ArrayToString1. For instance, in PublicAPIField2, we rewrite the method setAction() as an assignment statement that modifies the member variable of an object. Similarly, the function getAction() is also written as another statement that retrieves the value from the member variable. Therefore, the information flow can be propagated as well when these functions are traversed. StubDroid  systematically addresses the problem by automatically generating summarized models of the libraries.
DroidHolmes cannot help FlowDroid and IccTA to completely solve the limitations of static data-flow analysis. For example, when a source transfers data to an array and another data in the array is sent to a sink, e.g., ArraysAndLists, DroidHolmes cannot help to differentiate the items in the array.
V-C Recovering Contextual Information for Real-world Apps
We randomly choose 500 real-world apps, among which 250 apps are gathered from Google Play Store and 250 apps are from MalGenome project  and VirusShare . Our evaluation needs the manual effort, so the size of the test set is not as large as some automatic tools [2, 34]. We invite two experimenters to install and operate each app. We ask them to cover the functionalities of an app as many as possible, such as login, connecting network, sharing location information and so on. To evaluate the effectiveness of DroidHolmes in contextual recovery, we manually inspect the bytecode of apps to find specific execution paths which contain sufficient contextual information . For the paths with conditional statements, we leverage code instrumentation to make the checks return true . After that, we collect runtime logs of each app and use DroidHolmes to help recovering contextual information.
We analyze the contextual information of matched paths and summarize the following three findings.
|Package Name||Lost Callback Method|
Finding 1: The entry point built by existing static analysis tools may be incomplete for some apps. We notice that some callback methods are recorded in logs but lost in the entry point built by existing static analysis tools [3, 18]. We list six cases found from real-world apps in Table III. Furthermore, some callback methods involving AsyncTask, such as doInBackground() or onPreExecute(), are lost in the entry point of 53 apps. When the methods are not correctly modeled in the entry point, the detected result may be imprecise. To solve the problem, DroidHolmes supports to rebuild a new entry point based on audit logs.
|For Market App||For Malware|
|1||Checking for the value||Logic bomb|
|2||Checking for compatibility of APIs||Anti-virtualization|
Finding 2: Conditional statements are one of the most important contextual information to unveil the malice of app behaviors. We notice that the semantic of conditional statements extracted from the path in market apps and malware are different, which are listed in Table IV. Specifically, in market apps, conditional statements are commonly used to ensure the normal use of the app. For example, in owncloud , setDescription() is invoked when the value of android.os.Build.VERSION.SDK_INT is larger than 26. The reason of the check is that the API can only be used in the OS after Android 8.0. In malware, conditional statements are usually used to hide the real intention of the app. For instance, Shuilianhua  steals user’s device ID when the app is continuously used for 48 hours.
Finding 3: Topological structures of the paths extracted from some malware samples are complex, which increase the difficulty of app behavioral analysis. We notice that the structures of the paths in these malware samples, such as ynqgas.mqbgseos, consist of exception handlers, loops and branches. When the three substructures are nested together, there are a large number of possible control-flow paths within the apps’ CFG. We guess that the purpose of designing the sophisticated structures is to increase the difficulty of static behavioral analysis.
V-C2 Runtime Performance
On average, the auditing module spends 6.55 seconds to find a matched path of a market app, and the average memory consumption is 845.3 MB. It takes 10.46 seconds to handle a malware sample on average, and the average memory consumption is 1252.6 MB. Some sophisticated cases in malware samples make DroidHolmes consume lots of time and memory resources to match a long log sequence with complicated graph structures, as shown in the case study.
V-C3 Case Study
We select a representative app as our case study. It is a malware sample of FakeInstaller family, whose MD5 is dd40531493f53456c3b22ed0bf3e20ef . In our evaluation, most of the tested apps only use reflection mechanism occasionally. However, this app is much special because almost all the methods are invoked by reflective calls, and meanwhile the arguments of the reflective calls are all obfuscated. Traditional static analysis tools such as FlowDroid cannot be used to reveal the real intention of the app’s behaviors. Harvester cannot extract all the runtime values for the app because of the limitation of code slicing . We also resort to AppAudit , which only finds that the app uses some sensitive permissions.
It is challenging for DroidHolmes to help recovering contextual information of the app. First, even when the app only runs about 20 seconds, the reflection-based API invoke() has been called for more than 11200 times. Second, almost all the nodes about call statements in the app’s supergraphs are the same reflective method invoke(). It is impractical to find the path by matching the method signatures in logs with these callsites because of the huge computational complexity.
DroidHolmes spends 640.1 seconds to find the target path matched with 17097 log records and the memory consumption is about 26.6 GB. After supergraphs of the app are updated according to the logs, the number of call edges is 5749, where DroidHolmes has pruned 884 uncovered edges and added 528 edges about reflective calls. To evaluate the effectiveness of our log matching algorithm, we try to find the matched path with the primary depth-first searching. Finally, no matched path can be found even the running time is longer than 5 hours.
Figure 8 depicts a path segment and its contextual information. In the figure, the left path is reconstructed based on the logs, and the right path is the analysis result with the help of DroidHolmes. We notice that the library embedded in the app collects use’s privacy information (i.e., IMEI, IMSI) by calling getSubscriberId() and getDeviceId() when the app is lunched. It then checks the two strings with the predefined rules, such as whether each character can be transferred to a value with integer type. We guess it may be an anti-virtualization technique to avoid obtaining mock information. When the checks pass, it transmits the information to the specified server by network. There is another suspicious behavior that the app obtains a method signature toCharArray() by connecting many unordered characters. It is a trick to evade static analysis. Furthermore, the app uses vast exception handlers to protect from the crash when performing suspicious operations. We also find that multiple reflection calls in the app are used in a nested pattern. For instance, getFields() is called by invoke() sometimes.
Vi Limitations and Discussion
DroidHolmes currently does not analyze the implementation of native methods. The native methods are implemented in C++ instead of Java, so it requires more techniques for performing reverse engineering on C++ implementations. In this paper we only focus on analyzing the Java-based implementation. We plan to analyze the code of C++ in future work. DroidHolmes presently does not support to analyze the dynamic code loading. Actually, it is theoretically possible to find the matched path when the dynamic loaded code blocks are obtained . Meanwhile, we do not make a fine-grained model for threading, which may cause imprecision in contextual recovery. We leave the problems as our future work. Note that the integrity of logs should be achieved before DroidHolmes runs. A selective method is using the TrustZone technology  to protect the logs from tampering by adversaries.
Vii Related Work
Vii-a Behavioral Reconstruction
Many techniques have been proposed to reconstruct app behaviors via runtime information. CopperDroid  modifies the Android emulator to collect system-call information for reconstructing the potential malicious behaviors of a running app. DroidScope  seamlessly reconstructs app behaviors based on the OS-level and Java-level semantics by instrumenting the virtual machine. DroidForensic  collects multi-layer forensic logs and reconstructs the attacks of Android. VetDroid  reconstructs permission use behaviors for finding information leaks and so on. Furthermore, SLEUTH  aims to reconstruct the real-time attack scenario for the systems including Windows, Linux and FreeBSD.
DroidHolmes can be used to find a path matched with the audit logs on the app’s CFG. Based on the result of DroidHolmes, existing analysis tools can get detailed contextual information to profile the hidden intentions of app behaviors. These contextual information is unavailable in the behaviors built by the above techniques. Then, different from the above techniques that record multi-level logs, DroidHolmes only logs a small number of Android API calls. Therefore, our logging scheme incurs negligible performance overhead on users’ smartphones.
Vii-B Behavioral Analysis
Static Analysis. RiskRanker  assesses potential security risks of Android apps, such as known root, illegal cost creation and privacy violation attacks. FlowDroid  proposes the taint propagation analysis to determine whether there exists a data flow from a predefined source to a given sink. FlowDroid only analyzes single components in the apps with high accuracy, while CHEX , Amandroid , DroidSafe , Epicc , IC3  and IccTA 
are proposed to analyze ICC-based data leaks. The static analysis schemes have effectively detected a large number of malicious behaviors, but they may over-estimate the malware threats. Moreover, static analysis may fail to completely detect the Android attacks including reflection and ICC.
The performance of existing static analysis tools can be improved with the help of DroidHolmes. It leverages the runtime information to reconstruct method call relations (e.g., reflective calls and ICC links) for the Android mechanisms on supergraphs. Moreover, in the graphs, the redundant edges that are not covered in the runtime execution are pruned in the result of DroidHolmes. Therefore, the above tools can achieve more accurate behavioral analysis based on the result of DroidHolmes.
Dynamic Analysis. TaintDroid  modifies the Dalvik virtual machine to implement the dynamic taint tracking. AppsPlayground  adopts an improved version of TaintDroid for the dynamic data-flow tracking. BOXMATE  presents an automatic test generation scheme based on BOXIFY . DroidHolmes does not inject any module inside the OS to track runtime data flows or perform the elaborate path exploration for apps; instead, it only logs a small number of Android API calls. Therefore, the performance overhead of DroidHolmes imposed on smartphones is lower than the techniques that introduce overmuch computations at runtime.
Hybrid Analysis. Pegasus  designs a modeling checking mechanism to detect whether the operations in an app is consistent with users’ GUI-based interactions, and also performs dynamic analysis to deal with possible Java reflection cases. AppIntent  applies static analysis to identify the possible execution paths leading to sensitive data transmission, and lets human analysts determine whether the transmission is user-intended according to the results of dynamic analysis. AppAudit  proposes an efficient analysis framework with less time and memory compared with AppIntent and FlowDroid. Harvester  supports to dynamically execute the specified code paths for getting the runtime value of reflection after performing code slicing.
The purposes are different between DroidHolmes and these techniques. Specifically, the techniques leverage static analysis to find the code blocks where the underlying intentions cannot be decided temporarily. Based on the found result, they then use dynamic analysis to perform the targeted testing for revealing the real malice. DroidHolmes does not restrict the scope and goal of dynamic analysis. It analyzes both temporal and structural relationship among apps’ operations, and provides a refined result to facilitate behavioral analysis. In other words, the result of DroidHolmes can help to improve the performance of the above techniques.
Machine Learning-based Analysis. The machine learning-based systems [24, 15, 31, 51, 52] have been used for identifying malicious app behaviors. The inaccurate graphs caused by ICC, reflection or other mechanisms, also bring great challenges in extracting precise features from apps. For instance, the unresolved reflection methods are treated as being security-sensitive in AppContext , which may cause false positives in analyzing. DroidHolmes can be used to help these systems extracting precise features to train the more effective model for behavioral classification.
We propose and implement DroidHolmes, a novel system for recovering contextual information of behaviors in Android apps around limited-quantity audit logs. In our evaluation, DroidHolmes helps existing analysis tools to achieve 94.87% and 100% in precision and recall respectively on 132 apps from open-source test suites. Based on the result of DroidHolmes, the contextual information in the behaviors of 500 real-world apps is also recovered. Meanwhile, DroidHolmes incurs negligible performance overhead on the smartphone.
-  “Global mobile os market share in sales to end users from 1st quarter 2009 to 1st quarter 2018,” https://www.statista.com/statistics/266136/global-market-share-held-by-smartphone-operating-systems/.
-  X. Pan, X. Wang, Y. Duan, X. Wang, and H. Yin, “Dark hazard: learning-based, large-scale discovery of hidden sensitive operations in android apps,” in Proc. of NDSS, 2017.
-  S. Arzt, S. Rasthofer, C. Fritz, E. Bodden, A. Bartel, J. Klein, Y. L. Traon, D. Octeau, and P. McDaniel, “Flowdroid: precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for android apps,” in PLDI, 2014, p. 29. [Online]. Available: http://doi.acm.org/10.1145/2594291.2594299
-  H. Zhang, D. She, and Z. Qian, “Android root and its providers: A double-edged sword,” in Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 2015, pp. 1093–1104.
-  J. Huang, X. Zhang, L. Tan, P. Wang, and B. Liang, “Asdroid: detecting stealthy behaviors in android applications by user interface and program behavior contradiction,” in ICSE, 2014, pp. 1036–1046. [Online]. Available: http://doi.acm.org/10.1145/2568225.2568301
R. Wang, W. Enck, D. S. Reeves, X. Zhang, P. Ning, D. Xu, W. Zhou, and A. M. Azab, “Easeandroid: Automatic policy analysis and refinement for security enhanced android via large-scale semi-supervised learning.” inUSENIX Security Symposium, 2015, pp. 351–366.
-  T. Isohara, K. Takemori, and A. Kubota, “Kernel-based behavior analysis for android malware detection,” in Computational Intelligence and Security (CIS), 2011 Seventh International Conference on. IEEE, 2011, pp. 1011–1015.
-  Y. Zhou, Z. Wang, W. Zhou, and X. Jiang, “Hey, you, get off of my market: detecting malicious apps in official and alternative android markets.” in NDSS, vol. 25, no. 4, 2012, pp. 50–52.
-  L.-K. Yan and H. Yin, “Droidscope: Seamlessly reconstructing the os and dalvik semantic views for dynamic android malware analysis.” in USENIX security symposium, 2012, pp. 569–584.
-  K. Tam, S. J. Khan, A. Fattori, and L. Cavallaro, “Copperdroid: Automatic reconstruction of android malware behaviors,” in NDSS, 2015. [Online]. Available: http://www.internetsociety.org/doc/copperdroid-automatic-reconstruction-android-malware-behaviors
-  X. Yuan, O. Setayeshfar, H. Yan, P. Panage, X. Wei, and K. H. Lee, “Droidforensics: Accurate reconstruction of android attacks via multi-layer forensic logging,” in Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. ACM, 2017, pp. 666–677.
-  Y. Zhang, M. Yang, B. Xu, Z. Yang, G. Gu, P. Ning, X. S. Wang, and B. Zang, “Vetting undesirable behaviors in android apps with permission use analysis,” in CCS, 2013, pp. 611–622. [Online]. Available: http://doi.acm.org/10.1145/2508859.2516689
-  W. Enck, P. Gilbert, B. Chun, L. P. Cox, J. Jung, P. McDaniel, and A. Sheth, “Taintdroid: An information-flow tracking system for realtime privacy monitoring on smartphones,” in OSDI, 2010, pp. 393–407. [Online]. Available: http://www.usenix.org/events/osdi10/tech/full_papers/Enck.pdf
-  Y. Fratantonio, A. Bianchi, W. Robertson, E. Kirda, C. Kruegel, and G. Vigna, “Triggerscope: Towards detecting logic bombs in android applications,” in IEEE Symposium on Security and Privacy (SP), 2016, pp. 377–396.
-  W. Yang, X. Xiao, B. Andow, S. Li, T. Xie, and W. Enck, “Appcontext: Differentiating malicious and benign mobile app behaviors using context,” in ICSE, 2015, pp. 303–313. [Online]. Available: http://dx.doi.org/10.1109/ICSE.2015.50
-  F. Wei, S. Roy, X. Ou, and Robby, “Amandroid: A precise and general inter-component data flow analysis framework for security vetting of android apps,” in CCS, 2014, pp. 1329–1341. [Online]. Available: http://doi.acm.org/10.1145/2660267.2660357
-  L. Li, T. F. Bissyandé, D. Octeau, and J. Klein, “Droidra: Taming reflection to support whole-program analysis of android apps,” in Proceedings of the 25th International Symposium on Software Testing and Analysis. ACM, 2016, pp. 318–329.
-  L. Li, A. Bartel, T. F. Bissyandé, J. Klein, Y. L. Traon, S. Arzt, S. Rasthofer, E. Bodden, D. Octeau, and P. McDaniel, “Iccta: Detecting inter-component privacy leaks in android apps,” in ICSE, 2015, pp. 280–291. [Online]. Available: http://dx.doi.org/10.1109/ICSE.2015.48
-  “Google play. https://play.google.com/store.”
-  Y. Zhou and X. Jiang, “Dissecting android malware: Characterization and evolution,” in IEEE Symposium on Security and Privacy, SP, 2012, pp. 95–109. [Online]. Available: http://dx.doi.org/10.1109/SP.2012.16
-  “Virusshare. https://virusshare.com/.”
-  T. Vidas and N. Christin, “Evading android runtime analysis via sandbox detection,” in Proceedings of the 9th ACM symposium on Information, computer and communications security. ACM, 2014, pp. 447–458.
-  M. I. Gordon, D. Kim, J. H. Perkins, L. Gilham, N. Nguyen, and M. C. Rinard, “Information flow analysis of android applications in droidsafe.” in NDSS, 2015.
-  Y. Aafer, W. Du, and H. Yin, “Droidapiminer: Mining api-level features for robust malware detection in android,” in SecureComm, 2013, pp. 86–103. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-04283-1_6
-  A. Gorla, I. Tavecchia, F. Gross, and A. Zeller, “Checking app behavior against app descriptions,” in Proceedings of the 36th International Conference on Software Engineering. ACM, 2014, pp. 1025–1035.
-  D. Octeau, D. Luchaup, M. Dering, S. Jha, and P. McDaniel, “Composite constant propagation: Application to android inter-component communication analysis,” in ICSE, 2015, pp. 77–88. [Online]. Available: http://dx.doi.org/10.1109/ICSE.2015.30
-  D. Octeau, S. Jha, M. Dering, P. McDaniel, A. Bartel, L. Li, J. Klein, and Y. Le Traon, “Combining static analysis with probabilistic models to enable market-scale android inter-component analysis,” in ACM SIGPLAN Notices, vol. 51, no. 1. ACM, 2016, pp. 469–484.
-  S. Rasthofer, S. Arzt, M. Miltenberger, and E. Bodden, “Harvesting runtime values in android applications that feature anti-analysis techniques.” in NDSS, 2016.
-  M. Sun, T. Wei, and J. Lui, “Taintart: A practical multi-level information-flow tracking system for android runtime,” in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2016, pp. 331–342.
-  K. Jamrozik, P. von Styp-Rekowsky, and A. Zeller, “Mining sandboxes,” in Software Engineering (ICSE), 2016 IEEE/ACM 38th International Conference on. IEEE, 2016, pp. 37–48.
-  V. Avdiienko, K. Kuznetsov, A. Gorla, A. Zeller, S. Arzt, S. Rasthofer, and E. Bodden, “Mining apps for abnormal usage of sensitive data,” in ICSE, 2015, pp. 426–436. [Online]. Available: http://dx.doi.org/10.1109/ICSE.2015.61
-  P. Lam, E. Bodden, O. Lhoták, and L. Hendren, “The soot framework for java program analysis: a retrospective,” in CETUS, 2011.
-  Z. Yang, M. Yang, Y. Zhang, G. Gu, P. Ning, and X. S. Wang, “Appintent: analyzing sensitive data transmission in android for privacy leakage detection,” in CCS, 2013, pp. 1043–1054. [Online]. Available: http://doi.acm.org/10.1145/2508859.2516676
S. Rasthofer, S. Arzt, and E. Bodden, “A machine-learning approach for classifying and categorizing android sources and sinks,” inNDSS, 2014.
-  “Monkey. https://developer.android.com/studio/test/monkey.”
-  Y. Li, F. Yao, T. Lan, and G. Venkataramani, “Sarre: semantics-aware rule recommendation and enforcement for event paths on android,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 12, pp. 2748–2762, 2016.
-  “IBM Rational AppScan. IBM security appscan standard resources.” ibm.com/software/products/en/appscan.
-  “Fortify Static Code Analyzer. build better code and secure your software.” www8.hp.com/us/en/software-solutions/static-code-analysis-sast/.
-  S. Arzt and E. Bodden, “Stubdroid: automatic inference of precise data-flow summaries for the android framework,” in ICSE, 2016, pp. 725–735. [Online]. Available: http://doi.acm.org/10.1145/2884781.2884816
-  M. Zhang and H. Yin, “Appsealer: Automatic generation of vulnerability-specific patches for preventing component hijacking attacks in android applications.” in NDSS, 2014.
-  M. Xia, L. Gong, Y. Lyu, Z. Qi, and X. Liu, “Effective real-time android application auditing,” in 2015 IEEE Symposium on Security and Privacy, 2015, pp. 899–914. [Online]. Available: http://dx.doi.org/10.1109/SP.2015.60
-  S. Rasthofer, S. Arzt, S. Triller, and M. Pradel, “Making malory behave maliciously: Targeted fuzzing of android execution environments,” in Software Engineering (ICSE), 2017 IEEE/ACM 39th International Conference on. IEEE, 2017, pp. 300–311.
-  K. Rubinov, L. Rosculete, T. Mitra, and A. Roychoudhury, “Automated Partitioning of Android Applications for Trusted Execution Environments,” Icse, pp. 923–934, 2016.
-  M. N. Hossain, S. M. Milajerdi, J. Wang, B. Eshete, R. Gjomemo, R. Sekar, S. Stoller, and V. Venkatakrishnan, “Sleuth: Real-time attack scenario reconstruction from cots audit data,” 2017.
-  M. C. Grace, Y. Zhou, Q. Zhang, S. Zou, and X. Jiang, “Riskranker: scalable and accurate zero-day android malware detection,” in MobiSys, 2012, pp. 281–294. [Online]. Available: http://doi.acm.org/10.1145/2307636.2307663
-  L. Lu, Z. Li, Z. Wu, W. Lee, and G. Jiang, “CHEX: statically vetting android apps for component hijacking vulnerabilities,” in CCS, 2012, pp. 229–240. [Online]. Available: http://doi.acm.org/10.1145/2382196.2382223
-  D. Octeau, P. McDaniel, S. Jha, A. Bartel, E. Bodden, J. Klein, and Y. L. Traon, “Effective inter-component communication mapping in android: An essential step towards holistic security analysis,” in USENIX Security Symposium, 2013, pp. 543–558. [Online]. Available: https://www.usenix.org/conference/usenixsecurity13/technical-sessions/presentation/octeau
-  V. Rastogi, Y. Chen, and W. Enck, “Appsplayground: automatic security analysis of smartphone applications,” in CODASPY, 2013, pp. 209–220. [Online]. Available: http://doi.acm.org/10.1145/2435349.2435379
-  M. Backes, S. Bugiel, C. Hammer, O. Schranz, and P. v. Styp-Rekowsky, “Boxify: Full-fledged app sandboxing for stock android,” 2015.
-  K. Z. Chen, N. M. Johnson, V. D’Silva, S. Dai, K. MacNamara, T. R. Magrino, E. X. Wu, M. Rinard, and D. X. Song, “Contextual policy enforcement in android applications with permission event graphs,” in NDSS, 2013.
-  A. Gorla, I. Tavecchia, F. Gross, and A. Zeller, “Checking app behavior against app descriptions,” in ICSE, 2014, pp. 1025–1035. [Online]. Available: http://doi.acm.org/10.1145/2568225.2568276
-  Z. Zhu and T. Dumitras, “Featuresmith: Automatically engineering features for malware detection by mining the security literature,” in CCS, 2016, pp. 767–778. [Online]. Available: http://doi.acm.org/10.1145/2976749.2978304