1. Introduction
This paper seeks answer the following fundamental question at the intersection of programming languages theory and empirical software engineering:
What portion of the code of a large corpus of real software systems lies in SubTuring islands; ‘islands’ of code that denote computation for which interesting program analysis questions are decidable?
We use the term ‘Turing Swamp’ to refer to any code that does not lie in such a SubTuring island. Of course, merely determining whether or not code lies within an island or in the swamp is, itself, undecidable. Therefore, our tool uses a simple conservative underapproximation of SubTuring islands (and corresponding overapproximation of the swamp).
Our SubTuring island identification algorithm, Cook, guarantees that the halting problem is decidable for any computation it identifies as lying within an island; as a result, Cook necessarily underapproximates the amount of code that lies within such islands. Even with this relatively simple underapproximation, we were able to determine that a large proportion of nontrivial production code (for Android) does indeed lie in island (not swamp) code. That is, for a corpus of 1100 Android applications, containing over 2 million methods, we found that 55% of the methods are subTuring.
Even if we remove the ‘long tail’ of simple methods (like getters and setters and methods with fewer than 30 bytecode instructions) we still find that 22% of all code lies in a SubTuring island. We then ask
Since we find that at least one fifth of nontrivial realworld systems lies in a SubTuring island, what are some of the ramifications for programming languages and software engineering applications that rely on static analysis?
To investigate these implications we conducted two empirical studies of the impact for subTuring islands. Even with our conservative underapproximation, we found that (at least) 37% of the verification conditions for runtime exceptions (e.g., array bounds and null pointer violations) lie within subTuring islands. Furthermore, (for a dataset of ten open source applications), we found a statistically significant difference in bug density, with a large effect size.
These findings reveal a glimpse of the potential implications and applications of SubTuring analysis. In a single paper we cannot claim to have addressed more that the first few natural questions that occur when considering the approximate computation of the boundary between SubTuring islands and the swamp. Nevertheless, we believe that our results demonstrate that a surprisingly large portion of code does clearly lie within a SubTuring island and that there is practical merit in studying islands to inform and improve static analysis. SubTuring Islands support fully automatic and precise symbolic reasoning; this reasoning might be exploited for bug repair and free humans to concentrate problems occurring in the swamp.
There are many avenues for future work. We outline some of these and their relationship to existing trends of intellectual investigation in the programming languages and software engineering research communities. We hope that this paper will stimulate the further investigation of SubTuring analyses of software and realworld applications of these findings. Our paper seeks to motivate this research agenda with scientific evidence for the prevalence of SubTuring islands (within Android applications in this case) and the realworld impact and implications for bug density and verification.
Specifically, the contributions of this paper are the following.

We introduce and formalise the concept of subTuring island.

We provide an analysis for identifying subTuring islands and its implementation in the prototype tool Endeavour.

We reveal that SubTuring code is more prevalent in realworld system than might be expected: a conservative lower bound is at least one fifth of nontrivial Android App code is SubTuring.

We demonstrate that SubTuring island analysis has great potential for realworld application. Specifically, we report that 37% of the array bounds and null pointer verification conditions lie within islands, while islands enjoy lower bug density than the Turing swamp.
2. SubTuring Islands
This section first defines SubTuring Island where the definition is parameterized by a decision procedure. As an illustrative decision procedure, we consider terminating islands, SubTuring islands with a conservative decision procedure for halts versus may not halt. The bulk of this section then presents the syntax and semantics of Carib, a core language that facilitates the identification of SubTuring islands.
Definition 2.0 (SubTuring Island).
A region of code is subTuring with respect to property if there exists a decision procedure that determines whether holds over all executions of .
Under Rice’s theorem (rice53), SubTuring islands are only computable when approximates while being certain when it determines holds; it cannot decide both and . For an arbitrary property, a suitable decision procedure may not exist, hence the existential quantification in the definition of subTuring. Finding a decision procedure is based on human ingenuity. The parameterisation of the definition on decision procedure , implies that subTuring islands are only defined with respect to a given decidable property. Different static analyses can safely approximate the islands, and with different levels of precision, thereby giving rise to the generation of different islands. However, if any approximation safely underapproximates the code that lies within an island then it will be safe to make ‘islandaware assertions and inferences’ within any given island.
We focus our investigation on a decision procedure for the halting problem. Our realisation, given in Section 3.2, soundly approximates this undecidable problem. Given this decision procedure, we frame the Terminating Islands Identification Problem in terms of states, , where denotes the set of program lvalues and denotes the lifted value domain , which we leave otherwise unspecified. We say that a state, is divergence free if none of its lvalues are mapped to . Given a divergencefree state and a region of code , our goal is to determine if is also divergencefree; we formalize , the semantics of Carib, in Section 2.1. From a practical perspective, we are interested in regions that represent meaningful code fragments such as a method or a loop body.
Definition 2.0 (The Terminating Island Identification Problem).
Given a region of code , the Terminating Island Identification Problem is to determine if
When this condition is satisfied, divergence can only result from the code that makes up . In this sense, is independent of its context (the rest of the program). Since we consider only a decision procedure for termination in this paper, we use SubTuring Island to refer to a Terminating Island in what follows.
To illustrate the goal of our analysis, consider the three examples shown in Figure 1 where the region of code considered is a method. To being with, method foo of Figure 1a calls method bar, which is clearly subTuring; thus it is not a source of divergence to its caller foo, which is also subTuring. In contrast, in Figure 1b, callee bar is not subTuring as it contains a loop whose termination can not be guaranteed. As a result its caller foo is also not subTuring. Finally, Figure 1c is similar to Figure 1b except that the source of divergence is a call to an API method, which may diverge (A call to a recursive method would have the same effect.) The potentially divergent API call does not, however, relegate Line 4 to the swamp, despite its control dependence on the call to bar. This is because a Carib function always terminates, so the fact that a Line 4 is terminationsensitive control dependent on Line 3 does not matter (sdetal:unifying; kaetal:acmsurveys).
1void foo(){ ✔
2 int x = 5;
3 x = bar();
4 return x;
5}
6
7int bar(){ ✔
8 int y = 1;
9 return y;
10}

1void foo(){ ✘
2 int x = 5;
3 x = bar();
4 return x;
5}
6
7void bar(){ ✘
8 int y = 0;
9 while(){
10 y = y+1;
11 }
12 return y;
13}

1void foo(){ ✔
2 int x = 5;
3 unused = bar();
4 return x;
5}
6
7void bar(){ ✘
8 int y = 0;
9 r = api();
10 if (r)
11 y = y+1;
12 }
13 return y;
14}

(a)  (b)  (c) 

2.1. Carib: Its Syntax and Semantics
Figure 2 defines , the grammar of Carib, our core language. Carib is a Jimplelike (ValleeRaiGHLPS00) intermediary representation with a minimal set of instructions. Vallée et al. (ValleeRaiGHLPS00) and Bartel et al. (BartelKTM12) have shown that Jimple can encode the entire instruction set of widely deployed virtual machines, such as the JVM and Dalvik. A Carib program is a set of methods derived from the nonterminal prog. Carib incorporates three simplifications that ease the presentation. First, instead of the usual syntax for method invocation , Carib uses the form , with the receiver being the first argument. Second, Carib defines only the two structured control constructs while and ifelse. Finally, to simplify reasoning about sideeffects, Carib restricts pointer dereferences to its assignment statement, where the dereference operator can only appear either in the LHS or RHS alone. Further, the call syntax only permits an as an actual parameter, again ruling out pointer dereferences. These properties simplify reasoning about aliasing in Carib.
Carib’s semantics, , extends a conventional semantics, , such as Winskel’s IMP (Winskel) where is a partial function from to that updates the state to reflect the execution of . For (Figure 2), is identical to the conventional semantics, , when terminates. When does not terminate, reifies the nontermination by binding , to each variable modified by .
To reify nontermination, must first identify it. In Carib there are three potential sources: loops, recursive method calls, and (unknown) API calls, The semantics uses three oracles in the identification: , , and (Section 3 discusses our computable approximations to these three). For example, the termination oracle, is used to identify nonterminating loops as well as recursive methods that may diverge. Figure 3 defines these oracles and other symbols and functions used to define Carib’s semantics.
Finally, we formalise the notions of state and state update as used in the Carib semantics. As a convenience, we assign a unique name to each local variable and formal parameter, and then simply refer to only those names that are in scope. We use to denote the set of all program lvalues (in Figure 2, these include identifiers and array/structure references, ). An lvalue denotes a memory location that holds a value from the lifted value domain, , where denotes divergence; we leave otherwise unspecified. A program state maps each lvalue to its value . denotes the (possibly infinite) set of all program states. We write to denote the value that maps to and to denote the updated state where and . As a notational convenience, we write for variable set to denote and as shorthand for , with and and . Finally, we write to denote .
Carib’s semantics must account for two potentially divergent constructs, while and call. If a loop does not terminate, our semantics effectively replaces the loop with a parallel assignment of to all the lvalues that the loop may modify. We handle recursive calls in the same way. Other constructs in may propagate , but will not introduce it. In essence, our semantics is a collecting semantics based on taint analysis where while and call are the only taint sources. We say that a program point is in the swamp if reaches it, otherwise the point is a subTuring island. Finally, we emphasis that, under its semantics Carib methods always terminate.
set of all lvalues potentially written by the execution of  
the termination oracle used to check for nontermination of  
the set of (assumed divergent) API methods  
the set of recursive methods deems divergent  
the set of free variables in expression  
a fresh pseudovariable used to hold a method’s return value 
sequence : 

assign :  
ife :  =  
while :  
call :  

Figure 4 presents Carib’s semantics, . The rule for while leverages to determine if a loop terminates. For a loop that may not terminate, the first line of while rule binds to each lvalue potentially written during the execution of . The externally supplied function identifies these lvalues. Section 3 describes our conservative realisation of as well as our conservative determination of the set of written lvalues.
The second source of nontermination is calls to recursive methods and API methods, denoted and in Figure 4. For the call rule, the first case binds to all lvalues that the called method’s execution potentially updates together with , the variable receiving the method’s return value. In call’s second case, conventional semantics apply. Here, denotes the statements of the called method. Working outward from , we evaluate ’s on the state formed by binding ’s formals to the actuals found in the call. Other than the return value, there is no need to map information back to the caller because uses call by value semantics. In the case of objects (and arrays), Carib passes a copy of the reference to the object to the callee thereby allowing the called method to update (only) the members of the class (or array) associated with the actual. To handle return values, we store the value in and then update the final state to bind to this value.
Finally, the ife and assign rules can only propagate ; they do not introduce it. For assign, if reaches any variables in , assign binds to ; otherwise, the conventional semantics is used. When reaches an if statement’s conditional expression the ife rule assigns to all lvalues potentially written by either branch of the if statement; otherwise, it applies to the appropriate branch.
3. Realising Oracles and Translating to Carib
To analyse industrial programs, we must translate them into Carib and instantiate Carib’s three oracles: its writable location oracle , its termination oracle , and its externally defined, set of divergent API methods . To handle constructs Carib does not define, we translate to Carib in the obvious way, conceptually desugaring them. For , we assume a usersupplied list. Below, we describe how we realize and for Android bytecode. Realising these oracles is not enough. To apply the Cook analysis to actual programs, we also need to change the semantics of their potentially divergent constructs, like loops and method calls. We achieve this via a program transformation that replaces each potentially divergent construct with parallel assignments of to the lvalues of .
3.1. Soundly Identifying Potential Writes
Realising requires finding writable locations, both syntactic lvalues and what can be reached through them. We currently harvest syntactic lvalues from assignment statements without considering their feasibility. Computing reachable lvalues would require handling aliasing, which occurs when two lvalues refer to the same object. To account for the possibility of aliasing in the analysis we use lvalue representatives where two lvalues are aliases if they have the same representative.
We want our Cook analysis (Section 4) to scale to large applications, so we need a sound and efficient handling of aliases. Instead of performing pointer analysis (Section 6.4), we use Sundaresan et al.’s approach (SundaresanHRVLGG00) as it offers a simple and scalable solution. Their approach is based on the observation that only objects of compatible types can be aliases in a typesafe programming language, such as Java. Another advantage of this approach is its simplicity and ease of implementation. Sundaresan et al. use a flowinsensitive analysis because their main purpose is to keep track of object types. In our case, we need to track value transfer between variables. Therefore, we take intervariable flow relations into account. For example, in the code x := y; y := z, a flowinsensitive analysis captures dataflow from y to x and z to y and the spurious flow from z to x. Our approach does not include the spurious flow. Despite this additional precision, we still benefit, in terms of scalability, from Sundaresan et al.’s alias handling.
Let denote augmented with abstracted locations created from the types of the subject program. Formally, we define to map each lvalue to its representative:
where is an object of type , is a field, is an array, is an index, and denotes the highest class in the type hierarchy of that contains the field .
In Carib (Section 2), an lvalue is either a variable, an array, or a field access. In the absence of an array or field dereference, the mapping is simply . The other two cases, a field or array access, are more involved. Two field accesses and are aliases if and point to the same object. To handle this case, all potentially aliasing field accesses must have the same representative. In type safe languages, like Java, and can be aliases only if and belong to the same type hierarchy. While in principle if and are aliases, then either suffices as the representative, for ease of identification, we include in ’s range representatives based on type names, specifically .
An array access lvalue can alias for two reasons: reference and index. In the reference case, and alias if and alias. Alternatively, and alias if . To take both into account, we perform a lightweight alias analysis that partitions array terms into parts of potential aliases. Given an array , returns the representative of ’s alias part. Defining the representative of as solves the problem of referenceinduced aliasing, but not indexinduced aliasing. Tracking indexes may generate an unbounded number of terms when indices are modified inside a loop. Therefore, ’s third case conservatively assumes all indices alias.
Using and , the set of all syntactic lvalues in , we approximate as
This realisation of , , assumes can access ’s internals. This assumption does not hold, in general, for API calls whose implementation can be externally defined in a black box library. Such API calls are prevalent in real world code. To handle them, we use a second instantiation of that we describe in Section 3.3 where we first use it.
3.2. Identifying Divergent Loops and Calls
Realising Carib’s semantics for real bytecode demands that we first identify loops. Since bytecode permits unstructured loops, we implemented a loop detection analysis that searches for loops in the control flow graph of each method, which was experimentally shown to outperform existing alternatives (WeiMZC07). This analysis is an optimised depth first traversal that discovers the control flow graph. This analysis also allows us to detect both simple and complex loops including nested loops and those constructed using gotos. A purely syntactic method, this analysis is complete (finds all loops), but unsound, in that it may report infeasible loops.
Having identified loops, we turn to realizing . Despite the undecidability of loop termination in general, we can statically determine that some loop forms terminate. While simple, our oracle realisation is not purely syntactic. It simplifies a loop before checking its form. Let be a loop that does not contain any nested loops, but otherwise has an arbitrary body. The execution of loop can be expressed as a sequence of singleiteration cycles: . Our oracle concludes that terminates iff both of the following hold: 1) each cycle increments a counter and 2) this counter is bounded in each cycle. These two conditions guarantee the existence of an increasing ranking function that is bounded from above, which is sufficient to ensure loop termination (PodelskiR04). This instantiation of , which we denote , safely and conservatively underapproximates the set of terminating loops.
Our corpus of Android apps includes 627,423 loops. Of these, our oracle shows that 330,894 (53%) terminate. Despite its simplicity and conservatism, identifies a large number of terminating loops.
In Section 2.1, we use to populate , the set of divergent recursive methods. works only on loops, which necessitates a separate mechanism for . To do so, we conservatively build a call graph for the program using the class hierarchy approach (SundaresanHRVLGG00) that provides a conservative approximation of the runtime types of receiver objects. We then identify recursive methods as those nodes belonging to strongly connected components in the call graph. To this end, we use Tarjan’s algorithm for detecting strongly connected components (Tarjan72).
3.3. Rewriting Divergent Constructs
To track divergence in a program, we identify divergent constructs and replace them with assignments of to every lvalue representative potentially modified. We do this as a sourcetosource transformation , which maps each statement to a statement that explicitly includes all necessary assignments of . As Figure 5 depicts, we define using the three rewriting rules: loop, rec, and api. The loop rule replaces the loop with a parallel assignment of to all lvalue representatives that the loop modifies. It uses to identify these lvalues. The rec rule does the same for calls to recursive methods.
Handling loops and recursion makes us complete with respect to our language Carib (Section 2), assuming that a program is selfcontained (i.e. does not make API calls). Most programs, however, make API calls to external libraries. Because an external library is a black box, we cannot syntactically determine which of its parameters it may write through. Thus, we need a second instantiation of to handle API calls. This second instantiation determines all the lvalue representatives potentially modified using a given formal parameter. For arrays or objects, however, we must consider their fields. Let ‘fields’ be the set of fields of a variable where the empty set denotes a scalar. Given the formal parameter x, its reachable lvalues are the set of lvalue representatives given by the function defined as follows
is irreflexive because, under Carib’s call by value semantics, actual parameters are immutable. In Figure 5, the api rule uses to handle API calls where the set includes all lvalue representatives reachable from ’s actual parameters, as determined by .
As an illustration of , consider the code shown in Figure 6. To simplify the presentation, the code examples in the paper use to a more common Javalike syntax. The actual analysis is applied to bytecode, which is closer to Carib’s Jimplelike syntax of Figure 2; however, the more Javalike syntax better communicates the intuition behind our technique. In Figure 6, method m has a single formal parameter a and accesses its b field and subsequently b’s field x through the variable tmp. Thus, aa.btmp.x, which, assuming A has no super classes, yields {A.b, A.B.x}.
Our transformation applies the rules loop, rec, and api. We naturally extend to an entire program , where each divergent construct in is replaced with in .
4. Cook: Discovering SubTuring Islands
This section introduces our analysis algorithm Cook, named after the British explorer Captain Cook. Starting from the basic knowledge that divergent constructs are clearly part of the swamp, we want to analyze their impact on other parts of the program. Let us write to refer to variable x at program location (e.g., at a given line number). We also write to indicate that in the scope of statement , variable y at location depends on variable x at location . In other words, modifying x at may modify y at . We overapproximate (Figure 4) with respect to divergence propagation via the following rule: * [left=(divergence propagation)] ℓ_1:x=⊥ depend(ℓ_2:y,ℓ_1:x, s)ℓ_2:y=⊥ This suggests an approach for identifying subTuring islands by applying a dependency analysis whose goal is to assign to variables that depend on other divergenceaffected variables. Broadly speaking, our analysis approximates the dependency relation induced by the program over variables, yielding an overapproximation of the swamp. Methods not in the swamp make up the subTuring islands, which we thus conservatively underapproximate.
Figure 7 overviews Cook’s workflow and components. Cook takes as input a transformed program whose divergent constructs have been rewritten. Cook outputs a report indicating which methods are subTuring and which fall into the Turing swamp.
While a subTuring island can be any code region, the islands we consider in the remainder of the paper are methods. Cook implements a bottomup interprocedural dependency analysis. It consists of two fixpoint computations. The outer computation, Explore (Algorithm 1), operates over the whole program and calls the inner computation Landfall (Algorithm 2), to compute facts for methods. In what follows, we describe each algorithm in detail.
4.1. Explore: Interprocedurally Searching for SubTuring Islands
Starting from the transformed program, Cook is an interprocedural taint analysis that propagates divergence. Cook assigns a method to the swamp if it uses a tainted variable when called in a divergencefree state. Thus, Cook considers only taints produced by the method or a method it transitively calls.
In subTuring analysis of termination, nested loops (and recursion) can propagate bottom outwards but enclosing loops cannot propagate it inwards. Otherwise, if nontermination were to be defined to propagate inwards, this would make the analysis of Islands often trivial and useless. For example, a loopfree reactive program, encased in a single nonterminating loop would often simply become ‘all swamp’. That would not be helpful for analysis: the body is loop free and so this body always terminates. It can be analysed as a terminating island of code, in isolation from its surrounding loop.
In such a reactive system, figuratively speaking, the program is a single large ‘castle’ on an island surrounded by a ‘moat’ of swamp. Such a ‘swamp castle’ does not, itself, fall into the swamp. Pragmatically, this means that we could (and we argue, should) analyse and reason about the body of such a reactive system (which is loop free) in a very different way to the way in which we would reason about it as a whole component in a larger system. However, for our Cook analysis, the fact that taints do not propagate from the calling context means we cannot use an offtheshelf solution (Flowdroid).
Cook’s output is the set of subTuring methods. Cook is interprocedural and needs the program’s call graph. Objectoriented languages, in general, have many features, such as method overriding, that make constructing an exact call graph at compile time impossible. Thus, Cook overapproximates the call graph using a class hierarchy approach (SundaresanHRVLGG00) that conservatively approximates the runtime types of receiver objects. For an object having a declared type
, its estimated types will be
plus all the subclasses of . If is an interface then its estimated types are all the classes implementing it and the classes derived from them. We use the notation to represent the inheritance relation between classes (types); means that is a subclass of . This relation is reflexive, thus . Given an object , the function returns all the types that can potentially have at runtime. If the declared type of is the class then we haveLet function return all classes implementing interface , including the implementations of subinterfaces of . If the declared type of an object is an interface , then we have
This means that we take into account all the classes implementing , the ones implementing subinterfaces of and their subclasses. For a method invocation , the possible resolutions of the virtual method at runtime is given by
We use a class name as a prefix to distinguish different virtual methods. We write to indicate that method is defined in class and use to stipulate that statement appears in method . Finally, the call graph of a program is given by
By , we mean that the class is defined in the program . Hence the call graph represents the set of all possible pairs of (caller, callee) belonging to the given program.
Leveraging the approximate call graph, Algorithm 1 implements , Cook’s interprocedural algorithm. takes a program transformed by (Section 3). initializes a worklist to hold all the methods found in the program (Line 1) and associates empty summaries with each method (Lines 45). The swamp is also initially empty (Line 3). A summary for each method is then computed, by calling (Line 8), described below. Using the facts returned by , Line 9 tests if belongs in the swamp. It does when its summary contains at least one element of the form , meaning that Cook cannot safely, statically determine that it terminates. If this is the case, adds to the . Function locals returns the set of lvalue representatives corresponding to a method’s local variables. It is useless to keep such elements in a summary; Line 11 discards them. If ’s summary has changed (Line 14), its entry is updated and its callers are placed on the worklist (Lines 15). Finally, on Line 16 returns the set of subTuring methods as the complement of against the set of all program methods.
4.2. Landfall: Cook’s Intraprocedural Analysis
Landfall (Algorithm 2) is an intraprocedural analysis. It approximates the dependence relation induced by a given method over program variables. It uses the lifted set of lvalues and an abstract interpretation over the domain representing the powerset of pairs of lvalue representatives:
where is defined as . Each pair in means that x depends on y with the use of taking aliasing into account. We call the pair a fact. Furthermore, the element expresses that we cannot rule out the possibility that x might be affected by divergence.
Landfall computes the transitive closure over elements from the domain with respect to statements of method using two auxiliary functions: controldependence function and datadependence function . The function captures control dependencies created by conditional statements. For example, consider the code shown in Figure 8. If in this example we only account for data dependencies, we conclude that variable y only depends on z errantly omitting x. However, if x is affected by divergence, we need to propagate this fact to y.
Before describing how we compute , we introduce relevant terminology. Each method in the program is represented by a Control Flow Graph (CFG), a directed graph where is the set of nodes and a set of edges. Each node represents either an assignment or a branch condition. The edges, , represent control flow between program statements. In Carib, we map each assignment to a node with one successor and each conditional statement to a node with two successors, representing the and branches. For CFG node , is the set of successors of , its predecessors, and the statement represents. Finally, each CFG includes two special nodes: is the CFG’s unique entry node, which has no predecessors, and is its unique exit node, which has no successors.
To compute control dependencies, we use the wellestablished approach of Ferrante et al. (FerranteOW87), which we denote as where CFG is a control flow graph, a location, and a map associating locations with sets of facts. This function returns the set of facts induced by control dependencies for location . Function includes transitive control dependencies.
Turning to the data dependences, the function models the effect of program statements on elements of the abstract domain . For a given fact and statement , is defined as follows:
where is the set of dependencies locally induced by statement . For example, yields . Function data_dep transitively extends the relation represented by the input facts and the relation induced by the . It also excludes (kills) facts that are no longer valid after the assignment. For example,
Since the assignment modifies x, the fact no longer holds. Landfall transitively obtains the fact from the input fact combined with from the assignment statement.
We provide the definitions of functions and for Carib’s basic statements in Table 1. Assignments to simple variables (the first five cases) result in dependencies expressing how the assignment’s lefthandside depends on the identifiers appearing in its righthandside except when the righthandside is a constant, which does not introduce any dependencies. When the righthandside is an object field or an array reference, we use its representative to take aliases into account. The return statement is modeled as the assignment ret := id, where ret is a special variable (see Figure 4) used to store and retrieve the method’s return value. In all these cases, we kill input facts expressing dependencies involving the assignment’s lefthandside.
In case of an assignment to a field or array element, we use its representative to take aliases into account. To preserve soundness, we do not kill any facts. Indeed, a representative overapproximates possible aliases. Therefore, the updated lvalues may or may not be an actual alias of a given fact. For a call to a method , we replace the formal parameters with the corresponding actuals in ’s summary, which is a set of facts expressing dependencies induced by . We also replace the special variable ret with r. Landfall computes method summaries iteratively, onthefly when demanded by Explore. Finally, for the assignment , we keep the fact expressing that the assigned variable is affected by divergence because the purpose of our analysis is to track the propagation of .
Statement  






Landfall uses and in a standard worklist. The input and output of all nodes is initialized the empty set on Lines 45. Then, the entry node’s input is created on Line 6. New facts are produced by simulating the effect of program statements using the transfer function data_dep (Line 12), accounting for control dependencies (Line 13). When the set of facts associated with a given location changes, all successors of are explored again (Lines 1416). The algorithm is guaranteed to terminate because is finite and so is the set of facts. Once a fixpoint is reached, the algorithm returns the set of facts accumulated at the exit node.
4.3. Implementation
We implemented our approach for subTuring island identification in a tool called Endeavour, which is written in Python. Endeavour takes as input an Android application and returns a report that includes the analysis result together with other statistics. Endeavour accepts Android apps directly in binary (APK) format. It uses Androguard^{1}^{1}1https://github.com/androguard/androguard to parse and decompile the APK files as well as generate the control flow graphs. Hence, Endeavour does not require source code. We use our own intermediary representation for instructions which has a lisplike format. One key phase in Endeavour is loop extraction (Section 3.2), which extracts a list of loops, each of which is identified by its header together with the nodes it contains. It also obtains the hierarchical (domination) relation between loops. Finally, Endeavour implements the overapproximation of the call graph based on the class hierarchy approach (SundaresanHRVLGG00) (Section 4.1). Endeavour is available at to.be@posted.post.final.acceptance.
5. Experimental Results
This section empirically investigates six research questions involving subTuring islands, henceforth abbreviated STislands. We start by overviewing the application corpora that makes up our experimental subjects. The investigation then begins by considering the prevalence of STislands. Simply put if STislands are rare then their study is of little practical value. We next take a deeper look in into the main causes of divergence. Finding API methods the dominant source, we consider the impact of safe listing subsets of the API methods. Then turning to two of the many applications of STislands, we consider first the relationship between bug density in the swamp and on the STislands, and second the percentage of verification conditions, such array bound violations and null object dereferences that occur on STislands. Finally. we consider the runtime efficiency of our tool Endeavour.
In the experiments, unless otherwise stated, we make the following assumptions. First, we discarded getters and setters as we assume that they are implemented in a standard way making them trivially subTuring. In addition, we initially assume that all API calls diverge and bind to all variables they may write or that depend on them.
We study two sets of apps. A large dataset, app_bin, of over one thousand apps, for which source code is unavailable, and a smaller set, app_src, of ten apps, for which full source code is available. Both corpora, are composed of a range of real world production apps to ensure that our empirical scientific findings have high external validity. The app_bin dataset is composed of 1100 Android applications uniformly selected from more than 600000 apps collected from the Androzoo^{2}^{2}2https://androzoo.uni.lu. Androzoo apps have diverse origins, including the Google Play, store which is the predominant source the apps we study. Our set of 1100 apps contains more than 2 million methods. The app_src dataset is composed of ten applications selected from Github under certain criteria that we describe later. We only consider this dataset in the experiment described in Section 5.4, which requires the app source code. In all other experiments, we consider the larger add_bin set.
5.1. Landscape of STislands
First of all it is important to know the proportion of code that resides within STislands. The answer to this question suggests the code size over which we can reason precisely. A significant proportion means that it is worth investing in the improvement of static analysis as the benefit may be substantial. So the first research question we address is the following:
RQ1: What is the proportion of code occupied by STislands?
The results using app_bin are summarized in Figure 9 where the left boxplot shows the distribution of STmethod percentages.
Finding 1a: Overall, the average percentage of STmethods in an app is approximately 55%, hence, the majority of methods are subTuring.
To study the impact of code size on our results, we want to exclude trivial methods. Defining trivial is hard. We conservatively consider methods of fewer than ten lines as trivial. To convert lines into bytecode instructions, we averaged method length in bytecode instructions over its noncomment source code length and found that on average each line of source code generates three bytecode instruction. Thus, we consider a method trivial if it includes fewer than thirty bytecode instructions. The results when considering only nontrivial methods are shown on the right of Figure 9.
Finding 1b: Discounting trivial methods, the percentage of STmethods is 22%, which while lower than overall average, still represents a significant portion of the code.
While the percentage of STmethods drops, it remains significant as it represents almost a quarter of each app. Moreover, our analysis is both sound and efficient, hence the percentage under estimates the true proportion of code that lies in nontrivial subTuring islands. A more precise but less efficient analysis can only ever uncover additional subTuring methods. Hence, this result underscores the value of investing in static analysis tools specialized to exploit STislands.
5.2. Causes of Divergence
Understanding the causes of divergence informs us about prevalent reasons of precision loss. For example, if it turns out that a certain language construct is the dominant cause of divergence, then we might want to give it greater attention in future work. Therefore, we seek an answer to the following research question:
RQ2: What are the main causes of divergence?
To answer this question, we refined our analysis by extending the abstract domain with an element indicating the cause of divergence: API call, loop, or recursive method.
, we classify the sources of divergence as following
api loop recursion 76% 13% 11%We can see that over three quarters of the divergence is due to library API calls. This suggests that a more precise modelling of API calls is likely to improve the precision of a given static analysis. We set out to experimentally investigate this hypothesis in the next section.
5.3. API Safe Listing
(a) : All methods.  (b) : Only methods with #inst 30. 
Cook is very conservative as it assumes that all API calls cause divergence. In practice, many called API methods have a quite wellunderstood and documented behaviour, making it is plausible to assume that calls to such API methods are not a source of divergence. In this section, we test the impact of this possibility in the following research question:
RQ3: How does a more precise modelling of APIs impact the analysis?
We define a safe list of most frequently used APIs which are assumed to not induce divergence. Among the selected APIs are methods from the Java standard library and some Android frequently used API methods. Under this setting, we repeat the experiments of Section 5.1, where we vary the size of the API safe list. Results are shown in Figure 10. Figure 10a, shows the percentage of STmethods per app for different sizes of the API safe list while Figure 10b considers only methods with more than 30 bytecode instructions. Results when using an empty API safe list repeat the data shown in Figure 9. We included them as a baseline. At the other end placing all API methods on the safe list allows us to investigate the impact of a developer who seeks to focus the analysis solely on his or her code.
Finding 3: For a safe list containing just 5% of most frequently used APIs, the average percentage of STmethods grows to almost 80% when all methods are considered and just over 50% when only methods containing more than 30 bytecode are considered.
Here a safe list of only 5% of the frequently used APIs yields an important increase in STmethods. Interestingly, increasing this to 10% has minimal impact, which may be an instance of the way the most frequently used calls tend to distribute as a power law. Finally, including all APIs on the safe list causes 88% of all methods and 66% of all nontrivial methods to be STmethods. The trend here hints at the value in techniques such as providing formal summaries for the common API methods.
5.4. Distribution of Bugs over STIslands
It is interesting to check whether there is a correlation between bugs and STislands. We address this possibility in the following research question:
RQ4: Is there a significant difference in the bug distribution in the swamp compared to the STislands?
Investigation of this research question requires application source code; thus we make use the app_src collection, which was collected under the following constraints:

Open source: we need the code of the application as well as the corresponding repository to perform the experiment.

Repository history: to rule out simple weekend projects.

Nontrivial size: to rule out small toy applications.

Number of application installations: we want the apps to have real users, thereby attesting to their practical use.
The resulting app_src collection includes the ten realworld applications shown in Table 2.
We compute bug density for STmethods and swamp methods using the following steps:

To identify bugs and their corresponding locations, we use a heuristic based on a bag of words. We check the presence of certain commits associated with keywords such as ”bug”, ”fix”, etc. in the
git repository of each application. We call such commits bugfix commits. 
A buggy line is any line removed, added, or modified by a bugfixing commit. A method is buggy if it contains a buggy line. We assume that a single bug is associated with a single commit and write bugs(m) to express the number of bugs associated with method m.

As our analysis is at the bytecode level, we compile the original source code of each app considered to obtain a binary APK file to analyse.

Finally, we compute bug density for STmethods and swamp methods. The bug density for an application , , is defined as
where is the number of lines of code in method and the number of methods in . We respectively denote the bug density for STmethods and swamp methods as and .
app  LOC  
bitcoinwallet  23392  
connectbot  26625  
irccloud  57471  
k9  123606  
mgit  10919  
orbot  18772  
owncloud  63495  
signal  92868  
vlc  69976  
worldpress  128433  

Overall in app_src there are 6906 STmethods comprised of 475 KLoC with 1863 bugs, and 7417 swamp methods comprised of 894 KLoC with 5317 bugs. We compare bugginess statistically using the nonparametric Wilcoxon test at first the method level and then the line level. The average bugs per method of 0.27 for STmethods and 0.72 for the swamp are statistically different (). Because swamp methods tend to include more lines of code, we also compare the two using bugsperline. In this case, the 0.0265 for STmethods is again statistically less than the 0.289 for the swamp (). Table 2 breaks these bug density out by program.
Finally, we use generalized linear models to investigate the question “How likely is a method to be buggy?” where a method is considered buggy if it contains one or more bugs. A method’s bugginess forms each model’s response variable. Generalized linear models enable us to consider multiple explanatory variables as well as binary response variables. In the first model, we use STisland as the sole explanatory variable. With an odd ratio of 2.07, the model predicts that a swamp method is over twice as likely to contain a bug when compare to an STisland method (
). Including program as an additional explanatory variable, which enables the model to account for differences between programs, increases the odds ratio to 2.09. The impact of additionally including lines as an explanatory variable is negligible with or without the program variable. Finally, it is interesting that there is no significant interaction between program and a method being an STmethod; thus, the likelihood of being an STisland method is independent of the program. This unexpected uniformity strongly supports the external validity of our findings.Finding 4: The bug densities for STislands are statistically smaller than that of the swamp ().
From the above statistics, bug density tends to be higher in the swamp. This result further supports our suggestion to use the swamp as a hint for guiding bug search. In other words, one should allocated a limited budget (time, resources, etc.) to the swamp than to the STislands.
5.5. Finding Potential Errors
STislands are portions of code about which we can precisely answer whether a given property holds. We would like to investigate the presence of concrete properties falling into STislands on which program safety relies. One such property is a runtime exception such as an array outofbounds and nullobject dereferences. We address the following research question:
RQ5: What is the percentage of verification conditions related to detecting bound violations and null object dereference runtime errors that occur in STislands?
We studied the spread of these two potential runtime exceptions over STislands in our app_bin corpus of 1100 applications. We count all array accesses and object dereferences in the code and compute the proportion of the ones occurring in STmethods for each application.
The results, presented in Figure 11, show that just over one in three exceptions can be precisely checked at compile time because it lies on a subTuring island. This is a lower bound for our corpus of 1100 apps, because our determination of subTuring islands is a safe underapproximation. Moreover, as visible in the violin plot (Figure 11), the percentage of array accesses and object dereferences is around 80% for a notable number of apps.
Finding 6: A lower bound on the average percentage of subTuring array accesses and object dereferences in our corpus is 37%.
5.6. Analysis Performance
We have established that nontrivial portions of real world Android app code lie in subTuring islands and have demonstrated that this has implications for bug density and verification in an empirical analysis. Finally, we report on the computational cost of identifying subTuring islands using our approximation. While many other techniques for approximation could be used, and should be explored in future work, it is useful to know whether, at least one such analysis exists that is scalable. If we are able to provide evidence that our approximation is computationally feasible and, therefore, that there does exist a scalable useful approximation to subTuring islands, this will further underscore the practical value of subTuring analysis.
RQ6: Can STislands be efficiently identified?
We measured Endeavour’s analysis time from parsing an application to delivering its output on a 3.2GHz Intel Core i5 quadcore processor with 8GB of memory, running Linux. The results show that our approach is scalable to realworld applications.
Finding 7: Endeavour takes less than four minutes for even the largest applications studied, containing more than methods.
6. Related Work
Our analysis marries taint analysis with termination reification (as divergence). Taint analysis is a technique used in software security (Flowdroid; Enck:2014; WeiROR14; TrippPCCG13; GordonKPGNR15). The goal of taint analysis is to show the absence of information leaks from a set of given sources to a set of given sinks. It can be performed statically (Flowdroid; WeiROR14; TrippPCCG13; GordonKPGNR15) or dynamically (Enck:2014). Our bottomup interprocedural dataflow analysis is a flowsensitive taint analysis that takes into account implicit information flows due to control dependencies. In our case, sources are divergent constructs. Our work also relates to various other topics, including invariant generation, loop summarization, bounded model checking, termination analysis, strictness analysis and program slicing.
6.1. Loops
As loops are a key component in our study, we consider work from the literature aimed at their analysis.
Summaries.
Our modelling of potentially nonterminating loops consists of assigning divergence values to variables they possibly modify. Loop summarization techniques allow to infer loopfree code that soundly approximates a given loop.
Sharygina and Browne proposed a syntactic transformation for abstracting branches in loops in a UML dialect (design level) (SharyginaB03). Kroening and Weissenbacher proposed an approach based on associating recurrence equations with loop variables and then computing a closed form for each equation. Kroening et al. (KroeningSTTW08) a proposed related technique for replacing code fragments, including loops, with corresponding abstract transformers that play the role of the summaries. Seghir proposed a lightweight technique for inferring loop summaries over array segments as well as simple variables using a set of inference rules (Seghir11). Xie et al. presented a technique for summarizing loops that contain multiple paths and manipulate strings, with conditions over string content (XieLLLC15). They further extended their work to support disjunctive reasoning (XieCLLL16). Loop summarization can be folded into our approach to increase the number of loops that can be statically determined to terminate by construction.
Invariants
One approach for reasoning about loops in the context of program verification is through loop invariants (Hoare69). Many verification tools rely on manually provided invariants (FlanaganLLNSS02; BarnettCDJL05; DahlweidMSTS09). However, the literature is rich in terms of approaches that automatically infer invariants in various domains: arithmetic (Karr76; MullerOlmS04; ColonSS03) (linear), (SankaranarayananSM04) (nonlinear), arrays (JhalaM07; GulwaniMT08; SrivastavaG09) and heaps (SagivRW02; PodelskiW05). Software model checkers attempt to build invariants automatically, during the verification process (BallRaj; HenzingerJMS02; ChakiETAL03; PodelskiRybalARMC; IvanicicSGG05; cksy2004), relying on a popular technique called predicate abstraction (GrafS97). We can use invariants to express state changes (transitions) by introducing fresh variables to symbolically model initial values of variables. Hence, similar to summaries, we can use them to express the effect of a given loop, which should improve our algorithm’s precision.
Termination
Termination is another issue related to loops. Knowing the afterstate of a given loop is only possible when the loop terminates. Therefore, we model the effect of potentially nonterminating loops by assigning a divergent value to potentially modified variables. The literature is rich with work regarding termination analysis (PodelskiRybalARMC; PodelskiR04; UrbanGK16; PodelskiR05; PodelskiR04LICS; CookPR05). Socalled ranking functions (PodelskiR04; UrbanGK16) and transition invariants (PodelskiRybalARMC; PodelskiR05; PodelskiR04LICS; CookPR05) are one of the key approaches proposed to show termination. They both express relationships over program states modeling the progress of variables. From a more pragmatic perspective, showing termination of loops via simple arguments (analysis) has also been studied (FratantonioMBKV15). Integrating loop analysis with our approach would help us mitigate precision loss.
Bounded Model Checking
Bounded model checking (BMC) is a technique that deals with loops in a systematic manner by simply unrolling (simulating) them (ClarkeKL04; FalkeMS13; Cordeiro10). The unrolling process may eventually result in a loopfree code fragment that exactly models the original loop’s effect on program variables. Unfortunately, such an approach does not work for loops that are not explicitly bound as the unfolding process will not terminate. Nonetheless, we can combine BMC with our approach to improve our reasoning precision by restricting its application to loops with explicit bounds and apply other techniques to those that are not.
6.2. Slicing
Program slicing is a technique proposed by Weiser (Weiser81) to extract a set of statements, called a slice, that influence a specified computation of interest, referred to as the slicing criterion. The semantics of the original program are preserved by the slice with respect to the slicing criterion. There has been a tremendous amount of work on slicing and its applications (BinkleyH04). While the original proposal statically defined a slice, a dynamic variant has been proposed as well (AgrawalH90). In the latter, a slice is a set of statements that affect the slicing criterion with respect to a particular input.
Slicing has been applied to various problems: program debugging (AgrawalDS93), testing (HarmanHLMW07) comprehension (KorelR98), reuse (CanforaLM98), and reengineering (RepsR95). While the original proposal is syntaxpreserving (i.e., the statements of the slice are all taken from the original code), some variants amorphous (HarmanD97), allowing changes to the program syntax as long as the program semantics are preserved with respect to the criterion. In the context of software model checking, path slicing was proposed to find statements in a given path that are relevant to show its (in)feasibility (JhalaM05). Slicing has also been used to reduce the number of interlivings in eventoriented applications (BlackshearCS15), and recently it has been combined with runtime analysis to extract values of variables that make an application difficult to statically analyse (SiegSME16).
Our approach shares with slicing the characteristic of relying on dependency analysis. Moreover, our analysis naturally yields subTuring slices (i.e., portions of the program that are subTuring). We obtain them by simply backtracking paths in the control flow graph of a given method and selecting statements that are not affected by divergent values.
6.3. Strictness Analysis
Similar to our approach, strictness analysis has been proposed to track divergence resulting from nontermination and error causing program crashes, such as division by zero. A function is said to be strict if it diverges whenever one of its parameters diverges. A variant of strictness analysis, jointstrictness, takes into account parameter combinations. A function is jointlystrict in a subset of its arguments if it divergences when all the arguments of the subset diverge. Mycroft proposed an approach to approximate the divergence relationship induced by a given function over its parameters and the result it returns (Mycroft80). The approach relies on an underlying forward abstract interpretation (CousotC77). A backward analysis has been implemented into the Glasgow Haskell Compiler to perform strictness analysis in a demanddriven fashion (tpdahaskell). Other forms of strictness analysis have been proposed in the literature. For example, Wadler and Hughes describe several projectionbased strictness (WadlerH87), such as headstrictness and tailstrictness, refining the original basic definition.
However, a function being subTuring neither entails strictness nor the other way around. Indeed, if a function always diverges regardless of its parameters, it is strict but not subTuring. On the other hand, the function f(x,y){if x return 1 else return y} is subTuring but not strict. It is subTuring as it does not contain any divergent construct. However, it is not strict because in case x is true the function does not diverge even if y diverges.
6.4. Pointer Analysis
Pointer analysis aims at determining the set of memory locations a pointer may refer to during program execution. Two popular pointer analysis that constitute the basis of many other approaches are Steensgaard’s (Steensgaard96) Andersen’s (Andersen94programanalysis). While Steensgaard’s analysis does not take into account the direction of flow of values induced by assignments, Andersen’s approach models assignment direction. Therefore, Steensgaard’s technique offers more scalability while Andersen’s provides more precision. Das proposed an algorithm lying between Andersen’s and Steensgaard’s approaches (Das00). It is scalable and, at the same time, its precision is very close to Andersen’s
Lhoták and Hendren introduced the SPARK framework (LhotakH03) that offers building blocks for implementing various pointer analysis for Java.
Sridharan et al. proposed a pointer analysis variant which is suitable for environments with small time and memory budgets (SridharanGSB05). Their approach is demanddriven, i.e., performs only the work necessary to answer a query issued by a client.
Instead of applying a pointer analysis, we soundly handle aliases using the variable representative idea inspired by Sundaresan et al (SundaresanHRVLGG00) (3.1). We plan to empirically study the impact of pointer analysis on Cook.
7. Conclusion
In this paper, we addressed the empirical question of how often a program analysis question has, in practice, an exact solution. To this end, we introduced subTuring islands, which are portions of code in which any question of interest is decidable. We provided a formal definition of subTuring islands and presented an algorithm for identifying such islands in applications. We have implemented our approach in a tool called Endeavour and applied it to a representative corpus of 1100 Android applications.
Our empirical study revealed that subTuring islands make up 55% of the methods in the 1100 Android apps studied. These results are not merely of theoretical interest, but have practical ramifications in software engineering. Our findings suggest that we can provide more precise assessments of test coverage; that we can expect more precise assessments of change impact analysis; that we can hope for more precise slices, and thereby, more precise reuse, better comprehension, and better reengineering interventions. For example, in the code on which we report, 37% of runtimeexception guards reside within subTuring islands. This means that an exact answer regarding the validity of these guards can be statically determined.