Understanding and Analyzing Java Reflection

06/14/2017
by   Yue Li, et al.
0

Java reflection has been increasingly used in a wide range of software. It allows a software system to inspect and/or modify the behaviour of its classes, interfaces, methods and fields at runtime, enabling the software to adapt to dynamically changing runtime environments. However, this dynamic language feature imposes significant challenges to static analysis, because the behaviour of reflection-rich software is logically complex and statically hard to predict, especially when manipulated frequently by statically unknown string values. As a result, existing static analysis tools either ignore reflection or handle it partially, resulting in missed, important behaviours, i.e., unsound results. Therefore, improving or even achieving soundness in (static) reflection analysis -- an analysis that infers statically the behaviour of reflective code -- will provide significant benefits to many analysis clients, such as bug detectors, security analyzers and program verifiers. This paper makes two contributions: we provide a comprehensive understanding of Java reflection through examining its underlying concept, API and real-world usage, and, building on this, we introduce a new static approach to resolving Java reflection effectively in practice. We have implemented our reflection analysis in an open-source tool, called SOLAR, and evaluated its effectiveness extensively with large Java programs and libraries. Our experimental results demonstrate that SOLAR is able to (1) resolve reflection more soundly than the state-of-the-art reflection analysis; (2) automatically and accurately identify the parts of the program where reflection is resolved unsoundly or imprecisely; and (3) guide users to iteratively refine the analysis results by using lightweight annotations until their specific requirements are satisfied.

READ FULL TEXT VIEW PDF

Authors

page 10

page 12

page 13

04/29/2019

A Framework for Debugging Java Programs in a Bytecode

In the domain of Software Engineering, program analysis and understandin...
01/08/2020

Deep Static Modeling of invokedynamic

Java 7 introduced programmable dynamic linking in the form of the invoke...
08/19/2020

Trace-based Debloat for Java Bytecode

Software bloat is code that is packaged in an application but is actuall...
06/02/2021

Efficient and Expressive Bytecode-Level Instrumentation for Java Programs

We present an efficient and expressive tool for the instrumentation of J...
07/25/2018

RuntimeSearch: Ctrl+F for a Running Program

Developers often try to find occurrences of a certain term in a software...
03/23/2022

OJXPerf: Featherlight Object Replica Detection for Java Programs

Memory bloat is an important source of inefficiency in complex productio...
12/02/2018

Ann: A domain-specific language for the effective design and validation of Java annotations

This paper describes a new modelling language for the effective design a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Java reflection allows a software system to inspect and/or modify the behaviour of its classes, interfaces, methods and fields at runtime, enabling the software to adapt to dynamically changing runtime environments. This dynamic language feature eases the development and maintenance of Java programs in many programming tasks by, for example, facilitating their flexible integration with the third-party code and their main behaviours to be configured according to a deployed runtime environment in a decoupled way. Due to such advantages, reflection has been widely used in a variety of Java applications and frameworks [Li et al. (2014), Zhauniarovich et al. (2015)].

Static analysis is widely recognized as a fundamental tool for bug detection [Engler et al. (2001), Naik et al. (2006)], security vulnerability analysis [Livshits and Lam (2005), Arzt et al. (2014)], compiler optimization [Dean et al. (1995), Sui et al. (2013)], program verification [Das et al. (2002), Blanchet et al. (2003)], and program debugging and understanding [Sridharan et al. (2007), Li et al. (2016)]. However, when applying static analysis to Java programs, reflection poses a major obstacle [Livshits et al. (2005), Li et al. (2014), Li et al. (2015), Smaragdakis et al. (2015)]. If the behavior of reflective code is not resolved well, much of the codebase will be rendered invisible for static analysis, resulting in missed, important behaviours, i.e., unsound analysis results [Livshits et al. (2015)]. Therefore, improving or even achieving soundness in (static) reflection analysis—an analysis that infers statically the behavior of reflective code—will provide significant benefits to all the client analyses as just mentioned above.

1.1 Challenges

Developing effective reflection analysis for real-world programs remains a hard problem, widely acknowledged by the static analysis community [Livshits et al. (2015)]:

Reflection usage and the size of libraries/frameworks make it very difficult to scale points-to analysis to modern Java programs.” [WALA (WALA)];

Reflection makes it difficult to analyze statically.” [Rastogi et al. (2013)];

In our experience [Ernst et al. (2014)], the largest challenge to analyzing Android apps is their use of reflection …” [Barros et al. (2015)]

Static analysis of object-oriented code is an exciting, ongoing and challenging research area, made especially challenging by dynamic language features, a.k.a. reflection.” [Landman et al. (2017)]

There are three reasons on why it is hard to untangle this knotty problem:

  • The Java reflection API is large and its common uses in Java programs are complex. It remains unclear how an analysis should focus on its effort on analyzing which of its reflection methods in order to achieve some analysis results as desired.

  • The dynamic behaviours of reflective calls are mainly specified by their string arguments, which are usually unknown statically (e.g., with some string values being encrypted, read from configuration files, or retrieved from the Internet).

  • The reflective code in a Java program cannot be analyzed alone in isolation. To resolve reflective calls adequately, a reflection analysis often works inter-dependently with a pointer analysis [Livshits et al. (2005), Li et al. (2014), Li et al. (2015), Smaragdakis and Balatsouras (2015), Smaragdakis et al. (2015)], with each being both the producer and consumer of the other. When some reflective calls are not yet resolved, the pointer information that is currently available can be over- or under-approximate. Care must be taken to ensure that the reflection analysis helps increase soundness (coverage) while still maintaining sufficient precision for the pointer analysis. Otherwise, the combined analysis would be unscalable for large programs.

As a result, most of the papers on static analysis for object-oriented languages, like Java, treat reflection orthogonally (often without even mentioning its existence). Existing static analysis tools either ignore reflection or handle it partially and ineffectively.

1.2 Previous Approaches

Initially, reflection analysis mainly relies on string analysis, especially when the string arguments to reflective calls are string constants, to resolve reflective targets, i.e., methods or fields reflectively accessed. Currently, this mainstream approach is still adopted by many static analysis tools for Java, such as Soot, Wala, Chord and Doop. However, as described in Section 1.1, string analysis will fail in many situations where string arguments are unknown, resulting in limited soundness and precision. As a static analysis, a (more) sound reflection analysis is one that allows (more) true reflective targets (i.e., targets that are actually accessed at runtime) to be resolved statically. In practice, any reflection analysis must inevitably make a trade-off among soundness, precision, scalability, and (sometimes) automation.

In addition, existing reflection analyses [Livshits et al. (2005), Bravenboer and Smaragdakis (2009), Smaragdakis et al. (2015), Barros et al. (2015), Li et al. (2016), Zhang et al. (2017)] cannot answer two critical questions that are raised naturally, in practice: Q(1) how sound is a given reflection analysis and Q(2) which reflective calls are resolved unsoundly or imprecisely? We argue for their importance as follows:

  • If Q(1) is unanswered, users would be unsure (or lose confidence) about the effectiveness of the analysis results produced. For example, a bug detector that reports no bugs may actually miss many bugs if some reflective calls are resolved unsoundly.

  • If Q(2) is unanswered, users would not have an opportunity to contribute in improving the precision and soundness of the analysis results, e.g., by providing some user annotations. For some client analyses (e.g., verification), soundness is required.

1.3 Contributions

In this paper, we attempt to uncover the mysterious veil of Java reflection and change the informed opinion in the program analysis community about static reflection analysis: “Java reflection is a dynamic feature which is nearly impossible to handle effectively in static analysis”. We make the following contributions:

  • We provide a comprehensive understanding of Java reflection through examining its underlying concept (what it is), interface (how its API is designed), and real-world usage (how it is used in practice). As a result, we will provide the answers to several critical questions, which are somewhat related, including:

    • What is reflection, why is it introduced in programming languages, and how is Java reflection derived from the basic reflection concept?

    • Which methods of the Java reflection API should be analyzed carefully and how are they related, as the API is large and complex (with about 200 methods)?

    • How is reflection used in real-world Java programs and what can we learn from its common uses? We have conducted a comprehensive study about reflection usage in a set of 16 representative Java programs by examining their 1,423 reflective call sites. We report 7 useful findings to enable the development of improved practical reflection analysis techniques and tools in future research.

    Figure 1: Reflection analysis: prior work vs. Solar.
  • We introduce a new static analysis approach, called Solar (soundness-guided reflection analysis), to resolve Java reflection effectively in practice. As shown in Figure 1, Solar has three unique advantages compared with previous work:

    • Solar is able to yield significantly more sound results than the state-of-the-art reflection analysis. In addition, Solar allows its soundness to be reasoned about when some reasonable assumptions are met.

    • Solar is able to accurately identify the parts of the program where reflection is analyzed unsoundly or imprecisely, making it possible for users to be aware of the effectiveness of their analysis results (as discussed in Section 1.2).

    • Solar provides a mechanism to guide users to iteratively refine the analysis results by adding lightweight annotations until their specific requirements are satisfied, enabling reflection to be analyzed in a controlled manner.

  • We have implemented Solar in Doop [Bravenboer and Smaragdakis (2009)] (a state-of-the-art pointer analysis tool for Java) and released it as an open-source tool. In particular, Solar can output its reflection analysis results with the format that is supported by Soot (a popular framework for analyzing Java and Android applications), allowing Soot’s clients to use Solar’s results directly.

  • We conduct extensive experiments on evaluating Solar’s effectiveness with large Java applications and libraries. Our experimental results provide convincing evidence on the ability of Solar in analyzing Java reflection effectively, in practice.

1.4 Organization

The rest of this paper is organized as follows. We will start by providing a comprehensive understanding of Java reflection in Section 2. Building on this understanding, we give an overview of Solar in Section 3 and introduce its underlying methodology in Section 4. Then, we formalize Solar in Section 5, describe its implementation in Section 6, and evaluate its effectiveness in Section 7. Finally, we discuss the related work in Section 8 and conclude in Section 9.

2 Understanding Java Reflection

Java reflection is a useful but complex language feature. To gain a deep understanding about Java reflection, we examine it in three steps. First, we describe what Java reflection is, why we need it, and how it is proposed (Section 2.1). Second, we explain how Java reflection is designed to be used, i.e., its API (Section 2.2). Finally, we investigate comprehensively how it has been used in real-world Java applications (Section 2.3). After reading this section, the readers are expected to develop a whole picture about the basic mechanism behind Java reflection, understand its core API design, and capture the key insights needed for developing practical reflection analysis tools.

2.1 Concept

Reflection, which has long been studied in philosophy, represents one kind of human abilities for introspecting and learning their nature. Accordingly, a (non-human) object can also be endowed with the capability of such self-awareness. This arises naturally in artificial intelligence: “

Here I am walking into a dark room. Since I cannot see anything, I should turn on the light”. As explained in [Sobel and Friedman (1996)], such thought fragment reveals a self-awareness of behaviour and state, one that leads to a change in that selfsame behaviour and state, which allows an object to examine itself and make use of the meta-level information to decide what to do next.

Similarly, when we enable programs to avail themselves of such reflective capabilities, reflective programs will also allow the programs to observe and modify properties of their own behaviour. Thus, let a program be self-aware — this is the basic motivation of the so-called computational reflection, which is also considered as the reflection used in the area of programming languages [Demers and Malenfant (1995)].

In the rest of this section, we will introduce what computational reflection is (Section 2.1.1), what reflective abilities it supports (Section 2.1.2), and how Java reflection is derived from it (Section 2.1.3).

2.1.1 Computational Reflection

Reflection, as a concept for computational systems, dates from Brian Smith’s doctoral dissertation [Smith (1982)]. Generally, as shown in Figure 2(a), a computational system is related to a domain and it answers questions about and/or support actions in the domain [Maes (1987)]. Internally, a computational system incorporates both the data that represents entities and relations in the domain and a program that describes how these data may be manipulated.

A computational system is said to be also a reflective system, as shown in Figure 2(b), if the following two conditions are satisfied:

  • First, the system has its own representation, known as its self-representation or metasystem, in its domain as a kind of data to be examined and manipulated.

  • Second, the system and its representation are causally connected: a change to the representation implies a change to the system, and vice versa.

Figure 2: Computational vs. reflective computational systems.

The base system should be reified into its representation before its metasystem can operate. Then the metasystem examines and manipulates its behaviour using the reified representation. If any changes are made by the metasystem, then the effects will also be reflected in the behavior of the corresponding base system.

2.1.2 Reflective Abilities

Generally, (computational) reflection is the ability of a program to examine and modify the structure and behavior of a program at runtime [Herzeel et al. (2008), Malenfant et al. (1996)]. Thus, it endows the program the capabilities of self-awareness and self-adapting. These two reflective abilities are known as introspection and intercession, respectively, and both require a reification mechanism to encode a program’s execution state as data first [Demers and Malenfant (1995)].

  • Introspection: the ability of a program to observe, and consequently, reason about its own execution state.

  • Intercession: the ability of a program to modify its own execution state or alter its own interpretation or meaning.

Providing full reflective abilities as shown above is hardly acceptable in practice, as this will introduce both implementation complexities and performance problems [Chiba (2000)]. Thus, in modern programming languages like Java, reflective abilities are only partially supported [Forman and Forman (2004), Bracha and Ungar (2004)].

2.1.3 Java Reflection

Java reflection supports introspection and very limited intercession. In particular, an introspection step is usually followed by behaviour changes such as object creation, method invocation and attribute manipulation111Some other researchers hold a different view that Java reflection does not support intercession at all [Bracha and Ungar (2004), Donkervoet and Agha (2007)], as they adopt a more strict definition of intercession, which implies the ability to modify the self-representation of a program. [Cazzola (2004), Forman and Forman (2004)].

Despite its limited reflective abilities, Java reflection is able to allow programmers to break the constraints of staticity and encapsulation, making the program adapt to dynamically changing runtime environments. As a result, Java reflection has been widely used in real-world Java applications to facilitate flexibly different programming tasks, such as reasoning about control (i.e., about which computations to pursue next) [Forman and Forman (2004)], interfacing (e.g., interaction with GUIs or database systems) [Gestwicki and Jayaraman (2002), Rashid and Chitchyan (2003)], and self-activation (e.g., through monitors) [Dawson et al. (2008)].

Java reflection does not have a reify operation as described in Section 2.1.1 (Figure 2(b)) to turn the basic (running) system (including stack frames) into a representation (data structure) that is passed to a metasystem. Instead, a kind of metarepresentation, based on metaobjects, exists when the system begins running and persists throughout the execution of the system [Forman and Forman (2004)].

A metaobject is like the reflection in a mirror: one can adjust one’s smile (behaviour changes) by looking at oneself in a mirror (introspection). In Section 2.2, we will look at how Java reflection uses metaobjects and its API to facilitate reflective programming.

1A a = new A();
2String cName, mName, fName = ...;
3Class clz = Class.forName(cName);
4Object obj = clz.newInstance();
5Method mtd = clz.getDeclaredMethod(mName,{A.class});
6Object l = mtd.invoke(obj, {a});
7Field fld = clz.getField(fName);
8X r = (X)fld.get(a);
9fld.set(null, a);
Figure 3: An example of reflection usage in Java.

2.2 Interface

We first use a toy example to illustrate some common uses of the Java reflection API (Section 2.2.1). We then delve into the details of its core methods, which are relevant to (and thus should be handled by) any reflection analysis (Section 2.2.2).

2.2.1 An Example

There are two kinds of metaobjects: Class objects and member objects. In Java reflection, one always starts with a Class object and then obtain its member objects (e.g., Method and Field objects) from the Class object by calling its corresponding accessor methods (e.g., getMethod() and getField()).

In Figure 3, the metaobjects clz, mtd and fld are instances of the metaobject classes Class, Method and Field, respectively. Constructor can be seen as Method except that the method name “<init>” is implicit. Class allows an object to be created reflectively by calling newInstance(). As shown in line 4, the dynamic type of obj is the class (type) represented by clz (specified by cName). In addition, Class provides accessor methods such as getDeclaredMethod() in line 5 and getField() in line 7 to allow the member metaobjects (e.g., of Method and Field) related to a Class object to be introspected. With dynamic invocation, a Method object can be commanded to invoke the method that it represents (line 6). Similarly, a Field object can be commanded to access or modify the field that it represents (lines 8 and 9).

2.2.2 Core Java Reflection API

In reflection analysis, we are concerned with reasoning about how reflection affects the control and data flow information in the program. For example, if a target method (say ) that is reflectively invoked in line 6 in Figure 3 cannot be resolved statically, the call graph edge from this call site to method (control flow) and the values passed interprocedurally from obj and a to this and the parameter of (data flow), respectively, will be missing. Therefore, we should focus on the part of the Java reflection API that affects a pointer analysis, a fundamental analysis that statically resolves the control and data flow information in a program [Livshits et al. (2005), Li et al. (2014), Smaragdakis et al. (2015), Lhoták and Hendren (2003), Milanova et al. (2005), Smaragdakis et al. (2011), Tan et al. (2016), Tan et al. (2017)].

Figure 4: Overview of core Java reflection API.333We summarize and explain the core reflection API (25 methods) that is critical to static analysis. A more complete reflection API list (181 methods) is given in [Landman et al. (2017)] without explanations though.

It is thus sufficient to consider only the pointer-affecting methods in the Java reflection API. We can divide such reflective methods into three categories (Figure 4):

  • Entry methods, which create Class objects, e.g., forName() in line 3 in Figure 3.

  • Member-introspecting methods, which introspect and retrieve member metaobjects, i.e., Method (Constructor) and Field objects from a Class object, e.g., getDeclaredMethod() in line 5 and getField() in line 7 in Figure 3.

  • Side-effect methods, which affect the pointer information in the program reflectively, e.g., newInstance(), invoke(), get() and set() in lines 4, 6, 8 and 9 in Figure 3 for creating an object, invoking a method, accessing and modifying a field, respectively.

Entry Methods

Class objects are returned by entry methods, as everything in Java reflection begins with Class objects. There are many entry methods in the Java reflection API. In Figure 4, only the four most widely used ones are listed explicitly.

Note that forName() (loadClass()) returns a Class object representing a class that is specified by the value of its string argument. The Class object returned by .getClass() and .class represents the dynamic type (class) of and , respectively.

Member-Introspecting Methods

Class provides a number of accessor methods for retrieving its member metaobjects, i.e., the Method (Constructor) and Field objects. In addition, these member metaobjects can be used to introspect the methods, constructors and fields in their target class. Formally, these accessor methods are referred to here as the member-introspecting methods.

As shown in Figure 4, for each kind of member metaobjects, there are four member-introspecting methods. We take a Method object as an example to illustrate these methods, whose receiver objects are the Class objects returned by the entry methods.

  • getDeclaredMethod(String, Class[]) returns a Method object that represents a declared method of the target Class object with the name (formal parameter types) specified by the first (second) parameter (line 5 in Figure 3).

  • getMethod(String, Class[]) is similar to getDeclaredMethod(String, Class[]) except that the returned Method object is public (either declared or inherited). If the target Class does not have a matching method, then its superclasses are searched first recursively (bottom-up) before its interfaces (implemented).

  • getDeclaredMethods() returns an array of Method objects representing all the methods declared in the target Class object.

  • getMethods() is similar to getDeclaredMethods() except that all the public methods (either declared or inherited) in the target Class object are returned.

Side-Effect Methods

Nine side-effect methods and their side effects on the pointer analysis, assuming that the target class of clz and ctor is A, the target method of mtd is m and the target field of fld is f. Simplified Method Calling Scenario Side Effect Class::newInstance o = clz.newInstance() o = new A() Constructor::newInstance o = ctor.newInstance({arg, …}) o = new A(arg, …) Method::invoke a = mtd.invoke(o, {arg, …}) a = o.m(arg, …) Field::get a = fld.get(o) a = o.f Field::set fld.set(o, a) o.f = a Proxy::newProxyInstance o = Proxy.newProxyInstance(…) o = new Proxy$*(…) Array::newInstance o = Array.newInstance(clz, size) o = new A[size] Array::get a = Array.get(o, i) a = o[i] Array::set Array.set(o, i, a) o[i] = a

As shown in Figure 4, a total of nine side-effect methods that can possibly modify or use (as their side effects) the pointer information in a program are listed. Accordingly, Table 2.2.2 explains how these methods affect the pointer information by giving their side effects on the pointer analysis.

In Figure 4, the first five side-effect methods use four kinds of metaobjects as their receiver objects while the last four methods use Class or Array objects as their arguments. Below we briefly examine them in the order given in Table 2.2.2.

  • The side effect of newInstance() is allocating an object with the type specified by its metaobject clz or ctor (say ) and initializing it via a constructor of , which is the default constructor in the case of Class::newInstance() and the constructor specified explicitly in the case of Constructor::newInstance().

  • The side effect of invoke() is a virtual call when the first argument of invoke(), say , is not null. The receiver object is as shown in the “Side Effect” column in Table 2.2.2. When is null, invoke() should be a static call.

  • The side effects of get() and set() are retrieving (loading) and modifying (storing) the value of a instance field, respectively, when their first argument, say , is not null; otherwise, they are operating on a static field.

  • The side effect of newProxyInstance() is creating an object of a proxy class Proxy$*, and this proxy class is generated dynamically according to its arguments (containing a Class object). Proxy.newProxyInstance() can be analyzed according to its semantics. A call to this method returns a Proxy object, which has an associated invocation handler object that implements the InvocationHandler interface. A method invocation on a Proxy object through one of its Proxy interfaces will be dispatched to the invoke() method of the object’s invocation handler.

  • The side effect of Array.newInstance() is creating an array (object) with the component type represented by the Class object (e.g., clz in Table 2.2.2) used as its first argument. Array.get() and Array.set() are retrieving and modifying an index element in the array object specified as their first argument, respectively.

2.3 Reflection Usage

The Java reflection API is rich and complex. We have conducted an empirical study to understand reflection usage in practice in order to guide the design and implementation of a sophisticated reflection analysis described in this paper. In this section, we first list the focus questions in Section 2.3.1, then describe the experimental setup in Section 2.3.2, and finally, present the study results in Section 2.3.3.

2.3.1 Focus Questions

We consider to address the following seven focus questions in order to understand how Java reflection is used in the real world:

  • Q1. The core part of reflection analysis is to resolve all the nine side-effect methods (Table 2.2.2) effectively. What are the side-effect methods that are most widely used and how are the remaining ones used in terms of their relative frequencies?

  • Q2. The Java reflection API contains many entry methods for returning Class objects. Which ones should be focused on by an effective reflection analysis?

  • Q3. Existing reflection analyses resolve reflection by analyzing statically the string arguments of entry and member-introspecting method calls. How often are these strings constants and how often can non-constant strings be resolved by a simple string analysis that models string operations such as “+” and append()?

  • Q4. Existing reflection analyses ignore the member-introspecting methods that return an array of member metaobjects. Is it necessary to handle such methods?

  • Q5. Existing reflection analyses usually treat reflective method calls and field accesses as being non-static. Does this treatment work well in real-world programs? Specifically, how often are static reflective targets used in reflective code?

  • Q6. In [Livshits et al. (2005)], intraprocedural post-dominating cast operations are leveraged to resolve newInstance() when its class type is unknown. This approach is still adopted by many reflection analysis tools. Does it generally work in practice?

  • Q7. What are new insights on handling Java reflection (from this paper)?

2.3.2 Experimental Setup

We have selected a set of 16 representative Java programs, including three popular desktop applications, javac-1.7.0, jEdit-5.1.0 and Eclipse-4.2.2 (denoted Eclipse4), two popular server applications, Jetty-9.0.5 and Tomcat-7.0.42, and all eleven DaCapo benchmarks (2006-10-MR2) [Blackburn et al. (2006)]. Note that the DaCapo benchmark suite includes an older version of Eclipse (version 3.1.2). We exclude its bloat benchmark since its application code is reflection-free. We consider lucene instead of luindex and lusearch separately since these two benchmarks are derived from lucene with the same reflection usage.

We consider a total of 191 methods in the Java reflection API (version 1.6), including the ones mainly from package java.lang.reflect and class java.lang.Class.

We use Soot [Vallée-Rai et al. (1999)] to pinpoint the calls to reflection methods in the bytecode of a program. To understand the common reflection usage, we consider only the reflective calls found in the application classes and their dependent libraries but exclude the standard Java libraries. To increase the code coverage for the five applications considered, we include the jar files whose names contain the names of these applications (e.g., *jetty*.jar for Jetty) and make them available under the process-dir option supported by Soot. For Eclipse4, we use org.eclipse.core. runtime.adaptor.EclipseStarter to let Soot locate all the other jar files used.

We manually inspect the reflection usage in a program in a demand-driven manner, starting from its side-effect methods, assisted by Open Call Hierarchy in Eclipse, by following their backward slices. For a total of 609 side-effect call sites examined, 510 call sites for calling entry methods and 304 call sites for calling member-introspecting methods are tracked and analyzed. As a result, a total of 1,423 reflective call sites, together with some nearby statements, are examined in our study.

Figure 5: Side-effect methods.

2.3.3 Results

Below we describe our seven findings on reflection usage as our answers to the seven focus questions listed in Section 2.3.1, respectively. We summarize our findings as individual remarks, which are expected to be helpful in guiding the development of practical reflection analysis techniques and tools in future research.

Q1. Side-Effect Methods

Figure 5 depicts the percentage frequency distribution of all the nine side-effect methods in all the programs studied. We can see that newInstance() and invoke() are the ones that are most frequently used (46.3% and 32.7%, respectively, on average). Both of them are handled by existing static analysis tools such as Doop, Soot, Wala and Bddbddb. However, Field- and Array-related side-effect methods, which are also used in many programs, are ignored by most of these tools. To the best of our knowledge, they are handled only by Elf [Li et al. (2014)], Solar [Li et al. (2015)] and Doop [Smaragdakis et al. (2015)]. Note that newProxyInstance() is used in jEdit only in our study and a recent survey on reflection analysis [Landman et al. (2017)] reports more its usages in the real world.

Remark 1. Reflection analysis should at least handle newInstance() and invoke() as they are the most frequently used side-effect methods (79% on average), which will significantly affect a program’s behavior, in general; otherwise, much of the codebase may be invisible for analysis. Effective reflection analysis should also consider Field- and Array-related side-effect methods, as they are also commonly used.

Q2. Entry Methods

Figure 6 shows the percentage frequency distribution of eight entry methods. “Unknown” is included since we failed to find the entry methods for some side-effect calls (e.g., invoke()) even by using Eclipse’s Open Call Hierarchy tool. For the first 12 programs, the six entry methods as shown (excluding “Unknown” and “Others”) are the only ones leading to side-effect calls. For the last two, Jetty and Tomcat, “Others” stands for defineClass() in ClassLoader and getParameterTypes() in Method. Finally, getComponentType() is usually used in the form of getClass().getComponentType() for creating a Class object argument for Array.newInstance().

On average, Class.forName(), .class, getClass() and loadClass() are the top four most frequently used (48.1%, 18.0%, 17.0% and 9.7%, respectively). A class loading strategy can be configured in forName() and loadClass(). In practice, forName() is often used by the system class loader and loadClass() is usually overwritten in customer class loaders, especially in framework applications such as Tomcat and Jetty.

Figure 6: Entry methods.

Remark 2. Reflection analysis should handle Class.forName(), getClass(), .class, and loadClass(), which are the four major entry methods for creating Class objects. In addition, getComponentType() should also be modeled if Array-related side-effect methods are analyzed, as they are usually used together.

Q3. String Constants and String Manipulations

In entry methods, Class.forName() and loadClass() each have a String parameter to specify the target class. In member-introspecting methods, getDeclaredMethod(String,...) and getMethod(String,...) each return a Method object named by its first parameter; getDeclaredField(String) and getField(String) each return a Field object named by its single parameter.

As shown in Figure 7, string constants are commonly used when calling the two entry methods (34.7% on average) and the four member-introspecting methods (63.1% on average). In the presence of string manipulations, many class/method/field names are unknown exactly. This is mainly because their static resolution requires precise handling of many different operations e.g., subString() and append(). In fact, many cases are rather complex and thus cannot be handled well by simply modeling the java.lang.String-related API. Thus, Solar does not currently handle string manipulations. However, the incomplete information about class/method/field names (i.e., partial string information) can be exploited beneficially [Smaragdakis et al. (2015)].

We also found that many string arguments are Unknown (55.3% for calling entry methods and 25.1% for calling member-introspecting methods, on average). These are the strings that may be read from, say, configuration files, command lines, or even Internet URLs. Finally, string constants are found to be more frequently used for calling the four member-introspecting methods than the two entry methods: 146 calls to getDeclaredMethod() and getMethod(), 27 calls to getDeclaredField() and getField() in contrast with 98 calls to forName() and loadClass(). This suggests that the analyses that ignore string constants flowing into some member-introspecting methods may fail to exploit such valuable information and thus become imprecise.

           (a) Calls to entry methods            (b) Calls to member-introspecting methods
Figure 7: Classification of the String arguments of two entry methods, forName() and loadClass(), and four member-introspecting methods, getMethod(), getDeclaredMethod(), getField() and getDeclaredField().

Remark 3. Resolving reflective targets by string constants does not always work. On average, only 49% reflective call sites (where string arguments are used to specify reflective targets) use string constants. In addition, fully resolving non-constant string arguments by string manipulation, although mentioned elsewhere [Livshits et al. (2005), Bodden et al. (2011)], may be hard to achieve, in practice.

Q4. Retrieving an Array of Member Objects

As introduced in Section 2.2.2, half of member-introspecting methods (e.g., getDeclaredMethods()) return an array of member metaobjects. Although not as frequently used as the ones returning single member metaobject (e.g., getDeclaredMethod()), they play an important role in introducing new program behaviours in some applications. For example, in the two Eclipse programs studied, there are four invoke() call sites called on an array of Method objects returned from getMethods() and 15 fld.get() and fld.set() call sites called on an array of Field objects returned by getDeclaredFields(). Through these calls, dozens of methods are invoked and hundreds of fields are modified reflectively. Ignoring such methods as in prior work [Livshits et al. (2005)] and tools (Bddbddb, Wala, Soot) may lead to significantly missed program behaviours by the analysis.

Remark 4. In member-introspecting methods, get(Declared)Methods/Fields/Constructors(), which return an array of member metaobjects, are usually ignored by most of existing reflection analysis tools. However, they play an important role in certain applications for both method invocations and field manipulations.

(a) Method::invoke() call sites (b) Field::get()/set() call sites
Figure 8: The percentage frequency distribution of side-effect call sites on instance and static members.
Q5. Static or Instance Members

In the literature on reflection analysis [Livshits et al. (2005), Li et al. (2014), Smaragdakis et al. (2015)], reflective targets are mostly assumed to be instance members. Accordingly, calls to the side-effect methods such as invoke(), get() and set(), are usually considered as virtual calls, instance field accesses, and instance field modifications, respectively (see Table 2.2.2 for details). However, in real programs, as shown in Figure 8, on average, 37% of the invoke() call sites are found to invoke static methods and 50% of the get()/set() call sites are found to access/modify static fields. Thus in practice, reflection analysis should distinguish both cases and also be aware of whether a reflective target is a static or instance member, since the approaches for resolving both cases are usually different.

Remark 5. Static methods/fields are invoked/accessed as frequently as instance methods/fields in Java reflection, even though the latter has received more attention in the literature. In practice, reflection analysis should distinguish the two cases and adopt appropriate approaches for handling them.

Q6. Resolving newInstance() by Casts

In Figure 3, when cName is a not string constant, the (dynamic) type of obj created by newInstance() in line 4 is unknown. For this case, Livshits et al. [Livshits et al. (2005)] propose to infer the type of obj by leveraging the cast operation that post-dominates intra-procedurally the newInstance() call site. If the cast type is A, the type of obj must be A or one of its subtypes assuming that the cast operation does not throw any exceptions. This approach has been implemented in many analysis tools such as Wala, Bddbddb and Elf.

However, as shown in Figure 9, exploiting casts this way does not always work. On average, 28% of newInstance() call sites have no such intra-procedural post-dominating casts. As newInstance() is the most widely used side-effect method, its unresolved call sites may significantly affect the soundness of the analysis, as discussed in Section 7.5.1. Hence, we need a better solution to handle newInstance().

Figure 9: newInstance() resolution by leveraging intra-procedural post-dominating casts.

Remark 6. Resolving newInstance() calls by leveraging their intra-procedural post-dominating cast operations fails to work for 28% of the newInstance() call sites found. As newInstance() affects critically the soundness of reflection analysis (Remark 1), a more effective approach for its resolution is required.

Q7. Self-Inferencing Property

As illustrated by the program given in Figure 3, the names of its reflective targets are specified by the string arguments (e.g., cName, mName and fName) at the entry and member-introspecting reflective calls. Therefore, string analysis has been a representative approach for static reflection analysis in the last decade. However, if the value of a string is unknown statically (e.g., read from external files or command lines), then the related reflective calls, including those to newInstance(), may have to be ignored, rendering the corresponding codebase or operations invisible to the analysis. To improve precision, in this case, the last resort is to exploit the existence of some intra-procedurally post-dominating cast operations on a call to newInstance() in order to deduce the types of objects reflectively created ().

However, in our study, we find that there are many other rich hints about the behaviors of reflective calls at their usage sites. Such hints can be and should be exploited to make reflection analysis more effective, even when some string values are partially or fully unknown. In the following, we first look at three real example programs to examine what these hints are and expose a so-called self-inferencing property inherent in these hints. Finally, we explain why self-inferencing property is pervasive for Java reflection and discuss its potential in making reflection analysis more effective.

Application: Eclipse (v4.2.2)
Class:org.eclipse.osgi.framework.internal.core.FrameworkCommandInterpreter
123  public Object execute(String cmd) {...
155    Object[] parameters = new Object[] {this}; ...
167    for (int i = 0; i < size; i++) {
174      method = target.getClass().getMethod("_" + cmd, parameterTypes);
175      retval = method.invoke(target, parameters); ...}
228  }
Figure 10: Self-inferencing property for a reflective method invocation, deduced from the number and dynamic types of the components of the one-dimensional array argument, parameters, at a invoke() call site.

[Reflective Method Invocation (Figure 10)] The method name (the first argument of getMethod() in line 174) is statically unknown as part of it is read from command line cmd. However, the target method (represented by method) can be deduced from the second argument (parameters) of the corresponding side-effect call invoke() in line 175. Here, parameters is an array of objects, with only one element (line 155). By querying the pointer analysis and also leveraging the type information in the program, we know that the type of the object pointed to by this is FrameworkCommandInterpreter, which has no subtypes. As a result, we can infer that the descriptor of the target method in line 175 must have only one argument and its declared type must be FrameworkCommandInterpreter or one of its supertypes.

Application: Eclipse (v4.2.2)
Class:org.eclipse.osgi.framework.internal.core.Framework
1652  public static Field getField(Class clazz, ...) {
1653    Field[] fields = clazz.getDeclaredFields(); ...
1654    for (int i = 0; i < fields.length|; i++) { ...
1658      return fields[i]; } ...
1662  }
1682  private static void forceContentHandlerFactory(...) {
1683    Field factoryField = getField(URLConnection.class, ...);
1687    java.net.ContentHandlerFactory factory =
          (java.net.ContentHandlerFactory) factoryField.get(null); ...
1709  }
Figure 11: Self-inferencing property for a reflective field access, deduced from the cast operation and the null argument used at a get() call site.

[Reflective Field Access (Figure 11)] In this program, factoryField (line 1683) is obtained as a Field object from an array of Field objects created in line 1653 for all the fields in URLConnection. In line 1687, the object returned from get() is cast to java.net.ContentHandlerFactory. Based on its cast operation and null argument, we know that the call to get() may only access the static fields of URLConnection with the type java.net.ContentHandlerFactory, its supertypes or its subtypes. Otherwise, all the fields in URLConnection must be assumed to be accessed conservatively.

Application: Eclipse (v4.2.2)
Class:org.eclipse.osgi.util.NLS
300  static void load(final String bundleName, Class<?> clazz) {
302    final Field[] fieldArray = clazz.getDeclaredFields();
336    computeMissingMessages(..., fieldArray, ...); ...
339  }
267  static void computeMissingMessages(..., Field[] fieldArray,...) {
272    for (int i = 0; i < numFields; i++) {
273      Field field = fieldArray[i];
284      String value = "NLS missing message: " + ...;
290      field.set(null, value); } ...
295  }
Figure 12: Self-inferencing property for a reflective field modification, deduced from the null argument and the dynamic type of the value argument at a set() call site.

[Reflective Field Modification (Figure 12)] Like the case in Figure 11, the field object in line 290 is also read from an array of field objects created in line 302. This code pattern appears one more time in line 432 in the same class, i.e., org.eclipse.osgi.util.NLS. According to the two arguments, null and value, provided at set() (line 290), we can deduce that the target field (to be modified in line 290) is static (from null) and its declared type must be java.lang.String or one of its supertypes (from the type of value).

[Self-Inferencing Property] For each side-effect call site, where reflection is used, all the information of its arguments (including the receiver object), i.e., the number of arguments, their types, and the possible downcasts on its returned values, together with the possible string values statically resolved at its corresponding entry and member-introspecting call sites, forms its self-inferencing property.

We argue that the self-inferencing property is a pervasive fact about Java reflection, due to the characteristics of object-oriented programming and the Java reflection API. As an example, the declared type of the object (reflectively returned by get() and invoke() or created by newInstance()) is always java.lang.Object. Therefore, the object returned must be first cast to a specific type before it is used as a regular object, except when its dynamic type is java.lang.Object or it will be used only as an receiver for the methods inherited from java.lang.Object; otherwise, the compilation would fail. As another example, the descriptor of a target method reflectively called at invoke() must be consistent with what is specified by its second argument (e.g., parameters in line 176 of Figure 10); otherwise, exceptions would be thrown at runtime. These constraints should be exploited to enable resolving reflection in a disciplined way.

The self-inferencing property not only helps resolve reflective calls more effectively when the values of string arguments are partially known (e.g., when either a class name or a member name is known), but also provides an opportunity to resolve some reflective calls even if the string values are fully unknown. For example, in some Android apps, class and method names for reflective calls are encrypted for benign or malicious obfuscation, which “makes it impossible for any static analysis to recover the reflective call” [Rastogi et al. (2013)]. However, this appears to be too pessimistic in our setting, because, in addition to the string values, some other self-inferencing hints are possibly available to facilitate reflection resolution. For example, given (A)invoke(o, {...}), the class type of the target method can be inferred from the dynamic type of o (by pointer analysis). In addition, the declared return type and descriptor of the target method can also be deduced from A and {...}, respectively, as discussed above.

Remark 7. Self-inferencing property is an inherent and pervasive one in the reflective code of Java programs. However, this property has not been fully exploited in analyzing reflection before. We will show how this property can be leveraged in different ways (for analyzing different kinds of reflective methods as shown in Sections 4.2 and 4.3) in order to make reflection analysis significantly more effective.

3 Overview of Solar

We first introduce the design goal of, challenges faced by, and insights behind Solar in Section 3.1. We then present an overview of the Solar framework including its basic working mechanism and the functionalities of its components in Section 3.2.

3.1 Goals, Challenges and Insights

Design Goal

As already discussed in Section 1.3, Solar is designed to resolve reflection as soundly as possible (i.e., more soundly or even soundly when some reasonable assumptions are met) and accurately identify the reflective calls resolved unsoundly.

Challenges

In addition to the challenges described in Section 1.1, we must also address another critical problem: it is hard to reason about the soundness of Solar and identify accurately which parts of the reflective code have been resolved unsoundly.

If one target method at one reflective call is missed by the analysis, it may be possible to identify the statements that are unaffected and thus still handled soundly. However, the situation will deteriorate sharply if many reflective calls are resolved unsoundly. In the worst case, all the other statements in the program may be handled unsoundly. To play safe, the behaviors of all statements must be assumed to be under-approximated in the analysis, as we do not know which value at which statement has been affected by the unsoundly resolved reflective calls.

Insights

To achieve the design goals of Solar, we first need to ensure that as few reflective calls are resolved unsoundly as possible. This will reduce the propagation of unsoundness to as few statements as possible in the program. As a result, if Solar reports that some analysis results are sound (unsound), then they are likely sound (unsound) with high confidence. This is the key to enabling Solar to achieve practical precision in terms of both soundness reasoning and unsoundness identification.

To resolve most or even all reflective calls soundly, Solar needs to maximally leverage the available information (the string values at reflective calls are inadequate as they are often unknown statically) in the program to help resolve reflection. Meanwhile, Solar should resolve reflection precisely. Otherwise, Solar may be unscalable due to too many false reflective targets introduced. In Sections 4.2 and 4.3, we will describe how Solar leverages the self-inferencing property in a program (Definition 12) to analyze reflection with good soundness and precision.

Finally, Solar should be aware of the conditions under which a reflective target cannot be resolved. In other words, we need to formulate a set of soundness criteria for different reflection methods based on different resolution strategies adopted. If the set of criteria is not satisfied, Solar can mark the corresponding reflective calls as the ones that are resolved unsoundly. Otherwise, Solar can determine the soundness of the reflection analysis under some reasonable assumptions (Section 4.1).

3.2 The Solar Framework

Figure 13 gives an overview of Solar. Solar consists of four core components: an inference engine for discovering reflective targets, an interpreter for soundness and precision, a locater for unsound and imprecise calls, and a Probe (a lightweight version of Solar). In the rest of this section, we first introduce the basic working mechanism of Solar and then briefly explain the functionality of each of its components.

3.2.1 Working Mechanism

Given a Java program, the inference engine resolves and infers the reflective targets that are invoked or accessed at all side-effect method call sites in the program, as soundly as possible. There are two possible outcomes. If the reflection resolution is scalable (under a given time budget), the interpreter will proceed to assess the quality of the reflection resolution under soundness and precision criteria. Otherwise, Probe, a lightweight version of Solar, would be called upon to analyze the same program again. As Probe resolves reflection less soundly but much more precisely than Solar, its scalability can be usually guaranteed. We envisage providing a range of Probe variants with different trade-offs among soundness, precision and scalability, so that the scalability of Probe can be always guaranteed.

Figure 13: Overview of Solar.

If the interpreter confirms that the soundness criteria are satisfied, Solar reports that the reflection analysis is sound. Otherwise, the locater will be in action to identify which reflective calls are resolved unsoundly. In both cases, the interpreter will also report the reflective calls that are resolved imprecisely if the precision criteria are violated. This allows potential precision improvements to be made for the analysis.

The locater not only outputs the list of reflective calls that are resolved unsoundly or imprecisely in the program but also pinpoints the related entry and member-introspecting method calls of these “problematic” calls, which contain the hints to guide users to add annotations, if possible. Figure 14 depicts an example output.

As will be demonstrated in Section 7, for many programs, Solar is able to resolve reflection soundly under some reasonable assumptions. However, for certain programs, like other existing reflection analyses, Solar is unscalable. In this case, Probe (a lightweight version of Solar whose scalability can be guaranteed as explained above) is applied to analyze the same program. Note that Probe is also able to identify the reflective calls that are resolved unsoundly or imprecisely in the same way as Solar. Thus, with some unsound or imprecise reflective calls identified by Probe and annotated by users, Solar will re-analyze the program, scalably after one or more iterations of this “probing” process. As discussed in Section 7, the number of such iterations is usually small, e.g., only one is required for most of the programs evaluated.

For some programs, users may choose not to add annotations to facilitate reflection analysis. Even in this case, users can still benefit from the Solar approach, for two reasons. First, Probe is already capable of producing good-quality reflection analysis results, more soundly than string analysis. Second, users can understand the quality of these results by inspecting the locater’s output, as discussed in Section 1.2.

Figure 14: An example output from Solar when its soundness/precision criteria are violated.

3.2.2 Basic Components

Their functionalities are briefly explained below.

Reflective Target Inference Engine

We employ two techniques to discover reflective targets: collective inference for resolving reflective method invocations (invoke()) and field accesses/modifications (get()/set()) and lazy heap modeling for handling reflective object creation (newInstance()). Both techniques exploit the self-inferencing property found in our reflection usage study (Section 2.3.3) to resolve reflection in a disciplined manner with good soundness and precision. We will explain their approaches in Sections 4.2 and 4.3, respectively, and further formalize them in Section 5.4.

Soundness and Precision Interpreter

Solar currently adopts a simple but practical scheme to measure precision, in terms of the number of targets resolved at a side-effect call site. Solar allows users to specify a threshold value in advance to define the imprecision that can be tolerated for each kind of side-effect calls, forming its precision criteria. Its soundness criteria are formulated in terms of conditions under which various inference rules (adopted by the inference engine) can be applied soundly. We will formalize the soundness criteria required in Section 5.5.

Unsound and Imprecise Call Locater

A side-effect reflective call is identified as being imprecisely resolved if the number of resolved targets is higher than permitted by its corresponding precision criterion. Similarly, a side-effect reflective call is marked as being unsoundly resolved if its corresponding soundness criterion is violated.

To facilitate user annotations for an imprecisely or unsoundly resolved side-effect reflective call, the locater also pinpoints its corresponding entry and member-introspecting method call sites. It can be difficult to understand the semantics of a reflective call by reading just the code at its vicinity. Often, more hints about its semantics are available at or around its entry and member-introspecting method call sites, which may reside in different methods or even classes in the program.

Figure 14 illustrates Solar’s output for a real program. The invoke() (side-effect method) call site in method getValue is the unsoundly resolved call identified. Its entry method (forName()) and member-introspecting method (getMethods()) call sites, which are located in the constructor of class org.hsqldb.Function, are also highlighted. At the right-hand side of the figure, we can see that the hints for annotations are available around the entry and member-introspecting call sites (e.g., lines 169, 184 and 185) rather than the side-effect call site (line 352). This demonstrates the usefulness of Solar’s annotation strategy.

We will further explain how Solar identifies unsoundly resolved reflective calls in Section 4.4 and how users are guided to add annotations in Section 4.5.

Probe

Probe is a lightweight version of Solar by weakening the power of its inference engine. Probe changes its inference strategies in both collective inference and lazy heap modeling, by resolving reflection more precisely but less soundly. Thus, the scalability of Probe can be usually guaranteed as fewer false reflective targets are introduced. We will formalize Probe based on the formalism of Solar in Section 5.7.

4 The Solar Methodology

We first define precisely a set of assumptions made (Section 4.1). Then we examine the methodologies of collective inference (Section 4.2) and lazy heap modeling (Section 4.3) used in Solar’s inference engine. Finally, we explain how Solar identifies unsoundly resolved reflective calls (Section 4.4) and how doing so helps guide users to add lightweight annotations to facilitate a subsequent reflection analysis (Section 4.5).

4.1 Assumptions

There are four reasonable assumptions. The first one is commonly made on static analysis [Sonntag and Colnet (2014)] and the next two are made previously on reflection analysis for Java [Livshits et al. (2005)]. Solar adds one more assumption to allow reflective allocation sites to be modeled lazily. Under the four assumptions, it becomes possible to reason about the soundness and imprecision of Solar.

Assumption 1 (Closed-World)

Only the classes reachable from the class path at analysis time can be used during program execution.

This assumption is reasonable since we cannot expect static analysis to handle all classes that a program may download from the Internet and load at runtime. In addition, Java native methods are excluded as well.

Assumption 2 (Well-Behaved Class Loaders)

The name of the class returned by a call to Class.forName(cName) equals cName.

This assumption says that the class to be loaded by Class.forName(cName) is the expected one specified by the value of cName, thus avoiding handling the situation where a different class is loaded by, e.g., a malicious custom class loader. How to handle custom class loader statically is still an open hard problem. Note that this assumption also applies to loadClass(), another entry method shown in Figure 4.

Assumption 3 (Correct Casts)

Type cast operations applied to the results of calls to side-effect methods are correct, without throwing a ClassCastException.

This assumption has been recently demonstrated as practically valid through extensive experiments in [Landman et al. (2017)].

Assumption 4 (Object Reachability)

Every object o created reflectively in a call to newInstance() flows into (i.e., will be used in) either (1) a type cast operation …= (T) v or (2) a call to a side-effect method, get(v), set(v,…) or invoke(v,…), where v points to o, along every execution path in the program.

Cases (1) and (2) represent two kinds of usage points at which the class types of object will be inferred lazily. Specifically, case (1) indicates that is used as a regular object, and case (2) says that is used reflectively, i.e., flows to the first argument of different side-effect calls as a receiver object. This assumption does not cover only one rare situation where is created but never used later. As validated in Section 7.2, Assumption 4 is found to hold for almost all reflective allocation sites in the real code.

4.2 Collective Inference

Figure 15 gives an overview of collective inference for handling reflective method invocations and field accesses/modifications. Essentially, we see how the side-effect method calls invoke(), get() and set() are resolved. A Class object is first created for the target class named cName. Then a Method (Field) object () representing the target method (field) named mName (fName) in the target class of is created. Finally, at some reflective call sites, e.g., invoke(), get() and set(), the target method (field) is invoked (accessed) on the target object o, with the arguments, {...} or a.

Solar works as part of a pointer analysis, with each being both the producer and consumer of the other. By exploiting the self-inferencing property (Definition 12) inherent in the reflective code, Solar employs the following two component analyses:

Target Propagation (Marked by Solid Arrows)

Solar resolves the targets (methods or fields) of reflective calls, invoke(), get() and set(), by propagating the names of their target classes and methods/fields (e.g., those pointed by cName, mName and fName if statically known) along the solid lines into the points symbolized by circles.

Figure 15: Collective Inference in Solar.
Target Inference (Marked by Dashed Arrows)

By using Target Propagation alone, a target member name (blue circle) or its target class type (red circle) at a reflective call site may be missing, i.e., unknown, due to the presence of input-dependent strings (Figure 7). If the target class type (red circle) is missing, Solar will infer it from the dynamic type of the target object o (obtained by pointer analysis) at invoke(), get() or set() (when o != null). If the target member name (blue circle) is missing, Solar will infer it from (1) the dynamic types of the arguments of the target call, e.g., {...} of invoke() and a of set(), and/or (2) the downcast on the result of the call, such as (A) at invoke() and get().

Let us illustrate Target Inference by considering r = (A) .get(o) in Figure 15. If a target field name is known but its target class type (i.e., red circle) is missing, we can infer it from the types of all pointed-to objects by o. If is one such a type, then a potential target class of o is or any of its supertypes. If the target class type of is but a potential target field name (i.e., blue circle) is missing, we can deduce it from the downcast (A) to resolve the call to r = o.f, where f is a member field in whose type is A or a supertype or subtype of A. A supertype is possible because a field of this supertype may initially point to an object of type A or a subtype of A.

In Figure 15, if getMethods() (getFields()) is called as a member-introspecting method instead, then an array of Method (Field) objects will be returned so that Target Propagation from it is implicitly performed by pointer analysis. All the other methods in Class for introspecting methods/fields/constructors are handled similarly.

Resolution Principles

To balance soundness, precision and scalability in a disciplined manner, collective inference resolves the targets at a side-effect method call site (Figure 15) if and only if one of the following three conditions is met:

  • Both its target class type (red circle) and target member name (blue circle) are made available by target propagation (solid arrow) or target inference (dashed arrow).

  • Only its target class type (red circle) is made available by target propagation (solid arrow) or target inference (dashed arrow).

  • Only its target member name (blue circle) is made available by both target propagation (solid arrow) and target inference (dashed arrow).

In practice, the first condition is met by many calls to invoke(), get() and set(). In this case, the number of spurious targets introduced can be significantly reduced due to the simultaneous enforcement of two constraints (the red and blue circles).

To increase the inference power of Solar, as explained in Section 3.1, we will also resolve a side-effect call under the one of the other two conditions (i.e., when only one circle is available). The second condition requires only its target class type to be inferable, as a class (type) name that is prefixed with its package name is usually unique. However, when only its target member name (blue circle) is inferable, we insist both its name (solid arrow) and its descriptor (dashed arrow) are available. In a large program, many unrelated classes may happen to use the same method name. Just relying only the name of a method lone in the last condition may cause imprecision.

If a side-effect call does not satisfy any of the above three conditions, then Solar will flag it as being unsoundly resolved, as described in Section 4.4.

4.3 Lazy Heap Modeling

Figure 16: Lazy heap modeling (LHM). The abstract objects, , and , for newInstance() are created lazily at the two kinds of LHM (usage) points in Cases (II) and (III), where A and B have no subtypes and m1 is declared in D with one subtype B, implying that the dynamic types of the objects pointed by v4 is D or B.

As shown in Section 2.3.3, reflective object creation, i.e., newInstance() is the most widely used side-effect method. Lazy heap modeling (LHM), illustrated in Figure 16, is developed to facilitate its target inference and the soundness reasoning for Solar.

There are three cases. Let us consider Cases (II) and (III) first. Usually, an object, say o, created by newInstance() will be used later either regularly or reflectively as shown in Cases (II) and (III), respectively. In Case (II), since the declared type of o is java.lang.Object, is first cast to a specific type before used for calling methods or accessing fields as a regular object. Thus, o will flow to some cast operations. In Case (III), o is used in a reflective way, i.e., as the first argument of a call to a side-effect method, invoke(), get() or set(), on which the target method (field) is called (accessed). This appears to be especially commonly used in Android apps.

For these two cases, we can leverage the information at o’s usage sites to infer its type lazily and also make its corresponding effects (on static analysis) visible there. As for the (regular) side-effects that may be made by o along the paths from newInstance() call site to its usages sites, we use Case (I) to cover this situation.

Now, we examine these Cases (I) – (III), which are highlighted in Figure 16, one by one, in more detail. If cName at c = Class.forName(cName) is unknown, Solar will create a Class object c that represents this unknown class and assign it to c. On discovering that c1 points to a c at an allocation site i (v = c1.newInstance()), Solar will create an abstract object o of an unknown type for the site to mark it as being unresolved yet. Subsequently, o will flow into Cases (I) – (III).

In Case (I), the returned type of o is declared as java.lang.Object. Before o flows to a cast operation, the only side-effect that can be made by this object is to call some methods declared in java.lang.Object. In terms of reflection analysis, only the two pointer-affecting methods shown in Figure 16 need to be considered. Solar handles both soundly, by returning (1) an unknown string for v1.toString() and (2) an unknown Class object for v1.getClass(). Note that clone() cannot be called on v1 of type java.lang.Object (without a downcast being performed on v1 first).

Let us consider Cases (II) and (III), where each statement, say , is called an LHM point, containing a variable into which o flows. In Figure 16, we have . Let lhm be the set of class types discovered for the unknown class at by inferring from the cast operation at as in Case (II) or the information available at a call to (C) m1.invoke(v4, args) (e.g., on C, m1 and args) as in Case (III). For example, given S2: A a = (A) v2, lhm(S2) contains A and its subtypes. To account for the side-effects of v = c1.newInstance() at lazily, we add (conceptually) a statement, x = new T(), for every , before . Thus, o is finally split into and thus aliased with distinct abstract objects, o,…, o, where , such that will be made to point to all these new abstract objects.

Figure 16 illustrates lazy heap modeling for the case when neither A nor B has subtypes and the declaring class for m1 is discovered to be D, which has one subtype B. Thus, Solar will deduce that , and . Note that in Case (II), o will not flow to a and b due to the cast operations.

As java.lang.Object contains no fields, all field accesses to o will only be made on its lazily created objects. Therefore, if the same concrete object represented by o flows to both and , then . This implies that and will point to a common object lazily created. For example, in Figure 16, v3 and v4 points to since . As a result, the alias relation between and is correctly maintained, where is a field of o.

Figure 17: An example for illustrating LHM in Solar.

In Figure 17, Solar will model the newInstance() call in line 3 lazily (as cName1 is statically unknown) by returning an object of an unknown type . Note that flows into two kinds of usage points: the cast operation in line 12 and the invoke() call in line 15. In the former case, Solar will infer to be A and its subtypes in line 12. In the latter case, Solar will infer based on the information available in line 15 by distinguishing three cases. (1) If cName2 is known, then Solar deduces from the known class in cName2. (2) If cName2 is unknown but mName2 is known, then Solar deduces from the known method name in mName2 and the second argument new Object[] {b,c} of the invoke() call site. (3) If both cName2 and mName2 are unknown (given that the types of are already unknown), then Solar will flag the invoke() call in line 15 as being unsoundly resolved, detected automatically by verifying one of the soundness criteria, i.e., Condition (4) in Section 5.5.

Discussion

Under Assumption 4, we need only to handle the three cases in Figure 16 in order to establish whether a newInstance() call has been modeled soundly or not. The rare exception (which breaks Assumption 4) is that o is created but never used later (where no hints are available). To achieve soundness in this rare case, the corresponding constructor (of the dynamic type of o) must be annotated to be analyzed statically unless ignoring it will not affect the points-to information to be obtained. Again, as validated in Section 7.2, Assumption 4 is found to be very practical.

4.4 Unsound Call Identification

Intuitively, we mark a side-effect reflective call as being unsoundly resolved when Solar has exhausted all its inference strategies to resolve it, but to no avail. In addition to Case (3) in Example 4.3, let us consider another case in Figure 16, except that c2 and mName are assumed to be unknown. Then m1 at s4: m1.invoke(v4, args) will be unknown. Solar will mark it as unsoundly resolved, since just leveraging args alone to infer its target methods may cause Solar to be too imprecise to scale (Section 4.2).

The formal soundness criteria that are used to identify unsoundly resolved reflective calls are defined and illustrated in Section 5.5.

4.5 Guided Lightweight Annotation

As shown in Example 3.2.2, Solar can guide users to the program points where hints for annotations are potentially available for unsoundly or imprecisely resolved reflective calls. As these “problematic” call sites are the places in a program where side-effect methods are invoked, we can hardly extract the information there to know the names of the reflective targets, as they are specified at the corresponding entry and member-introspecting call sites (also called the annotation sites), which may not appear in the same method or class (as the “problematic call sites”). Thus, Solar is designed to automatically track the flows of metaobjects from the identified “problematic” call sites in a demand-driven way to locate all the related annotation sites.

In Solar, we propose to add annotations for unsoundly resolved side-effect call sites, which are often identified accurately. As a result, the number of required annotations (for achieving soundness) would be significantly less than that required in [Livshits et al. (2005)], which simply asks for annotations when the string argument of a reflective call is statically unknown. This is further validated in Section 7.3.

5 Formalism

We formalize Solar, as illustrated in Figure 13, for RefJava, which is Java restricted to a core subset of its reflection API. Solar is flow-insensitive but context-sensitive. However, our formalization is context-insensitive for simplicity. We first define RefJava (Section 5.1), give a road map for the formalism (Section 5.2) and present some notations used (Section 5.3). We then introduce a set of rules for formulating collective inference and lazy heap modeling in Solar’s inference engine (Section 5.4). Based on these rules, we formulate a set of soundness criteria (Section 5.5) that enables reasoning about the soundness of Solar (Section 5.6). Finally, we describe how to instantiate Probe from Solar (Section 5.7), and handle static class members (Section 5.8).

5.1 The RefJava Language

RefJava consists of all Java programs (under Assumptions 14) except that the Java reflection API is restricted to the seven core reflection methods: one entry method Class.forName(), two member-introspecting methods getMethod() and getField(), and four side-effect methods for reflective object creation newInstance(), reflective method invocation invoke(), reflective fields access get() and modification set().

Our formalism is designed to allow its straightforward generalization to the entire Java reflection API. As is standard, a Java program is represented only by five kinds of statements in the SSA form, as shown in Figure 20. For simplicity, we assume that all the members (fields or methods) of a class accessed reflectively are its instance members, i.e., in get(o), set(o,a) and invoke(o,…) in Figure 15. We will formalize how to handle static members in Section 5.8.

5.2 Road Map

Figure 18: Solar’s inference engine: five components and their inter-component dependences (depicted by black arrows). The dependences between Solar and pointer analysis are depicted in red arrows.

As depicted in Figure 18, Solar’s inference engine, which consists of five components, works together with a pointer analysis. The arrow between a component and the pointer analysis means that each is both a producer and consumer of the other.

Let us take an example to see how this road map works. Consider the side-effect call t = f1.get(o) in Figure 15. If cName and fName are string constants, Propagation will create a Field object (pointed to by f1) carrying its known class and field information and pass it to Target Search (

1
). If cName or fName is not a constant, a Field object marked as such is created and passed to Inference (

2
), which will infer the missing information and pass a freshly generated Field object enriched with the missing information to Target Search (

3
). Then Target Search maps a Field object to its reflective target in its declaring class (

4
). Finally, Transformation turns the reflective call t = f1.get(o) into a regular statement t = o. and pass it to the pointer analysis (

5
). Note that Lazy Heap Modeling handles newInstance() based on the information discovered by Propagation (

a
) or Inference (

b
).

5.3 Notations

In this paper, a field signature consists of the field name and descriptor (i.e., field type), and a field is specified by its field signature and the class where it is defined (declared or inherited). Similarly, a method signature consists of the method name and descriptor (i.e., return type and parameter types) and, a method is specified by its method signature and the class where it is defined.

class type
Field object* , , , =
field/method name
field signature* =
field
field type*
parameter (types)
field name*
method
Method object* , , , =
local variable
method signature* =
Abstract heap object
return type*
unknown
method name*
Class object ,
parameter*
Figure 19: Notations. Here , where is an unknown class type or an unknown field/method signature. A superscript ‘*’ marks a domain that contains .

We will use the notations given in Figure 19. , and represent the set of Class, Field and Method objects, respectively. In particular, denotes a Class object of a known class and denotes a Class object of an unknown class . As illustrated in Figure 16, we write to represent an abstract object created at an allocation site if it is an instance of a known class and of (an unknown class type) otherwise. For a Field object, we write if it is a field defined in a known class and otherwise, with its signature being . In particular, we write for in the special case when is unknown, i.e., . Similarly, , , and are used to represent Method objects. We write for when is unknown (with the return type being irrelevant, i.e., either known or unknown), i.e., .

5.4 The Inference Engine of Solar

We present the inference rules used by all the components in Figure 18, starting with the pointer analysis and moving to the five components of Solar. Due to their cyclic dependencies, the reader is invited to read ahead sometimes, particularly to Section 5.4.6 on LHM, before returning back to the current topic.

5.4.1 Pointer Analysis

Figure 20 gives a standard formulation of a flow-insensitive Andersen’s pointer analysis for RefJava. pt(x) represents the points-to set of a pointer x. An array object is analyzed with its elements collapsed to a single field, denoted . For example, x[i] = y can be seen as x. = y. In [A-New], uniquely identifies the abstract object created as an instance of at this allocation site, labeled by i. In [A-Ld] and [A-St], only the fields of an abstract object of a known type can be accessed. In Java, as explained in Section 4.3, the field accesses to (of an unknown type) can only be made to the abstract objects of known types created lazily from at LHM points.

[A-New] [A-Cpy]
[A-Ld] [A-St]
[A-Call]
Figure 20: Rules for Pointer Analysis.

In [A-Call] (for non-reflective calls), like the one presented in [Sridharan et al. (2013)], the function is used to resolve the virtual dispatch of method on the receiver object to be . There are two cases. If , we proceed normally as before. For , it suffices to restrict to , as illustrated in Figure 16 and explained in Section 4.3. We assume that has a formal parameter for the receiver object and for the remaining parameters, and a pseudo-variable is used to hold the return value of .

5.4.2 Propagation

Figure 5.4.2 gives the rules for handling forName(), getMethod() and getField() calls. Different kinds of Class, Method and Field objects are created depending on whether their string arguments are string constants or not. For these rules, denotes a set of string constants and the function creates a Class object , where is the class specified by the string value returned by () (with ).

[P-ForName]

[P-GetMtd]

[P-GetFld]
figureRules for Propagation.

By design, , and will flow to Target Search but all the others, i.e., , , , and will flow to Inference, where the missing information is inferred. During Propagation, only the name of a method/field signature ( or ) can be discovered but its other parts are unknown: .

x = (A) m.(y, args)
[I-InvTp]
[I-InvSig]
[I-InvS2T]
x = (A) f.(y)
[I-GetTp]
[I-GetSig]
[I-GetS2T]
f.(y, x)
[I-SetTp]
[I-SetSig]
[I-SetS2T]
Figure 21: Rules for Collective Inference.

5.4.3 Collective Inference

Figure 21 gives nine rules to infer reflective targets at x = (A) m.invoke(y,args), x = (A) f.get(y), f.set(y,x), where A indicates a post-dominating cast on their results. If A = Object, then no such cast exists. These rules fall into three categories. In [I-InvTp], [I-GetTp] and [I-SetTp], we use the types of the objects pointed to by y to infer the class type of a method/field. In [I-InvSig], [I-GetSig] and [I-SetSig], we use the information available at a call site (excluding y) to infer the descriptor of a method/field signature. In [I-InvS2T], [I-GetS2T] and [I-SetS2T], we use a method/field signature to infer the class type of a method/field.

Some notations used are in order. As is standard, holds when is or a subtype of . In [I-InvSig], [I-GetSig], [I-InvS2T] and [I-GetS2T], is used to take advantage of the post-dominating cast (A) during inference when A is not Object. By definition, holds. If is not Object, then holds if and only if or holds. In [I-InvSig] and [I-InvS2T], the information on args is also exploited, where args is an array of type Object[], only when it can be analyzed exactly element-wise by an intra-procedural analysis. In this case, suppose that args is an array of elements. Let be the set of types of the objects pointed to by its -th element, args[]. Let . Then . Otherwise, , implying that args is ignored as it cannot be exploited effectively during inference.

To maintain precision in [I-InvS2T], [I-GetS2T] and [I-SetS2T], we use a method (field) signature to infer its classes when both its name and descriptor are known. In [I-InvS2T], the function returns the set of classes where the method with the specified signature is defined if and , and otherwise. The return type of the matching method is ignored if . In [I-GetS2T] and [I-SetS2T], returns the set of classes where the field with the given signature is defined if and , and otherwise.

Let us illustrate our rules by considering two examples in Figures 17 and 22.

Let us modify the reflective allocation site in line 3 (Figure 17) to c1.newInstance(), where c1 represents a known class, named A, so that . By applying [L-KwTp] (introduced later in Figure 24) to the modified allocation site, Solar will create a new object , which will flow to line 10, so that . Suppose both cName2 and mName2 point to some unknown strings. When [P-GetMtd] is applied to c.getMethod(mName,…) in line 7, a Method object, say, is created and eventually assigned to m in line 14. By applying [I-InvTp] to m.invoke(v, args) in line 15, where , Solar deduces that the target method is a member of class A. Thus, a new object is created and assigned to . Given args = new Object[] {b, c}, is constructed as described earlier. By applying [I-InvSig] to this invoke() call, Solar will add all new Method objects to such that , which represent the potential target methods called reflectively at this site.

In Figure 22, hd is statically unknown but the string argument of getMethod() is "handle", a string constant. By applying [P-ForName], [P-GetMtd] and [L-UkwTp] (Figure 24) to the forName(), getMethod() and newInstance() calls, respectively, we obtain , and , where indicates a signature with a known method name (i.e., “handle”). Since the second argument of the invoke() call can also be exactly analyzed, Solar will be able to infer the classes where method "handle" is defined by applying [I-InvS2T]. Finally, Solar will add all inferred Method objects m to at the invoke() call site. Since neither the superscript nor the subscript of m is u, the inference is finished and the inferred m will be used to find out the reflective targets (represented by it) in Target Search (Section 5.4.4).

Figure 22: A simplified real code of Example 5.4.3 for illustrating inference rule [I-InvS2T]

5.4.4 Target Search

For a Method object in a known class (with being possibly ), we define to find all the methods matched:

(1)

where is the standard lookup function for finding the methods according to a declaring class and a signature except that (1) the return type is also considered in the search (for better precision) and (2) any that appears in is treated as a wild card during the search.

Similarly, we define for a Field object :

(2)

to find all the fields matched, where plays a similar role as . Note that both and also need to consider the super types of (i.e., the union of the results for all where , as shown in the functions) to be conservative due to the existence of member inheritance in Java.

5.4.5 Transformation

Figure 23 gives the rules used for transforming a reflective call into a regular statement, which will be analyzed by the pointer analysis.

[T-Inv]
[T-Get]
[T-Set]
Figure 23: Rules for Transformation.

Let us examine [T-Inv] in more detail. The second argument args points to a one-dimensional array of type Object[], with its elements collapsed to a single field during the pointer analysis, unless args can be analyzed exactly intra-procedurally in our current implementation. Let arg,…, arg be the freshly created arguments to be passed to each potential target method found by Target Search. Let be the parameters (excluding ) of , such that the declaring type of is . We include to only when holds in order to filter out the objects that cannot be assigned to . Finally, the reflective target method found can be analyzed by [A-Call] in Figure 20.

5.4.6 Lazy Heap Modeling

In Figure 24, we give the rules for lazily resolving a newInstance() call, as explained in Section 4.3.

[L-KwTp]
[L-UkwTp]
[L-Cast]
[L-Inv]
[L-GSet]
Figure 24: Rules for Lazy Heap Modeling.

In [L-KwTp], for each Class object pointed to by c, an object, , is created as an instance of this known type at allocation site  straightaway. In [L-UkwTp], as illustrated in Figure 16, is created to enable LHM if c points to a instead. Then its lazy object creation happens at its Case (II) by applying [L-Cast] (with blocked from flowing from x to a) and its Case (III) by applying [L-Inv] and [L-GSet]. Note that in [L-Cast], A is assumed not to be Object.

5.5 Soundness Criteria

RefJava consists of four side-effect methods as described in Section 5.1. Solar is sound if their calls are resolved soundly under Assumptions 14. Due to Assumption 4 illustrated in Figure 16, there is no need to consider newInstance() since it is soundly resolved if invoke(), get() and set() are. For convenience, we define:

(3)

which means that the dynamic type of every object pointed to by is known.

Recall the nine rules given for resolving (A) m.invoke(y, args), (A) f.get(y) and f.set(y, x) in Figure 21. For the Method (Field) objects () with known classes , these targets can be soundly resolved by Target Search, except that the signatures can be further refined by applying [I-InvSig], [I-GetSig] and [I-SetSig].

For the Method (Field) objects () with unknown class types , the targets accessed are inferred by applying the remaining six rules in Figure 21. Let us consider a call to (A) m.invoke(y, args). Solar attempts to infer the missing classes of its Method objects in two ways, by applying [I-InvTp] and [I-InvS2T]. Such a call is soundly resolved if the following condition holds:

(4)

If the first disjunct holds, applying [I-InvTp] to invoke() can over-approximate its target methods from the types of all objects pointed to by y. Thus, every Method object is refined into a new one for every .

If the second disjunct holds, then [I-InvS2T] comes into play. Its targets are over-approximated based on the known method names and the types of the objects pointed to by args. Thus, every Method object is refined into a new one , where and . Note that is leveraged only when it is not . The post-dominating cast (A) is considered not to exist if A = Object. In this case, holds (only for ).

Finally, the soundness criteria for get() and set() are derived similarly:

(5)
(6)

In (5), applying [I-GetTp] ([I-GetS2T]) resolves a get() call soundly if its first (second) disjunct holds. In (6), applying [I-SetTp] ([I-SetS2T]) resolves a set() call soundly if its first (second) disjunct holds. By observing [T-Set], we see why is needed to reason about the soundness of [I-SetS2T].

5.6 Soundness Proof

We prove the soundness of Solar for RefJava subject to our soundness criteria (4) – (6) under Assumptions 14. We do so by taking advantage of the well-established soundness of Andersen’s pointer analysis (Figure 20) stated below.

Solar is sound for RefJava with its reflection API ignored.

If we know the class types of all targets accessed at a reflective call but possibly nothing about their signatures, Solar can over-approximate its target method/fields in Target Search. Hence, the following lemma holds.

Solar is sound for RefJava, the set of all RefJava programs in which cName is a string constant in every Class.forName(cName) call. [Sketch] By [P-ForName], the Class objects at all Class.forName(cName) calls are created from known class types. By Lemma 5.6, this has four implications. (1) LHM is not needed. For the rules in Figure 24, only [L-KwTp] is relevant, ena