Static Code Analysis of Multilanguage Software Systems

06/03/2019 ∙ by Anas Shatnawi, et al. ∙ 0

Identifying dependency call graphs of multilanguage software systems using static code analysis is challenging. The different languages used in developing today's systems often have different lexical, syntactical, and semantic rules that make thorough analysis difficult. Also, they offer different modularization and dependency mechanisms, both within and between components. Finally, they promote and--or require varieties of frameworks offering different sets of services, which introduce hidden dependencies, invisible with current static code analysis approaches. In this paper, we identify five important challenges that static code analysis must overcome with multilanguage systems and we propose requirements to handle them. Then, we present solutions of these requirements to handle JEE applications, which combine server-side Java source code with a number of client-side Web dialects (e.g., JSP, JSF) while relying on frameworks (e.g., Web and EJB containers) that create hidden dependencies. Finally, we evaluate our implementations of the solutions by developing a set of tools to analyze JEE applications to build a dependency call graph and by applying these tools on two sample JEE applications. Our evaluation shows that our tools can solve the identified challenges and improve the recall in the identification of multilanguage dependencies compared to standard JEE static code analysis and, thus, indirectly that the proposed requirements are useful to build multilanguage static code analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Popular websites such as Google, Facebook, YouTube, and common software systems, such as Mozilla Firefox, Microsoft Word, are now built in heterogeneous multilanguage environments; using heterogeneous components developed following multiple programing languages, like C, C++, Java, JSP, PHP, SQL, and–or XML, with complex interactions (boughanmi2010multi, ).

The static code analysis of these multilanguage systems is essential to support software engineers in their tasks, such as program comprehension, optimization, reuse, and service identification, etc. It extracts a dependency call graph that represents the dependencies between program elements. As we deal with multilanguage systems, we must compute such static dependency call graph in a way that identifies all of the dependencies that exist between different program elements, despite the different languages, technologies, and runtime environments, which abstract away—and thus obfuscate—such dependencies.

However, identifying such dependency call graph by analyzing statically the code of multilanguage systems is challenging because: (1) different languages follow different ways to describe their lexical, syntactical, and semantic rules, (2) different languages offer different modularization mechanisms and manage dependencies differently, both within and between components, (3) several indirect dependencies are described in textual configuration files with different syntax and whose interpretation depends on the used frameworks, (4) heterogeneous components can be distributed in different layers (e.g., view, service, and data layers) making it difficult to trace cross-component dependencies through various communication mechanisms, and (5) heterogeneous components rely on a variety of frameworks that manage their life cycles and offer different container services that hide dependencies.

Thus, existing traditional unilingual static code analysis approaches must be improved by considering other types of analyses, involving other kinds of software artifacts (e.g., configuration files) and codifying container services provided by frameworks. We illustrate in Figure 1 how the results of a classical static code analysis can be improved with the detection of hidden dependencies by analyzing multilanguage and container service dependencies to connect the different parts of the dependency call graph.

Figure 1. Improving classic static code analysis to include multilanguage and container service dependencies

In the literature, approaches have been proposed to support the static code analysis of multilanguage systems, such as (von1995program, ; kienle2010rigi, ; bruneliere2014modisco, ; moise2005extracting, ; kraft2008cross, ; german2009license, ). Yet, these approaches suffer from three main limitations. First, they can only analyze one programming language at a time, although they can deal with more than one programming language in isolation from one another, like (von1995program, ; bruneliere2014modisco, ; kienle2010rigi, ). Second, they do not identify hidden dependencies related to container services (bruneliere2014modisco, ; moise2005extracting, ; kraft2008cross, ; german2009license, ). Third, they cannot analyze source code files mixing multilanguage code (von1995program, ; kienle2010rigi, ; bruneliere2014modisco, ; moise2005extracting, ).

Moreover, none of these existing approaches described the challenges—and their requirements—of the static analysis of multilanguage software systems. Instead, each focused on providing solutions to a narrow, particular subset of the challenges 1–5.

In this paper, we introduce and discuss the five challenges to analyze statically multilanguage software systems. We identified these challenges based on our literature review, our experience with analyzing JEE applications, and discussions/experience with industrial partners. We propose requirements to handle these challenges and allow the development of static code analyses for multilanguage software systems.

Then, we illustrate how the five challenges can be overcome in the particular case of JEE applications, which are multilanguage software systems that combine server side Java code with a number of client side Web dialects (JSP, JSF, etc.). In JEE applications, dependencies within and between languages are hidden in container services and various ad-hoc configuration files.

We develop an automated tool that represents program elements and their related dependencies within and between JEE components as one dependency call graph. To define this dependency call graph, we rely on the OMG’s Knowledge Discovery Meta-model (KDM) (perez2011knowledge, ), a language-independent meta-model. We also rely on the MoDisco Eclipse plug-in that offers an open-source implementation of KDM for Java code (bruneliere2014modisco, ). We propose a number of rules to transform multilanguage JEE source code to the KDM model and to codify dependencies related to the container services, string literals, and configuration files. We apply our tool on two JEE applications: Java PetStore and JSP Blog. We thus show that our tool can address the five challenges for JEE applications.

The rest of this paper is organized as follows. Section 2 discusses the five challenges—and identify their related requirements—for analyzing statically multilanguage software systems to build a dependency call graph. Section 3 provides a motivating example. Section 4 presents our static analysis approach for JEE applications. Section 5 shows and discusses experimentation results. Section 6 presents related works. Finally, Section 7 conclude with future works.

2. Challenges with Multilanguage Software Systems

We identify five challenges that make static code analyses difficult for multilanguage software systems. These challenges are: (1) diversity of meta-models, (2) mixing multilanguage code in one source code file, (3) various patterns to codify dependencies, (4) hidden dependencies related to frameworks and container services, and (5) string literals to describe dependencies. We analyze these challenges to derive requirements that a static analysis must meet.

2.1. Diversity of Meta-models

Different programming languages were designed following different meta-models describe different lexical, syntactic, and semantic rules. For example, while C++, C#, Java follow the object-oriented paradigm, i.e., their meta-models offer the concepts of class, method, and attribute, many other concepts are language specific, e.g., virtual method, copy constructor, interfaces. Further, their elements make communicate based on method invocation, attribute access, inheritance, object instantiation, etc. As other example, ASP, HTML, and JSP follow the XML meta-models in their source code files, i.e., their elements are based on XML tags that define their attributes and dependencies. Some other programming languages are defined based on the procedural meta-models, such as Basic, C, COBOL, etc.

There are situations where one single statement/character has contradicting semantics across different programming languages. For example, the static keyword has different meanings in C, C++, and Java. Unlike Java, C++ does not support static blocks. Therefore, the use of one common code analyzer is difficult.

Requirement 1: the analysis should rely on a standard representation that can represent all programming languages in a multilanguage software system, including program elements and their dependencies.

2.2. Mixing Multilanguage Code in Source Files

Some programming languages allow mixing multilanguage code inside a single source code file, e.g., developers are permitted to mix Java and JavaScript code with JSP and HTML tags in JSP technology. Similarly, it is possible to mix HTML, Javascript, and PHP in a single PHP file. The different pieces of code can have internal dependencies among one another inside the same file and external dependencies with code in other files. A static analysis must identify both types of dependencies for completeness.

Requirement 2: the analysis must detect when multilanguage code is used in one file, parse each piece of code following its language to identify program elements and their internal and external dependencies.

2.3. Patterns of Dependencies

There is a variety of patterns that reflect dependency codification in multilanguage software systems due to:

  1. Different communication protocols: heterogeneous components rely on communication patterns to encode their dependencies. Some components use RMI while some others use HTTP requests, etc. Such communication patterns must be captured to codify related dependencies.

  2. Several versions: programming languages evolve over time. Thus, several mechanisms can be used to codify the same dependencies in every languages. We must identify these mechanisms to capture their dependencies.

  3. Variance in development and design patterns: developers follow different patterns while developing their systems. For example, in the event-driven pattern, many dependencies are based on runtime events like user actions, sensor values, etc. Other patterns codify dependencies differently111There are two main challenges behind having different codification patterns. First, developers are constrained to use specific patterns following the application context (e.g., mobile, desktop, Web, etc.). Second, developers have varied background, knowledge, and experience. Therefore, the analysis must understand how these patterns work to identify related dependencies..

    Although different patterns must confirm to the main guidelines and mechanisms provided by the languages, some languages leave space for developers to make specific patterns of dependencies that cannot be recovered by general static code analysis. Such cases can be found frequently in JEE custom tag libraries where developers have flexibility in implementing tag handler classes222https://docs.oracle.com/cd/E17802_01/blueprints/blueprints/code/jps11/src/ com/sun/j2ee/blueprints/petstore/taglib/list/PrevFormTag.java.html.

Requirement 3: the analysis must know the communication patterns available in each programming language of interest. It must also consider the versions of programming languages.

2.4. Hidden Dependencies with Frameworks and Containers

Various programming languages rely on different frameworks to deploy their components. These frameworks provide callback methods that are invoked by the frameworks at specific application execution times/events. Thus, we must understand these frameworks to identify which callback methods are called at certain times/for certain events, e.g., when the user presses a button. Also, we must explicitly codify/represent the dependencies that are managed by the frameworks. Figure 2 illustrates the idea of hidden dependencies being invisible in user code. It shows that hidden dependencies are caused by:

  • Calls related to the application life-cycle management. For example, when we create a Servlet, the server container automatically calls the init() method, then the service() one, which can in turn call other methods thus creating hidden dependencies.

  • Configurations files describing applications/frameworks functionalities. These files are located in different directories, which requires visiting these directories to parse each file and identify dependencies.

  • Callback methods offered by frameworks. These methods are used by developers to access underlying services and create hidden dependencies.

Requirement 4: we must analyze the different frameworks used to deploy different components so that the analysis can detect related hidden dependencies. We must understand the life cycle of each component, how they are configured in configuration files, and what callback methods they use to communicate with other components/frameworks.

Figure 2. Example of hidden dependencies inherent in frameworks

2.5. String Literals Encoding Dependencies

String literals are frequently used to realize dependencies and to encapsulate parameter values, e.g., dependencies described in MANIFEST.MF files. The analysis of such literals is difficult because the meaning of a literal depends on the application, frameworks, and programming languages. It may require data-flow analysis to extract the value of the literals.

Requirement 5: the analysis must identify where/when string literals are used to encode dependencies and know how to parse each string literal following the application context.

3. Motivating Example

To better illustrate these challenges and their solutions in JEE applications, we selected the custom tag libraries from Java PetStore 1.1.2 as motivating example (PetStore, ). Developers of Java PetStore implemented their own custom tags that are reused in several JSP pages. Figure 3.a shows an instance of reusing the prevForm custom tag in a JSP page. This tag is configured using a Tag Library Descriptor (TLD) shown in Figure 3.b. TLD is an XML file that defines tags and their related attributes and also maps the tags to their implementing tag handlers, written as Java classes333A Java implementation of PrevFormTag is availabe at https://docs.oracle.com/cd/E17802_01/blueprints/blueprints/code/jps11/src/com/sun/j2ee/blueprints/petstore/taglib/list/PrevFormTag.java.html. in Figure 3.c.

Figure 3. Motivation example of JSP tag library

4. Multilanguage Static Code Analysis for JEE Applications

4.1. Approach Overview

The static analysis of JEE applications to build dependency call graphs face all the previously mentioned challenges. Figure 3 shows instances of these challenges (different meta-models, multilanguage code, patterns of dependencies, hidden dependencies, and string literals). Therefore, our static analysis for JEE applications must fulfill the previous requirements.

We want to identify one dependency call graph that represents all program elements and their related dependencies in a JEE application. We must built this dependency call graph based on a language-independent meta-model that enables the representation of different meta-models (Java, JSP…) and abstracts differences among them. Also, we must represent all dependencies that exist among program elements, regardless of their technologies (e.g., EJBs, JavaBeans, JSPs…) or application layers (e.g., Web, business).

In the literature, we identified several meta-models, such as KDM (perez2011knowledge, ), FAMIX (famix, ), Abstract Syntax Tree (AST) and Pattern and Abstract-level Description Language (PADL) (gueheneuc2005ptidej, ). Among these meta-models, we selected the OMG KDM (Knowledge Discovery Meta-model) (perez2011knowledge, ) to represent our dependency call graph due to:

  1. Possibility of evolution: there are open-source specifications of KDM offered by the OMG that can be extended using a light-weight extension mechanism, in case we need to evolve the meta-model.

  2. Supported by open-source software tools: KDM is supported by the MoDisco tool (bruneliere2014modisco, ) that offers an open-source code implementations.

  3. Ability to represent multilanguage software components: KDM allows representing multilanguage software components and their related dependencies by customizing the KDMEntity and KDMRelationship interfaces.

  4. Build based on the container concept: this container concept enables the composition of an element based on other ones, e.g., a KDMEntity instance of a class can compose a set of KDMEntity instances corresponding to methods of this class. This supports the representation of program elements and dependencies at different levels of abstraction between (i.e., cross-language dependencies) and within (i.e., dependencies inside components) multilanguage JEE components.

  5. Ability to represent non-executable artifacts: KDM also allows representing different configuration files that are frequently used in JEE applications by extending the KDMEntity and KDMRelationship interfaces.

To build an instance of our KDM meta-model for a given JEE application, we developed a set of tools that: (1) transform JEE components into instances of KDM entities (and thei relationships), (2) describe multilanguage code in one file in the KDM model, (3) identify hidden dependencies of container services, and (4) parse string literals to identify related dependencies. We now describe in more details the technical implementation of these tools.

4.2. Automatic Transformation of JEE Components to KDM Representations

We developed a tool that automatically transforms the source code of a JEE application into a KDM model. First, our tool uses MoDisco to construct a KDM model of the Java program artifacts of the JEE application. Such model only444MoDisco only supports the construction of KDM models from object-oriented programming languages, like C++ and Java, not HTML or JSP contains object-oriented elements and their related dependencies. We extended this model to include program elements and dependencies of JSP and JSF pages and configuration files.

MoDisco supports the transformation of normal Java code to KDM models. Therefore, we decided to translate JSP pages to equivalent Java Servlets that are implemented by Java code because:

  1. Servlets represent the underlying Java implementation of JSPs. Thus, they support the life cycle management of JSP pages.

  2. An open-source tool is available to translate JSP pages into Servlets.

  3. An open-source tool is available to build KDM models of Servlets.

We based the translation of JSP pages to Servlets on the Jasper tool provided by Apache Tomcat (apache-tomcat, ). Figure 4 presents the process used by Jasper to translate JSP pages into Servlets.

Figure 4. Process of converting JSP pages into Servlets by Jasper

Jasper follows a set of rules to translate each JSP page to a Servlet:

  • JSP scriptlet tags (e.g., <% code fragment %>) that are used to insert Java code inside the JSP pages are represented in the Servlet class with the same code fragment. For example, the scriptlet <% for (int i=0; i<10; i++) %> is translated into for (int i=0; i<10; i++).

  • JSP declaration tags (e.g., <%! declaration; [ declaration; ]+ … %>) that are used for variable declarations are converted into equivalent variable declarations. For example, <%! int i=0; !%> is translated to int i=0;.

  • References to JavaBeans are converted by creating an instance of the corresponding classes. The references to setter/getter methods are realized through normal Java method invocations. For example, <jsp:useBean id="myBeans" class= "package.BeansClass" scope="session"> is transformed into an object instantiation of package.BeansClass and <jsp:getProperty name="myBeans" property="firstName"> </jsp:getProperty> into a method invocation to the getFirstName() method of this instance.

  • Uses of custom tag handlers are realized by a set of invocations to the life-cycle management methods of the tag handlers. Such methods are doStartTag(), doEndTag(), setAttribute().

  • Other JSP tags and HTML/XML tags are written as string literals in the output of the response objects. For example, <jsp:include page="/myPage.jsp." flush="true" /> is converted as out.write("<jsp:include page="/myPage.jsp." flush="true" />");.

A result of these transformation rules, several, but not all, program elements and their related dependencies of JSP pages are converted to Java code. Based on this Java code, we identified an initial KDM model that includes program elements and dependencies. However, there are some JSP tags that are not translated to Java code. We identified these tags based on runtime test. We built Table 1 that summarizes the set of JSP tags and their attributes that we need to codify their dependencies. To express dependencies of these JSP tags in the initial KDM model, we developed a set of tools that: (1) parse JSP pages to identify usage scenarios of these JSP tags, (2) parse these scenarios to identify the URLs of target server pages, (3) parse configuration files to map URLs to server pages, and (4) update our KDM model with these dependencies.

Tags Attributes
<form> action, method
<jsp:include> page
<%@ include> file
<jsp:directive.include> file
<jsp:forward> page
<%@ page %> errorPage
<jsp:directive.page> errorPage
<a> href
<c:redirect> url
<c:url> value
Table 1. JSP tags and their attributes related to dependencies

4.3. Describing Multilanguage Code in One File in KDM

In JEE, multilanguage code in one file can be present in both Java classes (i.e., Servlets and tag handlers) and JSP pages.

4.3.1. Multilanguage Code in Java Classes

For Servlets and tag handlers, developers combine in the output stream object (offered by the Web container) server-side Java code with a number of client-side code (i.e., form action=" " … and a href). Figure 3.c presents an example of an output stream object in a tag handler.

By studying Oracle’s JEE specification and making runtime tests555We attached JSP tags in the output stream of a Servlet to test whether the Web container executes them at runtime and create dependencies., we identified that two tags are used by the Web container to describe dependencies to other service pages in the output stream: form action=" " … and a href. Therefore, we developed a tool that identifies such tags in Servlets and tag handlers implementation. It extracts URLs related to these tags based on Table 1 to identify the dependent server pages.

For example, in Figure 3.c, our tool creates a dependency between the JSP page that uses this tag handler (Figure 3.a) and the JSP page called cart by analyzing the HTML form and the value of its action attribute.

4.3.2. Multilanguage Code in JSP Pages

Multilanguage code can appear in JSP pages with a mix of Java code, JSP and HTML tags. We addressed such code in Section 4.2 when using Jasper to transform JSP pages to Servlets. Jasper detects JSP tags (i.e., JSP declaration and JSP scriptlet tags) having Java code and represents their contents in the right contexts in the translated JSP pages. Also, it builds the dependencies with other JSP pages and parts thereof. For example, Jasper translates the JSP code in Listing 1 to the Servlet in Listing 2. Dependencies related to Java code are identified in the KDM using MoDisco.

...
<TABLE BORDER="2" ALIGN="center">
<TH>Exponent</TH>
<TH>2^Exponent</TH>
<% for (int i=0; i<10; i++) {%>
<TR>
<TD><%= i%></TD>
<TD><%= Math.pow(2, i)%></TD>
</TR>
<% } //end for loop %>
</TABLE>
...
Listing 1: An example JSP code
...
public class PowersOf2 extends HttpServlet
{
public void service(HttpServletRequest ...){
...
out.print("<TABLE BORDER=’2’ ALIGN=’center’>");
out.print("<TH>Exponent</TH><TH>2^Exponent</TH>");
for (int i=0; i<10; i++){
out.print("<TR><TD>" + i + "</TD>");
out.print("<TD>" + Math.pow(2, i) + "</TD>");
out.print("</TR>");
} //end for loop
out.print("</TABLE>");
...
}
}
Listing 2: The resulting Java Servlet based on the JSP code

4.4. Identifying hidden Dependencies of Container Services in KDM

Figure 5. Life cycle management of JSP tag handler

JEE relies mainly on two containers to offer its services: Web and EJB containers. Web container is for JSFs, JSPs, Servlets and tag handlers. EJB container is for Enterprise Beans. We studied these two containers and identified a set of patterns that they use to offer their services (e.g., life cycle management, RMI, callback methods, etc.). We built state diagram models based on the life cycles of JEE components. Figure 5 shows an instance of these state diagrams related to EJBs and JSP tag handlers.

We observed that some dependencies are configurable, like security, transaction, and persistence. Their configuration parameters are specified in XML configuration files (e.g., web.xml, ejb.xml, tag library descriptor.tld) and code annotations (web annotations, EJB annotations). We studied patterns/idioms that JEE containers use to parse these configuration files and developed a specific parser for each configuration file to extract dependencies.

For our example in Figure 3.b, our parser identifies that prevForm has a mandatory attribute called action handled by the PrevFormTag class located in the com.sun.j2ee.blueprints.petstore.taglib.list package.

Figure 6 summarizes the container service dependencies that our tool identifies when a JSP page uses prevForm. These dependencies are due to callback methods: setAction("cart") as setter method for the action attribute and doStartTag() and doEndTag() for life cycle management of the tag handler. In our KDM model, we created dependencies from the KDM entity of this JSP page to the KDM entities of the corresponding methods of PrevFormTag. We proceeded similarly for other JEE components; Servlets, JSP pages, ejb.

Figure 6. Container service dependencies of our motivating example in Figure 3

4.5. Parsing String Literals to Identify Dependencies

String literals are used for multiple purposes. Figure 7 shows string literals used to forward requests between server pages and to communicate with Beans components, e.g., access attributes (the value entered by the end-user is stored in myBeans.myAttribute) and method invocations (the value returned from myBeans.myMethod() determines the relative-URL). We identified dependencies in string literals as follows:

  1. We analyzed the JEE specification and observed the situations where string literals create dependencies.

  2. We built a list of tags and their related attributes in which string literals codify dependencies.

  3. We developed a string parser that analyzes the values corresponding to the tags and attributes in our list. This parser is based on Expression Language because it is the official language used by the JSP engine to construct such string literals.

Figure 7. String literals in server pages

5. Evaluation

We now evaluate our static code analysis approach for JEE applications. We developed DeJEE (Dependencies in JEE), a fully automated tool to identify a dependency call graph of a given JEE application based on the KDM meta-model. We tested DeJEE on two case studies: Java PetStore (PetStore, ) and JSP Blog (JSPBlog, ).

We selected Java PetStore because it represents the official demonstration example provided by Sun Microsystems on how to develop flexible, scalable, cross-platform JEE applications. Also, the availability of its source code and documentation makes it ideal. Finally, it is implemented using many JEE technologies (e.g., JSPs, Servlets, EJBs, etc.). This implementation is composed of 88 JSP, 233 Java classes, and 8 HTML files.

JSP Blog is implemented using 10 JSP and 1 HTML files. We selected JSP Blog due to the fact that its JSP pages intensively use multilanguage code (combination of JSP tags and Java code) and do not depend on Java classes, which allows us to expose the limitations of existing tools, i.e., Modisco. Thus, it is a good option for evaluating how DeJEE works with multilanguage files but without Java classes.

5.1. Objectives and Methodology

We want to achieve the following three objectives:

  1. Objective: Demonstrate the capability of DeJEE to meet the five requirements of static code analysis in the case of JEE applications.

    Method: we extract one common KDM model for each case study. This addresses the first challenge related to dealing with diversity of meta-model. Then, we highlight the identified dependencies related to the other four challenges of analyzing multilanguage code in one file, patterns of dependencies, hidden dependencies on container services and dependencies encoded in string literals.

  2. Objective: Measure the improvement in recall (if any) that our approach adds in comparison to MoDisco in extracting multilanguage dependencies.

    Method: we extract KDM models using both MoDisco and DeJEE and then we compare the KDM models to report their differences in terms of numbers of KDM entities and relationships, which describe dependencies among JEE program elements. We selected MoDisco because it is the only tool that is remotely comparable to DeJEE in terms of their KDM models.

  3. Objective: measure the correctness of dependencies identified by DeJEE.

    Method:

    we manually constructed a ground-truth call graph of JSP Blog to precisely measure the precision and recall of dependencies identified by DeJEE. We identified the list of JSP pages invoked by each JSP page.

    As it is hard to manually construct a ground-truth call graph of Java PetStore due to its large number of source code files (88 JSP, 233 Java classes, and 8 HTML files.), we only manually evaluated the precision of identified dependencies in Java PetStore. For an identified KDM relationship instance, we studied the implementation files of the two related KDM entities to validate whether these implementation files do have a dependency or not.

5.2. Results

5.2.1. Capability of DeJEE to Identify Dependencies

Table 2 shows the results of applying DeJEE on Java PetStore and JSP Blog.

Multilanguage Code in One File

The results show the ability of DeJEE to automatically detect multilanguage files. DeJEE identified 40 and 6 files respectively in Java PetStore and JSP Blog. We manually checked these files and found that 100% of them contained multilanguage code. By analyzing these files, we found that DeJEE identifies dependencies related to multilanguage code of Servlets, tag handlers and JSP pages in the KDM model. For JSP Blog, DeJEE discovers 41 and 20 program dependencies by analyzing Java code and JSP tags in the multilanguage files, respectively.

Pattern of Dependencies

Different communication patterns are used to create dependencies among different JEE components (JSPs, JavaBeans, EJBs, Managed Beans, etc.). The HTTP Request Communication Pattern is used by JEE components to invoke Servlets and JSP pages. JSPs and JSFs use special tags and an expression language to connect to JavaBeans and Managed Beans components. EJBs are invoked through a JNDI lookup (Java Naming and Directory Interface), e.g., context.lookup("java:comp/env/ejb/HelloBean");. JSPs use this interface as Java code embed using the scriptlet tag. DeJEE can recover these dependencies using these patterns.

Container Service Dependencies

DeJEE identified container service dependencies related to custom taglib, JSPs, and Servlets by parsing related configuration files and annotations. To map relative-url to their corresponding Servlets or JSP pages, we parse web.xml and web annotations. The mapping of custom taglib to identify tag handlers is based on taglib.tld.

Dependencies in String Literals

DeJEE discovered string literals and parsed them to identify embedded, hidden dependencies. On average, each JSP page contains 17.51 (1,541/88 pages) and 18.54 (204/11 pages) string literals in which 46.7% (720/1,541) and 31.8% (65/204) of these string literals are related to dependencies in PetStore and JSP Blog, respectively.

In summary, DeJEE successfully identified 329 and 11 KDMEntity instances corresponding to Java code, JSP, and HTML files, and 2,673 and 61 KDMRelationship instances corresponding to dependencies, respectively for PetStore and JSP Blog.

Results Java PetStore JSP Blog
Multilanguage files 45.4% of JSPs 54.5% of JSPs
Total no. of string literals 1541 204
No. of string literals having dep. 720 65
No. of KDM entities 329 11
No. of KDM relationships 2673 61
Table 2. Results of DeJEE tool

5.2.2. Recall Improvement

Figure 8 shows the results of the recall improvement that DeJEE achieves compared to MoDisco. MoDisco does not support the analysis of JSP pages and, thus, cannot build KDM models for JSP pages and identify dependencies in these pages or between these pages and other program elements, like Java code. Also, when analysing multilanguage code in one file, MoDisco outputs an empty KDM model for JSP Blog while DeJEE provides a KDM model that consists of 11 KDMEntity and 61 KDMRelationship instances. Similarly for Java PetStore, DeJEE provides 329 KDMEntity and 2673 KDMRelationship instances compared to 233 KDMEntity and 2284 KDMRelationship ones provided by MoDisco.

Consequently, DeJEE improves MoDisco’s recall by discovering 41% () and 100% () more program elements and 17% () and 100% () more program dependencies, respectively in Java PetStore and JSP Blog, than MoDisco. DeJEE provides better results because of its codification of dependencies related to container services, string literals, and multilanguage code.

Figure 8. Recall Improvement

5.2.3. Correctness of Dependencies Identified by DeJEE

For JSP Blog, we found that the constructed ground truth call graph includes invocations that confirm to the KDM model identified by DeJEE with 100% precision and 100% recall.

It is worth to recall that we only measure the precision of the KDM dependencies of Java PetStore due to difficulties in building ground truth call graph of Java PetStore (i.e., size of system). We found that the identified KDM dependencies of the Java PetStore KDM model are true with 100% precision when we manually evaluated them.

5.3. Threats to Validity

In this section, we discuss internal and external threats to validity of the proposed approach.

5.3.1. Internal Threads to Validity

Manual Construction of Ground Truth Call Graphs

We conducted the evaluation of the resulting KDM models by manually constructing a ground-truth dependency call graph of JSP Blog. We report that this manual construction could be error prone in which any error will affect the precision and the recall values. To avoid such error prone, we validated the constructed call graph three times before we started the comparison with the DeJEE’s ones. Note that we are familiar with this case study since we are working with it two years ago as a part of our project that aims to analyze program dependencies in JEE Web applications.

We plan to measure the precision and recall of dependencies identified by DeJEE compared to a real ground truth dependency call graphs. We want to compare DeJEE dependencies against ones obtained from runtime executions. In this context, dependencies can be classified into three types as follows: (i) dependencies that appear in both cases have 100% precision. (ii) Dependencies that are only identified during the runtime execution are considered as references to identify true negative dependencies that DeJEE does not identify. (iii) Dependencies that are only identified by DeJEE are under thought as the runtime execution may not cover their execution paths following the tested use case scenarios of the software. In this case, we will rely on external human experts to validate the correctness of DeJEE. We will try to get in touch with developers of case studies to evaluate them. Otherwise, we will consider M.Sc. and Ph.D. students.

Missing of Java Reflection Dependencies

We are aware that there are missing dependencies in DeJEE dependency call graphs duo to Java reflection. We want to address this limitation in future work to improve further the recall of DeJEE. We will rely on the study of Landman et al. (landman2017challenges, ) and Livshits et al. (livshits2005reflection, ) to solve Java reflexion challenges.

5.3.2. External Threads to Validity

Coverage of DeJEE for Other Case Studies

When we developed DeJEE, we covered as many dependency codification patterns, which can be used by developers, as possible. To do so, we studied the Oracle’s JEE specification, which is the main reference for JEE developers to learn the JEE technologies. For example, we identified the various mechanisms that can be used in JSP pages to call another ones. This represents 100% of pure inter-JSP dependencies. We applied DeJEE on two JEE applications; Java PetStore and JSP Blog. As we mentioned earlier, Java PetStore is the official demonstration example provided by Sun Microsystems to explain how to develop JEE applications using several implementation patterns. JSP Blog can be considered as a good representative of JSP pages that mix Java code due to the fact that its JSP pages intensively use multilanguage code (combination of JSP tags and Java code). All of these argue that DeJEE can work for most of JEE applications that follow standard JEE development patterns.

Including Dependencies of Other Programming Languages

DeJEE builds dependency call graphs based on the KDM meta-models that supports the representation of other programing languages. Therefore, we are allowed to extend DeJEE to include other programming languages (e.g., C++). To do this extension, we have to: (i) transform source code elements of these programming languages to KDM representations, (ii) identify patterns of dependencies in the new multilanguage code (e.g., extract JNI dependencies), (iii) develop a tool that detects these patterns, and (iv) extend the KDM model by adding identified dependencies.

6. Related Work

Although many research works have been presented in the literature to support the static code analysis of multilanguage systems (von1995program, ; kienle2010rigi, ; bruneliere2014modisco, ; moise2005extracting, ; kraft2008cross, ; german2009license, ), these works suffer from three main limitations. First, they are able to analyze one programming language at one time, although they can deal with more than one programming language in isolation from one another like (von1995program, ; bruneliere2014modisco, ; kienle2010rigi, ). Second, they do not identify hidden dependencies related to container services (bruneliere2014modisco, ; moise2005extracting, ; kraft2008cross, ; german2009license, ). Third, they are not able to analyze source code files mixing multilanguage code (von1995program, ; kienle2010rigi, ; bruneliere2014modisco, ; moise2005extracting, ). Furthermore, none of these existing works have described the challenges—and their requirements—in the static code analysis of multilanguage software systems. Instead, each focused on providing solutions of a narrow, particular set of challenges.

Mayrhauser and Vans (von1995program, ) studied various program comprehension models and sketched an integrated model that can be used to explain developers’ comprehension of components written in any programming language one at a time. As another example, Muller and his group developed Rigi (kienle2010rigi, ), an environment to reverse engineer, explore, visualize, and document components in C, C++, or COBOL; in isolation from one another.

Cross-language research works are intrinsically multilanguage and their results pertain to the dependencies among heterogeneous components, two or more at a time. Such dependencies include (1) procedure calls; (2) clones; (3) license integration patterns; and (4) refactorings. For example, Moise and Wong (moise2005extracting, ) were among the first researchers, e.g., (linos2003tool, ; deruelle2001analysis, ) to extract, represent, and study cross-language dependencies, i.e., among heterogeneous components. They used the API provided by each language to identify cross-language calls, e.g., calls to Java Native Interface API in C and Java. Kraft et al. (kraft2008cross, ) developed a technique to identify cross-language clones using the Microsoft Code-DOM library for .NET languages and a hybrid token/tree-based algorithm for clone detection. They reported clones whose siblings exist in components written in both C# and Visual Basic.NET. German and Hassan (german2009license, ) described five possible kinds of dependencies between heterogeneous components. These are linking, forking, sub-classing, inter-process communication, and plug-in. Then, they identified these kinds of dependencies in 124 open-source software systems. Using the identified component dependencies and their licenses, they proposed 12 patterns of license integrations. Mayer and Schroeder (mayer2012cross, ) proposed a technique based on the MOF Query, View, Transformation Relations specification (QVT/R) to identify dependencies among heterogeneous components, warn of potential missing dependencies, and propagate renamings among heterogeneous components.

Although of major importance to understand multilanguage software, these works do not analyze the heterogeneous components as wholes, i.e., the patterns of dependencies and patterns of control and data flows through these dependencies. Only few works focused explicitly on the interactions among heterogeneous components, taken together: (1) static and (2) dynamic data and control flow interactions and (3) conformance of components and their configurations. Ayers et al. (ayers2005traceback, ) proposed TraceBack to diagnose bugs in multilanguage software by collecting data through runtime instrumentation of control-flow blocks. The data is collected by statically rewriting the binaries and–or instrumenting the intermediate languages to generate a unified trace of the components’ execution. Tan and Croft (tan2008empirical, ) studied the interactions between Java code and the C++ code in the underlying Java virtual machine. They showed that bugs are possible due to the assumptions made by the Java code regarding the C++ code. For example, the C++ native method java.util.zip.Deflater.deflatesByte() assumes that its Java callers check bounds, which could lead to buffer overflows.

Some works also relied on KDM models to analyze multilanguage code, Yazdanshenas and Moonen (yazdanshenas2011crossing, ) built homogeneous KDM models of heterogeneous systems, with components in C, C++, and Java and configuration files in XML. They used these models to obtain system dependency graphs and sliced these graphs to show if a given input is used to produce the expected output, typically in sensor/actuator systems and other such component-based systems. However, their work only takes into account object-oriented meta-model programming languages (similar to what MoDisco does (bruneliere2014modisco, )), disregarding other non object-oriented languages like JSPs and JSFs.

For the case of Web applications, Ricca et al. (ricca2002construction, ) sliced Web applications based on a dynamic analysis techniques. They traced dependencies between the PHP server and the Javascript client to extract the program elements relevant to a specified computation. Naumovich et al. (naumovich2004static, ) proposed a static analysis approach only for EJBs and JEE access policies. Kirkegaard et al. only analyzed output streams of Servlets to verify if they confirm the container specifications using context free grammars. Perin (perin2012reverse, ) introduced a meta-model and revere-engineering techniques to model the components of multilanguage software, focusing on Enterprise Java Beans software. Then, he used his approach to validate their architectures, to map their databases and transactions flows, and to identify hidden (i.e., abstracting explicit) dependencies as well as visualization techniques to ease their understanding. Although these approaches identified dependencies that are similar to ones in JEE applications, they are ad-hoc and did not cover all types of JEE dependencies. Shatnawi et al. (shatnawi2017analyzing, ) (shatnawi2018identifying, ) identified KDM model of JEE applications by analyzing Servlets and JSPs. However, they did not analyze dependencies related to tag handler dependencies. Hecht et al. (hecht2018codifying, ) extracted a declarative specification of the hidden dependencies of the J2EE applications that are inherent in the services offered and that are not visible in the user code that uses them. Based on this declarative specification, they codified of the hidden dependencies into rules that can be automatically detected using a tool. Despite the large contributions of these works in analyzing program dependencies in multilanguage software in general and JEE applications in specific, none of them describe the main challenges–and their consequence requirements–in analyzing multilanguage software systems. Also, existing works do not cover all dependencies that are resulted in multilanguage JEE application components. For example none of these works allowed us to codify the dependencies in our motivation example in Figure 3.

7. Conclusion and Future Work

The static code analysis is difficult when dealing with multilanguage software systems, such as JEE applications, which combine Java, JSF, JSP, and other source code and configuration files. In this paper, we identified five challenges with multilanguage software systems and proposed requirements to address them when developing static code analysis approach of JEE applications.

We developed an automated analysis tool, DeJEE, that illustrates our approach to identify a dependency call graph of a given JEE application. This dependency call graph is identified based on the KDM meta-model. We developed DeJEE as Eclipse-plug-in that allows software developers: (i) to identify files having multilanguage code in JEE applications, (ii) to describe program elements and their dependencies across and within Java code, Servlets, and JSPs using one common KDM model, (iii) to parse configuration files to identify container–service dependencies, and (iv) to discover dependencies among program elements that do not exist explicitly in the source code of the analyzed applications and would cause runtime errors if missing.

We applied DeJEE on two JEE applications (Java PetStore and JSP Blog) to evaluate the capability of DeJEE to solves the challenges of static code analysis in JEE applications and the correctness of DeJEE identified dependencies. Our results showed that DeJEE solves the challenges that we identified in the static analysis of JEE applications and detects related dependencies with 100% precision for JSP Blog and Java PetStore, and with 100% recall for JSP Blog. Furthermore, we compared DeJEE with MoDisco and found that DeJEE improves the recall of MoDisco by discovering respectively 70.5% and 58.5% more KDM entity instances and KDM relationship instances for the two case studies.

We plan to extend our work in the following four directions.

  1. Generalization of DeJEE: we plan to generalize the scalability DeJEE by experimenting it on a number of JEE applications of different sizes and from various domains.

  2. Extension of DeJEE: we want to extend DeJEE by addressing static analysis of Java reflexion. We will rely on study of Landman et al. (landman2017challenges, ) to solve Java reflexion challenges which is used in Web applications. Also, we plan to extend DeJEE to analyze dependencies in multilanguage software including other programming languages such as JavaScript, PHP.

  3. Application of DeJEE: we want to exploit the KDM model identified by DeJEE to perform other software engineering tasks like program understanding, change impact analysis, program debugging, reverse engineering and service identification.

  4. Study the Prevalence of the Identified Challenges in Other Multilanguage Software Systems: We identified, in this paper, five main challenges related to the static analysis of multilanguage software. We want to empirically study how frequently these challenges occur in other multilanguage software systems (rather than JEE applications), which supports the extension of DeJEE.

References

  • [1] Ferdaous Boughanmi. Multi-language and heterogeneously-licensed software analysis. In WCRE, pages 293–296. IEEE, 2010.
  • [2] Anneliese Von Mayrhauser and A Marie Vans. Program comprehension during software maintenance and evolution. Computer, 28(8):44–55, 1995.
  • [3] Holger M Kienle et al. Rigi-an environment for software reverse engineering, exploration, visualization, and redocumentation. Science of Computer Programming, 75(4):247–263, 2010.
  • [4] Hugo Bruneliere et al. Modisco: A model driven reverse engineering framework. IST, 56(8):1012–1032, 2014.
  • [5] Daniel L Moise et al. Extracting and representing cross-language dependencies in diverse software systems. In WCRE, pages 10–pp. IEEE, 2005.
  • [6] Nicholas A Kraft, Brandon W Bonds, and Randy K Smith. Cross-language clone detection. In SEKE, pages 54–59, 2008.
  • [7] Daniel M German and Ahmed E Hassan. License integration patterns: Addressing license mismatches in component-based development. In Proceedings of the 31st International Conference on Software Engineering, pages 188–198. IEEE Computer Society, 2009.
  • [8] Ricardo Pérez et al. Knowledge discovery metamodel-iso/iec 19506: A standard to modernize legacy systems. CSI, 33(6):519–532, 2011.
  • [9] Sun MicroSystems. Java pet store, http://www.oracle.com/technetwork /java/petstore1-3-1-02-139690.html, last access: October 2017.
  • [10] Serge Demeyer, Sander Tichelaar, and Stéphane Ducasse. Famix 2.1-the famoos information exchange model, 2001.
  • [11] Yann-Gaël Guéhéneuc. Ptidej: promoting patterns with patterns.
  • [12] The Apache Tomcat software. http://tomcat.apache.org/, last access: October 2017.
  • [13] JSP Blog. http://jspblog.sourceforge.net, last access: October 2017.
  • [14] Davy Landman, Alexander Serebrenik, and Jurgen J Vinju. Challenges for static analysis of java reflection: literature review and empirical study. In Proceedings of the 39th International Conference on Software Engineering, pages 507–518. IEEE Press, 2017.
  • [15] Benjamin Livshits, John Whaley, and Monica S Lam. Reflection analysis for java. In Asian Symposium on Programming Languages and Systems, pages 139–160. Springer, 2005.
  • [16] Panagiotis K Linos, Zhi-hong Chen, Seth Berrier, and Brian O’Rourke. A tool for understanding multi-language program dependencies. In 11th IEEE International Workshop on Program Comprehension, 2003, pages 64–72. IEEE, 2003.
  • [17] Laurent Deruelle, Nordine Melab, Mourad Bouneffa, and Henri Basson. Analysis and manipulation of distributed multi-language software code. In Source Code Analysis and Manipulation, 2001. Proceedings. First IEEE International Workshop on, pages 43–54. IEEE, 2001.
  • [18] Philip Mayer et al. Cross-language code analysis and refactoring. In SCAM, pages 94–103. IEEE, 2012.
  • [19] Andrew Ayers, Richard Schooler, Chris Metcalf, Anant Agarwal, Junghwan Rhee, and Emmett Witchel. Traceback: first fault diagnosis by reconstruction of distributed control flow. In ACM SIGPLAN Notices, volume 40, pages 201–212. ACM, 2005.
  • [20] Gang Tan and Jason Croft. An empirical security study of the native code in the jdk. In Usenix Security Symposium, pages 365–378, 2008.
  • [21] Amir Yazdanshenas et al. Crossing the boundaries while analyzing heterogeneous component-based software systems. In ICSM, pages 193–202. IEEE, 2011.
  • [22] Filippo Ricca et al. Construction of the system dependence graph for web application slicing. In SCAM, pages 123–132. IEEE, 2002.
  • [23] Gleb Naumovich et al. Static analysis of role-based access control in j2ee applications. ACM Software Engineering Notes, 29(5):1–10, 2004.
  • [24] Fabrizio Perin. Reverse engineering heterogeneous applications. PhD thesis.
  • [25] Anas Shatnawi, Hafedh Mili, Ghizlane El Boussaidi, Anis Boubaker, Yann-Gaël Guéhéneuc, Naouel Moha, Jean Privat, and Manel Abdellatif. Analyzing program dependencies in java ee applications. In Proceedings of the 14th International Conference on Mining Software Repositories, pages 64–74. IEEE Press, 2017.
  • [26] Anas Shatnawi, Hafedh Mili, Manel Abdellatif, Ghizlane El Boussaidi, Jean Privat, Yann-Gaël Guéhéneuc, and Naouel Moha. Identifying kdm model of jsp pages. arXiv preprint arXiv:1803.05270, 2018.
  • [27] Geoffrey Hecht, Hafedh Mili, Ghizlane El-Boussaidi, Anis Boubaker, Manel Abdellatif, Yann-Gaël Guéhéneuc, Anas Shatnawi, Jean Privat, and Naouel Moha. Codifying hidden dependencies in legacy j2ee applications. In 2018 25th Asia-Pacific Software Engineering Conference (APSEC), pages 305–314. IEEE, 2018.