Automated Query Generation for Design Pattern Mining in Source Code

Identifying which design patterns already exist in source code can help maintenance engineers gain a better understanding of the source code and determine if new requirements can be satisfied. There are current techniques for mining design patterns, but some of these techniques require tedious work of manually labeling training datasets, or manually specifying rules or queries for each pattern. To address this challenge, we introduce Model2Mine, a technique for automatically generating SPARQL queries by parsing UML diagrams, ensuring that all constraints are appropriately addressed. We discuss the underlying architecture of Model2Mine and its functionalities. Our initial results indicate that Model2Mine can automatically generate queries for the three types of design patterns (i.e., creational, behavioral, structural), with a slight performance overhead compared to manually generated queries, and accuracy that is comparable, or perform better than, existing techniques.



There are no comments yet.


page 1

page 2

page 3

page 4


Feature-Based Software Design Pattern Detection

Software design patterns are standard solutions to common problems in so...

On the Generation, Structure, and Semantics of Grammar Patterns in Source Code Identifiers

Identifiers make up a majority of the text in code. They are one of the ...

CRAQL: A Composable Language for Querying Source Code

This paper describes the design and implementation of CRAQL (Composable ...

Towards Learning (Dis)-Similarity of Source Code from Program Contrasts

Understanding the functional (dis)-similarity of source code is signific...

A Language-Agnostic Model for Semantic Source Code Labeling

Code search and comprehension have become more difficult in recent years...

Improved Query Reformulation for Concept Location using CodeRank and Document Structures

During software maintenance, developers usually deal with a significant ...

A Generic Approach to Detect Design Patterns in Model Transformations Using a String-Matching Algorithm

Maintaining software artifacts is among the hardest tasks an engineer fa...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Design patterns are general purpose solutions to recurring software engineering problems. It has advantages such as enhancing re-usability and maintainability by furnishing an explicit specification of class and object interactions and their underlying intent [b27]. Secure design patterns [b28] are reusable components that not only addresses common vulnerabilities but also reduce the high cost and efforts associated with implementing security at a later stage [b20]. Ever since they have been introduced, much research has gone into design patterns and secure design patterns, as they impact design, development, and maintenance stages of software engineering.

Since design patterns assist with satisfying requirements, it is important for maintenance engineers to determine which patterns are already present in the code. Finding design patterns can be time-consuming, due to manual work required to reverse engineer the code [b7]. Meanwhile, other techniques also require manual work before design patterns could be mined. This may involve the time-consuming task of manual labeling training data [b13] or manual specification of patterns for mining (e.g., rules [b45], queries [b6]

). Another key challenge in automation is the variations in implementations of design patterns making direct pattern matching infeasible. This is especially true with secure design patterns, which have a higher level of variability than object-oriented design patterns

[b40] [b41].

To address these challenges, we developed Model2Mine which automatically generates queries from UML Class Diagrams [b2] to mine design patterns. We leverage Semantic Web technology, such as Resource Description Framework (RDF) [b36], CodeOntology [b33] [b34] and SPARQL [b21] in developing this generator. An RDF graph shows relationships between resources (which could be data) and these relationships are represented as triples [b36]. Code Ontology is used to convert source code to RDF triples, and it preserves all relationships between elements within code (e.g., between packages, classes, and methods [b33] [b34]. Once we have an RDF graph of the source code, we can can retrieve triples using SPARQL queries [b21]. Our technique, Model2Mine, automatically generates these SPARQL queries from an XML Metadata Interchange (XMI) [b42] of a UML Class diagram.

Model2Mine is feasible to use because repositories of common design patterns have already been created [b15] [b27] and they already include UML Class diagrams in their description. This also applies to security design patterns as many of them also include Class diagrams [b43] [b44].

In this paper, our main contribution is a language agnostic approach that is fully automated with the ability to account for implementation variants of any design pattern. Compared to other methods for design pattern mining in source code, the ease of use of this tool comes from the fact that it does not involve any manual training stage and does not require defining rules and queries. The second contribution is that Model2Mine incorporates behavioral aspects of a pattern in addition to structural characteristics by incorporating stereotypes and filters. Thirdly, the paper discusses the various ways in which accuracy can be enhanced when mining design patterns using SPARQL queries.

We assessed Model2Mine using two types of evaluation. First, we compared our automatically generated SPARQL queries against manually constructed queries, for different types of design patterns (i.e., creational, behavioral and structural patterns [b27]). The automatically generated SPARQL queries were comparable to manually constructed queries, with only slight differences in running time. We also assessed the accuracy of mined queries. Our results thus far indicate that they are also comparable, or perform better, than existing techniques [b5], [b6], [b9], [b13], [b16], [b17], [b18], [b19], [b31].

This paper is organized as follows. We start with a motivation of our work. A comparison of Model2Mine with existing methodologies is discussed in Section IV. This is followed by a discussion of modeling and Semantic Web technologies used (Section III). Tool design, with a discussion of each module in the tool, is presented in Section V. We then discuss our evaluations, limitations and challenges in Sections  LABEL:Validation and LABEL:Discussion respectively.

Ii Motivation

Significant work has gone into identifying design patterns that can be used by software engineers [b15] [b27], but these patterns assume that a developer is working on the design phase of software. Many times, however, a maintenance engineer works on an existing codebase, and it is unclear which design patterns already exist in the source code.

The ability to identify existing patterns in source code is especially important for legacy code that needs to meet new security requirements. Model2Mine serves the purpose of helping security engineers to rapidly understand which, if any, existing security mechanisms (i.e., security design patterns) have been designed into the existing code base. The mined security design patterns can then be compared with security requirements.

Fig. 1: UML Diagram and corresponding SPARQL query for a simple inheritance relationship

Iii Background: Modeling & Semantic Web

In this section we provide background on the various technologies we use for Model2Mine.

Modeling: The Unified Modeling Language (UML) is a general-purpose modeling language intended to provide a standard way to visualize the design of a system [b2]. A UML Class Diagram has components like Classes and Interfaces which in turn contains attributes, operations. Classes are connected using relationships including Generalization, Association, Composition, Collaboration and Interface Realization.

There are various UML editing tools that enables users to create UML Diagrams. One of these tools is StarUML [b49]. We use StarUML to create UML Class Diagrams of various design patterns (e.g., Proxy, Visitor, Factory, Builder). We also used the StarUML XMI plugin to convert model (.mdj) and fragment (.mfj) files of these design patterns to XMI files. These XMI files serve as input to Model2Mine.

Modeling Metrics: SDMetrics is an Object Oriented design quality measurement tool for UML [b48]. SDMetrics analyzes the structure of UML models and works with all UML design tools that support XMI. Although the software is rich with features like comprehensive design measurements, automated design rule checks and an interactive UI, the only functionality we use in this project is the Open Core library used in its backend that parses UML Files stored as XMI. It supports all XMI versions currently in use. It also has a flexible custom XMI import, configurable to support proprietary UML metamodel extensions and tools that deviate from XMI standards.

Semantic Web: Mine2Model is built on top of various Semantic Web technologies: RDF, CodeOntology, and SPARQL. An RDF graph is a finite set of RDF triples [b36]. RDF triples contain facts, which are relationships between resources. Resources are represented as nodes, relationships are represented as edges. The vocabulary for RDF graphs is three disjoint sets: a set of URIs , a set of bnode identifiers , and a set of well-formed literals . The union of these sets is called the set of RDF terms. An RDF triple is a tuple .

CodeOntology is a building block of the Web of Code, an attempt to leverage code in a semantic framework [b33] [b34]

. The CodeOntology has an exposed API to parse source code of OpenJDK8 as well as result set of parsing open source code on Github through the GitHub API. Its framework is composed of three main components: Ontology, Parser and Datasets.

The ontology component is designed to model the domain of object-oriented programming languages. It is written in OWL 2 and is mainly focused towards the Java programming language, but it can be replaced to represent more languages. The modelling process underlying the creation of the ontology has been guided by common competency questions that usually arise during software processes and has been inspired by a re-engineering of the Java abstract syntax tree.

The parser component analyzes Java code to serialize it into RDF triples. Internally, the RDF triple extraction is managed by a Spoon [b47] processor invoked for every package in the input project. The RDF serialization process is handled using Apache Jena [b46]. It is able to extract structural information common to all object-oriented programming languages, like class hierarchy, methods and constructors. Optionally, it can also serialize into RDF triples all the statements and expressions, thereby providing a complete RDF-ization of source code. The RDF serialization of a Java project acts in three steps. First the project is analyzed to download all of its dependencies and load them in class path. Then an abstract syntax tree of the source code and its dependencies is built and processed to extract a set of RDF triples.

We also use SPARQL. A building block for SPARQL queries is Basic Graph Patterns (BGP). A SPARQL BGP is a set of triple patterns. A triple pattern is an RDF triple in which zero or more variables might appear. Variables are taken from the infinite set which is disjoint from the above-mentioned sets [b1].

SPARQL is a query language and a protocol for accessing RDF graphs [b1]. SPARQL takes the description of what the application wants, in the form of a query, and returns that information, in the form of a set of bindings or an RDF graph.

A sample SPARQL query that searches for all entries that has the name attribute set to value Smith is as follows:

Fig. 2: Object-oriented Design of the UML to SPARQL Converter Library

The query has PREFIX, SELECT and WHERE sections where PREFIX defines the database schema being queried from, SELECT statement defines the attributes being extracted and WHERE statement defines the various constraints that need to be matched to extract the results. The WHERE statement can have one or more constraints including a FILTER statement.

Iv Literature Review

Earlier approaches for detecting design patterns in source code ranged from sub-graph matching [b16], [b17], [b19] and ontology based techniques [b6], [b9]

to using machine learning techniques

[b5], [b13],[b18], [b31] and sequence diagrams [b9]. A detailed meta-analysis of various design pattern mining approaches is discussed in [b14].

The construction RDF triples from UML diagrams is discussed in [b39]. Our technique introduces a novel method and tool called Model2Mine, to generate SPARQL queries automatically by parsing UML diagrams. Model2Mine uses semantic web based technologies to convert source code to RDF triples and to query the triples using SPARQL queries. This technique not only removes the bottleneck of manually constructing queries but also enables bulk parsing of projects and creating datasets for source code mining research.

There are ontology-based approaches to mining design patterns (e.g., [b6], [b9]). One approach uses Semantic Web technologies for automatically detecting design patterns [b6]. However, this requires manual specification of queries and rules. That is, the SPARQL queries had to be manually constructed for each pattern intended to be mined. Their queries handle implementation variations using Union operations by defining each component and associated relationships within the Union Operation block. However, since the SELECT statement and component declaration is common, this only incorporates implementation variants that have exactly same number of target components. Another technique uses a knowledge base and inference rules to detect the design patterns that are similar in structure [b9]. The target system design, including a class diagram and its associated sequence diagrams, are analyzed and translated into knowledge concepts in ontology in terms of RDF/OWL elements. The detection is performed by semantically searching their predefined knowledge base of the expected design patterns and their corresponding detecting inference rules through SWRL and SQWRL. Our method uses a similar approach that relies on ontology by converting source code to RDF triples. However, we not only mine for structurally similar patterns, but also addresses behavioral and creational patterns as well. We achieve this by using SPARQL queries.

Numerous researchers have identified language specific solutions to design pattern mining in object oriented languages like C++ and Java that includes both manual [b12] and automated techniques [b26] [b37]. The underlying idea of creating a language-agnostic parser is similar to the multi-stage filtering strategy in [b25] as they also address the behavioral and creational patterns in addition to filtering structural similarity. However their extractor was developed only for C++ and the Abstract Object Language (AOL) representations of each pattern had to be constructed manually.

The IDEA (Interactive DEsign Assistant) system is another tool that matches a UML diagram of a design pattern against a class being implemented by verifying if the implementation can be improved to match the design pattern [b24]. However, this only supports verification of one class at a time in the source code due to scalability issues. Applying on a distributed set of open source projects is difficult.

Other techniques use source code metrics and machine learning to detect patterns without using strict structural constraints to cater to variations in implementation of the patterns in different projects such that minor variation in structure will not lead to false negative results [b5] [b18] [b13]. However, this lack of strict constraints leads to a large number of false positive results. Finally, it also requires tedious manual training for each pattern that needs to be detected. The similarity score comparison of graph vertices used in [b31] has the same limitation. The advantage of our model is that it caters to accommodating variations in implementation without compromising on accuracy and also removing the requirement for manual training for each pattern.

There also have been approaches to detect patterns from software documentations [b30]. However they require the descriptive and prescriptive architecture to be the same for the model to perform accurately.

V Design

As we mentioned, Model2Mine is built on top of the semantic web technology discussed in Section III. It uses the XMI representations of UML Class diagrams to generate queries for mining these patterns in source code, which is represented as an RDF graph. An object-oriented design of the tool is as shown in Figure 2.

The architecture follows a modular design with separation of concerns. For instance, the task of identifying components and relationships are handled by ModelElementResolverService. On the other hand, the task of constructing the query from identified Components and RelationshipItems is handled by QueryConstructionService. The two services are completely decoupled. Model2Mine was designed to enhance the re-usability and portability of the individual modules. Model2Mine can be extended to support more relationship types and component types with minimal changes.

Implemented objects are described in detail below.

V-a PatternUMLParser

The PatternUMLParser class contains the parseXMIFile method and saveOutputAsText method that parses UML Class diagrams in .xmi files and saves SPARQL queries as .rq files respectively. A Model object is created using the SDMetrics Open Core library using the XMI file. Once the diagram is parsed as Java Model Object, each ModelElement in the model is iteratively converted into a Component or RelationshipItem object. Further, the library iteratively analyzes each component and relationship in the diagram. It creates a SPARQL query which includes two parts: A SELECT statement and a WHERE clause. The query is a string formed by concatenating each RDF triple generated by analyzing relationships (e.g., associations, interface realizations, generalizations) in the Class diagram as well as constraints like data type and visibility.

The PatternUMLParser relies on ModelElementResolverService to resolve whether the ModelElement being parsed is relevant for constructing SPARQL query or not. The ModelElementResolverService constructs a blank SparqlQuery object and each element is added to the list of Components or RelationshipItems in the SparqlQuery object being constructed. Once all elements are checked, PatternUMLParser uses the QueryConstructionService to construct the query attribute of the SparqlQuery object. The Model object that contains hierarchical map of ModelElements constructed by parsing XMI representation of a UML diagram is shown in Figure 3. The class parses XMI file to identify UML elements defined in the MetaModel object based on the element to XMI keyword mapping defined in the XMITransformations object. After the construction of SPARQL query as explained above, the output is saved as an .rq file in the path defined in the Model2Mine util files.

Fig. 3: Hierarchical map of Model Elements created by parsing XMI using SDMetrics Open Core library

V-B ModelElementResolverService

This service resolves whether a ModelElement should be appended into the query or not and in what format. Dedicated methods resolving all types of relationships are defined here and relies on the Enumeration (enum) Relationships to resolve relevant relationships.

V-C Component

A component are the nodes in a UML diagram between which relationships exist. Components could be classes, interfaces, methods or other attributes.

V-D Relationships

This is an enumeration identifying all the relationships that appear between two components in a SPARQL query generated from a UML Diagram. This contains both relations between two classes, between a class and its attributes and methods, and between methods and its parameters. That is, in addition to the relationships like Generalization, Interface Realization, Association and Composition, this has entries corresponding to generation of triples with relations such as woc:hasReturnType, woc:hasMethod, woc:hasParameter etc. This is maintained as a separate Enumeration so that construction of constraint triples can be generalized as a RelationshipItem explained in section V-E where each element of the triple has types: (Component, Relationships , Component).

V-E RelationshipItem

A relationshipItem has three attributes: a , a , and a which has a value from the Relationships enumeration describing the relationship has to the . Both and are of Component type. An RDF triple can be constructed as

V-F QueryConstructionService

Once the lists of components and relationships are constructed for a Model in consideration, the QueryConstructionService builds the SparqlQuery. Each component is added to the Select statement. Each component is also added in the WHERE clause defining it’s type. For example, if a Method is encountered, an RDF triple defining the element has woc:Method type is added.