Feature Maps: A Comprehensible Software Representation for Design Pattern Detection

by   Hannes Thaller, et al.

Design patterns are elegant and well-tested solutions to recurrent software development problems. They are the result of software developers dealing with problems that frequently occur, solving them in the same or a slightly adapted way. A pattern's semantics provide the intent, motivation, and applicability, describing what it does, why it is needed, and where it is useful. Consequently, design patterns encode a well of information. Developers weave this information into their systems whenever they use design patterns to solve problems. This work presents Feature Maps, a flexible human- and machine-comprehensible software representation based on micro-structures. Our algorithm, the Feature-Role Normalization, presses the high-dimensional, inhomogeneous vector space of micro-structures into a feature map. We apply these concepts to the problem of detecting instances of design patterns in source code. We evaluate our methodology on four design patterns, a wide range of balanced and imbalanced labeled training data, and compare classical machine learning (Random Forests) with modern deep learning approaches (Convolutional Neural Networks). Feature maps yield robust classifiers even under challenging settings of strongly imbalanced data distributions without sacrificing human comprehensibility. Results suggest that feature maps are an excellent addition in the software analysis toolbox that can reveal useful information hidden in the source code.



There are no comments yet.


page 1

page 3

page 6


Feature-Based Software Design Pattern Detection

Software design patterns are standard solutions to common problems in so...

Pseudo-Feature Generation for Imbalanced Data Analysis in Deep Learning

We generate pseudo-features by multivariate probability distributions ob...

Deep Learning Anti-patterns from Code Metrics History

Anti-patterns are poor solutions to recurring design problems. Number of...

ZeroIn: Characterizing the Data Distributions of Commits in Software Repositories

Modern software development is based on a series of rapid incremental ch...

On the Spectrum of Random Features Maps of High Dimensional Data

Random feature maps are ubiquitous in modern statistical machine learnin...

A Feature Based Methodology for Variable Requirements Reverse Engineering

In the past years, software reverse engineering dealt with source code u...

Onboarding in Software Product Lines: ConceptMaps as Welcome Guides

With a volatile labour and technological market, onboarding is becoming ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Two ingredients are crucial for the approach within this work. First, Design Patterns (DPs) and their intricacies from a Design Pattern (Instance) Detection (DPD) perspective influence how detectors are built and evaluated. Thus a more formal definition helps to frame the problem and its solution. Secondly, Machine Learning (ML) and its automation capabilities concerning data modeling are essential to understand the approach and its evaluation.

Design Patterns


Figure : The goal is to find a process that can reconstruct the original role mappings of a given pattern. Inputs are classes of a system, outputs are mappings between pattern roles and the input classes. A unique mapping defines a set of mappings that share the same classes mapped to the primary roles. Component (Graphic) is assigned to the primary role and defines the communication scheme, while Composite (Picture), Leaf (Line and Rectangle) are classes assigned to secondary roles. In addition to the Composite design pattern, the example also presents the detection of the Aggregation and Inheritance micro-structures. Similar to the detection of design patterns, micro-structures are only an assortment of classes that form a class structure. The Composite example is based on an example by Gamma et al. [Gamma1995].

A Design Pattern (DP) is a set of roles to which participating classes are mapped. These classes communicate (structure and collaboration) in an organized fashion and have specific semantics (intent, motivation, and applicability) attached to them. Design pattern (instance) detection reconstructs the original mappings between classes of a system and roles of a pattern with respect to their communication such that the attached semantics provide information about the system under inspection. Figure Document (based on an example by Gamma et al. [Gamma1995]) illustrates this process in which the input is a set of classes, and the output is a set of mappings between these classes and the roles of a specific pattern. In other words, the task is to annotate a set of classes with the roles of a design pattern. The example is a subsystem of a drawing tool that uses the Composite pattern to interact, in a uniform way, with the drawing primitives and the scene. The Composite pattern organizes objects into a tree structure where internal nodes delegate specific calls while leaf nodes implement the actual behavior of the call. Graphic is the superclass of Picture, Line, and Rectangle and defines a common interface for these. Picture aggregates Graphic objects, and delegates calls to the draw-method. Line and Rectangle provide the actual implementation of draw, hence are called leafs. The goal is now to find a process that reconstructs the original mappings between classes and pattern roles indicated by the question mark in Figure Document. The left side in Figure Document represents the initial state of the system before the detection process. The right side shows the system with annotated roles after the detection process. Mapping 1, for example, maps Graphic Component, Picture Composite and Line Leaf. Each role mapping assigns at least one class to one role resulting in multiple possible role mappings for the same subsystem given various implementations of the different roles. Mappings 1 and 2 differ only in one role mapping which is a common scheme in design patterns. Primary roles define the communication scheme within the pattern and drive the communication through a pattern’s class structure, hence are often abstractions. Secondary roles provide the implementation for the abstractions and inherit the protocol from the primary roles, thus are commonly fluctuating in their class assignment. A system usually provides multiple versions of the secondary roles but only a handful of different implementations for the primary roles. All mappings that share the same primary roles belong to the same unique role mapping representing one specific implementation of a pattern within a subsystem. More formally, a pattern mapping is a -fold relation between a set of classes and a set of roles with


in which is a specific pattern with roles. Each unique role mapping reflects an equivalence class in which the primary roles are compared. Given a set of pattern mappings with the equivalence relation in which the classes mapped to the primary roles are compared, then


represents the equivalence class from . That is, all role mappings of pattern with roles that have the same classes mapped to the primary roles . In Figure Document, Graphic maps to the Component role representing the primary role of the Composite pattern. Line, Rectangle, and Picture map to the secondary roles where Mappings 1-2 belong to the Unique Mapping A. Tertiary roles (e.g., Client) that solely function as an entry point for the pattern’s communication sequence are often ignored in detection processes as they do not carry any useful information. The concept of unique mapping carries importance in analysis settings where they explicitly state the boundaries for design pattern instances.

Machine Learning

Machine Learning (ML) describes methods that learn relationships or structure from data in an automated fashion. Essential elements of ML are data (e.g., images or time series), the model (e.g., Random Forest [Breiman1984] or CNN [Lecun1998a]), the optimization procedure (e.g., Adam [Kingma2014]) and the evaluation procedure (e.g., Cross-Validation [Bishop2006] or test-set [Hastie2009] method).


The DPD problem can be framed as a supervised classification problem in which observations are annotated with the ground truth (e.g., is an instance of Composite, is not an instance of Composite). ML algorithms often expect the data to be independent and identically distributed (i.i.d.) meaning that the observations are mutually independent and collected in the same fashion. Especially observation independence is vital as violating the assumption results over-optimistic model performance in classification settings. Data can be preprocessed by standardizing or rescaling the values into a specific range. In addition to normalizing, data permutation is often employed to increase the amount of available data or to fit models that are robust against simple transformations (i.e., reduce overfitting) [Goodfellow2016b]. For instance, rotating or mirroring an image does not change the objects it shows (e.g., rotated apple is still an apple).


Many different supervised machine learning models of varying complexity and with different strengths and weaknesses exist. Convolutional Neural Networks (CNNs) [Lecun1998a]

are fundamental building blocks in modern machine learning and are prominent for their capabilities for computer vision problems. They learn local correlations within volumes and combine these correlations into high-level features at later stages. For instance, nearby pixels often correlate with each other, e.g., color and texture of an apple on the top right corner of an image. CNNs innately model such local correlations leading to good results in many domains while still having a reasonable amount of model parameters. A

Decision Tree [Murphy2012] recursively partitions the input space via axis-parallel splits that can be represented as a tree. Each leaf carries a response (e.g., a pattern class) while the nodes represent boolean conditions on an input dimension (e.g., is an interface {yes, no}). A Random Forest (RF) [Breiman1984]

contains multiple randomly perturbed decision trees, that reduce the possible high variance of a single tree. The trees are perturbed by fitting each on a subset of the data which creates smoother decision boundaries as multiple splits are averaged in the final prediction.

Hyper-parameter optimization searches the space of possible parameter configurations of an ML algorithm that leads to the best performing model according to some predefined goal function. For instance, tree depth or partition criterion are hyper-parameters of a decision tree.


It is crucial that the evaluation procedure provides a measurement of the model that is truthful, i.e., it should not over- or underestimate the performance. Cross-Validation [Geisser1993] (CV) is a method to evaluate the generalization performance by splitting the dataset into -folds.

folds are used to fit a model while the remaining fold is used to estimate its generalization performance. This process is repeated

times leading to

models and estimates that are averaged into a global generalization performance of the ML algorithm regarding the dataset and parameters. CV is a nearly unbiased estimator for the generalization performance except in the case of small datasets in which the evaluation variance may misestimate the generalization. Repeated cross-validation may be used to reduce the variance, trading it for some bias. Accuracy, Precision, Recall or the Matthews Correlation Coefficient (MCC) 

[Matthews1975, Powers2007]

are performance metrics for classification models. Accuracy, Recall, and Precision are frequently used metrics and need no further explanation. MCC is a reliable and balanced performance metric for binary classification ranging from -1 to 1 describing the strength of association between model prediction and ground truth. It provides the most accurate measurement of a (binary) model’s performance even if the dataset is imbalanced (skewed distribution of prediction labels). This does not hold for Accuracy, Precision and Recall as they do not consider all types of correctly or incorrectly predicted instances (true/false positive, true/false negatives).

Design Pattern Instance Detection Pipeline


Figure : Overview of the detection pipeline. Ellipses are processes; rectangles are artifacts. First, class features are extracted from the source code. Then, based on the features, pattern instance candidates are sampled from the system. Optionally, the features are normalized and ultimately classified by the inference method.

Figure Document shows a typical multistage DPD pipeline similar to many DP detectors [Alhusain2013, Chihada2015, Hussain2017, Zanoni2015, Lucia2009, Tsantalis2006]. Rectangles are artifacts while ellipses are processes. Source Code and Probability are the input and output of the entire DPD pipeline while Micro-structures, Candidate Role Mappings and Feature Maps are intermediate artifacts. Feature Extraction extracts distinctive features from the source code. Candidate Sampling reduces the total number of mappings that need to be evaluated. Feature Normalization transforms the features into a homogeneous space such that the Design Pattern Inference can reason efficiently about it.

Feature Extraction

Feature extraction deals with the inherent complexity of programming languages by extracting high-level concepts that later stages can use to successfully find pattern instances (see Figure Document). DPs describe classes and their loose arrangements and communication paths. Consequently, extracted features ideally capture these arrangements and their relationships to improve reasoning in later stages. We use Micro-Structures (MSs) [ArcelliFontana2011]

as features, which are very small DPs (usually one or two roles). A MS describes a general structural or behavioral property within or between a set of classes (e.g., self-reference, inheritance or method calls). Casting MSs into an easily readable form allows developers and algorithms to effectively comprehend these properties. This includes fast discovery of inheritance and call dependencies but also complex aspects like the implementation of a Composite or Decorator pattern. The most outstanding difference between MSs and DPs is that the former can be detected using logic or pattern matching, i.e., their size prohibits variance in their actual implementation. MS detectors are sub-graph filters retaining only sub-graphs that match their specified predicate. For example,

, describes the logic to filter for the Inheritance MS. Given two Type nodes, has to be an ancestor of in order to fulfill the predicate. The result of such an MS detector are sub-graphs from the program’s ASG that are annotated with the MS roles illustrated in Figure Document for the Inheritance and Aggregation MS. Graphic Superclass and Picture Subclass for one possible Inheritance MS instance. A complete catalog of Micro-Structures and their detectors is given in our previous work [Thaller2016], which is a refined and extended catalog based on Maggioni’s work [Maggioni2009]. The catalog is made up of three sub-catalogs: Elemental Design Patterns [Smith2011], Design Pattern Clues [ArcelliFontana2011b] and Micro-Patterns [Gil2005]. Each sub-catalog was defined independently with different motivations and goals, but all of them prove valuable in the process of DPD as Arcelli Fontana et al. [ArcelliFontana2011a] concluded in a series of experiments.

Candidate Sampling

Candidate Sampling uses the extracted features (and sometimes the Source Code) to find potential candidates of design pattern instances in a program’s ASG. As discussed, role mappings link concrete classes to specific pattern roles which span a search space of potential mappings where is the number of available classes and

the number of roles that need to be mapped for a pattern. Hence, finding potential candidates is vital since a full-search is impossible for non-trivial systems. We used Heuristic Search 

[Thaller2016] that we proposed in our previous work as candidate sampler. Heuristic Search checks for graph isomorphism [Wilson1986] between a very general description of the pattern and classes in the program under inspection. It uses primary roles as an entry point to iteratively search for secondary roles that fit the pattern description. Heuristic Search reduces the usual isomorphism search [Babai1983] to a search that is linear with the number of types. For Composite, heuristic search executes the following steps: Collect all types that are super-types as Component ( Component); Collect all types that are sub-types from and aggregate as Composite ( Composite); Collect all types that are sub-types from and do not aggregate as Leaf ( Leaf). Each found role mapping is a candidate that is later checked in the design pattern inference stage for their correct consistency with the pattern. A drawback to this heuristic search is that it might miss potential pattern instances. The advantage is that these are simple and efficient algorithms to find potential candidates.

Feature Normalization

Feature-Role Normalization (FRN) is the main contribution of this work. It is used to normalize micro-structures into a fixed sized matrix called Feature Map (FM). FRN allows the usage of a wide variety of existing statistical methods while still retaining the interpretability for manual analysis via software engineers. Figure Document shows the pipeline in which first the features (micro-structures) are extracted followed by the sampling of candidates from the system. A candidate mapping is a set of classes that map to the design pattern roles which may, may not or may partially map to several micro-structures. Each of the potentially mapped micro-structures may have a different number of roles, resulting in an inhomogeneous feature space. This inhomogeneous feature space is problematic in the inference step as most machine learning methods cannot handle irregularly structured data. Feature-Role Normalization presses this inhomogeneous feature space into a homogeneous two-dimensional feature space that is comprehensible and maintains enough information to be useful for engineers and learning algorithms. More specific, the goal of the normalization, is to take features each having roles and map them to design pattern roles such that the result is a fixed-sized matrix. FRN provides one approach to this problem by creating an matrix where rows represent micro-structures and columns design pattern roles. Values in the feature map, in the context of DPD, are globally unique role identifiers (), where no two roles of any pattern have the same values associated. That said, roles might be shared across multiple features if they share similar semantic information. Zeros represent missing features meaning that the class mapped to the role does not participate in the micro-structure.


Figure : A (condensed) feature map is a stack of micro-structures for a set of classes (represented by their Decorator roles in this case). Values represent unique role identifiers with or any other property related to micro-structures (e.g., number of occurrences of the micro-structure). The coloring in this example is a gradient based on the role identifiers. Gray spots represent absent values (zeros).

Figure Document

illustrates an example of a feature map describing classes that make up a Composite pattern. Values are role identifiers colored by a simple gradient, and empty gray spots are absent mappings (zeros). Rows depict an illustrative subset of the micro-structures (features) and columns the design pattern roles to which a set of classes are mapped. The class assigned to Component is a Superclass with its children classes assigned to Composite and Leaf. Composite aggregates a Component, delegating and redirecting method calls to it. This short example mirrors how software engineers would, in a slow fashion, go through the source code of a project to find possible instances of a Composite pattern. Feature maps simplify this process by compacting the relationships between multiple classes into new high-level concepts (e.g., a design pattern) and offer a straightforward way of visualizing them. Naturally, the visualization can be extended by recoloring, grouping or filtering the entries of a feature map, or by visualizing another source code property than identifiers (e.g., number of invocations between roles or complexity measures). This makes FMs a strong visualization technique for software analysis. They provide a compact view of specific aspects of a small set of classes allowing a quick overview of their relationships and class characteristics. Furthermore, the matrix structure has clear advantages regarding automation as many methods, e.g., machine learning algorithms, are designed to work with flat data. Detection processes that operate on structured data may circumvent the use of FMs for DPD by directly operating on the ASG. Examples are Support Vector Machines 

[Burges1998] (with tree-kernels), Recursive Neural Networks [Gori2003]

, sequence-to-sequence learning models like Long Short-Term Memory Networks 

[Hochreiter1997] or Graph Convolutional Neural Networks [Kipf2016]. These learning algorithms and their respective models might provide possible performance advantages in the context of DPD by naturally handling the ASG and the MSs. However, some lack computational efficiency while others lack interpretability of the classification decisions.

Information Preservation Issues in Feature Maps

FRN normalizes the inhomogeneous feature space caused by micro-structures and their roles into a fixed-sized feature map. The resulting feature map is a sparse representation of attributes and relationships within and between nodes in an ASG. While inspired by adjacency matrices, feature maps are not a proper and full representation of the subgraphs; hence they suffer from two specific issues: noise and collisions.


Figure : The left side illustrates an example of noise (R3) and role collisions (R5/6) within feature maps. Noise is introduced by features that relate to classes that do not participate in the actual mapping. Role collisions occur if a feature is present multiple times with the same set of mapped classes. The right side provides a solution to both issues where mappings to classes that do not participate in the actual mapping are ignored (forced to 0), and role collisions are solved by introducing virtual roles (Rx) that represent multiple roles at the same time.

A feature introduces noise if it has at least two roles and only one of its roles map to a class within the FM. Figure Document illustrates this problem with the Composite pattern and the features Pure Type, Inheritance and Delegation. On the left side, the class assigned to Leaf is a Superclass. However, neither of the other Composite roles participate in this inheritance hierarchy (both are 0). The Superclass role of the Leaf class exists in the project but does not carry any meaningful information in the context of the current FM mapping. These out of context values inflate the feature map with irrelevant information and introduce noise reducing the usability of FMs. A simple solution to the problem is to force entries to zero in cases where features depict a relationship, but none of the other mapped classes participate in it. Figure Document depicts this solution on the right side where R3 becomes . A feature suffers from a role collision if it has at least two roles and multiple instances of the feature trace to the same classes. Figure Document (left) illustrate a collision where two instances of the Delegation MS are present in the classes assigned to Component, Composite, and Leaf. Composite delegates to Component and Leaf delegates to Composite. The multi-assignment for Composite is the result of the class-level abstraction and compactness of FMs. One possible solution is to introduce virtual roles that describe two or more roles like Rx in Figure Document on the right. Virtual roles incur a certain amount of information loss (direction of relationship) especially if multiple virtual roles have to be used for a feature in a FM. However, virtual roles also allow to express self-loops, for instance if a class aggregates itself (it is Aggregator and Aggregate at the same time). These self-loops are helpful in many cases, for example, of the Singleton pattern instances contain Creation Site / Type To Create indicating that the class is the factory for itself. This information is essential in the context of Singleton detection.

Design Pattern Inference

Design Pattern Inference is the last step of the DPD pipeline (Figure Document) and receives feature maps as input. During training time the models and their parameters are optimized toward a labeled dataset of feature maps. During production time the models only return the probability that an FM belongs to the model’s optimized pattern or not (no labels, no model parameter modifications). The detailed intricacies of the training and production time are subject to the selected model class and may differ quite drastically. For instance, a CNN is optimized via a gradient-based algorithm while decision trees are constructed by splitting the input dimension values via an information theoretical criterion (training time).

Design Pattern Instance Detection Study

This study frames the process of DPD (see Figure Document) as a binary classification problem. Each design pattern is detected by a separate model that provides the probability that the input FM belongs to the design pattern. Naturally, multi-label classification, i.e., multiple patterns per observation (FM), is supported by feeding the same FM into different models. Given this setup, we conducted several experiments with FMs in the context of design pattern instance detection. For better reproducibility and comparability we evaluated only the last two stages of the DPD pipeline highlighted in Figure Document. Experience shows that reimplementing the entire pipeline of a DPD approach including the intermediates (e.g., sampler results) while making the detection results comparable is nearly impossible. That said, the entire pipeline from Figure Document was implemented, and details about the specific implementation of micro-structures and candidate samplers are given in our previous work [Thaller2016]. The following evaluation and experimental methodology should be seen as a minimal effort to guarantee comparable research in the future.

Controlled Variables

Detecting design pattern instances can be a daunting task as there are numerous ways to encode software (e.g., feature maps), select potential candidates (candidate sampling), and build classification models (e.g., RFs or CNNs). In total, the study controls for 7 Experiment Parameters (ExP).

  1. Pattern: Four widely used patterns are considered {Singleton, Template Method, Composite, Decorator}. This selection is based on previous studies [Tsantalis2006, Uchiyama2011, ArcelliFontana2011a, Zanoni2015] but also represents an even selection regarding their categorical distribution (creational, structural and behavioral patterns).

  2. Role Count: A confounding variable is the number of roles of the pattern. The expectation is that patterns with one role are easier to detect than patterns with 2 or 3 roles. However, more roles also imply more conditions through which better precision in the classification can be achieved. The selected patterns have a linear role count distribution with {1, 2, 3, 4} roles for {Singleton, Template Method, Composite, Decorator} excluding tertiary roles.

  3. Classification Model: {Random Forest, CNN} were selected as classification model classes. The rationale for selecting CNNs is their natural ability to handle volumes, i.e., to process FMs while leveraging their structural information. RFs were selected as they are a popular and efficient choice in many ML tasks and are more light-weight regarding hyper-parameters in contrast to CNNs.

  4. Negative-Positive Candidate Ratio: Non-trivial software systems expose a combinatorial number of possible class mappings that can function as pattern candidates. This means a practically infinite number of pattern counterexamples can be found within a system. The candidate samplers (Section Document) that mitigate this issue favor recall leading to imbalanced distributions of positive and negative examples. Negative-Positive Candidate Ratio (NPCR) captures this skew in the data. The study explores NPCRs of {1, 2, 4, 6, 8, 10}.

  5. Data Augmentation: Shuffling of rows was used as data augmentation method to increase the dataset size and classification difficulty. This inhibits the classifier from learning specific relationships between rows, instead of their content. A permutation count of {0, 1, 5, 10} states how often the dataset was copied, rows augmented and added to the dataset. Permuting rows and

    columns was not considered as it would result in a nearly random matrix destroying any structural information captured in feature maps.

  6. Optimization Budget: A budget of {200} evaluations was available for each model class (RF or CNN). These evaluations were used to tune their hyper-parameters (such as the number of units within layers, or depth of tree). The optimization selected hyper-parameter configurations that maximized the MCC.

  7. Instance Independence: Describes the confounding variable that instances within a unique mapping correlate thus result in too optimistic classification results if not properly handled. Using standard cross-validation mixes mappings from the same unique mapping into training and test-folds making the folds not independent. This leads to too optimistic evaluation results as the training examples leak information between training and test phase. We used {project-fold cross validation} to avoid information leaks (correlations between samples) during the trials.

Response Variables

The result of each experiment trial was the generalization performance of the classifier evaluated through cross-validation and measured by {Accuracy, Precision, Recall, F1, Matthews Correlation Coefficient

}. MCC was selected as the primary evaluation metric as it provides the most unbiased single number metric for binary classification.

Data Source

A total of widely used design patterns were selected for the study, each having peer-reviewed classification examples. These peer-reviewed classifications are given by nine projects that are part of the Pattern-like Micro-Architecture Repository (P-MARt111http://www.ptidej.net/tools/designpatterns/index_html#2 04/10/19) [Gueheneuc2007]. Applications within the repository range from modeling and drawing tools to static analysis and refactoring frameworks, therefore its diversity is a good representation for the real world. Table Document contains the distribution of design pattern instances across projects. Original are the mapping counts in the exploded form provided by the dataset. Revised are the mappings after a manual cleanup of the data. Some design pattern instances trace to classes not contained in the available source code repositories (e.g., java.awt, javax.swing, or third-party libraries) and were removed. Projects that did not include the target pattern were also not considered during the experiments (for the respective pattern) as controlling for imbalance (NPCR) is impossible, e.g., JRefactory for Template Method. At last, the Netbeans project was excluded as we were not able to parse it. Fixing the source code pieces in question was deemed futile as they were too numerous introducing potential bias in the evaluation. Unique are the unique mappings after the revision.

! [.1em] 2*Project Original / Revised / Unique Singleton Template Method Composite Decorator [.075em]1-5 JHotDraw 2 / 2 / 2 68 / 34 / 2 3840 / 960 / 1 176 / 44 / 1 JRefactory 2 / 2 / 2 0 / 0 / 0 0 / 0 / 0 0 / 0 / 0 JUnit 2 / 2 / 2 0 / 0 / 0 222 / 74 / 1 396 / 132 / 1 Lexi 2 / 2 / 2 0 / 0 / 0 0 / 0 / 0 0 / 0 / 0 MapperXML 3 / 3 / 3 88 / 44 /4 340 / 85 / 1 0 / 0 / 0 Netbeans - / - / - - / - / - - / - / - - / - / - Nutch 1 / 1 / 1 14 / 7 / 2 0 / 0 / 0 0 / 0 / 0 PMD 0 / 0 / 0 1 / 0 / 0 3 / 0 / 0 0 / 0 / 0 QuickUML 1 / 1 / 1 0 / 0 / 0 120 / 30 / 2 0 / 0 / 0 [.075em]1-5 Total 13 / 13 / 13 171 / 85 / 8 4525 / 1149 / 5 572 / 176 / 2 [.1em]

Table : The 9 projects contained in the P-MARt dataset. For every project the original and the revised (excluding instances outside of the available source code) number of design pattern instances (i.e. mappings) as well as the number of unique mappings is given.


First, micro-structures were extracted (Section Document) and possible candidates sampled (Section Document) as shown in Figure Document. The micro-structure extraction and candidate sampling were based on our previous work [Thaller2017]. Feature-Role Normalization (Section Document) created the feature maps for each candidate and pattern. Cells within the FM were globally unique role identifiers () which received no further preprocessing such as rescaling or standardization. Further, no specific assignment strategy was employed for identifiers, i.e., role identifiers were assigned in their encountered order resulting in a total value range of including the virtual roles. The controlled variables from Section Document

were executed in a systematic fashion resulting in the response variables from Section 

Document. All features (rows) were included. First, the FMs were collected into datasets, and negative-subsampling was applied according to the NPCR ratio (ExP Document). The dataset was then copied, permutated, and added times (ExP Document). FMs were provided to CNNs as-is and linearized (row-wise) into a vector for RF. Then the models were fitted (ExP Document) and evaluated using project-fold cross-validation (ExP Document). This procedure was repeated times in search of the optimal hyper-parameters (ExP Document) and for each pattern (ExP Document).

Experiment Results

.7! [.1em] Model Pattern Accuracy Precision Recall F1 MCC [.075em]1-7 4*CNN Singleton Template Method Composite Decorator [.075em]1-7 4*RF Singleton Template Method Composite Decorator [.1em]

Table : Average test-performance of the Convolutional Neural Network (CNN) and Random Forest (RF) based detections.

The experiment parameters led to a total of trials in which classifiers were fitted. The bulk of these classifiers were part of the optimization budget (ExP Document). In total, the best performing top-30 classifiers (including ties) for each Pattern, Model, and NPCR combination was selected for the evaluation. For instance, the boxplot in Figure Document for the Singleton pattern at an NPCR of 1 represents the top-30 classifiers of this category. Furthermore, Table Document contains the average test performance (marginalized over NPCR) for each pattern and model. Using the Evans guidelines [Evans1996] for interpreting correlations, we see that CNN models have on average a strong performance with a Median (Med) and an Interquartile Range (IQR) of to . The worst median performance is given by the Template Method classifiers reaching a moderate ). Best median performance is given by the Composite pattern with strong . The total variance in these estimates is acceptably low and would cause in the worst case a performance drop of degrading it to a moderate classifier. CNNs are rather robust against data imbalance with an average NPCR 1 to 10 change of . RF models are quite similar to CNN models for patterns with low role count but degrade quite drastically for patterns with four roles. RFs have on average a moderate performance with . At least a moderate () up to a strong performance () is given, excluding the Decorator models that were systematically worse () than a random model (). Again, the best performance as reached for Composite patterns with . On average, RF models exposed a variance of (Composite). Figure Document shows the effect of NPCRs on the classifier’s performance. Both, CNN and RF suffer from imbalance except in the case of the Singleton pattern. CNNs are on average (all patterns) more robust against imbalance with an average NPCR 1 to NPCR 10 difference of . However, they degrade for Composite and Decorator. The average difference for RFs is . Tests for independence [Kruskal1952] regarding MCC between CNN models with different permutation counts (ExP Document) were close to significant . Permutation passes for RF models were significant with . There was a significant effect concerning the NPCR for both model types with CNN () and RF ().


Figure : Each boxplot represents the top-30 fitted models for a given pattern and model, illustrating the trend across imbalanced datasets (NPCRs). Trends indicate that Feature Maps are a robust representation for DPD even in highly imbalanced settings.


The presented results in Section Document fit with the intuition that patterns with more roles are easier to detect, models have a decline in performance with larger NPCR, FMs fit well in the framework of CNNs. Patterns with more roles inherently have more conditions that have to be met as each role describes some specific behavior a participating class must fulfill. This avoids unintentional implementations of these patterns and their detection because of a more elaborate structure and communication flow. Template Method, for example, describes a very general and freely used concept of abstract classes that defer implementation details of an algorithm to subclasses. Not only is this implementation technique used without actually having the intention to implement the Template Method pattern, but also pre-labeling of the dataset in this regard might prove challenging because of its general applicability. In contrast, Composite or Decorator use inheritance in conjunction with redirection techniques forcing the classifier to focus not only on inheritance but also the redirection aspect between multiple classes. Both patterns achieved excellent results that are quite robust across different NPCRs using CNNs (low degradation and low variance). However, RF, while similar to CNN for Singleton and Template Method, strongly degrades with patterns that have more roles (both variance and bias). This is to be expected as FMs encode relationship information in a 2-dimensional format (matrix) which needs to be serialized (loss of structural information) into a vector. Furthermore, serialization with patterns that have, e.g., four roles (Decorator), result in a dimensional input vector. We did not explore the invertibility of the negative correlation (systematic mistakes) that RFs made for the Decorator. However, a simple inversion of the predictions might not generalize as expected and is open for further research. Nevertheless, RFs are still a good and fast solution for patterns with 1-3 three roles as the performance in Figure Document shows. The degrading performance of models with higher imbalance is often an issue. The minority class (class with lesser samples) is outweighed and too seldom encountered during training of the classifier in order to learn the essential aspects of it. Figure Document shows that the performance degrades with higher NPCR. Still, the underlying trends are rather shallow except for Template Method. The increase in performance for the Singleton pattern is most likely related to the amount of available data. With only examples it is expected that the classifiers use the extra negative samples to make the overall predictions more robust. In total, the degradation of performance over NPCRs is less extreme than initially anticipated. Though, it would need an additional study to attribute this effect properly. FMs work well in the framework of CNNs because of their natural processing of volumes. CNNs are able to partially replicate the human DPD process via feature maps described in Section Document. For example, a filter can combine Aggregation, Inheritance, and Delegation across Component, Composite and Leaf in the first convolutional step from Figure Document

. The result would be a higher level internal feature (of unknown quantity and semantics) representing this combination. Each layer then computes an even more abstract interpretation of the original feature leading to a good prediction. The same does not necessarily hold for RFs as the structural information (relationships between DP roles) would need to be recovered from the serialized vectors while still focusing on the content. We think that this natural advantage of CNNs resulted in the overall better performance especially for patterns with more roles. Permutation passes increased the performance for RF classifiers and acted as additional regularization for CNNs without any performance loss (nor significant gain). An advantage of NPCR (or generally imbalance) analysis is that it allows for direct comparison with other results via linear interpolation between NPCRs (as long as the datasets are the same). Zanoni et al. 

[Zanoni2015] conducted a similar study (same dataset) where they used multitudes of different classification models to find DPs. They report Accuracy for their best performing models without explicitly reporting the values. We sketch a comparison by interpolating our results to match their (computed form the ZeroR model) and compare our average model accuracies with their best model accuracies. Their best result for the Singleton pattern () was given by an RF model with . Our RF is worse with , and the CNN is worse with . Their best Composite () model ( [Smola2004]) reaches where both of our models are better (CNN and RF ). Their best Decorator () model given by an RF with which is quite similar to our average CNNs with . No data for Template Method is available. These results show how well FMs are suited for DPD given that we compared the average performance and not the best performing models. Besides, the authors used ordinary 10-fold cross-validation in which instances from the same unique mapping may leak information between evaluation and training folds.

Threats to Validity

An internal threat to validity is given by the dataset (P-MARt 04/10/19 [Gueheneuc2007]) we used to train and evaluate our approach. We inspected many projects and their respective labeled design patterns in order to understand the misclassification during the experiments. Many of these projects are already outdated using old Java versions that forced the developers to take strange roundabouts in their implementations. Consequently, it is possible that some design patterns would be implemented differently today. For instance, the Singleton pattern is in modern systems most likely implemented via dependency injection frameworks. Furthermore, we could not understand some of the labeled pattern instances as they seem to diverge too far from the original design pattern definition. The size of the dataset poses a threat, especially for the Singleton pattern as the model may overfit. To mitigate this thread we used repeated cross-validation to estimate the generalization performance on unseen data, added data augmentation (increases amount of observations), and included additional model specific regularization methods. Random forests naturally regularize with an increased number of trees. We used Dropout [Srivastava2014], kernel and activation regularization (L1, L2 [Murphy2012, Goodfellow2016b]), and early stopping [Prechelt1998a] to regularize the CNNs. Our experiment design tried to eliminate most external threats by using multiple patterns with different numbers of roles, with multiple bootstrapped datasets and different NPCRs. However, it only evaluates the performance of one pattern with number of roles, and generalizing it to all patterns with roles may only be possible on a limited scale. Nevertheless, the broad sample of models still provides valuable insight into the methodologies performance in many different settings.

Related Work

The DPD research community has a long and active history with various approaches, tools, and methodologies. Several interests can be distilled from DPD and that are reflected in Figure Document. A coarse classification of these interests is feature extraction, intermediate representations, candidate sampling, and inference methods. This work focused on intermediate representations and inference methods. Zanoni [Zanoni2015], with whom we compared our results, uses micro-structures, clustering, and various machine learning approaches to find design patterns. Along with Arcelli, they proposed a framework that uses these techniques called MARPLE [ArcelliFontana2011]. Most influential work for feature maps and our approach is given by Tsantalis et al. [Tsantalis2006], in which they extract adjacency matrices from the ASG reflecting a specific aspect. An example of an aspect would be Generalization or Association, and in such a way they may be seen as micro-structures. The approach itself compared the adjacency matrices from patterns with the matrices extracted from the system via an iterative similarity scoring algorithm proposed by Blondel et al. [Blondel2004]. Their approach provides nearly always a recall and precision of 1 for the patterns detected in this work. However, their evaluation is only based on three manually inspected projects (JHotDraw, JRefactory, and JUnit). The authors accounted for subjective bias through manual inspection but did not mitigate its possible impact. Finally, they report that their methodology suffers from computational inefficiency caused by the (adjacency) matrices and the matching algorithm. One big issue in the DPD community is that results are hard to reproduce and compare because of the multistage nature of detectors. Fontana et al. [ArcelliFontana2012] give an attempt to make results comparable. Nevertheless, the approach is a community-driven web-based benchmarking system that suffers, like many other community-based approaches, from the cold start phenomenon, i.e., it provides not enough upfront benefit to justify its usage in the first place. To improve this situation, we employed an evaluation strategy that decouples the last stage of the DPD pipeline simplifying the reproducibility and comparability of the results to the availability of the dataset that candidate samplers produce. Furthermore, we provided a comprehensive approach to evaluate detectors in a more general fashion than peak performance models.


This work presented Feature Maps (FMs), how they are computed via Feature-Role Normalization (FRN) and used for design pattern detection. Feature maps themselves can be understood as flexible and comprehensible source code representation useful beyond DPD. For DPD, FMs provide a representation that allows for robust detection performance even if the datasets are strongly imbalanced. In conclusion, FMs do not only help to extract the information developers weave into their systems, but also provide means to represent and comprehend them in a compact fashion. In the future, we are planning to extend the study on FMs in the context of DPD by applying the methodology to more design patterns and a bigger dataset. Futhermore, we are planning to compare the methodology to learning algorithms that handle graph representations natively.