Recover and RELAX: Concern-Oriented Software Architecture Recovery for Systems Development and Maintenance

The stakeholders of a system are legitimately interested in whether and how its architecture reflects their respective concerns at each point of its development and maintenance processes. Having such knowledge available at all times would enable them to continually adjust their systems structure at each juncture and reduce the buildup of technical debt that can be hard to reduce once it has persisted over many iterations. Unfortunately, software systems often lack reliable and current documentation about their architecture. In order to remedy this situation, researchers have conceived a number of architectural recovery methods, some of them concern-oriented. However, the design choices forming the bases of most existing recovery methods make it so none of them have a complete set of desirable qualities for the purpose stated above. Tailoring a recovery to a system is either not possible or only through iterative experiments with numeric parameters. Furthermore, limitations in their scalability make it prohibitive to apply the existing techniques to large systems. Finally, since several current recovery methods employ non-deterministic sampling, their inconsistent results do not lend themselves well to tracking a systems course over several versions, as needed by its stakeholders. RELAX (RELiable Architecture EXtraction), a new concern-based recovery method that uses text classification, addresses these issues efficiently by (1) assembling the overall recovery result from smaller, independent parts, (2) basing it on an algorithm with linear time complexity and (3) being tailorable to the recovery of a single system or a sequence thereof through the selection of meaningfully named, semantic topics. An intuitive, informative architectural visualization rounds out RELAX's contributions. RELAX is illustrated on a number of existing open-source systems and compared to other recovery methods.



There are no comments yet.


page 1

page 3


Study of the Utility of Text Classification Based Software Architecture Recovery Method RELAX for Maintenance

Background. The software architecture recovery method RELAX produces a c...

The Value of Software Architecture Recovery for Maintenance

In order to maintain a system, it is beneficial to know its software arc...

E-SC4R: Explaining Software Clustering for Remodularisation

Maintenance of existing software requires a large amount of time for com...

Taking Recoveries to Task: Recovery-Driven Development for Recipe-based Robot Tasks

Robot task execution when situated in real-world environments is fragile...

Architectural Decay as Predictor of Issue- and Change-Proneness

Architectural decay imposes real costs in terms of developer effort, sys...

Semantic Slicing of Architectural Change Commits: Towards Semantic Design Review

Software architectural changes involve more than one module or component...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A software system can only be maintained to the extent that it is known. Knowing a system includes being aware of its architecture. This awareness avoids technical debt and ensures the system’s continued integrity.

In practice, the knowledge of a system’s architecture may have never existed or been degraded over time through phenomena such as missing or poor documentation, personnel changes as well as architectural drift or erosion [1]. (The latter two are caused by careless or unintentional addition, removal, and modification of architectural design decisions [2].) In many of these cases, the only way to obtain any architectural information is to recover it from implementation-level artifacts. For this, a wide variety of software architecture recovery methods exists. These recover different views of a system’s architecture under different paradigms. For this, they apply different algorithms on the system’s implementation artifacts (e.g., the source code, bytecode, executable files, directory structure, configuration files).

Recently, a range of studies have begun looking at the nature, rate, and impact of changes in a system’s architecture and the resulting architectural decay in existing systems [3], [4], [5]. This places special emphasis on the recovery methods used in that they must be:

  • Accurate - the architectural view they provide must be a proper reflection of the architecture.

  • Appropriately sensitive and deterministic - the difference in the obtained architectural views must be commensurate with the amount and type of system change. If, according to measures of architectural similarity, any change in the source code of a system results in the recovered architecture of every version of a system being entirely different from any other version, changes cannot be meaningfully compared. This diminishes the usefulness of such recoveries for evolutionary studies on the impact of changes in a system’s source code on its architecture.

  • Efficient - code-bases for individual systems and different system versions must be analyzable reasonably quickly. This becomes crucial for evolutionary studies that track architectural changes over a range of versions.

  • Scalable - recovery techniques must be able to handle very large systems that are common today.

While many existing recovery methods may give an accurate view of the system under their respective paradigms, they lack one or more of the above listed desirable attributes, which limits their use and their utility in many situations. To address these issues, we have developed RELAX. We claim the architectural view (described below) produced by RELAX is useful and correctly reflects the underlying architecture. RELAX is appropriately sensitive to changes in that minor changes to source code do not cause major changes in the recovered view. We also claim that RELAX is both efficient and scalable, enabling it to recover architectures of large systems. This is enabled by RELAX’s additivity, which allows the composition and reuse of partial results, distribution of the recovery process and reduction of the workload on new versions of a system to just the parts that have changed. Additionally, RELAX is tailorable by allowing different stakeholders to maximize the utility of the recovery by considering their perspective.

Just like every system has an architecture by definition (even if none of its stakeholders are aware of it), each recovery method needs to follow a paradigm that is determined by the purpose it intends to serve. We have aimed at choosing a flexible paradigm that serves as many different groups of stakeholders of a system as possible, not only one. Another goal was to make using the method and interpreting its result as straightforward as possible. It is our hope that this will lead to a democratization of architecture recovery.

These considerations have lead us to choose a concern-oriented architectural view for RELAX. In this context, a “concern” can be defined as a role, responsibility, concept, or purpose of a software system. Data persistence, Networking, and GUI are examples of generic concerns that a system may commonly address. On the other hand, there are domain-specific or application-specific concerns, e.g., Interrupt Handling as part of an OS Kernel. (It is important to note that in the context of software architecture recovery, the noun concern is used in its more general meaning of something that is regarded as important and not limited to something that causes worry [6].)

Approaching a system from this point of view is useful for many different types of stakeholders: Maintainers and particularly programmers will be interested in learning what a system does and how and where it does it. A concern-centric view can also be useful for stakeholders other than programmers. For instance, the architect can assess how well concerns are separated. Project managers can determine task allocation among programmers with varying degrees of familiarity with the system. Customers for whom the system is being built can check whether their concerns are reflected in it. Even interested end users may use RELAX to find out whether a system’s source code implements a functionality that may not be mentioned in its documentation. The latter two types of stakeholders do not even need to be experts in software development to derive utility from RELAX. The usefulness of a perspective based on a system’s implemented concerns in comprehending an architecture has been shown [7].

Given an input of a system’s source code and a set of concerns, RELAX classifies and clusters a system’s code entities into word classes that relate to user-specified concerns. Its output is a view that represents the system’s architectural structure and location of concerns textually and visually. Both elements of the view provide actionable information to its maintainers. Additionally, the visualization allows the viewer to gather important facts about the overall architecture of the system at a glance while also allowing them to dig deeper.

RELAX is evaluated on a set of open source systems.

The research contributions of this paper are RELAX (RELiable Architecture EXtraction), a concern based architecture recovery method that is scalable, accurate and appropriately sensitive. RELAX provides an integrated visualization of the results that can be easily interpreted and directly applied to the maintenance of the system.

The remainder of this paper is organized as follows: Section II explains the foundation of RELAX. Section III describes RELAX’s approach. Section IV presents our evaluation results. Section V compares our approach to that of other recovery methods. Section VII with our conclusion and section VIII on future work round out the paper.

Ii Foundation

Ii-a Software Architecture and Architecture Recovery

Many different definitions of “Software Architecture” exist [8]. Additionally, many different recovery methods exist that espouse different views of a software architecture [9]. This creates the potential of a mismatch if both are not selected in light of each other.

Architecture recovery is the process of retrieving a system’s architecture from its implementation-level artifacts [1]. Since this means that only what is actually present in the system’s implementation can be used for recovery, and consequently, that the definition of “Software Architecture” which forms the basis of a given recovery method needs to reflect this for consistency.

This makes a definition such as “the set of principal design decisions about a system” [1] unsuitable for the purposes of architecture recovery, since due to erosion and drift, there is no guarantee that a single one of these decisions is realized in a current version of the system as built. In extreme cases, they may never have been present at all, rendering any attempt at recovering an architecture under this definition moot.

A definition that fits architecture recovery in general well is: “Fundamental concepts or properties of a system in its environment embodied in its elements, relationships, and in the principles of its design and evolution” [10]. This definition covers the source code, which most recovery methods as well as ours are using for their basic resource, as an element of the system. We think that another definition fits the concern-orientation of RELAX even better [11]. According to it, a software architecture comprises

  • A collection of software and system components, connections, and constraints.

  • A collections of system stakeholders’ need statements.

  • A rationale which demonstrates that the components, connections, and constraints define a system that, if implemented, would satisfy the collection of system stakeholders’ need statements.

When considering that the output of any recovery method is a view of a system’s architecture, it needs to be kept in mind that there is no single “correct” view that contains the whole truth Instead, the same architecture can be described through different views [12]. While it is possible to recover the architecture of very small systems manually, for large systems with millions of SLOC only computer-aided recovery is feasible.

Ii-B Text Classification

Fig. 1: Text Classification Workflow

Text classification is the natural language processing method RELAX employs to locate concerns in a software system. This method automatically assigns provided documents to specified categories

[13]. To achieve this, a set of labeled examples from two or more categories of interest are supplied and then used to train a classifier algorithm which can then determine whether any given document belongs in one of the categories the classifier was trained on.

Figure 1 shows the training and prediction phases of text classification and how they interact [14].

Iii Approach

As its input, RELAX takes a software system’s source code alongside a trained classifier. It then leverages text classification techniques to incrementally tag individual source entities with attributes and group them into concern-related clusters. The combination of choices we made in the way in which RELAX uses text classification to build its architectural view results in a set of features that address several weaknesses present in other recovery methods.

  1. The basic building block of RELAX’s architectural view is the result of the independent text classification of each source code entity.

  2. The additive nature of forming clusters from individual source code entities whose attributes are independent of each other directly facilitates RELAX’s scalability.

  3. The Naïve Bayes classifier-based algorithm further aids the scalability and accuracy of RELAX.

  4. Explicit prevention of crosstalk between changes in individual code entities limits their impact on the resulting architecture.

  5. The ability of users to select the concerns on which RELAX bases its recovery enables tailoring RELAX to specific needs through easily understandable choices.

  6. An intuitive and informative visualization allows stakeholders to quickly get an overview of the prevailing system-level concerns, and also to dig deeper to the level of individual source code entities.

This section describes the key principles underlying RELAX and its visualization, as well as the details of its implementation.

Iii-a Main Recovery Process

Iii-A1 Selecting Concerns

The stakeholders and their concerns stand at the beginning of the process. Those concerns can have any level of granularity, ranging from top level concerns (e.g. Database, Graphics or Networking) to lower application-specific levels (e.g. HDFS Upgrade Management or InterDataNode Protocol for Apache Hadoop [15]). In addition, non-functional concerns (e.g. Security, Backup, Interoperability) can also be used. RELAX does not impose a hard limit on the number of input concerns to use in the training phase. The “right” number of concerns to look for in a system is not determined by any attributes of that system, such as its size or complexity. Instead, based on their knowledge about the system to be recovered and their use case, users can decide on any set of named concerns that form the basis of the system’s recovery.

For example, a project manager might be interested in a suitable task distribution of maintenance activities among programmers with specific skills. The project manager can then choose to conduct a coarse grained recovery with a selection of topics that mirrors the fields of specialization of the programmers, such as Database, Graphics and Networking. In another situation, a researcher may be interested in how certain concerns are shared among related systems. For example, they could be interested in whether a project like Apache Chukwa [16], which is built on the Apache Hadoop File System (HDFS) [17] addresses HDFS-related concerns such as HDFS Upgrade Management or InterDataNode Protocol. The choice of concerns is the only activity required of the user that is similar to setting parameters in other recovery methods. However, RELAX aims to make this an intuitive choice because the concerns are either named for well-known topics of general interest or, optionally, named by the users themselves.

Iii-A2 Collecting Training Data and Training a Classifier

The kind and amount of work necessary in this step depends on the concerns selected by the user. If the concerns are already covered by an existing classifier that is provided by RELAX or that the user otherwise has access to (such as through having trained it in an earlier recovery), no additional work is necessary here.

If this is not the case, the user will be interested in training their own classifier on their chosen concerns. The required training data can either come from the curated labeled training data already provided by RELAX, or it can be provided by the user. In the latter case, the user needs to find sources of training data related to the desired concerns and label them with names of their choice. This can be any mixture of source code, API documents, articles on the subject or simply a list of related words. It is important to note that the user is not required to fully understand the training data. Subsequently, the user needs to label the different categories of training data with the concern names of their choosing. Figure 2 shows an example of a directory structure with training data files.

Fig. 2: Training Data Example

A classifier is then trained from the provided labeled training data. For this, the distinguishing features of the sets of documents labeled with different concerns are determined by the classifier training algorithm.

From this, a classifier model is generated that can later be used to label different sets of data that were not part of the training process, such as the code entities of systems whose architecture is to be recovered.

The training process generates a number of classifier candidates whose accuracy is then checked on a portion of the training data that has been set aside for this purpose. The classifier with the best accuracy is then chosen as the classifier to be used for architecture recoveries. Figure 3

shows the accuracy information obtained for two classifier candidates called “Trial 30” and “Trial 31”. It shows the overall accuracy of each classifier candidate as well as a confusion matrix. A confusion matrix is a table whose rows show the labels of the test data and whose columns show the labels determined by the classifier candidate. It is easy to determine whether a candidate’s results are fully correct (i.e., all documents are labeled by the classifier candidate with the labels of the test data) by looking at whether or not all numbers are lined up on the diagonal from the table’s origin to the lower right. As a summary, an overall accuracy value, between 0 and 1, for a candidate can be computed by dividing the number of correctly identified test documents by the overall number of test documents. The training output also shows the overall accuracy value for the candidate as a value between 0 and 1 that is calculated by dividing the number of correctly identified test documents by the overall number of test documents. In the case of “Trial 30” in the figure, we can see that While “Trial 30” has misclassified two test documents that should have been labeled with “security” as “networking” and therefore has an accuracy close to 0.94, or 31/33. “Trial 31” has classified all documents correctly and consequently has an accuracy of 1.0. It should therefore be chosen.

Once a classifier is trained on a set of topics, it is reusable.

Fig. 3: Classifier Candidate Selection

Iii-A3 Classification

package org.apache.hadoop.chukwa.database@;
import org.apache.hadoop.chukwa.util.Database@Writer;
import java.sql.SQLException@;
import java.sql.ResultSet@;
import java.sql.ResultSetMetaData@;
public class Consolidator extends Thread {
  private Database@Config dbc = new Database@Config();
  private String table@ = null;
  public Consolidator(String table@, String intervalString) {
        this.table@ = table@;
  String query@ = select@ * from  + table@;
  log.debug(“Query@:  + query@);
  rs = db.query@(query@);
  if ( {
ResultSetMetaData@ rmeta = rs.getMetaData@();
    for (int i = 1; i <= rmeta.getColumnCount@(); i++) {
      if (rmeta.getColumnName@(i).equals(“timestamp)) {
        start = rs.getTimestamp(i).getTime();
      end = start + (interval * 60000);
        } catch (SQLException@ ex2) {
      log.error(“Unable to determine starting point in table@:  + this.table@);
      log.error(“SQL@ Error:);
Listing 1: Example code with SQL-related text in red

In the classification step, the trained classifier extracts the features from each code entity and assigns a feature vector to it based on that entity’s affinities to each concern the classifier has been trained on. We have chosen Naïve Bayes as a classifier for RELAX based on several considerations: first, it assumes that features are independent, which appears to be a good fit for code files, where each feature may be encountered individually and can individually determine which topic a code entity belongs to. Second, its linear time complexity serves the scalability of RELAX. Third, classifiers trained with it have performed well in our accuracy evaluation (compare to Section

IV-A). Last but not least, the prediction model of the Naïve Bayes algorithm is deterministic [18]. Determinism is an important feature for evolutionary software studies since without it, we cannot determine with certainty whether two different recovered architectural views which were produced by the same recovery method came from two different systems or system versions. Listing 1 shows a database-related code snippet with the words that indicate its relation to SQL databases highlighted. For a Naïve Bayes classifier, the feature vector consists of values between 0 and 1 for each concern. For example, the feature vectors for three code entities called “SQL.Java”, “Screen.Java” and “ConnectIP.Java” could look like the rows of Table I. We can see that the affinity values over all concerns do not have to add up to 1.0 and that they can have values that are not 0 or 1. This is because a code entity may not be related to any selected concern or it may be strongly related to more than one concern.

Entities Database Graphics Networking
SQL.Java 0.9 0.1 0.2
Screen.Java 0.05 0.95 0.1
ConnectIP.Java 0.02 0.01 0.92
TABLE I: Entities with Feature Vectors

Iii-A4 Clustering

ClusterFeature Database Graphics Networking
Database 1 0 0
Graphics 0 1 0
Networking 0 0 1
Unknown 0 0 0
TABLE II: Clusters with Feature Vectors

Before clustering begins, each user-selected concern-related cluster is assigned an orthogonal feature vector that mirrors that concern and allows code entities to be grouped into it. A default “Unknown” cluster without any concern affinities is always created for the code entities that are not related to any selected concern. The rows in Table II

show the feature vectors for three clusters related to databases, graphics and networking, respectively as well as the default cluster for entities that are not related to any selected concern. Based on the results of the classification, each code entity is then assigned to the concern-related cluster that its feature vector is most similar to. This similarity is determined using the cosine similarity between the feature vector of the code entity and the cluster. Cosine similarity is a measure of the distance between vectors and is commonly used in Natural Language Processing in order to determine how close a body of text is to a given topic

[19], [20], [21].

Iii-A5 Additivity and Crosstalk Prevention

Recall that our goals for RELAX include scalability, efficiency, appropriate sensitivity and determinism. Our intuitive approach to this is to explore building up the overall recovery result from individual parts that could be individually and independently processed and reused or updated as needed. We then decided that these individual parts should be the source code entities of the system and analyzed which beneficial properties would emerge.

RELAX classifies each code entity as belonging to a set of user defined concerns. The classification task is performed on an individual source code entity and has no dependence on the classification of any other entity.

The classification of source code entities individually enables RELAX’s important property of additivity. This means that the recovery results of the whole system or its subsystems can be composed from smaller parts, eventually reaching down to the ground level of individual source code entities. Additivity in turn enables scalability and efficiency by allowing the following operations:

  • For the architecture of a system to be recovered, it can be split up into smaller units which can then be distributed to be classified and associated with a cluster. This way, the ceiling of the system size that can be evaluated is nearly unlimited.

  • For evolutionary studies on a system, only the entities that have changed will need to be evaluated. The information on the remaining entities gained in a previous recovery run can be reused.

  • Libraries or frameworks can be evaluated separately and their results added or subtracted from the whole as needed.

Further, the individual classification also limits “Crosstalk”. Crosstalk is a phenomenon in which a change in a source code entity affects parts of the recovered architectural view that do not pertain to it. Therefore, without Crosstalk affecting RELAX, it means that the change of the recovery result is only confined to the code entities that have changed and (possibly, depending on whether their associated concerns have changed) their cluster association. Further, the scale of the changes in the recovered view is proportional to with the size of the changes in the source code entities.

Iii-A6 Textual Output

Conceptually, the textual output produced contains (1) The classification of each source code entity, (2) the constituents of all concern-related clusters, and (3) the auxiliary output from other tools, such as the list of dependencies between code entities or the size of entities in SLOC, which can be used for further processing and analysis. (RELAX uses the Classycle library [22] to determine the dependencies between code entities. are not determined by RELAX, but can be used for further analysis.)

Iii-B Visualization

Fig. 4: RELAX Directory Graph Example

The directory visualization of RELAX shown in Figure 4 aims to give any stakeholder a high-level overview of the system architecture and the system’s addressed concerns that can be enlarged to the level of individual source entities.

The visualization is based on the directory structure of the system, which corresponds to the package structure in Java. The system is shown as a directory tree. Nodes are either packages (inner nodes) or source entities (leaf nodes). Nodes that belong to the same package are surrounded by a rectangle. Since software systems can consist of a very large number of source entities, individual nodes can be very small in an overview (situations in which an individual node would make up less space than a pixel would be conceivable), and gaining an impression of their concerns would be impossible.

Therefore, in order to guarantee that concerns can be shown, the lines from each package folder to its children are shown in the color prevailing that corresponds to the prevailing concern in that package. The prevailing concern is determined as follows: For each child node, the main concern is determined by the classifier as the topic most relevant to the corresponding code entity. The weight of this entity is then determined by its file size (physical or logical SLOC can also be selected for this). If a child node is not a leaf node (i.e. it stands for a package), then its prevailing concern is the concern that carries the most weight with its children. This relationship holds recursively throughout the tree. One important outcome of this is that there is an easy way to see what the main concern of the overall system is by checking the color of the root node (or colors of the root nodes, if several exist) of the system. Because of the recursion, this holds for each package.

RELAX generates a legend for the directory visualization which shows the names of the concerns as they exist in the classifier in the color automatically selected for them by RELAX. The color selection is based on guidelines for optimal distinguishability of adjacent colors [23].

Fig. 5: RELAX Directory Graph Detail

Individual nodes can be examined by zooming in. The paradigm that this visualization is following is that of a navigational file manager, such as the Finder in macOS or the Explorer in Windows. The details shown in Figure 5 correspond to the metadata view obtained by right-clicking and selecting “Get Info” on the macOS Finder or right-clicking and selecting “Properties” on the Windows Explorer. In the example shown in Figure 5, we are seeing a package (left) and two Java source entities. Since all three belong to the same package, we see three incoming arrows from the top in the same color Each entity is shown with a group of attributes:

  • A top box containing the base-name of its canonical name,

  • A second row showing

    • its file size in bytes with the color of its concern as the background color,

    • its logical SLOC with the same background color.

  • A third row with all outgoing dependencies colored for the corresponding entities,

  • A fourth row with all incoming dependencies colored for the corresponding entities.

Checking individual entities can give the user an impression of how connected an entity is and which type of concerns the related entities address. Questionable dependencies could be caught here. The format of the file that is used to lay out the directory graph is a human-readable text file that describes a directed graph. The actual layout is done by dot, a program from the Graphviz [24] package. The dot program creates hierarchical layouts. Results are created in PDF format. It is possible to provide specific directives for the width and height of the graph.

From the hierarchical diagram of the system shown in Figure 4, a stakeholder can immediately get an overview of the system and gain some first impressions: First, it is apparent that the system has two top level folders (with branching only beginning several levels below the top due to the Java packaging conventions, which use the reverse Internet domain names of organizations [25]). It is clear that five package levels have leaf nodes (which stand for code entities). The third level from the bottom has the most code entities.

Regarding concerns, the system seems to be mostly addressing the one that is shown in bright red. Two concerns, bright green and dark blue, seem to be addressed mostly in one package each (second level from the bottom at the very left and near the middle of the third level from the bottom). Several concerns, such as the orange, the light blue one and chiefly the bright red one, are shown to be distributed throughout the system. This could indicate a poor separation of concerns (or possibly the need for a narrower definition of the concerns that should be used for classification).

Conclusions can also be drawn when studying the evolution of a system. The diagrams in Figures 7 and 7 show two consecutive minor versions of the same system:

Fig. 6: RELAX Directory Graph of First System Version
Fig. 7: RELAX Directory Graph of Second System Version
Fig. 6: RELAX Directory Graph of First System Version

The similar outlines are making the two versions of the system easy to compare (though some differences in shape are due to the automatic layout in the Graphviz package). The comparison shows that the leftmost package in the hierarchy, which was dominated by the red concern in the first version is now more evenly split three ways between red, blue and green and has changed its prevailing concern from red to green.

Iii-C Workflow

Fig. 8: RELAX Recovery Workflow

Figure 8 shows the workflow of a RELAX recovery from the point of view of the programmatic process, which incorporates all parts of our approach. The selection of concerns is not shown as an explicit step, but is an implicit part of the selection of a trained classifier. Training a classifier based on training data is shown with dashed lines since it is not a necessary part of each recovery.

Iii-D Implementation

RELAX uses the MALLET [26] toolkit, which includes different classification algorithms such as Naïve Bayes [27][28], Maximum Entropy [29]

and Decision Trees

[30] and allows training and applying them.

RELAX has been implemented in Java as part of a workbench comprising of a suite of architecture recovery techniques. The implementation is GUI-based and allows training classifiers, running RELAX, and visualization of the results without leaving the GUI. The principal output produced is a textual clustering of the system’s source code entities and a directory visualization.

Iv Evaluation

Iv-a Accuracy

Some of the principal results of a RELAX recovery are the classification of all individual code entities of a system and their grouping into a set of concern-related clusters. The accuracy of the clustering can be determined by measuring its similarity to an expert decomposition, which is another clustering manually prepared by an expert on the system, such as its architect. The expert decomposition serves as a “ground truth” [31]. A known measure of similarity for this is MojoFM [32]. We picked it for our evaluation because it has been used in several studies such as [33] as well as [34] and data for our evaluation, which compares the respective closeness of RELAX to expert decompositions is already available for ACDC and ARC (see Section V), while the clustering results that formed the basis of the study [2] are not. We felt it was important to compare the performance of RELAX to that of ARC, since the latter is another recovery method whose paradigm is that of concern-oriented clustering.

It expresses the similarity between two partitions of a set as a percentage, where 100% represents identity and 0% maximal difference. Its formula is:


Where stands for the clustering technique. is the clustering produced by and is the expert decomposition. represents the minimum number of Move (moving an object to a different cluster) and Join (joining one or more clusters to form a new cluster) operations to transform partition A to B.

For the purposes of our comparisons of RELAX clusterings to expert decompositions, we are interested in answering the following question: How close would RELAX come to the expert decomposition if a classifier would be trained to categorize a system’s code entities into clusters related to the concerns present in the expert decomposition?

Table III compares the MojoFM values of RELAX to those of two other recovery methods (ACDC and ARC, both of which are described in detail in Section V) that have been identified previously as the two closest to the expert decompositions of eight systems out of a set of ten recovery methods [2]. As can be seen in the diagram, RELAX exceeds their MojoFM values in five cases, is between the two in one case and closely below them in two. This lets us conclude that RELAX’s overall accuracy is better than that of the two most accurate known recovery methods so far.

Bash margin=1ex,bgcolor=orange75.86 margin=1ex,bgcolor=gray!2557.89 margin=1ex,bgcolor=gray!2549.35
OODT margin=1ex43.47 margin=1ex,bgcolor=orange48.48 margin=1ex46.01
Hadoop margin=1ex,bgcolor=gray!2558.32 margin=1ex,bgcolor=gray!2554.28 margin=1ex,bgcolor=orange62.92
ArchStudio margin=1ex74.60 margin=1ex76.28 margin=1ex,bgcolor=orange87.68
Linux-D margin=1ex,bgcolor=orange74.67 margin=1ex,bgcolor=gray!2551.47 margin=1ex,bgcolor=gray!2536.31
Linux-C margin=1ex,bgcolor=orange93.70 margin=1ex75.72 margin=1ex63.76
Mozilla-D margin=1ex,bgcolor=orange53.47 margin=1ex,bgcolor=gray!2543.44 margin=1ex,bgcolor=gray!2541.20
Mozilla-C margin=1ex,bgcolor=orange90.62 margin=1ex62.50 margin=1ex60.30
Average margin=1ex,bgcolor=orange70.59 margin=1ex,bgcolor=gray!2558.76 margin=1ex,bgcolor=gray!2555.94

TABLE III: MojoFM Ground Truth Comparison Values

Iv-B Scalability and Efficiency

Since each file is classified individually and independent of any other, and the time of an individual Naïve Bayes classification depends only on the size of the file to be classified [35], the time required to recover a system’s architecture should scale linearly with the overall size of files in a system to be classified, or the lines to be processed.

In order to determine how the performance of RELAX changes with the system size, its performance was measured with versions 15 versions of Apache Hadoop [15], 7 versions of Apache Chukwa [16], and one version each of Log4j2 [36] and Chromium [37]. Altogether, these comprised more than 2.45 million SLOC.

The scatter plot in Figure 9 shows our observations of how many SLOC were processed per second by RELAX, respectively, for each of the 24 systems. A trend line was fitted to the plot.

Fig. 9: RELAX Recovery Speeds for Projects of Different Sizes

The observations confirm that, as expected, the number of source lines of code (SLOC) that are processed over a given unit of time does not decline with the size of the system. (The trend line shows an increasing performance with an increase in SLOC. This may be an artifact of the underlying OS and is not expected to be sustained for bigger system sizes.)

V Related Work

V-a Pkg

PKG [38], the simplest approach to architecture recovery is based on the package-level structure view of a system’s implementation. This approach produces an objective but not architecturally satisfying view in that it stays at the surface instead of trying to assist its user to determine why the system is built the way it is. Other clustering techniques have been suggested on the basis on file names and file naming conventions [39, 40]. However, their assumptions about naming conventions are not always correct. Many other sophisticated techniques exist. We will review several relevant approaches.

V-B Acdc

The ACDC (Algorithm for Comprehension-Driven Clustering) algorithm [41] uses structural relationships specified as patterns to create an algorithm for recovering components and configurations that bounds the size of the cluster (the number of software entities in the cluster), and provides a name for the cluster based on the names of files in the cluster. ACDC’s view is oriented toward components that are based on structural patterns (e.g., a component consisting of entities that together form a particular subgraph).

V-C Arc

ARC [7] uses topic modeling to find concerns and combines them with the structural information to automatically identify components and connectors [7]. The topic model employed in ARC is Latent Dirichlet Allocation (LDA) [42]. Using LDA, ARC can detect concerns in individual code entities and compute similarities between them. A software system’s implementation entities, such as its source files, are represented as a set of documents (a corpus) and each document in turn as a ”bag of words” [7]. Each document can be related to several different topics. Based on those topics, the documents are clustered using dependencies between them as structural information and concerns (the topics from the topic model) as features. It is very important to note that topic modeling as applied in ARC will not name the detected topics automatically and is an iterative process which is not guaranteed to yield topics that are consistent or that a human being can name [43]. In contrast, document classification uses named topics from the outset. We have outlined issues with ARC in detail elsewhere [43] and will therefore limit ourselves to a short overview. They comprise

  • Its handling of stop words,

  • The selection of the number of topics to be detected,

  • Topic quality,

  • Determinism,

  • Sensitivity to architectural change, and

  • Scalability.

Vi Threats to validity

Vi-a Generalizability

In order for RELAX (or any other recovery method that is based on natural language processing) to be able to produce meaningful results through natural language processing, all of the following conditions need to hold:

  1. The programming language the system is written in allows the use of comments or variable names,

  2. The code contains meaningful comments or variable names,

  3. The comments and variable names are pertinent to the purpose of the code.

This excludes code that has misleading comments or variable names or has been obfuscated. Additionally, due to availability issues regarding closed-source systems’ source code, only versions of open-source systems have been evaluated.

Vi-B User Studies

Because the selection of systems to be evaluated had not settled yet when our evaluation was conducted, user studies with engineers that would serve to further ascertain of RELAX have not been conducted yet.

Vii Conclusion

In this paper we presented a novel architecture recovery method which employs text classification to recover a concern-oriented view of a system. The approach classifies source code entities to clusters based on user-defined concerns. The conceptual and design choices made have resulted in an accurate and scalable concern based recovery method. The tool that implements the approach has been evaluated for accuracy and scalability on a set of open source systems. The results confirm the claims made.

Viii Future Work

Currently, RELAX assigns each source code entity to the cluster that represents its dominant concern. We aim to allow users to control the way in which code entities are assigned to clusters. For instance, the users could choose clusters which represent what to them are meaningful combinations of more than one concern and instruct RELAX to assign code entities to such clusters. Another feature to be implemented is the ability to define undesirable dependencies in a system (e.g., those that break a desired layered architectures by having entities that serve low-level concerns depend on others in a higher layer.) Of further interest are studies of architecture evolution with RELAX, as well as comparisons between its performance and that of other recovery methods on large or very large systems (e.g. Chromium OS).


The authors would like to thank Nenad Medvidovic for reviewing early drafts of this paper. We would further like to thank Duc Le and Suhrid Karthik for providing comments and insights. =0mu plus 1mu


  • [1] R. N. Taylor, N. Medvidovic, and E. M. Dashofy, Software Architecture - Foundations, Theory, and Practice.   Wiley, 2010.
  • [2] J. Garcia, I. Ivkovic, and N. Medvidovic, “A Comparative Analysis of Software Architecture Recovery Techniques,” Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on, pp. 486–496, 2013.
  • [3] P. Behnamghader, D. M. Le, J. Garcia, D. Link, A. Shahbazian, and N. Medvidovic, “A large-scale study of architectural evolution in open-source software systems,” Empirical Software Engineering, pp. 1–48, 2016. [Online]. Available:
  • [4] W. N. Oizumi, A. F. Garcia, T. E. Colanzi, M. Ferreira, and A. V. Staa, “When code-anomaly agglomerations represent architectural problems? an exploratory study,” Proceedings - 28th Brazilian Symposium on Software Engineering, SBES 2014, pp. 91–100, 2014.
  • [5] Y. Cai, H. Wang, S. Wong, and L. Wang, “Leveraging design rules to improve software architecture recovery,” Proceedings of the 9th international ACM Sigsoft conference on Quality of software architectures, pp. 133–142, 2013. [Online]. Available:
  • [6] Merriam-Webster, “Concern — Definition of Concern by Merriam-Webster.” [Online]. Available:
  • [7] J. Garcia, D. Popescu, C. Mattmann, N. Medvidovic, and Y. Cai, “Enhancing architectural recovery using concerns,” in 2011 26th IEEE/ACM International Conference on Automated Software Engineering, ASE 2011, Proceedings, 2011, pp. 552–555.
  • [8] Software Engineering Institute, “Community Software Architecture Definitions.” [Online]. Available:
  • [9] S. Ducasse and D. Pollet, “Software Architecture Reconstruction : A Process-Oriented Taxonomy,” IEEE Transactions on Software Engineering, vol. 35, no. 4, pp. 573–591, 2009.
  • [10] ISO/IEC/IEEE, “Systems and software engineering — Architecture description.” [Online]. Available:
  • [11] C. Gacek, A. Abd-Allah, B. Clark, and B. Boehm, “On the definition of software system architecture,” in Proceedings of the First International Workshop on Architectures for Software Systems.   Seattle, WA, 1995, pp. 85–94.
  • [12]

    O. Maqbool and H. A. Babri, “Hierarchical clustering for software architecture recovery,”

    Transactions on Software Engineering, vol. 33, no. 11, pp. 759–780, 2007.
  • [13] C. D. Manning, P. Raghavan, H. Schütze et al., Introduction to Information Retrieval.   Cambridge University Press, 2008, vol. 1, no. 1.
  • [14] S. Bird, E. Klein, and E. L. Beijing, Natural Language Processing with Python.   ” O’Reilly Media, Inc.”, 2009.
  • [15] “Welcome to apache™ hadoop®!” Apr 2018. [Online]. Available:
  • [16] The Apache Software Foundation, “Welcome To Apache Chukwa,” 2016. [Online]. Available:
  • [17] ——, “HDFS Users Guide,” 2018. [Online]. Available:
  • [18] R. Krzysztofowicz, “Bayesian forecasting via deterministic model,” Risk Analysis, vol. 19, no. 4, pp. 739–749, 1999.
  • [19] L. D. Baker and A. McCallum, “Distributional Clustering of Words for Text Classification,” in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval.   ACM, 1998, pp. 96–103.
  • [20] A. Huang, “Similarity measures for text document clustering,” Proceedings of the Sixth New Zealand, no. April, pp. 49–56, 2008.
  • [21] Y.-S. Lin, J.-Y. Jiang, and S.-J. Lee, “A Similarity Measure for Text Classification and Clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 7, pp. 1575–1590, 2014. [Online]. Available:
  • [22] F.-J. Elmer, “Classycle: Analysing Tools for Java Class and Package Dependencies,” 2014. [Online]. Available:
  • [23] C. Brewer, M. Harrower, and The Pennsylvania State University, “Colorbrewer 2.0 - Color Advice for Cartography.” [Online]. Available:
  • [24] AT&T Research and Contributors, “Graphviz - Graph Visualization Software.” [Online]. Available:
  • [25] Oracle, “Naming a Package,” 2015. [Online]. Available:
  • [26]

    A. McCallum, “MALLET MAchine Learning for LanguagE Toolkit.” [Online]. Available:
  • [27]

    K. P. Murphy, “Naive Bayes classifiers,”

    Bernoulli, vol. 4701, no. October, pp. 1–8, 2006. [Online]. Available:
  • [28] Y. H. Li and a. K. Jain, “Classification of Text Documents,” The Computer Journal, vol. 41, no. 8, pp. 537 –546, 1998. [Online]. Available:
  • [29] K. Nigam, J. Lafferty, and A. McCallum, “Using maximum entropy for text classification,” IJCAI-99 workshop on machine …, 1999. [Online]. Available:
  • [30] S. R. Safavian and D. Landgrebe, “A survey of decision tree classifier methodology,” IEEE transactions on systems, man, and cybernetics, vol. 21, no. 3, pp. 660–674, 1990.
  • [31] O. Maqbool and H. Babri, “The weighted combined algorithm: a linkage algorithm for software clustering,” Eighth European Conference on Software Maintenance and Reengineering, pp. 15–24, 2004. [Online]. Available:
  • [32] Zhihua Wen and V. Tzerpos, “An effectiveness measure for software clustering algorithms,” Proceedings. 12th IEEE International Workshop on Program Comprehension, 2004., pp. 194–203, 2004.
  • [33] C. Patel, A. Hamou-Lhadj, and J. Rilling, “Software clustering using dynamic analysis and static dependencies,” in Software Maintenance and Reengineering, 2009. CSMR’09. 13th European Conference on.   IEEE, 2009, pp. 27–36.
  • [34] A. Mahmoud and N. Niu, “Evaluating software clustering algorithms in the context of program comprehension,” in Program Comprehension (ICPC), 2013 IEEE 21st International Conference on.   IEEE, 2013, pp. 162–171.
  • [35] D. D. Lewis and M. Ringuette, “A comparison of two learning algorithms for text categorization,” in Third annual symposium on document analysis and information retrieval, vol. 33, no. October 1996, 1994, pp. 81–93.
  • [36] The Apache Software Foundation, “Apache logging services.” [Online]. Available:
  • [37] Google and volunteers, “Chromium.” [Online]. Available:
  • [38] D. M. Le, P. Behnamghader, J. Garcia, D. Link, A. Shahbazian, and N. Medvidovic, “An empirical study of architectural change in open-source software systems,” IEEE International Working Conference on Mining Software Repositories, vol. 2015-Augus, pp. 235–245, 2015.
  • [39] N. Anquetil and T. Lethbridge, “File Clustering Using Naming Conventions for Legacy Systems,” in Proceedings of the 1997 conference of the Centre for Advanced Studies on Collaborative research, 1997.
  • [40] N. Anquetil and T. C. Lethbridge, “Recovering software architecture from the names of source files,” Journal of Software Maintenance: Research and Practice, vol. 11, no. 3, pp. 201–221, 1999.
  • [41] V. Tzerpos and R. C. Holt, “ACDC: An algorithm for comprehension-driven clustering,” in Reverse Engineering, 2006. WCRE’00. 7th ….   IEEE, 2000, pp. 258–267.
  • [42] D. M. Blei, A. Y. A. Ng, and M. M. I. Jordan, “Latent dirichlet allocation,” Neural Information Processing Systems, vol. 3, no. 4-5, pp. 993–1022, 2003.
  • [43] D. Link, P. Behnam, R. Moazeni, and B. Boehm, “The value of software architecture recovery for maintenance,” in 12th Innovations in Software Engineering Conference, 2019, p. to appear. [Online]. Available: