Variability-aware Datalog

by   Ramy Shahin, et al.

Variability-aware computing is the efficient application of programs to different sets of inputs that exhibit some variability. One example is program analyses applied to Software Product Lines (SPLs). In this paper we present the design and development of a variability-aware version of the Soufflé Datalog engine. The engine can take facts annotated with Presence Conditions (PCs) as input, and compute the PCs of its inferred facts, eliminating facts that do not exist in any valid configuration. We evaluate our variability-aware Soufflé implementation on several fact sets annotated with PCs to measure the associated overhead in terms of processing time and database size.



page 1

page 2

page 3

page 4


Automatic and Efficient Variability-Aware Lifting of Functional Programs

A software analysis is a computer program that takes some representation...

MetricHaven – More Than 23,000 Metrics for Measuring Quality Attributes of Software Product Lines

Variability-aware metrics are designed to measure qualitative aspects of...

Lifting Datalog-based Analyses to Software Product Lines

Applying program analyses to Software Product Lines (SPLs) has been a fu...

An Empirical Study of Configuration Mismatches in Linux

Ideally the variability of a product line is represented completely and ...

Understanding Conditional Compilation Through Integrated Representation of Variability and Source Code

The C preprocessor (CPP) is a standard tool for introducing variability ...

Towards Modal Software Engineering

In this paper we introduce the notion of Modal Software Engineering: aut...

A Variability-Aware Design Approach to the Data Analysis Modeling Process

The massive amount of current data has led to many different forms of da...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A Datalog engine is used to infer knowledge from a set of facts given some inference rules. There are cases though where we need to apply the same rules to different sets of facts coming from different worlds, or different configurations. For example, the Doop [Bravenboer:2009] pointer analysis framework encodes its logic as Datalog rules, and applies them to facts extracted from Java programs. Doop can only work on a single software product at a time. However, it is common for software engineers to develop a whole family of products, a Software Product Line (SPL) [Clements:2001], as one project, exploiting the commonality across those products. Different variants (products) implement different sets of features. Since each feature can be either present or not in a variant, the number of variants is usually exponential in the number of features.

To use a framework like Doop on an SPL, we need to apply it to each of the variants individually. This is infeasible in most cases because of the exponential number of variants. Also, it involves a lot of redundancy because it does not leverage the commonality across variants. To mitigate those drawbacks, some program analyses have been lifted to efficiently work on SPLs instead of single products [Bodden:2013, Classen:2010, Gazzillo:2012, Kastner:2012, Midtgaard:2015, Salay:2014, Thum:2014]. This lifting process usually invovles reimplementing the analysis to be variability-aware.

Our prior work [Shahin:2019] outlines an approach to apply Doop (and similar frameworks) to the whole SPL at once, showing orders of magnitude of savings in computation time and storage space compared to running on each variant separately. One building block of that work was modifying the Soufflé [Jordan:2016] Datalog engine to be variability-aware, i.e., taking fact variability into consideration when inferring new facts. One fundamental advantage of our approach is that lifting a Datalog engine to be variability-aware automatically lifts all analyses that use it. In addition, variability-aware inference can be widely applied beyond program analysis. In any application domain, it is possible for different facts to be present only in specific situations, configurations, or in some constrained worlds. Instead of modeling each of those variants separately, it makes sense to model them together since inference rules are orthogonal to variability.

The rest of this paper starts with some background definitions and a motivating example (Sec. 2), followed by the design of variability-aware Soufflé (Sec. 3). We then present the results of our evaluation experiments (Sec. 4), and finally conclude and suggest some future directions (Sec. 5).

2 Background and Motivating Example

In this section we define some Datalog and variability terms, illustrating them on the motivating example in Fig. 1. We then briefly introduce the architecture of the Soufflé Datalog engine.

Path(v1, v2) :- Edge(v1, v2).
Path(v1, v3) :-
        Edge(v1, v2), Path(v2, v3).
Listing 1: Path rules.
Edge(Athens, Rome)   @ Sea.
Edge(Rome, Toronto)  @ Air.
Edge(NYC, Athens)    @ !Land.
Edge(Toronto, NYC)   @ Land.
Listing 2: Variability-aware inputs.
Path(Athens, Rome)    @ Sea.
Path(Rome, Toronto)   @ Air.
Path(NYC, Athens)     @ !Land.
Path(Toronto, NYC)    @ Land.
Path(Athens, Toronto) @ Sea /\ Air.
Path(Rome, NYC)       @ Air /\ Land.
Path(Toronto, Athens) @ Land /\ !Land.
Path(NYC, Rome)       @ Sea.
Listing 3: Variability-aware outputs.
Figure 1: Motivating example.

2.1 Datalog and Variability

Datalog is a declarative data definition and query language that combines relational data manipulation and logical inference [Ceri:1989]. A Datalog program is a set of inference rules, collectively referred to as the Intentional Dataabse (IDB). For example, the Datalog program in Listing 1 computes directed paths given graph edges.

A program takes facts, referred to as the Extensional Database (EDB), as input, and by repeatedly applying the inferrence rules to the input facts new output facts are generated. Listings 2 and 3 are examples of input and output facts respectively.

Variability-aware computing is the ability to efficiently compute over values from different worlds at the same time. A set of worlds is defined in terms of a set of features . A world is defined by a configuration , where each feature can be either present or absent. A set of worlds is defined by a propositional formula over features.

Each software artifact can be labeled with a Presence Condition (PC): a propositional formula specifying the set of worlds in which this artifact exists. Datalog facts are an example of artifacts. If we are modeling a set of worlds defined by three features: Land, Air and Sea, facts can be labeled by PCs as seen in Listing 2. The ’@’ symbol is syntactically used to separate the fact predicate from its PC. We use the symbols ’!’ for negation, ’\/ ’ for disjunction, and ’/\ ’ for conjunction. Parenthesis can be also used to override operator precedence.

Usually not all feature combinations are valid. For example, the expression states that we have an edge that is both overland and marine, which does not make sense. To rule out invalid feature combinations, a product line usually has a feature model : a propositional formula over features specifying their valid combinations (valid worlds). A configuration is valid only if is satisfiable. Our example’s feature model is

Now a variability-aware Datalog engine needs to take both the feature model and the presence conditions of facts into consideration when inferring new facts. Whenever a new fact is inferred, its Presence Condition (PC) should be the conjunction of the PCs of its resolvent facts together with the feature model. If this PC is not satisfiable, the inferred fact does not belong to any valid configuration (world), and can be removed.

Listing 3 shows the results of applying our variability-aware Datalog engine to the program and facts aforementioned. Crossed-out facts are the ones removed because their presence conditions are not satisfiable (in general or with respect to the feature model).

Formal syntax and semantics of variability-aware Datalog, together with correctness criteria of the lifted inference algorithm, and proof of correctness are presented in [Shahin:2019].

2.2 Soufflé 

Figure 2: Soufflé architecture.

Soufflé [Jordan:2016] is an optimized Datalog engine, with a Datalog interpreter in addition to the option of compiling programs into native C++ code (Fig. 2). Soufflé first compiles Datalog into Relational Algebra Machine (RAM) programs, which are then either interpreted or compiled. RAM is a relational algebra language with a fixpoint operator.

Soufflé employs a semi-naive Datalog evaluation algorithm to compile Datalog into RAM. Elaborate data indexing techniques and multi-threaded query processing are then used to evaluate RAM programs. These techniques, in addition to the ability to compile RAM into C++, and subsequently into optimized native machine code, result in high-performance exeuction of Datalog programs.

3 Variability-aware Soufflé 

We modified the Soufflé engine to support variability-aware Datalog inference. Soufflé runs in two modes: interpreter mode and compilation (code synthesis) mode. We only support the interpreter mode at this time.

3.1 Syntax Extension

Figure 3: BNF syntax of Soufflé clauses and presence conditions.

We extend the Soufflé fact syntax (Fig. 3) with an optional Presence Condition (PC) before the period (’.’) at the end. A presence condition is prefixed with the ’@’ symbol, and has the syntactic structure of a propositional formula.

Figure 4: Modifications and additions to Soufflé syntax and parsing classes.

The Soufflé grammar (Lex and Yacc files) is extended accordingly, and Abstract Syntax Tree (AST) classes are added to the code-base for Presence Conditions (Fig. 4). AstPresenceCondition is an abstract class inheriting from AstNode. Concrete subclasses of AstPresenceCondition are Primitive (for True , False and atomic propositional symbols), Negation, and BinOp (for conjunction and disjunction).

The syntactic category of presence conditions can appear in Soufflé programs, and also in CSV files. While the Soufflé parser takes care of programs, we had to implement a separate parser for PCs appearing in CSV files (the PresenceConditionParser class). It identifies a PC as an optional field prefixed with ’@’ coming at the end of a fact. If a PC exists, it is parsed into an AstPresenceCondition object.

The AstClause class is now extended with an AstPresenceCondition field. Unless a PC is provided for a clause, the default value is the True proposition (indicating that the fact is present in all configurations). AstTranslator has a method called translateClause that compiles an AstClause into a RamStatement. This method is modified to translate the PC of the clause as well.

The propositional symbols used in PCs come from a syntactic category different from that of Soufflé variables and constants. To avoid name collisions, we store those symbols in a separate symbol table (featSymTable). An AstTranslationUnit now has two symbols tables: one for Datalog symbols and the other for propositional symbols (feature names).

Soufflé performs some optimizations on the AST before it is translated into a RAM program. For example, in the MinimiseProgramTransformer class, areBijectivelyEquivalent is a method that checks if two clauses are bijectively equivalent. We extend this method to compare the PCs of the clauses as well. If the PCs are not syntactically the same, we consider the two clauses not equivalent.

3.2 Ram

Figure 5: Modifications and additions to RAM interpreter classes.

The AST of a Soufflé translation unit is compiled into a Relational Algebra Machine (RAM) program, encapsulated in a RamTranslationUnit object (Fig. 5). Similar to AstTranslationUnit, we need to carry the feature symbol table (featSymTable) over to RAM as a part of the translation unit. A RamProgram is contained within a translation unit, and it consists of a set of RamStatement objects. A RamFact is a special kind of RamStatement, and we add a PresenceCondition object as a field to it.

A syntactic AstPresenceCondition is compiled into a PresenceCondition object, which encapsulates a representation of the PC propositional formula. We store PCs as Binary Decision Diagrams [Huth:2004], and we use CUDD [Somenzi:1998] as a BDD engine. To keep the number of PC objects at a minimum, we also maintain a hash-table mapping BDDs to PC objects. This way a new PC object is created only if no other object with the same BDD already exists in memory.

Soufflé stores RAM relations as tables of numbers. String values are stored elsewhere, and their corresponding numeric identifiers are the values actually stored in relations. This keeps relations homogeneous, easy to access and index. Since we now need to add a PC for each RAM record, the easiest way is to extend relations with an extra field for the PC. To keep the relation data-structure homogeneous, instead of storing a PC object, we store its address, which is a 64-bit numeric value, pretty much like other fields. This way our extra PC field is opaque to the rest of the RAM subsystem. We had to take special care of nullary relations, i.e., relations of zero fields. They have special semantics in Soufflé, and to preserve the semantics, we consider a relation of a single field (the PC) to be nullary.

3.3 Interpreter

The Soufflé interpreter runs a program on the fly, keeping a context of type InterpreterContext, and manipulating a set of RAM relations. To avoid getting into the details of how relations are stored, and how data indices are maintained, we decided not to modify InterpreterRelation and InterpreterIndex. Instead, we wrap InterpreterRelation in LiftedInterpreterRelation. The wrapper maintains the same interface, but adds the semantic manipulation of the PC field.

Another significant difference between LiftedInterpreterRelation and InterpreterRelation is existence checking of records. In Soufflé checking if a record exists in a relation is straightforward using the full index of the relation, returning true if the record exists in the index and false otherwise. With PCs existence checking is more subtle because the record we are looking for might exist but with a different PC. To accommodate for this, we add a PC output parameter to exists, the existence checking method of LiftedInterpreterRelation. Now instead of just returning a boolean indicating whether a record exists in a relation, we also return a pointer to the stored PC of the record (if the record exists).

Now whenever two records are resolved by the interpreter, their PCs need to be conjoined, and the conjunction (if satisfiable) becomes the PC of the resulting record. If on the other hand the conjunction is not satisfiable, the result can be safely ignored because an unsatisfiable PC indicates an empty set of configurations in which this record exists. Satisfiability checking is a constant-time operation on BDDs (although BDD construction might take exponential time in the number of variables). Because clause resolution might take place recursively, we add a PC field to InterpreterContext, which keeps track of the PCs of intermediate results.

When inserting a record into a relation, again we need to take the PC into consideration. If that record already exists in the relation with the same PC, then we do not need to add it again. If on the other hand it exists with a different PC, we now need to disjoin that with the new PC because we are expanding the set of configurations where this record exists into that of the union of the two PCs. If the record does not exist at all, we just add it with its new PC.

We had to modify the I/O subsystem of Soufflé to make sure we correctly read and write PCs together with records from/to CSV files. PresenceConditionParser is used to parse PCs on input, and logic for serializing PCs is added to the PresenceCondition class. At this point, we do not support storing facts to SQLite databases.

4 Evaluation

(a) Time overhead.
(b) Space overhead.
Figure 6: Time and space overhead due to variability-aware inference for five different fact sets and three sets of rules.
insens 1Type+Heap taint-1Call+Heap
Fact-base R FPC T(ms) S(KB) TN(ms) SN(KB) T(ms) S(KB) TN(ms) SN(KB) T(ms) S(KB) TN(ms) SN(KB)
Lampiro 18 343 8,111 41,170 8,324 41,160 20,725 149,686 20,522 149,661 45,996 230,370 43,014 230,329
Prevayler 5 6,507 5,334 4,407 5,066 4,177 6,013 8,630 5,908 8,035 9,717 5,534 9,640 5,203
BerkeleyDB 42 49,062 10,810 49,725 10,966 47,071 17,273 122,922 17,186 113,346 21,474 112,060 21,247 104,137
MM08 27 6,811 4,720 3,259 4,656 2,944 5,142 6,990 5,099 6,114 9,306 7,829 9,360 6,960
GPL 21 3,353 4,517 409 4,471 314 4,718 593 4,675 441 8,861 462 8,795 344
Table 1: Inference time for three different Datalog programs applied to five different fact sets. For each fact base we report the number of features (R), number of facts with PCs other than True (FPC), inference time (T), database size (S), non-variabality-aware inference time (TN), and non-variability-aware database size (SN). Time is reported in milliseconds, and space is reported in Kilobytes.

We evaluate the performance of our implementation of variability-aware Soufflé in terms of time and space overhead. In particular, the research question we are trying to answer is how much of an overhead in terms of inference time and database size is attributed to our modifications to Soufflé . To answer this question, we compare the performance of Soufflé on a fact set annotated with PCs against its performance on the same set with the PCs removed.

We use the same dataset used in [Shahin:2019], which is comprised of five fact sets extracted from Java programs, and three program analyses (implemented as Datalog rules) applied to each of them. Table 1 summarizes the number of features (R) and number of facts annotated with PCs (FPC) for each of the five benchmark fact sets. In addition, for each of the three Datalog rule sets (insens, 1Type+Heap, taint-1Call+Heap) it outlines the inference time (T), database size after inference (S), and the corresponding values when the fact set with no PC annotations is used (TN and SN respectively). Time is measured in milliseconds, and space is measured in Kilobytes.

Fig. 5(a)

shows the inference time overhead when applying each of the three Datalog programs to each of the five fact sets. Overhead is calculated as a ratio between the time taken by variability-aware inference to standard Datalog inference. There are a few cases of overhead values less than 1.0, which can be considered as outliers due to other factors affecting overall processing time (e.g., I/O). From this graph, we can conclude that the overhead is relatively small (7% was the maximum reported for taint-1Call+Heap on Lampiro). We still can not see a direct correlation between the time overhead and fact set attributes (e.g., feature count, percentage of facts annotated with PCs).

Similarly, Fig. 5(b) shows the database size overhead when applying the same Datalog programs to the fact sets, where the ratio here is between database sizes. Soufflé databases are stored as text files, and since variability-aware facts (including inferred ones) might have PCs, and those PCs are stored as text, it is natural that a variability-aware fact database takes more space than a plain databse with no PCs. We can see from this graph that the database size overhead grows roughly with the percentage of PC-annotated input facts. This overhead reaches almost 34% for GPL, where about 60% of the input facts are PC-annotated.

Please recall that the rationale behind variability-aware computing is to run a program only once on values from all configurations, as opposed to running the program on each configuration separately. Since the number of configurations is typically exponential in the number of features, the marginal overhead we see here is negligible compared to the savings due to running the program only once. More details on our experiment setup and evaluation results can be found in [Shahin:2019].

5 Conclusion and Future Work

In this paper we presented the design and development of the variability-aware Soufflé Datalog engine. The engine can take Datalog facts annotated with presence conditions as input, and compute the presence conditions of its inferred facts, eliminating facts that do not exist in any valid configuration.

We evaluated the overhead of our variability-aware Datalog inference in terms of inference time and size of the fact database, showing that time overhead is marginal, and space overhead grows with the percentage of PC-annotated input facts. This overhead is acceptable compared to the brute force approach (each configuration running separately), where the number of configurations, and accordingly the overhead, is exponential in the number of variability features.

For future work, we plan to extend our variability-aware inference implementation to the Soufflé C++ code generator. We also plan to extend our theoretical foundations and implementation to support presence conditions on rules. This would allow for variability of inference logic in addition to data.