More Effective Ontology Authoring with Test-Driven Development

Ontology authoring is a complex process, where commonly the automated reasoner is invoked for verification of newly introduced changes, therewith amounting to a time-consuming test-last approach. Test-Driven Development (TDD) for ontology authoring is a recent test-first approach that aims to reduce authoring time and increase authoring efficiency. Current TDD testing falls short on coverage of OWL features and possible test outcomes, the rigorous foundation thereof, and evaluations to ascertain its effectiveness. We aim to address these issues in one instantiation of TDD for ontology authoring. We first propose a succinct, logic-based model of TDD testing and present novel TDD algorithms so as to cover also any OWL 2 class expression for the TBox and for the principal ABox assertions, and prove their correctness. The algorithms use methods from the OWL API directly such that reclassification is not necessary for test execution, therewith reducing ontology authoring time. The algorithms were implemented in TDDonto2, a Protégé plugin. TDDonto2 was evaluated on editing efficiency and by users. The editing efficiency study demonstrated that it is faster than a typical ontology authoring interface, especially for medium size and large ontologies. The user evaluation demonstrated that modellers make significantly less errors with TDDonto2 compared to the standard Protégé interface and complete their tasks better using less time. Thus, the results indicate that Test-Driven Development is a promising approach in an ontology development methodology.



There are no comments yet.


page 10

page 13


Test-Driven Development of ontologies (extended version)

Emerging ontology authoring methods to add knowledge to an ontology focu...

How, What and Why to test an ontology

Ontology development relates to software development in that they both i...

Applied Awareness: Test-Driven GUI Development using Computer Vision and Cryptography

Graphical user interface testing is significantly challenging, and autom...

Facilitating Ontology Development with Continuous Evaluation

In this paper we propose facilitating ontology development by constant e...

A Behavior-Based Ontology for Supporting Automated Assessment of Interactive Systems

Nowadays many software development frameworks implement Behavior-Driven ...

Direct computation of diagnoses for ontology debugging

Modern ontology debugging methods allow efficient identification and loc...

CLaRO: a Data-driven CNL for Specifying Competency Questions

Competency Questions (CQs) for an ontology and similar artefacts aim to ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ontology engineering is facilitated by methods and methodologies, and tooling support for them. The methodologies are mostly information system-like, high-level directions, such as variants on waterfall and lifecycle development Garcia et al. (2010); Suárez-Figueroa et al. (2008), although more recently, notions of Agile development are being ported to the ontology development setting, e.g., Blomqvist et al. (2012); Peroni (2017), including testing in some form Ferré and Rudolph (2012); Garca-Ramos et al. (2009); Vrandečić and Gangemi (2006); Warrender and Lord (2015). Now that most automated reasoners for OWL have become stable and reliable over the years, new methods have been devised that use the reasoner creatively in the support of the ontology authoring process. Notably, the OWL reasoner can also be used for examining negations Ferré (2016); Ferré and Rudolph (2012), checking the changes in entailments after an ontology edit Denaux et al. (2012); Matentzoglu et al. (2016), and proposing compatible object properties for any two classes Keet et al. (2013).

Such tools are motivated at least in part by the time-consuming trial-and-error authoring process, i.e., where a modeller checks consistency after each edit Vigo et al. (2014). However, aforementioned methods and tools still require classification for each assessment step, which is unsustainable for large or complex ontologies due to prohibitively long classification times. Effectively, these modellers take a test-last approach to ontology authoring. In this respect, ontology engineering methodologies still lag behind software engineering methodologies both with respect to maturity and adoption Iqbal et al. (2013).

There are a few recent attempts at explicitly incorporating automated testing with a test-first approach Keet and Ławrynowicz (2016); Warrender and Lord (2015), which is common in software engineering and known under the banner of test-driven development (TDD) Beck (2004). To the best of our knowledge, there are three tools for TDD unit testing of ontologies Keet and Ławrynowicz (2016); Warrender and Lord (2015); Scone Project (2016) such that one can check whether an axiom is entailed before adding it. They exhibit two shortcomings that prevent potential for wider uptake: 1) certain axioms expressible in OWL 2 DL Motik et al. (2009) are not supported as TDD tests, such as , and 2) the outcome of a test can be only “pass” or “fail” with no further information about the nature of failure. Further, there has been no rigorous theoretical analysis of the techniques used for such ontology testing that avails of the automated reasoner pre-emptively. Yet, for modellers to be able to rely on reasoner-driven TDD for ontology authoring—as they do with test-last ontology authoring—such a theoretical foundation would be needed.

In this paper, we aim to fill this gap in rigour and coverage. We first propose a succinct logic-based model of TDD unit testing as a prerequisite. Subsequently, we generalise the piecemeal algorithms of Keet and Ławrynowicz (2016) to cover also any OWL 2 class expression in the axiom under test for not only the TBox, as in Keet and Ławrynowicz (2016), but also for the principal ABox assertions, and prove their correctness. These algorithms do not require reclassification of an ontology in any test after a first single classification before executing one or more TDD tests, and are such that the algorithms are compliant with any OWL 2 compliant reasoner. This is feasible through bypassing the ontology editor functionality (‘pressing the reasoner button’) and availing directly of a set of methods available from the OWL reasoners in a carefully orchestrated way. We have implemented the algorithms by extending one of the three TDD tools for ontologies, TDDOnto Keet and Ławrynowicz (2016), into TDDonto2—also a Protégé 5 plugin—as a proof-of-concept to ascertain their correct functioning practically Davies et al. (2017)111This open source plugin is accessible at This implementation was subsequently used in two evaluations. First, we devised a human-independent editing efficiency approach examining clicks, keystrokes, and reasoner invocation to compare the test-last with the basic Protégé 5 interface to the test-first with TDDonto2. TDDonto2 has a higher editing efficiency, i.e., takes less time, than the basic interface with a small difference for very small ontologies and substantially for medium to large ontologies. Second, we conducted a typical user evaluation to compare the basic Protégé interface to TDDonto2, which demonstrated that the modellers completed a larger part of the tasks with fewer mistakes in less time. Thus, TDD with the test-first approach and TDDonto2 is more effective than the common (test-last with Protégé) authoring approach.

In the remainder of the paper, we first provide motivations for why testing is applicable to ontologies (Section 2), and describe the main requirements. Section 3 describes related work. The first part of the main contributions are presented in Section 4, which is the model for testing and the main novel algorithms. The evaluations of its implementation are presented in Section 5. Section 6 discusses the work and we conclude in Section 7.

2 Motivations and requirements for a test-first approach

Test-Driven Development (TDD) in software engineering Beck (2004) is a methodology based on two rules: 1) write new code only if an automated test has failed, and 2) eliminate duplication. This induces a “red–green–refactor” pattern of development: first write a new test which fails, then write code which makes it pass with minimal effort, then remove resultant duplication and restructure if necessary. Tests thus serve to define desired functionality. The process is usually facilitated with a test harness that runs tests automatically and generates reports. TDD has been shown to improve code quality Rafique and Mišić (2013), especially in complex projects, and it is also shown to improve productivity Janzen (2006). In light of this, TDD may also be used for ontology development. In the next two subsections we first motivate where in the ontology authoring process tests could be useful and subsequently look at the broader picture of the ontology development processes.

2.1 TDD tests for ontologies

Ontologies, like computer programs, can become complex so that it is difficult for a human author to predict the consequences of changes. Automated tests are therefore useful to detect unintended consequences. As an illustrative example, suppose an author creates the following classes and subsumptions: . The author then realises that not all herbivores are mammals, so shortens the hierarchy to , thereby losing the derivation. An application that uses this ontology to retrieve mammals would then erroneously exclude giraffes. This issue can be caught by a simple automated test to check whether is still entailed. Superficially, it may seem like this problem can be solved by just adding those axioms directly to the ontology. However, adding such axioms introduces a lot of redundancy, making modification of the ontology more difficult. Adding only a test instead ensures correctness without bloating the ontology. In addition, if it is specified as a test and documented as such in a testing environment, one can easily repeatedly re-run tests.

Tests may also be used outside an automated test suite in order to explore and understand an ontology. For example, an author might be assessing an ontology of animals for reuse and wants to verify that is entailed in the ontology. The author can simply create a corresponding temporary test and observe the result, saving the time it would take to browse the inferred class hierarchy in an ontology development environment such as Protégé.

A similar approach can be employed when developing a new ontology: create a temporary test to determine whether the axiom i) is already entailed, ii) would result in a contradiction or unsatisfiable class if it were to be added to the ontology, or iii) can be added safely. For instance, one may wish to check whether one can add the domain declaration to the ontology such that it would not cause itself or another class to become unsatisfiable. In this case, it would be just executing this one TDD test for this axiom. Compare that with laborious and time-consuming alternatives: 1) browsing to the object property declarations and/or clicking through all class axioms of classes that are not animals and manually inspect each one on whether they happen to participate in an axiom involving

, or 2) the standard approach of adding an axiom, running the reasoner, and then observing the consequences. Option 1 may be feasible for a small toy ontology consisting of a few classes, but not for typical ontologies, let alone large ontologies, and one probably still adds option 2 to it. Option 2 involves reclassification, which may be slow, and which a TDD test can avoid once the ontology is classified.

This gives us two broad use cases:

  1. Declare many tests alongside an ontology and evaluate them all together in order to demonstrate quality or detect regressions.

  2. Evaluate temporary tests as needed in order to explore an ontology or predict the consequences of adding a new axiom.

To satisfy both of these, a hard requirement is that tests must not reclassify the ontology, and they must produce results that identify the consequences of adding an axiom.

2.2 Tests within the development process

The previous examples assume there is an axiom to add, but that has to come from some place and more activities are going on when testing. We identified three scenarios for the former and illustrated an informal lifecycle process for the latter in Keet and Ławrynowicz (2016)

. For instance, i) the axiom may be a formalisation of a competency question formulated by domain experts, ii) a modeller may work with a template for axioms, a spreadsheet or an ontology design pattern for one or more axioms, or iii) a knowledge engineer may know already which axiom to add. Resolving this aspect is beyond the current scope, as even within the TDD part, multiple options already are possible even with the straightforward fail-pass character. This is illustrated in Figure 

1 for a more detailed version of the TDD test cycle within the larger TDD lifecycle for the simplest base case. TDD tests for ontologies currently only pass or fail, but, cf. TDD tests in software development, one has to deal with classifying the ontology and handling any inconsistency or unsatisfiable classes as well. The reason that that classification step is present is because when a test evaluates to a ‘fail’, one does not know whether that axiom is absent just because the knowledge is missing or because it would result to an inconsistency if it were to be added, so one would have to check that anyway after all. Ideally, one would want to know before any edit why the axiom is not in the ontology, so as to be better informed about the next step(s) to take.

Figure 1: General flow for the simplest base case for a single TDD test that corresponds to the red-green (fail-pass) TDD notion, based on the lifecycle in Keet and Ławrynowicz (2016) (not all possible permutations are shown). The highlighted part indicates the flow of steps for the straightforward fail-pass case.

3 Related works

We briefly outline the few ontology testing implementations first and subsequently recap relevant aspects of automated reasoners for OWL ontologies.

Test-Driven Development for ontologies

Three TDD tools for ontologies have been proposed recently. TDDOnto is a Protégé plugin that allows test axioms to be specified in Protégé’s syntax and then uses the reasoner through the OWL API, for it was shown to be the most efficient technology examined Keet and Ławrynowicz (2016). Further, once the ontology is classified, one can run as many implemented tests as one wants without invoking a classification again, whereas the other tools require ontology classification with each axiom exploration; hence, by design, they are already less efficient than TDDOnto. The two other TDD approaches for ontology development use a subset of OWL or have a different scope. Tawny-OWL Warrender and Lord (2015) is an ontology development framework implemented in Clojure (not a widely known and used programming language) and it provides predicate functions which query the reasoner that will return true/false. It can be used in conjunction with any testing framework, such as the built-in “clojure.test”. TDD from a domain expert perspective is explored with Scone Scone Project (2016), which is based on Cucumber Cucumber (2016). It leverages controlled natural language and (computationally costly) mock individuals, as do the tests described in Denaux et al. (2012). Like Tawny-OWL, Scone is also separated from the common ontology development tools. Tawny-OWL and Scone do not support testing object properties or data properties.

No attempt has been made to rigorously prove the correctness of the testing algorithms of TDDOnto, Tawny-OWL, or Scone. None of these three tools support the full range of axioms permitted in OWL 2. Most notably, in none of them it is possible to directly test axioms of the form where is not a named class, such as , i.e., where it is a class expression. In addition, all three tools give only limited information about the result of any test, being pass/fail in Tawny-OWL and Scone, with TDDOnto also reporting missing vocabulary. This hinders their usefulness as a means to explore an ontology or aid in development. Using TDD with arbitrary General Concept Inclusions (GCIs) and a more comprehensive testing model is possible with TDDonto2 that has been introduced recently Davies et al. (2017), but the paper did not cover technical details and evaluation.

There are several related works that move in the direction of Agile ontology development Blomqvist et al. (2012); Peroni (2017), such as the pay-as-you-go approach for Ontology-Based Data Access Sequeda and Miranker (2017) and the axiom-based possible world explorer Ferré (2016), a proposal for an ontology testing framework for requirements verification and validation Fernandez-Izquierdo (2017), and the ‘preparation’ step of going from competency questions to computing TDD tests Dennis et al. (2017). They have an affinity with TDD insofar as that they could integrate with, or even rely on, a well-functioning, reliable, implementation of TDD for ontology authoring, and they predominantly rely on SPARQL queries rather than an OWL reasoner.

Automated reasoners

There are numerous OWL 2-compliant automated reasoners that employ several mechanisms to handle OWL files; e.g., the OWL API OWL API (2016), OWLlink Liebig et al. (2011) (a Java library that defines a widely-supported standard interface for reasoners), and OWL-BGP Kollia et al. (2011) (a Java library that implements SPARQL). OWL-BGP introduces an efficiency overhead which is not present in OWL API Keet and Ławrynowicz (2016) and OWLlink specifies a protocol for communication between distributed components (availing of the OWL API) whose scenario is thus orthogonal. Therefore we consider only using the reasoner directly through the OWL API and its functionality.

Performance evaluation of TDDOnto found that implementations that introduced temporary “mock” individuals were substantially slower than all others Keet and Ławrynowicz (2016). The cause was not explicitly identified, but it is likely due to the need for reclassification of the ontology to include the new assertions. As stipulated in Section 1, reclassification is undesirable and therefore that approach is not appropriate when it can be avoided. Instead, after the ontology is initially classified, one should be able to test by making queries, which are assumed to be acceptably efficient. The OWL API reasoner interface specifies a “convenience” method named isEntailed that accepts any axiom and returns a Boolean indicating whether or not that axiom is entailed. However, it is not mandatory for reasoners to implement this method, and only at most half do so222There are 73 reasoners listed at of which 38 showed evidence of maintenance since 2012, and of those, 19 reasoners list support for entailment.. We therefore do not want to rely on its use. Therefore, we will use other methods available.

4 Foundations of TDD for ontologies

After a few preliminaries, we introduce the model for testing ontologies and subsequently present a selection of the algorithms.

4.1 Preliminaries

We begin with two prerequisite definitions, and then identify relevant reasoner methods and their returned values.

Definition 1 (Ontology language ).

is the language of ontology with language specification adhering to the OWL 2 standard Motik et al. (2009), which has classes denoted where or may be a named class or a class expression, object properties , individuals , and axioms that adhere to those permitted in OWL 2 DL.

Definition 2 (Signatures).

The signature of an axiom in ontology represented in , is the set of all symbols in the axiom. Further, there is a class signature , an object property signature and an individual signature .

Definition 3 (OWL reasoner methods).

Let be the ontology under test, are class expressions; is a named class; ; and . The following methods are available from the reasoner:

Note that getSubClasses returns the union of equivalent classes and strict subclasses.

4.2 A model of testing ontologies

In order to rigorously examine any testing algorithms, we need a formal description of what it means to test an axiom against an ontology333Observe that the scope is testing axioms, not a broad informal meaning of ‘testing’ that also would include, say, checking for naming conventions.. In line with the use cases identified in Section 2, we define the possible test results. Instead of the underspecified three possible statuses of the existing tools (pass/fail/unknown), we specify seven cases that include the three existing ones and, principally, refine the ‘fail’ cases. They are listed in order from most grave failure to pass.

  • Ontology already inconsistent. That is, . The reasoner cannot meaningfully respond to queries, so no claims can be made about the axiom.

  • Ontology already incoherent There is at least one such that .

  • Missing entity in axiom. By Definition 2, we have .

  • Axiom causes inconsistency. If the axiom were to be added to , then it would cause it to become inconsistent, i.e., .

  • Axiom causes incoherence. If the axiom were to be added to , it would cause at least one named class to become unsatisfiable.

  • Axiom absent. The axiom is not entailed by the ontology () and can be added without negative consequences.

  • Axiom entailed. The axiom is already entailed by the ontology ().

In the context of TDD, only “Axiom entailed” is a pass; all the others are test failures. The first two possible failures apply to the entire suite of tests rather than to any one, so they should be checked only once as preconditions before evaluating any tests. Therefore, we do not consider them in any of the algorithms in Section 4.3. Similarly, the missing entities case can be a simple check at the start of each test which does not affect how it is otherwise evaluated. Since there is no ambiguity, we henceforth abbreviate the remaining cases to “inconsistent”, “incoherent”, “absent”, and “entailed”. This leads to the following formal definition of the testing model.

Definition 4 (Model for testing).

Given a consistent and coherent ontology , and an axiom s.t. , i.e.,

then the result of testing against is:

The resultant values are ordered according to graveness of failure: . One could add nicer labels in a user interface, such as “this axiom is redundant” for “entailed”, but such considerations are outside the current scope.

4.3 Algorithms and analysis

We now introduce the algorithms and analyses, in the context of an ontology , which cover the most used types of axioms for (complex) classes, individuals, and RBox axioms that can be expressed as class axioms (domain and range axioms, and functional and local reflexivity and their inverses)444currently not supported: 1) entity declarations and datatype definitions because they cannot meaningfully be tested, 2) axioms, for they are hardly used due to unusual semantics, 3) RBox axiom types other than listed above, because it is complex to detect inconsistencies Keet (2012) and they require non-standard reasoning services.. In the interest of space and readability, a selection of the algorithms and proofs is presented in this section, which is based on importance, novelty, and to demonstrate the approach to the algorithms and proofs. The remaining ones follow the same pattern and are explained briefly in the text; the complete set of algorithms and their proofs can be found in a technical report Davies (2016).

Overall, this leaves four generic class axioms, three assertions, and six object property axioms. Each algorithm is named according to the axiom it tests, as written in OWL 2 functional syntax, prepended with “test”. For example, the algorithm for testing axioms is named testSubClassOf.

4.3.1 Class axioms

In the class axioms permitted by OWL 2 DL, all arguments may be arbitrary class expressions, not just named classes, except for in which must be a named class. Consequently, to determine if holds, it is not sufficient to check if , because will not occur in this set if it is not a named class. To resolve this, we build class expressions from the arguments and query them for satisfiability and instances.

To test such GCIs, we introduce Algorithm 1, which tests subsumption of class expressions and we will show its correctness.

1: class expressions
2:function testSubClassOf()
3:     if  then
4:         return
5:     else if  then
6:         return
7:     else if isSatisfiable(then
8:         return
9:     else
10:         return
11:     end if
12:end function
Algorithm 1 test
Lemma 1.

For any set of axioms and class expressions and , .

This is well known and therefore the proof is not included.

Proposition 1.

testSubClassOf is sound and complete for entailment. That is, and .


Soundness Algorithm 1 can only return entailed at line 10, so the three if-conditions must all be false. So


Now suppose . By Lemma 1, . In other words, is satisfiable, which contradicts the last term of Eq. 1. Hence the supposition is false, so .

Completeness The algorithm returns entailed if Eq. 1 holds (see Soundness). From we have that , and by Lemma 1, so the last term of the equation is true. Since is unsatisfiable, by the coherence precondition it has no named subclasses, and by the consistency precondition it has no instances. Therefore the first and second terms of the equation are also true. Therefore Eq. 1 holds, and so the algorithm returns entailed. ∎

Proposition 2.

testSubClassOf is sound and complete w.r.t. inconsistency.


Soundness Algorithm 1 can only return inconsistent at line 4, so the first if-condition holds, so , which means there exists an individual such that . Under it follows also that , which is a contradiction, so .

Completeness We have that is consistent, so it has an interpretation, but is inconsistent, so it has no interpretations. Suppose the algorithm does not return inconsistent. Then it must be that . But in this case there exists an interpretation which models both and . Let be the interpretation of with the smallest domain. This means that the interpretation of any class only contains elements which correspond to individuals which must be in that class: . This clearly still models because every individual is still in all classes it is entailed to be in. Under the supposition, we have that . So for any individual , . Letting , then . From the construction of , this means that . So also models . This contradicts the initial condition that is inconsistent, so the supposition must be false, and therefore the algorithm returns inconsistent. ∎

Proposition 3.

testSubClassOf is sound and complete w.r.t. incoherence.


Soundness Algorithm 1 can only return incoherent at line 6, so the first if-condition must be false and the second true. So Therefore, by the second term, there exists some named class such that . By the contrapositive of Proposition 2, is consistent, so by Lemma 1,

Completeness If is added to , then the only classes that are affected are and its subclasses. Consider a named class . If then it is possible that any element in the is also in and thus it is possible that . If this is true for all such , then they are all satisfiable in which is therefore coherent, so it must not be true for at least one . That is,

From the contrapositive of Proposition 2 we have that the first if-condition is false, and we have shown that the second if-condition is true, so the algorithm returns incoherent. ∎

Theorem 1.

testSubClassOf is correct and terminating.


It has been shown that Algorithm 1 is sound and complete for entailment, inconsistency, and incoherence, and the result is absent when it is not one of these other three. Therefore the algorithm returns the correct result in all cases. Termination is trivial, since the algorithm contains no loops or recursion. ∎

Note that a test for local reflexivity can avail of the same algorithm, with testSubClassOf(,ObjectHasSelf(R)) and likewise for irreflexivity, and it holds similarly for functional as testSubClassOf(,ObjectMaxCardinality(1,R)), and likewise for inverse functional, and object property domain and range (see Davies (2016) for details).

Testing for equivalent classes is done with a testEquivalentClasses function. The algorithm is correct and terminating, which follows directly from its specification: it iterates through testSubClassOf in a nested for-loop for classes (with ), and given that testSubClassOf is correct and terminating, then testEquivalentClasses so is the former (see Davies (2016) for details).

Algorithm 2 for disjoint classes is sound and complete for entailment, inconsistency, and incoherence, following largely the proofs of testSubClassOf that it uses within its for-loop, and therefore not included here (These proofs are available in the online technical report Davies (2016)). Likewise, Algorithm 3 for disjoint union is correct and terminating, relying on the previous results for the test for equivalent classes and for disjoint classes.

1: class expressions
2:function testDisjointClasses()
4:     for  to  do
5:         for  to  do
8:         end for
9:     end for
10:     return
11:end function
Algorithm 2 test
1: named class class expressions
2:function testDisjointUnion()
5:     return
6:end function
Algorithm 3 test

4.3.2 Assertions

Observe that adding an assertion does not affect satisfiability of classes, so cannot become incoherent. We take this as given for all axioms tested in this section. Algorithm 4 tests equivalence and Algorithm 5 difference of individuals. When an algorithm accepts individuals as arguments, we use the shorthand for this set. We use the integer variable to iterate over the indices of individuals given as arguments, and the variables and to temporarily store a set of individuals. Soundness and completeness proofs are available in the online technical report Davies (2016).

1: individuals
2:function testSameIndividual()
3:     if  then
4:         return
5:     else
6:         for  to  do
8:              if  then
9:                  return
10:              end if
11:         end for
12:         return
13:     end if
14:end function
Algorithm 4 test
1: individuals
2:function testDifferentIndividuals()
3:     for  to  do
5:         if  then
6:              return
7:         end if
8:     end for
9:     for  to  do
11:         if  then
12:              return
13:         end if
14:     end for
15:     return
16:end function
Algorithm 5 test

Finally, the algorithm for class assertions checks whether the individual is an instance of a class expression, using the getInstances(C) function.

5 Evaluation of TDD with TDDonto2

Figure 2: Annotated screenshot of TDDonto2 with a few evaluated sample axioms and a tourism ontology.

In order to evaluate the model of testing and the algorithms, we developed a Prot́egé plugin, named TDDonto2, that implements these TDD features, which will be introduced in Section 5.1. The actual evaluation will be presented afterwards. We carry out two ontology authoring tests: a quantitative approach to editing efficiency (Section 5.2) and an experiment with 25 novice modellers (Section 5.3).

5.1 The TDDonto2 Protégé plugin

TDDonto2 Davies et al. (2017) is an updated version of TDDOnto Keet and Ławrynowicz (2016) and is also a Protégé 5.x plugin that uses the OWL API, as that was shown to have resulted in the best performance cf the other techniques investigated Keet and Ławrynowicz (2016); Ławrynowicz and Keet (2016). It implements the new model of testing for TDD and the new TDD algorithms and also has a GUI to wrap around it that incorporates some of the Protégé features, such as autocomplete and recognising whether a term is in the ontology’s vocabulary. The plugin can be added to any tab in Protégé, as desired555it thus still allows the modeller to browse to the standard interface with the classes, object properties, and individuals tabs and so forth to author ontologies.. An annotated screenshot of the plugin with several tests is included in Figure 2. Within the broader setting of an overarching TDD methodology for ontology authoring, it covers the aspects from entering the axiom (wherever it came from) until (but excluding) the refactoring step. Like its predecessor TDDOnto, one can type a single axiom and evaluate it directly, add several axioms and evaluate the whole a set of axioms or a selected subset thereof, and add a single axiom or a set of axioms to the ontology with a one-click operation. The different possible statuses for the test outcomes are colour-coded, where ‘no evaluation’ is left blank, ‘entailed’ is highlighted in green, and the other statuses are highlighted in red. Examples that illustrate the tool and a tutorial-style screencast are available from, as well as the source code and jar file.

Note that TDDonto2 is a proof-of-concept tool primarily to test the workings of the algorithms and investigate in more depth what the best components of a full TDD methodology would be, i.e., it has not been subjected to a full software development lifecycle and it does not yet cover all steps in a TDD methodology (e.g., it does not consider refactoring). Nonetheless, it serves to evaluate the TDD with TDDonto2 to obtain first indications whether there is any benefit to a test-first approach already.

5.2 Editing efficiency evaluation

Ontology authoring efficiency consists of two components: the number of clicks and keystrokes one has to carry out even when one is familiar with the interface and the repeated invocation of the reasoner with as one extreme case the reasoner invocation after each edit and the other one only at the end of all editing operations. The former depends mainly on the axioms and the interface design of the ontology editor and the latter is an orthogonal dimension that can be added based on the number of edits.

Keystroke models are well-known in Human-Computer Interaction research as a non-invasive way to determine how long a task will take with the software, which has been extended over the years to determine correlations between typing & browsing speed and programming performance, the emotional state of a user, and biometric identification of users to detect hackers (see Kołakowska (2013); Thomas et al. (2005) and references therein). For the editing efficiency evaluation of test-first TDD-based ontology authoring with TDDonto2 vs the test-last-based baseline, we are interested in task completion, and within that scope, to eliminate the noise of users so as to get a clear understanding of the ‘number of clicks and keystrokes’ component and the effects of reasoner invocation to determine the best performance in the most efficient possible situation. This then puts forward the hypothesis that:

  • Given a time allocation to clicks and keystrokes and automated reasoning (classification), the overall editing time is lower for TDD in TDDonto2 than the case of the ‘expert user’ and its test-last approach.

The ‘expert user’ is idealised as someone who is an experienced ontology engineer, which is someone who has a high familiarity with the widely-used Protégé 5.x tool and who never browses to the wrong place nor gets sidetracked in the authoring process. That said, the evaluation method described in the next section easily can be amended for another ontology editor and a ‘novice factor’ can be added666for instance, slower that average typing, more latency in scrolling and moving the mouse, and adding “Mental Operator” time penalties for switching contexts (including to the wrong ones, like the wrong tab), and keyboard/mouse input devices; see also Kołakowska (2013); Thomas et al. (2005) for further possible variables and metrics..

5.2.1 Materials and methods


To falsify, or validate, the hypothesis, we need a systematic and plausible scenario that can handle axiom input changes. The first aspect to establish is the type of edits. We assume a typical ontology that is not too lightweight, yet also does not use all corner cases of OWL 2 DL axioms and data about common axioms in an ontology. This resulted in the following set of 10 types of axioms, where and are named classes, is a simple object property, an individual777There are obviously variations, but the main point here is the principle of how to approach computing an editing efficiency and what variables are relevant.:

  1. simple class subsumption

  2. simple existential (all-some) or simple universal or

  3. simple disjointness or

  4. domain

  5. range

  6. instance declaration

  7. qualified cardinality constraint or or

  8. non-simple class on the left-hand side (lhs) (other than domain and range axiom)

  9. ‘closure’ axiom

  10. arbitrary class expression on the rhs, conjunction/disjunction

Let us assume for now that these 10 axioms types are the only ones, so as to show the principle of the calculations.

The second dimension for the editing efficiency calculation is the presence/absence in the ontology of all the vocabulary elements used in the axioms, i.e., whether the terms have yet to be typed up or not. Third, whether all those vocabulary elements are top-level entities in the hierarchy (i.e., directly subsumed by owl:Thing and topObjectProperty) versus all entities are located at the leaves in the deepest hierarchy in the ontology. One then can compute three core variants: a) the lower bound with existing vocabulary + top-level elements, b) the upper bound with new vocabulary + down in the hierarchy, and c) an average case by taking the average characters/term and hierarchy depth for a typical online ontology or computed for the specific ontology under evaluation. We opt for the latter option, i.e., we take a quantitative approach where pre-selected axiom types are filled with average vocabulary from an existing ontology. To do this, we need the following variables:

  • where term is the class name or object property name;

  • , be this the class or object property hierarchy;

  • where term is the instance name;

  • their respective averages, , , and , for some arbitrary ontology or for a specific ontology that is to be edited.

The number of clicks and keystrokes for a set of editing operations is then calculated as follows. Let be the number of axioms to add, which are of a form as one of the 10 above and categorised as such. Then for each axiom (), one can calculate the minimum of clicks () and the maximum () so that the best case scenario is and the worst case scenario is . Alternatively, one can select an option for each axiom that mirrors one’s habit or fix the option to systematically compare the same actions across ontologies. Because this is the first such evaluation, we choose the latter case.

Next, the way the number of clicks and keystrokes is calculated for each axiom specifically is as follows. We examined the editing efficiency calculations by using Protégé 5.2, which offers several ways of adding an axiom in most cases, under the assumption that the vocabulary is already present in the ontology. The first one for axiom of type (i) is included here and the others are listed in Appendix A:

  1. click ‘Classes’ tab (= 1), then either:

    1. drag class to position, if sufficiently nearby (existing classes) = 1

    2. click class - click SubClass Of - in ‘class expression editor’ type classname - click ok (existing classes) =

    3. click class - click SubClass Of - in ‘class hierarchy’ click as far down in the hierarchy as needed - select class - click ok (existing or new classes) =

The corresponding one for TDDonto2 is SubClassOf: .

Further, when typing the name of a vocabulary element or keyword, we consider the autocomplete feature. For a term, this varies by the size and naming, but for keywords (e.g., SubClassOf:) this is one character + a tab. To even this out, such an autocomplete is set on 4 keystrokes. For instance, the above calculation for SubClassOf: in TDDonto2 then becomes 4 + 4 + 4 = 12.

For the principal case to examine, we factor in time in two ways, with the first one being the determining one for falsifying hypothesis H1:

  • reasoning time, by the worst case scenario for reasoning where after each axiom, the reasoner is invoked in the standard Protégé vs an “evaluate all” in TDDonto2.

  • allocating 1 second to each click and 0.3 second to each keystroke, which is based on the typing speed average of 190-200 characters per minute (and permutations thereof; see below);

For small ontologies that classify fast, the time-per-click is expected to be the major factor, whereas for lager or complex ontologies, the reasoning time is expected to be the major factor. The reasoning time is determined by classifying the ontology once, and then the worst case (reasoner invocation after each axiom) is computed by multiplying it by 9 as approximation of the total cost for Protégé and twice for TDDonto2 (at the start and after adding all axioms).

Lastly, the principal case will be checked against several permutations to assess robustness of results. They are as follows: (1) the same setting of 1s click and 0.3s keystroke but without autocomplete, (2) without autocomplete but slower clicking (2s) and faster typing (0.25s/keystroke), and (3) the same setting but 8 keystrokes for autocomplete cf. 4.


We selected three actual ontologies to compute the average values and augmented it with three ‘mock’ ontology averages. The actual ontologies are the African Wildlife Ontology (AWO), the Pizza ontology, and DataMining Optimization Ontology (DMOP). The AWO is a very small tutorial ontology (31 classes, 5 object properties, and 56 logical axioms, in (i.e., OWL 2 DL)), which is used in the ontology engineering course of one of the authors888 and it also will be used for the user evaluation in the next section. The Pizza ontology Rector et al. (2004) is deemed well-known; it has 98 classes, 8 object properties, and 786 logical axioms, and is in (OWL DL) expressiveness. DMOP Keet et al. (2015) is medium-sized and complex (723 classes, 96 object properties, and 2425 logical axioms, in , i.e., OWL 2 DL). It was chosen because of its size and complexity and because two of the authors were involved in its development and thus would be able to analyse some internals if the results would demand for it. In addition, we added three mock ontologies such that the parameters have different values from AWO, Pizza, and DMOP such that the effects of the variables’ values for reasoner time, length of names, and hierarchy depth can be examined further. Their relevant characteristics are included in Table 1. Given that we calculate with averages, the actual number of classes etc in M1-M3 do not matter, nor does the DL fragment or OWL species as indication of possible reasoning times, for that is also a fixed value for each mock ontology, which was set to be in-between Pizza’s insignificant classification time on the one end and DMOP’s 20 minutes at the other end. While there are larger ontologies with longer classification times, 20 minutes is already substantial for the authoring process and much longer than the overhead of the TDDonto2 tool to compute the outcome of the tests.

The classification times for AWO, Pizza, and DMOP were recorded on a MacBook Pro with 2.7 GHz Intel Core i5 and 16 GB memory.

AWO Pizza DMOP M1 M2 M3
Classif. (s) 0.81 0.1 1196.53 100 500 25
7.06 13.07 21.09 15 15 23
2 4.86 8.39 6 12 6
9.4 11.63 14.14 12 12 15
1.2 1.5 2.2 2 3 1.5
0 6.4 19.03 10 10 19
Table 1: Classification time in seconds (“Classif. (s)”), average number of characters for names () of class () and object properties () and hierarchy depth (), and instances () used in the editing efficiency computation. M1-M3: mock ontologies where values are set to assess their effects on editing efficiency.
Figure 3: Illustrative typical results of the editing efficiency data for Protégé v5.2 and TDDonto2. The times increase with increasing vocabulary name size and hierarchy depth averages (see Table 1).

5.2.2 Results

The aggregate results in editing efficiency time with and without reasoner is shown in Figure 4. In order to falsify or validate hypothesis H1 on test-last vs test-first, compare “Total Protégé - single edit reasoner” with “Total TDDonto2 - with reasoner”: it is obvious that test-first is somewhat to much faster than test-last, ranging from 13s in case of the AWO to a staggering 8430s with the DMOP. Thus, H1 is validated.

Comparing the first two data series (i.e., without reasoner) to the latter two (with reasoner), it is clear that the reasoner has most effect on overall editing time for the medium to large ontologies (DMOP and mock2). For the two small ontologies, AWO and Pizza, it turns out that the click & keystroke time is the major contributor to the overall time taken. AWO’s clicks amount to 75.9s in Protégé, so with a single classification being 0.81s, it only reaches a total of 83.19s for Protégé’s worst case (invoke reasoner after each edit), which is 68.4s and 70.02s, respectively for TDDonto2 with its two invocations of the reasoner.

Figure 4: Aggregate results for the Protégé v5.2 and TDDonto2 editing efficiency with and without factoring the reasoning time, when an instance of each axiom type is tested/added once.
Figure 5: Scenario (a) for axiom types (ix) and (x), i.e., where not the class expression editor is used (as in Figure 3), but the browsing and clicking interface instead. Note the different range of values on the y-axis cf. Figure 3.

While the test-first is a clear winner, the difference is small for very small ontologies and the interface itself seems to have a bigger impact. Therefore, let us disaggregate the click-and-keystroke time by type of axiom. Then, the results are still favourable for TDDonto2—i.e., less time—except for a small difference for axiom type (ix) and (x); see Figure 3

. We carried out a statistical analysis for each axiom type, where the null-hypothesis is that there is no difference. As the TDDonto2 series’ data are not normally distributed, we used Wilcoxon for paired data, 2-tailed, and significance level 0.05. This was calculated over the data of the 4 scenarios pooled together so as not to cherry-pick (and 6 data points is insufficient for Wilcoxon). It was statistically significant for all axiom types except for type 5 (see online supplementary data).

There are several other noteworthy observations from this interface-only data. First, unlike with Protégé, the editing efficiency with TDDonto2 is immune to the vocabulary size names and hierarchy depths. This is partially thanks to the autocomplete feature. Second, there is an obvious gradient in the Protégé data. Given the three mock ontologies that varied class depth and vocabulary name size, it demonstrates that the hierarchy depth has the most negative effect, given mock2’s and lower vocabulary name size cf DMOP, yet having higher values overall, and cf. mock1 and mock3 that have similar results but only the low hierarchy depth value remained the same.

Note also that the values for axiom types (ix) and (x) are similar and now also invariant for Protégé, because of the possible options to add it to the ontology and the task execution scenario selected was that of the class expression editor rather than firing the steps for axiom type (ii) twice and related auxiliary clicks (see Appendix for details). That is, Protégé now also requires the user to type the characters of the axiom, which is similar to the TDDonto2 interface. Using their respective scenario (a) with clicking, then the differences with TDDonto2 are the largest, as shown in Figure 5, for it is a compounding effect adding up from the simpler axiom types. As they rely on axiom type (ii), which was already statistically significantly faster with TDDonto2, then so it is for axiom types (ix) and (x).

The comparison with other scenarios that vary click and keystroke times and autocomplete indicates robustness of results. That is, in the first alternate scenario (no autocomplete), Protégé has slightly lower values for most axiom types. In the second and third scenarios (slower clicks and less autocomplete, respectively), TDDonto2 has lower values for most axiom types and they show a similar pattern as in Figure 3 where axiom types (i)-(vii) are better for TDDonto2 and (ix)-(x) slightly better with Protégé (data not included). Thus, on the micro-level of the axiom, there is some editing difference between Protégé 5.2 and TDDonto2 for (very) small ontologies and an expert user, which becomes larger in favour of TDDonto2 for medium-sized and large ontologies. The latter is even more pronounced when taking into account the reasoner.

5.2.3 Discussion

It may be clear from the materials and methods section that there can be many parameters to assess editing efficiency, that, perhaps, detracts one from the core result of test-first vs. test-last with the reasoner—the former having a higher editing efficiency. For the interface interaction, we made choices that seem reasonable to us, such as the autocomplete feature and the average case only and, to some extent, which of the alternate interface interaction scenarios was selected to be included999e.g., for (ix) and (x), the clicking option could have been selected cf. the class expression editor, which resulted in the large difference with TDDonto2, as shown in Figure 5. Besides that the authors would choose the latter option over firing the steps for axiom type (ii) twice, the class expression editor is the only option in case of a disjunction on the right-hand side, and therefore the class expression editor had been selected upfront.. This serves the investigation into simulating a set of expert users in tool evaluations as well as teasing out parameters for non-human quantitative evaluations of ontology authoring tasks. In the case of the common Protégé 5.2 interface vs TDDonto2, the latter generally comes out favourably in several scenarios. This is especially so for medium and large-sized ontologies, which is further amplified with greater class hierarchy depth. In addition, if a ratio of axiom types would have been selected to be included in the calculations, rather than one of each, it would be mostly of type (ii), where TDDonto2 is distinctly more efficient, so this beneficial effect would thus be amplified further. Thus, ontologies within OWL 2 EL expressiveness characteristics (tailored to large, ‘simple’ TBoxes), such as a SNOMED CT, would benefit most from TDD and its axiom-level input in TDDonto2. This would be even more so for the scenario where new vocabulary has to be added, for the additional typing has a comparatively larger effect on the predominantly clicking-based Protégé interface with the existing vocabulary scenario we opted for. Finally, one may argue that most ontology developers are not expert efficient users and the editing efficiency calculations are artificial. Therefore, we shall return to this factor in the next section with the user evaluation.

The set-up has been lenient on the automated reasoner. This is because the aim was to obtain a general sense of any possible benefit in the authoring process, rather than aiming for improvements in the seconds. Practically, reasoning time may increase greatly as a result of adding an axiom. A quantitative approach would amount to randomised adding of axioms, so it would be an unpredictable effect. Reasoner performance per sé is not the scope of the paper, however, and therefore we deemed the approximation of 9x the baseline acceptable, rather than adding more than the baseline. Further, the mode of approximation taken is favourable for the test-last setting rather than TDD, for the former requires more often the invocation of the reasoner. That TDD emerged positively already suggests it will be even more so in praxis.

5.3 User study

The editing efficiency evaluation reported on in the previous section assumed an optimal user working on an average ontology. Here, we are evaluating actual novice users on a relatively small ontology. We run this user study to test the following claims:

  • Users will complete a larger part of the task when using the test-first TDDonto2 Protégé plugin than when using the test-last basic ontology editor interface of Protégé.

  • Users will make fewer mistakes when using the TDDonto2 Protégé plugin than when using the basic ontology editor interface.

  • Users will be able to complete the tasks in less time when using the TDDonto2 plugin than when using the basic ontology editor interface.

5.3.1 Materials and methods

The study included two general subtasks (assessing TBox axioms and ABox axioms) in one domain: African wildlife.

The subjects were master students studying computer science, who have basic knowledge of ontologies and semantic technologies. The users had been informed that the purpose of the user study was to evaluate using a tool for introducing knowledge into an ontology, a plugin to the Protégé editor TDDonto2, compared to using the existing Protégé interface without the plugin. The users were presented with a demo of TDDonto2101010accessible via and an ontology to be extended.

The general task was to assess the status of each axiom from the two sets (TBox axioms and ABox axioms) with respect to the ontology. The subjects could select the status from the following set: entailed, absent, incoherent, inconsistent, and they were provided textual definitions of each of the statuses. We measured what percent of the statuses of the axioms the subjects were able to assess within given time (completeness) and what percent of their assessments were correct (correctness).

The experiment procedure was as follows:

For the given set of axioms, follow these steps:

  • Register the current time (in the field provided).

  • Select the status of the given axiom (“Entailed”, “Absent”, “Incoherent”, “Inconsistent”) for each axiom from the set,

  • Enter the axioms for which you have marked “Absent” into the ontology and save the ontology file.

  • Copy the ontology source file to the given field.

  • Register the current time (in the field provided).

The experiment was designed as a within subject study. The subjects were divided into two groups corresponding to the attendees of two class labs. There were 25 subjects altogether: 13 in the first group (which we denote Group A), and 12 in the second group (which we denote Group B). They were assigned four tasks:

  • Task : Testing the introduction of axioms regarding classes – without TDDonto2 plugin,

  • Task : Testing the introduction of axioms regarding instances – without TDDonto2 plugin,

  • Task : Testing the introduction of axioms regarding classes – with TDDonto2 plugin,

  • Task : Testing the introduction of axioms regarding instances – with TDDonto2 plugin.

In Group A, the subjects first used the basic editor interface to complete the tasks and then TDDonto2. In Group B, the subjects first used TDDonto2 to complete the tasks and then the basic editor interface. The time to complete tasks and (TBox axioms) was limited to 20 minutes per each task, and the time to complete tasks and (ABox axioms) was limited to 5 minutes for each task. Furthermore, to avoid the issue of a transfer effect (i.e., not repeating the errors the second time the subjects do the task) we prepared two different but comparable axiom sets111111The tested axioms are at aforementioned URL (fn. 10).

5.3.2 Results and discussion

Table 2 shows the data and statistics on the results of the user study, corresponding to claims C1-C3, with a breakdown by groups and the interfaces used. Regarding claim C1 (see Table 2a), we can see that, on average, more subjects completed the tasks when using TDDonto2 (94% of task completeness versus 90% regarding the basic interface), which supports our claim. Regarding claim C2 (see Table 2b), relatively more subjects (88%) correctly completed the tasks when using TDDonto2, while the percent of correct answers was lower in case of using the basic editor (62%), which supports claim C2. Regarding claim C3 (see Table 2c), we can see that the average time of completing the task was more than 2 times shorter when using TDDonto2 than when using the basic editor, more precisely it constituted 44% of the time of completing the tasks when using the basic editor, which supports our claim C3.

Task Group A Group B Total
(basic editor first) (TDDonto2 first)
Basic editor
TBox1 13 (100%) 10 (83%)
ABox1 13 (100%) 9 (75%) 90%
TBox2 13 (100%) 12 (100%)
ABox2 12 (92%) 10 (83%) 94%
Total 98% 85%
(a) Degree of completing the tasks per each group.
Task Group A Group B Total
(basic editor first) (TDDonto2 first)
Basic editor
TBox1 61/117 (52%) 63/90 (70%)
ABox1 26/39 (67%) 20/27 (74%) 62%
TBox2 98/117 (84%) 96/108 (89%)
ABox2 31/36 (86%) 30/30 (100%) 88%
Total 70% 82%
(b) Correctness of completing the tasks per each group.
Task Group A Group B Total
(basic editor first) (TDDonto2 first)
Basic editor
TBox1 19.00 min 19.19 min
ABox1 4.92 min 4.67 min 12.27 min
TBox2 8.15 min 7 min
ABox2 2.86 min 2.60 min 5.38 min
Total 8.97 min 8.56 min
(c) Average time (in minutes) of completing the tasks per each group.
Table 2: Statistics corresponding to claims C1, C2, C3.

Figure 6 shows the comparison of correctness of completing the tasks when using the basic interface, and TDDonto2, disaggregated by the types of axiom statuses.

(a) TBox
(b) ABox
Figure 6: A breakdown of the correctness results—basic interface vs TDDonto2—with respect to status types.

We have performed a t-test for statistical significance of correctness percentage results. We generated two sets for three settings (per TBox, per ABox, and per TBox plus ABox): a set with overall correctness percentage per each user per each task for the basic interface and a set with overall correctness percentage per each user per each task for TDDonto2. The null hypothesis was that relevant two sets of correctness percentages (the results for the basic interface and for TDDonto2) had identical mean (expected) values. The results for the TBox experiments are: statistic=-3.3497,

=0.0015. The results for the ABox experiments are: statistic=-2.8108, =0.0072. The overall result (TBox plus ABox) is as follows: statistic=-4.3337, =3.55e-05. Since the value of in all the cases is below 0.05, we reject the null hypothesis and conclude that the difference in the correctness results between the basic interface version of the experiments and the version when the subjects used TDDonto2 is statistically significant.

Finally, note that in the editing efficiency evaluation, editing in the AWO was about the same for Protégé and TDDonto2, for an assumed efficient expert user. Yet, these results with actual users demonstrate clearly that in praxis one can already observe TDD benefits even in these settings of small ontologies already.

6 Discussion

TDD as test-first approach to ontology authoring has been shown to be theoretically and technologically a viable option, and the first user study indicated that it is also beneficial for the authoring process from a user perspective. This clearly can be embedded in a broader process of ontology engineering, as the proposed lifecycle in Keet and Ławrynowicz (2016) already suggested. This can be extended further to also include goal or behaviour-driven development, which Scone Scone Project (2016) aims at, and conversion of competency questions into axioms that would feed into the technical TDD component presented in this paper by, e.g., linking it to Dennis et al. (2017). The TDD component of regression testing—verifying past tests still pass—also may be an avenue for future works. Overall, these additions change the general flow of a TDD test of Figure 1 into the one shown in Figure 7. Also in this case, more scenarios are possible than shown, so as not to obscure the general idea. For instance, after resolving conflicts when a precondition fails, one may not want the axiom in the ontology anymore. The current version of the TDDonto2 tool can cater for these variants, but it has no explicit interface features for them at present and it is left to the modeller’s decisions.

Figure 7: Main steps of the general flow for the base case for a single TDD test, incorporating the model for testing.

Concerning theoretical and feature advances, the algorithms presented in Section 4.3 are the first ones with a broad coverage of OWL 2 language features, superseding those presented in all related work Keet and Ławrynowicz (2016); Scone Project (2016); Warrender and Lord (2015) especially on GCIs and the ABox, and also proving correctness of encoding. In addition, the model of testing axioms goes beyond the pass/fail/unknown and reporting missing vocabulary of the related work (which it does, too), by providing other possible outcomes that clarify what sort of a ‘fail’ it is. This is a distinct feature for the setting of ontologies cf. software engineering, where a fail simply means “not present, to implement”: the ‘fail/not present’ may be because of absence due to lack of coverage, indeed, but also may be because adding it would cause inconsistency or incoherence, which is something one would want to know to determine the next step in the ontology authoring process. That is, unlike in software development, a ‘fail’ does not necessarily imply ‘to add’.

We did make certain design decisions for this TDD that one may want to experiment with aside from the choice of technology121212Note that alternatives to using the reasoner directly have been investigated, notably BGP with SPARQL-OWL Kollia et al. (2011) and instance-based approach with mock objects, but exploiting the OWL reasoner turned out to be the fastest Keet and Ławrynowicz (2016); Ławrynowicz and Keet (2016).. For instance, once isEntailed is implemented by most or all reasoners, one could choose to update some of the algorithms accordingly. Also, one may also want to relax the coherency precondition. In our Model for Testing specification (Definition 4), we sided with the somewhat ‘hardline’ approach from a logician’s viewpoint—a consistent theory, and every element satisfiable—compared to a possible tolerance for unsatisfiable classes at some point in the authoring stage. Anecdotally, we have seen behaviour along the line of “yes, I know x and y are inconsistent but I do not want to deal with them now”. Within the TDD scope, it would be preferable to remove them from the ontology, and add them at least temporarily as TDD test in the test set. This possibility eliminates issues with cascading unsatisfiable classes, yet not somehow losing that knowledge that with the test specification has become more easily examinable and thus resolvable.

The overall time of authoring an ontology is reduced thanks to not invoking the reasoner for each edit, which many a developer does Vigo et al. (2014), yet still being able to evaluate what the outcome would be if the axiom were to be added to the ontology. It does not reduce the reasoning time for the first classification, nor after actually having modified the ontology. Such efficiency improvements are reasoner improvements (e.g., using incremental reasoning), whereas here we focus on authoring improvements.

7 Conclusions

The novel test-driven development algorithms introduced in this paper fill a gap in rigour and coverage of both types of axioms that can be tested with a test-first approach and it provides more feedback to the modeller by means of its model of testing. The evaluation of this test-driven development in TDDonto2 with a novel human-independent assessment approach for editing efficiency demonstrated that it is faster than the typical ontology authoring interface (Protégé 5.2) to some extent for smaller ontologies and even more so for medium to large ontologies with a stylised expert modeller, especially when automated reasoning is factored into the authoring process. Further, the user evaluation demonstrated that it is also more effective in task completion, time, and correctness (quality) for smaller ontologies and relative novice users. Thus, TDD’s test-first approach with TDDonto2 is more effective than the common test-last authoring approach with Protégé.

The results demonstrate promise of test-driven development as an ontology development methodology. To turn it in a complete methodology, other components can be investigated, such as the refactoring step and the interaction with competency questions.


This work was partly supported by the Polish National Science Center (Grant No 2014/13/D/ST6/02076).


  • Garcia et al. (2010) Garcia, A., O’Neill, K., Garcia, L.J., Lord, P., Stevens, R., Corcho, O., et al. Developing ontologies within decentralized settings. In: Chen, H., et al., editors. Semantic e-Science. Annals of Information Systems 11. Springer; 2010, p. 99–139.
  • Suárez-Figueroa et al. (2008) Suárez-Figueroa, M.C., de Cea, G.A., Buil, C., Dellschaft, K., Fernández-Lopez, M., Garcia, A., et al. NeOn methodology for building contextualized ontology networks. NeOn Deliverable D5.4.1; NeOn Project; 2008.
  • Blomqvist et al. (2012) Blomqvist, E., Sepour, A., Presutti, V.. Ontology testing – methodology and tool. In: 18th International Conference on Knowledge Engineering and Knowledge Management (EKAW’12); vol. 7603 of LNAI. Springer; 2012, p. 216–226.
  • Peroni (2017) Peroni, S.. A simplified agile methodology for ontology development. In: M., D., M., P.V., E., J.R., editors. OWLED 2016, ORE 2016: OWL: Experiences and Directions - Reasoner Evaluation; vol. 10161 of LNCS. Springer; 2017, p. 55–69.
  • Ferré and Rudolph (2012) Ferré, S., Rudolph, S.. Advocatus diaboli – exploratory enrichment of ontologies with negative constraints. In: Proc. of EKAW’12; vol. 7603 of LNAI. Springer; 2012, p. 42–56. 8-12 Oct 2012, Galway, Ireland.
  • Garca-Ramos et al. (2009) Garca-Ramos, S., Otero, A., Fernández-López, M.. OntologyTest: A tool to evaluate ontologies through tests defined by the user.

    In: Omatu, S., et al., editors. 10th International Work-Conference on Artificial Neural Networks, IWANN 2009 Workshops, Proceedings, Part II; vol. 5518 of

    LNCS. Springer; 2009, p. 91–98.
    Salamanca, Spain, June 10-12, 2009.
  • Vrandečić and Gangemi (2006) Vrandečić, D., Gangemi, A.. Unit tests for ontologies. In: On the Move to Meaningful Internet Systems 2006: OTM 2006 Workshops; vol. 4278 of Lecture Notes in Computer Science. Springer. ISBN 978-3-540-48276-5; 2006, p. 1012–1020.
  • Warrender and Lord (2015) Warrender, J.D., Lord, P.. How, what and why to test an ontology. CoRR 2015;abs/1505.04112. 1505.04112; URL
  • Ferré (2016) Ferré, S.. Semantic authoring of ontologies by exploration and elimination of possible worlds. In: Proc. of EKAW’16; vol. 10024 of LNAI. Springer; 2016, p. 180–195. 19-23 November 2016, Bologna, Italy.
  • Denaux et al. (2012) Denaux, R., Thakker, D., Dimitrova, V., Cohn, A.G.. Interactive semantic feedback for intuitive ontology authoring. In: Proc. of FOIS’12. IOS Press; 2012, p. 160–173.
  • Matentzoglu et al. (2016) Matentzoglu, N., Vigo, M., Jay, C., Stevens, R.. Making entailment set changes explicit improves the understanding of consequences of ontology authoring actions. In: Proc. EKAW’16; vol. 10024 of LNAI. Springer; 2016, p. 432–446.
  • Keet et al. (2013) Keet, C.M., Khan, M.T., Ghidini, C.. Ontology authoring with FORZA. In: Proc. of CIKM’13. ACM proceedings; 2013, p. 569–578.
  • Vigo et al. (2014) Vigo, M., Bail, S., Jay, C., Stevens, R.D.. Overcoming the pitfalls of ontology authoring: strategies and implications for tool design. International Journal of Human-Computer Studies 2014;72(12):835–845.
  • Iqbal et al. (2013) Iqbal, R., Murad, M.A.A., Mustapha, A., Sharef, N.M.. An analysis of ontology engineering methodologies: A literature review. Research Journal of Applied Sciences, Engineering and Technology 2013;6(16):2993–3000.
  • Keet and Ławrynowicz (2016) Keet, C.M., Ławrynowicz, A.. Test-driven development of ontologies. In: Proc. of ESWC’16; vol. 9678 of LNCS. Springer; 2016, p. 642–657.
  • Beck (2004) Beck, K.. Test-Driven Development: by example. Addison-Wesley, Boston, MA; 2004.
  • Scone Project (2016) Scone Project. Scone project.; Accessed: 9-5-2016.
  • Motik et al. (2009) Motik, B., Patel-Schneider, P.F., Parsia, B.. OWL 2 web ontology language structural specification and functional-style syntax. W3C Recommendation; W3C; 2009. Http://
  • Davies et al. (2017) Davies, K., Keet, C.M., Lawrynowicz, A.. TDDonto2: A test-driven development plugin for arbitrary TBox and ABox axioms. In: Blomqvist, E., Hose, K., Paulheim, H., Lawrynowicz, A., Ciravegna, F., Hartig, O., editors. The Semantic Web: ESWC 2017 Satellite Events; vol. 10577 of LNCS. Springer; 2017, p. 120–125. 30 May - 1 June 2017, Portoroz, Slovenia.
  • Rafique and Mišić (2013) Rafique, Y., Mišić, V.B.. The effects of test-driven development on external quality and productivity: A meta-analysis. IEEE Transactions on Software Engineering 2013;39(6):835–856. doi:10.1109/TSE.2012.28.
  • Janzen (2006) Janzen, D.S.. Software architecture improvement through test-driven development. In: Companion to 20th ACM SIGPLAN Conference 2005. ACM Proceedings; 2006, p. 240–241.
  • Cucumber (2016) Cucumber. Cucumber.; Accessed: 1-11-2016. URL
  • Sequeda and Miranker (2017) Sequeda, J.F., Miranker, D.P.. A pay-as-you-go methodology for ontology-based data access. IEEE Internet Computing 2017;March /April 2017:92–96.
  • Fernandez-Izquierdo (2017) Fernandez-Izquierdo, A.. Ontology testing based on requirements formalization in collaborative development environments. In: Aroyo, L., Gandon, F., editors. Doctoral Consortium at ISWC (ISWC-DC’17); vol. 1962 of CEUR-WS. 2017,Vienna, Austria, October 22nd, 2017.
  • Dennis et al. (2017) Dennis, M., van Deemter K., , Dell’Aglio, D., Pan, J.Z.. Computing authoring tests from competency questions: Experimental validation. In: d’Amato, C., et al., editors. The Semantic Web - ISWC 2017; vol. 10587 of LNCS. Springer; 2017, p. 243–259.
  • OWL API (2016) OWL API. OWL API.; Accessed: 1-11-2016.
  • Liebig et al. (2011) Liebig, T., Luther, M., Noppens, O., Wessel, M.. OWLlink. Semantic Web Journal 2011;2(1):23–32.
  • Kollia et al. (2011) Kollia, I., Glimm, B., Horrocks, I.. SPARQL Query Answering over OWL Ontologies. In: Proc. of ESWC’11; vol. 6643 of LNCS. Springer; 2011, p. 382–396.
  • Keet (2012) Keet, C.M.. Detecting and revising flaws in OWL object property expressions. In: Proc. of EKAW’12; vol. 7603 of LNAI. Springer; 2012, p. 252–266. 8-12 Oct 2012, Galway, Ireland.
  • Davies (2016) Davies, K.. Towards test-driven development of ontologies: An analysis of testing algorithms. Project Report; University of Cape Town; 2016.
  • Ławrynowicz and Keet (2016) Ławrynowicz, A., Keet, C.M.. The TDDonto tool for test-driven development of DL knowledge bases. In: Proc. of DL’16; vol. 1577 of CEUR-WS. 2016,22-25 April 2016, Cape Town, South Africa.
  • Kołakowska (2013) Kołakowska, A.. A review of emotion recognition methods based on keystroke dynamics and mouse movements. In: HSI 2013. IEEE Xplore; 2013, p. 548–555. 6-8 June 2013, Sopot, Poland.
  • Thomas et al. (2005) Thomas, R.C., Karahasanovic, A., Kennedy, G.E.. An investigation into keystroke latency metrics as an indicator of programming performance. In: Proceedings of the 7th Australasian conference on Computing education (ACE’05); vol. 42. Australian Computer Society; 2005, p. 127–134.
  • Rector et al. (2004) Rector, A., Drummond, N., Horridge, M., Rogers, L., Knublauch, H., Stevens, R., et al. OWL pizzas: Practical experience of teaching OWL-DL: Common errors & common patterns. In: Proceedings of the 14th International Conference Knowledge Acquisition, Modeling and Management (EKAW’04); vol. 3257 of LNCS. Springer; 2004, p. 63–81. Whittlebury Hall, UK.
  • Keet et al. (2015) Keet, C.M., Lawrynowicz, A., d’Amato, C., Kalousis, A., Nguyen, P., Palma, R., et al. The data mining optimization ontology. Web Semantics: Science, Services and Agents on the World Wide Web 2015;32:43–53.


The calculations of the interface clicks for Protégé 5.2 are as follows. Given that several options are typically possible, we select one, which is indicated with an asterisk at the end of the option. noting that it offers several ways of adding an axiom in most cases.

  1. click ‘Classes’ tab (= 1), then either:

    1. drag class to position, if sufficiently nearby (existing classes) = 1

    2. click class - click SubClass Of - in ‘class expression editor’ type classname - click ok (existing classes) =

    3. click class - click SubClass Of - in ‘class hierarchy’ click as far down in the hierarchy as needed - select class - click ok (existing or new classes) = []

  2. click ‘Classes’ tab (=1), then either:

    1. click class - click SubClass Of - in ‘class expression editor’ type “R some/only D” - click ok =

    2. click class - click SubClass Of - in ‘Object restriction creator’ click as far down in the property hierarchy as needed - select property - in ‘Object restriction creator’ click as far down in the restriction filler as needed - select class - click restriction type some - click ok = []

  3. click ‘Classes’ tab (=1), then either:

    1. click class - click Disjoint With - in ‘class expression editor’ type classname - click ok (existing classes) =

    2. click class - click Disjoint With - in ‘class hierarchy’ click as far down in the hierarchy as needed - select class - click ok = []

  4. click ‘Object properties’ tab (=1), then either:

    1. click property - click Domain - in ‘class expression editor’ type classname - click ok =

    2. click property - click Domain - in ‘class hierarchy’ click as far down in the hierarchy as needed - select class - click ok = []

  5. has the same processes as for (iv).

  6. click ‘Individuals by class’ tab (=1), then either:

    1. click Types - in ‘class expression editor’ type classname - click ok =

    2. click Types - in ‘class hierarchy’ click as far down in the hierarchy as needed - select class - click ok = []

    3. click ‘Classes’ tab (=1), then click Instances - click instance - click ok = 3

  7. click ‘Classes’ tab (=1), then either:

    1. click class - click SubClass Of - in ‘class expression editor’ type “R min x D” - click ok =

    2. click class - click SubClass Of - in ‘Object restriction creator’ click as far down in the property hierarchy as needed - select property - in ‘Object restriction creator- click as far down in the restriction filler as needed - select class - click restriction type - click/type cardinality - click ok = []

  8. click ‘Active ontology’ tab (=1), then

    1. click ‘General class axioms’ - click add - type the entire GCI - click ok = 3 + GCI

  9. click ‘Classes’ tab (=1), then either:

    1. execute the clicks for axiom type (ii) twice

    2. click add - click class expression editor, and type: some and only = 2+ 4 + + + 3 + 4 + + = 13 + + []

  10. click ‘Classes’ tab (=1), then either:

    1. execute the clicks for axiom type (ii), then click add - click class expression editor and type: some ( or ) = [axiom type (ii) clicks] + 2 + 4 + + 1+ + 2 + + 1 = [axiom type (ii) clicks] + 10 + + +

    2. click add - click class expression editor, and type: some and some ( or ) = 2 + 4 + + + 3 + 4 + + 1+ + 2 + + 1 = 16 + + + + + []

The clicks formulae with TDDonto2 are as follows. In the TDDonto2 plugin, one only adds full GCIs/assertions and then the user has to click “Add” (=1), and then, for the 10 axiom types, in the same order:

  1. SubClassOf:

  2. SubClassOf: some

  3. and SubClassOf: owl:nothing or SubClassOf: not or

  4. some SubClassOf:

  5. some (inverse() SubClassOf:

  6. c1 Type: C =

  7. SubClassOf: min n

  8. non-simple class on the lhs (other than domain and range axiom) = can be anything

  9. SubClassOf: some and only =

  10. SubClassOf: some and some ( or ) = + 11 + 4 + + + 3 + 4 + + 1+ + 2 + + 1 = 26 + + + + + + .