Path Outlines: Browsing Path-Based Summaries of Linked Open Datasets

02/23/2020
by   Marie Destandau, et al.
Inria
0

Linked Data (LD) are structured sources of information, such as DBpedia or Geonames, that can be linked together and queried. The information they contain is atomized into triples, each triple being a simple statement composed of a subject, a predicate and an object. Triples can then be combined to form higher level statements following information needs. This granularity makes it difficult to produce overviews of LD content. We therefore introduce the concept of path-based summaries which carries a higher level of semantics for data producers. We also introduce the tool Path Outlines to support LD producers in browsing path-based summaries of their datasets. We present its interface based on the broken (out)lines layout algorithm and the path browser visualisation. Our approach, reifying chains of statements into path outlines, was informed by the observation of LD producers and we report a characterisation of their needs. We compare Path Outlines with the current baseline technique (Virtuoso SPARQL query editor) in an experiment with 36 participants. We show that participants prefer Path Outlines, find it easier to understand, easier to use, faster, and lowering the number of tasks that users give-up before completing them.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 5

page 6

page 8

page 10

page 11

page 12

04/16/2018

NELL2RDF: Reading the Web, and Publishing it as Linked Data

NELL is a system that continuously reads the Web to extract knowledge in...
07/24/2017

eLinda: Explorer for Linked Data

To realize the premise of the Semantic Web towards knowledgeable machine...
06/11/2019

Generating Summaries with Topic Templates and Structured Convolutional Decoders

Existing neural generation approaches create multi-sentence text as a si...
10/25/2021

TODSum: Task-Oriented Dialogue Summarization with State Tracking

Previous dialogue summarization datasets mainly focus on open-domain chi...
09/01/2015

GR2RSS: Publishing Linked Open Commerce Data as RSS and Atom Feeds

The integration of Linked Open Data (LOD) content in Web pages is a chal...
05/16/2020

The Missing Path: Diagnosing Incompleteness in Linked Data

The Semantic Web is an interoperable ecosystem where data producers, suc...
03/04/2014

Clustering Concept Chains from Ordered Data without Path Descriptions

This paper describes a process for clustering concepts into chains from ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As explained by Bizer et al., “In recent years the Web has evolved from a global information space of linked documents to one where both documents and data are linked. Underpinning this evolution is a set of best practices for publishing and connecting structured data on the Web known as Linked Data (LD). […] These best practices have been adopted by an increasing number of data providers […], leading to the creation of a global data space containing billions of assertions — the Web of Data” [1].

To keep this Web of Data usable, LD providers, our users, need to ensure the quality of the assertions (statements) in their datasets. Several methods exist and can be combined. The quality of ontologies (data models) is evaluated on their ability and efficiency to express knowledge from a domain [2], and on their compatibility, in cases where several are combined [3]. The quality of data is evaluated on their formal conformity with ontologies [4] and with good practices [5, 6]. Most of the time, LD are created from the transformation of existing data sources, so the quality constraints can also be implemented when transforming the data [7].

However, data assessed with such methods can still present usability problems, such as nonsensical statements or incomplete information, and it remains difficult for data providers to detect such problems. While the necessity of summaries to produce overviews of the content is acknowledged [8], it is hard to determine the right unit to summarise meaningful pieces of information. Existing approaches are either at an atomic level, too focused for the user to make sense of the information globally, or at an ontological level, too abstract.

To support LD producers in performing their data curation tasks our contribution includes:

  • the concept of path-based summaries with an API to analyse such summaries,

  • a visualisation tool, Path Outline, to present them, based on two new visualisation techniques, and

  • a controlled experiment to evaluate the tool.

After giving a brief introduction to the basic concepts of Linked Data, we discuss related work regarding LD summaries, their visualisation, and issues relative to retrieving summary information. We introduce Path Outlines, the tool that supports LD producers in browsing path-based summaries of their datasets. We present its interface based on the broken (out)lines layout algorithm and the path browser visualisation. We report our observation of LD producers and characterisation of their needs. We explain how it led us to reifying chains of statements into path outlines, and operationalising the needs into path-based tasks. We relate our interview of 11 producers to evaluate the relevance of such tasks. Finally, we report an evaluation of Path Outlines, compared with the Virtuoso SPARQL query editor as a baseline, and we discuss the results of the evaluation.

Fig. 1: Basic concepts of Linked Data. Samples extracted from Nobel and DBpedia datasets. Full datasets contain respectively and triples (on 2019-09-07).

2 Basic Concepts of Linked Data

The syntax of Linked Data is defined in the Resource Description Framework W3C Recommendation (RDF) [9]. A Dataset is a collection of statements named triples. Triples are composed of a subject, a predicate and an object, as shown in Figure 1. Subjects and predicates are always Uniform Resource Identifier (URIs). Objects can be URIs (. 4–9) or literals(. 1–3). The same URI can be the subject and object of several triples (. 5, 6, and 7 or . 6, 8 and 9). The triples form a network that can be represented as a node-link diagram (c). The special predicate rdf:type (. 4, 7 and 8) expresses that a subject entity is part of a class of resources. Predicates and classes of resources are defined in data models called ontologies. For instance, the predicates of the 3 first triples, and the object of the 4th, belong to the FOAF [10] (friend of a friend) ontology, a model dedicated to the description of people and their relationships. In principle, URIs should always be dereferencable: querying them on the web should lead to their description. Literals can be typed, and string literals can be associated with a language (Figure 1-a, grey color). URIs can be prefixed for better readability, as in Figure 1-c: the beginning, common to several URIs, is given a short name, e.g. foaf: instead of http://xmlns.com/foaf/0.1/.

Linked Datasets are interlinked: a dataset can reference an entity produced in another one (red color). When this happens, they can be queried jointly, through federated queries, and a chain of statements can jump from one dataset to another: the triples in Nobel Dataset la Sorbonne is in Paris, Paris entity in Nobel is equivalent to Paris entity in DBpedia can be completed by those from DBpedia: Paris’ latitude is 48.856701, Paris’ longitude is 2.350800.

As seen in the example, the information is separated into atomic pieces that can be retrieved and combined following needs. For instance, a question like “When was Marie Curie born?”, could be answered with triple 2. “What was her affiliation?” could be answered by chaining triples 5, 6 and 9. Placing Marie Curie on a map displaying laureates by affiliation, could be achieved by chaining triples 5, 6 and 9 and 10 to get the latitude, and 5, 6 and 9 and 11 to get the longitude. A chain of statements is commonly called a path in the graph.

Linked Data is used by communities such as Wikidata, institutions such as national libraries and archives, research laboratories, and companies to combine and share data, and let them be queried jointly. There are few tools to edit Linked Data, datasets are more often produced by transforming existing data sources, with transformation tools or ad-hoc scripts. The workflows are very diverse, even among similar producers: video archives of 2019 professional meeting Semantic Web In liBraries (SWIB) show each data producer describing a specific workflow 111https://www.youtube.com/playlist?list=PL7fMsenbLiQ3FnY59f-nrlHpmy2z5Nmtc. However, they have in common to be very fragmented, because of the many diverse sources. As a result, the producers have no overview of their data, and when errors occur, their discovery is delayed until a user finds out, or never.

3 Related Work

We will discuss the types of summaries which are currently available, the visualisations of these summaries, and the difficulty of writing and running queries for summary information.

By summary we mean description of the content of a dataset, sometimes characterised by descriptive statistics.

3.1 Dataset Summaries

Finding the right unit to summarise such content is not trivial. A first approach is to focus on simple statements, following the structure of the data. Auer et al. present features like the number of occurrences of a property (e.g., foaf:name), the number of entities in a class (e.g., foaf:Person), or the datatype of objects (e.g., xsd:date) aggregated over the whole dataset [11]222http://lodstats.aksw.org/. This summary is complete and accurate, but does not reveal much about the content: it is possible to know that there are n names in a dataset, but what do those names describe? or that there are n persons in the dataset, but how are they described? how many of them do have a name? To answer such questions, more context can be added by considering properties for entities with a specific type, that is counting the number of foaf:Person having a foaf:name [12, 13, 14]. Others also take into account the type of the objects of the triples (not only of their subjects). This allows to count the number of Persons having a birthplace that is a City on the one hand, and the number of Persons having a birthplace that is a Country on the other hand [15, 16, 17]. Adding more context leads to more interpretable summaries, the tradeoff being to leave aside parts of the graph, such as statements involving untyped or literal objects.

Another approach is to construct a subgraph as the summary, producing some sort of a-posteriori model, inferred from the data. Čebirić et al. compute the smallest graph containing all patterns, with variations regarding the definition of a pattern [18]. These subgraphs are very dense, and are query-oriented, not meant to be read by humans. Troullinou et al. limit the subgraph to the most represented classes, and the most represented direct properties between them [8]. The restrictions make it graspable, yet very incomplete. Weise et al. [19] give access to more elaborate statements, also starting from the most represented classes, and considering the most represented properties, which can be chained without involving untyped entities. Those subgraphs preserve access to chains of statements, but the statistics are produced for single statements only. Since the number of paths that can be extracted from a graph is extremely large, considering chains of statements would require to decide where to start and where to stop.

Khatchadourian and Consens [14] also provide a summary of the links between datasets. However, they account only for entities described with a owl:sameAs link, that belong to the same class of resources and are described by the same properties in both datasets. In other words, they focus on strong similarities, while we think the power of interlinking rests on the complementarity of items described in different ways.

In contrast to previous methods, our approach considers paths as the basic unit for summaries. We define path-based summaries as descriptive statistics about chains of statements starting from a set of entities sharing the same rdf:type, with a specified maximum depth.

Fig. 2: An example of LD-VOWL visualization [19]. The node-link diagram theoretically enables reading sequences of statements but is hard to read, even for a relatively small graph ( statements). Selecting a node or edge displays statistics for this element in the right panel.

3.2 Visualising LD Summaries

The few summaries presented as visualisations use simple lists or different types of node-link diagrams. Simple lists have the advantage of being readable but they provide no overview [11]. On the other hand, diagrams give a full picture but can be harder to interpret. UML diagrams can be accurate for single statements [12]. The use of classic node-links is very similar to ontology visualisation [20, 21, 22]; in both cases, groups of entities or literals are presented as nodes, and properties as links. For ontologies, the class of a group of entities can be a separate node representing the class entity, as expressed in RDF [23], or, by metonymy, the name of the class is often used to label the group of entities [24]. Only the second method is used for summaries since it makes statements more readable [8, 19]. LD-VOWL [19] use the subgraph visualized as a node-link diagram as an interface to display statistics: selecting a node in the diagram gives access to statistical information about all similar nodes and properties starting from them in this original graph.

As its name implies, almost everything in Linked Data is a link: entities, properties, classes and datatypes are URIs, which by definition are links [25]. Technically, the connection between two datasets is made possible by joints: the fact that a same URI exists in two datasets enables querying them jointly. Semantically, one tends to think of these joints as links. The representation used by LOD Cloud—frequently chosen to illustrate the Semantic Web—is a node-link diagram where each dataset is displayed as a node and the presence of joints between two datasets is displayed as a link between them333https://lod-cloud.net/.

In the related domain of ontology alignment tools with instance matching features, Kotis and Lanzenberger point that instances are usually presented as simple lists, out of the context of their respective datasets [26]: “Interpreting an entity of one ontology in the context of the knowledge of another ontology is a cognitively difficult task since it requires the understanding of semantic relations among entities of different ontologies” [27].

Node-links diagrams are also the most common representation for paths in graphs that are not necessarily linked data [28], and their readability has been studied. Huang and Eades remarked that people tried to read paths from left to right and top to bottom, even when the task requires another direction [29]. Van Amelsvoort et al. demonstrated that reading behaviours were influenced by the direction of elements [30]. Ware et al. showed that good continuity and edge crossing and path length did influence the following of a path [31]. A specific type of node-link diagrams, node-link trees, seemed to be more efficient for tasks related to following paths, traversing graphs [28, 32], and reading paths [33]. A survey on the readability of hypertext did mention many studies showing that the multiplication of possibilities impacts readability negatively [34], supporting the same idea. In an approach that has similarities with ours, PathFinder [35] did lay all possible paths for a subgraph flat. The list was very long and has to be paginated even when the subgraph is very small.

Therefore, existing visualizations of LD provide some level of summaries, but they have difficulties to find the right level of representation and interaction, and they present readability issues.

3.3 Querying Summary Information

SPARQL, the main query language for Linked Data, provides aggregation operators to count occurrences. These operators can be applied to the different types of summaries we mentioned: simple patterns, patterns specifying the type of the subject and / or of the object, and more complex statements, following paths in the graph by chaining triples. However, complex queries raise both technical and conceptual issues, as reported by Warren et al. [36].

From a technical point of view, the cost of a query increases with number of entities and length of paths, and the fact that a query is federated, resulting in possible network and server timeouts and errors. SPARQL query optimisation [37] and Federated Distributed SPARQL query processing [38] are two intertwined research areas.

From a conceptual point of view, thinking of paths patterns in graphs is not a simple mental operation. If finding the right unit to summarise and visualising them is difficult, forming a mental model cannot be simple either. Among tools to assist writing queries, some offer the possibility of discovering the model iteratively, enabling at each step to browse the available possibilities for extending the current path [39]. This applies to one dataset only, not to others that would be queried jointly. And the tool does not present summaries, so the query must then be edited manually to add aggregation operators, which adds a level of complexity.

Altogether, there is still a need for a tool which would enable the summarisation and visualisation of paths in a Linked Dataset, and the display of links between datasets in context, and that would not require too many efforts to manipulate and read.

4 User Study 1: Characterizing LD Producer’s Needs

One of the authors was a a support engineer for data producers in a joint research project where three major public institutions built a model and published their data to interlink their musical catalogs. It was striking how much they seemed to be hindered in the understanding of their data, dedicating an important part of their time to improvise ad hoc representations, to be able to see what they were doing. To inform the design of our tool, we analysed her detailed meeting notes over these two years to identify the problems they had faced.

4.1 Participants

The group of producers was composed of 4 women and 3 men, employed by three public institution. The meetings notes concerned 26 meetings during the first phase of the project, where they discussed the work accomplished and tried to solve problems together. There was a mean of 5.31 core participants per meeting. In addition to the core participants, a coordinator attended 5 meetings, and there was a total of 21 guests over 11 meetings.

4.2 Data Analysis

Relying on the notes, we summarized the activities and listed the problems encountered in each meeting. We characterized the problems in a bottom up approach.

4.3 Results

4.3.1 See What the Data Can Express

This concern appeared in 22 meetings. Data producers developed a number of ad-hoc tools such as custom-made node-link diagrams, spreadsheets with specialized scripts to filter, and documents for listing properties and classes. Although this enabled them to improve their understanding of the model and to communicate together, readability and interpretability remained difficult. For instance, in the last observed meeting, after more than a year and a half of work, there were still discussions about the level of abstraction (Work, Expression, Manifestation) at which the title could and should be in the current version of the model.

4.3.2 See How the Data Fill In The Model

This problem appeared in 18 meetings. Data producers had written themselves the mapping rules to transform their data into RDF, and knew them rather well. But they did not know how the data really fitted in them. Until a specific interface was developed, which occured nearly a year after the first data were transformed, they had to use spreadsheets or raw RDF xml files to check if the model did enable to express the data accurately, selecting both representative and random items. Knowing how well a property was represented, and what there really was in common between data coming from the different databases was difficult.

4.3.3 See How the Data are Encoded

This concern appeared in 8 meetings, when working on mapping rules. Librarians did the inventory of possible cases relying on their (impressive) memory, but they were concerned by the cases they probably missed. As an example, the original information for the date of creation of an Expression could be a category (beginning of the XVIIth century), a text note containing both time and other information, or a date in different formats—knowing that library models use non standard options to express ranges or uncertainty. The rules to process such information were many and complicated, and varied from a database to another. It was nearly impossible to get a sense of the partition of resulting data.

4.3.4 Find Data Matching Specific Criteria

This concern appeared in 14 meetings. Producers were building this common model in order to enable joint querying of their databases. While the abstraction of the model seemed to ensure a common structure with pivot elements to which different properties could be attached depending on the available information, it was then difficult to imagine how user needs could be expressed in queries addressing entities originating from all databases.

Fig. 3: A collection of path-based tasks that can be combined to address user needs.

4.3.5 See Information Available Through Interlinking

This concern appeared in 6 meetings. To decide which existing vocabularies to create similarity links to, data producers needed to know the type of information it would give access to, but also the coverage for their data. For instance, aligning places with Geonames would in theory give acccess to geocoordinates. But there were places, typically for traditionnal music, for which it was likely that the alignment would not be in Geonames. These could maybe be bridged by using Geoethno. But there might be other reason for no alignement, and in the end it was impossible to estimate the ccoverage of the information really added by interlinking.

5 User study 2: validating the approach

5.1 Motivation

To address these “specific user problems” [40], we needed to provide some intermediate level of understanding, between the precise example and the abstract model, in other words a summary. We relied on path-based summaries, and operationalised this approach in a series of low-level path-based tasks, presented in Figure 3. These tasks basically consist of three actions: browse, filter and inspect, that can be associated with the different features of a path and its extensions. To check if this approach did fulfil their needs, we interviewed 11 data producers for a partial validation. We selected 6 tasks involving all concepts - identification, inspection, coverage, features and extensions, illustrated with examples inspired from the situations we observed.

5.2 Participants

We conducted a fifteen to thirty minute interview with 11 data producers recruited via calls on Semantic Web mailing lists and Twitter. Participants belonged to industry (4), academia (4) and public institutions (3). The Datasets they usually manipulated contained data from various domains, ranging from biological pathways to cultural heritage through household appliances. All participation was voluntary and without compensation.

5.3 Set up

The interview was supervised online through the videoconference system Renater.

5.4 Procedure

We present each type of task, together with a precise example, which can be adapted to the participant’s domain. We ask participants if they do already perform such a task; and if so, how often and by which means; if not, for what reason. Finally, we ask if those tasks do make them think of other similar or related tasks.

5.5 Data Collection and Analysis

We collected answers in an xls sheet and analysed them with R.

5.6 Results

Fig. 4: Usage and interest of data producers regarding the tasks: a) they hardly ever perform them, b) but would be very interested in a tool supporting them

5.6.1 Current Status

A few participants already performed such tasks, as reported in Figure 4. Some did perform similar tasks, but for direct properties only (4)444the counts in this paragraph correspond to the number of tasks, not to the number of participants, or on the original data before the transformation in rdf(5), especially for validating the datatype. in these cases the tasks had been identified as needed, but the available solution was incomplete.

participants already performing such tasks used sparql query editors (16) or content negotiation in the browser (3). the main reason given for not performing a task, or performing it too rarely was no tool (14). these tasks are actually possible with sparql, but we interpret this as a sign that participants either did not know how to write the queries, or regarded it as so complicated that they would not even consider it as an option. the second main reason was time concerns (13): the task was regarded as doable, but the time it would have taken to write such queries was too sizeable.

5.6.2 interest for tasks

2 participants had difficulties in relating to the tasks, and did not express interest. their use of linked data was focused on querying single entities rather than sets. they did not feel the need for an overview. most other participants, however, declared a strong interest for the tasks (Figure 4). three had already identified their needs, others sounded really enthusiastic that we were able to elicit the tasks for them. in some cases, participants needed rephrasing or further examples to really understand the tasks.

6 participants spontaneously mentionned clearly seeing the interest of a tool enabling those tasks for reusers, in a discovery context. dp10 was the only participant to suggest a related task:

identify outliers in values of paths typed as numerical values

. this corresponds to task e with more advanced statistics.

5.7 summary

this interview confirms that data producers are aware of their difficulties, but need help to elicit their needs and the tasks to address them, as well as tools to support those tasks.

6 Path Outlines

We present Path Outlines, a tool to support data producers in curating their dataset, letting them browse and inspect path-based summaries of their datasets. We posit that the path level is the appropriate level of abstraction to represent meaningful information for sets of entities in Linked Data. Our tool is based on a user interface using several visualizations to explore the paths. We introduce a new layout algorithm called broken (out)lines that allows displaying a large number of paths and the path browser visualization to compact the representation of paths. We also introduce an application program interface (API) called LDPaths to analyze the paths.

6.1 Definition

paths outlines are conceptual objects reifying summaries of chains of statements in a Linked Data graph [41]. They offer a granularity that matches our users’ tasks, beyond simple statements. We define a path outline as a set of similar resources, e.g., belonging to the same class of resources, related to a set of values by a sequence of RDF statements. Values are the URIs or literals reached by following the chain of properties all the way to the last statement in the chain, for all resources in the set. Core features of a path outline are described in Table I.

Going to the end of all chains in the dataset would produce too many over long statements to be technically feasible, so we set a limit that can be adjusted depending on the specificity of the model and the computing resources available.

Feature Definition
Start set Entrypoint rdf:type of the start set
Coverage
percentage of entities in the start set for
which this path actually exists
Statements Depth
Number of statements from the start set to
the set of values
Properties
URIs of the property in each statement,
from the start to the end
Types
rdf:types of intermediate sets of entities,
from the start to the end
Values Count
Number of total values (or URIs)
at the end of the path
= number of instances of the paths
Unique count
Number of unique values (or URIs)
at the end of the path
Datatype
Except for URIs datatype of the values
at the end of the path
Language
For strings, if specified: list of languages
of the values
Min / max
Numerical values:
minimum and maximum value
Strings:
first and last value, in alphabetical order
TABLE I: Core features of a path outline.

6.2 User Interface

Our interface (Figure 6-1) first presents one or more datasets to explore, as well as the datasets to which they are linked. Datasets are laid out with a circle packing algorithm [42] as a bubble chart which allows to see and compare their size, which is mapped to the number of triples they contain. Users can hover a dataset to display its name. Using the filter panel (Figure 6-2), they can select a specific size range or search by name filters out other datasets in a fade out. Clicking on a dataset opens it in the foreground (Figure 6-3). Datasets linked to the selected one also come to the foreground, but as small bullets laid out on the side (Figure 6-8). The different sets of entities belonging to the selected dataset and sharing the same rdf:type are laid out inside in another circle packing, their size corresponding to the number of entities (Figure 6-4). The filter panel enables to filter them by size and name (Figure 6-6). Users can hover a set to display its name, and click on it to open it. Other sets become smaller and are aligned on the side to be easily available (Figure 6-8). The available path depths (Figure 6-7) are laid out with the broken (out)lines algorithm presented in Figure 5. By default paths of depth 1 are selected and displayed in the browser (Figure 6-9). We took inspiration from systems which present an overview of a graph for the user to select one of different cuts in it [43, 44] . This mechanism allows the interface to present a very large number of paths. For instance, the analysis of Data.bnf (LD produced from the National Library of France, BNF) with a maximum depth of 5 gives more than paths for 9 sets of entities. With such cuts, the largest group of paths are paths of depth 4 for the set “Event” with paths, as shown in Figure 6-f. We will now see how users can inspect these paths in the path browser.

Fig. 5: Broken outlines are drawn and positioned according to the maximum possible depth of paths.
Fig. 6: Overview to detail. a) The user selects a dataset among available datasets. b) The dataset opens in the foreground, interlinked datasets are placed aside. The user selects a start set c) Paths are displayed in the paths browser. When a single path is hovered or selected, details are available in the detail panel. d) The user selects extensions of a path in another dataset. e/f) The user selects paths of depth 4. When hovering a property, its label and description are displayed in a tooltip

6.2.1 The path browser visualisation: paths as readable sequences

The path browser displays all combinations of a given depth for a given set of entities in a given dataset, laid out with the path browser visualisation. The path browser visualisation can be described as a combination of Sankey diagram with Treemap. Paths being sequences of properties, it is possible to represent them with a Sankey diagram, as shown in Figure 7, though the number of paths that can be displayed is limited, and it is difficult to follow the edge the labels relate to and to identify sequences. We kept the idea of displaying an element once per step, thus taking advantage of the fact that the many possible paths in a graph are composed by combining the same properties in different sequences. However, instead of using curved lines crossing each other to represent properties, we drew rectangles, dividing the space for all properties at each step with a space-filling algorithm similar to treemaps [45]. With our visualisation, the Event paths of depth 4 in Data.bnf (made of 13 properties at depth 1, 12 at depth 2, 25 at depth 3, 55 at depth 4, and 110 at depth 5) can be displayed (Figure 6-f).

Fig. 7: Sankey diagram generated with Google charts for a sample of 60 paths of depth 2 (out of 110) describing Periodical in Data.bnf

To browse the paths, users can click on a property in one or several columns. The large rectangles allow easy hovering and clicking interactions, making it easy to filter on properties by direct manipulation. Hovering shows information on how the flows merge and divide (Figure 6-e), since they are no more visible at first glance, as they were in the Sankey. Selected properties form a pattern, and all paths that do not match this pattern are filtered out. The filter panel offers to filter by statistical features (Figure 6-10), and gives an overview of the available range for each feature. Property and statistical filters can be combined. Hovering or selecting a single path displays its statistical information in the statistical panel (Figure 6-11). This panel also offers a list of linked datasets to whicch the selected path can be extended. Selecting a linked dataset adds a column on the right (Figure 6-12), where all extensions can be browsed. The filter panel (Figure 6-13) and statistical panel (Figure 6-14) now apply to the paths including extensions. A line shows the target dataset, inviting users to click it and explore its paths.

6.2.2 Scenario of use

Scenario 1: A member of the DBpedia community would like to check the quality of music albums described in the DBpedia dataset. She opens Path Outlines, searches DBpedia in the filter panel (Figure 2-a2). A dozen of datasets remain, all other are filtered out (Figure 2-a1). Hovering them she can see each one corresponds to a different language. She clicks on the french version which opens in the foreground (Figure 2-b3). To find music albums among the many sets of entities, she types music in the filter panel (Figure 2-b6). Five sets of entities correspond to this keyword (Figure 2-b5), she hovers them and identifies schema:MusicAlbum, which she selects. This isolates the set, displays its broken (out)lines (Figure 2-c7), and opens the path browser (Figure 2-c8). By default paths of depth 1 are diplayed. The interface announces that there are more than 41 000 albums, with 87 paths of depth 1. She wants to check properties with a bad coverage, to see if there is a reason for this. She uses the cursor in the filter panel (Figure 2-c10) to select paths with a coverage lower than 10 percent. She hovers available paths and inspect their coverage. She notices that the property http://fr.dbpedia.org/property/writer is used only once. A property which sounds very similar, http://dbpedia.org/property/writer, is used more than 800 times. To identify the entity she needs to modify, she clicks on the button “See query”, that gives her access to the endpoint. She will now do similar checks with other paths of depth 1 and paths of depth 2.

Scenario 2: A person in charge of the Nobel Dataset would like to know what kind of geographical information is available for the nobel:Laureates. Could she draw maps of their birth places or affiliations? She knows there are no geo coordinates in his dataset, but some should be available through similarity links. She opens Path Outlines, searches nobel in the filter panel, and opens her dataset. She then selects the nobel:Laureates start set. She starts to look laureate having an affiliation aligned to another dataset. She selects paths of depth 3. In the first column, she types affiliation. This removes other properties than nobel:affiliation from this column, and properties wich are not used in a path starting with nobel:affiliation from other columns. Among properties remaining in the second column she can easily identify dbpedia:city, which she selects. In the third column, she selects owl:sameAs property. A single path being now selected, summary information appears in the inspector: 72 percent of the laureates have an affiliation aligned with an external dataset. She selects the link to display extensions in DBpedia. A list of 78 available properties to extend the path in DBpedia appear. She types geo in the search field. A list of 4 properties containing geo:lat and geo:long remains. She inspects the summary information of the extended paths: only 32 percent of the laureates have geo coordinates in DBpedia. She repeats the same operations for birth places: 96 percent have a similarity link to an external dataset, among which 61 percent have geo coordinates in DBpedia.

6.2.3 Implementation

The front-end interface is developed with NodeJS, it uses Vue.js and d3.js frameworks. The code is open source

555https://gitlab.inria.fr/mdestand/spf.

6.3 LDPath API for Path Analysis

In order to analyse the paths, we developed a specific extension to the semantic framework CORESE666https://corese.inria.fr/. Given an input query, it discovers and navigates paths in a SPARQL endpoint by completing the input query with predicates that exist in the endpoint. LDPath first computes the list of possible predicates and then, for each predicate, it counts the number of paths. This behaviour is done recursively for each predicate until a maximum path length is reached. The values at the end of each path are analysed to retrieve the features listed in Table I. LDPath can also, for each path, count the number of joins of this path in another endpoint, and compute the list of possible predicates to extend the path by one statement. The values at the end of the extension are also analysed. The software package consists in recursively rewriting and executing SPARQL queries with appropriate service clauses. The API of this extension is made available for other purposes, and can be queried independently of Path Outlines.

7 User study 3: evaluating Path Outlines

We conducted a 2x 2 within-subject controlled experiment to compare Path Outlines with virtuoso SPARQL query editor. The first independent variable was the technique, and the second, the dataset. We chose a non graphical tool as a baseline because we were unable to find a graphical tool that was able to support the tasks, whereas we identified SPARQL query editors as the most straightforward means to realise them, as confirmed by the producers we interviewed. We limited the experiment to 3 tasks, to keep the total time under one hour, knowing that it is tiring for participants to write queries in a limited time, especially when the experimenter is watching. The tasks were very similar for both datasets, with small adaptations to the context. Task 2 necessitated looking at paths of several depth. Our hypothesis was that Path Outlines would make the execution of the tasks more comfortable, easier and quicker. We also hypothesised that participants would better remember the structure of a dataset after the experiment with Path Outlines.

7.1 participants

we recruited 36 participants (30 men and 6 women) via calls on semantic web mailing lists and twitter, with the requirement that they should be able to write sparql queries. 5 of the participants in the interview also registered for the experiment. job categories included 12 researchers, 10 phd students, 9 engineers and 3 librarians. 29 produced rdf data and 31 reused them. their experience with sparql ranged from 6 months to 15 years, the average being 5.07 years and the median 4 years (note: sparql has existed since 2004, the standard was released in 2008). 12 rated their level of comfort with sparql as very comfortable, 11 as rather comfortable, 10 as fine, and 3 as rather uncomfortable. 18 used it several times a week, 13 several times a month, 2 several times a year and 3 once a year or less. all participation was voluntary and without compensation.

7.2 set-up

the experiment was supervised online through the videoconference system renater rendez-vous. due to technical problems it was replaced by skype in 4 cases and appear.in in 2 cases. it was run face-to-face for 3 participants. we used a lime survey form to guide participants through the tasks and collect the results. the form provided links to our tool, to a web interface developed in javascript, and to a sparql endpoint we had set up for the experiment. in 5 cases due to restrictions in the network we replaced the endpoint by nobel public endpoint. we used two datasets, which had been analysed with our tool and are hosted in our endpoint. two participants stopped after two tasks, because of personal planning reasons, so we asked the last two participants to complete only two tasks, to keep the four configurations balanced for all tasks.

Fig. 8: Comfort of the technique, easyness of the task and success: comparison of Path Outlines and SPARQL query editor on each task. a) Participants find Path Outlines more comfortable, b) they perceive similar tasks as easier when performed with it, c) and they are more able to complete the tasks successfully with it.

7.3 Procedure

We send an email to the participants with a link to the video conference. As they connect we give them a link to the form with a unique token, valid only once, associated with their anonymous unique identifier. Participants are invited to read the consent form. We rephrase the main points, and invite them to accept it if they wish to continue. We start with a set of questions about their experience with SPARQL. Then we introduce the experiment and explain how it will unfold.

The first task is displayed, associated with a technique and a dataset. For example: Task 1 / Nobel Dataset / Path Outlines. Consider all the awards in the dataset. For what percentage of them can you find the label of the birth place of the laureate of an award? We read it aloud, and rephrase the statement until it makes sense to the participants 777performing such tasks on sets of entities in a Linked Dataset was a new concept for some of the participants. Participants are asked to describe their plan before they perform the task. We rate the precision 8880 for no or very unprecise planning, 1 for unprecise planning, 2 for very precise planning. The time to actually perform the task is limited to eight minutes. If they cannot complete in time, they are asked to estimate how much time they think they would have needed. Then they rate the difficulty of the task and the comfort of the technique.

The next task is a similar task with the other technique on the other dataset. We counterbalance the order of the two factors, technique and dataset, resulting in 4 configurations. Tasks are always performed in the same order, from the one involving the simplest concepts to the most complicated. After the set of two similar tasks, participants are asked which environment they would chose if they had both at their disposal for such a task.

The same is repeated for two other sets of tasks. For instance, Task 2 (T2) on Nobel dataset is Consider all the laureates in the dataset. Find all the paths of depth 1 or 2 starting from them and leading to a temporal information. Indicate the datatype of the values at the end of the path. And Task 3 (T3): Imagine you want to plot a map of the universities. The most precise geographical information about the universities in the dataset seems to be the cities, which are aligned to DBpedia through similarity links owl:sameAs. Find one or several properties in DBpedia (http://dbpedia.org/sparql) that could help you place the cities on a map.

At the end of the three sets of tasks, participants answer a MCQ form about the general structure of a dataset: number of triples, classes, paths of length 1 and length 2. To finish with, they are invited to comment on the tool and make suggestions.

7.4 Data collection and analysis

We collect the answers to the form, screencasts of the web browser and notes. Answers to the form and notes are merged in an xls sheet, and analysed with R.

7.5 Results

7.5.1 Comfort and easiness

In general, participants found Path Outlines more comfortable than the SPARQL query editor (Figure 8c). Several said that they would need more time to become fully comfortable with the tool. Five minutes to practice was indeed a very short time, but the level of comfort reported with Path Outlines is already very good. The level of comfort reported when performing tasks with SPARQL was lower than the level initially expressed. We interpret this as being due partly to the fact that it is uncomfortable to code when an experimenter is watching, and partly to the difficulty of the tasks. Being very familiar with SPARQL does not mean being familiar with queries involving both sets of entities and deep paths. This supports the idea that a specific tool for such tasks can be useful even for experts. Three users mentionned being less comfortable with Virtuoso than with their usual environment. However, Virtuoso was the most frequent tool listed as a usual tool by participants (23). Participants perceived the same tasks as being easier when performed with Path Outlines than with Virtuoso SPARQL query editor, as shown in Figure 8b. We think this is due to the fact that Path Outlines enables them to manipulate directly the paths, saving them the mental process of reconstructing them by chaining statements and associating summary information to them.

7.5.2 Time on task

We counted 8 minutes for each timeout or dropout. Participants were quicker with Path Outlines on the three tasks, as shown in Figure 9

a. We applied paired samples t-tests to compare time on each technique for each task. There was a significant difference for the three tasks:

T1:

,

T2:

,

T3:

which suggests that the technique has a significant effect on time.

Those who did not complete the tasks were asked to give an estimation of the additional time they would have needed. We did not use self estimations to make a time comparison: not all participants were able to answer, and such estimations are likely to be unreliable, time perception and self perception being influenced by many factors. However, we report them as an indicator: for participants with a very precise plan, it ranged from 30 seconds to an hour; with an unprecise plan, it ranged from 15 seconds to 45 minutes; and with no plan it ranged from 4 minutes to several hours. Task 2 required them to look at paths of two different depths, which we had identified as a non optimal aspect of our interface. Although participants were longer on this task, Path Outlines still outperformed Virtuoso SPARQL query editor, but several participants expressed the wish that they could see both depths at the same time.

7.5.3 Task completion and errors

Using our tool, only one participant timed out on task 2, all others managed to complete each of the tasks within 8 minutes. With SPARQL, there were 37 dropouts (9 on T1, 10 on T2 and 18 on T3) and 15 timeouts (9 on T1, 5 on T2 and 1 on T3). Among the tasks completed in time, 28 did have erroneous or incomplete results with SPARQL (11 on T1, 13 on T2 and 5 on T3) versus 13 with our tool (on T2), as summed up in Figure 8a.

The main error on T1 were that some participants counted the number of paths matching the pattern instead of the number of documents having such paths (either by counting values at the end of the paths or by counting entities without the DISTINCT keyword). It occurred 9 times in SPARQL, and none with our tool. Four participants were close to making the mistake but corrected themselves with SPARQL, and one did so with our tool. Another error occurred only once with SPARQL: the participant started from the wrong class of resource.

T2 presented the particular difficulty that temporal information in RDF datasets can be typed with various datatypes, including xsd:string and xsd:integer. The most common error was to give only part of the results, either because of relying on only one datatype, or because it was difficult to sort out the right ones when displaying all of them. It occurred 12 times with both techniques. The mean percentage of correct results was 75 percent with our tool, versus 50 percent with SPARQL. With SPARQL, 1 participant happened to give all paths as an answer, including non temporal ones.

For T3, one participant gave an answer that did not meet requirement with SPARQL, stating that it would be too complicated. Another error which happened 5 times was that the query timed out, although it was correct. There are tricks and workarounds, but in most cases the time needed to write the query and realise it would time out was already too long to start figuring out a workaround. This is a common problem with federated queries on sets, also reported by Warren and Mulholland [36].

Fig. 9: Time on task and preference: comparison of Path Outlines and SPARQL query editor on each task. a) Participants are quicker with Path Outlines and b) prefer Path Outlines to SPARQL query editor

7.5.4 Gaining a structural overview of the dataset

Answers were very sparse, most participants did not remember the information at all, and there was no significant difference between the techniques. We can not make any conclusion from the data we collected. We think this is related to the fact that participants were fully focused on finishing tasks in time, and did not take time to look at other elements of the interface. We could do another experiment asking similar questions after open exploration tasks.

7.5.5 Comments

Several participants expressed the need for such a tool in their work, and asked if they could try it on their own data. Most of them liked the tool and made positive comments. One participant wrote an email after the experiment to thank us for the work, saying that “such tools are needed due to the conceptual difficulties in understanding large complex datasets”. It is interesting to note that he happened to be one of the two participants who had difficulties to relate to the tasks during the interview.

7.5.6 Preference

Most participants preferred Path Outlines (34 on T1, 31 on T2 and 29 on T3) versus Virtuoso SPARQL query editor (2 on T1, 5 on T2 and 3 on T3), as shown in Figure 9b.

8 Discussion and future work

Although Path Outlines was designed for users who are not very comfortable with SPARQL, it seemed difficult to find enough beginners as participants, so we had to open the call to any level of expertise, expecting this might lower the effect size. We were positively surprised to see that with an average level of experience and expertise which was rather high, participants still performed so significantly better with our tool. While novice users are often more efficient and comfortable with our interface relying on recognition, than with a query editor relying on recall [46], this is not always the case with more experienced users [add quotes]. We think this might be due to the fact that querying over sets did not seem to be a usual operation for most of the participants. Linked Open Data are by definition huge and uncomplete, so it might be natural to not even try to get overviews, since such queries are likely to either time out or return results that are difficult to interpret. However, as human beings, we need to compare, evaluate, see resources in the context of other similar resources. Considering subsets of the LOD world, and being able to see how they relate is probably needed to leverage the use of LOD. The new standard of property paths for SPARQL queries is a sign that the fact to consider paths deeper in the graph was needed, the difficulty to think at the same time deep and broad remains, and we think path-based summaries can help.

The concept still needs to be refined and developed. At the moment, our paths are“weakly typed” 

[18]: they consider statements going through different types of intermediate entities as being similar. They do not either take into account overlapping between sets of entities at start point. It would be worth investigating the benefits of filtering on those criteria. Although the cost for computing the analysis would be higher, we could imagine several modes of summaries, depending on the time and resources available to compute the summary.

Although it seemed logical to start by looking at shorter paths before longer ones, there are cases when users would prefer to see several depths at the same time, as for Task 2. With the current interface this means repeating the same task with different depths. In the absence of a better solution, we considered this as acceptable, though not optimal. The challenge is not trivial, but would definitely be worth further investigations.

As a prototype, our tool works on small to medium datasets. For the demo instance, we analysed 5 datasets : On Data.bnf, our analysis tool breaks when trying to analyse the three most populated classes: Work, Expression and Manifestation. We were able to analyse them on a 10 percent sample of the dataset. For larger datasets, it would make sense to compute the analysis on a sample and extrapolate. This would raise design challenges regarding the representation of uncertainty. In the current state of the prototype, we don’t provide access to the real data. To identify entities summarised by a path, we only provide a link to the SPARQL endpoint with the query to fetch the results. It would definitely be valuable to integrate statistics with the content [47], although this would come with new technical and design challenges.

While we studied a very specific user group, participants in our interview and experiment spontaneously mentioned the interest of such a tool for Data Reusers when discovering a dataset. The tool could also be adapted for Ontology Builders, for instance to support navigation on inferred class hierarchy [48], and let them discover to which statements inferences can lead. Studying Semantic Web users and building tools to leverage the use of the technology is needed if we want to use it to “overcome challenges in HCI” [49], so that initiatives such as [50, 51] do not remain at the margin of the community. Semantic web data being graph data, the principle of a path browser could also be generalized to other graph data, addressing key concerns such as gaining overviews [52, 53]

, structuring high-dimensional data in a low number of dimensions 

[54, 55, 56], building visual analysis tools [57], and representing irregular and heterogeneous semi-structured data [58, 59].

We focused on data producers to design and evaluate our tool. However, during both the interview and the experiment, many participants spontaneously mentionned the interest for such a tool in situations where a data reuser would discover a dataset. A direction for future work could be to check if there are tasks specific to this context. Another application, for data producers, could be to adapt the tool to the display of ontologies, to help identify potential paths produced by an ontology or a combination of ontologies, and assist producers in modeling the data, before they are created and can be summarised.

9 Conclusion

Linked data producers face a challenge: the particular structure of their data implies new tasks that need to be elicited and empowered. We observed them in situations where they felt hindered in the understanding of their data and characterised their needs. To address these needs, we reified chains of statements into paths, and operationalised this approach into tasks. We interviewed 11 data producers and confirmed that they were enthusiatic with the elicitation of their needs and related tasks, and interested in a tool to meet them. We designed Path Outlines, a tool to support such tasks, relying on an API to analyse the paths. It enables users to browse and inspect large collections of paths. We compared Path Outlines with Virtuoso SPARQL query editor, SPARQL being the most common way to realise such tasks. Path Outlines was rated as more comfortable, easier, performed better in terms of time and lowered the number of abandons, although participants had, on average, 5 years of experience with SPARQL, versus 5 minutes with our tool.

10 Acknowledgments

We most heartedly thank Jean-Daniel Fekete, Wendy Mackay and Jean-Philippe Rivière, the members of Doremus project, as well as all the participants in our interview and experiment.

References

  • [1] C. Bizer, T. Heath, and T. Berners-Lee, “Linked data: The story so far,” in Semantic services, interoperability and web applications: emerging concepts.   IGI Global, 2011, pp. 205–227.
  • [2] H. Hlomani and D. Stacey, “Approaches, methods, metrics, measures, and subjectivity in ontology evaluation: A survey,” Semantic Web Journal, vol. 1, no. 5, pp. 1–11, 2014.
  • [3] K. C. Feeney, G. Mendel Gleason, and R. Brennan, “Linked data schemata: fixing unsound foundations,” Semantic Web, vol. 9, no. 1, pp. 53–75, 2018.
  • [4] A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, S. Auer, and P. Hitzler, “Quality assessment methodologies for linked open data,” Submitted to Semantic Web Journal, 2013.
  • [5] A. Hogan, J. Umbrich, A. Harth, R. Cyganiak, A. Polleres, and S. Decker, “An empirical survey of linked data conformance,” Web Semantics: Science, Services and Agents on the World Wide Web, vol. 14, pp. 14–44, 2012.
  • [6] M. Schmachtenberg, C. Bizer, and H. Paulheim, “Adoption of the linked data best practices in different topical domains,” in International Semantic Web Conference.   Springer, 2014, pp. 245–260.
  • [7] A. Dimou, D. Kontokostas, M. Freudenberg, R. Verborgh, J. Lehmann, E. Mannens, S. Hellmann, and R. Van de Walle, “Assessing and refining mappingsto rdf to improve dataset quality,” in International Semantic Web Conference.   Springer, 2015, pp. 133–149.
  • [8] G. Troullinou, H. Kondylakis, E. Daskalaki, and D. Plexousakis, “Ontology understanding without tears: The summarization approach,” Semantic Web, vol. 8, no. 6, pp. 797–815, 2017.
  • [9] J. Carroll and G. Klyne, “Resource description framework (RDF): Concepts and abstract syntax,” W3C, W3C Recommendation, Feb. 2004, http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/.
  • [10] D. Brickley and L. Miller, “Foaf vocabulary specification 0.99,” Ontology, Feb. 2014, http://xmlns.com/foaf/spec/.
  • [11] S. Auer, J. Demter, M. Martin, and J. Lehmann, “Lodstats–an extensible framework for high-performance dataset analytics,” in

    International Conference on Knowledge Engineering and Knowledge Management

    .   Springer, 2012, pp. 353–362.
  • [12] S. Issa, P.-H. Paris, F. Hamdi, and S. S.-S. Cherfi, “Revealing the conceptual schemas of rdf datasets,” in International Conference on Advanced Information Systems Engineering.   Springer, 2019, pp. 312–327.
  • [13] K. Kellou-Menouer and Z. Kedad, “Schema discovery in rdf data sources,” in International Conference on Conceptual Modeling.   Springer, 2015, pp. 481–495.
  • [14] S. Khatchadourian and M. P. Consens, “Explod: Summary-based exploration of interlinking and rdf usage in the linked open data cloud,” in Extended Semantic Web Conference.   Springer, 2010, pp. 272–287.
  • [15] B. Spahiu, R. Porrini, M. Palmonari, A. Rula, and A. Maurino, “Abstat: ontology-driven linked data summaries with pattern minimalization,” in European Semantic Web Conference.   Springer, 2016, pp. 381–395.
  • [16] M. Dudáš, V. Svátek, and J. Mynarz, “Dataset summary visualization with lodsight,” in European Semantic Web Conference.   Springer, 2015, pp. 36–40.
  • [17] M. Dudáš and V. Svátek, “Discovering issues in datasets using lodsight visual summaries,” in Proceedings of the International Workshop on Visualizations and User Interfaces for, 2015, p. 77.
  • [18] Š. Čebirić, F. Goasdoué, and I. Manolescu, “Query-oriented summarization of rdf graphs,” 2016.
  • [19] M. Weise, S. Lohmann, and F. Haag, “Ld-vowl: Extracting and visualizing schema information for linked data,” in 2nd International Workshop on Visualization and Interaction for Ontologies and Linked Data, 2016, pp. 120–127.
  • [20] A. Katifori, C. Halatsis, G. Lepouras, C. Vassilakis, and E. Giannopoulou, “Ontology visualization methods—a survey,” ACM Computing Surveys (CSUR), vol. 39, no. 4, p. 10, 2007.
  • [21] M. Dudáš, O. Zamazal, and V. Svátek, “Roadmapping and navigating in the ontology visualization landscape,” in International Conference on Knowledge Engineering and Knowledge Management.   Springer, 2014, pp. 137–152.
  • [22] N. Bikakis and T. Sellis, “Exploration and visualization in the web of big linked data: A survey of the state of the art,” arXiv preprint arXiv:1601.08059, 2016.
  • [23] E. Pietriga, “Isaviz, a visual environment for browsing and authoring rdf models,” in Eleventh International World Wide Web Conference Developers Day, 2002, 2002.
  • [24] S. Lohmann, V. Link, E. Marbach, and S. Negru, “Webvowl: Web-based visualization of ontologies,” in International Conference on Knowledge Engineering and Knowledge Management.   Springer, 2014, pp. 154–158.
  • [25] W. A. Woods, “What’s in a link: Foundations for semantic networks,” in Representation and understanding.   Elsevier, 1975, pp. 35–82.
  • [26] A. Anikin, D. Litovkin, M. Kultsova, E. Sarkisova, and T. Petrova, “Ontology visualization: Approaches and software tools for visual representation of large ontologies in learning,” in

    Conference on Creativity in Intelligent Technologies and Data Science

    .   Springer, 2017, pp. 133–149.
  • [27] K. Kotis and M. Lanzenberger, “Ontology matching: current status, dilemmas and future challenges,” in 2008 International Conference on Complex, Intelligent and Software Intensive Systems.   IEEE, 2008, pp. 924–927.
  • [28] L. R. Novick, “Understanding spatial diagram structure: An analysis of hierarchies, matrices, and networks,” The Quarterly Journal of Experimental Psychology, vol. 59, no. 10, pp. 1826–1856, 2006, the hierarchy depicts a rigid structure of power or precedence relations among items.
  • [29] W. Huang and P. Eades, “How people read graphs,” in proceedings of the 2005 Asia-Pacific symposium on Information visualisation-Volume 45.   Australian Computer Society, Inc., 2005, pp. 51–58.
  • [30] M. van Amelsvoort, J. van der Meij, A. Anjewierden, and H. van der Meij, “The importance of design in learning from node-link diagrams,” Instructional science, vol. 41, no. 5, pp. 833–847, 2013.
  • [31] C. Ware, H. Purchase, L. Colpoys, and M. McGill, “Cognitive measurements of graph aesthetics,” Information visualization, vol. 1, no. 2, pp. 103–110, 2002.
  • [32] L. R. Novick and S. M. Hurley, “To matrix, network, or hierarchy: That is the question,” Cognitive psychology, vol. 42, no. 2, pp. 158–216, 2001.
  • [33] B. Lee, C. S. Parr, C. Plaisant, B. B. Bederson, V. D. Veksler, W. D. Gray, and C. Kotfila, “Treeplus: Interactive exploration of networks with enhanced tree layouts,” IEEE Transactions on Visualization and Computer Graphics, vol. 12, no. 6, pp. 1414–1426, 2006.
  • [34] D. DeStefano and J.-A. LeFevre, “Cognitive load in hypertext reading: A review,” Computers in human behavior, vol. 23, no. 3, pp. 1616–1641, 2007.
  • [35] C. Partl, S. Gratzl, M. Streit, A. M. Wassermann, H. Pfister, D. Schmalstieg, and A. Lex, “Pathfinder: Visual analysis of paths in graphs,” in Computer Graphics Forum, vol. 35, no. 3.   Wiley Online Library, 2016, pp. 71–80.
  • [36] P. Warren and P. Mulholland, “Using sparql–the practitioners’ viewpoint,” in European Knowledge Acquisition Workshop.   Springer, 2018, pp. 485–500.
  • [37] M. Stocker, A. Seaborne, A. Bernstein, C. Kiefer, and D. Reynolds, “Sparql basic graph pattern optimization using selectivity estimation,” in Proceedings of the 17th international conference on World Wide Web.   ACM, 2008, pp. 595–604.
  • [38] A. Macina, J. Montagnat, and O. Corby, “A SPARQL Distributed Query Processing Engine Addressing both Vertical and Horizontal Data Partitions,” in 32ème Conférence sur la Gestion de Données - Principes, Technologies et Applications (BDA), Poitiers, Nov. 2016.
  • [39] S. Ferré, “Sparklis: an expressive query builder for sparql endpoints with guidance in natural language,” Semantic Web, vol. 8, no. 3, pp. 405–418, 2017.
  • [40] D. R. Karger, “The semantic web and end users: What’s wrong and how to fix it,” IEEE Internet Computing, vol. 18, no. 6, pp. 64–70, 2014.
  • [41] M. Beaudouin-Lafon and W. E. Mackay, “Reification, polymorphism and reuse: three principles for designing visual interfaces,” in Proceedings of the working conference on Advanced visual interfaces.   ACM, 2000, pp. 102–109.
  • [42] C. R. Collins and K. Stephenson, “A circle packing algorithm,” Computational Geometry, vol. 25, no. 3, pp. 233 – 256, 2003. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0925772102000998
  • [43] J. Abello, F. Van Ham, and N. Krishnan, “Ask-graphview: A large scale graph visualization system,” IEEE transactions on visualization and computer graphics, vol. 12, no. 5, pp. 669–676, 2006.
  • [44] D. Archambault, T. Munzner, and D. Auber, “Tugging graphs faster: Efficiently modifying path-preserving hierarchies for browsing paths,” IEEE Transactions on Visualization and Computer Graphics, vol. 17, no. 3, pp. 276–289, 2010.
  • [45] B. Shneiderman, “Tree visualization with tree-maps: A 2-d space-filling approach,” Tech. Rep., 1998.
  • [46] D. Bau, J. Gray, C. Kelleher, J. Sheldon, and F. Turbak, “Learnable programming: blocks and beyond,” arXiv preprint arXiv:1705.09413, 2017.
  • [47] A. Perer and B. Shneiderman, “Integrating statistics and visualization for exploratory power: From long-term case studies to design guidelines,” IEEE Computer Graphics and Applications, vol. 29, no. 3, pp. 39–51, 2009.
  • [48] M. Vigo, C. Jay, and R. Stevens, “Constructing conceptual knowledge artefacts: Activity patterns in the ontology authoring process,” in Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, ser. CHI ’15.   New York, NY, USA: ACM, 2015, pp. 3385–3394. [Online]. Available: http://doi.acm.org/10.1145/2702123.2702495
  • [49] D. Degler, S. Henninger, and L. Battle, “Semantic web hci: Discussing research implications,” in CHI ’07 Extended Abstracts on Human Factors in Computing Systems, ser. CHI EA ’07.   New York, NY, USA: ACM, 2007, pp. 1909–1912. [Online]. Available: http://doi.acm.org/10.1145/1240866.1240921
  • [50] I. Roes, N. Stash, Y. Wang, and L. Aroyo, “A personalized walk through the museum: The chip interactive tour guide,” in CHI ’09 Extended Abstracts on Human Factors in Computing Systems, ser. CHI EA ’09.   New York, NY, USA: ACM, 2009, pp. 3317–3322. [Online]. Available: http://doi.acm.org/10.1145/1520340.1520479
  • [51] K. Luyten, K. Thys, S. Huypens, and K. Coninx, “Telebuddies: Social stitching with interactive television,” in CHI ’06 Extended Abstracts on Human Factors in Computing Systems, ser. CHI EA ’06.   New York, NY, USA: ACM, 2006, pp. 1049–1054. [Online]. Available: http://doi.acm.org/10.1145/1125451.1125651
  • [52] R. Shannon, A. Quigley, and P. Nixon, “Graphemes: Self-organizing shape-based clustered structures for network visualisations,” in CHI ’10 Extended Abstracts on Human Factors in Computing Systems, ser. CHI EA ’10.   New York, NY, USA: ACM, 2010, pp. 4195–4200. [Online]. Available: http://doi.acm.org/10.1145/1753846.1754125
  • [53] B. E. Alper, N. Henry Riche, and T. Hollerer, “Structuring the space: A study on enriching node-link diagrams with visual references,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ser. CHI ’14.   New York, NY, USA: ACM, 2014, pp. 1825–1834. [Online]. Available: http://doi.acm.org/10.1145/2556288.2557112
  • [54] B. Alper, B. Bach, N. Henry Riche, T. Isenberg, and J.-D. Fekete, “Weighted graph comparison techniques for brain connectivity analysis,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ser. CHI ’13.   New York, NY, USA: ACM, 2013, pp. 483–492. [Online]. Available: http://doi.acm.org/10.1145/2470654.2470724
  • [55] M. Wattenberg, “Visual exploration of multivariate graphs,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ser. CHI ’06.   New York, NY, USA: ACM, 2006, pp. 811–819. [Online]. Available: http://doi.acm.org/10.1145/1124772.1124891
  • [56] B. Bach, E. Pietriga, and J.-D. Fekete, “Visualizing dynamic networks with matrix cubes,” in Proceedings of the 32Nd Annual ACM Conference on Human Factors in Computing Systems, ser. CHI ’14.   New York, NY, USA: ACM, 2014, pp. 877–886. [Online]. Available: http://doi.acm.org/10.1145/2556288.2557010
  • [57] N. Cao, Y.-R. Lin, L. Li, and H. Tong, “g-miner: Interactive visual group mining on multivariate graphs,” in Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, ser. CHI ’15.   New York, NY, USA: ACM, 2015, pp. 279–288. [Online]. Available: http://doi.acm.org/10.1145/2702123.2702446
  • [58] D. R. Karger and D. Quan, “Haystack: A user interface for creating, browsing, and organizing arbitrary semistructured information,” in CHI ’04 Extended Abstracts on Human Factors in Computing Systems, ser. CHI EA ’04.   New York, NY, USA: ACM, 2004, pp. 777–778. [Online]. Available: http://doi.acm.org/10.1145/985921.985931
  • [59] ——, “Collections: Flexible, essential tools for information management,” in CHI ’04 Extended Abstracts on Human Factors in Computing Systems, ser. CHI EA ’04.   New York, NY, USA: ACM, 2004, pp. 1159–1162. [Online]. Available: http://doi.acm.org/10.1145/985921.986013