Empowering Investigative Journalism with Graph-based Heterogeneous Data Management

02/08/2021 ∙ by Angelos Christos Anadiotis, et al. ∙ Inesc-ID Inria Ecole Polytechnique 0

Investigative Journalism (IJ, in short) is staple of modern, democratic societies. IJ often necessitates working with large, dynamic sets of heterogeneous, schema-less data sources, which can be structured, semi-structured, or textual, limiting the applicability of classical data integration approaches. In prior work, we have developed ConnectionLens, a system capable of integrating such sources into a single heterogeneous graph, leveraging Information Extraction (IE) techniques; users can then query the graph by means of keywords, and explore query results and their neighborhood using an interactive GUI. Our keyword search problem is complicated by the graph heterogeneity, and by the lack of a result score function that would allow to prune some of the search space. In this work, we describe an actual IJ application studying conflicts of interest in the biomedical domain, and we show how ConnectionLens supports it. Then, we present novel techniques addressing the scalability challenges raised by this application: one allows to reduce the significant IE costs while building the graph, while the other is a novel, parallel, in-memory keyword search engine, which achieves orders of magnitude speed-up over our previous engine. Our experimental study on the real-world IJ application data confirms the benefits of our contributions.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Journalism and the press are a critical ingredient of any modern society. Like many other industries, such as trade, or entertainment, journalism has benefitted from the explosion of Web technologies, which enabled instant sharing of their content with the audience. However, unlike trade, where databases and data warehouses had taken over daily operations decades before the Web age, many newsrooms discovered the Web and social media, long before building strong information systems where journalists could store their information and/or ingest data of interest for them. As a matter of fact, journalists’ desire to protect their confidential information may also have played a role in delaying the adoption of data management infrastructures in newsrooms.

At the same time, highly appreciated journalism work often requires acquiring, curating, and exploiting large amounts of digital data. Among the authors, S. Horel co-authored the “Monsanto Papers” series which obtained the European Press Prize Investigative Reporting Award in 2018 (17); a similar project is the “Panama Papers” (later known as “Offshore Leaks”) series of the International Consortium of Investigative Journalists (32). In such works, journalists are forced to work with heterogeneous data, potentially in different data models (structured such as relations, semistructured such as JSON or XML documents, or graphs, including but not limited to RDF, as well as unstructured text). We, the authors, are currently collaborating on such an Investigative Journalism (IJ, in short) application, focused on the study of situations potentially leading to conflicts of interest111According to the 2011 French transparency law, “A conflict of interest is any situation where a public interest may interfere with a public or private interest, in such a way that the public interest may be, or appear to be, unduly influenced.” (CoIs, in short) between biomedical experts and various organizations: corporations, industry associations, lobbying organizations or front groups. Information of interest in this setting comes from: scientific publications (in PDF) where authors declare e.g., “Dr. X. Y. has received consulting fees from ABC”; semi-structured metadata (typically XML, used for instance in PubMed), where authors may also specify such connections; a medical association, say, French cardiology, may build its own disclosure database which may be relational, while a company may disclose its ties to specialists in a spreadsheet.

This paper builds upon our recent work (Anadiotis et al., 2020a), where we have identified a set of requirements (R) and the constraints (C) that need to be addressed to efficiently support IJ applications. We recall them here for clarity and completeness:

R1. Integral source preservation and provenance: in journalistic work, it is crucial to be able to trace each information item back to the data source from which it came. This enables adequately sourcing information, an important tenet of quality journalism.

R2. Little to no effort required from users: journalists often lack time and resources to set up IT tools or data processing pipelines. Even when they are able to use a tool supporting one or two data models (e.g., most relational databases provide some support for JSON data), handling other data models remains challenging. Thus, the data analysis pipeline needs to be as automatic as possible.

C1. Little-known entities: interesting journalistic datasets feature some extremely well-known entities (e.g., world leaders in the pharmaceutical industry) next to others of much smaller notoriety (e.g., an expert consulted by EU institutions, or a little-known trade association). From a journalistic perspective, such lesser-known entities may play a crucial role in making interesting connections among data sources, e.g., the association may be created by the industry leader, and it may pay the expert honoraries.

C2. Controlled dataset ingestion: the level of confidence in the data required for journalistic use excludes massive ingestion from uncontrolled data sources, e.g., through large-scale Web crawls.

R3. Performance on “off-the-shelf” hardware: The efficiency of our data processing pipeline is important; also, the tool should run on general-purpose hardware, available to users like the ones we consider, without expertise or access to special hardware.

Further, IJ applications’ data analysis needs entail:

R4. Finding connections across heterogeneous datasets is a core need. In particular, it is important for our approach to be tolerant of inevitable differences in the organization of data across sources. Heterogeneous data integration works, such as (Doan et al., 2012; Calvanese et al., 2007; Buron et al., 2020), and recent heterogeneous polystores, e.g., (Duggan et al., 2015; Alotaibi et al., 2019; Quamar et al., 2020) assume that sources have well-understood schemas; other recent works, e.g., (Ota et al., 2020; Christodoulakis et al., 2020; Nargesian et al., 2020) focus on analyzing large sets of Open Data sources, all of which are tabular. IJ data sources do not fit these hypothesis: data can be semi-structured, structured, or simply text. Therefore, we opt for integrating all data sources in a heterogeneous graph (with no integrated schema), and for keyword-based querying where users specify some terms, and the system returns subtrees of the graph, that connect nodes matching these terms.

C4. Lack of single, well-behaved answer score: After discussing several journalistic scenarios, no unique method (score) for deciding which are the best answers to a query has been identified. Instead: () it appears that “very large” answers (say, of more than 20 edges) are of limited interest; () connections that “state the obvious”, e.g., that a French scientist is connected to a French company through their nationality, are not of interest. Therefore, unlike prior keyword search algorithms, which fix a score function and exploit it to prune the search, our algorithm must be orthogonal and work it with any score function.

Building upon our previous work, and years-long discussions of IJ scenarios, this paper makes the following contributions:

  • We describe the CoI IJ application proposed by S. Horel (Section 2), we extract its technical requirements and we devise an end-to-end data analysis pipeline addressing these requirements (Section 3).

  • We provide application-driven optimizations, inspired from the CoI scenario but reusable to other contexts, which speeds up the graph construction process (Section 4).

  • We introduce a parallel, in-memory version of the keyword search algorithm previously introduced in (Anadiotis et al., 2020b, a), and we explain our design in both the physical database layout and the parallel query execution (Section 5).

  • We evaluate the performance of our system using both synthetic and real-world PubMed data, we demonstrate its scalability, and we show that we have improved the performance compared to our prior work by several orders of magnitude, thereby enabling the journalists to perform interactive exploration of their data (Section 6).

2. Use case: conflicts of interest in the biomedical domain

The topic. Biomedical experts such as health scientists and researchers in life sciences play an important role in society, advising governments and the public on health issues. They also routinely interact with industry (pharmaceutical, agrifood etc.), consulting, collaborating on research, or otherwise sharing work and interests. To trust advice coming from these experts, it is important to ensure the advice is not unduly influenced by vested interests. Yet, IJ work, e.g. (Oreskes and Conway, 2012; Horel, 2018, 2020), has shown that disclosure information is often scattered across multiple data sources, hindering access to this information. We now illustrate the data processing required to gather and collectively exploit such information.

Figure 1. Graph data integration in ConnectionLens.

Sample data. Figure 1 shows a tiny fragment of data that can be used to find connections between scientists and companies. For now, consider only the nodes shown as a black dot or as a text label, and the solid, black edges connecting them; these model directly the data. The others are added by ConnectionLens as we discuss in Section 3.1. () Hundreds of millions of bibliographic notices (in XML) are published on the PubMed web site; the site also links to research (in PDF). In recent years, PubMed has included an optional CoIStatement element where authors can declare (in free text) their possible links with industrial players; less than 20% of recent papers have this element, and some of those present, are empty (“The authors declare no conflict of interest”). () Within the PDF papers themselves, paragraphs titled, e.g., “Acknowledgments”, “Disclosure statement” etc. may contain such information, even if the CoIStatement is absent or empty. This information is accessible if one converts the PDF in a format such as JSON. In Figure 1, Alice declares her consulting for ABCPharma in XML, yet the “Acknowledgments” paragraph in her PDF paper mentions HealthStar222This example is inspired from prior work of S. Horel where she identified (manually inspecting thousands of documents) an expert supposedly with no industrial ties, yet who authored papers for which companies had supplied and prepared data.. () A (subset of a) knowledge base (in RDF) such as WikiData describes well-known entities, e.g., ABCPharma; however, less-known entities of interest in an IJ scenario are often missing from such KGs, e.g., HealthStar in our example. () Specialized data sources, such as a trade catalog or a Wiki Web site built by other investigative journalists, may provide information on some such actors: in our example, the PharmaLeaks Web site shows that HealthStar is also funded by the industry. Such a site, established by a trusted source (or colleague), even if it has little or no structure, is a gold mine to be reused, since it saves days or weeks of tedious IJ work. In this and many IJ scenarios, sources are highly heterogeneous, while time, skills, and resources to curate, clean, or structure the data are not available.

Sample query. Our application requires the connections of specialists in lung diseases, working in France, with pharmaceutical companies. In Figure 1, the edges with green highlight and those with yellow highlight, together, form an answer connecting Alice to ABCPharma (spanning over the XML and RDF sources); similarly, the edges highlighted in green together with those in blue, spanning over XML, JSON and HTML, connect her to HealthStar.

The potential impact of a CoI database. A database of known relationships between experts and interested companies, built by integrating heterogeneous data sources, would be a very valuable asset. In Europe, such a database could be used, e.g., to select, for a committee advising EU officials on industrial pollutants, experts with few or no such relationships. In the US, the Sunshine Act (35), just the French 2011 law, require manufacturers of drugs and medical devices to declare such information, but this does not extend to companies from other sectors.

3. Investigative Journalism pipeline

Figure 2. Investigative Journalism data analysis pipeline.

The pipeline we have built for IJ is outlined in Figure 2. First, we recall ConnectionLens graph construction (Section 3.1), which integrates heterogeneous data into a graph, stored and indexed in PostgreSQL. On this graph, the GAM keyword search algorithm (recalled in Section 3.2) answers queries such as our motivating example; these are both detailed in (Anadiotis et al., 2020a). The modules on yellow background in Figure 2 are the novelties of this work, and will be introduced below: scenario-driven performance optimizations to the graph construction (Section 4), and an in-memory, parallel keyword search algorithm, called P-GAM (Section 5).

3.1. ConnectionLens graph construction

ConnectionLens integrates JSON, XML, RDF, HTML, relational or text data into a graph, as illustrated in Figure 1. Each source is mapped to the graph as close to its data model as possible, e.g., XML edges have no labels while internal nodes all have names, while in JSON conventions are different etc. Next, ConnectionLens extracts named entities from all text nodes, regardless of the data source they come from, using trained language models. In the figure, blue, green, and orange nodes denote Organization, Location, and Person entities, respectively. Each such entity node is connected to the text node it has been extracted from, by an extraction edge recording also the confidence of the extraction (dashed in the figure). Entity nodes are shared across the graph, e.g., Person:Alice has been found in three data sources, Org:BestPharma in two sources etc. ConnectionLens includes a disambiguation module which avoids mistakenly unifying entities with the same labels but different meanings. Finally, nodes with similar labels are compared, and if their similarity is above a threshold, a sameAs (red) edge is introduced connecting them, labeled with the similarity value.

A sameAs edge with similarity 1.0 is called an equivalence edge. Then, equivalent nodes, e.g., the ABCPharma entity and the identical-label RDF literal, would lead to equivalence edges. To keep the graph compact, one of the nodes is declared the representative of all nodes, and instead, we only store the equivalence edges adjacent to the representative. Details on all the above graph construction steps can be found in (Anadiotis et al., 2020a).

Formally, a ConnectionLens graph is denoted , where nodes can be of different types (URIs, XML elements, JSON nodes etc., but also extracted entities) and edges encode: data source structure, entities extracted from text, and node label similarity.

3.2. The GAM keyword search algorithm

We view our motivating query, on highly heterogeneous content with no a-priori known structure, as a keyword search query over a graph. Formally, a query is a set of keywords, and an answer tree (AT, in short) is a set of edges which () together, form a tree, and () for each , contain at least one node whose label matches . We are interested in minimal answer trees, that is answer trees which satisfy the following properties: () removing an edge from the tree will make it lack at least one keyword match, and () if more than one nodes match a query keyword, then all matching nodes are related through sameAs links with similarity 1.0. In the literature (see Section 7), a score function is used to compute the quality of an answer, and only the best ATs are returned, for a small integer . Our problem is harder since: () our ATs may span over different data sources, even of different data models; () they may traverse an edge in its original or in the opposite direction, e.g., to go from JSON to XML through Alice; this brings the search space size in , where is the number of edges; and (no single score function serves all IJ needs since, depending on the scenario, journalists may favor different (incompatible) properties of an AT, such as “being characteristic of the dataset” or, on the contrary, “being surprising”. Thus, we cannot rely on special properties of the score function, to help us prune unpromising parts of the search space, as done in prior work (see Section 7). Intuitively, tree size could be used to limit the search: very large answer trees (say, of more than 100 edges) generally do not represent meaningful connections. However, in heterogeneous, complex graphs, users find it hard to set a size limit for the exploration. Nor is a smaller solution always better than a larger one. For instance, an expert and a company may both have “nationality” edges leading to “French” (a solution of 2 edges), but that may be less interesting than finding that the expert has written an article specifying in its CoIStatement funding from the company (which could span over 5 edges or more).

Our Grow-and-Aggressive-Merge (GAM) algorithm (Anadiotis et al., 2020b, a) enumerates trees exhaustively, until a number of answers are found, or a time-out. First, it builds 1-node trees from the nodes of which match 1 or more keywords, e.g., in Figure 3, showing some partial trees built when answering our sample query. The keyword match in each node label appears in bold. Then, GAM relies on two steps. Grow adds to the root of a tree one of its adjacent edges in the graph, leading to a new tree: thus is obtained by Grow on , by Grow on , and successive Grow steps lead from to . Similarly, from , successive Grow’s go from the HTML to the JSON data source (the HealthStar entity occurs in both), and then to the XML one, building . Second, as soon as a tree is built by Grow, it is Merged with all the trees already found, rooted in the same node, matching different keywords and having disjoint edges wrt the given tree. For instance, assuming is built after , they are immediately merged into the tree , having the union of their edges. Each Merge result is then merged again with all qualifying trees (thus the “agressive” in the algorithm name). For instance, when Grow on builds a tree rooted in the PubMedArticle node (not shown; call it ), Merge() is immediately built, and is exactly the answer highlighted with green and blue in Figure 1.

Together, Grow and Merge are guaranteed to generate all solutions. If , Grow alone is sufficient, while requires also the Merge step. GAM may build a tree in several ways, e.g., the answer above could also be obtained as Merge(Merge(, Grow()), ); GAM keeps a history of the trees already explored, to avoid repeating work on them. Importantly, GAM can be used with any score function. Its details are described in (Anadiotis et al., 2020b, a).

Figure 3. Trees built by GAM for our sample query.

4. Use case-driven optimization

In this section, we present an optimization we brought to the graph construction process, guided by our target application.

In the experiments we ran, Named Entity Recognition (NER) took up to 90% of the time ConnectionLens needs to integrate data sources into a graph. The more textual the sources are, the more time is spent on NER. Our application data lead us to observe that:

  • Some text nodes, e.g., those found on the path PubMedArticle.Authors.Author.Name, always correspond to entities of a certain type, in our example, Person. If this information is given to ConnectionLens, it can create a Person entity node, like the Alice node in Figure 1, without calling the expensive NER procedure.

  • Other text nodes may be deemed uninteresting for the extraction, journalists think no interesting entities appear there. If ConnectionLens is aware of this, it can skip the NER call on such text nodes. Observe that the input data, including all its text nodes, is always preserved; we only avoid extraction effort deemed useless (but which can still be applied later if application requirements evolve).

To exploit this insight, we introduced a notion of context, and allow users to specify (optional) extraction policies. A context is an expression designating a set of text nodes in one or several data sources. For instance, a context specified by the rooted path PubMedArticle.Authors.Author.Name designates all the text values of nodes found on that path in an XML data source; the same mechanism applies to an HTML or JSON data source. In a relational data source containing table with attribute , a context of the form designates all text nodes in the ConnectionLens graph obtained from a value of the attribute in relation . Finally, an RDF property used as context designates all the values such that a triple is ingested in a ConnectionLens graph.

Based on contexts, an extraction policy takes one of the following form: () force where is a context and is an entity type, e.g., Person, states that each node designated by the context is exactly one instance of ; () skip, to indicate that NER should not be performed on the text nodes designated by ; () as syntactic sugar, for hierarchical data models (e.g., XML, JSON etc.), skipAll allows stating that NER should not be performed on the text nodes designated by , nor on any descendant of their parent. This allows larger-granularity control of NER on different portions of the data.

Observe that our contexts (thus, our policies) are specified within a data model; this is because the regularity that allows defining then can only be hoped for within data sources with identical structure. Policies allow journalists to state what is obvious to them, and/or what is not interesting, in the interest of graph construction speed. Force policies may also improve graph quality, by making sure NER does not miss any entity designated by the context.

5. In-memory parallel keyword search

We now describe the novel keyword search module that is the main technical contribution of this work. A in-memory graph storage model specifically designed for our graphs and with keyword search in mind (Section 5.1) is leveraged by a a multi-threaded, paralell algorithm, called P-GAM (Section 5.2), and which is a parallel extension of our original GAM algorithm, outlined in Section 3.2.

5.1. Physical in-memory database design

The size of the main memory in modern servers has grown significantly over the past decade. For instance, AWS EC2 offers nodes providing up to 24TB of main memory and 448 hardware threads (Services, ). Data management research has by now led to several mature products (DB engines) running entirely in main memory, such as Oracle Database In-Memory, SAP HANA, and Microsoft SQL Server with Hekaton. Moving the data from the hard disk to the main memory significantly boosts performance, avoiding disk I/O costs. However, it introduces new challenges on the optimization of the data structures and the execution model for a different bottleneck: the memory access (Boncz et al., 1999).

We have integrated P-GAM inside a novel in-memory graph database, which we have built and optimized for P-GAM operations. The physical layout of a graph database is important, given that graph processing is known to suffer from random memory accesses (Ahn et al., 2015; Elyasi et al., 2019; Roy et al., 2013; Hong et al., 2015). Our design () includes all the data needed by appplications as described in Section 2, while also () aiming at high performance, parallel query execution in modern scale-up servers, in order to tackle huge search spaces (Section 3.2).

We start with the scalability requirements. Like GAM, P-GAM also performs Grow and Merge operations (recall Figure 3).

To enumerate possible Grow steps, P-GAM needs to access all edges adjacent to the root of a tree, as well as the representative (Section 3.1) of the root, to enable growing with an equivalence edge. Further, as we will see, P-GAM (as well as GAM) relies on a simple edge metric, called specificity, derived from the number of edges with the same label adjacent to a given node, to decide the best neighbor to Grow to. For instance, if a node has 1 spouse and 10 friend edges, the edge going to the spouse is more specific than one going to a friend.

A Merge does not need more information than available in its input trees; instead, it requires specific run-time data structures, as we describe below.

Figure 4. Physical graph layout in memory.

In our memory layout, we split the data required for search, from the rest, as the former are critical for performance; we refer to the latter as metadata. Figure 4 depicts the memory tables that we use. The Node table includes the ID of the data source where the node comes from, and references to each node’s: () representative, ( neighbors, if they exist (for a fixed - static allocation), () metadata, and () other neighbors, if they exist (dynamic allocation). We separate the allocation of neighbors into static and dynamic, to keep neighbors in the main Node structure, while the rest are placed in a separate heap area, stored in the Node connections table. This way, we can allocate a fixed size to each Node, efficiently supporting the memory accesses of P-GAM. In our implementation, we set ; in general, it can be set based on the average degree of the graph vertices. The Node metadata table includes information about the type of each node (e.g., JSON, HTML, etc.) and its label, comprising the keywords that we use for searching the graph. The Edge table includes a reference to the source and the target node of every edge, the edge specificity, and a reference to the edge metadata. The Edge metadata table includes the type and the label of each edge. Finally, we use a keywordIndex, which is a hash-based map associating every node with its labels. P-GAM probes the keywordIndex when a query arrives to find the references to the Node table that match the query keywords and start the search from there. Among all the structures, only Node connections (singled out by a dark background in Figure 4) is in a dynamically allocated area; all the others are statically allocated.

The above storage is row (node) oriented, even though column storage often speeds up greatly analytical processing; this is due to the nature of the keyword search problem, which requires traversing the graph from the nodes matching the keywords, in BFS style. Since we consider fully ad-hoc queries (any keyword combinations), there are no guarantees about the order of the nodes P-GAM visits. Therefore, in our setting, the vertically selective access patterns, which are optimally exploited by column-stores, do not apply. Instead, the crucial optimization here is to find the neighbors of every node fast. This is leveraged by our algorithm, as we explain below.

Input: , query ={w, , w}, maximum number of solutions , maximum time limit
Output: Answer trees for on
1 pQueue new priority queue of (tree, edge) pairs, ;
2 keywordIndex.lookup(w);
3 for  edge adjacent to  do
4       push on some pQueue (distribute equally)
5 end for
6launch P-GAM Worker (Algorithm 2) threads;
return solutions
Algorithm 1 P-GAM

5.2. P-GAM: parallel keyword query execution

Our P-GAM (Parallel GAM) query algorithm builds a set of data structures, which are exploited by concurrent workers (threads) to produce query answers. We split these data structures to shared and private to the workers. We start with the shared ones. The, history data structure holds all trees built during the exploration, while treesByRoot gives access to all trees rooted in a certain node. As the search space is huge, the history and treesByRoot data structures grow very much. Specfically, for history, P-GAM first has to make sure that an intermediate AT has not been considered before (i.e. browse the history) before writing a new entry. Similar, treesByRoot is updated only when a tree changes its root or if there is a Merge of two trees; however, it is probed several times for Merge candidates. Therefore, we have implemented these data structures as lock-free hash-based maps to ensure high concurrency and prioritize read accesses. Observe that, given the high degree of data sharing, keeping these data structures thread-private would not yield any benefit.

Moving to the thread-private data structures, each thread, say number , has a priority queue pQueue, in which are pushed (tree, edge) pairs, such that the edge is adjacent to the root of the tree. Priority in this queue is determined as follows: we prefer the pairs whose nodes match most query keywords; to break a tie, we prefer smaller trees; and to break a possible tie among these, we prefer the pair where the edge has the highest-specificity. This is a simple priority order we chose empirically; any other priority could be used, with no change to the algorithm.

P-GAM keyword search is outlined in Algorithm 1. It creates the shared structures, and threads (as many as available based on the availability of computing hardware resources). The search starts by looking up the nodes matching at least one query keywords (line 2); we create a 1-node tree from each such node, and push it together with an adjacent edge (line 4), in one of the pQueue’s (distributing them in round-robin).

Next, worker threads run in parallel Algorithm 2, until a global stop condition: time-out, or until the maximum number of solutions has been reached, or all the queues are empty. Each worker repeatedly picks the highest-priority (tree, edge) pair on its queue (line 2), and applies Grow on it (line 3), leading to a 1-edge larger tree (e.g., obtained from in Figure 3). Thus, the stack priority orders the possible Grow steps at a certain point during the search; it tends to lead to small solutions being found first, so that users are not surprised by the lack of a connection they expected (and which usually involves few links). If the Grow result tree had not been found before (this is determined from the history), the worker tries to Merge it with all compatible trees, found within treesByRoot (line 6). The Merge partners (e.g., and in Figure 3) should match different (disjoint) keywords; this condition ensures minimality of the solution. Merge results are repeatedly Merge’d again; the thread switches back to Grow only when no new Merge on the same root is possible. Any newly created tree is checked and, if it matches all query keywords, added to the solution set (and not pushed in any queue). Finally, to balance the load among the workers, if one has exhausted his queue, it retrieves the highest-priority (tree, edge) pair from the queue with most entries, pushing the possible results in its own queue.

1 repeat
2       pop , the highest-priority pair in pQueue (or, if empty, from the pQueue having the most entries);
3       Grow();
4       if  history then
5            for all edges adjacent to the root of , push () in pQueue;
6             build all Merge() where treesByRoot.get(.root) and matches keywords disjoint from those of ;
7             if  history then
8                  recursively merge with all suitable partners;
9                   add all the (new) Merge trees to history;
10                   for each new Merge tree , and edge adjacent to the root of , push in pQueue;
11             end if
13       end if
15until time-out or solutions are found or all pQueue empty, for ;
Algorithm 2 P-GAM Worker (thread number out of )

As seen above, the threads intensely compete for access to history and treesByRoot. As we demonstrate in Section 6.3, our design allows excellent scalability as the number of threads increases.

6. Experimental Evaluation

We now present the results of our experimental evaluation. Section 6.1 presents the hardware and data used in our application. Then, Section 6.2 studies the impact of extraction policies (Section 4). Section 6.3 analyzes the scalability of the P-GAM algorithm, focusing on its interaction with the hardware, and demonstrates its significant gains wrt GAM. Section 6.4 demonstrats P-GAM scalability on a large, real-world graph built for our CoI IJ application.

6.1. Hardware and Software Setup

We used a server with a 2x10-core Intel Xeon E5-2640 v4 (Broadwell) CPUs clocked at 2.4GHz, and 128GB of DRAM. We do not use Hyper-Threads, and we bind every CPU core to a single worker thread. As shown in Figure 2, ConnectionLens (90% Java, 10% Python) is used (Section 6.2) to construct a graph out of a set of data sources, and store it in PostgreSQL. Next in the processing pipeline, we migrate the graph to the novel in-memory graph engine previously describe, which queries it using the P-GAM algorithm. The query engine is a NUMA-aware, multi-threaded C++ application.

6.2. The Impact of Extraction Policies

In this experiment, we loaded a set of bibligraphic Pubmed XML bibliographic notices ( MB on disk). This dataset inspired an extraction policy stating that: the text content of any PubMedArticle.Authors.Author.Name is a Person entity, and that extraction is skipped from the article and journal title, as well as from the article keywords. NER is still applied on author affiliations (rich with Organization and Location entities), as well as on the CoIStatement elements of crucial interest in our context.

Total (s) Extraction (s) Storage (s)
No policy 1416 1199 136
Using policy 929 716 131
Table 1. Sample impact of an extraction policy.

Table 1 shows that our policy reduced the extraction time by about 40%, reducing the loading time by 34%. As a point of reference, we also noted the time to load (and index) the graph nodes and edges in PostgreSQL; extraction strongly dominates the total time, confirming the practical interest of application-driven policies.

6.3. Scalability Analysis

The scalability analysis is performed on synthetic graphs, whose size and topology we can fully control. We focus on two aspects that impact scalability: () contention in concurrent access to data structures, and () size of the graph (which impacts the search space). To analyze the behavior of P-GAM’s concurrent data structures, we use chain graphs, because they yield a large number of intermediate results, shared across threads, even for a small graph. This way, we can isolate the size of the graph from the size of the intermediate results.

Figure 5. Synthetic graphs: chain and star.

We use two shapes of graphs (each with 1 associated query), leading to very different search space sizes (Figure 5). In both graphs, all the kwd for are distinct keywords, as well as the labels of the node(s) where the keyword is shown; no other node label matches these keywords. Chain has edges; on it, {kwd, kwd} has solutions, since any two neighbor nodes can be connected by an or by a edge; further, partial (non-solution) trees are built, each containing one keyword plus a path growing toward (but not reaching) the other. Star has branches, each of which is a line of length ; at one extremity each line has a keyword kwd, , while at the other extremity, all lines have kwd. As explained in Section 3.1, these nodes are equivalent, one is designated their representative (in the Figure, the topmost one), and the others are connected to it through equivalence edges, shown in red. On this graph, the query {kwd, kwd, kwd} has exactly 1 solution which is the complete graph; there are partial trees.

Graph (ms) (s) (ms) (s)
chain 4096 2 0.8 160 674.5
chain 8192 4 3.8 203 900.0
chain 16384 4 13.7 234 900.0
chain 32768 8 53.2 315 900.0
star 1 233 .6 4063 60.2
star 1 969 3.3 12580 243.9
star 1 2469 10.1 36261 900.0
star 1 5149 23.3 67984 900.0
star 1 9111 44.7 108960 900.0
Table 2. Single-thread P-GAM vs. GAM performance.

Single-thread P-GAM vs. GAM We start by comparing P-GAM, using only 1 thread, with the (single-threaded) Java-based GAM, accessing graph edges from a PostgreSQL database. We ran the two algorithms on the synthetic graphs and queries, with a time-out of 15 minutes; both could stop earlier if they exhausted the search space. Table 2 shows: the number of solutions , the time (ms) until the first solution is found by PGAM and its total running time (s), as well as the corresponding times and for GAM (Java on Postgres). On these tiny graphs, both algorithms found all the expected solutions, however, even without parallelism, P-GAM is to more than faster. In particular, on all but the 3 smallest graphs, GAM did not exhaust its search space in 15 minutes. This experiment validates the expected orders of magnitude speed-up of a carefully designed in-memory implementation, even without parallelism (since we restricted P-GAM to 1 thread).

Figure 6. GAM-P scaling on chain graphs.

Parallel P-GAM Next, on the graphs chain for , we report the exhaustive search time (Figure 6) for query {kwd, kwd} as we increase the number of worker threads from 1 to 20. We see a clear speedup as the number of threads increases, which is on average 13x for the graph sizes that we report. The speedup is not linear, because as the size of the intermediate results grows, it exceeds the size of the CPU caches, while threads need to access them at every iteration. Our profiling revealed that, as several threads access the shared data structures, they evict content from the CPU cache that would be useful to other threads. Instead, we did not notice overheads from our synchronization mechanisms.

Figure 7. P-GAM scaling on star graphs.

To study the scalability of the algorithm with the graph size, we use for and the query {kwd, kwd, kwd, kwd}. Figure 7 shows the exhaustive search time of P-GAM on these graphs of up to nodes, using to threads. We obtain an average speed-up of with threads, regardless the size of the graph, which shows that P-GAM scales well for different graph models and graph sizes. After profiling, we observed that the size of the intermediate results impacts the performance, similar to the previous case of the chain graph.

In the above star experiments, we used up to threads since the graph has a symetry of (however, theads share the work with no knowledge of the graph structure). When keyword matches are poorly connected, e.g., at the end of simple paths, as in our star graphs, P-GAM search starts by exploring these paths, moving farther away from each keyword; if nodes match query keywords, up to threads can share this work. In contrast, as soon as these explored paths intersect, Grow and Merge create many opportunities that can be exploited by one thread or another. On chain, the presence of 2 edges between any adjacent nodes multiplies the Grow and Merge opportunities, work which can be shared by many threads. This is why on chain, we see scalability up to 32 worker threads, which is the maximum that our server supports.

6.4. P-GAM in Conflict of Interest Application

We now describe experiments on actual application data.

The graph. We selected sources based on S. Horel’s expertise and suggestions, as follows. () We loaded 400.000 PubMed bibliographic notices (XML), corresponding to articles from 2019 and 2020; they occupy 803 MB on disk. We used the same extraction policy as in Table 1 to perform only the necessary extraction. () We have downloaded 85.400 PDF articles corresponding to these notices (those that were available in Open Access), transformed them into JSON using an extraction script we developed, and preserved only those paragraphs starting with a set of keywords (“Disclosure”, “Competing Interest”, “Acknowlegments” etc.) which have been shown (17) to encode potentially interesting participations of people (other than authors) and organizations in an article. Together, these JSON fragments occupy 173 MB on disk. The JSON and the XML content from the same paper are connected (at least) through the URI of that paper, as shown in Figure 1. () We have crawled 375 HTML Web pages from a set of Web sites describing people and organizations previously involved in scientific expertise on sensitive topics (such as tobacco or endocrine disruptors), specifically: www.desmogblog.com, tobaccotactics.org, www.wikicorporates.org and www.sourcewatch.org. These pages total 31.97 MB. Table 3 shows the numbers of nodes , of edges , and, respectively, of Person, Organization and Location entities (, , ), split by the data model, and overall.

XML 32,028,429 19,851,904 1,483,631 584,734 126,629
JSON 1,025,307 432,303 75,297 7,320 4,139
HTML 246,636 185,479 3,726 7,227 320
Total 33,300,372 20,469,686 1,562,654 665,167 131,088
Table 3. Statistics on Conflict of Interest application graph.
Keywords #
1 A1, A2 4462 5315 5316 1000 2-10, 6
2 A3, H1 4671 5140 5140 1000 3-7, 6
3 U1, H1 4832 4981 4981 1000 2-5, 5
4 A4, I1 8520 13711 13712 1000 2-5, 5
5 A5, I2 5800 6366 6366 1000 2-8, 8
6 A6, I3, P1 4657 5072 60000 16 4, 4
7 A7, I3, P2 44256 44273 60000 10 5, 5
8 A8, I4, P3 12560 12560 60000 2 5, 5
9 A9, I4, P3 28982 33435 60000 3 5, 5
10 A10, U1, I3 7577 17383 17383 1000 4-6, 6
11 A11, I4, I5 10396 32320 60000 6 3, 3
12 A12, I4, I6 7320 7467 60000 24 4, 4
13 A3, A13, U2, P4 15759 35025 60000 5 5-6, 8, 6,8
14 A3, A14, U3, G1 10711 10711 60000 1 7, 7
15 A3, A15, U4, P4 8560 9942 60000 16 9, 9
Table 4. P-GAM performance on CoI real-world graph.

Querying the graph. Table 4 shows the results of executing 15 queries, until 1000 solutions or for at most 1 minute, using P-GAM. From left to right, the columns show: the query number, the query keywords, the time until the first solution is found, the time until the last solution is found, the total running time , the number of solutions found, and some statistics on the number of data sources participating in the solutions found (, see below). All times are in milliseconds. We have anonymized the keywords that we use, not to single out individuals or corporations, and since the queries are selected aiming not at them, but at a large variety of P-GAM behavior. We use the following codes: A for author, G for government service, H for hospital, P for country, U for university, and I for industry (company). A value of the form “2-10, 6” means that P-GAM found solutions spanning at least 2 and at most 10 data sources, while most solutions spanned over 6 sources.

We make several observations based on the results. The stop conditions were set here based on what we consider as an interactive query response time, and a number of solutions which allow further exploration by the users (e.g., through an interactive GUI we developed). Further, solutions span over several datasets, demonstrating the interest of multi-dataset search enabled, and that P-GAM exploits this possibility. Finally, we report results after performing queries including different amount of keywords and the system remains responsive within the same time bounds, despite the increasing query complexity.

7. Related Work and Conclusion

In this paper, we presented a complete pipeline for managing heterogeneous data for IJ applications. This innovates upon recent work (Anadiotis et al., 2020a) where we have addressed the problems of integrating such data in a graph and querying it, as follows: (

) we present a complete data science application with clear societal impact, (

) we show how extraction policies improve the graph construction performance, and () we introduce a parallel search algorithm which scales across different graph models and sizes. Below, we discuss prior work most relevant wrt the contributions we made here; more elements of comparison can be found in (Anadiotis et al., 2020a).

Our work falls into the data integration area (Doan et al., 2012); our IJ pipeline starts by ingesting data into an integrated data repository, deployed in PostgreSQL. The first platform we proposed to Le Monde journalists was a mediator (Bonaque et al., 2016), resembling polystores, e.g., (Duggan et al., 2015; Kolev et al., 2016). However, we found that: () their datasets are changing, text-rich and schema-less, () running a set of data stores (plus a mediator) was not feasible for them, () knowledge of a schema or the capacity to devise integration plan was lacking. ConnectionLens’ first iteration (Chanial et al., 2018) lifted () by introducing keyword search, but it still kept part of the graph virtual, and split keyword queries into subqueries sent to sources. Consolidating the graph in a single store, and the centralized GAM algorithm (Anadiotis et al., 2020a) greatly sped up and simplified the tool, whose performance we again improve here. We share the goal of exploring and connecting data, with data discovery methods (Sarma et al., 2012; Fernandez et al., 2018a, b; Ota et al., 2020), which have mostly focused on tabular data. While our data is heterogeneous, focusing on an IJ application partially eliminates risks of ambiguity, since in our context, one person or organization name typically denote a single concept.

Keyword search has been studied in XML (Guo et al., 2003; Liu and Chen, 2007), graphs (from where we borrowed Grow and Merge operations for GAM) (Ding et al., 2007; He et al., 2007), and in particular RDF graphs (Elbassuoni and Blanco, 2011; Le et al., 2014). However, our keyword search problem is harder in several aspects: () we make no assumption on the shape and regularity of the graph; () we allow answer trees to explore edges in both directions; () we make no assumption on the score function, invalidating Dynamic Programming (DP) methods such as (Liu and Chen, 2007) and other similar prunings. In particular, we show in (Anadiotis et al., 2020b) that edges with a confidence lower than 1, such as similarity and extraction edges in our graphs, compromise, for any “reasonable” score function which reflects these confidences, the optimal substructure property at the core of DP. Works on parallel keyword search in graphs either consider a different setting, returning a certain class of subgraphs instead of trees (Yang et al., 2019) or standard graph traversal algorithms like BFS (Hong et al., 2011; Dhulipala et al., 2017; Leiserson and Schardl, 2010). To the best of our knowledge, GAM is the first keyword search algorithm for the specific problem that we consider in this paper. Accordingly, in this paper we have parallelized GAM, into P-GAM, by drawing inspiration and addressing common challenges raised in graph processing systems in the literature, in particular concerning the CPU efficiency while interacting with the main memory (Malicevic et al., 2017; Ahn et al., 2015; Elyasi et al., 2019; Roy et al., 2013; Hong et al., 2015).

Our future work includes: building a unified CoI repository based on more biomedical sources, enhancing our in-memory query processor, and querying the graph using natural language.

Acknowledgments. The authors thank M. Ferrer and the Décodeurs team (Le Monde) for introducing us, and for many insightful discussions.


  • J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi (2015) A scalable processing-in-memory accelerator for parallel graph processing. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 13-17, 2015, D. T. Marr and D. H. Albonesi (Eds.), pp. 105–117. External Links: Link, Document Cited by: §5.1, §7.
  • R. Alotaibi, D. Bursztyn, A. Deutsch, I. Manolescu, and S. Zampetakis (2019) Towards scalable hybrid stores: constraint-based rewriting to the rescue. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, P. A. Boncz, S. Manegold, A. Ailamaki, A. Deshpande, and T. Kraska (Eds.), pp. 1660–1677. External Links: Link, Document Cited by: §1.
  • A. G. Anadiotis, O. Balalau, C. Conceição, H. Galhardas, M. Y. Haddad, I. Manolescu, T. Merabti, and J. You (2020a) Graph integration of structured, semistructured and unstructured data for data journalism. CoRR abs/2012.08830. Note: Currently under evaluation as an invited journal paper. External Links: Link, 2012.08830 Cited by: 3rd item, §1, §3.1, §3.2, §3.2, §3, §7, §7.
  • A. G. Anadiotis, M. Y. Haddad, and I. Manolescu (2020b) Graph-based keyword search in heterogeneous data sources. In Bases de Données Avancés (informal publication), External Links: Link, 2009.04283 Cited by: 3rd item, §3.2, §3.2, §7.
  • R. Bonaque, T. D. Cao, B. Cautis, F. Goasdoué, J. Letelier, I. Manolescu, O. Mendoza, S. Ribeiro, X. Tannier, and M. Thomazo (2016) Mixed-instance querying: a lightweight integration architecture for data journalism. Proc. VLDB Endow. 9 (13), pp. 1513–1516. External Links: Link, Document Cited by: §7.
  • P. A. Boncz, S. Manegold, and M. L. Kersten (1999) Database architecture optimized for the new bottleneck: memory access. In Proceedings of the 25th International Conference on Very Large Data Bases, VLDB ’99, San Francisco, CA, USA, pp. 54–65. External Links: ISBN 1558606157 Cited by: §5.1.
  • M. Buron, F. Goasdoué, I. Manolescu, and M. Mugnier (2020) Obi-wan: ontology-based RDF integration of heterogeneous data. Proc. VLDB Endow. 13 (12), pp. 2933–2936. External Links: Link Cited by: §1.
  • D. Calvanese, G. D. Giacomo, M. Lenzerini, D. Lembo, A. Poggi, and R. Rosati (2007) MASTRO-I: efficient integration of relational data through DL ontologies. In DL Workshio, CEUR Workshop Proceedings, Vol. 250. External Links: Link Cited by: §1.
  • C. Chanial, R. Dziri, H. Galhardas, J. Leblay, M. L. Nguyen, and I. Manolescu (2018) ConnectionLens: finding connections across heterogeneous data sources. Proc. VLDB Endow. 11 (12), pp. 2030–2033. External Links: Link, Document Cited by: §7.
  • C. Christodoulakis, E. Munson, M. Gabel, A. D. Brown, and R. J. Miller (2020) Pytheas: pattern-based table discovery in CSV files. Proc. VLDB Endow. 13 (11), pp. 2075–2089. External Links: Link Cited by: §1.
  • L. Dhulipala, G. E. Blelloch, and J. Shun (2017) Julienne: A framework for parallel graph algorithms using work-efficient bucketing. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2017, Washington DC, USA, July 24-26, 2017, C. Scheideler and M. T. Hajiaghayi (Eds.), pp. 293–304. External Links: Link, Document Cited by: §7.
  • B. Ding, J. X. Yu, S. Wang, L. Qin, X. Zhang, and X. Lin (2007) Finding top-k min-cost connected trees in databases. In Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15-20, 2007, R. Chirkova, A. Dogac, M. T. Özsu, and T. K. Sellis (Eds.), pp. 836–845. External Links: Link, Document Cited by: §7.
  • A. Doan, A. Y. Halevy, and Z. G. Ives (2012) Principles of data integration. Morgan Kaufmann. External Links: Link, ISBN 978-0-12-416044-6 Cited by: §1, §7.
  • J. Duggan, A. J. Elmore, M. Stonebraker, M. Balazinska, B. Howe, J. Kepner, S. Madden, D. Maier, T. Mattson, and S. B. Zdonik (2015) The BigDAWG polystore system. SIGMOD. Cited by: §1, §7.
  • S. Elbassuoni and R. Blanco (2011) Keyword search over RDF graphs. In Proceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, United Kingdom, October 24-28, 2011, C. Macdonald, I. Ounis, and I. Ruthven (Eds.), pp. 237–242. External Links: Link, Document Cited by: §7.
  • N. Elyasi, C. Choi, and A. Sivasubramaniam (2019) Large-scale graph processing on emerging storage devices. In 17th USENIX Conference on File and Storage Technologies, FAST 2019, Boston, MA, February 25-28, 2019, A. Merchant and H. Weatherspoon (Eds.), pp. 309–316. External Links: Link Cited by: §5.1, §7.
  • [17] (2018) European Press Prize: the Monsanto Papers. European Press Prize. External Links: Link Cited by: §1, §6.4.
  • R. C. Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden, and M. Stonebraker (2018a) Aurum: A data discovery system. In 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018, pp. 1001–1012. External Links: Link, Document Cited by: §7.
  • R. C. Fernandez, E. Mansour, A. A. Qahtan, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang (2018b) Seeping semantics: linking datasets using word embeddings for data discovery. In 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018, pp. 989–1000. External Links: Link, Document Cited by: §7.
  • L. Guo, F. Shao, C. Botev, and J. Shanmugasundaram (2003) XRANK: ranked keyword search over XML documents. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003, A. Y. Halevy, Z. G. Ives, and A. Doan (Eds.), pp. 16–27. External Links: Link, Document Cited by: §7.
  • H. He, H. Wang, J. Yang, and P. S. Yu (2007) BLINKS: ranked keyword searches on graphs. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Beijing, China, June 12-14, 2007, C. Y. Chan, B. C. Ooi, and A. Zhou (Eds.), pp. 305–316. External Links: Link, Document Cited by: §7.
  • S. Hong, S. Depner, T. Manhardt, J. V. D. Lugt, M. Verstraaten, and H. Chafi (2015) PGX.D: a fast distributed graph processing engine. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2015, Austin, TX, USA, November 15-20, 2015, J. Kern and J. S. Vetter (Eds.), pp. 58:1–58:12. External Links: Link, Document Cited by: §5.1, §7.
  • S. Hong, T. Oguntebi, and K. Olukotun (2011) Efficient parallel graph exploration on multi-core CPU and GPU. In 2011 International Conference on Parallel Architectures and Compilation Techniques, PACT 2011, Galveston, TX, USA, October 10-14, 2011, L. Rauchwerger and V. Sarkar (Eds.), pp. 78–88. External Links: Link, Document Cited by: §7.
  • S. Horel (2018) Lobbytomie. La Découverte. Note: In French External Links: ISBN 2707194123, Link Cited by: §2.
  • S. Horel (2020) Petites ficelles et grandes manoeuvres de l’industrie du tabac pour réhabiliter la nicotine. Note: In French External Links: Link Cited by: §2.
  • B. Kolev, P. Valduriez, C. Bondiombouy, R. Jiménez-Peris, R. Pau, and J. Pereira (2016) CloudMdsQL: querying heterogeneous cloud data stores with a common language. Distributed Parallel Databases 34 (4), pp. 463–503. External Links: Link, Document Cited by: §7.
  • W. Le, F. Li, A. Kementsietsidis, and S. Duan (2014) Scalable keyword search on large RDF data. IEEE Trans. Knowl. Data Eng. 26 (11), pp. 2774–2788. External Links: Link, Document Cited by: §7.
  • C. E. Leiserson and T. B. Schardl (2010) A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers). In Proceedings of the Twenty-Second Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’10, New York, NY, USA, pp. 303–314. External Links: ISBN 9781450300797, Link, Document Cited by: §7.
  • Z. Liu and Y. Chen (2007) Identifying meaningful return information for XML keyword search. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Beijing, China, June 12-14, 2007, C. Y. Chan, B. C. Ooi, and A. Zhou (Eds.), pp. 329–340. External Links: Link, Document Cited by: §7.
  • J. Malicevic, B. Lepers, and W. Zwaenepoel (2017) Everything you always wanted to know about multicore graph processing but were afraid to ask. In 2017 USENIX Annual Technical Conference, USENIX ATC 2017, Santa Clara, CA, USA, July 12-14, 2017, D. D. Silva and B. Ford (Eds.), pp. 631–643. External Links: Link Cited by: §7.
  • F. Nargesian, K. Q. Pu, E. Zhu, B. G. Bashardoost, and R. J. Miller (2020) Organizing data lakes for navigation. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, D. Maier, R. Pottinger, A. Doan, W. Tan, A. Alawini, and H. Q. Ngo (Eds.), pp. 1939–1950. External Links: Link, Document Cited by: §1.
  • [32] (2013) Offshore leaks. ICIJ. External Links: Link Cited by: §1.
  • N. Oreskes and E. Conway (2012) Merchants of doubt. Bloomsbury Publishing. External Links: ISBN 1408824833, Link Cited by: §2.
  • M. Ota, H. Mueller, J. Freire, and D. Srivastava (2020) Data-driven domain discovery for structured datasets. Proc. VLDB Endow. 13 (7), pp. 953–965. External Links: Link Cited by: §1, §7.
  • [35] (2010) Physician payments sunshine act. Wikipedia. External Links: Link Cited by: §2.
  • A. Quamar, J. Straube, and Y. Tian (2020) Enabling rich queries over heterogeneous data from diverse sources in healthcare. In CIDR 2020, 10th Conference on Innovative Data Systems Research, Amsterdam, The Netherlands, January 12-15, 2020, Online Proceedings, External Links: Link Cited by: §1.
  • A. Roy, I. Mihailovic, and W. Zwaenepoel (2013) X-stream: edge-centric graph processing using streaming partitions. In ACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP ’13, Farmington, PA, USA, November 3-6, 2013, M. Kaminsky and M. Dahlin (Eds.), pp. 472–488. External Links: Link, Document Cited by: §5.1, §7.
  • A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee, F. Wu, R. Xin, and C. Yu (2012) Finding related tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012, K. S. Candan, Y. Chen, R. T. Snodgrass, L. Gravano, and A. Fuxman (Eds.), pp. 817–828. External Links: Link, Document Cited by: §7.
  • [39] A. W. Services Memory optimized instances. Note: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/memory-optimized-instances.htmlLast accessed: 2021-01-25 Cited by: §5.1.
  • Y. Yang, D. Agrawal, H. V. Jagadish, A. K. H. Tung, and S. Wu (2019)

    An efficient parallel keyword search engine on knowledge graphs

    In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019, pp. 338–349. External Links: Link, Document Cited by: §7.