Interactive graph query language for multidimensional data in Collaboration Spotting visual analytics framework

12/12/2017 ∙ by Adam Agocs, et al. ∙ CERN 0

Human reasoning in visual analytics of data networks relies mainly on the quality of visual perception and the capability of interactively exploring the data from different facets. Visual quality strongly depends on networks' size and dimensional complexity while network exploration capability on the intuitiveness and expressiveness of user frontends. The approach taken in this paper aims at addressing the above by decomposing data networks into multiple networks of smaller dimensions and building an interactive graph query language that supports full navigation across the sub-networks. Within sub-networks of reduced dimensionality, structural abstraction and semantic techniques can then be used to enhance visual perception further.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Related Work

The related work is twofold since it combines multiple visual analytics techniques with the power of graph query languages. In the last 15 years, a lot of visual analytics articles were published with the aim of showing processes of transformation of multidimensional data into node-link diagrams [38, 24].

A lot of articles have been published, especially on the coordinated multiple views topic, which introduces a visual analytics paradigm supported by an interactive query language or by a set of operations. These articles can be divided into four different groups:

  • OLAP [15] inspired paradigms that are using operations like slice, roll-up, dice, etc. The most relevant papers are PivotGraph [39], ScatterDice [10] (and GraphDice [4]), MatrixCube [1] and Orion [19].

  • Relational algebra-related solutions such as Cross-filter views [40] which uses grouping, filtering, projection and selection operations, Polaris [37] that introduces and maps its algebra to SQL and Ploceus [28] which works with first-order logic language.

  • Other solutions such as Cross-filter views with hypergraph query language [34], JUNG[31] and Gephi[3] that allow users to use other programming languages (JAVA in these cases).

Literature on graph query languages is huge [22, 17, 21, 16, 26, 2, 32, 44]. It covers the use of different graph models reflecting the variety of requirements for applications and languages.

The visual analytics model introduced in this paper promotes a different approach to graph query language. The language operates on a directed, labelled graph that is managed via user interactions treated as query inputs and follows the semantic web query language concept, SPARQL [18]. This approach allows users to generate graph patterns and evaluate them directly on the graph.

2 Basic graph and views

Let graph be a directed, label graph defined as a four-element tuple where represents a set of vertices and , a set of edges defined as a subset of the Cartesian products of these vertices. is a set of vertex labels and is a mapping function from vertices to the corresponding labels. Figure 1 shows an example of such a graph.

Figure 1: Example of graph where and

We define the reachability graph over graph as where vertices are labels of graph , is defined as the Cartesian product of the labels where any two vertices of are connected if and only if there exists two connected vertices in graph and their respective labels correspond to the two vertices of graph . Graph is a description of graph , it is also called the graph schema of graph . Graph schema helps users view graph via different sub-graphs of lesser dimensionality using labels of as dimensions and facilitates the generation of approximately optimal user-defined graph queries. Let graph be a graph pattern where and . To process the answer to a graph query, one needs to find all possible isomorphic subgraphs of that are homomorphic to a graph pattern

corresponding to the query. This is a graph pattern matching problem, a well-known part of Mathematics

[13]. In this case, one defines Graph , a subgraph of graph as a sample matching the graph pattern if and only if:

  • ,

  • .

The answer to a graph query is a view containing the set of subgraphs of matching . To build such a view, one needs first to introduce the graph pairing function and the set . Let and be two graph patterns. These graph patterns are paired iff

  • and

  • .

Where a path is an alternate non-empty sequence of vertices and edges, starting and ending with vertices and requiring that all edges and vertices be distinct from one another. indicates that all edges of this path are in set . The function is defined as

And , the set of these pairs is defined as .

A view of graph is defined as a six-element tuple where

  • ,

  • ,

  • and ,

  • ,

  • ,

  • .

The use of multiple graph patterns for the construction of graph is required since the cardinality of set and set is not necessary equal to 1 (see details in Section 3.1). To ease the reading, graph is noted to refer directly to the set of labels used in the construction of the view. Also, in practice, we use an aggregation function on edges, respectively on vertices in graph for determining their respective weights instead of the elements in set (for instance, the number of elements). Figure 2 shows an example of a view.

Figure 2: Example of a view where . The two graph patterns are and pair(

3 Graph creation from user interactions

In this section, we introduce how the graph patterns and views can be created as a result of the following user interactions:

  • Selection of different nodes in the current view,

  • Removal of all vertices with the same label selected in one of the previous views,

  • Navigation from one view to another.

Users can modify set and set when performing any of the above interactions. Let be the set of vertices corresponding to a user selection, we define from :

  1. which contains the labels of nodes in set and,

  2. with , a subset of set , restricted to vertices having their respective labels in set .

In order for set to operate as a filter, the matched sample definition of Section 2 has to be restricted by requiring that . Example 1 below shows the content of for user selection from the graph depicted in Figure 1.

Example 1
(1)
(2)

3.1 Graph pattern construction

This section shows how to construct a graph pattern with set containing all the labels of vertices in set . We exploit the fact that graph patterns are actually only needed when constructing edges in and their respective weights. A pair of graph patterns are required for each combination of labels in set and set since the path direction between vertices from set and set are different due to the construction of edges between vertices of and vertices of . Each pattern has to satisfy the following criteria:

  • It must be a connected and directed graph,

  • It must be minimal,

  • Labels from set can be used as intermediate vertices in the pattern.

These requirements exactly fit a Steiner Minimal Tree problem [23], known to be NP-complete[14] and for which we use a minimal spanning tree solver as an approximation algorithm. Algorithm 1 describes the full process of pair generation. Figure 3 shows the graph schema of graph depicted in Figure 1 and the generated patterns.

1:function PatternGenerator()
2:     
3:     
4:     while  do
5:         
6:         
7:         while  do
8:              
9:              
10:              
11:              
12:         end while
13:     end while
14:     return
15:end function
Algorithm 1 Pattern generator algorithm
Figure 3: On the left hand-side, the graph schema of graph in the example of Figure 1. On the middle and on the right hand-side, an example of a graph pattern pair for set F (Example 1), with and

3.2 Connecting user interactions and views

Now that graph patterns () have been created using set , set and set , one can introduce the function that generates views from user interactions. ( and ) as
where

are the vertices of graph and

are the ”interconnection” vertices: The other members of the six-tuple are unchanged since

  • labels (set ) are not modified and since

  • edge definition (set ) and weighting functions ( and ) only depend on set and set .

4 Operations on graphs

User interactions will result in the following graph operations:

  • Selection: The user selects nodes on the view,

  • Expansion: The user expands a view by removing in his previous selection, vertices having the same labels,

  • Navigation: The user navigates from a view to another.

To define these operations one needs first to introduce the concepts of visual equivalence and minimal views since there can be views with vertices of null weight that are hidden to the user and hence non-selectable. Let and be two different filters on the same view complying with . In essence, this means that there is no difference in the sets of vertices with labels contained in which technically should be empty. View and view generated using F1 and F2 are said to be visual equivalent if and only if

Definition 2 (Vis-equivalent)

where () represents the vertices of view (). Intuitively visual equivalence guaranties that vertices that are not common to two views have empty weights. It provides equivalence classification on views. It is easy to prove that for each class of views there is only one which does not have vertices with empty weights. This view is called the minimal view.

4.1 Selection on graphs

Let be the set of user selected nodes within a view. and where is a set of vertices from the minimal view which is visual-equivalent to graph . The selection operator is defined as

Definition 3

(Selection)

where . It is to be noted that at view creation the the selection operator has been used with a more general definition of the function.

4.2 Expansion on graphs

The expansion operator is in some sense the ”invert” or the selection operator. It is defined as

Definition 4

(Expansion)

The expansion operator changes view when and remove all vertices in set that are labelled with labels in .

4.3 Navigation through graphs

By selecting a subset of labels from one can build views of graph with reduced dimensional complexity. Navigation across views is required to enable users to apprehend the full graph . Therefore the navigation function goes from view to a view labelled as and and is defined as:

Definition 5 (Navigation)

4.4 Navigation history

The navigation history can be represented as a navigation graph where vertices represent navigation states and edges navigation steps between states. complies to

  • .

  • ,

where there is a navigation step between node to node if and only if one of the following statements is true:

  1. , and ;

  2. , and ;

  3. and .

In , the third component of an edge is always one of the operations or . It indicates how the step was processed.
The proper size of is where and .
A particular navigation history corresponds to a walk in . An example of such a walk is given below

Example 6 (Walk on graph)

In practice, a particular set of labels is used to create an entry view from which all the above mentioned operations can then be performed by users.

5 Use-case

In the framework of AIDA [6], an FP7 project on Advanced European Infrastructures for Detectors at Accelerators, researchers needed to identify key players from academia and industry for technologies considered as strategic for the particle physics programme. To this end, the Collaboration Spotting project was launched in 2012 with a view to enabling users to search for technologies in titles and abstracts of publications and patents and viewing the organisation, journal category, keywords, city and country landscapes for each of these technologies individually. Individual technology searches are represented as vertices in a view named Technogram, used as the user entry view in which edges represent publications and/or patents common to searches.

5.1 Data

Two different sources are used for searching. The metadata records of publications from Web of Science™ Core Collection [7] developed by Clarivate Analytics (in the past, Thomson Reuters) and the metadata records of patents from PATSTAT developed by the European Patent Office [12]. Although the two sources have a number of labels in common, such as organisation, city and country there are others like journal category and keyword that only belong to publications. The subset of data from the two sources corresponding to the labels of interest for users was used to construct graph and its schema .

5.2 Storing data in a graph database (Neo4j)

Graph is stored in a Neo4j graph database [30], in which individual metadata records are stored as subgraphs of labelled vertices using Published item, Organisation, Journal Category, Author Keyword, City, Region and Country as labels. Figure 4 represents the reachability graph (graph schema) of this network. Besides these labels, additional labels have been introduced to support user authentication and authorisation (User) and searches (Graph and Technology). Searches use full text indices of the Apache Lucene project [29] that have been integrated into the Neo4j database as legacy indices [30].

Figure 4: The database schema (reachability graph); Light color nodes represent nodes uploaded by the data administrator and the dark nodes are created by the system itself by using search and authentication modules.

5.2.1 Statistic of our graph data

Searches on publications and patents metadata records from the 2000 - 2014 period can be performed. The resulting data network contains 45 million vertices and 150 million edges. Its breakdown is given in Table 1. and Table 2.

Type of nodes Number of nodes
Patents 15.000.442
Publications 20.087.904
Organisations 2.918.060
Author Keywords 8.193.604
Subject Categories 230
Cities 7.741
Regions 946
Countries 128
46.209.055
Table 1: Number of nodes by node labels
Patents Publications
Organ. 12.440.903 36.672.677 49.113.580
Author Key. - 48.941.098 48.941.098
Subject Cat. - 32.566.806 32.566.806
Cities 3.193.709 8.826.222 12.019.931
Regions 265.421 2.504.441 2.769.862
Count. 3.156.449 8.020.648 11.177.097
19.056.482 137.531.892 156.588.374
Table 2: Number of edges by node labels. A patent does not have author keywords or subject categories property

As can be noticed the number of region edges is smaller than the number of country edges due to the use of the level of Nomenclature of Territorial Units For Statistic [11] created by the European Commission.

5.3 Navigation

The entry point for this use case is individual users. Using the terminology introduced above, the initial user interaction set contains user IDs.

5.3.1 Limitations

In the current implementation there is a restriction on the size of and fixed to a single label Published Item and the visualization system only supports undirected edges. This calls for the generation of only one graph pattern instead of two making the system faster.

In Figure 5, a short series of pictures illustrates how operations are working. The user enters the system with a technology view (vertices are labelled with the Technology label and they are connected to the other views via vertices labelled with the Published Item label).

(a) Technology view () of a user
(b) Selecting two technologies () and navigating to view.
(c) Subject Category view () for the selected technologies.
(d) Selecting a cluster in the Subject Category view () and expanding the view to go back to the Technology view ().
(e) Technology view with filter
Figure 5: Example of operations; navigation, selection and expansion on views

6 Conclusion and Future Work

The current version of Collaboration Spotting running at CERN [8] addresses the implementation of the concepts using patents and publications metadata records. It is a new experimental service that aims to provide the High Energy Physics community (such as HEPTech [20]) with information on Academia & Industry main players active around key technologies, with a view to fostering more inter-disciplinary and inter-sectoral R&D collaborations, and giving the procurement service the opportunity of reaching a wider selection of high-tech companies for biding purposes. Collaboration Spotting is generic in its concepts and implementation. It can support visual analytics of any kind of data and its backend is implemented using Neo4j graph database [30]. Conference papers, technical & business news, trademarks & designs and financial data are amongst the data targeted to enrich the information on technologies that one can obtain from publications and patents. The choice of data sources will depend on users’ priorities. The tool can be of use to other communities, in particular in dentistry[27] but also to policy makers and investors if data in the labelled graph is enriched with technical & business news and financial data. Collaboration Spotting also addresses other types of data such as compatibility and dependency relationships in software and meta-data [5, 35] of the LHCb experiment at CERN.

As an interactive graph query language, Collaboration Spotting is intended to provide a fully customisable visual analytics environment. In the current version data processing supports searches and contextual queries. In the future, labelled & directed relationships and attributes on nodes will be included in the labelled property graph representation of the data network and the processing will be extended to more complex operations directly on the graph resulting from searches and queries with a view to enhancing the visual perception of users.

Acknowledgements.

References

  • [1] B. Bach, E. Pietriga, and J.-D. Fekete. Visualizing dynamic networks with matrix cubes. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’14, pp. 877–886. ACM, New York, NY, USA, 2014. doi: 10 . 1145/2556288 . 2557010
  • [2] P. Barceló Baeza. Querying graph databases. In Proceedings of the 32Nd Symposium on Principles of Database Systems, PODS ’13, pp. 175–188. ACM, New York, NY, USA, 2013. doi: 10 . 1145/2463664 . 2465216
  • [3] M. Bastian, S. Heymann, M. Jacomy, et al. Gephi: an open source software for exploring and manipulating networks. Icwsm, 8:361–362, 2009.
  • [4] A. Bezerianos, F. Chevalier, P. Dragicevic, N. Elmqvist, and J. D. Fekete. Graphdice: A system for exploring multivariate social networks. In Proceedings of the 12th Eurographics / IEEE - VGTC Conference on Visualization, EuroVis’10, pp. 863–872. The Eurographs Association and John Wiley & Sons, Ltd., Chichester, UK, 2010. doi: 10 . 1111/j . 1467-8659 . 2009 . 01687 . x
  • [5] M. Cattaneo, M. Clemencic, and I. Shapoval. LHCb software and Conditions Database cross-compatibility tracking system: A graph-theory approach. In Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC), 2012 IEEE, pp. 990–996. IEEE, 2012.
  • [6] CERN - AIDA team. Advanced European Infrastructures for Detectors at Accelerators, December 2017.
  • [7] Clarivate Analytics (in the past, Thomson Reuters). Web of Science™, December 2017.
  • [8] Collspotting Developer Team. Collspotting, December 2017.
  • [9] T. A. Davis and Y. Hu. The university of florida sparse matrix collection. ACM Trans. Math. Softw., 38(1):1:1–1:25, Dec. 2011.
  • [10] N. Elmqvist, P. Dragicevic, and J. D. Fekete. Rolling the dice: Multidimensional visual exploration using scatterplot matrix navigation. IEEE Transactions on Visualization and Computer Graphics, 14(6):1539–1148, Nov 2008. doi: 10 . 1109/TVCG . 2008 . 153
  • [11] European Commision. NUTS - Nomenclature Of Territorial Units For Statistics, December 2017.
  • [12] European Patent Office. PATSTAT - Worldwide Patent Statistical Database, December 2017.
  • [13] B. Gallagher. Matching structure and semantics: A survey on graph-based pattern matching. AAAI FS, 6:45–53, 2006.
  • [14] M. R. Garey, R. L. Graham, and D. S. Johnson. The complexity of computing steiner minimal trees. SIAM journal on applied mathematics, 32(4):835–859, 1977.
  • [15] J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data mining and knowledge discovery, 1(1):29–53, 1997.
  • [16] R. H. Güting. Graphdb: Modeling and querying graphs in databases. In VLDB, vol. 94, pp. 12–15. Citeseer, 1994.
  • [17] M. Gyssens, J. Paredaens, J. V. D. Bussche, and D. V. Gucht. A graph-oriented object database model, 1990.
  • [18] S. Harris, A. Seaborne, and E. Prud’hommeaux. Sparql 1.1 query language. W3C Recommendation, 21, 2013.
  • [19] J. Heer and A. Perer. Orion: A system for modeling, transformation and visualization of multidimensional heterogeneous networks. Information Visualization, 13(2):111–133, 2014.
  • [20] HEPTech Team. HEPTech - website, December 2017.
  • [21] J. Hidders. Typing graph-manipulation operations. In Database Theory-ICDT 2003, pp. 394–409. Springer, 2003.
  • [22] J. Hidders and J. Paredaens. GOAL, A Graph-based Object and Association Language.
  • [23] F. K. Hwang, D. S. Richards, and P. Winter. The Steiner tree problem, vol. 53 of Annals of Discrete Mathematics. Elsevier, 1992.
  • [24] J. Kehrer and H. Hauser. Visualization and visual analysis of multifaceted scientific data: A survey. IEEE Transactions on Visualization and Computer Graphics, 19(3):495–513, March 2013. doi: 10 . 1109/TVCG . 2012 . 110
  • [25] J. B. Kollat, P. M. Reed, and R. M. Maxwell.

    Many-objective groundwater monitoring network design using bias-aware ensemble kalman filtering, evolutionary optimization, and visual analytics.

    Water Resources Research, 47(2), 2011. W02529.
  • [26] H. S. Kunii. DBMS with Graph Data Model for Knowledge Handling. In Proceedings of the 1987 Fall Joint Computer Conference on Exploring Technology: Today and Tomorrow, ACM ’87, pp. 138–142. IEEE Computer Society Press, Los Alamitos, CA, USA, 1987.
  • [27] E. Leonardi, A. Agocs, S. Fragkiskos, N. Kasfikis, J. Le Goff, M. Cristalli, V. Luzzi, and A. Polimeni. Collaboration spotting for dental science. Minerva Stomatologica, 63(9):295–306, sep 2014.
  • [28] Z. Liu, S. B. Navathe, and J. T. Stasko. Network-based visual analysis of tabular data. In 2011 IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 41–50, Oct 2011. doi: 10 . 1109/VAST . 2011 . 6102440
  • [29] Lucene™/Solr™ Committers. Apache Lucene™ Documentation, December 2017.
  • [30] Neo4j. The Neo4j Manual v2.3.3, December 2017.
  • [31] J. O’Madadhain, D. Fisher, S. White, and Y. Boey. The JUNG (Java Universal Network/Graph) Framework. University of California, Irvine, California, 2003.
  • [32] J. Paredaens, P. Peelman, and L. Tanca. G-Log: A graph-based query language. Knowledge and Data Engineering, IEEE Transactions on, 7(3):436–453, 1995.
  • [33] A. Scharl, A. Hubmann-Haidvogel, A. Weichselbraun, H. P. Lang, and M. Sabou. Media watch on climate change – visual analytics for aggregating and managing environmental knowledge from online sources. In 2013 46th Hawaii International Conference on System Sciences, pp. 955–964, Jan 2013. doi: 10 . 1109/HICSS . 2013 . 398
  • [34] R. Shadoan and C. Weaver. Visual analysis of higher-order conjunctive relationships in multidimensional data using a hypergraph query system. IEEE Transactions on Visualization and Computer Graphics, 19(12):2070–2079, Dec 2013. doi: 10 . 1109/TVCG . 2013 . 220
  • [35] I. Shapoval, M. Clemencic, and M. Cattaneo. ARIADNE: a Tracking System for Relationships in LHCb Metadata. In Journal of Physics: Conference Series, vol. 513, p. 042039. IOP Publishing, 2014.
  • [36] Z. Shen, K.-L. Ma, and T. Eliassi-Rad. Visual analysis of large heterogeneous social networks by semantic and structural abstraction. IEEE transactions on visualization and computer graphics, 12(6):1427–1439, 2006.
  • [37] C. Stolte, D. Tang, and P. Hanrahan. Polaris: a system for query, analysis, and visualization of multidimensional relational databases. IEEE Transactions on Visualization and Computer Graphics, 8(1):52–65, Jan 2002. doi: 10 . 1109/2945 . 981851
  • [38] T. von Landesberger, A. Kuijper, T. Schreck, J. Kohlhammer, J. van Wijk, J.-D. Fekete, and D. Fellner. Visual analysis of large graphs: State-of-the-art and future research challenges. Computer Graphics Forum, 30(6):1719–1749, 2011.
  • [39] M. Wattenberg. Visual exploration of multivariate graphs. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’06, pp. 811–819. ACM, New York, NY, USA, 2006. doi: 10 . 1145/1124772 . 1124891
  • [40] C. Weaver. Cross-filtered views for multidimensional visual analysis. IEEE Transactions on Visualization and Computer Graphics, 16(2):192–204, March 2010. doi: 10 . 1109/TVCG . 2009 . 94
  • [41] P. C. Wong, H.-W. Shen, C. R. Johnson, C. Chen, and R. B. Ross. The top 10 challenges in extreme-scale visual analytics. IEEE computer graphics and applications, 32(4):63, 2012.
  • [42] P. C. Wong and J. Thomas. Visual analytics. IEEE Computer Graphics and Applications, 24(5):20–21, Sept 2004. doi: 10 . 1109/MCG . 2004 . 39
  • [43] P. T. Wood. Query languages for graph databases. SIGMOD Rec., 41(1):50–60, Apr. 2012.
  • [44] J. Yang, S. Zhang, and W. Jin. DELTA: Indexing and Querying Multi-labeled Graphs. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management, CIKM ’11, pp. 1765–1774. ACM, New York, NY, USA, 2011. doi: 10 . 1145/2063576 . 2063832