The HyperBagGraph DataEdron: An Enriched Browsing Experience of Multimedia Datasets

05/28/2019 ∙ by Xavier Ouvrard, et al. ∙ University of Geneva CERN 0

Traditional verbatim browsers give back information in a linear way according to a ranking performed by a search engine that may not be optimal for the surfer. The latter may need to assess the pertinence of the information retrieved, particularly when s·he wants to explore other facets of a multi-facetted information space. For instance, in a multimedia dataset different facets such as keywords, authors, publication category, organisations and figures can be of interest. The facet simultaneous visualisation can help to gain insights on the information retrieved and call for further searches. Facets are co-occurence networks, modeled by HyperBag-Graphs – families of multisets – and are in fact linked not only to the publication itself, but to any chosen reference. These references allow to navigate inside the dataset and perform visual queries. We explore here the case of scientific publications based on Arxiv searches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Revealing important information contained in Big Data calls for powerful analytical and visualising tools to reveal important information it contains. In an information space, meaningfull information can be regrouped by hierarchical classification or - non exclusive - by semantically cohesive categories that are combined to express concepts [1].

When browsing a textual database, traditional verbatim browsers give back linear information in the form of a ranked list with a short description of the references. To increase the pertinence of this information, the surfer has often to perform new search either by refining the original keywords she used or by using other pertinent queries that can help herhim to refine the retrieved information.

In fact, in most cases, the information space is multifacetted. Those facets are not independent but are linked by the physical entities contained in the search output. Choosing a type of reference common to these entities enables the construction of a network of co-occurences for each facet. Navigating to reveal information between the different facets helps. For instance, in the case of scientific publications different information of interest for the scientist can be regrouped around the publication reference: the authors, the organization(s) they belong, the countries of these organisations, the main keywords of an article or the figures inside the publication. All this metadata can give insights into the information space and as most of the articles in the database can be systematically regrouped to get meaningfull information. Choosing as reference the article itself, the facets can show different co-occurrence networks such as co-authors, co-keywords and co-categories. The interaction between the different facets is ensured by the links between them. The choice of the reference type is not unique and can be any convenient metadata type in the dataset. Co-occurences can contain repetitions or an individual weighting: multisets, and particularly natural multisets, allow them while sets don’t.

We propose in this article another way to explore an information space by using hyper-bag-graphs (hb-graphs for short), an extension of hypergraphs to families of multisets. In [2], we show that hb-graphs enhance an exchange-based diffusion over co-occurence networks, providing a fine vertex and hb-edge ranking, accounting for a hb-edge based weighting of vertices.

This paper presents a hyper-bag framework of co-occurence networks extending the visualisation part of the hypergraph framework sketched in [3] to hb-graph . This framework supports browsing of an information space and dataset visual queries. The framework is validated from a theoretical point of view and with a use case. We have implemented a 2.5D interface to visualise different facets of the Arxiv information space and perform visual queries.

Section II lists the related work and the mathematical background. Section III presents the hb-graph framework. Section IV gives results and Section V concludes and addresses future work.

Ii Related work and mathematical background

Ii-a Information space discovery

Discovering knowledge in an information space requires to gather meaningful information, either hierarchically or semantically organized. Semantic provides support to the definition of facets within an information space [1].

Navigation and visualisation of information spaces have been achieved by several authors in many different ways. In [4], a pivot is used to stroll between three facets of the information space; the approach is limited to the visualisation of a small amount of pivots at the same time. This visualisation is based on a tri-partite graph. In [5], an interactive exploration of implicit and explicit relations in faceted datasets is proposed. The space of visualisation is shared between different metadata with cross findings between metadata, partitioning the space in categories.

[6] proposes a visual analytics graph-based framework that reveals this information and facilitates knowledge discovery. It provides insights into different facets of an information space based on user-selected perspectives. The dataset is stored as a labelled graph in a graph database. Choosing a perspective as reference and a facet as dimension, paths of the labelled graph are retrieved with same dimension extremities going through reference vertices. Visualisation comes in the form of navigable node-link graphs: edges materialise common references between vertices and are seen as pairwise collaboration between two vertices.

Ii-B Co-occurence networks

Data mining is only one step in the knowledge discovery processing chain [7]. If numerical data allows rich statistics on the instances, non numerical data mining consists often in summarizing data as occurrences. Some other approaches exist. Often they consist in regrouping data instances through similarities using techniques such as -nearest neighbours [8]

: using a threshold, links are established between different occurrences. These approaches relies on similarities: the curse of dimensionality is nonetheless a limiting factor in their use

[9], even if some techniques exist to limit its impact [10, 11]. A last way of finding occurences is to retrieve links through the dataset itself.

If the dataset reflects existing links the job is easier since an inherent network can be built through the data instances. A typical example is with group of friends in social networks. When the links exist, the collaborations are derived from them. Nonetheless, links are often neither direct nor tangible: in this case occurences need to be built or process from the dataset.

A dataset can be a set of physical references, stored as rows in traditional relational databases. Each physical reference has a metadata instance attached to it. Some of the types of the metadata instances are of interest for visualisation and some for processing additional information. The set of physical references and metadata instances used for visualisation constitute the types of the network, each type being seen either as a reference or a facet of the information space. This allows - as it will be explained in the next section - the retrieval of co-occurrences in one facet, based on one reference type - which can differ from the physical reference.

Ii-C Multisets and hb-graphs

Co-occurences can be seen as collaborations and therefore constitute with their links a network. A collaboration is a -adic relationship as mentioned in [12] between occurrences, and therefore modelisation is often done with hypergraphs, i.e. family of sets over a vertex set. But hypergraphs don’t support neither hyperedge-based repetition nor hyperedge-based weighting of vertices. We introduce hb-graphs in [13, 2, 14] to extend the concept of hypergraphs to families of multisets with same universe, called the vertex set.

Multisets - also known as bags or msets - have been used for a long time in many domains, in particular in text representation [2]. A multiset on a universe is a couple where is a set and is an application from to called the multiplicity function of the multiset . Elements that have a nonzero multiplicity are gathered in the support of the multiset, written If the range of the multiplicity function is a subset of , the multiset is called a natural multiset. In this case, elements of the multiset can be seen as a non-ordered list of repeated elements. The empty mset of universe , written , is the multiset of empty support on the universe .

Several notations of msets exist. Among the common notations mentioned in [13], we note in this article a mset of universe by:

where 111A natural multiset can also be expressed as an unordered list with repetition:

.

Different operations are defined between two multisets and of same universe . Especially, as it is used in this paper, the additive union of and is the multiset of universe such that for all ,

Considering a set of vertices, a hb-graph is a family of multisets having same universe and with support a subset of . The elements of are called the hb-edges. As a multiset, each hb-edge has a multiplicity function associated to it: where . For a general hb-graph, each hb-edge has to be seen as a weighted system of vertices, where the weights of each vertex are hb-edge dependent.

When the multiplicity range of each hb-edge is a subset of the hb-graph is said natural. A hypergraph is a natural hb-graph where every vertex in any hb-edge has a binary value - 0 or 1 - for multiplicity.

Considering the family of hb-edge support of a hb-graph , we can define its support hypergraph as the hypergraph , where . The support hypergraph is unique for a given hb-graph. But reconstructing the hb-graph from a support hypergraph generates an infinite number of hb-graphs, showing that the information contained in a hb-graph is denser than in a hypergraph.

We represent hb-graphs using an unnormalised extra-node representation [13]: an extra-node per hb-edge is added and the link thickness between it and each hb-edge support vertex is proportional to the vertex multiplicity in this hb-edge. The hypergraph support of the hb-graph constitues a simplified representation.

Iii Hb-graph framework

Multidimensional datasets are formed of data that are linked with physical entities. For instance, a publication, a person, a piece of music are possible physical entities. Some metadata of various kind and types, numerical or not, are attached to the physical entities, including their own reference.

Statistics can be easily performed on numerical types. For non numerical types, only co-occurence gathering is easily achievable with traditional charts and arrays. But choosing one of these non-numerical types as a reference to build co-occurences enhances navigation in the information space. The navigation is a simplification of the one presented in [3]. The hb-graph framework extends the facet visualisation achieved in [3] to support multisets instead of sets using interconnected hb-graphs at the level of the data instances called visualisation hb-graphs.

Iii-a Enhancing navigation

Traditional database structures can be seen as hypergraphs where the hyperedges reflect the table headers and the vertices the metadata instances. Normalized forms of such databases are linked to properties of the schema hypergraph [15]. In graph databases, the schema222although not required [16] represents the relationships between the vertex types. The schema hypergraph represents these relationships as hyperedges.

If database knowledge extraction processing is performed, such as natural language processing, the schema hypergraph becomes an

extended schema hypergraph .

Some of the types in the extended schema have no interest neither for visualisation, nor for being used as reference: we consider the types of interest as a subset of from which we generate the extracted extended schema hypergraph where ,

Only vertices of that are reachable - i.e. vertices with a simple path in between them - can be further navigated and used either as reference or visualisation type. Therefore, we build the reachability hypergraph with as its vertex set, individual hyperedges of being the connected components of . The hyperedges of are not connected as the connected components of a hypergraph constitute a partition of its vertex set. When the reachability hypergraph has only one hyperedge, the whole dataset is navigable: it is the ideal situation.

We make the assumption that in each of the hyperedge of the reachability hypergraph, it exists a metadata type or a combination of metadata types that can be chosen as the physical reference. The data instances related to this references are supposed to be unique. For instance, in a publication dataset the physical reference is the id of the publication itself.

Last hypergraph at the metadata level, the navigation hypergraph is defined by choosing a hyperedge of the reachability hypergraph and a non-empty subset of of possible reference types of interest. The choice of a subset of allows to consider the remaining vertices of as visualisation vertex types, that will be used to generate the facet visualisation hb-graphs and are called the visualisation types. Hence: Navigation without changing references is possible only in one hyperedge of at a time. The simplest case happens when there is only one reference of interest selected at a time in

; we restrict ourselves to this case for the moment, i.e. we consider for

the set

In a publication dataset, typical metadata types are: publication id, title, abstract, authors, affiliations, addresses, author keywords, publication categories, countries, organisations,…333Metadata of interest for visualisation or referencing are in italic There are many different navigation hyperedge possible: for instance choosing as reference the publication ids, the navigation hyperedge is: {authors, author keywords, organisations, country, publication categories}; choosing author keywords as reference the navigation hyperedge is: {authors, organisations, country, publication category, publication ids}.

Iii-B Facet visualisation hb-graphs

Each physical entity in a dataset is described by a unique physical reference and a set of data instances of different types . In [3], we use sets to store co-occurrences. Nonetheless in many cases, it is worth storing additional information by joining a multiplicity - with nonnegative integer or real values - to elements of co-occurences. For instance in a publication dataset, different authors can have the same affiliation organization; retrieving one occurence of organization per author enforces repetitions and a natural multiset. When considering keywords, their relative frequency in the document can be used as multiplicity. From [2], we know that hb-graphs allow a refined ranking of the information. Hence, the above facts motivate the usage of multisets to store co-occurrences.

We write the multiset of values of type - possibly empty - that are attached to , the physical entity. is entirely described by its reference and the family of multisets that corresponds to co-occurrences of the different types in linked to the physical reference, i.e.

Performing a search on the dataset retrieves a set of physical references . In the single-reference-restricted navigation hypergraph, each hyperedge describes accessible facets relatively to a chosen reference type Given a type , the associated facet shows the visualisation hb-graph where the hb-edges are the co-occurrences of type relatively to reference instances of type ( as short) retrieved from the different references in

We then build the co-occurrences by considering the set of all values of type attached to all the references : Each element of is mapped to a set of physical references in which they appear: we write the mapping. The multiset of values of type relatively to the reference instance is

The raw visualisation hb-graph for the facet of type attached to the search is then defined as:

Since some hb-edges can possibly point to the same sub-mset of vertices, we build a reduced visualisation weighted hb-graph from the raw visualisation hb-graph. To achieve it we define: and the equivalence relation such that: , :

Considering a quotient class 444 is the quotient set of by , we write where .

is the support set of the multiset : is of multiplicity in this multiset.

It yields:

Let , then is bijective. allows to retrieve the class associated to a given hb-edge; hence the associated values of to this class - which will be important for navigation. The references associated to are The reduced visualisation weighted hb-graph for the search is defined as

The hb-graph support hypergraphs can be used to retrieve the results given for hypergraphs in [3].

Iii-C Navigability through facets

As for a given search and a given reference , the sets and are fixed, the navigability can be ensured between the different facets. We consider a type , its visualisation hb-graph and a subset of the vertex set of . We target another type of co-occurrences referring to to be visualised. We illustrate the navigation in Figure 1.

test

Physical entity:
Reference:

Figure 1: Navigating between facets of the information space

We suppose that the user selects elements of as vertices of interest from which she wants to switch facet. Hb-edges of which contains at least one element of are gathered in Using the application we retrieve the corresponding class of references of type associated to the elements of , to build the set of references of type involved in the building of co-occurences of type Each of the classes in contains instances of type that are gathered in a set Each element of is linked to a set of physical references by Hence we obtain the physical reference set involving elements of :

The raw visualisation hb-graph in the targeted facet is now enhanced using as search set To obtain the reduced weighted version we use the same approach as above. The multiset of co-occurrences retrieved includes all occurrences that have co-occurred with the references attached to one of the elements of selected in the first facet. Of course if the reduced visualisation hb-graph contains all the instances of type attached to physical entities of the search .

The reference type can always be shown in one of the facet as a visualisation hb-graph, that is in fact an hypergraph where all the hb-edges are constituted of the reference itself in multiplicity the number of time the reference occurs in the hb-graph.

Ultimately, by building a multi-dimensional network organized around types, one can retrieve very valuable information from combined data sources. This process can be extended to any number of data sources as long as they share one or more types. Otherwise the reachability hypergraph is not connected and only separated navigations are possible. Figure 2 shows some examples of visualisation hypergraphs.

(a) (b)
(c) (d)

(a) A publication network.

(b) Reference: publication; Facet: organization;

View: hb-graph extra-node representation.

(c) and (d) Reference: keywords; Facet: organization;

(c) View: hb-graph extra-node representation.

(d) View: hb-graph support hypergraph extra-node representation.

Figure 2: A co-occurence network and some visualisation hb-graphs

Iii-D The DataHbEdron

The DataHbEdron provides soft navigation between the different facets of the information space. It has been introduced in [3] for hypergraphs and its principle is similar with the hb-graph support.

Each facet of the information space corresponding to a visualisation type includes a visualisation hb-graph viewed in its 2D extra-node representation with a normalised thickness on hb-edges [17]. The different facets are embedded in a 2.5D representation called the DataHbEdron. The DataHbEdron can be toggled between a cube - Figure 3 - and a carousel shape to ease the navigation between facets. The reference facet is presented as a list of references corresponding to the search output.

In the DataHbEdron, the faces show different facets of the information space: the underlying visualisation hb-graphs allow as previously explained the navigability through facets. Hb-edges are selectable interactively between the different facets; as each hb-edge is linked to a subset of the references, the corresponding references can be used to highlight information in the different facets as well as in the face containing the reference visualisation hb-graph.

Figure 3: DataHbEdron: cube shape

Iv Results

We applied this framework to perform searches and visual queries on the Arxiv database. The results are visualised in the DataHbEdron allowing simultaneous visualisation of the different facets of the information space constitutes of authors, extracted keywords and subject categories. The tool developped is now part of the Collaboration Spotting family555http://collspotting.web.cern.ch/. When performing a search, the standard Arxiv API666https://arxiv.org/help/api/index is used to query the Arxiv database. The queries can be formulated either by a text entry or done interactively directly using the visualisation: queries include single words or multiple words, with AND, OR and NOT possible operators and parenthesis groupings. The querying history is stored and presented as an interactive hb-graph to allow the visual construction of complex queries including refinement of the queries already performed. Each time a new query is formulated, the corresponding metadata is retrieved by the Arxiv API.

When performing a search on Arxiv, the query is transformed into a vector of words. The most relevant documents are retrieved based on a similarity measure between the query vector and the word vectors associated to individual documents. Arxiv relies on Lucene’s built-in Vector Space Model of information retrieval and the boolean model

777https://lucene.apache.org/core/2_9_4/scoring.html. The Arxiv API returns the metadata associated to the document with highest scores for the query performed. We keep only the first answers, with tunable by the end user. This metadata, filled by authors during their submission of a preprint, contains different information such as authors, Arxiv categories and abstract.

The information space contains four main facets: the first facet shows the Arxiv reference visualisation hb-graph with a contextual sentence related to the query, links to Arxiv article’s presentation and pdf. This first facet layout is similar to classical textual search engines - Figure 4.

Figure 4: First facet of the DataEdron: a well-known like classical verbatim interface

The second facet corresponds to co-authors of the articles using as reference the publication. The third facet depicts the co-keywords extracted from the abstracts. The fourth facet shows the Arxiv categories involved in the references.

Co-keywords are extracted from the abstracts using TextBlob, a natural language processing Python library888https://textblob.readthedocs.io/en/dev/. We extract only nouns using the tagged text, which has been lemmatized and singularized.

Nouns in the abstract of each document are scored with TF-IDF, the Term Frequency - Invert Document Frequency. Scoring each noun in each abstract of the retrieved documents generates a hb-graphs of universe the nouns contained in the abstracts. Each hb-edge contains a set of nouns extracted from a given abstract with a multiplicity function that represents the TF-IDF score of each noun. In order to limit the hb-edge size, we keep only the first words related to an abstract, where is tunable by the end-user.

The fifth facet shows the queries that have been performed during the session: the graph of those queries can be saved. The sixth facet is reserved to show additional information such as the pdf of publications.

Any node on any facet is interactive, allowing to highlight information from one facet to another by showing the hb-edges that are mapped through the references. Queries can be build using the vertices of the hb-graph, either isolated or in combination with the current search using AND, OR and NOT through keyboard shortcuts and mouse. The first query is the only one to be performed by typing it. Merging queries between different users is immediate as they correspond to hb-edges of a hb-graph. Queries are evolutive, gathered, stored and resketchable months later.

Figure 5: Performed search

The surfer has the possibility to display additional information related to authors using dblp, to keywords using DuckDuckGo for disambiguation and Wikipedia.

V Future work and Conclusion

The framework presented in this paper supports visual query of datasets: it enables full navigability of the corresponding of the corresponding information space. It provides powerful insights into datasets using simultaneous facet visualisation of the information space constructed from the query results. The 2.5D visualisation helps to understand the links in the dataset. This framework is flexible enough to enhance user insight into many other multimedia content. Nonetheless, the evaluation of such a human computer interface, as well as the evaluation of the hb-graph visualisation remain open questions for experts of these fields.

Acknowledgments

This work is part of the PhD of Xavier OUVRARD, done at UniGe, co-supervised by Pr. Stéphane MARCHAND-MAILLET and Dr Jean-Marie LE GOFF, head of the Collaboration Spotting Project. The research is founded by a doctoral position at CERN. The authors want to thanks Tullio BASAGLIA from the CERN Library for his very precious advices and feedbacks on the interface.

References

  • [1] S. R. Ranganathan, Elements of library classification. 1962.
  • [2] X. Ouvrard, J.-M. Le Goff, and S. Marchand-Maillet, “Diffusion by exchanges in hb-graphs: Highlighting complex relationships extended version,” Under submission, 2019.
  • [3] X. Ouvrard, J. Le Goff, and S. Marchand-Maillet, “Hypergraph modeling and visualisation of complex co-occurence networks,” Electronic Notes in Discrete Mathematics, vol. 70, pp. 65–70, 2018. TCDM 2018 – 2nd IMA Conference on Theoretical and Computational Discrete Mathematics, University of Derby.
  • [4] M. Dörk, N. H. Riche, G. Ramos, and S. Dumais, “Pivotpaths: Strolling through faceted information spaces,” IEEE Transactions on Visualization and Computer Graphics, vol. 18, no. 12, pp. 2709–2718, 2012.
  • [5] J. Zhao, C. Collins, F. Chevalier, and R. Balakrishnan, “Interactive exploration of implicit and explicit relations in faceted datasets,” IEEE Transactions on Visualization and Computer Graphics, vol. 19, no. 12, pp. 2080–2089, 2013.
  • [6] A. Agocs, D. Dardanis, J.-M. Le Goff, and D. Proios, “Interactive graph query language for multidimensional data in collaboration spotting visual analytics framework,” ArXiv e-prints, Dec. 2017.
  • [7] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques. Elsevier, 2011.
  • [8]

    J. H. Friedman, “On bias, variance, 0/1—loss, and the curse-of-dimensionality,”

    Data mining and knowledge discovery, vol. 1, no. 1, pp. 55–77, 1997.
  • [9] R. Weber, H.-J. Schek, and S. Blott, “A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces,” in VLDB, vol. 98, pp. 194–205, 1998.
  • [10] P. Indyk and R. Motwani, “Approximate nearest neighbors: towards removing the curse of dimensionality,” in

    Proceedings of the thirtieth annual ACM symposium on Theory of computing

    , pp. 604–613, ACM, 1998.
  • [11] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the surprising behavior of distance metrics in high dimensional space,” in International conference on database theory, pp. 420–434, Springer, 2001.
  • [12] M. E. Newman, “Scientific collaboration networks. ii. shortest paths, weighted networks, and centrality,” Physical review E, vol. 64, no. 1, p. 016132, 2001.
  • [13] X. Ouvrard, J.-M. Le Goff, and S. Marchand-Maillet, “Adjacency and tensor representation in general hypergraphs. part 2: Multisets, hb-graphs and related e-adjacency tensors,” arXiv preprint arXiv:1805.11952, 2018.
  • [14] X. Ouvrard, J. Le Goff, and S. Marchand-Maillet, “Hb-graph modeling and visualisation of complex co-occurence networks,” Article under writing, 2019.
  • [15] R. Fagin, “Degrees of acyclicity for hypergraphs and relational database schemes,” Journal of the ACM, vol. 30, no. 3, pp. 514–550, 1983.
  • [16] R. C. McColl, D. Ediger, J. Poovey, D. Campbell, and D. A. Bader, “A performance evaluation of open source graph databases,” PPAA ’14, pp. 11–18, ACM, 2014.
  • [17]

    X. Ouvrard, J.-M. Le Goff, and S. Marchand-Maillet, “On hb-graphs and their application to general hypergraph e-adjacency tensor,”

    Article under submission to the MCCCC32 proceedings, 2019.