I Introduction
Revealing important information contained in Big Data calls for powerful analytical and visualising tools to reveal important information it contains. In an information space, meaningfull information can be regrouped by hierarchical classification or  non exclusive  by semantically cohesive categories that are combined to express concepts [1].
When browsing a textual database, traditional verbatim browsers give back linear information in the form of a ranked list with a short description of the references. To increase the pertinence of this information, the surfer has often to perform new search either by refining the original keywords she used or by using other pertinent queries that can help herhim to refine the retrieved information.
In fact, in most cases, the information space is multifacetted. Those facets are not independent but are linked by the physical entities contained in the search output. Choosing a type of reference common to these entities enables the construction of a network of cooccurences for each facet. Navigating to reveal information between the different facets helps. For instance, in the case of scientific publications different information of interest for the scientist can be regrouped around the publication reference: the authors, the organization(s) they belong, the countries of these organisations, the main keywords of an article or the figures inside the publication. All this metadata can give insights into the information space and as most of the articles in the database can be systematically regrouped to get meaningfull information. Choosing as reference the article itself, the facets can show different cooccurrence networks such as coauthors, cokeywords and cocategories. The interaction between the different facets is ensured by the links between them. The choice of the reference type is not unique and can be any convenient metadata type in the dataset. Cooccurences can contain repetitions or an individual weighting: multisets, and particularly natural multisets, allow them while sets don’t.
We propose in this article another way to explore an information space by using hyperbaggraphs (hbgraphs for short), an extension of hypergraphs to families of multisets. In [2], we show that hbgraphs enhance an exchangebased diffusion over cooccurence networks, providing a fine vertex and hbedge ranking, accounting for a hbedge based weighting of vertices.
This paper presents a hyperbag framework of cooccurence networks extending the visualisation part of the hypergraph framework sketched in [3] to hbgraph . This framework supports browsing of an information space and dataset visual queries. The framework is validated from a theoretical point of view and with a use case. We have implemented a 2.5D interface to visualise different facets of the Arxiv information space and perform visual queries.
Section II lists the related work and the mathematical background. Section III presents the hbgraph framework. Section IV gives results and Section V concludes and addresses future work.
Ii Related work and mathematical background
Iia Information space discovery
Discovering knowledge in an information space requires to gather meaningful information, either hierarchically or semantically organized. Semantic provides support to the definition of facets within an information space [1].
Navigation and visualisation of information spaces have been achieved by several authors in many different ways. In [4], a pivot is used to stroll between three facets of the information space; the approach is limited to the visualisation of a small amount of pivots at the same time. This visualisation is based on a tripartite graph. In [5], an interactive exploration of implicit and explicit relations in faceted datasets is proposed. The space of visualisation is shared between different metadata with cross findings between metadata, partitioning the space in categories.
[6] proposes a visual analytics graphbased framework that reveals this information and facilitates knowledge discovery. It provides insights into different facets of an information space based on userselected perspectives. The dataset is stored as a labelled graph in a graph database. Choosing a perspective as reference and a facet as dimension, paths of the labelled graph are retrieved with same dimension extremities going through reference vertices. Visualisation comes in the form of navigable nodelink graphs: edges materialise common references between vertices and are seen as pairwise collaboration between two vertices.
IiB Cooccurence networks
Data mining is only one step in the knowledge discovery processing chain [7]. If numerical data allows rich statistics on the instances, non numerical data mining consists often in summarizing data as occurrences. Some other approaches exist. Often they consist in regrouping data instances through similarities using techniques such as nearest neighbours [8]
: using a threshold, links are established between different occurrences. These approaches relies on similarities: the curse of dimensionality is nonetheless a limiting factor in their use
[9], even if some techniques exist to limit its impact [10, 11]. A last way of finding occurences is to retrieve links through the dataset itself.If the dataset reflects existing links the job is easier since an inherent network can be built through the data instances. A typical example is with group of friends in social networks. When the links exist, the collaborations are derived from them. Nonetheless, links are often neither direct nor tangible: in this case occurences need to be built or process from the dataset.
A dataset can be a set of physical references, stored as rows in traditional relational databases. Each physical reference has a metadata instance attached to it. Some of the types of the metadata instances are of interest for visualisation and some for processing additional information. The set of physical references and metadata instances used for visualisation constitute the types of the network, each type being seen either as a reference or a facet of the information space. This allows  as it will be explained in the next section  the retrieval of cooccurrences in one facet, based on one reference type  which can differ from the physical reference.
IiC Multisets and hbgraphs
Cooccurences can be seen as collaborations and therefore constitute with their links a network. A collaboration is a adic relationship as mentioned in [12] between occurrences, and therefore modelisation is often done with hypergraphs, i.e. family of sets over a vertex set. But hypergraphs don’t support neither hyperedgebased repetition nor hyperedgebased weighting of vertices. We introduce hbgraphs in [13, 2, 14] to extend the concept of hypergraphs to families of multisets with same universe, called the vertex set.
Multisets  also known as bags or msets  have been used for a long time in many domains, in particular in text representation [2]. A multiset on a universe is a couple where is a set and is an application from to called the multiplicity function of the multiset . Elements that have a nonzero multiplicity are gathered in the support of the multiset, written If the range of the multiplicity function is a subset of , the multiset is called a natural multiset. In this case, elements of the multiset can be seen as a nonordered list of repeated elements. The empty mset of universe , written , is the multiset of empty support on the universe .
Several notations of msets exist. Among the common notations mentioned in [13], we note in this article a mset of universe by:
where ^{1}^{1}1A natural multiset can also be expressed as an unordered list with repetition:
Different operations are defined between two multisets and of same universe . Especially, as it is used in this paper, the additive union of and is the multiset of universe such that for all ,
Considering a set of vertices, a hbgraph is a family of multisets having same universe and with support a subset of . The elements of are called the hbedges. As a multiset, each hbedge has a multiplicity function associated to it: where . For a general hbgraph, each hbedge has to be seen as a weighted system of vertices, where the weights of each vertex are hbedge dependent.
When the multiplicity range of each hbedge is a subset of the hbgraph is said natural. A hypergraph is a natural hbgraph where every vertex in any hbedge has a binary value  0 or 1  for multiplicity.
Considering the family of hbedge support of a hbgraph , we can define its support hypergraph as the hypergraph , where . The support hypergraph is unique for a given hbgraph. But reconstructing the hbgraph from a support hypergraph generates an infinite number of hbgraphs, showing that the information contained in a hbgraph is denser than in a hypergraph.
We represent hbgraphs using an unnormalised extranode representation [13]: an extranode per hbedge is added and the link thickness between it and each hbedge support vertex is proportional to the vertex multiplicity in this hbedge. The hypergraph support of the hbgraph constitues a simplified representation.
Iii Hbgraph framework
Multidimensional datasets are formed of data that are linked with physical entities. For instance, a publication, a person, a piece of music are possible physical entities. Some metadata of various kind and types, numerical or not, are attached to the physical entities, including their own reference.
Statistics can be easily performed on numerical types. For non numerical types, only cooccurence gathering is easily achievable with traditional charts and arrays. But choosing one of these nonnumerical types as a reference to build cooccurences enhances navigation in the information space. The navigation is a simplification of the one presented in [3]. The hbgraph framework extends the facet visualisation achieved in [3] to support multisets instead of sets using interconnected hbgraphs at the level of the data instances called visualisation hbgraphs.
Iiia Enhancing navigation
Traditional database structures can be seen as hypergraphs where the hyperedges reflect the table headers and the vertices the metadata instances. Normalized forms of such databases are linked to properties of the schema hypergraph [15]. In graph databases, the schema^{2}^{2}2although not required [16] represents the relationships between the vertex types. The schema hypergraph represents these relationships as hyperedges.
If database knowledge extraction processing is performed, such as natural language processing, the schema hypergraph becomes an
extended schema hypergraph .Some of the types in the extended schema have no interest neither for visualisation, nor for being used as reference: we consider the types of interest as a subset of from which we generate the extracted extended schema hypergraph where ,
Only vertices of that are reachable  i.e. vertices with a simple path in between them  can be further navigated and used either as reference or visualisation type. Therefore, we build the reachability hypergraph with as its vertex set, individual hyperedges of being the connected components of . The hyperedges of are not connected as the connected components of a hypergraph constitute a partition of its vertex set. When the reachability hypergraph has only one hyperedge, the whole dataset is navigable: it is the ideal situation.
We make the assumption that in each of the hyperedge of the reachability hypergraph, it exists a metadata type or a combination of metadata types that can be chosen as the physical reference. The data instances related to this references are supposed to be unique. For instance, in a publication dataset the physical reference is the id of the publication itself.
Last hypergraph at the metadata level, the navigation hypergraph is defined by choosing a hyperedge of the reachability hypergraph and a nonempty subset of of possible reference types of interest. The choice of a subset of allows to consider the remaining vertices of as visualisation vertex types, that will be used to generate the facet visualisation hbgraphs and are called the visualisation types. Hence: Navigation without changing references is possible only in one hyperedge of at a time. The simplest case happens when there is only one reference of interest selected at a time in
; we restrict ourselves to this case for the moment, i.e. we consider for
the setIn a publication dataset, typical metadata types are: publication id, title, abstract, authors, affiliations, addresses, author keywords, publication categories, countries, organisations,…^{3}^{3}3Metadata of interest for visualisation or referencing are in italic There are many different navigation hyperedge possible: for instance choosing as reference the publication ids, the navigation hyperedge is: {authors, author keywords, organisations, country, publication categories}; choosing author keywords as reference the navigation hyperedge is: {authors, organisations, country, publication category, publication ids}.
IiiB Facet visualisation hbgraphs
Each physical entity in a dataset is described by a unique physical reference and a set of data instances of different types . In [3], we use sets to store cooccurrences. Nonetheless in many cases, it is worth storing additional information by joining a multiplicity  with nonnegative integer or real values  to elements of cooccurences. For instance in a publication dataset, different authors can have the same affiliation organization; retrieving one occurence of organization per author enforces repetitions and a natural multiset. When considering keywords, their relative frequency in the document can be used as multiplicity. From [2], we know that hbgraphs allow a refined ranking of the information. Hence, the above facts motivate the usage of multisets to store cooccurrences.
We write the multiset of values of type  possibly empty  that are attached to , the physical entity. is entirely described by its reference and the family of multisets that corresponds to cooccurrences of the different types in linked to the physical reference, i.e.
Performing a search on the dataset retrieves a set of physical references . In the singlereferencerestricted navigation hypergraph, each hyperedge describes accessible facets relatively to a chosen reference type Given a type , the associated facet shows the visualisation hbgraph where the hbedges are the cooccurrences of type relatively to reference instances of type ( as short) retrieved from the different references in
We then build the cooccurrences by considering the set of all values of type attached to all the references : Each element of is mapped to a set of physical references in which they appear: we write the mapping. The multiset of values of type relatively to the reference instance is
The raw visualisation hbgraph for the facet of type attached to the search is then defined as:
Since some hbedges can possibly point to the same submset of vertices, we build a reduced visualisation weighted hbgraph from the raw visualisation hbgraph. To achieve it we define: and the equivalence relation such that: , :
Considering a quotient class ^{4}^{4}4 is the quotient set of by , we write where .
is the support set of the multiset : is of multiplicity in this multiset.
It yields:
Let , then is bijective. allows to retrieve the class associated to a given hbedge; hence the associated values of to this class  which will be important for navigation. The references associated to are The reduced visualisation weighted hbgraph for the search is defined as
The hbgraph support hypergraphs can be used to retrieve the results given for hypergraphs in [3].
IiiC Navigability through facets
As for a given search and a given reference , the sets and are fixed, the navigability can be ensured between the different facets. We consider a type , its visualisation hbgraph and a subset of the vertex set of . We target another type of cooccurrences referring to to be visualised. We illustrate the navigation in Figure 1.
test
We suppose that the user selects elements of as vertices of interest from which she wants to switch facet. Hbedges of which contains at least one element of are gathered in Using the application we retrieve the corresponding class of references of type associated to the elements of , to build the set of references of type involved in the building of cooccurences of type Each of the classes in contains instances of type that are gathered in a set Each element of is linked to a set of physical references by Hence we obtain the physical reference set involving elements of :
The raw visualisation hbgraph in the targeted facet is now enhanced using as search set To obtain the reduced weighted version we use the same approach as above. The multiset of cooccurrences retrieved includes all occurrences that have cooccurred with the references attached to one of the elements of selected in the first facet. Of course if the reduced visualisation hbgraph contains all the instances of type attached to physical entities of the search .
The reference type can always be shown in one of the facet as a visualisation hbgraph, that is in fact an hypergraph where all the hbedges are constituted of the reference itself in multiplicity the number of time the reference occurs in the hbgraph.
Ultimately, by building a multidimensional network organized around types, one can retrieve very valuable information from combined data sources. This process can be extended to any number of data sources as long as they share one or more types. Otherwise the reachability hypergraph is not connected and only separated navigations are possible. Figure 2 shows some examples of visualisation hypergraphs.
(a)  (b) 
(c)  (d) 
(a) A publication network.
(b) Reference: publication; Facet: organization;
View: hbgraph extranode representation.
(c) and (d) Reference: keywords; Facet: organization;
(c) View: hbgraph extranode representation.
(d) View: hbgraph support hypergraph extranode representation.
IiiD The DataHbEdron
The DataHbEdron provides soft navigation between the different facets of the information space. It has been introduced in [3] for hypergraphs and its principle is similar with the hbgraph support.
Each facet of the information space corresponding to a visualisation type includes a visualisation hbgraph viewed in its 2D extranode representation with a normalised thickness on hbedges [17]. The different facets are embedded in a 2.5D representation called the DataHbEdron. The DataHbEdron can be toggled between a cube  Figure 3  and a carousel shape to ease the navigation between facets. The reference facet is presented as a list of references corresponding to the search output.
In the DataHbEdron, the faces show different facets of the information space: the underlying visualisation hbgraphs allow as previously explained the navigability through facets. Hbedges are selectable interactively between the different facets; as each hbedge is linked to a subset of the references, the corresponding references can be used to highlight information in the different facets as well as in the face containing the reference visualisation hbgraph.
Iv Results
We applied this framework to perform searches and visual queries on the Arxiv database. The results are visualised in the DataHbEdron allowing simultaneous visualisation of the different facets of the information space constitutes of authors, extracted keywords and subject categories. The tool developped is now part of the Collaboration Spotting family^{5}^{5}5http://collspotting.web.cern.ch/. When performing a search, the standard Arxiv API^{6}^{6}6https://arxiv.org/help/api/index is used to query the Arxiv database. The queries can be formulated either by a text entry or done interactively directly using the visualisation: queries include single words or multiple words, with AND, OR and NOT possible operators and parenthesis groupings. The querying history is stored and presented as an interactive hbgraph to allow the visual construction of complex queries including refinement of the queries already performed. Each time a new query is formulated, the corresponding metadata is retrieved by the Arxiv API.
When performing a search on Arxiv, the query is transformed into a vector of words. The most relevant documents are retrieved based on a similarity measure between the query vector and the word vectors associated to individual documents. Arxiv relies on Lucene’s builtin Vector Space Model of information retrieval and the boolean model
^{7}^{7}7https://lucene.apache.org/core/2_9_4/scoring.html. The Arxiv API returns the metadata associated to the document with highest scores for the query performed. We keep only the first answers, with tunable by the end user. This metadata, filled by authors during their submission of a preprint, contains different information such as authors, Arxiv categories and abstract.The information space contains four main facets: the first facet shows the Arxiv reference visualisation hbgraph with a contextual sentence related to the query, links to Arxiv article’s presentation and pdf. This first facet layout is similar to classical textual search engines  Figure 4.
The second facet corresponds to coauthors of the articles using as reference the publication. The third facet depicts the cokeywords extracted from the abstracts. The fourth facet shows the Arxiv categories involved in the references.
Cokeywords are extracted from the abstracts using TextBlob, a natural language processing Python library^{8}^{8}8https://textblob.readthedocs.io/en/dev/. We extract only nouns using the tagged text, which has been lemmatized and singularized.
Nouns in the abstract of each document are scored with TFIDF, the Term Frequency  Invert Document Frequency. Scoring each noun in each abstract of the retrieved documents generates a hbgraphs of universe the nouns contained in the abstracts. Each hbedge contains a set of nouns extracted from a given abstract with a multiplicity function that represents the TFIDF score of each noun. In order to limit the hbedge size, we keep only the first words related to an abstract, where is tunable by the enduser.
The fifth facet shows the queries that have been performed during the session: the graph of those queries can be saved. The sixth facet is reserved to show additional information such as the pdf of publications.
Any node on any facet is interactive, allowing to highlight information from one facet to another by showing the hbedges that are mapped through the references. Queries can be build using the vertices of the hbgraph, either isolated or in combination with the current search using AND, OR and NOT through keyboard shortcuts and mouse. The first query is the only one to be performed by typing it. Merging queries between different users is immediate as they correspond to hbedges of a hbgraph. Queries are evolutive, gathered, stored and resketchable months later.
The surfer has the possibility to display additional information related to authors using dblp, to keywords using DuckDuckGo for disambiguation and Wikipedia.
V Future work and Conclusion
The framework presented in this paper supports visual query of datasets: it enables full navigability of the corresponding of the corresponding information space. It provides powerful insights into datasets using simultaneous facet visualisation of the information space constructed from the query results. The 2.5D visualisation helps to understand the links in the dataset. This framework is flexible enough to enhance user insight into many other multimedia content. Nonetheless, the evaluation of such a human computer interface, as well as the evaluation of the hbgraph visualisation remain open questions for experts of these fields.
Acknowledgments
This work is part of the PhD of Xavier OUVRARD, done at UniGe, cosupervised by Pr. Stéphane MARCHANDMAILLET and Dr JeanMarie LE GOFF, head of the Collaboration Spotting Project. The research is founded by a doctoral position at CERN. The authors want to thanks Tullio BASAGLIA from the CERN Library for his very precious advices and feedbacks on the interface.
References
 [1] S. R. Ranganathan, Elements of library classification. 1962.
 [2] X. Ouvrard, J.M. Le Goff, and S. MarchandMaillet, “Diffusion by exchanges in hbgraphs: Highlighting complex relationships extended version,” Under submission, 2019.
 [3] X. Ouvrard, J. Le Goff, and S. MarchandMaillet, “Hypergraph modeling and visualisation of complex cooccurence networks,” Electronic Notes in Discrete Mathematics, vol. 70, pp. 65–70, 2018. TCDM 2018 – 2nd IMA Conference on Theoretical and Computational Discrete Mathematics, University of Derby.
 [4] M. Dörk, N. H. Riche, G. Ramos, and S. Dumais, “Pivotpaths: Strolling through faceted information spaces,” IEEE Transactions on Visualization and Computer Graphics, vol. 18, no. 12, pp. 2709–2718, 2012.
 [5] J. Zhao, C. Collins, F. Chevalier, and R. Balakrishnan, “Interactive exploration of implicit and explicit relations in faceted datasets,” IEEE Transactions on Visualization and Computer Graphics, vol. 19, no. 12, pp. 2080–2089, 2013.
 [6] A. Agocs, D. Dardanis, J.M. Le Goff, and D. Proios, “Interactive graph query language for multidimensional data in collaboration spotting visual analytics framework,” ArXiv eprints, Dec. 2017.
 [7] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques. Elsevier, 2011.

[8]
J. H. Friedman, “On bias, variance, 0/1—loss, and the curseofdimensionality,”
Data mining and knowledge discovery, vol. 1, no. 1, pp. 55–77, 1997.  [9] R. Weber, H.J. Schek, and S. Blott, “A quantitative analysis and performance study for similaritysearch methods in highdimensional spaces,” in VLDB, vol. 98, pp. 194–205, 1998.

[10]
P. Indyk and R. Motwani, “Approximate nearest neighbors: towards removing the
curse of dimensionality,” in
Proceedings of the thirtieth annual ACM symposium on Theory of computing
, pp. 604–613, ACM, 1998.  [11] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the surprising behavior of distance metrics in high dimensional space,” in International conference on database theory, pp. 420–434, Springer, 2001.
 [12] M. E. Newman, “Scientific collaboration networks. ii. shortest paths, weighted networks, and centrality,” Physical review E, vol. 64, no. 1, p. 016132, 2001.
 [13] X. Ouvrard, J.M. Le Goff, and S. MarchandMaillet, “Adjacency and tensor representation in general hypergraphs. part 2: Multisets, hbgraphs and related eadjacency tensors,” arXiv preprint arXiv:1805.11952, 2018.
 [14] X. Ouvrard, J. Le Goff, and S. MarchandMaillet, “Hbgraph modeling and visualisation of complex cooccurence networks,” Article under writing, 2019.
 [15] R. Fagin, “Degrees of acyclicity for hypergraphs and relational database schemes,” Journal of the ACM, vol. 30, no. 3, pp. 514–550, 1983.
 [16] R. C. McColl, D. Ediger, J. Poovey, D. Campbell, and D. A. Bader, “A performance evaluation of open source graph databases,” PPAA ’14, pp. 11–18, ACM, 2014.

[17]
X. Ouvrard, J.M. Le Goff, and S. MarchandMaillet, “On hbgraphs and their application to general hypergraph eadjacency tensor,”
Article under submission to the MCCCC32 proceedings, 2019.
Comments
There are no comments yet.