Today, user data can be easily acquired from various domains, ranging from the social Web to medical records, scientific publications, and retail store receipts. This data is characterized by a combination of demographics (e.g., age, gender, occupation) and actions (e.g., rating a movie, publishing a paper, following a medical treatment). Explorers in their role as data scientists, rely on user data to conduct large-scale population studies and to gain insights on the preferences of different population segments. Explorers, in their role as information consumers, use the social Web for routine tasks such as choosing a restaurant.
In this paper, we introduce Vexus (Visualizing and EXploring User GroupS), a visualization framework which lets explorers understand user data. Explorers can discover and visualize groups of similar users along different dimensions of their choice. They can navigate in the group space and obtain more details on demand. Vexus helps them complete two kinds of tasks: reach a single group of interest (e.g., finding an audience group for targeted advertisement) or collect users among different groups (e.g., forming a geographically diverse set of experts for a conference committee).
User data exploration is a challenging task for two main reasons. First, its large size is daunting. For instance, BookCrossing, a book rating dataset, contains one million ratings of 278,858 users for 271,379 books. Second, explorers often have a partial understanding of the data and their needs. To tackle this challenge, we build Vexus on two principles: aggregated analytics [1, 5] and interactivity . The former suggests a group-based analysis which addresses noise and sparsity and enables new findings in a more granular space. The latter is in opposition to a more traditional single-shot analysis, and provides means to iteratively refine one’s understanding.
Aggregated Analytics. The aggregation of users’ demographics and actions forms groups such as “young professionals in Paris” and “female teenagers who watch romantic movies”. All group members share common demographics and actions that describe the group. Group-based analytics provides a summarized view of user data which facilitates navigating in the space. Once landed on a group of interest, the explorer can switch to a user-level investigation. In Vexus, we build upon group-based principles that we developed in .
An explorer may be interested in a group as a whole. For instance, the group of “middle-age users in France who read thrillers” in its entirety, is a good candidate for John Grisham111An American writer best known for his popular legal thrillers books. An explorer may also need to delve into a group to choose some members. For instance, program committee chairs look for an internationally-diverse and gender-balanced committee, and hence need to choose members from various groups.
Group-based analysis can be very expensive. The number of possible groups is potentially very large as it is exponential in the number of users’ demographics and actions. Any set of users with at least one demographic or action in common can form a group. As an example, with only four demographic attributes and five values for each, the number of user groups will be in the order of . Consequently, we also need to address scalability in Vexus.
Interactivity. We advocate an interactive human-in-the-loop approach where an explorer who is interested in a group of users, may request to obtain other interesting groups that are semantically connected to that group in the next iteration. We distinguish an interactive process from a random walk in the space of user groups by respecting the following key principles.
P1: Limited Options. The explorer must be able to see different groups without being overwhelmed (i.e., Occam’s razor principle).
P2: Optimality. Groups offered to the explorer must be of high quality. In other words, interactive steps must collectively optimize a quality function. This ensures the purposefulness of the interactive process and prevents statistically false local discoveries such as Simpson’s paradox .
P3: Efficiency. The train of thought of the explorer must not be lost. Hence each interactive step must be fast to preserve fluidity.
The following example shows how the proposed exploration principles enable the efficient analysis of user data. Consider Tiffany who wants to find a person she met at last night’s party in Westford, Massachusetts (MA). She does not remember his name or any other indicating contact. Hence no querying mechanism is of help. Tiffany uses Vexus to inspect the list of Mike’s friends. Mike is the party host and Vexus forms groups from Mike’s friends (aggregated analytics). Vexus returns three groups (limited options) which are “engineers in MA who work in NextWorth company”, “engineers in bioinformatics” and “part-time market managers in Boston”. Those groups are diverse to provide different analysis directions and cover most of Mike’s friends (optimality
). Tiffany remembers that the person she is looking for was talking about “data visualization”, thus he should not be working for NextWorth, a recycling company. She also remembers that he mentioned he is a full-time employee; thus he should not belong to the last group either. So she selects the group of engineers in bioinformatics. In the next iteration, she immediately receives three subsets of that group (efficiency). She notices a group of “software engineers in BioView” (a company for cell imaging and analysis) where she finds the person she was looking for.
Our contributions in Vexus are as follows.
Exploratory Analysis. Vexus is particularly targeted to scenarios of exploratory analysis where explorers have a partial understanding of the underlying user data and need to refine their objectives as they discover new insights. Vexus exploits appropriate indexing paradigms to enable fluid interactions. Moreover, Vexus builds an explorer profile and uses it to anticipate follow-up steps and select groups on-the-fly depending on the explorer’s evolving needs. Vexus serves essential applications on a variety of datasets, such as building a program committee for a conference , assembling a team of experts in crowd data sourcing , recommending items to a group , and validating hypotheses such as “young professionals are more inclined to buying organic food” . To the best of our knowledge, Vexus is the first framework which enables fluid navigation of user groups in an exploratory context.
Visualization. Vexus uses state-of-the-art visualization techniques to interact with the explorer. User groups are visualized in a directed force layout to prevent clutter. Histograms and charts show detailed statistics about groups. Those statistics are displayed in coordinated views where a brush on one (e.g., histogram) updates all other statistics instantaneously. The explorer can also request to see members of a group where a two-dimensional projection provides a clustered view of those users.
Ii System Architecture
Fig. 1 shows the overall architecture of our system. First, Vexus pre-processes user data offline to obtain user groups. Groups form a disconnected undirected graph where an edge exists between two groups if they are not disjoint. Group exploration is a navigation in that graph.
Ii-a VEXUS Modules
Pre-processing. In the offline process, Vexus receives the input user data either as a dataset (in the form of a CSV file) or as a data stream. An ETL process (including data cleaning) precedes the data import to prepare data for analysis. Each record in user data describes one user action (e.g., rating a book). We consider the generic schema [user, item, value] for user data. For instance the tuple [Mary, Mr Miracle, 4] means that the user “Mary” rates the book “Mr Miracle” with the score 4 (out of 5). Each user is also associated to a set of demographics.
The user data is given as input to a group discovery algorithm. Vexus is independent of this process. For user datasets, different group discovery algorithms such as LCM  and -MOMRI  can be used. In case of user data streams, StreamMining  and Birch  can be employed. For each group, its members and their common attributes will be returned.
For efficient navigation in the space of groups, we build an inverted index per group in that contains all groups in in decreasing order of their similarity to . We use the Jaccard distance to compute the similarity between each pair of groups. To reduce both time and space complexity, we only materialize 10% of each inverted index which is shown in  to be adequate to deliver satisfying results.
Group Exploration. Fig. 2 shows a screen-shot of group exploration. Inspired from OLAP and in conformance with visual analytics principles , we consider five visual modules in Vexus: GroupViz, Context, Stats, History and Memo. In GroupViz, an explorer examines a limited number of groups to obtain one or more groups of interest. She can then ask to navigate to other groups which are similar to what she has already liked (i.e., interactivity). The explorer preference, captured in the form of feedback, is illustrated in Context. The sequence of selected groups is visualized in History. The explorer can backtrack to any previous step in History. The explorer may request to delve into more details and observe group members. In this case, an exhaustive set of statistics will be shown in Stats. At any stage of the process, the explorer can bookmark a group or a user in Memo. The analysis ends when the explorer is satisfied with her collection in Memo, which serves as her analysis goal.
GroupViz visualizes groups in the form of circles. It is shown in previous research  that is an ideal match for human perception capacity. The position of circles is enforced by a directed force layout to prevent visual clutter. The size of circles reflects the number of users in groups. Circles are be color-coded by any attribute of choice (e.g., by gender in Fig. 2) to provide immediate insights. The group description is shown by hovering over the circle to provide an explanation of the group’s content.
Ii-B VEXUS Features
In the following, we highlight key distinctive functionalities of Vexus.
Interactivity. At any given point, the explorer can click on a group in GroupViz. Then, Vexus decides which groups (conforming with principle P1) to explore next for based on implicit feedback so far (reflected in Context). To comply with the optimality principle P2, the quality of groups should be verified. We consider diversity and coverage as quality objectives in Vexus. Optimizing diversity provides various analysis directions and reduces redundancy in returned groups. Optimizing coverage ensures that the most interesting records appear in at least one group in the output. The results of our user study in  shows that highly diverse and covering groups are preferred as they contain informative and representative users. We use a best-effort greedy approach that we developed in  to return a local diverse and covering set of groups with a lower-bound on similarity.
Note that while all interactions in Vexus occur in , the bottleneck of the framework is the greedy process. To comply with the efficiency principle P3, we set a time limit for the greedy process. The higher this limit, the more optimized the set of groups. We safely set the time limit to (i.e., continuity preserving latency ) which enables Vexus to reach in average 90% of diversity and 85% of coverage.
Granular Analysis. The explorer can investigate group members to inspect more details. Vexus employs Linear Discriminant Analysis  as a dimensionality reduction approach to obtain a 2D projection of members of a desired group (Focus View in Fig. 2). Members whose profile are more similar appear closer to each other. Also, histograms will show an exhaustive list of demographic distributions in Stats module. For instance, focusing on the group of “very senior researchers in data management with a very high number of publications” reveals that of its members are male. The explorer can brush on histograms and constrain the set of users. For instance, she can express her desire to “limit the search only to females” by a brush on “female” in the gender histogram. An updated list of selected users is shown in a table. For instance, by brushing on gender to select females and on publication rate to select “extremely active” over the above group, the table lists Elke A. Rundensteiner (a full professor in the Computer Science department of Worcester Polytechnic Institute) with 325 publications in 26 years of her career.
Interoperability. Histograms are implemented using Crossfilter charts222http://square.github.io/crossfilter/. Crossfilter employs the methodology of coordinated views where a brush on one histogram updates all other statistics instantaneously. This satisfies the efficiency principle P3 at the user-level. Crossfilter’s efficiency is ensured by employing the concept of incremental queries which prevents redundant query executions by sub-setting the data under the brush, on-the-fly.
Feedback Learning. During the interactive process, Vexus, Vexus interprets this choice as a positive feedback and increases the score of ’s members and their common activities described in inside the feedback vector. The vector is always kept normalized, i.e., all scores in the vector add up to . This implicitly means that users and demographics that do not get rewarded, will gradually end up with a lower score tending to zero. Vexus shows the explicit current status of the feedback vector in the Context module. Hence the explorer can easily understand how Vexus results are currently biased. She can easily unlearn (i.e., make Vexus forget about a user or a demographic value) by deleting it from Context. To incorporate feedback in the greedy optimizer behind the group visualizer, we consider a weighted similarity function. Intuitively, a group which is highly in line with the feedback received so far gets a higher weight, hence it is more probable to be chosen as one of the returned groups in subsequent steps.
We consider two user datasets: database researchers (DB-Authors) and book ratings (BookCrossing333http://www2.informatik.uni-freiburg.de/ cziegler/BX/). Explorers can seek to achieve either a single target task (ST), where the goal is to find a single group in its entirety (e.g., finding an audience group for targeted advertisement), or a multi-target task (MT), where the goal is to identify several users of interest while exploring user groups (e.g., forming an expert-set for a conference). In the following scenarios, we show how Vexus enables explorers to achieve ST and MT tasks in an efficient and comprehensive way.
Scenario 1: Expert Set Formation (MT). Our explorer can be a program committee (PC) chair whose task is to build an expert set formed by geographically distributed male and female researchers with different seniority and expertise levels. In this scenario, we employ our DB-Authors dataset which is now available for the public.444https://persyval-platform.imag.fr/perscido/web/DS32/detaildataset. Vexus guides the chair in the interactive process to find colleagues to invite. The chair may start from a small group of researchers of the previous year’s PC. Then Vexus returns similar groups. Vexus captures the feedback from the chair throughout the process and biases the exploration towards her interest. To diversify the expert set, the chair may delete a learned demographic value, e.g., “male”, to obtain more gender-balanced results. The interactivity of Vexus coupled with the feedback-based aggregations of user data makes it distinct against other expert-set formation tools such as DBPubs  and Sofia . Our results in  show that Vexus enables PC chairs to form committees of major conferences (SIGMOD, VLDB and CIKM) in less than 10 iterations on average.
Scenario 2: Discussion Groups (ST). Our explorer can be an avid book reader who is looking to join an online book club. Having over 1,000 ratings (ranging from 1 to 10 but mostly high) for her favorite author, Debbie Macomber (author of contemporary women’s fiction), the explorer navigates groups of users in BookCrossing (as a user rating dataset) using Vexus to find discussion groups. For instance, she discovers a group with whom she agrees (e.g., people who like fiction books) and another group with whom she disagrees (e.g., people who like gender-neutral books). The user study in  shows an 80% satisfaction of exploring rating datasets via user groups in contrast to individuals.
-  S. Amer-Yahia, S. Kleisarchaki, N. K. Kolloju, L. V. Lakshmanan, and R. H. Zamar. Who rates like me: on building flexible rating maps efficiently. WWW, 2017.
-  S. Amer-Yahia, B. Omidvar-Tehrani, S. B. Roy, and N. Shabib. Group recommendation with temporal affinities. In EDBT, 2015.
-  A. Baid, A. Balmin, H. Hwang, E. Nijkamp, J. Rao, B. Reinwald, A. Simitsis, Y. Sismanis, and F. van Ham. Dbpubs: multidimensional exploration of database publications. PVLDB, 2008.
-  M. Bender, R. Klein, A. Disch, and A. Ebert. A functional framework for web-based information visualization systems. IEEE Transactions on Visualization and Computer Graphics, 6(1):8–23, 2000.
-  M. Das, S. Amer-Yahia, G. Das, and C. Yu. MRI: meaningful interpretations of collaborative ratings. PVLDB, 4(11), 2011.
-  J.-D. Fekete and R. Primet. Progressive analytics: A computation paradigm for exploratory data analysis. arXiv preprint arXiv:1607.05162, 2016.
-  B. Golshan, T. Lappas, and E. Terzi. Sofia search: a tool for automating related-work search. In SIGMOD, 2012.
S. Ji and J. Ye.
Generalized linear discriminant analysis: a unified framework and
efficient model selection.
IEEE Transactions on Neural Networks, 19(10), 2008.
-  R. Jin and G. Agrawal. An algorithm for in-core frequent itemset mining on streaming data. In ICDM, 2005.
-  G. Liu, M. Feng, Y. Wang, L. Wong, S.-K. Ng, T. L. Mah, and E. J. D. Lee. Towards exploratory hypothesis testing and analysis. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 745–756. IEEE, 2011.
-  G. Miller. Human memory and the storage of information. IRE Transactions on Information Theory, 2(3), 1956.
-  S. Mishra, V. Leroy, and S. Amer-Yahia. Discovering characterizing regions for consumer products. In DSAA, 2015.
-  B. Omidvar-Tehrani, S. Amer-Yahia, P. Dutot, and D. Trystram. Multi-objective group discovery on the social web. In PKDD, 2016.
-  B. Omidvar-Tehrani, S. Amer-Yahia, and A. Termier. Interactive user group analysis. In CIKM, 2015.
-  B. Omidvar-Tehrani, S. Amer-Yahia, A. Termier, A. Bertaux, É. Gaussier, and M.-C. Rousset. Towards a framework for semantic exploration of frequent patterns. In IMMoA. CEUR-WS, 2013.
-  T. Uno, T. Asai, Y. Uchida, and H. Arimura. Lcm: An efficient algorithm for enumerating frequent closed item sets. In FIMI. Citeseer, 2003.
-  B. Valeri, S. Elbassuoni, and S. Amer-Yahia. Acquiring reliable ratings from the crowd. In HCOMP, 2015.
-  T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering method for very large databases. In ACM Sigmod Record, volume 25. ACM, 1996.