Visual exploration; Interactive visualization; Information visualization; Visual analytics; Exploratory data analysis.
Data visualization is the presentation of data in a pictorial or graphical format, and a data visualization tool is the software that generates this presentation. Data visualization provides users with intuitive means to interactively explore and analyze data, enabling them to effectively identify interesting patterns, infer correlations and causalities, and supports sense-making activities.
Exploring, visualizing and analysing data is a core task for data scientists and analysts in numerous applications. Data visualization111Throughout the article, terms visualization and visual exploration, as well as terms tool and system are used interchangeably.idvBook_v2 provides intuitive ways for the users to interactively explore and analyze data, enabling them to effectively identify interesting patterns, infer correlations and causalities, and support sense-making activities.
The Big Data era has realized the availability of a great amount of massive datasets that are dynamic, noisy and heterogeneous in nature. The level of difficulty in transforming a data-curious user into someone who can access and analyze that data is even more burdensome now for a great number of users with little or no support and expertise on the data processing part. Data visualization has become a major research challenge involving several issues related to data storage, querying, indexing, visual presentation, interaction, personalization bs16; Shneiderman08; HeerK12b; MortonBGM14; WuBM14; GGL15; IdreosPC15; HeerS12.
Given the above, modern visualization and exploration systems should effectively and efficiently handle the following aspects.
Real-time Interaction. Efficient and scalable techniques should support the interaction with billion objects datasets, while maintaining the system response in the range of a few milliseconds.
On-the-fly Processing. Support of on-the-fly visualizations over large and dynamic sets of volatile raw (i.e., not preprocessed) data is required.
Visual Scalability. Provision of effective data abstraction mechanisms is necessary for addressing problems related to visual information overloading (a.k.a. overplotting).
User Assistance and Personalization. Encouraging user comprehension and offering customization capabilities to different user-defined exploration scenarios and preferences according to the analysis needs are important features.
4 Visualization in Big Data Era
This section discusses the basic concepts related to Big Data visualization. First, the limitations of traditional visualization systems are outlined. Then, the basic characteristics of data visualization in the context of Big Data era are presented. Finally, the major prerequisites and challenges that should be addressed by modern exploration and visualization systems are discussed.
4.1 Traditional Systems
Most traditional exploration and visualization systems cannot handle the size of many contemporary datasets. They restrict themselves to dealing with small dataset sizes, which can be easily handled and analysed with conventional data management and visual explorations techniques. Further, they operate in an offline way, limited to accessing preprocessed sets of static data.
4.2 Current Setting
On the other hand, nowadays, the Big Data era has made available large numbers of very big datasets, that are often dynamic and characterized by high variety and volatility. For example, in several cases (e.g., scientific databases), new data constantly arrive (e.g., on a daily/hourly basis); in other cases, data sources offer query or API endpoints for online access and updating. Further, nowadays, an increasingly large number of diverse users (i.e., users with different preferences or skills) explore and analyze data in a plethora of different scenarios.
4.3 Modern Systems
Modern systems should be able to efficiently handle big dynamic datasets, operating on machines with limited computational and memory resources (e.g., laptops). The dynamic nature of nowadays data (e.g., stream data), hinders the application of a preprocessing phase, such as traditional database loading and indexing. Hence, systems should provide on-the-fly processing over large sets of raw data.
Further, in conjunction with performance issues, modern systems have to address challenges related to visual presentation. Visualizing a large number of data objects is a challenging task; modern systems have to “squeeze a billion records into a million pixels” Shneiderman08. Even in small datasets, offering a dataset overview may be extremely difficult; in both cases, information overloading (a.k.a. overplotting) is a common issue. Consequently, a basic requirement of modern systems is to effectively support data abstraction over enormous numbers of data objects.
Apart from the aforementioned requirements, modern systems must also satisfy the diversity of preferences and requirements posed by different users and tasks. Modern systems should provide the user with the ability to customize the exploration experience based on her preferences and the individual requirements of each examined task. Additionally, systems should automatically adjust their parameters by taking into account the environment setting and available resources; e.g., screen resolution/size, available memory.
5 Systems and Techniques
This section presents how state-of-the-art approaches from Data Management and Mining, Information Visualization and Human-Computer Interaction communities attempt to handle the challenges that arise in the Big Data era.
5.1 Data Reduction
In order to handle and visualize large datasets, modern systems have to deal with information overloading issues. Offering visual scalability are crucial in Big Data visualization. Systems should provide efficient and effective abstraction and summarisation mechanisms. In this direction, a large number of systems use approximation techniques (a.k.a. data reduction techniques), in which abstract sets of data are computed. Considering the existing approaches, most of them are based on: (1) sampling and filtering FisherPDs12; ParkCM15; AgarwalMPMMS13; ImVM13; BattleSC13 and/or (2) aggregation (e.g., binning, clustering) EF10; bsps15; JugelJM15; LiuJH13; LinsKS13; GodfreyGLR16.
5.2 Hierarchical Exploration
Approximation techniques are often defined in a hierarchical manner EF10; bsps15; GodfreyGLR16; LinsKS13, allowing users to explore data in different levels of detail (e.g., hierarchical aggregation).
Hierarchical approaches222sometimes also referred as multilevel allow the visual exploration of very large datasets in multiple levels, offering both an overview, as well as an intuitive and effective way for finding specific parts within a dataset. Particularly, in hierarchical approaches, the user first obtains an overview of the dataset before proceeding to data exploration operations (e.g., roll-up, drill-down, zoom, filtering) and finally retrieving details about the data. Therefore, hierarchical approaches directly support the visual information seeking mantra “overview first, zoom and filter, then details on demand” Shneiderman96. Hierarchical approaches can also effectively address the problem of information overloading as they adopt approximation techniques.
Hierarchical techniques have been extensively used in large graphs visualization, where the graph is recursively decomposed into smaller sub-graphs that form a hierarchy of abstraction layers. In most cases, the hierarchy is constructed by exploiting clustering and partitioning methods RodriguesTPTTF13; AbelloHK06; Auber04; TominskiAS09. In other works, the hierarchy is defined with hub-based LinCTWKC13 and density-based ZinsmaierBDS12 techniques. ArchambaultMA08 supports ad-hoc hierarchies which are manually defined by the users. Differents approaches have been adopted in SundaraAKDWCS10; BikakisLKG16, where sampling techniques have been exploited. Other works adopt edge bundling techniques which aggregate graph edges to bundles Gansner2011; Ersoy2011; Phan2005; Lambert2010; Cui2008; Holten2006.
5.3 Progressive Results
Data exploration requires real-time system’s response. However, computing complete results over large (unprocessed) datasets may be extremely costly and in several cases unnecessary. Modern systems should progressively return partial and preferably representative results, as soon as possible.
Progressiveness can significantly improve efficiency in exploration scenarios, where it is common that users attempt to find something interesting without knowing what exactly they are searching for beforehand. In this case, users perform a sequence of operations (e.g., queries), where the result of each operation determines the formulation of the next operation. In systems where progressiveness is supported, in each operation, after inspecting the already produced results, the user is able to interrupt the execution and define the next operation, without waiting the exact result to be computed.
In this context, several systems adopt progressive techniques. In these techniques the results/visual elements are computed/constructed incrementally based on user interaction or as time progresses StolperPG14; bsps15; KalininCZ14. Further, numerous recent systems integrate incremental and approximate techniques. In these cases, approximate results are computed incrementally over progressively larger samples of the data FisherPDs12; AgarwalMPMMS13; ImVM13.
5.4 Incremental and Adaptive Processing
The dynamic setting established nowadays hinders (efficient) data preprocessing in modern systems. Additionally, it is common in exploration scenarios that only a small fragment of the input data to be accessed by the user.
In situ data exploration Alagiannis2012; OlmaKAAA17; bsps15; ZoumpatianosIP14; KarpathiotakisBAA14; TianALAMV17 is a recent trend, which aims at enabling on-the-fly exploration over large and dynamic sets of data, without (pre)processing (e.g., loading, indexing) the whole dataset. In these systems, incremental and adaptive processing and indexing techniques are used, in which small parts of data are processed incrementally “following” users’ interactions.
5.5 Caching and Prefetching
Recall that, in exploration scenarios, a sequence of operations is performed and, in most cases, each operation is driven by the previous one. In this setting, caching and/or prefetching the sets of data that are likely to be accessed by the user in the near future can significantly reduce the response time bcs15; TauheedHSMA12; KalininCZ14; JayachandranTKN14; bsps15; ChanXGH08; KhanSA14. Most of these approaches use prediction techniques which exploit several factors (e.g., user behavior, user profile, use case) in order to determine the upcoming user interactions.
5.6 User Assistance
The large amount of available information makes it difficult for users to manually explore and analyze data. Modern systems should provide mechanisms that assist the user and reduce the effort needed on their part, considering the diversity of preferences and requirements posed by different users and tasks.
Recently, several approaches have been developed in the context of visualization recommendation VartakHSMP16. These approaches recommend the most suitable visualizations in order to assist users throughout the analysis process. Usually, the recommendations take into account several factors, such as data characteristics, examined task, user preferences and behavior, etc.
Especially considering data characteristics, there are several systems that recommend the most suitable visualization technique (and parameters) based on the type, attributes, distribution, or cardinality of the input data Key2012; EhsanSC16; MackinlayHS07; bsps15; EURECOM+4380; ThellmannGOS15. In a similar context, some systems assist users by recommending certain visualizations that reveal surprising and/or interesting data VartakMPP14; WongsuphasawatM16; WillsW10. Other approaches provide visualization recommendations based on user behavior and preferences MutluVT16; GotzW09. Finally, systems provide recommendations and explanations regarding data trends and anomalies KandelPPHH12; WM13.
6 Examples of Applications
Visualization techniques are of great importance in a wide range of application areas in the Big Data era. The volume, velocity, heterogeneity and complexity of available data make it extremely difficult for humans to explore and analyze data. Data visualization enables users to perform a series of analysis tasks that are not always possible with common data analysis techniques daglib0028506.
Major application domains for data visualization and analytics are Physics and Astronomy. Satellites and telescopes collect daily massive and dynamic streams of data. Using traditional analysis techniques, astronomers are able to identify noise, patterns and similarities. On the other hand, visual analytics can enable astronomers to identify unexpected phenomena and perform several complex operations, which are not are feasible by traditional analysis approaches.
Another application domain is atmospheric sciences like Meteorology and Climatology. In this domain high volumes of data are collected from sensors and satellites on a daily basis. Storing these data over the years results in massive amounts of data that have to be analyzed. Visual analytics can assist scientists to perform core tasks, such as climate factors correlation analysis, event prediction, etc. Further, in this domain, visualization systems are used in several scenarios in order capture real-time phenomena, such as, hurricanes, fires, floods, and tsunamis.
In the domain of Bioinformatics, visualization techniques are exploited in numerous tasks. For example, analyzing the large amounts of biological data produced by DNA sequencers is extremely challenging. Visual techniques can help biologist to gain insight and identify interesting “areas” of genes on which to performs their experiments.
In the Big Data era, visualization techniques are extensively used in the business intelligence domain. Finance markets is one application area, where visual analytics allow to monitor markets, identify trends and perform predictions. Besides, market research is also an application area. Marketing agencies and in-house marketing departments analyze a plethora of diverse sources (e.g., finance data, customer behavior, social media). Visual techniques are exploited to realize task such as, identifying trends, finding emerging market opportunities, finding influential users and communities, optimizing operations (e.g., troubleshooting of products and services), business analysis and development (e.g., churn rate prediction, marketing optimization).
7 Further Reading
The literature on visualization is extensive, covering a large range of fields and many decades. Data visualization is discussed in a great number of recent introductory-level textbooks, such as idvBook_v2; Murray13; book0024324; daglib0028506; Johnson2004.
Also, there are various articles discussing Big Data visualization; see Shneiderman08; HeerK12b; MortonBGM14; WuBM14; HeerS12. Surveys of Big Data visualization systems can be found at GGL15; bs16; IdreosPC15.
In what follows we provide some surveys/studies related to issues discussed in this article:
Graph visualization LandesbergerKSKWFF11; HermanMM00; BattistaETT99
Hierarchical exploration EF10
Visualization recommendations VartakHSMP16
Linked and Web data visualization bs16; DadzieP17; MarieG14a; DR11
High-dimensional data visualization LiuMWBP17
Temporal data visualization WohlfartABM08
Some of the major workshops and symposiums focusing on Big Data visualization include:
Workshop on Big Data Visual Exploration and Analytics (BigVis)
Symposium on Big Data Visual Analytics (BDVA)
Big Data Analysis and Visualization (LDAV)
Workshop on Data Mining Meets Visual Analytics at Big Data era (DAVA)
Workshop on Human-In-the-Loop Data Analytics (HILDA)
Workshop on Data Systems for Interactive Analysis (DSIA)
Workshop on Immersive Analytics: Exploring Future Interaction and Visualization Technologies for Data Analytics
Visualizing Semantic Data
Graph exploration and search