1 Related Work
In this section, we will discuss the existing techniques available for studying mixed multivariate datasets including both categorical and numerical attributes applied in related domains [65, 25]. The main objectives of visual analytics in these domains includes the study of correlations between categorical variables and clustering in the parameter space with projection methods (fused displays and dimension reduction techniques) or parallel sets.
1.1 Techniques to study correlation
There are multiple specialized techniques available to study correlation between features in highdimensional data. Since the data in consideration is categorical with one dependent numerical variable, most techniques like Pearson correlation will give ambiguous results. Hence, specialized correlation methods like Cramer’s V (based on Chisquared statistic) are used
[4, 17]. There also exist statistical tests for correlating categorical variables by comparing their behavior on numerical variables, like Ttest, chisquare test, OneWay ANOVA and the Kruskal Wallis test. Techniques also exist to study correlation of multivariate temporal data
[10, 62]. However, for datasets with very high dimensionality, it can be hard to study correlations in the overall distribution of the dataset. Hence, methods to study correlation on large datasets over parts of the distribution have been devised [58]. The results from these techniques can then be used as input to fused displays where these correlations are visualized in the form of scatterplots and networks [69].1.2 Clustering techniques
Since most categorical data consist of unordered nominal values [71], most clustering algorithms are not directly applicable to study categorical parameter spaces. Advanced techniques like kmode [30], SQUEEZER [24] and COOLCAT [3]
have been developed to work especially on categorical data. Some of the latest research has focused more on advanced clustering techniques in a supervised learning environment
[64] based on human perception. All of these techniques differ based on the similarity criterion used for clustering as different similarity criterion are designed to capture specific relationships in the data. However, in multiobjective filtering scenarios, clustering as a concept is limited in its scope as each algorithm captures only a particular relationship in the dataset based on the similarity criterion.1.3 High dimensional Data Visualization techniques
Projecting high dimensional data into lower dimensions is another technique to visualize relationships between attributes and the data points. Scatterplot matrices [23] is a way to visualize pairwise relationships between the variables in which multiple plots are generated where each plot compares two attributes from the dataset. Different variations of this technique include bivariate scatterplot projections of the full space and HyperSlices based approach [5, 52]. However, all of these technique do not scale with the number of attributes as the number of plots increases exponentially. This makes it difficult to mentally fuse the disjoint relationships obtained from individual plots. Similarly, 3D volume datasets can be represented with Multicharts [15] and dynamic volume lines [66] but these techniques are also limited in their application domain.
Parallel Sets [37] is another popular method for visual analytics of multidimensional categorical data. It maps data into ribbons which subdivide according to the percentage of the population they represent. Each categorical variable is mapped to an axis which is divided into sections according to the percentage of data contained in each category (see Figure 2 (right)). However, as the number of parameters in the dataset increases, the plot can become too cluttered to project any useful information. An example parallel sets plot of our systems performance data is shown in Figure 2 (right), showing the excessive overlap of ribbons with only five variables. The complete parallel sets plot is given in the supplementary material.
Another class of dimension reduction techniques include MDS [38, 39], PCA, Kernel PCA, locally linear embedding (LLE) [54], Fisher’s discriminant analysis [47], spectral clustering [49] and tdistributed stochastic neighbor embedding (tSNE) [45]
. Although these techniques have been designed to work with numerical data, categorical data can be converted to numeric form and can be visualized using these techniques. To convert categorical data into numerical format, we can use onehot encoding or the remapping technique described by Zhang
et al. [70]. These methods are good for visualizing relationships between the datapoints but their effectiveness decrease as the dimensionality of the dataset increase. An example case is shown in Figure 1 where no clear clusters based on the dependent numerical variable (throughput) could be seen with spectral clustering and tSNE on the systems performance dataset.To better cater the need of projecting a larger number of dimensions to lower dimensions, another class of multivariate projection techniques exist which arranges variables in radial layouts e.g. Star Coordinates [33, 34, 40] or RadViz [14, 21, 28]
. Both of these techniques are similar as they generate a radial layout with variables as anchor points on the circumference of a circle and the data points are systematically places inside the circle based on their value for each variable. Star coordinates project a linear transformation of data while RadViz projects a nonlinear transformation
[55]. These projection techniques work well to project and visualize clusters in high dimensional numerical data [50]. Also, Star coordinates and RadViz can be combined to create a smooth visual transition over multiple dimensions of the data to visualize multiple dimensions of the dataset interactively [41, 42]. While these techniques work well for numerical data, they cannot be applied directly to categorical parameter spaces. A variation, concentric RadViz [51] can be used to study different categorical variables as concentric RadViz circles but the main objective is to study data distribution for given parameter combinations. However, the correlation between different categories cannot be visualized with this technique.Another technique, Multiple Correspondence Analysis (MCA) [20] is specifically designed for projecting categorical data. Numerical data can also be visualized with MCA by discretizing it into categories. It can be used to generate fused displays in which the levels of categorical variables are plotted within the same space than the data points. Similar to PCA, one can select a bivariate basis which maximizes the spatial expanse of the plot. In these displays the distance between two points represents a notion of association. As shown in Figure 2
(left), MCA is effective in visualizing associations among the levels of the categories. However, there is a certain loss of information due to the omission of the higher order basis vectors. It also tends to get cluttered when the number of data points (the parameterized configurations) or even the number of categories and levels grow large.
2 Dataset
While our method readily applies to any categorical dataset with a numerical (or categorical) target variable, our specific use case was to support a team of systems researchers in their aim to learn about the impact of configuration choices on throughput and its variability in a benchmark computer system. The dataset we used had been collected over a period of three years in the research team’s lab at our university.
A set of several experiments were run to measure the system performance for a large number of configurations. Currently, the dataset consists of 10 dimensions with 100k configurations and about 500k data points (i.e., system configurations that were each executed on average five times to ensure stable results). The attributes in the dataset include Workload Type, File System, Block Size, Inode Size, Block Group, Atime Option, Journal Option, Special Option, I/O Scheduler, and Device type. All of these variables are categorical where a configuration is a set of categories from at least one of these variables. Some of these variables are ordinal (e.g., Block Size can be 1KB, 2KB, or 4KB only) while others are nominal (e.g., JournalOp can be writeback, ordered, journal, or none). The dependent numerical variable is the Throughput of each parameter configuration.
Direct optimization techniques have been applied to search for optimal configuration in such large parameter spaces. Some of the applied techniques include Control Theory [43, 44, 72]
[29, 18], Simulated Annealing [36, 13] and Bayesian Optimization [57]. However these techniques prove to be too slow and sometimes result in suboptimal solutions as our experiments confirm [68, 9]. Hence, there is a need to visualize the search space and the efficacy of the search techniques. Our ICE tool helps in visualizing and filtering these large parameter spaces to learn about optimal settings and tradeoffs for the underlying system’s performance.3 Requirement Analysis
To systematically evolve our ICE tool with the needs of the systems researchers in mind, we applied Munzner’s nested model for visualization design [48, 46]. Building the ICE tool following the nested model greatly helped in the stepbystep development with proper evaluation at each stage of the implementation. The first of the four stages of developing the eventual visual tool was to gather, from the domain experts, a list of requirements expected to be met by our tool. Our many discussions culminated in the following list of seven requirements:
R1: Statistics visualization. System researchers are typically interested in assessing the impact of a parameter on throughput via statistical measures. Hence, the framework should display the Mean, Median, some Percentiles, Min, Max, Range and Distribution
of the resulting throughput for each variable independently. Visualizing a complete distribution curve is important to prevent any incorrect statistical information. For example, the mean of a bimodal distribution and a normal distribution might be the same, but they are different distributions requiring different systems approaches to optimize. A full distribution curve of the data can complement the statistical information, thus preventing any deceptive conclusions about a parameter.
R2: Comparative visualization. Comparing the impact and tradeoffs of different parameters on system throughput is crucial for choosing the best configuration in such a large parameter space. The ability to compare different parameter settings helps analysts to determine the right set of parameters by repeated selection and filtering to arrive at the desired system performance.
R3: Filtering. When dealing with large parameter spaces, choosing a system configuration with the best performance is nontrivial. Filtering by choosing the best parameters iteratively can reveal complex hierarchical dependencies between the parameters and system throughput. For example, assume analyst Mike seeking to optimize a system running a database server workload. He can first choose the best File System type, followed by the best Block Size and so on until there is no more improvement in the system performance.
R4: Support informed predictions. As discussed in R4, filtering is important for reducing the large parameter space to a smaller space of interest. Yet, guidelines are needed that can help an analyst choose the right parameters to reach a desired goal. Assume analyst Jane who has a system running a Database server workload and a File System of type ext2. Now she wishes to choose the system configuration which gives a minimum variation in the performance: i.e., the narrowest range of throughput thus yielding a “stable” throughput behavior. To achieve these goals, the visualization scheme should provide the necessary cues.
R5: Provenance visualization. Iterative filtering is useful but it needs to be attached to a visual provenance scheme where the analyst can keep track of the progress at each stage in the filtering progress. Likewise, the analyst should be able to move back to any past state in the pipeline to undo any actions if required.
R6: Aggregate view. Requirements R1R4 focus on analyzing the impact of throughput with each parameter in the dataset where the goal is to assist in informed predictions. At the same time, the interface should also give a summarizing view of the span of throughput performance that is reachable with the evolving system configuration.
During our meetings with the systems research team, we soon realized that they presently had very few visual tools at hand to analyze their large parameter spaces with these seven requirements in mind. They were open to the use of visual tools, but they strived for easytounderstand traditional visualization tools, as opposed to highly specialized designs with a possibly steep learning curve. Their motivation was to develop a tool that would gain wide acceptance within the systemsresearch community and use well recognized standards and metrics, made visual and interactive via our tool.
We also concluded that dashboards with standard visualizations, such as bar, line, and pie charts were insufficient to fully capture the requirements we collected, at least not in an easy and straightforward manner. Other visualization paradigms such as parallel sets and MCA plots were similarly ruled out (see our study in Section 1.3 above).
We thus needed to find a balance between an advanced visualization design and one that would convey the identified established performance metrics in an intuitive way. We believe that the emerged design and the lessons learned throughout the process are sufficiently general and apply to domains much wider than computer systems analysis.
4 Interactive Configuration Explorer (ICE)
The ICE interface is divided into three components (see Figure ICE: An Interactive Configuration Explorer for High Dimensional Categorical Parameter Spaces.). The first section is the Parameter Explorer (A). Its design satisfies majority of the requirements (R1 to R4) as it visualizes and allows users to tune the target variable’s distribution for each parameter in the dataset. It allows the analyst to turn off parameters that are deemed irrelevant as well as filter out configurations with unwanted or noncompetitive parameter level settings, both by toggling on/off the parameter and parameter level (category) bars, respectively, enabling the user to conduct the iterative optimization of the target parameter, system throughput in this case. It also supports zooming and panning for better comparison of the bars. To the right of the Parameter Explorer is the Aggregate View (B). The Aggregate View displays the throughput distribution for the configurations selected in the Parameter Explorer, thus satisfying requirement R6. The third component of the ICE is the Provenance Terminal (C). It satisfies requirement R5 and allows the user to easily track, roll back, and edit the parameter filtering progress.
4.1 The RangeDistribution (RD) Bars
Sections A, B of the ICE interface consist of a set of RangeDistribution (RD) bars
. Each bar contains the probability distribution function with additional statistical information about the dependent numerical variable. The RD bars are arranged and delimited similarly to a vertical Gantt or timeline chart, with one bar dedicated to one parameter level, and are grouped by the variables. The lower/upper limit of each bar is determined by the lowest/highest value of the dependent numerical variable that can be achieved for all configurations with the parameter level the bar represents.
A completely annotated bar displaying the information that each part of the bar contains is shown in Figure 3. Each bar is a sequence of combinations of grays which represent the range of percentiles. The color codes are chosen with the help of ColorBrewer [22] to show a continuous diverging effect of percentiles on the bar. The magenta region shows the distribution of the target variable over the range. Statistical information is shown with lines separating the percentile ranges and a black dot displaying the mean value. See Section 4.6 for more detail on how we arrived at these specific design choices.
4.2 Parameter Explorer
The Parameter Explorer is designed for the goal of visualizing a numerical variable with respect to individual parameters in the dataset: i.e., the requirements R1 to R4. As mentioned, multiple bars are stacked, grouped by parameters and their levels. This grouping allows for easy comparison of the impact of numerical variable on the parameters. As shown in Figure 4, the level names are listed underneath each bar and the parameters are shown as buttons below the group of levels. The bars for each variable are grouped within a blue box. The statistics (mean and percentiles) are shown as alternating shades of gray for each parameter level, hence partially satisfying R1. The distribution of dependent variable is shown as a magenta distribution curve. The grouping of bars with each bar containing the information about the impact on the dependent variable clearly reveals the correlation between the parameters levels, if there is any. For example, in Figure ICE: An Interactive Configuration Explorer for High Dimensional Categorical Parameter Spaces., the Workload types dbsrvr and websrvr can easily be compared based on the throughput values they span. A system running a wbsrvr workload has much less variation in the throughput as compared to the system running a dbsrvr workload. Similarly, all parameters can be correlated based on user objectives for a system optimization. This satisfies requirements R2 and R1.
Analysts can use the Parameter Explorer to filter within a large set of possible configuration spaces. As shown in Figure 4, the user has the ability to select one or more levels for each parameter; for example, the level dbsrvr is selected (level name shown in black) and the remaining levels in Workload are not (level names shown in red). Also, the user can completely select or remove a parameter from the dataset; for example, Block Size (button shown in red) is toggled off by the analyst, so it is not considered in generating the aggregate view. This satisfies the filtering requirement R3.
We specifically designed the Parameter Explorer to accommodate many parameters in a small space. One bar is generated for one parameter level, and depending on the screen size, analysts can accommodate several parameters in a single screen for quick comparison and filtering of the parameter space. Compared to parallel sets (Figure 2, right), where at the finest level one line is drawn for each data point, or groups of identical data points (see bottom portion of the plot), the space efficiency of ICE in displaying parameter levels is highly optimized. The simple stacked bars concept of ICE prevents the data cluttering that plagues the parallel sets since it captures the configuration statistics succinctly in each bar. Figure 4 shows a portion of the Parameter Explorer for the system performance dataset. The complete view of the Parameter Explorer is available in the supplement material.
The analyst can click on the level label to toggle it. Parameter Explorer and the Aggregate View are updated based on the filtered parameter space data. In this way, analysts can iteratively move closer to the configurations with the desired value of the target variable, throughput.
4.3 Provenance Terminal
The Provenance Terminal (see Figure 5) is used to keep track of the progress of the iterative filtering activities. In this process, the analyst might want to toggle between multiple parameter configurations to compare the resulting dependent variable distributions. The Provenance Terminal can be used to see and compare the dependent variable ranges for the various iterated parameter configuration. It also allows the analyst to roll back to a previous parameter configuration if the evolution gets stuck without hope to further improve it. This satisfies requirement R6. The maximum value of the dependent variable at each stage of the selection is shown with a red circular pointer on a red line, while the minimum value is shown with a blue circular pointer on a blue line. This view is updated with each user interaction.
An example use case of the Provenance Terminal can be that of a system administrator searching for the best configuration but with a minimum variation of the throughput. The latter will reduce the uncertainty in the predicted performance when the found parameter settings are applied in practice. The analyst would start off by selecting (Workload:Dbsrvr FileSystem:Xfs) as shown in stages 1–5 in Figure 6. We see that the minimum and the maximum throughput values almost converge to a very small range, but the maximum throughput value is compromised. To correct this, the analyst can go back to stage 4 by clicking on the red or blue pointer. This leads to a replication of this stage at the end of the chain as stage 6. Now the analyst can take a different path to get a better overall throughput while simultaneously optimizing for minimum throughput range: i.e., stages 7–8 in Figure 6 (Workload:Dbsrvr FileSystem:Ext2 InodeSize:128). In this way, the Provenance Terminal helps in comparing multiple configurations: i.e., comparing steps 1–5 (configuration 1) and steps 6–9 (configuration 2).
4.4 Aggregate View
The Aggregate View, located to the right of the Parameter Explorer B in Figure ICE: An Interactive Configuration Explorer for High Dimensional Categorical Parameter Spaces. displays a single RD bar. While the main purpose of each Parameter Explorer RD bar is to convey the dependent numerical variable distributions possible if the respective parameter level is chosen, the Aggregate View communicates the distribution possible with all currently selected parameter levels. As such it can be used to quickly visualize the impact of a transition from one parameter configuration to another. Whereas the Provenance Terminal summarizes the top and bottom end of the achievable dependent variable’s value only, the Aggregate View offers detailed distribution information for the current parameter configuration.
4.5 Interaction with ICE: Two Case Studies
To get a sense for how analysts would interact with ICE we present two use cases involving the systems performance dataset. One practical application is to analyze a system’s performance stability. Systems vary greatly in their performance for different workloads which can be quantified by the aforementioned range, i.e., the difference between the maximum and the minimum throughput for a particular configuration [8]. A large range means less stability and less predictability.
The first use case shows how one would optimize a system running a mail server workload. Figure 7 shows the steps involved in the filtering process. First, the analyst selects the workload type as Mail Server by clicking the respective label. The File System throughput values change as shown in the first step in Figure 7. The primary concern here is to minimize the variation in the throughput for a more stable and predictable mail service. The analyst can clearly see that choosing the btrfs File System gives the minimum throughput range and thus is more stable and predictable for the user of the service. While its overall throughput is lower than for ext2 and ext4, these File Systems are less reliable and would leave users of the mail service often frustrated.
However, sometimes there is a situation when the user cannot change the File System (i.e., because it requires a costly disk reformat and restore), and thus it has to be set to ext4 regardless of the application. Such cases are quite common in practice, when it is not possible to change some parameters of the system. In such a case, the analyst can return to the previous state of filtering by ways of the provenance terminal. After selecting the ext4 File System, the next parameter to tune is the block size which has throughput values as shown in Stage 2 of Figure 7. Comparing the throughput distributions for each level in block size, the user selects block size of 1024 since it results in the highest throughput value with minimum variation. After choosing Block Size = 1024, the parameter explorer view is updated with new throughput distributions for each parameter level. The next parameter the user can filter is the device type, shown as Stage 3 in Figure 7. For the given configuration, the device type ssd cannot be chosen since there is no sample with such configuration in the dataset. The label is henceforth colored red. Now the analyst can select either a sas or sata device. This presents a tradeoff where sas has a lower range while sata gives a higher throughput.
4.6 Design Alternatives
There were four design alternatives which we had to choose from. In this section we discuss why we chose the current design of the ICE tool given the alternatives.

RD bars instead of box plot: Box plots are great for representing the distribution of data with the help of percentiles, but they show only fifty percent of the data (i.e., from 25 to 75 percentile). They also assume that the data points are normally distributed which can be restrictive: it certainly is a restriction in our application as is apparent in the distributions shown in any of the RD bars.

RD bars instead of parallel sets: Bars make it possible to represent the parameters and their levels in a smaller space as compared to parallel sets. The RD bars also prevent data cluttering because they capture the configuration statistics succinctly without the need to draw individual lines (see also Section 4.2).

Displaying the distribution: Violin plots [27] and bean plots [32] are better in displaying distributions, as opposed to box plots. We choose to display only one half of the violin plots inside of the RD bars because it better utilizes the bar real estate. This is important since there might be a large number of parameters and so the width available to each bar is limited. In the interest of accommodating more parameter levels in a uniform looking display, the system experts suggested that halfviolin plots inside the bars were a better design.

Choice of colors: The color choices for percentiles and the distribution on the RD bars were decided with a user study. In an interactive session, the system researchers were presented with several possible color combinations for the RD bars chosen from color brewer [22]. The present selection of colors were deemed most appropriate by the experts in terms of visual interpretation.
5 Implementation
Figure 8 shows the block diagram of different components of our ICE tool. There is a backend server consisting of a Database, Filtering Engine, and a Provenance Stack. The frontend consists of a Visualization Engine which runs in a browser. The backend is a python flask server and the frontend is created with [6]. A database stores the original dataset which can be uploaded from the ICE interface.
The Filtering engine updates the existing data based on a user request from the Visualization engine. The data is then grouped separately for the Parameter Explorer and the Aggregate View and sent to the Visualization engine for display. Another component to the backend is the Provenance Stack, which keeps track of the dependent variable values with each user request. With every interaction, the Filtering engine updates the Provenance Stack which then updates the Provenance Terminal.
5.1 Data filtering
To filter and display large amount of data in real time is challenging. ICE is optimized for filtering speed using onehot encoding filtering and random sampling. Onehot encoding is used to convert categorical data to binary variables for faster processing with no loss of information. An example of converting the categorical data to numerical with onehot encoding is provided in the supplementary material. This technique greatly reduces the time complexity of searching for a parameter level. Where regular searching for a categorical parameter level has
complexity, onehot encoding has time complexity ( is the number of datapoints and is the number of parameter levels). Another benefit of using onehot encoding is that it generates a sparse version of the dataset which is easier for the modern systems to process with specialized data structures [19, 60, 53, 16].For the requirement to display distribution curves for each parameter level, the time to display the filtered data also needs to be optimized. If we try to use every datapoint in the calculation of the distribution, the time to display the visualization would not scale well with the size of the dataset. The time to display full data on our dataset with around 100k configurations is around 1,400 milliseconds, which is too slow. Hence, sampling of the data is required to estimate distributions. We evaluated the tradeoff between information loss with random sampling and the time to display the data. Figure
9 shows that as the distribution similarity (pvalue) of the complete and sampled dataset increase, the time to generate the visualization also increase. To measure information loss with sampling, we used the KolmogorovSmirnov test by comparing data distribution from the sampled dataset with the complete dataset.After evaluating the loss of information with sampling and the time to display the visualizations, a sample size of 20% proved to be an appropriate option. This is because the display time curve has a steep increase as we go to higher sample sizes but the the pvalue does not increase much after 20%—hence a good tradeoff. ICE on the systems performance dataset uses 20% of the full dataset (20k data points) which takes around 800 milliseconds of display and filtering time. These results also give a good threshold for dataset size which can be fully displayed with ICE without sampling. In the current implementation of ICE, the datasets with less than 20k data points are processed without sampling. For larger datasets, the sample size is determined when the pvalue crosses a .5 threshold.
6 Evaluation
In this section, we evaluate our ICE using the techniques suggested in the nested modelbased visualization design literature [48, 46]. We first used the Analysis of Competing Hypotheses (ACH) [26] method as a mechanism to efficiently identify which of the existing techniques (see Section 1) would need to be formally compared with ours via a user study. The ACH is a methodology for an unbiased comparison of a set of competing hypotheses, in our case the various visualization techniques in terms of the requirements put forward in Section 3.
The ACH showed that only ICE and Parallel Sets could satisfy all formulated hypotheses. We did not consider hypotheses comparing the goodness of a visualization or the effectiveness of filtering as these could be improved in any existing technique. Also, determining the goodness of a visualization is difficult [31] and requires a subjective study. We then conducted a formal user study to compare Parallel Sets with ICE.
6.1 Initial Comparative Evaluation Using ACH
The Analysis of Competing Hypotheses (ACH) is a technique to choose the best possible solution to satisfy a set of hypotheses. Fitting our overarching application scenario, we only evaluated the existing techniques (and ICE) in terms of the specific task of analyzing a set of categorical data with respect to a numerical target variable. It corresponds to the interaction and technique design stage of the nested model by Munzner et. al [48, 46]. We derived six hypotheses from the requirements listed by the system performance experts (see Section 3) as follows:
H1: Allow an assessment of the distribution of a numerical variable in terms of a given parameter. The visualization is able to display the distributions of the dependent numerical variable for each parameter. The analyst can get an estimate of the nature of this distribution: bimodal, multimodal, uniform, normal distributed, etc.
H2: Allow an assessment of the correlation between parameters. The visualization makes it possible to compare or correlate the parameters in the dataset with respect to their impact on the target numerical variable. Irrespective of the method of correlation, the analyst should be able to derive informative conclusions while filtering the parameter space based on correlation.
H3: Enable quick filtering. Filtering is used to track the best performing configurations for a desired goal. The visualization technique enables the analyst to add, remove and edit the parameters of the configuration and see updated distribution of the dependent numerical variable within one second.
H4: Allow an assessment of the statistics alongside the distribution. The visualization technique displays the statistics (mean, median, percentiles, max, and min) of the dependent numerical variable for each parameter.
H5: Allow informed predictions. The visualization provides cues to the analyst for filtering the parameter space.
H6: Provide insight on aggregate distributions. Similar to requirement R6, the visualization technique provides a summarized display of the dependent numerical variable values which can be reached from a given parameter setting.
We left out a hypothesis for the provenance visualization because it was not supported by any of the existing techniques (only ICE). Table 1 shows the results of the ACHbased evaluation applied to the available visualization techniques and our ICE. The comparison shows that by eliminating any visualization technique which does not satisfy one or more of the hypotheses, only parallel sets and ICE fit all hypotheses.
6.2 User Study Comparing Parallel Sets and ICE
Although the ACH evaluation revealed that both Parallel sets and ICE could be used to analyze categorical variables in the context of a target numerical variable, our computer systems experts voted against the use of Parallel Sets. This was because Parallel sets become too cluttered to effectively filter the parameter space for larger datasets. Nevertheless, to make these informal impressions more concrete, we conducted a user study to compare the effectiveness of ICE and Parallel Sets. The main objective of the user study was to compare the ICE and Parallel Sets based on two metrics: Time to filter configurations and Accuracy of filtering. The participants in the user study were divided into three categories based on their expertise: System performance experts (SE), Visualization experts (VE), and Non experts (NE). SEs were researchers working in the area of system performance, VEs were researchers working in the area of visual analytics, and NEs were users with no research experience in either of the two areas.
A question bank for the user study was compiled with the questions designed by three system researchers (independently), to uniformly represent the requirements of the systems community. After an initial usage tutorial, participants were given two unique sets of five randomly sampled tasks from the question bank to perform on both the tools. The dataset used in the study was the systems performance dataset as described in Section 2. The user study was conducted on 21 users: 7 SE’s, 7 VE’s, and 7 NE’s. Among the total participants, the gender composition was 9 females and 12 males with the overall age range of the participants being 22 to 34 years.
The results of the user study proved the effectiveness of ICE tool over Parallel Sets both in terms of accuracy and time to filter the parameter space. The average time for users to solve a question on ICE tool was 47.6 seconds as compared to Parallel sets which was 73.3 seconds. To compare the statistical significance of time difference, we performed a paired ttest on the distributions of average time to answer a question for each user on both the tools. The pvalue of the single tailed ttest was p = .0074 which is lower than the significant value of .05. Hence, the mean time to filter the parameter space is lower in ICE as compared to Parallel Sets with a high probability.
A similar analysis was done to measure the accuracy of each user on the five questions in the user study. The average accuracy of the participants using the ICE tool was 4.37 compared to 2.75 for parallel sets. The pvalue obtained on the single tailed ttest for the comparing accuracy distributions was p .001, which is significantly lower than the threshold of .05. Hence, the mean accuracy of the analyst for parameter filtering via the ICE tool is higher than via the Parallel Sets with a high probability. Given the results of this user study we conclude that ICE is better for multidimensional parameter space analysis both in terms of accuracy and time when compared to Parallel Sets.
We also analyzed the mean accuracy and time based on user expertise. The NEs took the most time for answering the user study questions and had the lowest accuracy as compared to other expertise categories with both of the tools. Also, the VEs were the most accurate with their answers but took a little more time compared to the SEs. However, the trend of expertisewise accuracy and time is the same for both ICE and Parallel sets. All plots for the expertise wise analysis and the user study tasks along with the dataset are provided in the supplementary material.
6.3 Case Studies
We also evaluated the ICE with case studies derived from two datasets taken from Kaggle.com [2, 1]. One dataset is an HR dataset of a US firm containing data on the hourly pay of its employees based on various parameters. The other is a French population characteristics dataset where the population distribution of a set of cities in France is studied on the basis of gender, cohabitation type, and age groups. Two domain experts were consulted to evaluate the effectiveness of our ICE tool in the study of different parameters in these datasets. Expert A who evaluated the ICE tool on the HR dataset had management experience at a private firm, and Expert B who evaluated the ICE tool on the French population dataset was an expert survey analyst.
6.3.1 Exploring the HR Dataset
This case study uses the ICE for exploring the HR dataset. The dataset has seven categorical variables: (Marital Status, US Residency Status, Hispanic status, Race, Department, Employee Status and Performance Score) and one dependent numerical variable: Hourly Pay Rate. To start out, Expert A (EA) first familiarized himself with the dataset and the usage of the ICE tool. Figure 10 has a part of the initial screen he browsed. It shows three of the seven variables with respect to hourly pay scale. Some of the more interesting observations he made were: (1) Married workers had the highest hourly pay and the mean hourly pay was highest for single workers. (2) The mean hourly pay of nonresidents who are eligible for US citizenship is higher than those of the residents. (3) White workers have the highest hourly pay among all races. (4) Considering the departments, the executive department had the highest hourly pay scale followed by IT services.
After the initial analysis, the other two variables in the dataset that were of particular interest to EA were Employee Source and Performance Score. He wanted to see whether high performing employees were properly compensated for their valuable efforts. The Parameter Explorer made this investigation easy and EA quickly confirmed that exceptional employees were indeed paid more than other employees, with a mean pay of about $40 per hour, shown in Figure 11.
Another parameter of interest was the hiring source of these exceptional employees. EA selected the exceptional performance score in the Parameter Explorer. This filtering updated the Employment Source group to only show the sources of exceptional workers with respect to their hourly pay. Figure 11 shows the result of this filtering and the caption offers a few interesting observations.
EA suggested that for better equality of all sources of exceptional workers, their mean hourly pay should be similar. Also, EA suggested that investment on college fairs and job sessions should be lowered as they are not a good source of exceptional workers. EA then confirmed that the use of ICE would help the HR department to better manage the company’s funding and investments.
6.3.2 Exploring the French Population Dataset
The French population habitation dataset has been collected to show existing equalities and inequalities in France. It consists of four categorical variables (City, MOC (Method of Cohabitation), Age group, and sex). The dependent numerical variable, Population count, is the number of people in each of the categories defined by permutations of the independent variables, for example, one category might be adult females with age 21–40 living in Paris with her children. Expert B (EB) was a survey analyst and like Expert A he first familiarized himself with the ICE tool by looking at an overview of the dataset’s variables. The overview screen of the ICE showing the population distributions and statistics is provided in the supplementary material. EB’s initial observations were: (1) The population count for a few categories in Paris is exceptionally high compared to other categories because the mean is very low compared to the highest value. This can also be seen in Figure 12. (2) The mean population of the age range 60–80 is the highest in all cities; (3) The age group 20 to 40 is the lowest on average for all cities; and (4) The average number of females is higher than the average number of males for the overall population.
Following the basic inferences, EB was further interested to study the habitation methods of females in three major cities of France: Paris, Marseille, and Lyon. EB selected Paris from the City variable followed by 2 from the Gender variable. The Parameter Explorer then showed the distributions of population for all categories of habitation methods, as shown in Figure 12. EB could see that the most females were children living with two parents, i.e., category 11 (shown by a single dot because all of these females have the same age group of below 20 years) followed by females living alone (i.e., category 32). Similar analyses were done for the cities of Marseille and Lyon. For Marseille, EB pointed out that almost an equal number of females lived as a single household and in a family with children. For Lyon, most females were the children living with two parents followed by females living as a single household. EB then used the Provenance Terminal to go back two stages in the filtering process to compare the female habitations in all cities. EB further pointed out that Paris had exceptionally large number of children living with single parent as compared to other cities.
After evaluating the use of the ICE on the France population dataset, EB recommended ICE as an effective tool for the quick filtering and understanding of survey statistics. EB also found the ICE tool helpful in understanding biases in the population distributions.
7 Conclusions
This paper presents the ICE tool, a novel approach for categorical parameter space analysis in the context of a dependent numerical variable. ICE overcomes the existing challenges by providing an effective layout for parameter space visualization. The stacked RD bars concept used in ICE along with interaction assists in effective filtering of the parameter space. A greater number of parameters could be visualized and readily correlated, thus increasing the efficiency of filtering. Multiple configurations can be compared for their impact on the target variable based on any objective. ICE also supports multiobjective filtering since it presents full statistics and distribution information to the user for each parameter level.
Several important lessons were learned while designing the idea of ICE. In the requirement analysis phase with the systems community researchers, we realized that by presenting the results gathered from the dataset with existing visualization techniques helped make the gathering of requirements more effective. Almost from the start, the system experts were skeptical about the accuracy of most existing techniques. They wanted a tool that would be able to show the statistical distributions precisely. It also helps to keep a keen eye on any struggles the collaborating domain experts may experience. For example, in the filtering experiments we noticed that they had trouble remembering the filtering path. This gave rise to the provenance terminal.
Besides the effective design of ICE, there still remain some limitations which can be taken up as the future work. For larger datasets, techniques to combine multiple parameters [35] can be incorporated to prevent excessive thinning of the bars. Moreover, some related precomputed solutions can be provided to the analyst based on optimization objectives to start off with the search process. Also, ICE is based on the assumption that the cost of changing parameters is the same throughout, which might not be true in some cases. Moreover, these costs might vary with time [63]. It will be useful to incorporate cost measures into ICE and provide support for real time cost based filtering. We will continue working on our ICE tool to incorporate these new features.
Acknowledgments
We would like to thank the anonymous VAST 2019 reviewers for their valuable comments. This work was made possible in part thanks to DellEMC, NetApp, and IBM support; NSF awards IIS1527200, CNS1251137, CNS1302246, CNS1305360, CNS1622832, CNS1650499, and CNS1730726; and ONR award N000141612264.
References
 [1] Kaggle french population dataset. https://www.kaggle.com/etiennelq/frenchemploymentbytown. Accessed: 20190330.
 [2] Kaggle hr dataset. https://www.kaggle.com/rhuebner/humanresourcesdataset. Accessed: 20190330.
 [3] D. Barbará, Y. Li, and J. Couto. Coolcat: an entropybased algorithm for categorical clustering. In Proceedings of the eleventh international conference on Information and knowledge management, pp. 582–589. ACM, 2002.
 [4] V. Batagelj and A. Mrvar. Pajek: Program for analysis and visualization of large networks. TimeshiftThe World in TwentyFive Years: Ars Electronica, pp. 242–251, 2004.
 [5] W. Berger, H. Piringer, P. Filzmoser, and E. Gröller. Uncertaintyaware exploration of continuous parameter spaces using multivariate prediction. In Computer Graphics Forum, vol. 30, pp. 911–920. Wiley Online Library, 2011.
 [6] M. Bostock, V. Ogievetsky, and J. Heer. D datadriven documents. IEEE transactions on visualization and computer graphics, 17(12):2301–2309, 2011.
 [7] Z. Cao, G. Kuenning, K. Mueller, A. Tyagi, and E. Zadok. Graphs are not enough: Using interactive visual analytics in storage research. In 11th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 19), 2019.
 [8] Z. Cao, V. Tarasov, H. P. Raman, D. Hildebrand, and E. Zadok. On the performance variation in modern storage stacks. In 15th USENIX Conference on File and Storage Technologies (FAST 17), pp. 329–344, 2017.
 [9] Z. Cao, V. Tarasov, S. Tiwari, and E. Zadok. Towards better understanding of blackbox autotuning: a comparative analysis for storage systems. In 2018 USENIX Annual Technical Conference (USENIXATC 18), pp. 893–907, 2018.
 [10] B. C. Cappers and J. J. van Wijk. Exploring multivariate event sequences using rules, aggregations, and selections. IEEE transactions on visualization and computer graphics, 24(1):532–541, 2017.
 [11] M. Chen, G. B. Bangera, D. Hildebrand, F. Jalia, G. Kuenning, H. Nelson, and E. Zadok. vnfs: maximizing nfs performance with compounds and vectorized i/o. ACM Transactions on Storage (TOS), 13(3):21, 2017.

[12]
W. Chen, F. Guo, and F.Y. Wang.
A survey of traffic data visualization.
IEEE Transactions on Intelligent Transportation Systems, 16(6):2970–2984, 2015.  [13] H. Cohn and M. Fielding. Simulated annealing: searching for an optimal temperature schedule. SIAM Journal on Optimization, 9(3):779–802, 1999.
 [14] K. Daniels, G. Grinstein, A. Russell, and M. Glidden. Properties of normalized radial visualizations. Information Visualization, 11(4):273–300, 2012.
 [15] I. Demir, C. Dick, and R. Westermann. Multicharts for comparative 3d ensemble visualization. IEEE Transactions on Visualization and Computer Graphics, 20(12):2694–2703, 2014.
 [16] I. S. Duff, A. M. Erisman, and J. K. Reid. Direct methods for sparse matrices. Oxford University Press, 2017.
 [17] S. Epskamp, A. O. Cramer, L. J. Waldorp, V. D. Schmittmann, D. Borsboom, et al. qgraph: Network visualizations of relationships in psychometric data. Journal of Statistical Software, 48(4):1–18, 2012.

[18]
D. E. Goldberg and J. H. Holland.
Genetic algorithms and machine learning.
Machine learning, 3(2):95–99, 1988.  [19] G. H. Golub and C. F. Van Loan. Matrix computations, vol. 3. JHU press, 2012.
 [20] M. J. Greenacre. Correspondence analysis. London: Academic Press, 1984.
 [21] G. Grinstein, C. B. Jessee, P. Hoffman, P. O’Neil, A. Gee, and E. Grigorenko. Highdimensional visualization support for data mining gene expression data. DNA Arrays: Technologies and Experimental Strategies, pp. 86–131, 2001.
 [22] M. Harrower and C. A. Brewer. Colorbrewer. org: an online tool for selecting colour schemes for maps. The Cartographic Journal, 40(1):27–37, 2003.
 [23] J. A. Hartigan. Printer graphics for clustering. Journal of Statistical Computation and Simulation, 4(3):187–213, 1975.
 [24] Z. He, X. Xu, and S. Deng. Squeezer: an efficient algorithm for clustering categorical data. Journal of Computer Science and Technology, 17(5):611–624, 2002.
 [25] C. Heinzl and S. Stappen. Star: Visual computing in materials science. In Computer Graphics Forum, vol. 36, pp. 647–666. Wiley Online Library, 2017.
 [26] R. J. Heuer Jr. Analysis of competing hypotheses. Psychology of Intelligence Analysis, pp. 95–110, 1999.
 [27] J. L. Hintze and R. D. Nelson. Violin plots: a box plotdensity trace synergism. The American Statistician, 52(2):181–184, 1998.
 [28] P. Hoffman, G. Grinstein, K. Marx, I. Grosse, and E. Stanley. Dna visual and analytic data mining. In Proceedings. Visualization’97 (Cat. No. 97CB36155), pp. 437–441. IEEE, 1997.

[29]
J. H. Holland et al.
Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence
. MIT press, 1992. 
[30]
Z. Huang.
Extensions to the kmeans algorithm for clustering large data sets with categorical values.
Data mining and knowledge discovery, 2(3):283–304, 1998.  [31] C. Johnson, R. Moorhead, T. Munzner, H. Pfister, P. Rheingans, and T. S. Yoo. Nihnsf visualization research challenges report. Institute of Electrical and Electronics Engineers, 2005.
 [32] P. Kampstra et al. Beanplot: A boxplot alternative for visual comparison of distributions. 2008.
 [33] E. Kandogan. Star coordinates: A multidimensional visualization technique with uniform treatment of dimensions. In Proceedings of the IEEE Information Visualization Symposium, vol. 650, p. 22. Citeseer, 2000.
 [34] E. Kandogan. Visualizing multidimensional clusters, trends, and outliers using star coordinates. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 107–116. ACM, 2001.
 [35] D. A. Keim, F. Mansmann, D. Oelke, and H. Ziegler. Visual analytics: Combining automated discovery with interactive visualizations. In International Conference on Discovery Science, pp. 2–14. Springer, 2008.
 [36] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. science, 220(4598):671–680, 1983.
 [37] R. Kosara, F. Bendix, and H. Hauser. Parallel sets: Interactive exploration and visual analysis of categorical data. IEEE transactions on visualization and computer graphics, 12(4):558–568, 2006.
 [38] J. Kruskal. The relationship between multidimensional scaling and clustering. In Classification and clustering, pp. 17–44. Elsevier, 1977.
 [39] A. Kumar, A. Tyagi, M. Burch, D. Weiskopf, and K. Mueller. Task classification model for visual fixation, exploration, and search. In Proceedings of the 11th ACM Symposium on Eye Tracking Research & Applications, p. 65. ACM, 2019.
 [40] D. J. Lehmann and H. Theisel. Orthographic star coordinates. IEEE Transactions on Visualization and Computer Graphics, 19(12):2615–2624, 2013.
 [41] D. J. Lehmann and H. Theisel. Optimal sets of projections of highdimensional data. IEEE Transactions on Visualization and Computer Graphics, 22(1):609–618, 2015.
 [42] D. J. Lehmann and H. Theisel. Optimal sets of projections of highdimensional data. IEEE transactions on visualization and computer graphics, 22(1):609–618, 2016.
 [43] Z. Li, K. M. Greenan, A. W. Leung, and E. Zadok. Power consumption in enterprisescale backup storage systems. Power, 2(2), 2012.
 [44] Z. Li, R. Grosu, K. Muppalla, S. A. Smolka, S. D. Stoller, and E. Zadok. Model discovery for energyaware computing systems: An experimental evaluation. In 2011 International Green Computing Conference and Workshops, pp. 1–6. IEEE, 2011.
 [45] L. v. d. Maaten and G. Hinton. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
 [46] M. Meyer, M. Sedlmair, and T. Munzner. The fourlevel nested model revisited: blocks and guidelines. In Proceedings of the 2012 BELIV Workshop: Beyond Time and ErrorsNovel Evaluation Methods for Visualization, p. 11. ACM, 2012.
 [47] S. Mika, G. Ratsch, J. Weston, B. Scholkopf, and K.R. Mullers. Fisher discriminant analysis with kernels. In Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. no. 98th8468), pp. 41–48. Ieee, 1999.
 [48] T. Munzner. A nested model for visualization design and validation. IEEE transactions on visualization and computer graphics, 15(6):921–928, 2009.

[49]
A. Y. Ng, M. I. Jordan, and Y. Weiss.
On spectral clustering: Analysis and an algorithm.
In Advances in neural information processing systems, pp. 849–856, 2002.  [50] L. Novakova and O. Štepanková. Radviz and identification of clusters in multidimensional data. In 2009 13th International Conference Information Visualisation, pp. 104–109. IEEE, 2009.
 [51] J. H. P. Ono, F. Sikansi, D. C. Corrêa, F. V. Paulovich, A. Paiva, and L. G. Nonato. Concentric radviz: visual exploration of multitask classification. In 2015 28th SIBGRAPI Conference on Graphics, Patterns and Images, pp. 165–172. IEEE, 2015.
 [52] H. Piringer, W. Berger, and J. Krasser. Hypermoval: Interactive visual validation of regression models for realtime simulation. In Computer Graphics Forum, vol. 29, pp. 983–992. Wiley Online Library, 2010.
 [53] S. Pissanetzky. Sparse Matrix Technologyelectronic edition. Academic Press, 1984.
 [54] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. science, 290(5500):2323–2326, 2000.
 [55] M. RubioSánchez, L. Raya, F. Diaz, and A. Sanchez. A comparative study between radviz and star coordinates. IEEE transactions on visualization and computer graphics, 22(1):619–628, 2015.
 [56] M. Sedlmair, C. Heinzl, S. Bruckner, H. Piringer, and T. Möller. Visual parameter space analysis: A conceptual framework. IEEE Transactions on Visualization and Computer Graphics, 20(12):2161–2170, 2014.
 [57] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2015.
 [58] L. Shao, A. Mahajan, T. Schreck, and D. J. Lehmann. Interactive regression lens for exploring scatter plots. In Computer Graphics Forum, vol. 36, pp. 157–166. Wiley Online Library, 2017.
 [59] V. Tarasov, S. Bhanage, E. Zadok, and M. Seltzer. Benchmarking file system benchmarking: It* is* rocket science. In HotOS, vol. 13, pp. 1–5, 2011.
 [60] R. P. Tewarson and K.Y. Cheng. A desirable form for sparse matrices when computing their inverse in factored forms. Computing, 11(1):31–38, 1973.
 [61] A. Tyagi, A. Kumar, A. Gandhi, and K. Mueller. Road accidents in the uk (analysis and visualization). IEEE VIS, 2018.
 [62] A. Unger, N. Dräger, M. Sips, and D. J. Lehmann. Understanding a sequence of sequences: Visual exploration of categorical states in lake sediment cores. IEEE transactions on visualization and computer graphics, 24(1):66–76, 2017.
 [63] D. Van Aken, A. Pavlo, G. J. Gordon, and B. Zhang. Automatic database management system tuning through largescale machine learning. In Proceedings of the 2017 ACM International Conference on Management of Data, pp. 1009–1024. ACM, 2017.
 [64] Y. Wang, K. Feng, X. Chu, J. Zhang, C.W. Fu, M. Sedlmair, X. Yu, and B. Chen. A perceptiondriven approach to supervised dimensionality reduction for visualization. IEEE transactions on visualization and computer graphics, 24(5):1828–1840, 2018.
 [65] J. Waser, R. Fuchs, H. Ribicic, B. Schindler, G. Bloschl, and E. Groller. World lines. IEEE transactions on visualization and computer graphics, 16(6):1458–1467, 2010.
 [66] J. Weissenböck, B. Fröhler, E. Gröller, J. Kastner, and C. Heinzl. Dynamic volume lines: Visual comparison of 3d volumes through spacefilling curves. IEEE transactions on visualization and computer graphics, 25(1):1040–1049, 2018.
 [67] Y. Wu, F. Wei, S. Liu, N. Au, W. Cui, H. Zhou, and H. Qu. Opinionseer: interactive visualization of hotel customer feedback. IEEE transactions on visualization and computer graphics, 16(6):1109–1118, 2010.
 [68] E. Zadok, A. Arora, Z. Cao, A. Chaganti, A. Chaudhary, and S. Mandal. Parametric optimization of storage systems. In 7th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 15), 2015.
 [69] Z. Zhang, K. T. McDonnell, and K. Mueller. A networkbased interface for the exploration of highdimensional data spaces. In 2012 IEEE Pacific Visualization Symposium, pp. 17–24. IEEE, 2012.
 [70] Z. Zhang, K. T. McDonnell, E. Zadok, and K. Mueller. Visual correlation analysis of numerical and categorical data on the correlation map. IEEE transactions on visualization and computer graphics, 21(2):289–303, 2015.
 [71] X. Zhao, J. Liang, and C. Dang. Clustering ensemble selection for categorical data based on internal validity indices. Pattern Recognition, 69:150–168, 2017.
 [72] X. Zhu, M. Uysal, Z. Wang, S. Singhal, A. Merchant, P. Padala, and K. Shin. What does control theory bring to systems research? ACM SIGOPS Operating Systems Review, 43(1):62–69, 2009.
Comments
There are no comments yet.