Exploratory data analysis (Tukey, 1977)
, often performed interactively, is an established approach for learning about patterns in a data set prior to more formal analyses. Humans are able to easily identify patterns in the data visually, even when the patterns are complex and difficult to model algorithmically. Visual data exploration is hence a powerful tool for exploring patterns in the data and a multitude of visual exploration systems have been designed for this purpose over the years. Let us now consider some general requirements for anefficient visual exploration system.
The system must take into account the user’s knowledge of the data, which iteratively accumulates during exploration.
The user must be shown informative views of the data given the user’s current knowledge.
The user must be able to steer the exploration in order to answer specific questions.
Despite the long history of visual exploration systems, they still lack a principled approach with respect to these general requirements. In this paper we address several shortcomings related to these requirements. Specifically, our goal and main contribution is to devise a framework for human-guided data exploration by modelling the user’s background knowledge and objectives, and using these to provide the user with the most informative views of the data.
Our contribution consists of three main parts: (i) a framework for modelling and incorporating the user’s background knowledge of the data that can be iteratively updated, (ii) finding the most informative views of the data, and (iii) a solution allowing the user to steer the visual data exploration process so that specific hypotheses formulated by the user can be answered. The first and third contribution are general, while the second one, that is, finding the most informative views of the data, is specific to a particular data type. In this paper we focus on data items that can be represented as real-valued vectors of attribute values. This paper extends our earlier works: preprint(Puolamäki, Oikarinen, Atli, and Henelius, 2018) and (Henelius et al., 2018), the latter of which only considers axis-aligned projections of the data and does not take advantage of the dimensionality reducted method presented in this work.
We next discuss the relation of our present work to existing literature on exploratory data analysis. Our first contribution is related to iterative data mining (Hanhijärvi et al., 2009) which is a paradigm where patterns already discovered by the user are taken into account as constraints during subsequent exploration. In brief, this works as follows. The user explores the data and observes a pattern in a view. The user marks the observed pattern as known in the exploration system. The system then takes this newly added pattern, as well as all other previously added patterns, into account when constructing the next view shown to the user. The goal is to prevent the system from showing already known information to the user again. This concept of iterative pattern discovery is also central to the data mining framework presented by De Bie (2011a; 2011b; 2013)
, where the user’s current knowledge (or beliefs) of the data is modelled as a probability distribution over data sets. This distribution is then updated iteratively during the exploration phase as the user discovers new patterns. Our work has been motivated byPuolamäki et al. (2010, 2016); Kang et al. (2016b) and Puolamäki, Oikarinen, Kang, Lijffijt, and Bie (2018), where these concepts have been successfully applied in visual exploratory data analysis such that the user is visually shown a view of the data which is maximally informative given the user’s current knowledge. Visual interactive exploration has also been applied in different contexts, for example, in item-set mining and subgroup discovery (Boley et al., 2013; Dzyuba and van Leeuwen, 2013; van Leeuwen and Cardinaels, 2015; Paurat et al., 2014), information retrieval (Ruotsalo et al., 2015), and network analysis (Chau et al., 2011).
Concerning our second contribution, solving the problem of determining which views of the data are maximally informative to the user (and hence interesting) has been approached in terms of, for example, different projections and measures of interestingness (De Bie et al., 2016; Kang et al., 2016a; Vartak et al., 2015). Constraints have also been used to assess the significance of data mining results, for example, in pattern mining (Lijffijt et al., 2014) or in investigating spatio-temporal relations (Chirigati et al., 2016). We observe, however, that a view maximally informative to the user, is a view that contrasts the most the user’s current knowledge. Hence, this kind of a view is maximally “surprising” to the user with respect to his or her current knowledge.
However, always showing maximally informative views to the user leads to a problem, which can be seen as one of the major shortcomings of previous work on iterative data mining and applications to visual exploratory data analysis. By definition, maximally informative views given the user’s existing knowledge will be surprising. Because the user is not able to control the path that the exploration takes, it is difficult to investigate specific hypotheses concerning the data or to steer the exploration process. Traditional iterative data mining hence suffers from a navigational problem Puolamäki et al. (2010). Our third contribution is to solve this navigational problem by incorporating both the user’s knowledge of the data, and different hypotheses concerning the data into the background distribution. It often is the case that the user has some pre-existing exploration objectives before starting the analysis, or the user develops specific hypotheses during the exploration phase. This navigational aspect in the exploration process has, as far as we are aware of, not been addressed previously, and we believe that the contribution we make in this area is highly important for any real interactive iterative data analysis framework.
Our framework is sketched in Figure 1. More formally, as in Lijffijt et al. (2014), we denote the original data set by and the set of all possible data sets by . We further define a set of constraints . A constraint is simply a subset of all possible data sets which always also includes the original data set, that is, is satisfied for all . Any set of constraints can be used to define a subset of data sets that satisfy all of the constraints in by , with defined as .
We assume that the user observes a set of relations such as correlations, cluster structures etc. from the data, as later defined by Definition 4. A constraint—or a set of constraints—can either preserve or break a relation. If the user observes that some relations are preserved in the data the user can infer that the data obeys the constraints that preserve these relations.
In this paper we assume that the user’s knowledge can be parametrised by a set of constraints
and by a uniform distribution over data sets in, with the probability of the data sets in the complement being zero. We call this uniform distribution a background distribution, which describes the probabilities the user gives for different possible data sets. Intuitively, the constraints denote the relations (or patterns) in the data that the user already is aware of. In the user’s mind, any data set that is not contradictory with any of these constraints is equally probable, while a data sets which contradict with any of the constraints have zero probability. In the absence of constraints, that is, when the user knows nothing, the user’s knowledge is described by the background distribution corresponding to the situation that the user considers all possible data sets equally likely, as shown in Figure 2.
Now, out of all possible views of the data (such as scatter plots over different coordinate axes) the most informative should be the one that—according to some measure—shows the maximal difference between the data set and a data set sampled from the background distribution as in Figure 3. When looking at this view, the user may learn more about the relations in the data and can add the learned knowledge as new constraints in , after which a new maximally informative view can be produced. This iterative process continues until the user has learned all he or she wants to know about the data; this is the approach taken, for example, in Puolamäki et al. (2016) and Puolamäki, Oikarinen, Kang, Lijffijt, and Bie (2018). As already mentioned, the drawback of the approach is that each new view is by definition also maximally surprising to the user and that there is no way to guide the exploration towards the user’s interests.
In this paper we complement this framework by using the constraints to parametrise, in addition to the user’s knowledge, what the user wants to know about the data. We do this by defining a new set of constraints, denoted by which defines the relations that are of the interest to the user. We further define a set of constraints which defines the relations that are of no interest to the user. Instead of comparing the data and the background distribution, as earlier, we find a view that shows the maximal difference between samples from the uniform distributions from and , respectively. Notice that if we are interested in all constraints, i.e., and , this new formulation reduces to the earlier approach of Puolamäki et al. (2016) and Puolamäki, Oikarinen, Kang, Lijffijt, and Bie (2018), at least if the all of the constraints together allow only the original data set, or .
|Informative view||Uninformative view|
The advantage of the new formulation is that by expressing the user’s knowledge and the user’s objectives using the same parametrisation we can in an elegant way formalise the data exploration process as finding views that show differences between two distributions, or between samples from two distributions.
The above discussion is generic for all classes of data sets and constraints. However, in the remainder of the paper we assume that the data set is a data table with rows corresponding to data items and columns to attributes, and the set consists of all data sets that can be obtained by randomly permuting the columns of . The constraints in are parametrised by tiles containing subsets of rows and columns, respectively. Samples from the constrained distribution can then be obtained efficiently by permutations, as described later in Section 2, which also gives the exact definitions for the concepts mentioned above. The views considered in this paper are scatterplots of the data obtained by linear projections that show the maximal differences of the distributions described above.
In summary, our contributions are:
a computationally efficient formulation and implementation of the user’s background knowledge of the data and objectives (which we here call hypotheses) using constrained randomisation,
a dimensionality reduction method for finding the view most informative to the user, and
an experimental evaluation that supports that our approach is fast, robust, and produces easily understandable results.
The rest of this paper is organised as follows. Our framework is formalised in Section 2, where we describe how to model the user’s background knowledge of the data, how data exploration objectives are formulated and updated, and how maximally informative views are determined. In Section 3 we empirically evaluate our framework, by considering both computational efficiency and robustness against noise, as well as provide use cases of user-guided exploration. We conclude the paper with a discussion in Section 4.
Let be an data matrix (data set). Here denotes the th element in column . Each column , is an attribute in the data set (we use the shorthand ). Let be a finite set of domains (for example, continuous or categorical) and let denote the domain of . Also let for all and , that is, all elements in a column belong to the same domain but different columns can have different domains. The derivations in Sections 2.1 and 2.2 are generic with respect to domains, but in Section 2.3 we consider only real numbers, that is, for all .
2.1 Permutations and Tile Constraints
We proceed to introduce the permutation-based sampling method and tile constraints which are used to constrain the sampled distributions as well as to express the user’s background knowledge and objectives (hypotheses). The distributions are constructed so that in the absence of constraints (tiles) the marginal distributions of the attributes are preserved.
We define a permutation of the data matrix as follows. [Permutation] Let denote the set of permutation functions of length such that is a bijection for all , and denote by the vector of column-specific permutations. A permutation of a data matrix is then given as . When permutation functions are sampled uniformly at random, we obtain a uniform sample from the distribution of data sets where each of the attributes has the same marginal distribution as the original data. Hence, given a data set , the set of possible data sets is .
A tile is a tuple , where and
. The tiles considered here are combinatorial (in contrast to geometric), meaning that rows and columns in the tile do not need to be consecutive. In the unconstrained case, there areallowed vectors of permutations. We parametrise distributions using tile constraints preserving the relations in a data matrix for subsets of rows and columns. The tiles constrain the set of allowed permutations as follows. [Tile constraint] Given a tile where and , a vector of permutations is allowed by iff the following condition is true for all , , and :
Given a set of tiles , a vector of permutations is allowed iff is allowed by all . For an empty set of tiles , all permutations in are allowed. A tile defines a subset of rows and columns, and the rows in this subset are permuted by the same permutation function in each column in the tile. In other words, the relations between the columns inside the tile are preserved. Thus, given a data set and a set of tiles , the subset of data sets in that satisfy all of the tile constraints in is given by
Notice that the identity permutation is always an allowed permutation. Figure 4 shows an example of both unconstrained permutation and permutation constrained with a tile.
We proceed to define formally what we mean by relations in this paper. [Relation] A relation is a real-valued function over data matrices . Given a set of tiles , we say that that preserves the relation , if is satisfied for all permutations allowed by . Otherwise, we say that breaks the relation . Thus, we use the term relation to denote any structure in the data which can be controlled (that is, essentially broken if need be) in the permutation scheme parametrised by the tile constraints. In practise, some tolerance could be included into the above definition for condition instead of exact equivalence. Examples of relations conforming to the above definition include correlations between attributes, and cluster structures. For example, for a real valued data matrix a relation could be defined as a covariance between columns and , i.e., . A set of tiles that contains a tile would preserve this relation. On the other hand, a set of tiles which would allow some of the rows in columns and to be permuted independently would break the relation . Another example of a possible relation would be the scagnostics for a scatterplot visualisation (Wilkinson et al., 2005).
We make an implicit assumption that if the user observes in the data that certain (visual) relations are preserved, then the user can conclude that the data also obeys constraints that preserve those same relations. The user can then add these constraints to the background distribution. Notice that the relations
correspond to visual patterns (correlations, cluster structures, outliers etc.) that the user could possibly observe. The relationswould not be evaluated by the computer, but they would be part of the user’s cognitive processing of the visualisations. Therefore, in practical applications, there is usually no need—nor would it be possible—to define all of the relations explicitly. For our purposes it is enough that the user can to a reasonable accuracy match the observed visual relations in the data to the corresponding constraints.
We use tile constraints to describe the user’s knowledge concerning relations in the data. As the user views the data he or she can observe relations and represent these as tile constraints. For example, the user can mark an observed cluster structure with a tile involving the data points in the cluster and the relevant attributes. We denote the set of user-defined tiles by . Then, a uniform distribution from is the background distribution, which describes the probabilities the user gives for different possible data sets.
We can now formulate our sampling problem as follows.
Problem (Sampling problem)
Given a set of tiles , draw samples uniformly at random from vectors of permutations allowed by .
The sampling problem is trivial when the tiles are non-overlapping, since permutations can be done independently within each non-overlapping tile. However, in the case of overlapping tiles, multiple constraints can affect the permutation of the same subset of rows and columns and this issue must be resolved. To this end, we need to define the equivalence of two sets of tiles, which means that the same constraints are enforced on the permutations. [Equivalence of sets of tiles] Let and be two sets of tiles. is equivalent to , if for all vectors of permutations it holds:
We use the term tiling for a set of tiles where no tiles overlap. Next, we show that there always exists a tiling equivalent to a set of tiles. Given a set of (possibly overlapping) tiles , there exists a tiling that is equivalent to . Let and be two overlapping tiles. Each tile describes a set of constraints on the allowed permutations of the rows in their respective column sets and . A tiling equivalent to is given by:
Tiles and represent the non-overlapping parts of and and the permutation constraints by these parts can be directly met. Tile takes into account the combined effect of and on their intersecting row set, in which case the same permutation constraints must apply to the union of their column sets. It follows that these three tiles are non-overlapping and enforce the combined constraints of tiles and . Hence, a tiling can be constructed by iteratively resolving overlap in a set of tiles until no tiles overlap. Notice that merging overlapping tiles leads to wider (larger column set) and lower (smaller row set) tiles. An example is shown in Figure 5. The limiting case is a fully-constrained situation where each row is a separate tile and only the identity permutation is allowed. We provide an efficient algorithm with the time complexity for merging tiles in Appendix A.
2.2 Formulating Hypotheses
As discussed in the introduction, in order to model what the user wants to know about the data, we define two sets of constraints: (the relations that are of the interest for the user) and (the relations that are of no interest for the user). We then find a view that shows the maximal difference between the uniform distributions from and , respectively. We now formalise this idea by formulating a pair of hypotheses, concisely represented using the tilings defined in the previous section. This provides a flexible method for the user to specify the relations in which he or she is interested.
[Hypothesis tilings] Given a subset of rows , a subset of columns , and a -partition of the columns given by , such that and if , a pair of hypothesis tilings is given by and . The hypothesis tilings define the items and attributes of interest, and, through the partition of , the relations between the attributes in which the user is interested. Hypothesis 1 () corresponds to a uniform distribution from the data sets in which all relations in are preserved, and hypothesis 2 () to a uniform distribution from the data sets in which the relations between attributes in the partitions of are broken while the relations between attributes inside each are preserved. In terms of the constraints, as discussed in Section 1, relations preserved by correspond to the relations which the user is not interested in (that is, in Section 1), while relations preserved by but broken by correspond to relations which user is interested in (that is, in Section 1).
Now, for example, if the columns are partitioned into two groups and the user is interested in relations between the attributes in and , but not in relations within or . On the other hand, if the partition is full, that is, and for all , then the user is interested in all relations between the attributes inside . The special case of and indeed reduces to unguided data exploration, where all inter-attribute relations in the data are of interest to the user.
Having defined both the user’s knowledge (background distribution) and the pair of hypotheses formalising the user’s objectives with tile constraints, we can now easily combine these to formalise the uniform distributions from the data sets we want to compare. I.e., we want to compare the uniform distributions from and , where ‘’ is used with a slight abuse of notation to denote the operation of merging two tilings (with possible overlaps between their tiles) into an equivalent tiling. Notice here that, by Definition 2.2, it holds that
and hence the comparison becomes equivalent to the formulation provided in the introduction. Recall that we can draw samples from these distributions as described in Section 2.1. From now on, we use the term hypothesis pair to denote
where is the tiling formalising the (current) background distribution, and and form a pair of hypothesis tilings as defined in Definition 2.2. See Figure 6 for examples of different cases in which a pair of hypothesis tilings and the user’s knowledge are used to explore relations between attributes. Here, the first case demonstrates a scenario in which the user is interested in all relations, and hence the set of relations that are of no interest to the user is empty. Furthermore, the user has no prior knowledge, that is, . The second case is similar, but here the tile shown an with orange dashed border represents the user’s knowledge . Finally, the third case shows a scenario in which . The relations preserved between attributes and for items are of no interest to the user, while the relations between the attribute groups and are.
2.3 Finding Views
We are now ready to formulate our second main problem, that is, given the uniform distributions from the two data sets characterised by the hypothesis pair , how can we find an informative view of the data maximally contrasting these? The answer to this question depends both on the type of data and the selected visualisation. For example, visualisations or measures of difference are different for categorical and real-valued data. The real-valued data discussed in this paper allows us to use projections (such as principal components) that mix attributes.
Problem (Comparing hypotheses)
Given the two uniform distributions characterised by the hypothesis pair , find the projection in which the distributions differ the most.
To solve this problem, we devise a linear projection pursuit method which finds the direction in which the two distributions differ the most in terms of variance
. In principle some other difference measure could be used instead. However, a variance-based measure can be implemented efficiently, which is one essential requirement for interactive use. Furthermore, using variance leads to the convenient property that our projection pursuit method reduces to standard principal component analysis (PCA) when the user has no background knowledge and when the hypotheses are most general, as shown in Theorem2.3 below.
Thus, we formalise the optimisation criterion in Problem 2.3 by defining a measure using variance. Specifically, we choose the following form for our gain function:
where is a vector in and and are the covariance matrices of the uniform distributions from the data sets in and , respectively. Then, the direction in which the distributions differ most in terms of the variance, that is, the solution to Problem 2.3, is given by
In order to solve Problem 2.3, we first show that the covariance matrix for a distribution defined using the permutation-based scheme with tile constraints can be computed analytically. Given , the covariance of attributes from the uniform distribution of data sets defined by a tiling is given by , where
and . Here, denotes the set of rows permuted together, denotes the centred data matrix, and denotes a set satisfying where , and , that is, the rows in a tile that the data point belongs to. The covariance is defined by
where the expectation is defined over the permutations and of columns and allowed by the tiling , respectively. The part of the sum for rows permuted together reads
where we have used and reordered the sum for . The remainder of the sum reads
where and the expectations have
been taken independently, because the rows in are permuted
independently at random. The result then follows from the observation
that for any .111 We have also verified experimentally that the
analytically derived covariance matrix matches the covariance matrix
estimated from a sample from the distribution.
We have also verified experimentally that the analytically derived covariance matrix matches the covariance matrix estimated from a sample from the distribution.
The direction in which the ratio of the variances is the largest can now be found applying a whitening operation (Kessy et al., 2018) on . The idea of whitening is to find a whitening matrix such that . Using this transformation in Equation (1) makes the denominator constant, and we hence obtain the solution to the optimisation in Equation (2) by finding the principal components of transformed using .
The solution to the optimisation problem of Equation (2) is given by , where is the first principal component of and is a whitening matrix such that . Using the gain in Equation (1) can be rewritten as
Equation (3) is maximised when is the maximal variance direction of , from which it follows that the solution to the optimisation problem of Equation (2) is given by , where is the first principal component of .
(that is, when making two-dimensional scatterplots), we project the
data onto the first two principal components, instead of
considering only the first component as in the above.
Finally, we are ready show that at the limit of no background knowledge and with the most general hypotheses, our method reduces to the PCA of the correlation matrix. In the special case of the first step in unguided data exploration, that is, comparing distributions from a hypotheses pair specified by , where and , the solution to Equation (2) is given by the first principal component of the correlation matrix of the data when the data has been scaled to unit variance. The proof follows from the observations that for the covariance matrix is a diagonal matrix (here a unit matrix), resulting in the whitening matrix . For this pair of hypothesis, denotes the covariance matrix of the original data. The result then follows from Theorem 2.3.
Once we have defined the most informative projection, which displays the directions in which the distributions parametrised by the hypothesis pair differ the most, we can show the original data in this projection. This allows the user to observe different patterns, for example, a clustered set of points, a linear relationship, or a set of outlier points. We note that it would also be possible to show and compare samples from the two distributions characterised by the hypothesis pair in the most informative view. In Henelius et al. (2018) we presented a proof-of-concept tool using which the user can, in fact, toggle between showing the data and samples from the two distributions representing the hypotheses. This can potentially shed some further light into why this particular view is interesting, but as we are mostly interested in the relations present in the actual data, we have chosen for simplicity to consider only the data in the most informative projection in this work.
2.4 Selecting Attributes for a Tile Constraint
After observing a pattern, the user can define a tile to be added to . The set of data points
included in the pattern can be easily selected from the projection shown. For selecting the attributes characterising the pattern, we use a procedure where for each attribute the ratio between the standard deviation of the attribute for the selection and the standard deviation of all data points is computed. If this ratio is below a threshold value(for example, ), then the attribute is included in the set of attributes characterising the pattern. The intuition here is that we are looking for attributes in which the selection of points are more similar to each other than is expected based on the whole data. Thus, the set of attributes for which the user’s knowledge of dependencies is included, is affected by the choice of . A smaller value of will only include attributes for which the selection of points is very similar, whereas a larger value of will include a larger set of attributes to the tile constraint. A parallel coordinates plot ordered according to the ratio of standard deviations can be useful in deciding a suitable value for , see Section 3.3 for examples in which we use parallel coordinate plots.
2.5 Example: Subsetting Loses Information
We conclude this section with a simple example illustrating how subsetting the data in order to focus on a specific objective can lead to a loss of information. With this example we wish to highlight two aspects: (i) how the user’s background knowledge and objectives affect the views that are most informative, and (ii) how it can be advantageous to investigate relations in the data as a whole instead of using a subset of the data for the analysis.
We construct a toy data set with four attributes , , , and as follows. We first generate two strongly correlated attributes and , after which we generate attribute by adding noise to , and attribute by adding noise to . This data set, visualised in Figure 7(a), is very simple and here it is possible to investigate all pairwise relations in the data in one view. This is in general not possible in any real analysis scenarios. Furthermore, we assume that the user is interested in the relation between attributes and . Our goal is then to find a maximally informative 1-dimensional projection of the data that takes both this objective and the user’s background knowledge into account.
First, let us assume that the user only knows the marginal distribution of each attribute but is unaware of the relations between the attributes. Using the approach in this paper we formulate this by means of the hypothesis pair , where , , , and (all rows in the data). A sample from is shown in Figure 7(b) using purple squares, and a sample from is shown using orange circles. The orange distribution hence models what the user currently knows and the purple what the user could optimally learn about the relation between and from the data. The orange and purple distributions differ the most in the plot , as expected, and indeed the maximally informative 1-dimensional projection satisfying Equation (2), is given by .
Secondly, assume that, unlike above, the user is already aware of the relationship between the attribute pairs and as well as and , but does not know that attributes and are almost identical. We proceed as above with the difference that we now add the user’s knowledge as a constraint to both the distributions. This is achieved by updating the hypothesis pair to , where captures the user’s knowledge.
Samples from the uniform distributions on the data sets conforming to this hypothesis pair are shown in Figure 7(c). Again, the orange distribution models the user’s knowledge (that is, ) and the purple what the user could learn from the relation between and from the data, given that the user already knows about the relationships of the attribute pairs and as well as and (that is, ). The orange and purple distributions differ the most in the plot and therefore the user would gain most information if shown this view. Indeed, the most informative 1-dimensional projection satisfying Equation (2) is . In other words, the knowledge of the relation of and gives maximal information about the relation of and . This makes sense, because the variables and are really connected via and through their generative process.
This example hence shows how the background knowledge affects the views. Also, if we had chosen a subset of the data containing, for example, just attributes and we would not have observed the connection of and through and , even if we knew the relation between and as well as and . Thus, we have demonstrated with this simple example that using hypothesis tilings as above allows us to explore the entire data set at once while still focusing on particular relations of interest.
In this section we first consider the stability and scalability of the framework presented in this paper. After this, we present examples of how the proposed method is used to explore relations in a data set and to focus on investigating a hypothesis concerning relations in a subset of the data. An open source library implementing the proposed framework, including the code for the experiments presented in this paper, is released as open-source software in https://github.com/edahelsinki/corand/.
All the experiments were run on a MacBook Pro laptop with a 3.1 GHz Intel Core i5 processor using R version 3.5.2 (R Core Team, 2018).
3.1 Data Sets
In the experiments we use synthetic data in the scalability experiment. We also use two real-world data sets to showcase the applicability of our framework in human-guided data exploration.
The german socioeconomic data set (Boley et al., 2013; Kang et al., 2016a)222Available from http://users.ugent.be/~bkang/software/sica/sica.zip contains records from 412 German administrative districts. Each district is represented by 46 attributes describing socioeconomic and political aspects in addition to attributes such as the type of the district (rural/urban), area name/code, state, region (East/West/North/South) and the geographic coordinates of each district center. The socioecologic attributes include, for example, population density, age and education structure, economic indicators (for example, GDP growth, unemployment, income), and the proportion of the workforce in different sectors. The political attributes include election results of the five major political parties (CDU/CSU, SPD, FDP, Green, and Left) in the German federal elections in 2005 and 2009, as well as the voter turnout. For our experiments we exclude the election results from 2005 (which are highly correlated with the 2009 election results), all non-numeric variables, and the area code and coordinates of the districts, resulting in 32 real-valued attributes (although we use the full data set when interpreting the results). Finally, we scale the real-valued variables to zero mean and unit variance.
The accident data set333Proprietary data obtained from the Finnish Workers’ Compensation Center https://www.tvk.fi/ is a random sample of 3000 accidents from a large data set containing all occupational accidents in Finnish enterprises during the period 2003–2014 reported to the Finnish Workers’ Compensation Center
. In the original data set, the accidents are described by 37 variables, the majority of which are categorical, including details about the victim (occupation, age, sex, nationality) and the accident (geographical location, cause, type, working process). We transform the categorical variables into real-valued ones usingone-hot encoding
, which creates a column for every label of every variable in which the presence (absence) of a label is indicated by 1 (0, respectively). To restrict the dimensionality of the resulting encoding, we drop variables with a very high number of labels; for example, the variable for the municipality in which the accident happened has more than 300 labels and would result in equally many columns in the data. Variables with many labels are implicitly given more weight in the one-hot encoding as well. For instance, the attributesukup (gender) has 2 labels, while the attribute ruumis (injured body part) has 68 labels. In the transformed data there are 2 columns for sukup and 68 columns for ruumis, making the latter more strongly represented in the data. This could impact further analysis, and to overcome this effect, we scaled the binary data in groups, that is, all columns that originate from the same variable are scaled to have a total variance of 1. The resulting data set contains 3000 rows and 220 attributes.
3.2 Stability and Scalability
We first study the sensitivity of the results with respect to noise or missing data rows. In this experiment we use the 32 real-valued variables from the german data together with three (non-trivial) factors, namely Type (2 values), State (16 values), and Region (4 values) to create synthetic data sets. A synthetic data set, parametrised by the noise term and an integer is constructed as follows. First, we randomly remove rows from the data, after which Gaussian noise with variance is added to the remaining variables, and finally all variables are rescaled to zero mean and unit variance. We create a random tile by randomly picking a factor that defines the rows in a tile and then randomly sample 2 to 32 attributes as the columns. The background knowledge consists of three such random tiles. The hypothesis tiles are constructed using one such random tile as a basis: and .
The results are shown in Table 1. We notice that the method is relatively insensitive with respect to the gain in terms of noise and removal of rows. Even removing about half of the rows () does not change the results meaningfully. Only a very high degree of noise, corresponding to (that is, circa 5–10% signal-to-noise ratio) substantially degrades the results.
Table 2 shows the running time of the algorithm as a function of the size of the data for Gaussian random data with a similar tiling setup as used for the german data. We make two observations. First, the tile operations scale linearly with the size of the data and they are relatively fast. Most of the time is spent on finding the views, that is, solving Equation (2). Even our unoptimised pure R implementation runs in seconds for data sets that are visualisable (having thousands of rows and hundreds of attributes); any larger data set should in any case be downsampled for visualisation purposes.
3.3 Exploration of the German Data Set
Next, we demonstrate our framework by exploring the german data set under different objectives.
Exploration without prior background knowledge
We start with unguided data exploration where we have no prior knowledge about the data and our interest is as generic as possible. In this case and as the hypothesis tilings we use , where all rows and columns belong to the same tile (fully-constrained tiling), and , where all columns form a tile of their own (fully unconstrained tiling). Our hypothesis pair is then .
We then consider the view of the data (Figure 8) which is maximally informative, that is, in which the two distributions parametrised by the hypothesis pair differ the most. We observe that there is some structure visible in this view. In order to investigate the characteristics of the data points corresponding to different patterns in the german data, we first choose to focus on the set of points in the upper right corner, marked with purple in Figure 8. Our selection, denoted by Selection 1, corresponds to rural districts in Eastern Germany (see Table 3). We also consider the parallel coordinates plot of the data, shown in Figure 9. This plot shows the 32 real-valued attributes in the data. The currently selected points (Selection 1) are shown in purple while the rest of the data is shown in black. The number in parentheses following each variable name is the ratio of the standard deviation of the selection and the standard deviation of all data. If this number is small we can conclude that the values for a particular attribute are homogeneous inside the selection (behave similarly). Based on the parallel coordinates plot in Figure 9 we observe that there is little support for the Green party and a high support for the Left party in these districts.
We next add a tile constraint for the items in the observed pattern where the columns (attributes) are chosen as described in Section 2.4 using a threshold value . Thus, we select those attributes for which the standard deviation ratio, that is, the number in parentheses in Figure 9, is below the threshold. The hypothesis pair is then updated to take into account the newly added tile, that is, we consider .
The most informative view displaying differences of the distributions parametrised by is shown in Figure 10. Now, Selection 1 (shown in purple for illustration purposes) is no longer as clearly visible in Figure 10 as it is in the first view. This is expected, since this pattern has been accounted for in the distributions parametrised using . We now focus on investigating the sparse region of points shown in orange in Figure 10 (Selection 2). By inspecting the class attributes of this selection we learn that these items correspond to urban districts (see Table 3) in all regions. Based on the parallel coordinates plot shown in Figure 11 we conclude that these districts are characterised by a low fraction of agricultural workforce and a high amount of service workforce, both expected in urban districts. We also notice that these districts have had a higher GDP growth in 2009 and that it appears that the amount of votes for the CDU party in these districts was quite low.
Exploration with a specific hypothesis
Next, we focus on a more specific hypothesis involving only a subset of rows and attributes. In particular, we want to investigate a hypothesis concerning the relations between certain attribute groups in rural areas. We hence define our hypothesis pair as follows. As the subset of rows we choose all 298 districts that are of the type rural. We then consider a subset of the attributes partitioned into four groups. The first attribute group () consists of the voting results for the political parties in 2009. The second attribute group () describes demographic properties such as the fraction of elderly people, old people, middle aged people, young people, and children in the population. The third group () contains attributes describing the workforce in terms of the fraction of the different professions such as agriculture, production, or service. The fourth group () contains attributes describing the level of education, unemployment and income. The attribute groupings are listed in Table 4. Thus, we here want to investigate relations between different attribute groups, ignoring the relations inside the groups.
We form the hypothesis pair , where consists of a tile spanning the rows in and the columns in whereas consists of four tiles: , . Looking at the view in which the distributions parametrised by the pair differ the most, shown in Figure 12(a), we find two clear clusters corresponding to a division of the districts into those located in the East, and those located elsewhere. We could also have used our already observed background knowledge of Selection 1, by considering the hypothesis pair , where is the tile defined earlier for Selection 1. For this hypothesis pair, the most informative view is shown in Figure 12(b), which clearly is different to Figure 12(a), since we already were aware of the relations concerning the rural districts in the East and this was included in our background knowledge.
|LEFT.2009, CDU.2009, SPD.2009, FDP.2009, GREEN.2009|
|Elderly.pop., Old.Pop., Mid.aged.Pop., Young.Pop., Children.Pop.|
|Agricult..workf., Prod..workf., Manufac..Workf., Constr..workf.,|
|Service.workf., Trade.workf., Finance.workf., Pub..serv..workf.|
|Highschool.degree, No.school.degree, Unemploy., Unempl..Youth, Income|
Comparison to PCA and ICA
To demonstrate the utility of the views shown, we compute values of the gain function as follows. We consider our four hypothesis pairs , , , and . For each of these pairs, we denote the direction in which the two distributions differ most in terms of the variance (solutions to Equation (2)) by , , , and , respectively. We then compute the gain for each and
. For comparison, we also compute the first principal component analysis (PCA) and independent component analysis (ICA)(Hyvärinen, 1999) projection vectors, denoted by and , respectively, and calculate the gain for different hypothesis pairs using these. For ICA, we use the log-cosh function and default parameters of the R package fastICA. The results are presented in Table 5. We find that the gain is always the highest when the projection vector matches the hypothesis pair (highlighted in the table), as expected. This shows that the views presented are indeed the most informative ones given the current background knowledge and the hypothesis pair being investigated. We also notice that the gain for PCA is equal to that of unguided data exploration, as expected by Theorem 2.3. When some background knowledge is used or if we investigate a particular hypothesis, the views with PCA or ICA objectives are less informative than the one obtained using our framework. The gains close to zero for the ICA objective are directions in which the variance of the more constrained distribution is small due to, for example, linear dependencies in the data.
3.4 Exploration of the Accident Data Set
Due to the preprocessing, several columns in the accident data set are used to encode the distinct categorical values in the original data. If we now want to explore relationships between the original variables, we can define a hypothesis pair, in which columns corresponding to the same categorical attribute are grouped together. We can thus investigate relations between attribute groups, ignoring relations inside the groups.
|AM1LK||14||Occupation of victim|
|EUSEUR||8||Days lost (severity of the accident)|
|IKAL||12||Age of victim (5-year bins, except 0–14 and|
|NUORET||4||Age of victim (0–15, 16–17, 18–19, and 19 years)|
|RUUMIS||33||Injured body part|
|POIKKEA||10||Deviation before accident|
|SATK||12||Month of accident|
|TOLP||22||Main industry category|
|TYOSUOR||9||Specific physical activity|
|TYOTEHT||31||Work done at the time of accident|
|VAHITAP||16||Contact-mode of injury|
|VAMMAL||13||Type of injury|
|VPAIVA||7||Week day of accident|
|VUOSI||12||Year of accident|
We define the hypothesis pair as follows. As the subset of rows we choose all the 3000 rows in the accident data set. We then consider a subset of the attributes , where a summary of the attribute groupings is provided in Table 6. The hypothesis pair is then , where consists of a tile spanning all rows in and all columns in whereas consists of tiles: , . The view in which the distributions parametrised by differ the most is shown in Figure 13(a). Here we observe two clear clusters. We select the points shown in purple and observe that these data points correspond to accidents happening during travel to work (vahty.m). Also, the attributes tpiste, tyoteht, tyosuor, poikkea, and vahitap (all related to the type of work or the place of work in which the accident occurred) have missing values (–) here, which is natural when an accident happens during travelling to work. The points in the complement of the purple selection on the other hand correspond to accidents happening at the workplace (vahty.p).
Next, we add a tile with the selection of purple points in Figure 13(a) as the set of rows and all columns in accident data as the set of columns to incorporate into the exploration our knowledge concerning these points. We proceed to consider the updated hypothesis pair . The most informative view for this hypothesis pair is shown in Figure 13(b). For illustration purposes we show in purple the same selection of rows in data as in Figure 13(a). We can now select, for example, the data points shown in orange in Figure 13(b) for further inspection. This selection corresponds to accidents at the work place (vahty.p) which happened to mainly women (sukup.n) over 19 years old (nuoret.4), and which resulted in 31–90 days of absence from work (euseur.9).
In this paper we propose a framework integrating both the user’s updating background knowledge from the data and the user’s current interests in explorative data analysis process in a principled way. We provide an efficient implementation of this method using constrained randomisation. Furthermore, we also extended PCA to work seamlessly with the framework in the case of real-valued data sets.
Typical real-world data sets, for instance those empirically investigated in this paper, contain a vast number of interesting patterns. The goal of the data analyst is to find interesting relations in the data. If automated analysis methods are used to extract patterns, it means that the patterns must be specified in advance to be used in conjunction with some data mining algorithm. Specifying patterns in advance is clearly nontrivial when there is a multitude of variable combinations that must be taken into account. Furthermore, if patterns are only extracted based on apriori specifications it is not possible to use insights obtained during the exploration to steer further exploration.
It is here where the power of human-guided data exploration lies. A non-interactive data mining method is restricted to either show generic features of the data—which may already be obvious to an expert—or output unusably many patterns, which is a typical problem, for example, in frequent pattern mining (there are easily too many patterns for the user to absorb). Our framework solves this problem: by integrating the human’s background knowledge and focus—formulated as mathematically defined hypotheses—we can at the same time guide the search towards topics interesting to the user at any particular moment while taking the user’s prior knowledge into account in an understandable and efficient way. Hence, the framework described in this paper makes it possible to interactively and efficiently explore relations between attributes in the data through a conceptually simple paradigm where the relations are encoded using tile constraints. This exploration framework allows the data analyst to use his or her innate pattern recognition skills to spot complex patterns, instead of having to specify them in advance. As demonstrated in our empirical evaluation of two real-world data sets, the proposed interactive exploration framework allows us to find interesting patterns and hence to make sense of the relations in the data.
Our work contains implicit assumptions about the human cognitive processing, such that the user’s knowledge can be modelled using background distribution. The validity of these assumptions would be a good question for future research. For example, in for real user the order in which different relations are observed probably matters, while our formulation is invariant under ordering of the relations. Also, the user is probably not able to model very fine-grained distributions, while our mathematical formulation of the background distribution can become extremely complex when the number of constraints grows.
As a potential direction for future work we consider the extension of the proposed method to understand classifiers or regression functions in addition to static data. Extending the ideas used here to different data types such as, for example, time series, is also worth investigating. Finding an efficient algorithm that could find a sparse solution to the optimisation problem of Equation (2) would also be an interesting problem. To the best of our knowledge, no such solution is readily available. We note that the solutions for sparse PCA are not directly applicable here: sparse PCA would indeed give a sparse variant of the vector in Theorem 2.3. However, this would not result in a sparse . Furthermore, we plan to study how to incorporate in our framework a scheme for evaluating the statistical significance of the visually observed patterns.
Finally, we have implemented an open source R package that allows us to simulate interactive visual data exploration in our framework. The framework is available under an open source license from https://github.com/edahelsinki/corand/ and it includes, in addition to the code needed to run the experiments in this paper, an interactive web-based interface prototype. We have also earlier released a preliminary prototype called tiler (Henelius et al., 2018), which includes the tile-based constrained randomisation approach, but does not implement the dimensionality reduction method presented in this work.
We thank Buse Gul Atli for discussions and contributions to the preprint (Puolamäki, Oikarinen, Atli, and Henelius, 2018). We thank the Finnish Workers’ Compensation Center for the access to the accident data. This work was supported by the Academy of Finland (decisions 326280 and 326339).
Appendix A Algorithm for merging tiles
Merging a new tile into a tiling where all tiles are non-overlapping can be done efficiently using Algorithm LABEL:alg:merge. We assume that the starting point is always a non-overlapping set of tiles and hence we only need to consider the overlap that the new tile has with the tiles in the tiling. This is similar to the merging of statements considered by Kalofolias et al. (2016). The algorithm has two steps. Let be the current tiling and the new tile to be added. In the first step (lines 1–11) we identify the tiles in with which overlaps, and in the second step (lines 12–17) we resolve (merge) the overlap between and the tiles identified in the previous step.
The first step proceeds as follows. An empty hash map is initialised (line 1) to be used to detect overlap between columns of the tiles in and the new tile . We proceed to iterate over each row in the new tile (lines 2–11). Since is a tiling, all its tiles are non-overlapping. We can thus store in a matrix of the same size as the data matrix where each element corresponds to the ID of the tile that covers that position. With a slight abuse of notation, in the algorithm refers to such a matrix. Now, given a row and a set of columns (line 3) we then get the IDs of the tiles on row with which overlaps. We store this in . The hash map is used to detect if this row has been seen before, that is, whether is a key in (line 4). If this is the first time this row is seen, is used as the key for a new element in the hash map and is initialised to be a tuple (line 5). Elements in this tuple are referred to by name, for instance, gives the set of rows associated with the key , while gives the set of tile IDs. On lines 6 and 7 we store the current row index and the unique tile IDs in the tuple. If the row was seen before, the row set associated with these tile IDs is updated (line 9). After this first step, the hash map contains tuples of the form (rows, id) where id specifies the IDs of the tiles with which overlaps at the rows specified by rows.
In the second step of the algorithm (lines 12–17), we first determine the currently largest tile ID in use (line 12). After this we iterate over the tuples in the hash map . For each tuple we must update the tiles having IDs and on line 14 we hence find the columns associated with these tiles. After this, the IDs of the affected overlapping tiles are updated (line 15), and the tile ID counter is incremented (line 16). Finally, the updated tiling is returned on line 18. The time complexity of the tile merging algorithm is .
- Boley et al. (2013) Mario Boley, Michael Mampaey, Bo Kang, Pavel Tokmakov, and Stefan Wrobel. One click mining—interactive local pattern discovery through implicit preference and performance learning. In ACM SIGKDD Workshop on Interactive Data Exploration and Analytics (IDEA), pages 27–35, 2013.
Chau et al. (2011)
Duen Horng Chau, Aniket Kittur, Jason I. Hong, and Christos Faloutsos.
Apolo: making sense of large network data by combining rich user interaction and machine learning.In SIGCHI Conference on Human Factors in Computing Systems, pages 167–176, 2011.
- Chirigati et al. (2016) Fernando Chirigati, Harish Doraiswamy, Theodoros Damoulas, and Juliana Freire. Data polygamy: the many-many relationships among urban spatio-temporal data sets. In International Conference on Management of Data (SIGMOD/PODS), pages 1011–1025. ACM, 2016.
- De Bie (2011a) Tijl De Bie. An information theoretic framework for data mining. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 564–572. ACM, 2011a.
- De Bie (2011b) Tijl De Bie. Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Mining and Knowledge Discovery, 23(3):407–446, 2011b.
- De Bie (2013) Tijl De Bie. Subjective interestingness in exploratory data mining. In International Symposium on Intelligent Data Analysis (IDA), pages 19–31, 2013.
De Bie et al. (2016)
Tijl De Bie, Jefrey Lijffijt, Raúl Santos-Rodriguez, and Bo Kang.
Informative data projections: a framework and two examples.
European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, pages 635–640, 2016.
- Dzyuba and van Leeuwen (2013) Vladimir Dzyuba and Matthijs van Leeuwen. Interactive discovery of interesting subgroup sets. In International Symposium on Intelligent Data Analysis (IDA), pages 150–161, 2013.
- Hanhijärvi et al. (2009) Sami Hanhijärvi, Markus Ojala, Niko Vuokko, Kai Puolamäki, Nikolaj Tatti, and Heikki Mannila. Tell me something I don’t know: randomization strategies for iterative data mining. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 379–388. ACM, 2009.
- Henelius et al. (2018) Andreas Henelius, Emilia Oikarinen, and Kai Puolamäki. Tiler: software for human-guided data exploration. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), pages 672–676. Springer, 2018.
- Hyvärinen (1999) Aapo Hyvärinen. Fast and robust fixed-point algorithms for independent component analysis. IEEE Transactions on Neural Networks, 10(3):626–634, 1999.
- Kalofolias et al. (2016) Janis Kalofolias, Esther Galbrun, and Pauli Miettinen. From sets of good redescriptions to good sets of redescriptions. In International Conference on Data Mining (ICDM), pages 211–220. IEEE, 2016.
- Kang et al. (2016a) Bo Kang, Jefrey Lijffijt, Raúl Santos-Rodríguez, and Tijl De Bie. Subjectively interesting component analysis: Data projections that contrast with prior expectations. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1615–1624, 2016a.
- Kang et al. (2016b) Bo Kang, Kai Puolamäki, Jefrey Lijffijt, and Tijl De Bie. A tool for subjective and interactive visual data exploration. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), pages 3–7. Springer, 2016b.
- Kessy et al. (2018) Agnan Kessy, Alex Lewin, and Korbinian Strimmer. Optimal whitening and decorrelation. The American Statistician, 72(4):309–314, 2018.
- Lijffijt et al. (2014) Jefrey Lijffijt, Panagiotis Papapetrou, and Kai Puolamäki. A statistical significance testing approach to mining the most informative set of patterns. Data Mining and Knowledge Discovery, 28(1):238–263, 2014.
- Paurat et al. (2014) Daniel Paurat, Roman Garnett, and Thomas Gärtner. Interactive exploration of larger pattern collections: a case study on a cocktail dataset. In ACM SIGKDD Workshop on Interactive Data Exploration and Analytics (IDEA), pages 98–106, 2014.
- Puolamäki et al. (2010) Kai Puolamäki, Panagiotis Papapetrou, and Jefrey Lijffijt. Visually controllable data mining methods. In IEEE International Conference on Data Mining Workshops, pages 409–417, 2010.
- Puolamäki et al. (2016) Kai Puolamäki, Bo Kang, Jefrey Lijffijt, and Tijl De Bie. Interactive visual data exploration with subjective feedback. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), pages 214–229. Springer, 2016.
- Puolamäki et al. (2018) Kai Puolamäki, Emilia Oikarinen, Buse Gul Atli, and Andreas Henelius. Human-guided data exploration using randomisation. arXiv preprint arXiv:1805.07725, 2018.
- Puolamäki et al. (2018) Kai Puolamäki, Emilia Oikarinen, Bo Kang, Jefrey Lijffijt, and Tijl De Bie. Interactive visual data exploration with subjective feedback: an information-theoretic approach. In IEEE International Conference on Data Engineering (ICDE), pages 1208–1211, 2018.
- R Core Team (2018) R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2018. URL https://www.R-project.org/.
- Ruotsalo et al. (2015) Tuukka Ruotsalo, Giulio Jacucci, Petri Myllymäki, and Samuel Kaski. Interactive intent modeling: information discovery beyond search. Communications of the ACM, 58(1):86–92, 2015.
- Tukey (1977) John W. Tukey. Exploratory Data Analysis. Addison-Wesley, 1977.
- van Leeuwen and Cardinaels (2015) Matthijs van Leeuwen and Lara Cardinaels. VIPER—visual pattern explorer. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), pages 333–336, 2015.
- Vartak et al. (2015) Manasi Vartak, Sajjadur Rahman, Samuel Madden, Aditya Parameswaran, and Neoklis Polyzotis. SeeDB: efficient data-driven visualization recommendations to support visual analytics. In Proceedings VLDB Endowment, volume 8(3), pages 2182–2193, 2015.
- Wilkinson et al. (2005) Leland Wilkinson, Anushka Anand, and Robert Grossman. Graph-theoretic scagnostics. In Proceedings of the 2005 IEEE Symposium on Information Visualization (INFOVIS), page 21. IEEE, 2005.