Dimensionality reduction (DR) methods can be used to create low-dimensional (typically 2-dimensional; 2-d) representations that are straightforward to visualize and subsequently explore the dominant structure of high-dimensional datasets. Non-linear DR methods are particularly powerful because they can capture complex structure even when it is spread over many dimensions. This explains the huge popularity of methods such as t-SNE (van der Maaten & Hinton, 2008), LargeVis (Tang et al., 2016), and UMAP (McInnes & Healy, 2018).
However, DR methods yield a single static embedding and the most prominent structure present in the data may already be known to the analyst. One may indeed construct higher-dimensional embeddings, hoping to uncover more structure, but there is no guarantee that any of the constructed dimensions is fully complementary to the prior knowledge of an analyst. Besides, the salient structure that is already known may be spread across all attributes, hence we cannot just remove the associated attributes and generally speaking it is not obvious how to visualize the remaining structure. The question arises: can we actively filter or discount prior knowledge from the embedding?
To this end, we introduce conditional t-SNE (ct-SNE), a generalization of t-SNE that accounts for prior information about the data. By discounting the prior information, the embedding may focus on capturing complementary information. More concretely, it does not aim to construct an embedding that reflects all the proximities in the original data (the objective of t-SNE), but it should reflect all pairwise proximities conditioned on whether we expect that pair to be close or not.
ct-SNE enables at least three new ways to obtain insight into data:
The prior knowledge may indeed be available beforehand, in which case we can straight away focus the analysis on an embedding that is more useful.
Such prior knowledge may be gained during analysis, leading to an iterative data analysis process.
If we observe some specific structure X in an embedding and then factor out specific information Y, then if X remains present in the embedding, we learn that X is Y complementary to Y.
Note we use the term prior knowledge, even when this knowledge is not available a priori, but gained during the analysis. This reflects the knowledge is available prior to the embedding step.
To demonstrate the idea behind ct-SNE more concretely, consider a ten-dimensional dataset with 1,000 data points. In dimension 1–4 the data points fall into five clusters (following a multi-variate Gaussian with small variance), similarly for dimensions 5–6 the points fall randomly into four clusters. Dimensions 7–10 contain Gaussian noise with larger variance. Figure1a gives the t-SNE embedding. It shows five large clusters, where some can be somewhat clearly split further into smaller clusters. The large clusters correspond to those defined in dimension 1–4. Figure 1b is the ct-SNE embedding where we have input the five colored clusters as prior knowledge. This figure shows four clusters that are complementary to the five clusters observed in 1a. We see they are complementary because there is no correlation between the colors and the clusters in Figure 1b. These four clusters are indeed those defined in dimensions 5–6. Finally, Figure 1c shows that after combining the labels, ct-SNE yields an embedding capturing only on the remaining noise.
The implementation of ct-SNE and code for the experiments on public data are available at https://bitbucket.org/ghentdatascience/ct-sne.
Contributions. This paper makes the following contributions:
ct-SNE, a new DR method that searches for an embedding such that a distribution defined in terms of distances in the input space (as in t-SNE) is well-approximated by a distribution defined in terms of distances in the embedding space after conditioning on the prior knowledge (Sec. 2.2);
A Barnes-Hut-Tree based optimization method to efficiently find an embedding (Sec. 2.3);
Extensive qualitative and quantitative experiments on synthetic and real world datasets show ct-SNE effectively removes the known factors, enables deeper visual analysis of high-dimensional data, and that ct-SNE scales sufficiently to handle hundreds of thousands of points (Sec. 3).
In this section, we first briefly recap the idea behind t-SNE and introduce the basic notation. Then, we derive ct-SNE and describe a Barnes-Hut based strategy to optimize the ct-SNE objective. Due to space limitations, we discuss in Appendix B how the idea of factoring out prior information can be applied to many other existing non-linear DR methods such as LargeVis and UMAP.
2.1 Background: t-SNE
In t-SNE, the input data set, of which the value domain is indexed by all pairs of indices with . This distribution is determined by specifying probabilities such that , equal to the probability that . For brevity, below we will speak of the distribution when we mean the categorical distribution with parameters .
More specifically, in t-SNE, the distribution is defined as follows:
The goal of t-SNE is to find another embedding , from which another categorical probability distribution is derived, specified by the values defined as follows:
An embedding is deemed better if the distance between these two categorical distributions is smaller, as quantified by the KL-divergence: .
2.2 Conditional t-SNE
Let us now assume that each data point has a label associated, with for all . Moreover, let us assume that it is known a priori that same-labeled data points are more likely to be nearby in . Our goal is to ensure that the embedding does not reflect that information again. This can be achieved by minimizing the KL-divergence between the distributions and (rather than ), where is the distribution derived from the embedding but conditioned on the prior knowledge.
We formalize this using the following notation. The indicator variable if and if , and the label matrix is defined by . The probability that the random variable is equal to , conditioned on the label matrix (i.e. the prior information) is denoted as:
In ct-SNE, this is the probability distribution that needs to be similar to for the embedding to be a good one. Note that if we ensure that is larger when than when , it will be less important for the embedding to ensure that is large for same-labeled data points, even if is large. I.e., for same-labeled data points, it is less important to be embedded nearby even if they are nearby in the input representation. This is precisely the goal of ct-SNE.
To compute , we now investigate its different factors. First, is simply computed as in Eq. (2). Second, we need to determine a suitable form for , based on the above intuition. To do this, we assume that is the sufficient statistic for , i.e. , where and can be regarded as the confidence of points and being randomly picked to have the same or different labels. Let us further denote the class size of the ’th class as . Then, for this distribution to be normalized, it must hold that:
This yields a relation between and . It also suggests a ballpark figure for . Indeed, one would typically set . For (i.e. the lower bound for ), they would both be equal to , i.e. one divided by the number of possible distinct label assignments (this is of course entirely logical). Thus, in tuning , one could take multiples of this minimal value.
We can now also compute the marginal probability as follows:
Given all this, one can then compute the required conditional distribution as follows:
It is numerically better to express this in terms of new variables and :
where the relation between and is:
This has the advantage of avoiding the large factorials and resulting numerical problems. The lower bound for to be considered is now (in which case also ).
Finally, computing the KL-divergence with , yields the ct-SNE objective function to be minimized:
Note that the last two terms are constant w.r.t. . Moreover, it is clear that for , this reduces to standard t-SNE. For (and related as per the Eq. (4)), the minimization of this KL-divergence will try to minimize when more strongly (as it is multiplied with the larger number ) than when (when it is multiplied with the smaller number ).
The objective function (Eq. (2.2)) is non-convex w.r.t the embedding . Even so, we find that optimizing the objective function using gradient descent with random restarts works well in practice. The gradient of the objective function w.r.t. the embedding of a point reads: 111A detailed derivation of the gradient computation can be found in Appendix A.
where and . The gradient can be decomposed in attraction and repelling forces between points in the embedding space. Thus the underlying problem of ct-SNE, just like many other force-based embedding methods, is related to the classic -body problem in physics222https://en.wikipedia.org/wiki/N-body_problem#Other_n-body_problems
, which has also been studied in the recent machine learning literature(Gray & Moore, 2001; Ram et al., 2009). The general goal of the -body problem is to find a constellation of objects such that equilibrium is achieved according to a certain measure (e.g., forces, energy). In the problem setting of ct-SNE, both the pairwise distances between points and the label information affect the attraction and repelling forces. Particularly, the label information strengthens the repelling force (assume ) between two points if they have the same label and weakens the repelling force if two points have different labels. This is desirable behavior because we do not want to reflect the known label information in the resulted embeddings.
Evaluating the gradient has complexity , which makes the computation (both time and memory cost) infeasible when is large (e.g., ). As an approximation of the gradient computation, we adapt the tree-based approximation strategy described by van der Maaten (2014). To efficiently model the proximity in high-dimensional space (Eq. (1
)) we use a vantage-point tree-based algorithm (which exploits the fast diminishing property of the Gaussian distribution). To approximate the low-dimensional proximity (Eq. (3)) we modify the Barnes-Hut algorithm to incorporate the label information. The basic idea of the Barnes-Hut algorithm is to organize the points in the embedding space using a kd-tree (which for 2-d embeddings is equivalent to a quad tree). Each node of the tree corresponds to a cell (dissection) in the embedding space. If a target point is far away from all the points in a given cell, then the interaction between the target point and the points within the cell can be summarized by the interaction between and the cell’s center of mass that is computed while constructing the kd-tree. More specifically, the summarization happens when , where is the radius of the cell, while controls the strength of summarization, i.e. the approximation strength. The summarized repelling force in t-SNE reads , where is the number of data points in that cell.
In the ct-SNE approximation, we had to overcome an additional complication though: we also need to summarize the label information for the points in a cell when the summarization happens. This can be done by maintaining a histogram in each cell, and counting the numbers of data points with different labels that fall into that cell. Then the repelling force of a target point can be weighted proportional to the number of points that have same (different) label(s) within the cell. Namely:
where is the number of data points in a cell that has the same label as point .
As both tree-based approximation schemes have complexity , counting the label will add an extra multiplicative constant , equal to the number of label values in the prior information. Thus the final complexity of approximated ct-SNE is .
The experiments investigate 4 questions: Q1 Does ct-SNE work as expected in finding complementary structure? Q2 How should (or equivalently, ) be chosen? Q3 Could ct-SNE’s goal be achieved also by using (a combination of) other methods? Q4 How well does ct-SNE scale? Two case studies addressing Q1 are presented in Sections 3.1–3.3. Two more case studies addressing Q1 as well as the experiments addressing Q2–Q4 are summarized in Sec. 3.4, and described in detail in Appendix C.
3.1 Datasets used, and experimental settings
The first dataset used in the main paper is a Synthetic dataset consisting of 1000 ten-dimensional data points, as explained in Section 1. The second in the main paper is a Facebook dataset consisting of -dimensional embedding of a de-identified random sample of Facebook users in the US. This embedding is generated based purely on the list of pages and groups that the users follow, as part of an effort to improve the quality of several recommendation systems at Facebook.
To study Q1, both qualitative and quantitative experiments were performed on the synthetic dataset. On the Facebook dataset we only conducted a qualitative evaluation (given the lack of ground truth).
Qualitative experiment. We qualitatively evaluate the effectiveness of ct-SNE through visualizations. More specifically, we compare the t-SNE visualization of a dataset with the ct-SNE visualization that has taken into account certain prior information that is visually identifiable from the t-SNE embedding. Thus by inspecting the presence of the prior information in the ct-SNE embedding and comparing to the t-SNE embedding, we can evaluate whether the prior information is removed. Conversely, we test whether information present in the ct-SNE embedding could have been identified from the t-SNE embedding to verify whether it indeed contains complementary information.
To select the prior information, we first visualize the t-SNE embedding and manually select points that are clustered in the visualization. Then we perform a feature ranking
procedure to identify the features that separate the selected points from the rest. This is done by fitting a linear classifier (logistic regression) on the selected cluster against all other data points. By inspecting the weights of the classifier, we can identify the feature that contributes the most to the classifier. Repeating thisfeature ranking procedure for other clusters, we aim to find a feature that correlates with the majority of the clusters in the t-SNE visualization. This feature is then treated as prior information and provided as input to ct-SNE. In the reported experiments, the most prominent feature was always categorical, so all points with the same value were treated as being in a cluster to define the prior. We apply exact ct-SNE on Synthetic and approximated ct-SNE () on the Facebook dataset.
We also evaluated whether ct-SNE can continuously provide new insights, by repeatedly applying the cluster selection and feature ranking procedure on ct-SNE embeddings.
Quantitative experiment. In this experiment, we quantify the presence of certain prior information in a ct-SNE embedding that also take the same prior information as input. For example, if ct-SNE encodes the prior information using labels, the strong presence of certain prior information is equivalent to the high homogeneity of the encoded labels in the embedding, i.e., points that are close to each other in the embedding often have the same label. To quantify such homogeneity, we developed a measure termed normalized Laplacian score defined as follows. Given an embedding and parameter , we denote
as the adjacency matrix of the k-nearest graph computed from the embedding. Then, the Laplacian matrix of the kNN graph has the formwhere . We further normalize the Laplacian matrix (
) to obtain a score that is insensitive to the node degrees. Given a label vectorwith values where each label has
points, and denote the one-hot encoding for each labelas , then the normalized Laplacian score can be computed as:
This score is essentially the pairwise difference (in terms of labels) between the data points that are connected according to the kNN graph. If a label is locally consistent (homogeneous) in an embedding, the feature difference among the kNN graph neighborhood is small, which results in a small Laplacian score. Conversely, a less homogeneous label over the kNN graph would have a large Laplacian score. Thus, if ct-SNE removes certain prior information from its embedding, then the embedding should have a large Laplacian score on the labels that encode the prior information.
3.2 Case study: Synthetic dataset
Qualitative experiment. The t-SNE visualization of the synthetic dataset shows five large clusters (Fig. 1a). Feature ranking (Sec. 3.1) shows these clusters correspond to the clustering in dimensions - of the data. Taking the cluster labels in dimensions - () as prior, ct-SNE gives a different visualization (Fig. 1b). The feature ranking further shows the ct-SNE embedding indeed reveals the clusters in the dimension 5-6 of the data. We further combine the labels and by assigning a new label to each combinations of the label in and , denoted as . ct-SNE with yields an embedding based only on the remaining noise (Fig. 1c).
Quantitative experiment. We computed the normalized Laplacian scores (Eq. (6)) of the t-SNE and several ct-SNE embeddings. Subfigures in Fig. 2a–c give the Laplacian score for three label sets: , , and . Fig. 2a shows that labels are less homogeneous (higher Laplacian score) in the ct-SNE embeddings with prior and than in the t-SNE embedding, indicating that ct-SNE effectively discounted the prior from the embeddings. Both the t-SNE embedding and ct-SNE with prior clearly pick up the cluster in , as indicated by the very low Laplacian score. Similarly, Figures 2b,c show that ct-SNE removes the prior information effectively for labels and , respectively, given the associated priors.
3.3 Case study: Facebook dataset
Qualitative experiment. Applying t-SNE on the Facebook dataset gives a visualization with many visually salient clusters (Fig. 3a). Computing the feature ranking for classification of selected clusters shows that the geography (i.e., the states) contributes to the embedding the most. This is further confirmed by coloring the data points according to the geographical region in the visualization as shown in Fig. 3a: most of the clusters are indeed quite homogeneous with respect to geography.
3.4 Summary of additional experimental findings
Two other case studies (App. C.2–C.3) on the UCI adult dataset (Dheeru & Karra Taniskidou, 2017) and a DBLP citation network dataset (Tang et al., 2008) confirm the ability of ct-SNE visualizations to reveal insightful clusters after conditioning on prior information that dominates the t-SNE visualizations (Q1). In Appendix C.4
we also analyzed the sensitivity of the ct-SNE embedding with respect to the hyperparameter(or ) (Q2). By varying the hyperparameter, we found ct-SNE yields low-dimensional embeddings that better approximate the original data than t-SNE (i.e., smaller KL-divergence). The analysis also shows that using a small (e.g., ) is a good rule of thumb when using ct-SNE for visualization. To answer Q3, we compared ct-SNE to two non-trivial baselines that remove the known factors from the high-dimensional data using either an adversarial auto-encoder (AAE (Makhzani et al., 2015)) or canonical correlation analysis (CCA (Hotelling, 1936)) and then apply t-SNE for visualization (App. C.5). We show that these baselines are either difficult to tune (AAE-based baseline) or have limited applicability (CCA-based baseline), while ct-SNE has essentially only one parameter to tune, and does not suffer from the limitations of the CCA baseline. Finally, we conducted a runtime experiment (App. C.6) showing that the approximated ct-SNE can efficiently embed large, high-dimensional data, without substantial quality loss (Q4).
4 Related Work
Many dimensionality reduction methods have been proposed in the literature. Arguably, -body problem based methods such as MDS (Torgerson, 1952), Isomap (Tenenbaum et al., 2000), t-SNE (van der Maaten & Hinton, 2008), LargeVis (Tang et al., 2016), and UMAP (McInnes & Healy, 2018)
appear to be the most popular ones. These methods typically have three components: (1) a proximity measure in the input space, (2) a proximity measure in the embedding space, (3) a loss function comparing the proximity between data points in the embedding space with the proximity in the input space. ct-SNE belongs to this class of DR methods. It accepts both high-dimensional data and priors about the data as inputs, and searches for low-dimensional embeddings while discounting structure in the input data specified as prior knowledge.
As a core component of ct-SNE is the prior information specified by the user, it can be considered an interactive DR method. Closely related to ct-SNE, there is a group of interactive DR methods that adjust the algorithms according to a user’s inputs (e.g., Kang et al., 2016; Puolamäki et al., 2018; Dıaz et al., 2014; Alipanahi & Ghodsi, 2011; Barshan et al., 2011; Paurat & Gärtner, 2013). These methods contrast with ct-SNE in that the user feedback must be obeyed in the output embedding, while for ct-SNE the prior knowledge defined by the user guides what is irrelevant to the user.333For an extended discussion about the related work, please refer to Appendix D.
We introduce conditional t-SNE to efficiently discover new insights from high-dimensional data. ct-SNE finds the lower dimensional representation of the data in a non-linear fashion while removing the known factors. Extensive case studies on both synthetic and real-world datasets demonstrate that ct-SNE can effectively remove known factors from low-dimensional representations, allowing new structure to emerge and providing new insights to the analyst. A tree-based optimization method allows ct-SNE to scale to a high dimensional dataset with hundreds of thousands of data points.
The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) / ERC Grant Agreement no. 615517, from the FWO (project no. G091017N, G0F9816N), from the European Union’s Horizon 2020 research and innovation programme and the FWO under the Marie Sklodowska-Curie Grant Agreement no. 665501, and from the EPSRC (SPHERE EP/R005273/1). We thank Laurens van der Maaten for helpful discussions.
- Alipanahi & Ghodsi (2011) Alipanahi, B. and Ghodsi, A. Guided locally linear embedding. PRL, 32(7):1029–1035, 2011.
Barshan et al. (2011)
Barshan, E., Ghodsi, A., Azimifar, Z., and Zolghadri Jahromi, M.
Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds.PR, 44(7):1357–1371, 2011.
- Cavallo & Demiralp (2018) Cavallo, M. and Demiralp, Ç. A visual interaction framework for dimensionality reduction based data exploration. In CHI, pp. 635, 2018.
- Dheeru & Karra Taniskidou (2017) Dheeru, D. and Karra Taniskidou, E. UCI machine learning repository, 2017.
- Dıaz et al. (2014) Dıaz, I., Cuadrado, A. A., Pérez, D., Garcıa, F. J., and Verleysen, M. Interactive dimensionality reduction for visual analytics. In ESANN, pp. 183–188, 2014.
- Edwards & Storkey (2015) Edwards, H. and Storkey, A. Censoring representations with an adversary. arXiv:1511.05897, 2015.
- Faust et al. (2019) Faust, R., Glickenstein, D., and Scheidegger, C. Dimreader: Axis lines that explain non-linear projections. TVCG, 25(1):481–490, 2019.
- Gray & Moore (2001) Gray, A. G. and Moore, A. W. N-body’problems in statistical learning. In NeurIPS, pp. 521–527, 2001.
- Grover & Leskovec (2016) Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. ACM, 2016.
- Hotelling (1936) Hotelling, H. Relations between two sets of variates. Biometrika, 28(3/4):321–377, 1936.
- Jeong et al. (2009) Jeong, D. H., Ziemkiewicz, C., Fisher, B., Ribarsky, W., and Chang, R. ipca: An interactive system for pca-based visual analytics. In CGF, volume 28, pp. 767–774, 2009.
- Kang et al. (2016) Kang, B., Lijffijt, J., Santos-Rodríguez, R., and De Bie, T. Subjectively interesting component analysis: data projections that contrast with prior expectations. In KDD, pp. 1615–1624, 2016.
- Madras et al. (2018) Madras, D., Creager, E., Pitassi, T., and Zemel, R. Learning adversarially fair and transferable representations. arXiv:1802.06309, 2018.
- Makhzani et al. (2015) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversarial autoencoders. arXiv:1511.05644, 2015.
- McInnes & Healy (2018) McInnes, L. and Healy, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426, 2018.
- Paurat & Gärtner (2013) Paurat, D. and Gärtner, T. Invis: A tool for interactive visual data analysis. In ECML-PKDD, pp. 672–676, 2013.
- Perozzi et al. (2014) Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. ACM, 2014.
- Pezzotti et al. (2017) Pezzotti, N., Lelieveldt, B. P., van der Maaten, L., Höllt, T., Eisemann, E., and Vilanova, A. Approximated and user steerable tsne for progressive visual analytics. TVCG, 23(7):1739–1752, 2017.
- Puolamäki et al. (2018) Puolamäki, K., Oikarinen, E., Kang, B., Lijffijt, J., and De Bie, T. Interactive visual data exploration with subjective feedback: An information-theoretic approach. In ICDE, pp. 1208–1211, 2018.
- Ram et al. (2009) Ram, P., Lee, D., March, W., and Gray, A. G. Linear-time algorithms for pairwise statistical problems. In NeurIPS, pp. 1527–1535, 2009.
- Stahnke et al. (2016) Stahnke, J., Dörk, M., Müller, B., and Thom, A. Probing projections: Interaction techniques for interpreting arrangements and errors of dimensionality reductions. TVCG, 22(1):629–638, 2016.
- Tang et al. (2008) Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., and Su, Z. Arnetminer: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 990–998. ACM, 2008.
- Tang et al. (2016) Tang, J., Liu, J., Zhang, M., and Mei, Q. Visualizing large-scale and high-dimensional data. In WWW, pp. 287–297, 2016.
- Tenenbaum et al. (2000) Tenenbaum, J. B., De Silva, V., and Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
- Torgerson (1952) Torgerson, W. S. Multidimensional scaling: I. theory and method. Psychometrika, 17(4):401–419, 1952.
- van der Maaten (2014) van der Maaten, L. Accelerating t-sne using tree-based algorithms. The Journal of Machine Learning Research, 15(1):3221–3245, 2014.
- van der Maaten & Hinton (2008) van der Maaten, L. and Hinton, G. Visualizing data using t-sne. JMLR, 9(Nov):2579–2605, 2008.
- van der Maaten & Hinton (2012) van der Maaten, L. and Hinton, G. Visualizing non-metric similarities in multiple maps. MLJ, 87(1):33–55, 2012.
Appendix A Detailed derivation of the gradient of the ct-SNE objective function
Here we derive in detail the gradient of the ct-SNE objective function. Denote the euclidean distance between points as . The derivative of with respect to embedding reads:
Denote the cost (KL-divergence) by :
Following the derivation from tSNE paper, the derivative of with respect to reads:
To compute the derivative of with respect to , we first have:
Denote The derivative of with respect to can be computed as:
Thus we have derivative of with respect to
Finally, we have derivative:
Appendix B On generalizing the idea of ct-SNE
The idea of removing known factors from low-dimensional representations can be generalized to other -body problem based DR methods. Oftentimes, the gradient of the -body problem based methods can be viewed as a summation of attraction forces and repelling forces. Removing a known factor thus amounts to re-weighting the attracting and repelling forces such that points that have the same label repel each other and points with different labels attract each other. For example, LargeVis (Tang et al., 2016) differs from t-SNE by modeling input space proximity using random KNN graph. Thus we can use the same conditioning idea as in ct-SNE to remove the known factors in LargeVis. However, for Uniform Manifold Approximation and Projection (UMAP) McInnes & Healy (2018), conditioning is not readily applicable. In contrast to t-SNE, UMAP uses fuzzy sets to model the proximity in both input space and embedding space. Then the cross entropy between two fuzzy sets serves as loss function to compare the modeled proximity between input space and the embedding space. In the UMAP setting, it is not straightforward to condition the lower dimensional proximity model on the prior. But we can still directly re-weight the repelling forces: for data points with the same label, the pushing effect is strengthened by ; for samples with different labels, the pushing effect is weakened by multiplying with , with assumption . However, without proper conditioning, parameter and loose their probabilistic interpretation and along with it their one-to-one correspondence (as in ct-SNE), thus both parameters and need to be set.
Appendix C Extended experiments
In this section, we introduce two additional datasets:
UCI Adult dataset. We sampled 1000 data points from the UCI adult dataset (Dheeru & Karra Taniskidou, 2017) with six attributes: the three numeric attributes age, education level, and work hours per week, and the three binary attributes ethnicity (white/other), gender, and income (>50k).
DBLP dataset. We extracted all papers from 20 venues444These venues are: NIPS, ICLR, ICML, AAAI, IJCAI, KDD, ECML-PKDD, ICDM, SDM, WSDM, PAKDD, VLDB, SIGMOD, ICDT, ICDE, PODS, SIGIR, WWW, CIKM, ECIR. in four areas (ML/DM/DB/IR) of computer science from the DBLP citation network dataset (Tang et al., 2008). We sampled half of the papers and constructed a network ( nodes555The network consists of paper nodes, author nodes, topic nodes and venue nodes.) based on paper-author, paper-topic, paper-venue relations. Finally, we embedded the network into a dimensional euclidean space using node2vec (Grover & Leskovec, 2016) with walk length , window size . In our experiment, both and are set to . Under this setting, node2vec is equivalent to DeepWalk (Perozzi et al., 2014).
c.2 Case study: UCI Adult dataset
Qualitative experiment. Fig. 4a shows t-SNE gives an embedding that consists of clusters grouped according to combinations of three attributes: gender, ethnicity and income (>50k). By incorporating the attribute gender as prior, the ct-SNE embedding (Fig. 4b) contains clusters with a mixture of male and female points, indicating the gender information is removed. Instead, by incorporating the attribute ethnicity the ct-SNE embedding (Fig. 4c) contains clusters with a mixture of ethnicities. Finally, incorporating the combination of attributes gender and ethnicity as prior, the ct-SNE embedding contains data points grouped according to income (Fig. 4d).
Quantitative experiment. We analyzed the homogeneities (Laplacian scores) of attributes gender, ethnicity and income (>50k) measured on both t-SNE and ct-SNE embeddings. Fig. 5a shows ct-SNE with prior gender removes the gender factor from the resulted embedding, while ct-SNE with prior ethnicity makes the gender factor in the resulted embedding clearer. Similarly, Figure. 5b,c show ct-SNE removes the prior information effectively for labels ethnicity and ethnicity&gender respectively, given the associated priors.
c.3 Case study: DBLP dataset
Qualitative experiment. Applying t-SNE on the DBLP dataset gives a visualization with many visual clusters (Fig. 6a). Feature ranking for classification of the selected clusters shows the topics that contribute the most to the visualization. Moreover, we used mpld3666https://mpld3.github.io (an interactive visualization library) to inspect (i.e., hovering over data points and check tooltips) the metadata of t-SNE plot. Upon inspection, the visualization appears to be globally divided according the four areas. This is further confirmed by coloring the data points according to the four areas: most of the clusters are indeed quite homogeneous with respect areas
Knowing from the t-SNE visualization the papers are indeed divided according to areas, the area structure in the visualization is not very informative anymore. Thus we can encode the area as prior for ct-SNE so that other interesting structures can emerge. Using the same color scheme, ct-SNE shows a visualization that has many clusters with mixed colors (Fig. 6b). This indicates the area information is mostly removed in the ct-SNE embedding. This is further confirmed by selecting clusters in ct-SNE embedding (Fig. 6d) and highlight the same set of points in the t-SNE embedding (Fig. 6c). The clusters highlighted in the ct-SNE visualization often consists of clusters (topics) from different areas (i.e., t-SNE clusters with different colors) that spread over the t-SNE visualization. Indeed, feature ranking indicates that papers in the selected ct-SNE cluster have similar topics in e.g., ‘privacy’, ‘data steam’, ‘computer vision’. Finally, we noticed that some clusters in ct-SNE (Fig. 6d) embedding also exist in the t-SNE embedding (Fig. 6 c). Using feature ranking as above we found these clusters are not homogeneous in terms of area of study, but in terms of topics (e.g., ‘clustering’, ‘active learning’), indicating a tightly connected research community behind the topic. Thus, by removing the irrelevant area structure using ct-SNE, clusters that persists in both visualizations become more salient and easier to observe.
c). Using feature ranking as above we found these clusters are not homogeneous in terms of area of study, but in terms of topics (e.g., ‘clustering’, ‘active learning’), indicating a tightly connected research community behind the topic. Thus, by removing the irrelevant area structure using ct-SNE, clusters that persists in both visualizations become more salient and easier to observe.
c.4 Parameters sensitivity
To understand the effect of the parameter (or equivalently, ) on ct-SNE embeddings (Q3), we study ct-SNE embeddings on the synthetic dataset with the prior fixed to be the cluster labels in dimensions 1–4. First, we try to understand the relation between the ct-SNE objective and the parameter (or equivalently, ). We evaluated the ct-SNE objective (Eq. 2.2) on the ct-SNE embeddings obtained by ranging (and correspondingly) from (strong prior removal effect) to (no prior remove effect, equivalent to t-SNE) with step size . We also evaluated the t-SNE objective (first term in Eq. 2.2) and the second term in Eq. 2.2 (the only term that depends on the prior, subsequently referred to as the prior term) for the ct-SNE embeddings associated with various s.
Fig. 7a visualizes the values of the ct-SNE objective, t-SNE objective, and ct-SNE prior term against different s. Observe that by using a prior, the ct-SNE embedding achieves a better approximation to the higher dimensional data. That is, ct-SNE achieves a lower KL-divergence (lowest at ) than t-SNE does (). This is because the prior term in the ct-SNE objective can be negative. Although the t-SNE objective increases when decreases, it is compensated by the negative value contributed by the prior term. Indeed, by factoring out certain prior from the lower dimensional embedding, the necessity of the embedding to represent the prior is alleviated, enabling ct-SNE to have more freedom to approximate the high-dimensional proximities.
Interestingly, we observe that the embedding with smallest KL-divergence does not necessarily give better visualization (e.g., clear separation of the clusters). We visualize the ct-SNE embedding that achieves smallest KL-divergence (, Fig. 7b) and compare it with the ct-SNE embedding that has strongest prior removal effect but larger KL-divergence (, Fig. 7c). Although the embedding with stronger prior removal effect has larger objective value, it gives a clearer clustering than in the embedding with smaller KL-divergence (). As a result, the clusters in dimensions 5–6 are easier to identify. Hence, we propose as rule of thumb when using ct-SNE for visualization to use small (e.g., ).
c.5 Baseline comparisons
In this section, we compare ct-SNE with two non-trivial baselines. The basic idea is to first remove the known factor from the dataset, and perform t-SNE to produce lower dimensional representations. Here we use a non-linear and a linear method to remove the known factors: adversarial auto-encoder (AAE) and canonical correlation analysis (CCA). The implementation of the baselines and code for comparison experiments are also available at https://bitbucket.org/ghentdatascience/ct-sne.
Baseline: AAE and t-SNE. Adversarial auto-encoder (AAE) Makhzani et al. (2015) can be used to learn a latent representation that prevents the discriminator from predicting certain attributes (Madras et al., 2018). In order to remove prior information from the low-dimensional representation of a dataset using AAE, we can configure the discriminator to predict the prior attributes, and using the auto-encoder to adversarially remove the prior from the latent representation of the dataset.
We adopt the AAE configuration described by Edwards & Storkey (2015). AAE is in general difficult to tune: it has hyperparameters ( network structure parameters, weights in the objective, and
learning rates) and a few design choices about the network architecture (e.g., the number of layers in each subnetwork and activation functions). We tried different parameter settings and managed to remove the clustering label information in dimensions 1–4 (Fig.8a) and 5–6 (Fig. 8b) from the data. In Figure 8a, the AAE approach manages to remove the prior information, but it fails to pick up the complementary structure in the data (clusters in dimensions 5–6). It also fails to remove the prior information (cluster labels in dimension 1–6) in Figure 8c. Comparing to this baseline, ct-SNE practically has only one parameter () to tune, which often can be set to a small positive number (e.g., ).
Baseline: CCA and t-SNE. Canonical correlation analysis (Hotelling, 1936)
aims to find a linear transformation for two random variables such that the correlation between transformed variables is maximized. To remove the prior information from data using CCA, one approach is to first find the (at most)subspace ( is the dimensionality of the data) in which the transformed data and the prior information (one hot encoding of the labels) have the largest correlation. Then the data is whitened by projecting it onto the null space (at least -d) of the subspace found in the first step. By doing so, the whitened data is less correlated to the known factor.
Another variant of the CCA-based approach is directly projecting the data onto the -dimensional subspace found by CCA in which the transformed data and labels has smallest correlation. To be consistent, we also apply t-SNE to the transformed data.
Our experimental results show the CCA-based approaches can easily remove label information that is orthogonal to other attributes in the data. For example, in the UCI Adult dataset, the gender information is orthogonal to the ethnicity and income, which can be easily removed using the CCA approach. However, the CCA-based approach performs poorly when the known factor is correlated with other attributes. Moreover, the CCA-based approaches also have the limitation that the number of the projection vectors is upper-bounded by the dimensionality of the data. If the number of unique values of an attribute exceeds the dimensionality of the data, the CCA projection would not be able to remove the label info entirely from the data. To illustrate our points, we synthesized a -dimensional dataset with 1,000 data points. The data points are grouped into clusters each corresponding to a multi-variate Gaussian with random location and small variance. Additionally, each cluster is separated into two small clusters (one contains points of the cluster, and another includes the rest) along one randomly chosen axis. Figure 9a,b shows both the CCA approaches pick up only the large clusters (differentiated using marker shape) but failed to pick up the structure of two small clusters (plotted in different colors) within each large cluster. On the other hand, ct-SNE removes the cluster information in the embedding and shows each large cluster can be further separated in to two smaller clusters.
Thus, the CCA-based baselines perform poorly when the known factor is correlated with other attributes. Moreover, the number of the projection vectors in CCA-based baselines is upper-bounded by the dimensionality of the data. Meanwhile, ct-SNE does not have these limitations.
We measure the runtime of the exact ct-SNE and the approximated version () on a PC with a quad-core GHz Inter Core i5 and a 2133MHz LPDDR3 RAM. By default, the maximum number of iterations of ct-SNE gradient update is 1,000. For larger datasets and prior attributes that have many values, more iterations are required to achieve a convergence. For example, the synthetic dataset (1,000 samples and 10 dimensions) requires fewer than 1,000 iterations to converge while the Facebook dataset (500,000 examples and 128 dimensions) requires 3,000 iterations to converge. Table. 1 shows that approximated ct-SNE is efficient and applicable to large data with high dimensionality, while exact ct-SNE is not.
Appendix D Extended related work
Many dimensionality reduction methods have been proposed in the literature. Arguably, -body problem based methods777In Section 2.3 we provide more information on the -body problem such as MDS (Torgerson, 1952), Isomap (Tenenbaum et al., 2000), t-SNE (van der Maaten & Hinton, 2008), LargeVis (Tang et al., 2016), and UMAP (McInnes & Healy, 2018) appear to be the most popular ones. These methods typically have three components: (1) a proximity measure in the input space, (2) a proximity measure in the embedding space, (3) a loss function comparing the proximity between data points in the embedding space with the proximity in the input space. When minimizing the loss over the embedding space, the data points (i.e., the bodies) have pairwise interactions and the embedding of all points needs to be updated simultaneously. Since the optimization problem is not convex, local minima are typically accepted as output. ct-SNE belongs to this class of DR methods. It accepts both high-dimensional data and priors about the data as inputs, and searches for low-dimensional embeddings while discounting structure in the input data specified as prior knowledge. Closely related, in the multi-maps t-SNE work (van der Maaten & Hinton, 2012) factors that are mutually exclusive are captured by multiple t-SNE embeddings at once. Comparing to multi-map t-SNE, ct-SNE allows users to disentangle information in a targeted (subjective) manner, by specifying which information they would like to have factored out.
As a core component of ct-SNE is the prior information specified by the user, it can be considered an interactive DR method. Existing papers on interactive DR can be categorized into two groups. The first group aim to improve the explainability and computation efficiency of existing DR methods via novel visualizations and interactions. iPCA (Jeong et al., 2009) allows users to easily explore the PCA components and thus achieve better understanding of the linear projections of the data onto different PCA components. Cavallo & Demiralp (2018) helps the user to understand low-dimensional representations by applying perturbations to probe the connection between input attributed space and embedding space. Similarly, Faust et al. (2019) introduce a method based on perturbations to visualize the effect of a specific input attribute on the embedding, while Stahnke et al. (2016) introduce ‘probing’ as a means to understand the meaning of point set selections within the embedding. Steerable t-SNE (Pezzotti et al., 2017) aims to make t-SNE more scalable by quickly providing a sketch of an embedding which is then refined only upon the user’s interests.
The second group of interactive DR methods adjust the algorithms according to a users’ inputs. SICA (Kang et al., 2016) and SIDE (Puolamäki et al., 2018) explicitly model the user’s belief state and find linear projections that contrast to it. These two methods are linear DR methods thus cannot present non-linear structures in the low-dimensional representations. Work by Dıaz et al. (2014) allows users to define their own metric in the input space, after which the low-dimensional representation reflects the adjusted importance of the attributes. This method puts the burden on the user for direct manipulation of the input space metric. Many variants of existing DR methods have been introduced where user feedback entails editing of the embedding, and such manually embedded points are used as constraints to guide the dimensionality reduction (e.g., Alipanahi & Ghodsi, 2011; Barshan et al., 2011; Paurat & Gärtner, 2013). These methods contrast with ct-SNE in that the user feedback must be obeyed in the output embedding, while for ct-SNE the prior knowledge defined by the user guides what is irrelevant to the user.