Conditional t-SNE: Complementary t-SNE embeddings through factoring out prior information

05/24/2019 ∙ by Bo Kang, et al. ∙ 0

Dimensionality reduction and manifold learning methods such as t-Distributed Stochastic Neighbor Embedding (t-SNE) are routinely used to map high-dimensional data into a 2-dimensional space to visualize and explore the data. However, two dimensions are typically insufficient to capture all structure in the data, the salient structure is often already known, and it is not obvious how to extract the remaining information in a similarly effective manner. To fill this gap, we introduce conditional t-SNE (ct-SNE), a generalization of t-SNE that discounts prior information from the embedding in the form of labels. To achieve this, we propose a conditioned version of the t-SNE objective, obtaining a single, integrated, and elegant method. ct-SNE has one extra parameter over t-SNE; we investigate its effects and show how to efficiently optimize the objective. Factoring out prior knowledge allows complementary structure to be captured in the embedding, providing new insights. Qualitative and quantitative empirical results on synthetic and (large) real data show ct-SNE is effective and achieves its goal.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dimensionality reduction (DR) methods can be used to create low-dimensional (typically 2-dimensional; 2-d) representations that are straightforward to visualize and subsequently explore the dominant structure of high-dimensional datasets. Non-linear DR methods are particularly powerful because they can capture complex structure even when it is spread over many dimensions. This explains the huge popularity of methods such as t-SNE (van der Maaten & Hinton, 2008), LargeVis (Tang et al., 2016), and UMAP (McInnes & Healy, 2018).

However, DR methods yield a single static embedding and the most prominent structure present in the data may already be known to the analyst. One may indeed construct higher-dimensional embeddings, hoping to uncover more structure, but there is no guarantee that any of the constructed dimensions is fully complementary to the prior knowledge of an analyst. Besides, the salient structure that is already known may be spread across all attributes, hence we cannot just remove the associated attributes and generally speaking it is not obvious how to visualize the remaining structure. The question arises: can we actively filter or discount prior knowledge from the embedding?

To this end, we introduce conditional t-SNE (ct-SNE), a generalization of t-SNE that accounts for prior information about the data. By discounting the prior information, the embedding may focus on capturing complementary information. More concretely, it does not aim to construct an embedding that reflects all the proximities in the original data (the objective of t-SNE), but it should reflect all pairwise proximities conditioned on whether we expect that pair to be close or not.

ct-SNE enables at least three new ways to obtain insight into data:

  • The prior knowledge may indeed be available beforehand, in which case we can straight away focus the analysis on an embedding that is more useful.

  • Such prior knowledge may be gained during analysis, leading to an iterative data analysis process.

  • If we observe some specific structure X in an embedding and then factor out specific information Y, then if X remains present in the embedding, we learn that X is Y complementary to Y.

Note we use the term prior knowledge, even when this knowledge is not available a priori, but gained during the analysis. This reflects the knowledge is available prior to the embedding step.

inline,author=Jefinline,author=Jeftodo: inline,author=JefWe should write somewhere in the paper that even though the technical aspects of the question are addressed using ct-SNE, this does not mean we can in practice extract all structure in a sequence of 2-d embeddings. Because, in order to do so, the right combination of priors has to be used, and there is no guarantee plain iterative use would achieve that.
Figure 1: Visualization of 2-d embeddings of synthetic data (see ‘example’ below).

Example.

To demonstrate the idea behind ct-SNE more concretely, consider a ten-dimensional dataset with 1,000 data points. In dimension 1–4 the data points fall into five clusters (following a multi-variate Gaussian with small variance), similarly for dimensions 5–6 the points fall randomly into four clusters. Dimensions 7–10 contain Gaussian noise with larger variance. Figure 

1a gives the t-SNE embedding. It shows five large clusters, where some can be somewhat clearly split further into smaller clusters. The large clusters correspond to those defined in dimension 1–4. Figure 1b is the ct-SNE embedding where we have input the five colored clusters as prior knowledge. This figure shows four clusters that are complementary to the five clusters observed in 1a. We see they are complementary because there is no correlation between the colors and the clusters in Figure 1b. These four clusters are indeed those defined in dimensions 5–6. Finally, Figure 1c shows that after combining the labels, ct-SNE yields an embedding capturing only on the remaining noise.

The implementation of ct-SNE and code for the experiments on public data are available at https://bitbucket.org/ghentdatascience/ct-sne.

Contributions. This paper makes the following contributions:

  • ct-SNE, a new DR method that searches for an embedding such that a distribution defined in terms of distances in the input space (as in t-SNE) is well-approximated by a distribution defined in terms of distances in the embedding space after conditioning on the prior knowledge (Sec. 2.2);

  • A Barnes-Hut-Tree based optimization method to efficiently find an embedding (Sec. 2.3);

  • We illustrate that the concept of conditioning embeddings on prior information can be applied to other popular non-linear DR methods (mentioned in Sec. 2, with details in Appendix B);

  • Extensive qualitative and quantitative experiments on synthetic and real world datasets show ct-SNE effectively removes the known factors, enables deeper visual analysis of high-dimensional data, and that ct-SNE scales sufficiently to handle hundreds of thousands of points (Sec. 3).

2 Method

In this section, we first briefly recap the idea behind t-SNE and introduce the basic notation. Then, we derive ct-SNE and describe a Barnes-Hut based strategy to optimize the ct-SNE objective. Due to space limitations, we discuss in Appendix B how the idea of factoring out prior information can be applied to many other existing non-linear DR methods such as LargeVis and UMAP.

2.1 Background: t-SNE

In t-SNE, the input data set

is taken to define a probability distribution for a categorical random variable

, of which the value domain is indexed by all pairs of indices with . This distribution is determined by specifying probabilities such that , equal to the probability that . For brevity, below we will speak of the distribution when we mean the categorical distribution with parameters .

More specifically, in t-SNE, the distribution is defined as follows:

(1)

The goal of t-SNE is to find another embedding , from which another categorical probability distribution is derived, specified by the values defined as follows:

(2)

An embedding is deemed better if the distance between these two categorical distributions is smaller, as quantified by the KL-divergence: .

2.2 Conditional t-SNE

Let us now assume that each data point has a label associated, with for all . Moreover, let us assume that it is known a priori that same-labeled data points are more likely to be nearby in . Our goal is to ensure that the embedding does not reflect that information again. This can be achieved by minimizing the KL-divergence between the distributions and (rather than ), where is the distribution derived from the embedding but conditioned on the prior knowledge.

We formalize this using the following notation. The indicator variable if and if , and the label matrix is defined by . The probability that the random variable is equal to , conditioned on the label matrix (i.e. the prior information) is denoted as:

In ct-SNE, this is the probability distribution that needs to be similar to for the embedding to be a good one. Note that if we ensure that is larger when than when , it will be less important for the embedding to ensure that is large for same-labeled data points, even if is large. I.e., for same-labeled data points, it is less important to be embedded nearby even if they are nearby in the input representation. This is precisely the goal of ct-SNE.

To compute , we now investigate its different factors. First, is simply computed as in Eq. (2). Second, we need to determine a suitable form for , based on the above intuition. To do this, we assume that is the sufficient statistic for , i.e. , where and can be regarded as the confidence of points and being randomly picked to have the same or different labels. Let us further denote the class size of the ’th class as . Then, for this distribution to be normalized, it must hold that:

This yields a relation between and . It also suggests a ballpark figure for . Indeed, one would typically set . For (i.e. the lower bound for ), they would both be equal to , i.e. one divided by the number of possible distinct label assignments (this is of course entirely logical). Thus, in tuning , one could take multiples of this minimal value.

We can now also compute the marginal probability as follows:

Given all this, one can then compute the required conditional distribution as follows:

(3)

It is numerically better to express this in terms of new variables and :

where the relation between and is:

(4)

This has the advantage of avoiding the large factorials and resulting numerical problems. The lower bound for to be considered is now (in which case also ).

Finally, computing the KL-divergence with , yields the ct-SNE objective function to be minimized:

(5)

Note that the last two terms are constant w.r.t. . Moreover, it is clear that for , this reduces to standard t-SNE. For (and related as per the Eq. (4)), the minimization of this KL-divergence will try to minimize when more strongly (as it is multiplied with the larger number ) than when (when it is multiplied with the smaller number ).

2.3 Optimization

The objective function (Eq. (2.2)) is non-convex w.r.t the embedding . Even so, we find that optimizing the objective function using gradient descent with random restarts works well in practice. The gradient of the objective function w.r.t. the embedding of a point reads: 111A detailed derivation of the gradient computation can be found in Appendix A.

where and . The gradient can be decomposed in attraction and repelling forces between points in the embedding space. Thus the underlying problem of ct-SNE, just like many other force-based embedding methods, is related to the classic -body problem in physics222https://en.wikipedia.org/wiki/N-body_problem#Other_n-body_problems

, which has also been studied in the recent machine learning literature

(Gray & Moore, 2001; Ram et al., 2009). The general goal of the -body problem is to find a constellation of objects such that equilibrium is achieved according to a certain measure (e.g., forces, energy). In the problem setting of ct-SNE, both the pairwise distances between points and the label information affect the attraction and repelling forces. Particularly, the label information strengthens the repelling force (assume ) between two points if they have the same label and weakens the repelling force if two points have different labels. This is desirable behavior because we do not want to reflect the known label information in the resulted embeddings.

Evaluating the gradient has complexity , which makes the computation (both time and memory cost) infeasible when is large (e.g., ). As an approximation of the gradient computation, we adapt the tree-based approximation strategy described by van der Maaten (2014). To efficiently model the proximity in high-dimensional space (Eq. (1

)) we use a vantage-point tree-based algorithm (which exploits the fast diminishing property of the Gaussian distribution). To approximate the low-dimensional proximity (Eq. (

3)) we modify the Barnes-Hut algorithm to incorporate the label information. The basic idea of the Barnes-Hut algorithm is to organize the points in the embedding space using a kd-tree (which for 2-d embeddings is equivalent to a quad tree). Each node of the tree corresponds to a cell (dissection) in the embedding space. If a target point is far away from all the points in a given cell, then the interaction between the target point and the points within the cell can be summarized by the interaction between and the cell’s center of mass that is computed while constructing the kd-tree. More specifically, the summarization happens when , where is the radius of the cell, while controls the strength of summarization, i.e. the approximation strength. The summarized repelling force in t-SNE reads , where is the number of data points in that cell.

In the ct-SNE approximation, we had to overcome an additional complication though: we also need to summarize the label information for the points in a cell when the summarization happens. This can be done by maintaining a histogram in each cell, and counting the numbers of data points with different labels that fall into that cell. Then the repelling force of a target point can be weighted proportional to the number of points that have same (different) label(s) within the cell. Namely:

where is the number of data points in a cell that has the same label as point .

As both tree-based approximation schemes have complexity , counting the label will add an extra multiplicative constant , equal to the number of label values in the prior information. Thus the final complexity of approximated ct-SNE is .

3 Experiments

The experiments investigate 4 questions: Q1 Does ct-SNE work as expected in finding complementary structure? Q2 How should (or equivalently, ) be chosen? Q3 Could ct-SNE’s goal be achieved also by using (a combination of) other methods? Q4 How well does ct-SNE scale? Two case studies addressing Q1 are presented in Sections 3.13.3. Two more case studies addressing Q1 as well as the experiments addressing Q2Q4 are summarized in Sec. 3.4, and described in detail in Appendix C.

3.1 Datasets used, and experimental settings

The first dataset used in the main paper is a Synthetic dataset consisting of 1000 ten-dimensional data points, as explained in Section 1. The second in the main paper is a Facebook dataset consisting of -dimensional embedding of a de-identified random sample of Facebook users in the US. This embedding is generated based purely on the list of pages and groups that the users follow, as part of an effort to improve the quality of several recommendation systems at Facebook.

To study Q1, both qualitative and quantitative experiments were performed on the synthetic dataset. On the Facebook dataset we only conducted a qualitative evaluation (given the lack of ground truth).

Qualitative experiment. We qualitatively evaluate the effectiveness of ct-SNE through visualizations. More specifically, we compare the t-SNE visualization of a dataset with the ct-SNE visualization that has taken into account certain prior information that is visually identifiable from the t-SNE embedding. Thus by inspecting the presence of the prior information in the ct-SNE embedding and comparing to the t-SNE embedding, we can evaluate whether the prior information is removed. Conversely, we test whether information present in the ct-SNE embedding could have been identified from the t-SNE embedding to verify whether it indeed contains complementary information.

To select the prior information, we first visualize the t-SNE embedding and manually select points that are clustered in the visualization. Then we perform a feature ranking

procedure to identify the features that separate the selected points from the rest. This is done by fitting a linear classifier (logistic regression) on the selected cluster against all other data points. By inspecting the weights of the classifier, we can identify the feature that contributes the most to the classifier. Repeating this

feature ranking procedure for other clusters, we aim to find a feature that correlates with the majority of the clusters in the t-SNE visualization. This feature is then treated as prior information and provided as input to ct-SNE. In the reported experiments, the most prominent feature was always categorical, so all points with the same value were treated as being in a cluster to define the prior. We apply exact ct-SNE on Synthetic and approximated ct-SNE () on the Facebook dataset.

We also evaluated whether ct-SNE can continuously provide new insights, by repeatedly applying the cluster selection and feature ranking procedure on ct-SNE embeddings.

Quantitative experiment. In this experiment, we quantify the presence of certain prior information in a ct-SNE embedding that also take the same prior information as input. For example, if ct-SNE encodes the prior information using labels, the strong presence of certain prior information is equivalent to the high homogeneity of the encoded labels in the embedding, i.e., points that are close to each other in the embedding often have the same label. To quantify such homogeneity, we developed a measure termed normalized Laplacian score defined as follows. Given an embedding and parameter , we denote

as the adjacency matrix of the k-nearest graph computed from the embedding. Then, the Laplacian matrix of the kNN graph has the form

where . We further normalize the Laplacian matrix (

) to obtain a score that is insensitive to the node degrees. Given a label vector

with values where each label has

points, and denote the one-hot encoding for each label

as , then the normalized Laplacian score can be computed as:

(6)

This score is essentially the pairwise difference (in terms of labels) between the data points that are connected according to the kNN graph. If a label is locally consistent (homogeneous) in an embedding, the feature difference among the kNN graph neighborhood is small, which results in a small Laplacian score. Conversely, a less homogeneous label over the kNN graph would have a large Laplacian score. Thus, if ct-SNE removes certain prior information from its embedding, then the embedding should have a large Laplacian score on the labels that encode the prior information.

Figure 2: The homogeneity of cluster labels in t-SNE and several ct-SNE embeddings of the synthetic dataset for (a parameter of the Laplacian score) ranging from to , for the three label sets: (a) , (b) , and (c) . Colored lines give the scores for different embeddings: t-SNE (blue), ct-SNE with prior (orange), ct-SNE with prior (green), and ct-SNE with prior (red).

3.2 Case study: Synthetic dataset

Qualitative experiment. The t-SNE visualization of the synthetic dataset shows five large clusters (Fig. 1a). Feature ranking (Sec. 3.1) shows these clusters correspond to the clustering in dimensions - of the data. Taking the cluster labels in dimensions - () as prior, ct-SNE gives a different visualization (Fig. 1b). The feature ranking further shows the ct-SNE embedding indeed reveals the clusters in the dimension 5-6 of the data. We further combine the labels and by assigning a new label to each combinations of the label in and , denoted as . ct-SNE with yields an embedding based only on the remaining noise (Fig. 1c).

Quantitative experiment. We computed the normalized Laplacian scores (Eq. (6)) of the t-SNE and several ct-SNE embeddings. Subfigures in Fig. 2a–c give the Laplacian score for three label sets: , , and . Fig. 2a shows that labels are less homogeneous (higher Laplacian score) in the ct-SNE embeddings with prior and than in the t-SNE embedding, indicating that ct-SNE effectively discounted the prior from the embeddings. Both the t-SNE embedding and ct-SNE with prior clearly pick up the cluster in , as indicated by the very low Laplacian score. Similarly, Figures 2b,c show that ct-SNE removes the prior information effectively for labels and , respectively, given the associated priors.

Figure 3: Visualization of 2-d embeddings of the Facebook dataset. Left column: t-SNE embedding, right column: ct-SNE embedding with region as prior. The two rows show identical embeddings but with different cluster markings (colors). See Section 3.3 for further info.

3.3 Case study: Facebook dataset

Qualitative experiment. Applying t-SNE on the Facebook dataset gives a visualization with many visually salient clusters (Fig. 3a). Computing the feature ranking for classification of selected clusters shows that the geography (i.e., the states) contributes to the embedding the most. This is further confirmed by coloring the data points according to the geographical region in the visualization as shown in Fig. 3a: most of the clusters are indeed quite homogeneous with respect to geography.

To understand the effect of an embedding like this in a downstream recommendation system, an analyst would want to know what type of user interests the embedding is capturing. For this, the regional clusters are not very informative. To alleviate that we can encode the region as prior for ct-SNE so that other interesting structures can emerge in the visualization. Using the same coloring scheme, ct-SNE shows a cluster with large mass that consists of users from different states (Fig. 3b). There are also a few small clusters with mixed color scattered on the periphery of the visualization. The visualization indicates that geographical information is mostly removed in the ct-SNE embedding. This is further confirmed by selecting clusters (highlighted in red color) in ct-SNE embedding (Fig. 3d) and highlighting the same set of points in the t-SNE embedding (Fig. 3c). The cluster highlighted in the ct-SNE embedding spreads over the t-SNE embedding, indicating these users are not geographically similar. Indeed, feature ranking (Sec. 3.1) indicates that the selected group of users (Fig. 3d) share an interest in horse riding: they tend to follow several pages related to that topic. Interestingly, we noticed that some of the clusters in the ct-SNE embedding are also clustered in the t-SNE embedding. These clusters are indeed not homogeneous in terms of the geographical regions. For example, the cluster highlighted in blue in the ct-SNE embedding (Fig. 3d) also exists in the t-SNE embedding (Fig. 3c). Using feature ranking as above we found that these clusters are not homogeneous in terms of geography, but in terms of users’ interest in Indian culture. While these clusters can thus also be seen in the t-SNE embedding, ct-SNE removes the irrelevant (region) cluster structure, such that those other clusters become more salient and easy to observe.

3.4 Summary of additional experimental findings

Two other case studies (App. C.2C.3) on the UCI adult dataset (Dheeru & Karra Taniskidou, 2017) and a DBLP citation network dataset (Tang et al., 2008) confirm the ability of ct-SNE visualizations to reveal insightful clusters after conditioning on prior information that dominates the t-SNE visualizations (Q1). In Appendix C.4

we also analyzed the sensitivity of the ct-SNE embedding with respect to the hyperparameter

(or ) (Q2). By varying the hyperparameter, we found ct-SNE yields low-dimensional embeddings that better approximate the original data than t-SNE (i.e., smaller KL-divergence). The analysis also shows that using a small (e.g., ) is a good rule of thumb when using ct-SNE for visualization. To answer Q3, we compared ct-SNE to two non-trivial baselines that remove the known factors from the high-dimensional data using either an adversarial auto-encoder (AAE (Makhzani et al., 2015)) or canonical correlation analysis (CCA (Hotelling, 1936)) and then apply t-SNE for visualization (App. C.5). We show that these baselines are either difficult to tune (AAE-based baseline) or have limited applicability (CCA-based baseline), while ct-SNE has essentially only one parameter to tune, and does not suffer from the limitations of the CCA baseline. Finally, we conducted a runtime experiment (App. C.6) showing that the approximated ct-SNE can efficiently embed large, high-dimensional data, without substantial quality loss (Q4).

4 Related Work

Many dimensionality reduction methods have been proposed in the literature. Arguably, -body problem based methods such as MDS (Torgerson, 1952), Isomap (Tenenbaum et al., 2000), t-SNE (van der Maaten & Hinton, 2008), LargeVis (Tang et al., 2016), and UMAP (McInnes & Healy, 2018)

appear to be the most popular ones. These methods typically have three components: (1) a proximity measure in the input space, (2) a proximity measure in the embedding space, (3) a loss function comparing the proximity between data points in the embedding space with the proximity in the input space. ct-SNE belongs to this class of DR methods. It accepts both high-dimensional data and priors about the data as inputs, and searches for low-dimensional embeddings while discounting structure in the input data specified as prior knowledge.

As a core component of ct-SNE is the prior information specified by the user, it can be considered an interactive DR method. Closely related to ct-SNE, there is a group of interactive DR methods that adjust the algorithms according to a user’s inputs (e.g., Kang et al., 2016; Puolamäki et al., 2018; Dıaz et al., 2014; Alipanahi & Ghodsi, 2011; Barshan et al., 2011; Paurat & Gärtner, 2013). These methods contrast with ct-SNE in that the user feedback must be obeyed in the output embedding, while for ct-SNE the prior knowledge defined by the user guides what is irrelevant to the user.333For an extended discussion about the related work, please refer to Appendix D.

5 Conclusion

We introduce conditional t-SNE to efficiently discover new insights from high-dimensional data. ct-SNE finds the lower dimensional representation of the data in a non-linear fashion while removing the known factors. Extensive case studies on both synthetic and real-world datasets demonstrate that ct-SNE can effectively remove known factors from low-dimensional representations, allowing new structure to emerge and providing new insights to the analyst. A tree-based optimization method allows ct-SNE to scale to a high dimensional dataset with hundreds of thousands of data points.

inline,author=Jefinline,author=Jeftodo: inline,author=JefWe could remove future work if in need of space.

6 Acknowledgement

The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) / ERC Grant Agreement no. 615517, from the FWO (project no. G091017N, G0F9816N), from the European Union’s Horizon 2020 research and innovation programme and the FWO under the Marie Sklodowska-Curie Grant Agreement no. 665501, and from the EPSRC (SPHERE EP/R005273/1). We thank Laurens van der Maaten for helpful discussions.

References

  • Alipanahi & Ghodsi (2011) Alipanahi, B. and Ghodsi, A. Guided locally linear embedding. PRL, 32(7):1029–1035, 2011.
  • Barshan et al. (2011) Barshan, E., Ghodsi, A., Azimifar, Z., and Zolghadri Jahromi, M.

    Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds.

    PR, 44(7):1357–1371, 2011.
  • Cavallo & Demiralp (2018) Cavallo, M. and Demiralp, Ç. A visual interaction framework for dimensionality reduction based data exploration. In CHI, pp. 635, 2018.
  • Dheeru & Karra Taniskidou (2017) Dheeru, D. and Karra Taniskidou, E. UCI machine learning repository, 2017.
  • Dıaz et al. (2014) Dıaz, I., Cuadrado, A. A., Pérez, D., Garcıa, F. J., and Verleysen, M. Interactive dimensionality reduction for visual analytics. In ESANN, pp. 183–188, 2014.
  • Edwards & Storkey (2015) Edwards, H. and Storkey, A. Censoring representations with an adversary. arXiv:1511.05897, 2015.
  • Faust et al. (2019) Faust, R., Glickenstein, D., and Scheidegger, C. Dimreader: Axis lines that explain non-linear projections. TVCG, 25(1):481–490, 2019.
  • Gray & Moore (2001) Gray, A. G. and Moore, A. W. N-body’problems in statistical learning. In NeurIPS, pp. 521–527, 2001.
  • Grover & Leskovec (2016) Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. ACM, 2016.
  • Hotelling (1936) Hotelling, H. Relations between two sets of variates. Biometrika, 28(3/4):321–377, 1936.
  • Jeong et al. (2009) Jeong, D. H., Ziemkiewicz, C., Fisher, B., Ribarsky, W., and Chang, R. ipca: An interactive system for pca-based visual analytics. In CGF, volume 28, pp. 767–774, 2009.
  • Kang et al. (2016) Kang, B., Lijffijt, J., Santos-Rodríguez, R., and De Bie, T. Subjectively interesting component analysis: data projections that contrast with prior expectations. In KDD, pp. 1615–1624, 2016.
  • Madras et al. (2018) Madras, D., Creager, E., Pitassi, T., and Zemel, R. Learning adversarially fair and transferable representations. arXiv:1802.06309, 2018.
  • Makhzani et al. (2015) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversarial autoencoders. arXiv:1511.05644, 2015.
  • McInnes & Healy (2018) McInnes, L. and Healy, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426, 2018.
  • Paurat & Gärtner (2013) Paurat, D. and Gärtner, T. Invis: A tool for interactive visual data analysis. In ECML-PKDD, pp. 672–676, 2013.
  • Perozzi et al. (2014) Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. ACM, 2014.
  • Pezzotti et al. (2017) Pezzotti, N., Lelieveldt, B. P., van der Maaten, L., Höllt, T., Eisemann, E., and Vilanova, A. Approximated and user steerable tsne for progressive visual analytics. TVCG, 23(7):1739–1752, 2017.
  • Puolamäki et al. (2018) Puolamäki, K., Oikarinen, E., Kang, B., Lijffijt, J., and De Bie, T. Interactive visual data exploration with subjective feedback: An information-theoretic approach. In ICDE, pp. 1208–1211, 2018.
  • Ram et al. (2009) Ram, P., Lee, D., March, W., and Gray, A. G. Linear-time algorithms for pairwise statistical problems. In NeurIPS, pp. 1527–1535, 2009.
  • Stahnke et al. (2016) Stahnke, J., Dörk, M., Müller, B., and Thom, A. Probing projections: Interaction techniques for interpreting arrangements and errors of dimensionality reductions. TVCG, 22(1):629–638, 2016.
  • Tang et al. (2008) Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., and Su, Z. Arnetminer: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 990–998. ACM, 2008.
  • Tang et al. (2016) Tang, J., Liu, J., Zhang, M., and Mei, Q. Visualizing large-scale and high-dimensional data. In WWW, pp. 287–297, 2016.
  • Tenenbaum et al. (2000) Tenenbaum, J. B., De Silva, V., and Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
  • Torgerson (1952) Torgerson, W. S. Multidimensional scaling: I. theory and method. Psychometrika, 17(4):401–419, 1952.
  • van der Maaten (2014) van der Maaten, L. Accelerating t-sne using tree-based algorithms. The Journal of Machine Learning Research, 15(1):3221–3245, 2014.
  • van der Maaten & Hinton (2008) van der Maaten, L. and Hinton, G. Visualizing data using t-sne. JMLR, 9(Nov):2579–2605, 2008.
  • van der Maaten & Hinton (2012) van der Maaten, L. and Hinton, G. Visualizing non-metric similarities in multiple maps. MLJ, 87(1):33–55, 2012.

Appendix A Detailed derivation of the gradient of the ct-SNE objective function

Here we derive in detail the gradient of the ct-SNE objective function. Denote the euclidean distance between points as . The derivative of with respect to embedding reads:

Denote the cost (KL-divergence) by :

where

and

Following the derivation from tSNE paper, the derivative of with respect to reads:

To compute the derivative of with respect to , we first have:

Denote The derivative of with respect to can be computed as:

Thus we have derivative of with respect to

Finally, we have derivative:

Appendix B On generalizing the idea of ct-SNE

The idea of removing known factors from low-dimensional representations can be generalized to other -body problem based DR methods. Oftentimes, the gradient of the -body problem based methods can be viewed as a summation of attraction forces and repelling forces. Removing a known factor thus amounts to re-weighting the attracting and repelling forces such that points that have the same label repel each other and points with different labels attract each other. For example, LargeVis (Tang et al., 2016) differs from t-SNE by modeling input space proximity using random KNN graph. Thus we can use the same conditioning idea as in ct-SNE to remove the known factors in LargeVis. However, for Uniform Manifold Approximation and Projection (UMAP) McInnes & Healy (2018), conditioning is not readily applicable. In contrast to t-SNE, UMAP uses fuzzy sets to model the proximity in both input space and embedding space. Then the cross entropy between two fuzzy sets serves as loss function to compare the modeled proximity between input space and the embedding space. In the UMAP setting, it is not straightforward to condition the lower dimensional proximity model on the prior. But we can still directly re-weight the repelling forces: for data points with the same label, the pushing effect is strengthened by ; for samples with different labels, the pushing effect is weakened by multiplying with , with assumption . However, without proper conditioning, parameter and loose their probabilistic interpretation and along with it their one-to-one correspondence (as in ct-SNE), thus both parameters and need to be set.

Appendix C Extended experiments

c.1 Datasets

In this section, we introduce two additional datasets:

UCI Adult dataset. We sampled 1000 data points from the UCI adult dataset (Dheeru & Karra Taniskidou, 2017) with six attributes: the three numeric attributes age, education level, and work hours per week, and the three binary attributes ethnicity (white/other), gender, and income (>50k).

DBLP dataset. We extracted all papers from 20 venues444These venues are: NIPS, ICLR, ICML, AAAI, IJCAI, KDD, ECML-PKDD, ICDM, SDM, WSDM, PAKDD, VLDB, SIGMOD, ICDT, ICDE, PODS, SIGIR, WWW, CIKM, ECIR. in four areas (ML/DM/DB/IR) of computer science from the DBLP citation network dataset (Tang et al., 2008). We sampled half of the papers and constructed a network ( nodes555The network consists of paper nodes, author nodes, topic nodes and venue nodes.) based on paper-author, paper-topic, paper-venue relations. Finally, we embedded the network into a dimensional euclidean space using node2vec (Grover & Leskovec, 2016) with walk length , window size . In our experiment, both and are set to . Under this setting, node2vec is equivalent to DeepWalk (Perozzi et al., 2014).

c.2 Case study: UCI Adult dataset

Figure 4: Visualization of -d embeddings of the UCI Adult dataset. Points are visually encoded according to their attributes. gender: female (orange color), male (blue color); ethnicity: white (circle), other (triangle); income (>50k): true (unfilled marker), false (filled marker). (a) t-SNE embedding shows clusters that are grouped according to the combinations of all three attributes. (b) With attribute gender as prior, ct-SNE embedding shows four clusters each has a mixture of points with different genders, indicating the gender information is removed. (c) With attribute ethnicity as prior, ct-SNE embedding also shows four clusters but each has a mixture of points with different ethnicities. (d) Incorporating the combination of attributes gender and ethnicity as prior, the resulted ct-SNE embedding shows two clusters that are correlated with the remaining attribute: income (>50k).
Figure 5: The homogeneity of cluster labels in t-SNE and several ct-SNE embeddings of the UCI Adult dataset for (a parameter of the Laplacian score) ranging from to with step size . Colored lines correspond to scores for different embeddings: t-SNE (blue), ct-SNE with prior gender (orange), ct-SNE with prior ethnicity (green), and ct-SNE with prior ethnicity & gender (red). Subfigures give homogeneity scores for various labels: (a) gender (b) ethnicity (c) gender & ethnicity. (a) The attribute gender has lower homogeneity (high Laplacian score) in the ct-SNE embedding with gender or ethnicity & gender as prior than in t-SNE embedding and ct-SNE embedding with ethnicity as prior. (b) The attribute ethnicity has lower homogeneity in the ct-SNE embedding with ethnicity or ethnicity & gender as priors than in the t-SNE embedding and ct-SNE with gender as prior) embeddings. (c) The attribute ethnicity & gender has high homogeneity in the t-SNE embedding only.

Qualitative experiment. Fig. 4a shows t-SNE gives an embedding that consists of clusters grouped according to combinations of three attributes: gender, ethnicity and income (>50k). By incorporating the attribute gender as prior, the ct-SNE embedding (Fig. 4b) contains clusters with a mixture of male and female points, indicating the gender information is removed. Instead, by incorporating the attribute ethnicity the ct-SNE embedding (Fig. 4c) contains clusters with a mixture of ethnicities. Finally, incorporating the combination of attributes gender and ethnicity as prior, the ct-SNE embedding contains data points grouped according to income (Fig. 4d).

Quantitative experiment. We analyzed the homogeneities (Laplacian scores) of attributes gender, ethnicity and income (>50k) measured on both t-SNE and ct-SNE embeddings. Fig. 5a shows ct-SNE with prior gender removes the gender factor from the resulted embedding, while ct-SNE with prior ethnicity makes the gender factor in the resulted embedding clearer. Similarly, Figure. 5b,c show ct-SNE removes the prior information effectively for labels ethnicity and ethnicity&gender respectively, given the associated priors.

c.3 Case study: DBLP dataset

Figure 6:

Visualization of 2-d embeddings of the DBLP dataset. Left column: t-SNE embedding, right column: ct-SNE embedding with area as prior. The rows contains different cluster markings. (a) t-SNE embedding shows a clustering according to four areas in computer science (red - machine learning, green - data mining, blue - data base, orange - information retrieval). (b) ct-SNE embedding shows a different clustering, with area information removed. (d) Newly emerged visual clusters (magenta - topic ‘privacy’, dark green - topic ‘data stream’, orange - topic ‘computer vision’) in ct-SNE embedding spread over in the t-SNE embedding (c), corresponding to users interested in horse riding. (d) Clusters (grass green - topic ‘clustering’, purple - topic ‘active leraning’) stood-out in the ct-SNE embedding also exists in the t-SNE embedding (c). These are a few out of many clusters that we found to exhibit a much more informative, interest-centric structure than the t-SNE projection.

Qualitative experiment. Applying t-SNE on the DBLP dataset gives a visualization with many visual clusters (Fig. 6a). Feature ranking for classification of the selected clusters shows the topics that contribute the most to the visualization. Moreover, we used mpld3666https://mpld3.github.io (an interactive visualization library) to inspect (i.e., hovering over data points and check tooltips) the metadata of t-SNE plot. Upon inspection, the visualization appears to be globally divided according the four areas. This is further confirmed by coloring the data points according to the four areas: most of the clusters are indeed quite homogeneous with respect areas

Knowing from the t-SNE visualization the papers are indeed divided according to areas, the area structure in the visualization is not very informative anymore. Thus we can encode the area as prior for ct-SNE so that other interesting structures can emerge. Using the same color scheme, ct-SNE shows a visualization that has many clusters with mixed colors (Fig. 6b). This indicates the area information is mostly removed in the ct-SNE embedding. This is further confirmed by selecting clusters in ct-SNE embedding (Fig. 6d) and highlight the same set of points in the t-SNE embedding (Fig. 6c). The clusters highlighted in the ct-SNE visualization often consists of clusters (topics) from different areas (i.e., t-SNE clusters with different colors) that spread over the t-SNE visualization. Indeed, feature ranking indicates that papers in the selected ct-SNE cluster have similar topics in e.g., ‘privacy’, ‘data steam’, ‘computer vision’. Finally, we noticed that some clusters in ct-SNE (Fig. 6d) embedding also exist in the t-SNE embedding (Fig. 6

c). Using feature ranking as above we found these clusters are not homogeneous in terms of area of study, but in terms of topics (e.g., ‘clustering’, ‘active learning’), indicating a tightly connected research community behind the topic. Thus, by removing the irrelevant area structure using ct-SNE, clusters that persists in both visualizations become more salient and easier to observe.

c.4 Parameters sensitivity

Figure 7: Visualizing the effect of different s (s) have on the ct-SNE embeddings. The embeddings are computed on the synthetic dataset with the prior information to be the cluster labels in dimensions -. (a) The values of ct-SNE objective (green), t-SNE objective (blue), and ct-SNE prior term (orange) against different s. ct-SNE achieves smaller KL-divergence than t-SNE. (b) ct-SNE embedding with has smallest KL-divergences but is not the best visualization. (c) ct-SNE embedding with gives a better visualization.

To understand the effect of the parameter (or equivalently, ) on ct-SNE embeddings (Q3), we study ct-SNE embeddings on the synthetic dataset with the prior fixed to be the cluster labels in dimensions 1–4. First, we try to understand the relation between the ct-SNE objective and the parameter (or equivalently, ). We evaluated the ct-SNE objective (Eq. 2.2) on the ct-SNE embeddings obtained by ranging (and correspondingly) from (strong prior removal effect) to (no prior remove effect, equivalent to t-SNE) with step size . We also evaluated the t-SNE objective (first term in Eq. 2.2) and the second term in Eq. 2.2 (the only term that depends on the prior, subsequently referred to as the prior term) for the ct-SNE embeddings associated with various s.

Fig. 7a visualizes the values of the ct-SNE objective, t-SNE objective, and ct-SNE prior term against different s. Observe that by using a prior, the ct-SNE embedding achieves a better approximation to the higher dimensional data. That is, ct-SNE achieves a lower KL-divergence (lowest at ) than t-SNE does (). This is because the prior term in the ct-SNE objective can be negative. Although the t-SNE objective increases when decreases, it is compensated by the negative value contributed by the prior term. Indeed, by factoring out certain prior from the lower dimensional embedding, the necessity of the embedding to represent the prior is alleviated, enabling ct-SNE to have more freedom to approximate the high-dimensional proximities.

Interestingly, we observe that the embedding with smallest KL-divergence does not necessarily give better visualization (e.g., clear separation of the clusters). We visualize the ct-SNE embedding that achieves smallest KL-divergence (, Fig. 7b) and compare it with the ct-SNE embedding that has strongest prior removal effect but larger KL-divergence (, Fig. 7c). Although the embedding with stronger prior removal effect has larger objective value, it gives a clearer clustering than in the embedding with smaller KL-divergence (). As a result, the clusters in dimensions 5–6 are easier to identify. Hence, we propose as rule of thumb when using ct-SNE for visualization to use small (e.g., ).

c.5 Baseline comparisons

In this section, we compare ct-SNE with two non-trivial baselines. The basic idea is to first remove the known factor from the dataset, and perform t-SNE to produce lower dimensional representations. Here we use a non-linear and a linear method to remove the known factors: adversarial auto-encoder (AAE) and canonical correlation analysis (CCA). The implementation of the baselines and code for comparison experiments are also available at https://bitbucket.org/ghentdatascience/ct-sne.

Baseline: AAE and t-SNE. Adversarial auto-encoder (AAE) Makhzani et al. (2015) can be used to learn a latent representation that prevents the discriminator from predicting certain attributes (Madras et al., 2018). In order to remove prior information from the low-dimensional representation of a dataset using AAE, we can configure the discriminator to predict the prior attributes, and using the auto-encoder to adversarially remove the prior from the latent representation of the dataset.

We adopt the AAE configuration described by Edwards & Storkey (2015). AAE is in general difficult to tune: it has hyperparameters ( network structure parameters, weights in the objective, and

learning rates) and a few design choices about the network architecture (e.g., the number of layers in each subnetwork and activation functions). We tried different parameter settings and managed to remove the clustering label information in dimensions 1–4 (Fig. 

8a) and 5–6 (Fig. 8b) from the data. In Figure 8a, the AAE approach manages to remove the prior information, but it fails to pick up the complementary structure in the data (clusters in dimensions 5–6). It also fails to remove the prior information (cluster labels in dimension 1–6) in Figure 8c. Comparing to this baseline, ct-SNE practically has only one parameter () to tune, which often can be set to a small positive number (e.g., ).

Figure 8: Visualization of 2-d embeddings obtained by applying the AAE based approach on the synthetic dataset. The data points are colored according to the cluster label in dimensions -. The data points are also plotted using different markers based on the cluster labels in dimensions -. (a) The AAE based approach successfully removed the clustering information in dimensions -, but failed to reveal the clusters in dimensions - (b) AAE successfully removed the clustering information in dimensions - and also reveals the clusters in dimensions - (c) AAE failed to remove the clustering information in dimensions -.
Figure 9: Visualization of 2-d embeddings obtained by applying CCA-based approaches and ct-SNE on a synthetic dimensional dataset. (a) Projecting data onto the null space of CCA top components and then apply t-SNE gives an embedding that picks up the large clusters (plotted with different markers) but failed to pick up the structure of two small clusters (colored differently) within each large cluster. (b) Projecting the data onto CCA components with least correlation and then apply t-SNE also fails to pick up the two-cluster structure within the large clusters. (c) ct-SNE removes the cluster information in the embedding and shows clearly the two cluster structure within each larger cluster.

Baseline: CCA and t-SNE. Canonical correlation analysis (Hotelling, 1936)

aims to find a linear transformation for two random variables such that the correlation between transformed variables is maximized. To remove the prior information from data using CCA, one approach is to first find the (at most)

subspace ( is the dimensionality of the data) in which the transformed data and the prior information (one hot encoding of the labels) have the largest correlation. Then the data is whitened by projecting it onto the null space (at least -d) of the subspace found in the first step. By doing so, the whitened data is less correlated to the known factor.

Another variant of the CCA-based approach is directly projecting the data onto the -dimensional subspace found by CCA in which the transformed data and labels has smallest correlation. To be consistent, we also apply t-SNE to the transformed data.

Our experimental results show the CCA-based approaches can easily remove label information that is orthogonal to other attributes in the data. For example, in the UCI Adult dataset, the gender information is orthogonal to the ethnicity and income, which can be easily removed using the CCA approach. However, the CCA-based approach performs poorly when the known factor is correlated with other attributes. Moreover, the CCA-based approaches also have the limitation that the number of the projection vectors is upper-bounded by the dimensionality of the data. If the number of unique values of an attribute exceeds the dimensionality of the data, the CCA projection would not be able to remove the label info entirely from the data. To illustrate our points, we synthesized a -dimensional dataset with 1,000 data points. The data points are grouped into clusters each corresponding to a multi-variate Gaussian with random location and small variance. Additionally, each cluster is separated into two small clusters (one contains points of the cluster, and another includes the rest) along one randomly chosen axis. Figure 9a,b shows both the CCA approaches pick up only the large clusters (differentiated using marker shape) but failed to pick up the structure of two small clusters (plotted in different colors) within each large cluster. On the other hand, ct-SNE removes the cluster information in the embedding and shows each large cluster can be further separated in to two smaller clusters.

Thus, the CCA-based baselines perform poorly when the known factor is correlated with other attributes. Moreover, the number of the projection vectors in CCA-based baselines is upper-bounded by the dimensionality of the data. Meanwhile, ct-SNE does not have these limitations.

c.6 Runtime

We measure the runtime of the exact ct-SNE and the approximated version () on a PC with a quad-core GHz Inter Core i5 and a 2133MHz LPDDR3 RAM. By default, the maximum number of iterations of ct-SNE gradient update is 1,000. For larger datasets and prior attributes that have many values, more iterations are required to achieve a convergence. For example, the synthetic dataset (1,000 samples and 10 dimensions) requires fewer than 1,000 iterations to converge while the Facebook dataset (500,000 examples and 128 dimensions) requires 3,000 iterations to converge. Table. 1 shows that approximated ct-SNE is efficient and applicable to large data with high dimensionality, while exact ct-SNE is not.

name size dim. exact apprx. ()
Synthetic 1,000 10 0.06 0.01
UCI Adult 1,000 6 0.07 0.01
DBLP 43,346 64 503.97 0.45
Synthetic 500,000 128 100,278 9.1
Table 1: Average runtime (in seconds) of exact and approximated ct-SNE in computing one gradient update step. To measure the runtime of ct-SNE on a dataset with similar size as the Facebook dataset, we scaled the Synthetic dataset up to data points with dimensions.

Appendix D Extended related work

Many dimensionality reduction methods have been proposed in the literature. Arguably, -body problem based methods777In Section 2.3 we provide more information on the -body problem such as MDS (Torgerson, 1952), Isomap (Tenenbaum et al., 2000), t-SNE (van der Maaten & Hinton, 2008), LargeVis (Tang et al., 2016), and UMAP (McInnes & Healy, 2018) appear to be the most popular ones. These methods typically have three components: (1) a proximity measure in the input space, (2) a proximity measure in the embedding space, (3) a loss function comparing the proximity between data points in the embedding space with the proximity in the input space. When minimizing the loss over the embedding space, the data points (i.e., the bodies) have pairwise interactions and the embedding of all points needs to be updated simultaneously. Since the optimization problem is not convex, local minima are typically accepted as output. ct-SNE belongs to this class of DR methods. It accepts both high-dimensional data and priors about the data as inputs, and searches for low-dimensional embeddings while discounting structure in the input data specified as prior knowledge. Closely related, in the multi-maps t-SNE work (van der Maaten & Hinton, 2012) factors that are mutually exclusive are captured by multiple t-SNE embeddings at once. Comparing to multi-map t-SNE, ct-SNE allows users to disentangle information in a targeted (subjective) manner, by specifying which information they would like to have factored out.

As a core component of ct-SNE is the prior information specified by the user, it can be considered an interactive DR method. Existing papers on interactive DR can be categorized into two groups. The first group aim to improve the explainability and computation efficiency of existing DR methods via novel visualizations and interactions. iPCA (Jeong et al., 2009) allows users to easily explore the PCA components and thus achieve better understanding of the linear projections of the data onto different PCA components. Cavallo & Demiralp (2018) helps the user to understand low-dimensional representations by applying perturbations to probe the connection between input attributed space and embedding space. Similarly, Faust et al. (2019) introduce a method based on perturbations to visualize the effect of a specific input attribute on the embedding, while Stahnke et al. (2016) introduce ‘probing’ as a means to understand the meaning of point set selections within the embedding. Steerable t-SNE (Pezzotti et al., 2017) aims to make t-SNE more scalable by quickly providing a sketch of an embedding which is then refined only upon the user’s interests.

The second group of interactive DR methods adjust the algorithms according to a users’ inputs. SICA (Kang et al., 2016) and SIDE (Puolamäki et al., 2018) explicitly model the user’s belief state and find linear projections that contrast to it. These two methods are linear DR methods thus cannot present non-linear structures in the low-dimensional representations. Work by Dıaz et al. (2014) allows users to define their own metric in the input space, after which the low-dimensional representation reflects the adjusted importance of the attributes. This method puts the burden on the user for direct manipulation of the input space metric. Many variants of existing DR methods have been introduced where user feedback entails editing of the embedding, and such manually embedded points are used as constraints to guide the dimensionality reduction (e.g., Alipanahi & Ghodsi, 2011; Barshan et al., 2011; Paurat & Gärtner, 2013). These methods contrast with ct-SNE in that the user feedback must be obeyed in the output embedding, while for ct-SNE the prior knowledge defined by the user guides what is irrelevant to the user.