1 Introduction
Dimensionality reduction (DR) methods can be used to create lowdimensional (typically 2dimensional; 2d) representations that are straightforward to visualize and subsequently explore the dominant structure of highdimensional datasets. Nonlinear DR methods are particularly powerful because they can capture complex structure even when it is spread over many dimensions. This explains the huge popularity of methods such as tSNE (van der Maaten & Hinton, 2008), LargeVis (Tang et al., 2016), and UMAP (McInnes & Healy, 2018).
However, DR methods yield a single static embedding and the most prominent structure present in the data may already be known to the analyst. One may indeed construct higherdimensional embeddings, hoping to uncover more structure, but there is no guarantee that any of the constructed dimensions is fully complementary to the prior knowledge of an analyst. Besides, the salient structure that is already known may be spread across all attributes, hence we cannot just remove the associated attributes and generally speaking it is not obvious how to visualize the remaining structure. The question arises: can we actively filter or discount prior knowledge from the embedding?
To this end, we introduce conditional tSNE (ctSNE), a generalization of tSNE that accounts for prior information about the data. By discounting the prior information, the embedding may focus on capturing complementary information. More concretely, it does not aim to construct an embedding that reflects all the proximities in the original data (the objective of tSNE), but it should reflect all pairwise proximities conditioned on whether we expect that pair to be close or not.
ctSNE enables at least three new ways to obtain insight into data:

The prior knowledge may indeed be available beforehand, in which case we can straight away focus the analysis on an embedding that is more useful.

Such prior knowledge may be gained during analysis, leading to an iterative data analysis process.

If we observe some specific structure X in an embedding and then factor out specific information Y, then if X remains present in the embedding, we learn that X is Y complementary to Y.
Note we use the term prior knowledge, even when this knowledge is not available a priori, but gained during the analysis. This reflects the knowledge is available prior to the embedding step.
Example.
To demonstrate the idea behind ctSNE more concretely, consider a tendimensional dataset with 1,000 data points. In dimension 1–4 the data points fall into five clusters (following a multivariate Gaussian with small variance), similarly for dimensions 5–6 the points fall randomly into four clusters. Dimensions 7–10 contain Gaussian noise with larger variance. Figure
1a gives the tSNE embedding. It shows five large clusters, where some can be somewhat clearly split further into smaller clusters. The large clusters correspond to those defined in dimension 1–4. Figure 1b is the ctSNE embedding where we have input the five colored clusters as prior knowledge. This figure shows four clusters that are complementary to the five clusters observed in 1a. We see they are complementary because there is no correlation between the colors and the clusters in Figure 1b. These four clusters are indeed those defined in dimensions 5–6. Finally, Figure 1c shows that after combining the labels, ctSNE yields an embedding capturing only on the remaining noise.The implementation of ctSNE and code for the experiments on public data are available at https://bitbucket.org/ghentdatascience/ctsne.
Contributions. This paper makes the following contributions:

ctSNE, a new DR method that searches for an embedding such that a distribution defined in terms of distances in the input space (as in tSNE) is wellapproximated by a distribution defined in terms of distances in the embedding space after conditioning on the prior knowledge (Sec. 2.2);

A BarnesHutTree based optimization method to efficiently find an embedding (Sec. 2.3);

Extensive qualitative and quantitative experiments on synthetic and real world datasets show ctSNE effectively removes the known factors, enables deeper visual analysis of highdimensional data, and that ctSNE scales sufficiently to handle hundreds of thousands of points (Sec. 3).
2 Method
In this section, we first briefly recap the idea behind tSNE and introduce the basic notation. Then, we derive ctSNE and describe a BarnesHut based strategy to optimize the ctSNE objective. Due to space limitations, we discuss in Appendix B how the idea of factoring out prior information can be applied to many other existing nonlinear DR methods such as LargeVis and UMAP.
2.1 Background: tSNE
In tSNE, the input data set
is taken to define a probability distribution for a categorical random variable
, of which the value domain is indexed by all pairs of indices with . This distribution is determined by specifying probabilities such that , equal to the probability that . For brevity, below we will speak of the distribution when we mean the categorical distribution with parameters .More specifically, in tSNE, the distribution is defined as follows:
(1) 
The goal of tSNE is to find another embedding , from which another categorical probability distribution is derived, specified by the values defined as follows:
(2) 
An embedding is deemed better if the distance between these two categorical distributions is smaller, as quantified by the KLdivergence: .
2.2 Conditional tSNE
Let us now assume that each data point has a label associated, with for all . Moreover, let us assume that it is known a priori that samelabeled data points are more likely to be nearby in . Our goal is to ensure that the embedding does not reflect that information again. This can be achieved by minimizing the KLdivergence between the distributions and (rather than ), where is the distribution derived from the embedding but conditioned on the prior knowledge.
We formalize this using the following notation. The indicator variable if and if , and the label matrix is defined by . The probability that the random variable is equal to , conditioned on the label matrix (i.e. the prior information) is denoted as:
In ctSNE, this is the probability distribution that needs to be similar to for the embedding to be a good one. Note that if we ensure that is larger when than when , it will be less important for the embedding to ensure that is large for samelabeled data points, even if is large. I.e., for samelabeled data points, it is less important to be embedded nearby even if they are nearby in the input representation. This is precisely the goal of ctSNE.
To compute , we now investigate its different factors. First, is simply computed as in Eq. (2). Second, we need to determine a suitable form for , based on the above intuition. To do this, we assume that is the sufficient statistic for , i.e. , where and can be regarded as the confidence of points and being randomly picked to have the same or different labels. Let us further denote the class size of the ’th class as . Then, for this distribution to be normalized, it must hold that:
This yields a relation between and . It also suggests a ballpark figure for . Indeed, one would typically set . For (i.e. the lower bound for ), they would both be equal to , i.e. one divided by the number of possible distinct label assignments (this is of course entirely logical). Thus, in tuning , one could take multiples of this minimal value.
We can now also compute the marginal probability as follows:
Given all this, one can then compute the required conditional distribution as follows:
(3)  
It is numerically better to express this in terms of new variables and :
where the relation between and is:
(4) 
This has the advantage of avoiding the large factorials and resulting numerical problems. The lower bound for to be considered is now (in which case also ).
Finally, computing the KLdivergence with , yields the ctSNE objective function to be minimized:
(5) 
Note that the last two terms are constant w.r.t. . Moreover, it is clear that for , this reduces to standard tSNE. For (and related as per the Eq. (4)), the minimization of this KLdivergence will try to minimize when more strongly (as it is multiplied with the larger number ) than when (when it is multiplied with the smaller number ).
2.3 Optimization
The objective function (Eq. (2.2)) is nonconvex w.r.t the embedding . Even so, we find that optimizing the objective function using gradient descent with random restarts works well in practice. The gradient of the objective function w.r.t. the embedding of a point reads: ^{1}^{1}1A detailed derivation of the gradient computation can be found in Appendix A.
where and . The gradient can be decomposed in attraction and repelling forces between points in the embedding space. Thus the underlying problem of ctSNE, just like many other forcebased embedding methods, is related to the classic body problem in physics^{2}^{2}2https://en.wikipedia.org/wiki/Nbody_problem#Other_nbody_problems
, which has also been studied in the recent machine learning literature
(Gray & Moore, 2001; Ram et al., 2009). The general goal of the body problem is to find a constellation of objects such that equilibrium is achieved according to a certain measure (e.g., forces, energy). In the problem setting of ctSNE, both the pairwise distances between points and the label information affect the attraction and repelling forces. Particularly, the label information strengthens the repelling force (assume ) between two points if they have the same label and weakens the repelling force if two points have different labels. This is desirable behavior because we do not want to reflect the known label information in the resulted embeddings.Evaluating the gradient has complexity , which makes the computation (both time and memory cost) infeasible when is large (e.g., ). As an approximation of the gradient computation, we adapt the treebased approximation strategy described by van der Maaten (2014). To efficiently model the proximity in highdimensional space (Eq. (1
)) we use a vantagepoint treebased algorithm (which exploits the fast diminishing property of the Gaussian distribution). To approximate the lowdimensional proximity (Eq. (
3)) we modify the BarnesHut algorithm to incorporate the label information. The basic idea of the BarnesHut algorithm is to organize the points in the embedding space using a kdtree (which for 2d embeddings is equivalent to a quad tree). Each node of the tree corresponds to a cell (dissection) in the embedding space. If a target point is far away from all the points in a given cell, then the interaction between the target point and the points within the cell can be summarized by the interaction between and the cell’s center of mass that is computed while constructing the kdtree. More specifically, the summarization happens when , where is the radius of the cell, while controls the strength of summarization, i.e. the approximation strength. The summarized repelling force in tSNE reads , where is the number of data points in that cell.In the ctSNE approximation, we had to overcome an additional complication though: we also need to summarize the label information for the points in a cell when the summarization happens. This can be done by maintaining a histogram in each cell, and counting the numbers of data points with different labels that fall into that cell. Then the repelling force of a target point can be weighted proportional to the number of points that have same (different) label(s) within the cell. Namely:
where is the number of data points in a cell that has the same label as point .
As both treebased approximation schemes have complexity , counting the label will add an extra multiplicative constant , equal to the number of label values in the prior information. Thus the final complexity of approximated ctSNE is .
3 Experiments
The experiments investigate 4 questions: Q1 Does ctSNE work as expected in finding complementary structure? Q2 How should (or equivalently, ) be chosen? Q3 Could ctSNE’s goal be achieved also by using (a combination of) other methods? Q4 How well does ctSNE scale? Two case studies addressing Q1 are presented in Sections 3.1–3.3. Two more case studies addressing Q1 as well as the experiments addressing Q2–Q4 are summarized in Sec. 3.4, and described in detail in Appendix C.
3.1 Datasets used, and experimental settings
The first dataset used in the main paper is a Synthetic dataset consisting of 1000 tendimensional data points, as explained in Section 1. The second in the main paper is a Facebook dataset consisting of dimensional embedding of a deidentified random sample of Facebook users in the US. This embedding is generated based purely on the list of pages and groups that the users follow, as part of an effort to improve the quality of several recommendation systems at Facebook.
To study Q1, both qualitative and quantitative experiments were performed on the synthetic dataset. On the Facebook dataset we only conducted a qualitative evaluation (given the lack of ground truth).
Qualitative experiment. We qualitatively evaluate the effectiveness of ctSNE through visualizations. More specifically, we compare the tSNE visualization of a dataset with the ctSNE visualization that has taken into account certain prior information that is visually identifiable from the tSNE embedding. Thus by inspecting the presence of the prior information in the ctSNE embedding and comparing to the tSNE embedding, we can evaluate whether the prior information is removed. Conversely, we test whether information present in the ctSNE embedding could have been identified from the tSNE embedding to verify whether it indeed contains complementary information.
To select the prior information, we first visualize the tSNE embedding and manually select points that are clustered in the visualization. Then we perform a feature ranking
procedure to identify the features that separate the selected points from the rest. This is done by fitting a linear classifier (logistic regression) on the selected cluster against all other data points. By inspecting the weights of the classifier, we can identify the feature that contributes the most to the classifier. Repeating this
feature ranking procedure for other clusters, we aim to find a feature that correlates with the majority of the clusters in the tSNE visualization. This feature is then treated as prior information and provided as input to ctSNE. In the reported experiments, the most prominent feature was always categorical, so all points with the same value were treated as being in a cluster to define the prior. We apply exact ctSNE on Synthetic and approximated ctSNE () on the Facebook dataset.We also evaluated whether ctSNE can continuously provide new insights, by repeatedly applying the cluster selection and feature ranking procedure on ctSNE embeddings.
Quantitative experiment. In this experiment, we quantify the presence of certain prior information in a ctSNE embedding that also take the same prior information as input. For example, if ctSNE encodes the prior information using labels, the strong presence of certain prior information is equivalent to the high homogeneity of the encoded labels in the embedding, i.e., points that are close to each other in the embedding often have the same label. To quantify such homogeneity, we developed a measure termed normalized Laplacian score defined as follows. Given an embedding and parameter , we denote
as the adjacency matrix of the knearest graph computed from the embedding. Then, the Laplacian matrix of the kNN graph has the form
where . We further normalize the Laplacian matrix () to obtain a score that is insensitive to the node degrees. Given a label vector
with values where each label haspoints, and denote the onehot encoding for each label
as , then the normalized Laplacian score can be computed as:(6) 
This score is essentially the pairwise difference (in terms of labels) between the data points that are connected according to the kNN graph. If a label is locally consistent (homogeneous) in an embedding, the feature difference among the kNN graph neighborhood is small, which results in a small Laplacian score. Conversely, a less homogeneous label over the kNN graph would have a large Laplacian score. Thus, if ctSNE removes certain prior information from its embedding, then the embedding should have a large Laplacian score on the labels that encode the prior information.
3.2 Case study: Synthetic dataset
Qualitative experiment. The tSNE visualization of the synthetic dataset shows five large clusters (Fig. 1a). Feature ranking (Sec. 3.1) shows these clusters correspond to the clustering in dimensions  of the data. Taking the cluster labels in dimensions  () as prior, ctSNE gives a different visualization (Fig. 1b). The feature ranking further shows the ctSNE embedding indeed reveals the clusters in the dimension 56 of the data. We further combine the labels and by assigning a new label to each combinations of the label in and , denoted as . ctSNE with yields an embedding based only on the remaining noise (Fig. 1c).
Quantitative experiment. We computed the normalized Laplacian scores (Eq. (6)) of the tSNE and several ctSNE embeddings. Subfigures in Fig. 2a–c give the Laplacian score for three label sets: , , and . Fig. 2a shows that labels are less homogeneous (higher Laplacian score) in the ctSNE embeddings with prior and than in the tSNE embedding, indicating that ctSNE effectively discounted the prior from the embeddings. Both the tSNE embedding and ctSNE with prior clearly pick up the cluster in , as indicated by the very low Laplacian score. Similarly, Figures 2b,c show that ctSNE removes the prior information effectively for labels and , respectively, given the associated priors.
3.3 Case study: Facebook dataset
Qualitative experiment. Applying tSNE on the Facebook dataset gives a visualization with many visually salient clusters (Fig. 3a). Computing the feature ranking for classification of selected clusters shows that the geography (i.e., the states) contributes to the embedding the most. This is further confirmed by coloring the data points according to the geographical region in the visualization as shown in Fig. 3a: most of the clusters are indeed quite homogeneous with respect to geography.
To understand the effect of an embedding like this in a downstream recommendation system, an analyst would want to know what type of user interests the embedding is capturing. For this, the regional clusters are not very informative. To alleviate that we can encode the region as prior for ctSNE so that other interesting structures can emerge in the visualization. Using the same coloring scheme, ctSNE shows a cluster with large mass that consists of users from different states (Fig. 3b). There are also a few small clusters with mixed color scattered on the periphery of the visualization. The visualization indicates that geographical information is mostly removed in the ctSNE embedding. This is further confirmed by selecting clusters (highlighted in red color) in ctSNE embedding (Fig. 3d) and highlighting the same set of points in the tSNE embedding (Fig. 3c). The cluster highlighted in the ctSNE embedding spreads over the tSNE embedding, indicating these users are not geographically similar. Indeed, feature ranking (Sec. 3.1) indicates that the selected group of users (Fig. 3d) share an interest in horse riding: they tend to follow several pages related to that topic. Interestingly, we noticed that some of the clusters in the ctSNE embedding are also clustered in the tSNE embedding. These clusters are indeed not homogeneous in terms of the geographical regions. For example, the cluster highlighted in blue in the ctSNE embedding (Fig. 3d) also exists in the tSNE embedding (Fig. 3c). Using feature ranking as above we found that these clusters are not homogeneous in terms of geography, but in terms of users’ interest in Indian culture. While these clusters can thus also be seen in the tSNE embedding, ctSNE removes the irrelevant (region) cluster structure, such that those other clusters become more salient and easy to observe.
3.4 Summary of additional experimental findings
Two other case studies (App. C.2–C.3) on the UCI adult dataset (Dheeru & Karra Taniskidou, 2017) and a DBLP citation network dataset (Tang et al., 2008) confirm the ability of ctSNE visualizations to reveal insightful clusters after conditioning on prior information that dominates the tSNE visualizations (Q1). In Appendix C.4
we also analyzed the sensitivity of the ctSNE embedding with respect to the hyperparameter
(or ) (Q2). By varying the hyperparameter, we found ctSNE yields lowdimensional embeddings that better approximate the original data than tSNE (i.e., smaller KLdivergence). The analysis also shows that using a small (e.g., ) is a good rule of thumb when using ctSNE for visualization. To answer Q3, we compared ctSNE to two nontrivial baselines that remove the known factors from the highdimensional data using either an adversarial autoencoder (AAE (Makhzani et al., 2015)) or canonical correlation analysis (CCA (Hotelling, 1936)) and then apply tSNE for visualization (App. C.5). We show that these baselines are either difficult to tune (AAEbased baseline) or have limited applicability (CCAbased baseline), while ctSNE has essentially only one parameter to tune, and does not suffer from the limitations of the CCA baseline. Finally, we conducted a runtime experiment (App. C.6) showing that the approximated ctSNE can efficiently embed large, highdimensional data, without substantial quality loss (Q4).4 Related Work
Many dimensionality reduction methods have been proposed in the literature. Arguably, body problem based methods such as MDS (Torgerson, 1952), Isomap (Tenenbaum et al., 2000), tSNE (van der Maaten & Hinton, 2008), LargeVis (Tang et al., 2016), and UMAP (McInnes & Healy, 2018)
appear to be the most popular ones. These methods typically have three components: (1) a proximity measure in the input space, (2) a proximity measure in the embedding space, (3) a loss function comparing the proximity between data points in the embedding space with the proximity in the input space. ctSNE belongs to this class of DR methods. It accepts both highdimensional data and priors about the data as inputs, and searches for lowdimensional embeddings while discounting structure in the input data specified as prior knowledge.
As a core component of ctSNE is the prior information specified by the user, it can be considered an interactive DR method. Closely related to ctSNE, there is a group of interactive DR methods that adjust the algorithms according to a user’s inputs (e.g., Kang et al., 2016; Puolamäki et al., 2018; Dıaz et al., 2014; Alipanahi & Ghodsi, 2011; Barshan et al., 2011; Paurat & Gärtner, 2013). These methods contrast with ctSNE in that the user feedback must be obeyed in the output embedding, while for ctSNE the prior knowledge defined by the user guides what is irrelevant to the user.^{3}^{3}3For an extended discussion about the related work, please refer to Appendix D.
5 Conclusion
We introduce conditional tSNE to efficiently discover new insights from highdimensional data. ctSNE finds the lower dimensional representation of the data in a nonlinear fashion while removing the known factors. Extensive case studies on both synthetic and realworld datasets demonstrate that ctSNE can effectively remove known factors from lowdimensional representations, allowing new structure to emerge and providing new insights to the analyst. A treebased optimization method allows ctSNE to scale to a high dimensional dataset with hundreds of thousands of data points.
6 Acknowledgement
The research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/20072013) / ERC Grant Agreement no. 615517, from the FWO (project no. G091017N, G0F9816N), from the European Union’s Horizon 2020 research and innovation programme and the FWO under the Marie SklodowskaCurie Grant Agreement no. 665501, and from the EPSRC (SPHERE EP/R005273/1). We thank Laurens van der Maaten for helpful discussions.
References
 Alipanahi & Ghodsi (2011) Alipanahi, B. and Ghodsi, A. Guided locally linear embedding. PRL, 32(7):1029–1035, 2011.

Barshan et al. (2011)
Barshan, E., Ghodsi, A., Azimifar, Z., and Zolghadri Jahromi, M.
Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds.
PR, 44(7):1357–1371, 2011.  Cavallo & Demiralp (2018) Cavallo, M. and Demiralp, Ç. A visual interaction framework for dimensionality reduction based data exploration. In CHI, pp. 635, 2018.
 Dheeru & Karra Taniskidou (2017) Dheeru, D. and Karra Taniskidou, E. UCI machine learning repository, 2017.
 Dıaz et al. (2014) Dıaz, I., Cuadrado, A. A., Pérez, D., Garcıa, F. J., and Verleysen, M. Interactive dimensionality reduction for visual analytics. In ESANN, pp. 183–188, 2014.
 Edwards & Storkey (2015) Edwards, H. and Storkey, A. Censoring representations with an adversary. arXiv:1511.05897, 2015.
 Faust et al. (2019) Faust, R., Glickenstein, D., and Scheidegger, C. Dimreader: Axis lines that explain nonlinear projections. TVCG, 25(1):481–490, 2019.
 Gray & Moore (2001) Gray, A. G. and Moore, A. W. Nbody’problems in statistical learning. In NeurIPS, pp. 521–527, 2001.
 Grover & Leskovec (2016) Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. ACM, 2016.
 Hotelling (1936) Hotelling, H. Relations between two sets of variates. Biometrika, 28(3/4):321–377, 1936.
 Jeong et al. (2009) Jeong, D. H., Ziemkiewicz, C., Fisher, B., Ribarsky, W., and Chang, R. ipca: An interactive system for pcabased visual analytics. In CGF, volume 28, pp. 767–774, 2009.
 Kang et al. (2016) Kang, B., Lijffijt, J., SantosRodríguez, R., and De Bie, T. Subjectively interesting component analysis: data projections that contrast with prior expectations. In KDD, pp. 1615–1624, 2016.
 Madras et al. (2018) Madras, D., Creager, E., Pitassi, T., and Zemel, R. Learning adversarially fair and transferable representations. arXiv:1802.06309, 2018.
 Makhzani et al. (2015) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversarial autoencoders. arXiv:1511.05644, 2015.
 McInnes & Healy (2018) McInnes, L. and Healy, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426, 2018.
 Paurat & Gärtner (2013) Paurat, D. and Gärtner, T. Invis: A tool for interactive visual data analysis. In ECMLPKDD, pp. 672–676, 2013.
 Perozzi et al. (2014) Perozzi, B., AlRfou, R., and Skiena, S. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. ACM, 2014.
 Pezzotti et al. (2017) Pezzotti, N., Lelieveldt, B. P., van der Maaten, L., Höllt, T., Eisemann, E., and Vilanova, A. Approximated and user steerable tsne for progressive visual analytics. TVCG, 23(7):1739–1752, 2017.
 Puolamäki et al. (2018) Puolamäki, K., Oikarinen, E., Kang, B., Lijffijt, J., and De Bie, T. Interactive visual data exploration with subjective feedback: An informationtheoretic approach. In ICDE, pp. 1208–1211, 2018.
 Ram et al. (2009) Ram, P., Lee, D., March, W., and Gray, A. G. Lineartime algorithms for pairwise statistical problems. In NeurIPS, pp. 1527–1535, 2009.
 Stahnke et al. (2016) Stahnke, J., Dörk, M., Müller, B., and Thom, A. Probing projections: Interaction techniques for interpreting arrangements and errors of dimensionality reductions. TVCG, 22(1):629–638, 2016.
 Tang et al. (2008) Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., and Su, Z. Arnetminer: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 990–998. ACM, 2008.
 Tang et al. (2016) Tang, J., Liu, J., Zhang, M., and Mei, Q. Visualizing largescale and highdimensional data. In WWW, pp. 287–297, 2016.
 Tenenbaum et al. (2000) Tenenbaum, J. B., De Silva, V., and Langford, J. C. A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
 Torgerson (1952) Torgerson, W. S. Multidimensional scaling: I. theory and method. Psychometrika, 17(4):401–419, 1952.
 van der Maaten (2014) van der Maaten, L. Accelerating tsne using treebased algorithms. The Journal of Machine Learning Research, 15(1):3221–3245, 2014.
 van der Maaten & Hinton (2008) van der Maaten, L. and Hinton, G. Visualizing data using tsne. JMLR, 9(Nov):2579–2605, 2008.
 van der Maaten & Hinton (2012) van der Maaten, L. and Hinton, G. Visualizing nonmetric similarities in multiple maps. MLJ, 87(1):33–55, 2012.
Appendix A Detailed derivation of the gradient of the ctSNE objective function
Here we derive in detail the gradient of the ctSNE objective function. Denote the euclidean distance between points as . The derivative of with respect to embedding reads:
Denote the cost (KLdivergence) by :
where
and
Following the derivation from tSNE paper, the derivative of with respect to reads:
To compute the derivative of with respect to , we first have:
Denote The derivative of with respect to can be computed as:
Thus we have derivative of with respect to
Finally, we have derivative:
Appendix B On generalizing the idea of ctSNE
The idea of removing known factors from lowdimensional representations can be generalized to other body problem based DR methods. Oftentimes, the gradient of the body problem based methods can be viewed as a summation of attraction forces and repelling forces. Removing a known factor thus amounts to reweighting the attracting and repelling forces such that points that have the same label repel each other and points with different labels attract each other. For example, LargeVis (Tang et al., 2016) differs from tSNE by modeling input space proximity using random KNN graph. Thus we can use the same conditioning idea as in ctSNE to remove the known factors in LargeVis. However, for Uniform Manifold Approximation and Projection (UMAP) McInnes & Healy (2018), conditioning is not readily applicable. In contrast to tSNE, UMAP uses fuzzy sets to model the proximity in both input space and embedding space. Then the cross entropy between two fuzzy sets serves as loss function to compare the modeled proximity between input space and the embedding space. In the UMAP setting, it is not straightforward to condition the lower dimensional proximity model on the prior. But we can still directly reweight the repelling forces: for data points with the same label, the pushing effect is strengthened by ; for samples with different labels, the pushing effect is weakened by multiplying with , with assumption . However, without proper conditioning, parameter and loose their probabilistic interpretation and along with it their onetoone correspondence (as in ctSNE), thus both parameters and need to be set.
Appendix C Extended experiments
c.1 Datasets
In this section, we introduce two additional datasets:
UCI Adult dataset. We sampled 1000 data points from the UCI adult dataset (Dheeru & Karra Taniskidou, 2017) with six attributes: the three numeric attributes age, education level, and work hours per week, and the three binary attributes ethnicity (white/other), gender, and income (>50k).
DBLP dataset. We extracted all papers from 20 venues^{4}^{4}4These venues are: NIPS, ICLR, ICML, AAAI, IJCAI, KDD, ECMLPKDD, ICDM, SDM, WSDM, PAKDD, VLDB, SIGMOD, ICDT, ICDE, PODS, SIGIR, WWW, CIKM, ECIR. in four areas (ML/DM/DB/IR) of computer science from the DBLP citation network dataset (Tang et al., 2008). We sampled half of the papers and constructed a network ( nodes^{5}^{5}5The network consists of paper nodes, author nodes, topic nodes and venue nodes.) based on paperauthor, papertopic, papervenue relations. Finally, we embedded the network into a dimensional euclidean space using node2vec (Grover & Leskovec, 2016) with walk length , window size . In our experiment, both and are set to . Under this setting, node2vec is equivalent to DeepWalk (Perozzi et al., 2014).
c.2 Case study: UCI Adult dataset
Qualitative experiment. Fig. 4a shows tSNE gives an embedding that consists of clusters grouped according to combinations of three attributes: gender, ethnicity and income (>50k). By incorporating the attribute gender as prior, the ctSNE embedding (Fig. 4b) contains clusters with a mixture of male and female points, indicating the gender information is removed. Instead, by incorporating the attribute ethnicity the ctSNE embedding (Fig. 4c) contains clusters with a mixture of ethnicities. Finally, incorporating the combination of attributes gender and ethnicity as prior, the ctSNE embedding contains data points grouped according to income (Fig. 4d).
Quantitative experiment. We analyzed the homogeneities (Laplacian scores) of attributes gender, ethnicity and income (>50k) measured on both tSNE and ctSNE embeddings. Fig. 5a shows ctSNE with prior gender removes the gender factor from the resulted embedding, while ctSNE with prior ethnicity makes the gender factor in the resulted embedding clearer. Similarly, Figure. 5b,c show ctSNE removes the prior information effectively for labels ethnicity and ethnicity&gender respectively, given the associated priors.
c.3 Case study: DBLP dataset
Qualitative experiment. Applying tSNE on the DBLP dataset gives a visualization with many visual clusters (Fig. 6a). Feature ranking for classification of the selected clusters shows the topics that contribute the most to the visualization. Moreover, we used mpld3^{6}^{6}6https://mpld3.github.io (an interactive visualization library) to inspect (i.e., hovering over data points and check tooltips) the metadata of tSNE plot. Upon inspection, the visualization appears to be globally divided according the four areas. This is further confirmed by coloring the data points according to the four areas: most of the clusters are indeed quite homogeneous with respect areas
Knowing from the tSNE visualization the papers are indeed divided according to areas, the area structure in the visualization is not very informative anymore. Thus we can encode the area as prior for ctSNE so that other interesting structures can emerge. Using the same color scheme, ctSNE shows a visualization that has many clusters with mixed colors (Fig. 6b). This indicates the area information is mostly removed in the ctSNE embedding. This is further confirmed by selecting clusters in ctSNE embedding (Fig. 6d) and highlight the same set of points in the tSNE embedding (Fig. 6c). The clusters highlighted in the ctSNE visualization often consists of clusters (topics) from different areas (i.e., tSNE clusters with different colors) that spread over the tSNE visualization. Indeed, feature ranking indicates that papers in the selected ctSNE cluster have similar topics in e.g., ‘privacy’, ‘data steam’, ‘computer vision’. Finally, we noticed that some clusters in ctSNE (Fig. 6d) embedding also exist in the tSNE embedding (Fig. 6
c). Using feature ranking as above we found these clusters are not homogeneous in terms of area of study, but in terms of topics (e.g., ‘clustering’, ‘active learning’), indicating a tightly connected research community behind the topic. Thus, by removing the irrelevant area structure using ctSNE, clusters that persists in both visualizations become more salient and easier to observe.
c.4 Parameters sensitivity
To understand the effect of the parameter (or equivalently, ) on ctSNE embeddings (Q3), we study ctSNE embeddings on the synthetic dataset with the prior fixed to be the cluster labels in dimensions 1–4. First, we try to understand the relation between the ctSNE objective and the parameter (or equivalently, ). We evaluated the ctSNE objective (Eq. 2.2) on the ctSNE embeddings obtained by ranging (and correspondingly) from (strong prior removal effect) to (no prior remove effect, equivalent to tSNE) with step size . We also evaluated the tSNE objective (first term in Eq. 2.2) and the second term in Eq. 2.2 (the only term that depends on the prior, subsequently referred to as the prior term) for the ctSNE embeddings associated with various s.
Fig. 7a visualizes the values of the ctSNE objective, tSNE objective, and ctSNE prior term against different s. Observe that by using a prior, the ctSNE embedding achieves a better approximation to the higher dimensional data. That is, ctSNE achieves a lower KLdivergence (lowest at ) than tSNE does (). This is because the prior term in the ctSNE objective can be negative. Although the tSNE objective increases when decreases, it is compensated by the negative value contributed by the prior term. Indeed, by factoring out certain prior from the lower dimensional embedding, the necessity of the embedding to represent the prior is alleviated, enabling ctSNE to have more freedom to approximate the highdimensional proximities.
Interestingly, we observe that the embedding with smallest KLdivergence does not necessarily give better visualization (e.g., clear separation of the clusters). We visualize the ctSNE embedding that achieves smallest KLdivergence (, Fig. 7b) and compare it with the ctSNE embedding that has strongest prior removal effect but larger KLdivergence (, Fig. 7c). Although the embedding with stronger prior removal effect has larger objective value, it gives a clearer clustering than in the embedding with smaller KLdivergence (). As a result, the clusters in dimensions 5–6 are easier to identify. Hence, we propose as rule of thumb when using ctSNE for visualization to use small (e.g., ).
c.5 Baseline comparisons
In this section, we compare ctSNE with two nontrivial baselines. The basic idea is to first remove the known factor from the dataset, and perform tSNE to produce lower dimensional representations. Here we use a nonlinear and a linear method to remove the known factors: adversarial autoencoder (AAE) and canonical correlation analysis (CCA). The implementation of the baselines and code for comparison experiments are also available at https://bitbucket.org/ghentdatascience/ctsne.
Baseline: AAE and tSNE. Adversarial autoencoder (AAE) Makhzani et al. (2015) can be used to learn a latent representation that prevents the discriminator from predicting certain attributes (Madras et al., 2018). In order to remove prior information from the lowdimensional representation of a dataset using AAE, we can configure the discriminator to predict the prior attributes, and using the autoencoder to adversarially remove the prior from the latent representation of the dataset.
We adopt the AAE configuration described by Edwards & Storkey (2015). AAE is in general difficult to tune: it has hyperparameters ( network structure parameters, weights in the objective, and
learning rates) and a few design choices about the network architecture (e.g., the number of layers in each subnetwork and activation functions). We tried different parameter settings and managed to remove the clustering label information in dimensions 1–4 (Fig.
8a) and 5–6 (Fig. 8b) from the data. In Figure 8a, the AAE approach manages to remove the prior information, but it fails to pick up the complementary structure in the data (clusters in dimensions 5–6). It also fails to remove the prior information (cluster labels in dimension 1–6) in Figure 8c. Comparing to this baseline, ctSNE practically has only one parameter () to tune, which often can be set to a small positive number (e.g., ).Baseline: CCA and tSNE. Canonical correlation analysis (Hotelling, 1936)
aims to find a linear transformation for two random variables such that the correlation between transformed variables is maximized. To remove the prior information from data using CCA, one approach is to first find the (at most)
subspace ( is the dimensionality of the data) in which the transformed data and the prior information (one hot encoding of the labels) have the largest correlation. Then the data is whitened by projecting it onto the null space (at least d) of the subspace found in the first step. By doing so, the whitened data is less correlated to the known factor.Another variant of the CCAbased approach is directly projecting the data onto the dimensional subspace found by CCA in which the transformed data and labels has smallest correlation. To be consistent, we also apply tSNE to the transformed data.
Our experimental results show the CCAbased approaches can easily remove label information that is orthogonal to other attributes in the data. For example, in the UCI Adult dataset, the gender information is orthogonal to the ethnicity and income, which can be easily removed using the CCA approach. However, the CCAbased approach performs poorly when the known factor is correlated with other attributes. Moreover, the CCAbased approaches also have the limitation that the number of the projection vectors is upperbounded by the dimensionality of the data. If the number of unique values of an attribute exceeds the dimensionality of the data, the CCA projection would not be able to remove the label info entirely from the data. To illustrate our points, we synthesized a dimensional dataset with 1,000 data points. The data points are grouped into clusters each corresponding to a multivariate Gaussian with random location and small variance. Additionally, each cluster is separated into two small clusters (one contains points of the cluster, and another includes the rest) along one randomly chosen axis. Figure 9a,b shows both the CCA approaches pick up only the large clusters (differentiated using marker shape) but failed to pick up the structure of two small clusters (plotted in different colors) within each large cluster. On the other hand, ctSNE removes the cluster information in the embedding and shows each large cluster can be further separated in to two smaller clusters.
Thus, the CCAbased baselines perform poorly when the known factor is correlated with other attributes. Moreover, the number of the projection vectors in CCAbased baselines is upperbounded by the dimensionality of the data. Meanwhile, ctSNE does not have these limitations.
c.6 Runtime
We measure the runtime of the exact ctSNE and the approximated version () on a PC with a quadcore GHz Inter Core i5 and a 2133MHz LPDDR3 RAM. By default, the maximum number of iterations of ctSNE gradient update is 1,000. For larger datasets and prior attributes that have many values, more iterations are required to achieve a convergence. For example, the synthetic dataset (1,000 samples and 10 dimensions) requires fewer than 1,000 iterations to converge while the Facebook dataset (500,000 examples and 128 dimensions) requires 3,000 iterations to converge. Table. 1 shows that approximated ctSNE is efficient and applicable to large data with high dimensionality, while exact ctSNE is not.
name  size  dim.  exact  apprx. () 

Synthetic  1,000  10  0.06  0.01 
UCI Adult  1,000  6  0.07  0.01 
DBLP  43,346  64  503.97  0.45 
Synthetic  500,000  128  100,278  9.1 
Appendix D Extended related work
Many dimensionality reduction methods have been proposed in the literature. Arguably, body problem based methods^{7}^{7}7In Section 2.3 we provide more information on the body problem such as MDS (Torgerson, 1952), Isomap (Tenenbaum et al., 2000), tSNE (van der Maaten & Hinton, 2008), LargeVis (Tang et al., 2016), and UMAP (McInnes & Healy, 2018) appear to be the most popular ones. These methods typically have three components: (1) a proximity measure in the input space, (2) a proximity measure in the embedding space, (3) a loss function comparing the proximity between data points in the embedding space with the proximity in the input space. When minimizing the loss over the embedding space, the data points (i.e., the bodies) have pairwise interactions and the embedding of all points needs to be updated simultaneously. Since the optimization problem is not convex, local minima are typically accepted as output. ctSNE belongs to this class of DR methods. It accepts both highdimensional data and priors about the data as inputs, and searches for lowdimensional embeddings while discounting structure in the input data specified as prior knowledge. Closely related, in the multimaps tSNE work (van der Maaten & Hinton, 2012) factors that are mutually exclusive are captured by multiple tSNE embeddings at once. Comparing to multimap tSNE, ctSNE allows users to disentangle information in a targeted (subjective) manner, by specifying which information they would like to have factored out.
As a core component of ctSNE is the prior information specified by the user, it can be considered an interactive DR method. Existing papers on interactive DR can be categorized into two groups. The first group aim to improve the explainability and computation efficiency of existing DR methods via novel visualizations and interactions. iPCA (Jeong et al., 2009) allows users to easily explore the PCA components and thus achieve better understanding of the linear projections of the data onto different PCA components. Cavallo & Demiralp (2018) helps the user to understand lowdimensional representations by applying perturbations to probe the connection between input attributed space and embedding space. Similarly, Faust et al. (2019) introduce a method based on perturbations to visualize the effect of a specific input attribute on the embedding, while Stahnke et al. (2016) introduce ‘probing’ as a means to understand the meaning of point set selections within the embedding. Steerable tSNE (Pezzotti et al., 2017) aims to make tSNE more scalable by quickly providing a sketch of an embedding which is then refined only upon the user’s interests.
The second group of interactive DR methods adjust the algorithms according to a users’ inputs. SICA (Kang et al., 2016) and SIDE (Puolamäki et al., 2018) explicitly model the user’s belief state and find linear projections that contrast to it. These two methods are linear DR methods thus cannot present nonlinear structures in the lowdimensional representations. Work by Dıaz et al. (2014) allows users to define their own metric in the input space, after which the lowdimensional representation reflects the adjusted importance of the attributes. This method puts the burden on the user for direct manipulation of the input space metric. Many variants of existing DR methods have been introduced where user feedback entails editing of the embedding, and such manually embedded points are used as constraints to guide the dimensionality reduction (e.g., Alipanahi & Ghodsi, 2011; Barshan et al., 2011; Paurat & Gärtner, 2013). These methods contrast with ctSNE in that the user feedback must be obeyed in the output embedding, while for ctSNE the prior knowledge defined by the user guides what is irrelevant to the user.
Comments
There are no comments yet.