Deep generative models pique researchers’ interest in the past decade. The fruitful progress has been achieved on this topic, such as auto-encoder Hinton and Salakhutdinov (2006) and variational auto-encoder (VAE) Kingma and Welling (2013); Rezende et al. (2014), generative adversarial network (GAN) Goodfellow et al. (2014); Radford et al. (2016); Arjovsky et al. (2017), normalizing flow Rezende and Mohamed (2015); Dinh et al. (2015, 2017); Kingma and Dhariwal (2018), and auto-regressive models van den Oord et al. (2016b, a, 2017). However, it is non-trial to train a deep generative model that can converge at a proper minimum of associated optimization. For example, GAN suffers non-stability, model collapse, and generative distortion during training. Many insightful algorithms have been proposed to circumvent those issues, including feature engineering Salimans et al. (2016), various discrimination metrics Mao et al. (2016); Arjovsky et al. (2017); Berthelot et al. (2017), distinctive gradient penalties Gulrajani et al. (2017); Mescheder et al. (2018), spectral normalization to discriminator Miyato et al. (2018), and orthogonal regularization to generator Brock et al. (2019)
. What is particularly of interest is that the breakthrough for GAN has been made with a simple technique of progressively growing neural networks of generator and discriminator from low-resolution images to high-resolution counterpartsKarras et al. (2018a). This kind of progressive growing also helps push the state of the art to a new level by enabling StyleGAN to produce photo-realistic and detail-sharp results Karras et al. (2018b), shedding new light on wide applications of GANs in solving real problems. This idea of progressive learning is actually a general manner of cognition process Elman (1993); Oudeyer et al. (2007)
, which has been formally named curriculum learning in machine learningBengio et al. (2009). The central topic of this paper is to explore a new curriculum for training deep generative models.
To facilitate robust training of deep generative models with noisy data, we propose curriculum learning with clustering. The key contributions are listed as follows:
We first summarize four representative curricula for generative models, i.e. architecture (generation capacity), semantics (data content), dimension (data space), and cluster (data distribution). Among these curricula, cluster curriculum is newly proposed in this paper.
Cluster curriculum is to treat data according to centrality of each data point, which is pictorially illustrated and explained in detail. To foster large-scale learning, we devise the active set algorithm that only needs an active data subset of small fixed size for training.
The geometric principle is formulated to analyze hardness of noisy data and advantage of cluster curriculum. The geometry pertains to counting a small sphere packed in an ellipsoid, on which is based the percolation theory we use.
The research on curriculum learning is diverse. Our work focuses on curricula that are closely related to data attributes, beyond which is not the scope we concern in this paper.
2 Curriculum learning
Curriculum learning has been a basic learning strategy to promoting performance of algorithms in machine learning. We quote the original explanation from the seminal paper Bengio et al. (2009) as its definition:
“The basic idea is to start small, learn easier aspects of the task or easier sub-tasks, and then gradually increase the difficulty level” according to pre-defined or self-learned curricula.
From cognitive perspective, curriculum learning is common for human and animal learning when they interact with environments Elman (1993), which is the reason why it is natural as a learning rule for machine intelligence. The learning process of cognitive development is gradual and progressive Oudeyer et al. (2007). In practice, the design of curricula is task-dependent and data-dependent. Here we summarize the representative curricula that are developed for generative models.
Architecture curriculum. The deep neural architecture itself can be viewed as a curriculum from the viewpoint of learning concepts Hinton and Salakhutdinov (2006); Bengio et al. (2006) or disentangling representations Lee et al. (2011)
. For supervised learning, the shallow layers decompose low-level features of objects while the deep layers yield high-level features of more abstract concepts for recognizing objectsLee et al. (2011); Zeiler and Fergus (2014); Zhou et al. (2016). For generative models, the analogous phenomenon is also observed from GANs Bau et al. (2018). In fact, progressively growing architectures from shallow layers of few parameters to deep layers of more parameters has been successfully harnessed in a variety of tasks using GANs Karras et al. (2018a); Korkinof et al. (2018); Karras et al. (2018b)
and autoencoderHeljakka et al. (2018).
. The most intuitive content for each datum is the semantic information that the datum conveys. The hardness of semantics determines the difficulty of learning knowledge from data. Therefore, the semantics can be a common curriculum. For instance, one can design the easy environment for a game to help the agent adapt the task in deep reinforcement learningJustesen et al. (2018), and then gradually complexify semantic structures of the environment, thus progressively increasing the difficulty level of the game. To unveil the capability of neural networks of learning number sense, the semantic curriculum may be designed with varying numbers of elements in images Zou and McClelland (2013). In light of this semantically progressive learning, the Weber function is gradually sharpened, indicating the effectiveness of learning cognitive concepts.
. The high dimensionality usually poses the difficulty of machine learning due to the curse of dimensionalityDonoho (2000), in the sense that the amount of data points needed to attain reliable performance grows exponentially with the dimension of variables Vershynin (2018). Therefore, the algorithms are expected to be beneficial from learning of growing dimensions. The effectiveness of dimension curriculum is evident from recent progress on deep generative models. The photo-realistic results of very high dimensions for GANs Karras et al. (2018a, b) have been achieved by the method that the neural architecture is progressively grown with varying image resolution. For emerging language generation, the dimension curriculum is the length of sequences Rajeswar et al. (2017); Press et al. (2017)
. The text generation begins with short sequences and the length intervals are gradually increased to cover sequences of more complexity.
3 Cluster curriculum
For fitting distributions, dense data points are generally easier to handle than sparse data or outliers. To train generative models robustly, therefore, it is plausible to raise cluster curriculum, meaning that generative algorithms first learn from data points close to cluster centers and then with more data progressively approaching cluster boundaries. Thus the stream of feeding data points to models for curriculum learning is the process of clustering data points according to cluster centrality that will be explained in section3.2. The toy example in Figure 1 illustrates how to form cluster curriculum.
3.1 Why clusters matter
The importance of clusters for data points is actually obvious from geometric point of view. The data sparsity in high-dimensional spaces causes the difficulty of fitting the underlying distribution of data points Vershynin (2018). So generative algorithms may be beneficial when proceeding from the local spaces where data points are relatively dense. Such data points form clusters that are generally informative subsets with respect to entire dataset. In addition, clusters contain common regular patterns of data points, where generative models are easier to converge. What is most important is that noisy data points deteriorate performance of algorithms. For classification, the effectiveness of curriculum learning is theoretically proven to circumvent the negative influence of noisy data Gong et al. (2016). We will analyze this aspect for generative models with geometric facts.
3.2 Generative models with cluster curriculum
With cluster curriculum, we are allowed to gradually learn generative models from dense clusters to cluster boundaries and finally to all data points. In this way, generative algorithms are capable of avoiding the direct harm of noise or outliers. To this end, we first need a measure called centrality that is the terminology in graph-based clustering. It quantifies the compactness of a cluster in data points or a community in complex networks Newman (2010)
. A large centrality implies that the associated data point is close to one of cluster centers. For easy reference, we provide the algorithms of computing two types of centralities in supplementary material. For experiments in this paper, all the cluster curricula are constructed by the centrality of stationary probability distribution, i.e. the eigenvector corresponding to the largest eigenvalue of the transition probability matrix drawn from the data.
To be specific, let
denote the centrality vector ofdata points. Namely, the -th entry of is the centrality of data point . Sorting in descending order and adjusting the order of original data points accordingly give data points arranged by cluster centrality. Let signify the set of centrality-sorted data points, where is the base set that guarantees a proper convergence of generative models, and the rest of is evenly divided into subsets according to centrality order. In general, the number of data points in is moderate compared to and determined according to . Such division of serves to efficiency of training, because we do not need to train models from a very small dataset. The cluster curriculum learning is carried out by incrementally feeding subsets in into generative algorithms. In other words, algorithms are successively trained on after , meaning that the curriculum for each round of training is accumulated with .
In order to determine the optimal curriculum , we need the aid of quality metric of generative models, such as Fréchet inception distance (FID) or sliced Wasserstein distance (SWD) Borji (2018). For generative models trained with each curriculum, we calculate the associated score via the specified quality metric. The optimal curriculum for effective training can be identified by the minimal value for all , where . The interesting phenomenon of this score curve will be illustrated in the experiment. The minimum of score is apparently metric-dependent. One can refer to Borji (2018)
for the review of evaluation metrics. In practice, we can opt one of reliable metrics to use or multiple metrics for decision-making of the optimal model.
There are two ways of using incremental subset during training. One is that the parameters of models are re-randomized when new data are used, the procedure of which is given in Algorithm 1 in supplementary material. The other is that the parameters are fine-tuned based on pre-training of previous model, which will be presented with a fast learning algorithm in the following section.
3.3 Active set for scalable training
To obtain the precise minimum of , the cardinality of needs to be set much smaller than , meaning that will be large even for a dataset of moderate scale. The training of many loops will lead to time-consuming. Here we propose the active set to address the issue, in the sense that for each loop of cluster curriculum, generative models are always trained with a subset of small fixed size instead of whose size becomes incrementally large.
To form the active set , the subset of data points are randomly sampled from to combine with for the next loop, where . For easy understanding, we illustrate the active set with toy example in Figure 2. In this scenario, progressive pre-training must be applied, meaning that the update of model parameters for the current training is based on parameters of previous loop. The procedure of cluster curriculum with active set is detailed in Algorithm 2 in supplementary material.
The active set allows us to train generative models with a small dataset that is actively adapted, thereby significantly reducing the training time for large-scale data.
4 Geometric view of cluster curriculum
Cluster curriculum bears the interesting relation to high-dimensional geometry, which can provide geometric understanding of our algorithm. Without loss of generality, we work on a cluster obeying normal distribution. The characteristic of the cluster can be extended into other clusters of the same distribution. For easy analysis, let us begin with a toy example. As Figure3a shows, the confidence ellipse fitted from the subset of centrality-ranked data points is nearly conformal to of all data points, allowing us to put the relation of these two ellipses by virtue of the confidence-level equation. Let signify the center and the covariance matrix of the cluster of interest, where . To make it formal, we can write the equation by
can be the chi-squared distribution or Mahalanobis distance square,
is the degree of freedom, andis the confidence level. For conciseness, we write as in the following context. Then the ellipses and correspond to and , respectively, where .
To analyze the hardness of training generative model, a fundamental aspect is to examine the number of given data points falling in a geometric entity 111For cluster curriculum, it is an annulus explained shortly. and the number of lattice points in it. The less is compared to , the harder the problem will be. However, the enumeration of lattice points is computationally prohibitive for high dimensions. Inspired by the information theory of encoding data of normal distributions Roman (1996), we count the number of small spheres of radius packed in the ellipsoid instead. Thus we can use this number to replace the role of as long as the radius of the sphere is set properly. With a little abuse of notation, we still use to denote the packing number in the following context. Theorem 1 gives the exact form of .
For a set of data points drawn from normal distribution , the ellipsoid of confidence is defined as , where has no zero eigenvalues and . Let be the number of spheres of radius packed in the ellipsoid . Then we can establish
We see that admits a tidy form with Mahalanobis distance , dimension , and sphere radius as variables. The proof is provided in supplementary material.
The geometric region of interest for cluster curriculum is the annulus formed by removing the ellipsoid 222The ellipse refers to the surface and the ellipsoid refers to the elliptic ball. from the ellipsoid , as Figure 3b displays. We investigate the varying law between and in the annulus when the inner ellipse grows with cluster curriculum. For this purpose, we need the following two corollaries that immediately follows from Theorem 1.
Let be the number of spheres of radius packed in the annulus that is formed by removing the ellipsoid from the ellipsoid , where . Then the following identity holds
with respect to the Chi quantile. is the dimension of data points. For each dimension, we sample 70,000 data points from . The scales of -axis and -axis are normalized by 10,000 and , respectively.
It is obvious that goes infinite when under the conditions that and is bounded. Besides, when (cluster) grows, reduces with exponent if the ellipsoid is fixed.
In light of Corollary 1, we can now demonstrate the functional law between and . First, we determine as follows
which means that
is the ellipsoid of minimal Mahalanobis distance to the center that contains all the data points in the cluster. In addition, we need to estimate a suitable sphere radius, such that and have comparable scales in order to make and comparable in scale. To achieve this, we define an oracle ellipse where . For simplicity, we let be the oracle ellipse. Thus we can determine with Corollary 3.
If we let be the oracle ellipse such that , then the free parameter can be computed with .
To make the demonstration amenable to handle, data points we use for simulation are assumed to obey the isotropic normal distribution, meaning that data points are generated with nearly equal variance along each dimension. Figure4 shows that gradually exhibits the critical phenomena of percolation processes333Percolation theory is a fundamental tool of studying the structure of complex systems in statistical physics and mathematics. The critical point is the percolation threshold where the transition takes place. One can refer to Stauffer and Aharony (1994) if interested. when the dimension goes large, implying that the data points in the annulus are significantly reduced when grows a little bigger near the critical point. In contrast, the number of lattice points is still large and varies negligibly until approaches the boundary. This discrepancy indicates clearly that fitting data points in the annulus is pretty hard and guaranteeing the precision is nearly impossible when crossing the critical point of even for a moderate dimension (e.g. ). Therefore, the plausibility of cluster curriculum can be drawn naturally from this geometric fact.
Generative models that we use for experiments are progressive growing of GAN (ProGAN) Karras et al. (2018a) and Glow of the normalizing-flow algorithm Kingma and Dhariwal (2018). These two algorithms are chosen because they are the state of the arts with official open sources available. According to convention, we opt the Fréchet inception distance (FID) as the quality metric for ProGAN. Considering that the backgrounds of face images generated by Glow are nearly uniform, we take the sliced Wasserstein distance (SWD) as the accuracy measure of the Glow model. One can refer to Borji (2018) for details, where various metrics for generative models are compared.
5.1 Dataset and experimental setting
We randomly sample 200,000 cat images from the LSUN dataset Yu et al. (2015). These cat images are captured in the wild. So their styles vary significantly. Figure 5 shows the cat examples of high and low centralities. We can see that noisy cat images differ much from the clean ones. There actually contain the images of very few informative cat features, which are outliers we refer to. The curriculum parameters are set as and , which means that the algorithms are trained with 20,000 images first and after the initial training, another 10,000 images according to centrality order are merged into the current training data for further re-training. For active set, its size is fixed to be .
The CelebA dataset is a large-scale face attribute dataset Liu et al. (2015). We use the cropped and well-aligned faces with a bit of image backgrounds preserved for generation task. For cluster-curriculum learning, we randomly sample 70,000 faces as the training set. The face examples of different centralities are shown in Figure 5. The curriculum parameters are set as and . We bypass the experiment of active set on faces because it is used for large-scale data.
Each image in two databases is resized to be . To form cluster curricula, we exploit ResNet34 He et al. (2016)
pre-trained on ImageNetRussakovsky et al. (2015) to extract 512-dimensional features for each face and cat images. The directed graphs are built with these feature vectors. We determine the parameter
of edge weights by enforcing the geometric mean of weights to be 0.8. The number of nearest neighbors is set to be
. The centrality is the stationary probability distribution. All codes are written with TensorFlow.
5.2 Experimental result
From Figure 6, we can see that the FID curves are all nearly V-shaped, indicating that global minima exist amid the training process. This is clear evidence that noisy data and outliers deteriorate the quality of generative models during training. From the optimal curricula found by two algorithms (i.e. curricula at 110,000 and 100,000), we can see that the curriculum of the active set differs from that of normal training only by one-step data increment, implying that the active set is reliable for fast cluster-curriculum learning. The performance of the active set measured by FID is much worse than that of normal training, especially when more noisy data are fed into generative models. However, this does not change the whole V-shape of the accuracy curve. Namely, it is applicable as long as the active set admits the metric minimum corresponding to the appropriate curriculum.
|(a) ProGAN||(b) Glow|
The clear V-shape of the centrality-FID curve on the cat data is due to that the noisy data of low centrality contains little effective information to characterize the cats, as already displayed in Figure 5. However, it is different for the CelebA face dataset where the face images of low centrality also convey part of face features. As evident by Figure 7a, ProGAN keeps being optimized by the majority of data until the curriculum of size . To highlight the meaning of this nearly negligible minimum, we also conduct the exactly same experiment on the FFHQ face dataset containing face images of high-quality Karras et al. (2018b). For FFHQ data, the noisy face data can be ignored. The gray curve of normal training in Figure 7a indicates that the FID of ProGAN is monotonically decreased for all curricula. This gentle difference of the FID curves at the ends between CelebA and FFHQ clearly demonstrate the difficulty of noisy data to generative algorithms. To further unveil this effect, we run cluster-curriculum learning on CelebA faces with Glow. The performance of Glow is sensitive to the variation of cluster densities Dahl et al. (2017); Kingma and Dhariwal (2018). So it is expected for Glow to reflect the alteration of noisy degree in data. To stabilize Glow, we take in this trial. As Figure 7b displays, Glow admits a sharp minimum at , i.e. one incremental curriculum of size in ProGAN. The one-step curriculum difference is good enough to support the plausibility of cluster-curriculum learning.
|(a) cat data||(b) face data|
5.3 Geometric investigation
To understand cluster curriculum deeply, we employ the geometric method formulated in section 4 to analyze the cat and face data. The percolation processes are both conducted with 512-dimensional features from ResNet34. Figure 8 displays the curve of that is the variable of interest in this scenario. As expected, the critical point in the percolation process occurs for both cases, as shown by blue curves. An obvious fact is that the optimal curricula (red strips) both fall into the (feasible) domains of percolation processes after critical points, as indicated by gray color. This is a desirable property because data become rather sparse in the annuli when crossing the critical points. Then noisy data play the non-negligible role in tuning parameters of generative models. Therefore, a fast learning strategy can be derived from percolation process. The training may begin from the curriculum specified by the critical point, thus significantly accelerating cluster-curriculum learning.
Another intriguing phenomenon is that the more noisy the data, the closer the optimal interval (red strip) is to the critical point. We can see that the optimal interval of the cat data is much closer to the critical point than that of the face data. What surprises us here is that the optimal interval of cluster curricula associated with the cat data nearly coincides with the critical point of the percolation process in the annulus! This means that the optimal curriculum may be found at the intervals close to the critical point of of percolation for heavily noisy data, thus affording great convenience to learning an appropriate generative model for such datasets.
6 Analysis and conclusion
The cluster curriculum is proposed for robust training of generative models. The active set of cluster curriculum is devised to facilitate scalable learning. The geometric principle behind cluster curriculum is analyzed in detail as well. The experimental results on LSUN cat dataset and CelebA face dataset demonstrate that generative models trained with cluster curriculum is capable of learning the optimal parameters with respect to the specified quality metric such as Fréchet inception distance and sliced Wasserstein distance. Geometric analysis indicates that the optimal curricula obtained from generative models are closely related to critical points of the associated percolation processes established in this paper. This intriguing geometric phenomenon is worth being explored deeply in terms of the theoretical connection between generative models and high-dimensional geometry.
It is worth emphasizing that the meaning of model optimality refers to the global minimum of the centrality-FID curve. As we already noted, the optimality is metric-dependent. We are able to obtain the optimal model with cluster curriculum, which does not mean that the algorithm only serves to this purpose. We know that more informative data can help learn a more powerful model covering large data diversity. Here a trade-off arises, i.e. the robustness against noise and the capacity of fitting more data. The centrality-FID curve provides a visual tool to monitor the state of model training, thus aiding us in understanding the learning process and selecting suitable models according to noisy degree of given data. For instance, we can pick the trained model close to the optimal curriculum for heavily noisy data or the one near the end of the centrality-FID curve for datasets of little noise. In fact, this may be the most common way of using cluster curriculum.
In this paper, we do not investigate the cluster-curriculum learning for the multi-class case, e.g. the ImageNet dataset with BigGAN Brock et al. (2019). The cluster-curriculum learning of multiple classes is more complex than that we have already analyzed on face and cat data. We leave this study for future work.
- Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. In arXiv:1701.07875, 2017.
- Bau et al. (2018) David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B. Tenenbaum, William T. Freeman, and Antonio Torralba. GAN dissection: Visualizing and understanding generative adversarial networks. arXiv:1811.10597, 2018.
- Bengio et al. (2006) Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems (NeurIPS), 2006.
- Bengio et al. (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In International Conference on Machine Learning (ICML), pages 41–48, 2009.
- Berthelot et al. (2017) David Berthelot, Thomas Schumm, and Luke Metz. BEGAN: boundary equilibrium generative adversarial networks. arXiv:1703.10717, 2017.
- Borji (2018) Ali Borji. Pros and cons of GAN evaluation measures. arXiv:1802.03446, 2018.
- Brock et al. (2019) Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations (ICLR), 2019.
- Dahl et al. (2017) Ryan Dahl, Mohammad Norouzi, and Jonathon Shlens. Pixel recursive super resolution. arXiv:1702.00783, 2017.
- Dinh et al. (2015) Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independent components estimation. In International Conference on Learning Representations (ICLR), 2015.
- Dinh et al. (2017) Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using Real NVP. In International Conference on Learning Representations (ICLR), 2017.
- Donoho (2000) David L. Donoho. High-dimensional data analysis: The curses and blessings of dimensionality. In AMS Math Challenges Lecture, 2000.
- Elman (1993) Jeffrey L. Elman. Learning and development in neural networks: The importance of starting small. Cognition, 48:781–799, 1993.
- Gong et al. (2016) Tieliang Gong, Qian Zhao, Deyu Meng, and Zongben Xu. Why curriculum learning & self-paced learning work in big/noisy data: A theoretical perspective. Big Data & Information Analytics, 1(1):111–127, 2016.
- Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. In Advances in Neural Information Processing Systems (NeurIPS), 2014.
- Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of Wasserstein GANs. In arXiv:1704.00028, 2017.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaojing Ren, and Jian Sun. Deep residual learning for image recognition. In
- Heljakka et al. (2018) Ari Heljakka, Arno Solin, and Juho Kannala. Pioneer networks: Progressively growing generative autoencoder. In arXiv:1807.03026, 2018.
- Hinton and Salakhutdinov (2006) Geoffrey E. Hinton and Ruslan R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
- Justesen et al. (2018) Niels Justesen, Ruben Rodriguez Torrado, Philip Bontrager, Ahmed Khalifa, Julian Togelius, and Sebastian Risi. Illuminating generalization in deep reinforcement learning through procedural level generation. arXiv:1806.10729, 2018.
- Karras et al. (2018a) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018a.
- Karras et al. (2018b) Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. arXiv:1812.04948, 2018b.
- Kingma and Dhariwal (2018) Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In arXiv:1807.03039, 2018.
- Kingma and Welling (2013) Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of the 2th International Conference on Learning Representations (ICLR), 2013.
- Korkinof et al. (2018) Dimitrios Korkinof, Tobias Rijken, Michael O’Neill, Joseph Yearsley, Hugh Harvey, and Ben Glocker. High-resolution mammogram synthesis using progressive generative adversarial networks. In arXiv:1807.03401, 2018.
- Lee et al. (2011) Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. Unsupervised learning of hierarchical representations with convolutional deep belief networks. In Proceedings of the 28th International Conference on Machine Learning (ICML), 2011.
- Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
- Mao et al. (2016) Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. arXiv:1611.04076, 2016.
- Mescheder et al. (2018) Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for GANs do actually converge? arXiv:1801.04406, 2018.
- Miyato et al. (2018) Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations (ICLR), 2018.
- Newman (2010) Mark Newman. Networks: An Introduction. Oxford, 1st edition, 2010.
Oudeyer et al. (2007)
Pierre-Yves Oudeyer, Frederic Kaplan, and Verena V. Hafner.
Intrinsic motivation systems for autonomous mental development.
IEEE Transactions on Evolutionary Computation, 11(2):265–286, 2007.
- Press et al. (2017) Ofir Press, Amir Bar, Ben Bogin, Jonathan Berant, and Lior Wolf. Language generation with recurrent generative adversarial networks without pre-training. arXiv:1706.01399, 2017.
- Radford et al. (2016) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In Proceedings of the 4th International Conference on Learning Representations (ICLR), 2016.
- Rajeswar et al. (2017) Sai Rajeswar, Sandeep Subramanian, Francis Dutil, Christopher Pal, and Aaron Courville. Adversarial generation of natural language. arXiv:1705.10929, 2017.
- Rezende and Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning (ICML), pages 1530–1538, 2015.
Rezende et al. (2014)
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.In International Conference on Machine Learning (ICML), pages 1278–1286, 2014.
- Roman (1996) Steven Roman. Introduction to Coding and Information Theory. Springer, 1st edition, 1996.
- Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
- Salimans et al. (2016) Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. arXiv:1606.03498, 2016.
- Stauffer and Aharony (1994) Dietrich Stauffer and Ammon Aharony. Introduction To Percolation Theory. CRC Press, 2st edition, 1994.
- van den Oord et al. (2016a) Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. In arXiv:1609.03499, 2016a.
- van den Oord et al. (2016b) Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In arXiv:1601.06759, 2016b.
- van den Oord et al. (2017) Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C. Cobo, Florian Stimberg, Norman Casagrande, Dominik Grewe, Seb Noury, Sander Dieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, Alex Graves, Helen King, Tom Walters, Dan Belov, and Demis Hassabis. Parallel WaveNet: Fast high-fidelity speech synthesis. In arXiv:1711.10433, 2017.
High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge University Press, 1st edition, 2018.
- Yu et al. (2015) Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. LSUN: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv:1506.03365, 2015.
- Zeiler and Fergus (2014) Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision (ECCV), 2014.
- Zhao and Tang (2014) Deli Zhao and Xiaoou Tang. Homophilic clustering by locally asymmetric geometry. arXiv:1407.1352, 2014.
Zhou et al. (2016)
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.
Learning deep features for discriminative localization.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2921–2929, 2016.
- Zou and McClelland (2013) Will Zou and James McClelland. Progressive development of the number sense in a deep neural network. In Annual Conference of the Cognitive Science Society, 2013.
1 Centrality measure
The centrality or clustering coefficient pertaining to a cluster in data points or a community in a complex network is a well-studied traditional topic in machine learning and complex systems. Here we introduce two graph-theoretic centralities for the utilization of cluster curriculum. Firstly, we construct a directed graph (digraph) with nearest neighbors. The weighted adjacency matrix of the digraph can be formed in this way: if is one of the nearest neighbors of and otherwise, where is the distance between and and is a free parameter.
1.1 Stationary probability distribution
The density of data points can be quantified with stationary probability distribution of a Markov chain. For a digraph built from data, the transition probability matrix can be derived by row normalization, say,. Then the stationary probability can be obtained by solving an eigenvalue problem
where denotes the matrix transpose. It is straightforward to know that is the eigen-vector of corresponding to the largest eigenvalue (i.e. ). is also defined as a kind of PageRank in many scenarios.
For density-based cluster curriculum, the centrality coincides with the stationary probability . Figure 1 in the main context shows the plausibility of using the stationary probability distribution to quantify data density.
1.2 Indegree-to-outdegree ratio
The curriculum measured with the stationary probability separates clusters if the density discrepancy between different clusters is obvious. Here we introduce another type of metric that is applicable to taking data points from all clusters for each curriculum. The metric relies on the dual high-order degrees of the digraph, i.e. the indegree and the outdegree Zhao and Tang . Formally, we compute the dual degrees of high order by
where is the power of and is the all-one vector. Then the degree-based centrality is put by the indegree-to-outdegree ratio
The metric of degree ratio makes every cluster curriculum contain data points of all clusters, thus facilitating the distributional diversity of cluster curriculum, as illustrated in Figure S1.
2 Theorem 1 and Proof
For a set of data points drawn from normal distribution , the ellipsoid of confidence is defined as , where has no zero eigenvalues and . Let be the number of spheres of radius packed in the ellipsoid . Then we can establish
As explained in the main context, the ellipse equation with respect to the confidence can be expressed by the following equation
Suppose that are the eigenvalues of . Then equation (S5) can be written as
Further, eliminating on the right side gives
Then we derive the length of semi-axis with respect to , i.e.
For a -dimensional ellipsoid , the volume of is
where the leng of semi-axis of and is the Gamma function. Substituting (S8) into the above equation, we obtain the final formula of volume
Using the volume formula in (S9), it is straightforward to get the volume of packing sphere
By the definition of , we can write
We conclude the proof of the theorem. ∎