Curriculum Learning for Deep Generative Models with Clustering

06/27/2019 ∙ by Deli Zhao, et al. ∙ Xiaomi Peking University 0

Training generative models like generative adversarial networks (GANs) and normalizing flows is challenging for noisy data. A novel curriculum learning algorithm pertaining to clustering is proposed to address this issue in this paper. The curriculum construction is based on the centrality of underlying clusters in data points. The data points of high centrality takes priority of being fed into generative models during training. To make our algorithm scalable to large-scale data, the active set is devised, in the sense that every round of training proceeds only on an active subset containing a small fraction of already trained data and the incremental data of lower centrality. Moreover, the geometric analysis is presented to interpret the necessity of cluster curriculum for generative models. The experiments on cat and human-face data validate that our algorithm is able to learn the optimal generative models (e.g. ProGAN and Glow) with respect to specified quality metrics for noisy data. An interesting finding is that the optimal cluster curriculum is closely related to the critical point of a geometric percolation process formulated in the paper.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep generative models pique researchers’ interest in the past decade. The fruitful progress has been achieved on this topic, such as auto-encoder Hinton and Salakhutdinov (2006) and variational auto-encoder (VAE) Kingma and Welling (2013); Rezende et al. (2014), generative adversarial network (GAN) Goodfellow et al. (2014); Radford et al. (2016); Arjovsky et al. (2017), normalizing flow Rezende and Mohamed (2015); Dinh et al. (2015, 2017); Kingma and Dhariwal (2018), and auto-regressive models van den Oord et al. (2016b, a, 2017). However, it is non-trial to train a deep generative model that can converge at a proper minimum of associated optimization. For example, GAN suffers non-stability, model collapse, and generative distortion during training. Many insightful algorithms have been proposed to circumvent those issues, including feature engineering Salimans et al. (2016), various discrimination metrics Mao et al. (2016); Arjovsky et al. (2017); Berthelot et al. (2017), distinctive gradient penalties Gulrajani et al. (2017); Mescheder et al. (2018), spectral normalization to discriminator Miyato et al. (2018), and orthogonal regularization to generator Brock et al. (2019)

. What is particularly of interest is that the breakthrough for GAN has been made with a simple technique of progressively growing neural networks of generator and discriminator from low-resolution images to high-resolution counterparts 

Karras et al. (2018a). This kind of progressive growing also helps push the state of the art to a new level by enabling StyleGAN to produce photo-realistic and detail-sharp results Karras et al. (2018b), shedding new light on wide applications of GANs in solving real problems. This idea of progressive learning is actually a general manner of cognition process Elman (1993); Oudeyer et al. (2007)

, which has been formally named curriculum learning in machine learning 

Bengio et al. (2009). The central topic of this paper is to explore a new curriculum for training deep generative models.

To facilitate robust training of deep generative models with noisy data, we propose curriculum learning with clustering. The key contributions are listed as follows:

  • We first summarize four representative curricula for generative models, i.e. architecture (generation capacity), semantics (data content), dimension (data space), and cluster (data distribution). Among these curricula, cluster curriculum is newly proposed in this paper.

  • Cluster curriculum is to treat data according to centrality of each data point, which is pictorially illustrated and explained in detail. To foster large-scale learning, we devise the active set algorithm that only needs an active data subset of small fixed size for training.

  • The geometric principle is formulated to analyze hardness of noisy data and advantage of cluster curriculum. The geometry pertains to counting a small sphere packed in an ellipsoid, on which is based the percolation theory we use.

The research on curriculum learning is diverse. Our work focuses on curricula that are closely related to data attributes, beyond which is not the scope we concern in this paper.

2 Curriculum learning

Curriculum learning has been a basic learning strategy to promoting performance of algorithms in machine learning. We quote the original explanation from the seminal paper Bengio et al. (2009) as its definition:

Curriculum learning.

“The basic idea is to start small, learn easier aspects of the task or easier sub-tasks, and then gradually increase the difficulty level” according to pre-defined or self-learned curricula.

From cognitive perspective, curriculum learning is common for human and animal learning when they interact with environments Elman (1993), which is the reason why it is natural as a learning rule for machine intelligence. The learning process of cognitive development is gradual and progressive Oudeyer et al. (2007). In practice, the design of curricula is task-dependent and data-dependent. Here we summarize the representative curricula that are developed for generative models.

Architecture curriculum. The deep neural architecture itself can be viewed as a curriculum from the viewpoint of learning concepts Hinton and Salakhutdinov (2006); Bengio et al. (2006) or disentangling representations Lee et al. (2011)

. For supervised learning, the shallow layers decompose low-level features of objects while the deep layers yield high-level features of more abstract concepts for recognizing objects 

Lee et al. (2011); Zeiler and Fergus (2014); Zhou et al. (2016). For generative models, the analogous phenomenon is also observed from GANs Bau et al. (2018). In fact, progressively growing architectures from shallow layers of few parameters to deep layers of more parameters has been successfully harnessed in a variety of tasks using GANs Karras et al. (2018a); Korkinof et al. (2018); Karras et al. (2018b)

and autoencoder 

Heljakka et al. (2018).

Semantics curriculum

. The most intuitive content for each datum is the semantic information that the datum conveys. The hardness of semantics determines the difficulty of learning knowledge from data. Therefore, the semantics can be a common curriculum. For instance, one can design the easy environment for a game to help the agent adapt the task in deep reinforcement learning 

Justesen et al. (2018), and then gradually complexify semantic structures of the environment, thus progressively increasing the difficulty level of the game. To unveil the capability of neural networks of learning number sense, the semantic curriculum may be designed with varying numbers of elements in images Zou and McClelland (2013). In light of this semantically progressive learning, the Weber function is gradually sharpened, indicating the effectiveness of learning cognitive concepts.

Dimension curriculum

. The high dimensionality usually poses the difficulty of machine learning due to the curse of dimensionality 

Donoho (2000), in the sense that the amount of data points needed to attain reliable performance grows exponentially with the dimension of variables Vershynin (2018). Therefore, the algorithms are expected to be beneficial from learning of growing dimensions. The effectiveness of dimension curriculum is evident from recent progress on deep generative models. The photo-realistic results of very high dimensions for GANs Karras et al. (2018a, b) have been achieved by the method that the neural architecture is progressively grown with varying image resolution. For emerging language generation, the dimension curriculum is the length of sequences Rajeswar et al. (2017); Press et al. (2017)

. The text generation begins with short sequences and the length intervals are gradually increased to cover sequences of more complexity.

Figure 1: Cluster Curriculum. From magenta color to black color, the centrality of data points reduces. The value is the number of data points taken with centrality order.

3 Cluster curriculum

For fitting distributions, dense data points are generally easier to handle than sparse data or outliers. To train generative models robustly, therefore, it is plausible to raise cluster curriculum, meaning that generative algorithms first learn from data points close to cluster centers and then with more data progressively approaching cluster boundaries. Thus the stream of feeding data points to models for curriculum learning is the process of clustering data points according to cluster centrality that will be explained in section 

3.2. The toy example in Figure 1 illustrates how to form cluster curriculum.

3.1 Why clusters matter

The importance of clusters for data points is actually obvious from geometric point of view. The data sparsity in high-dimensional spaces causes the difficulty of fitting the underlying distribution of data points Vershynin (2018). So generative algorithms may be beneficial when proceeding from the local spaces where data points are relatively dense. Such data points form clusters that are generally informative subsets with respect to entire dataset. In addition, clusters contain common regular patterns of data points, where generative models are easier to converge. What is most important is that noisy data points deteriorate performance of algorithms. For classification, the effectiveness of curriculum learning is theoretically proven to circumvent the negative influence of noisy data Gong et al. (2016). We will analyze this aspect for generative models with geometric facts.

3.2 Generative models with cluster curriculum

With cluster curriculum, we are allowed to gradually learn generative models from dense clusters to cluster boundaries and finally to all data points. In this way, generative algorithms are capable of avoiding the direct harm of noise or outliers. To this end, we first need a measure called centrality that is the terminology in graph-based clustering. It quantifies the compactness of a cluster in data points or a community in complex networks Newman (2010)

. A large centrality implies that the associated data point is close to one of cluster centers. For easy reference, we provide the algorithms of computing two types of centralities in supplementary material. For experiments in this paper, all the cluster curricula are constructed by the centrality of stationary probability distribution, i.e. the eigenvector corresponding to the largest eigenvalue of the transition probability matrix drawn from the data.

To be specific, let

denote the centrality vector of

data points. Namely, the -th entry of is the centrality of data point . Sorting in descending order and adjusting the order of original data points accordingly give data points arranged by cluster centrality. Let signify the set of centrality-sorted data points, where is the base set that guarantees a proper convergence of generative models, and the rest of is evenly divided into subsets according to centrality order. In general, the number of data points in is moderate compared to and determined according to . Such division of serves to efficiency of training, because we do not need to train models from a very small dataset. The cluster curriculum learning is carried out by incrementally feeding subsets in into generative algorithms. In other words, algorithms are successively trained on after , meaning that the curriculum for each round of training is accumulated with .

In order to determine the optimal curriculum , we need the aid of quality metric of generative models, such as Fréchet inception distance (FID) or sliced Wasserstein distance (SWD) Borji (2018). For generative models trained with each curriculum, we calculate the associated score via the specified quality metric. The optimal curriculum for effective training can be identified by the minimal value for all , where . The interesting phenomenon of this score curve will be illustrated in the experiment. The minimum of score is apparently metric-dependent. One can refer to Borji (2018)

for the review of evaluation metrics. In practice, we can opt one of reliable metrics to use or multiple metrics for decision-making of the optimal model.

There are two ways of using incremental subset during training. One is that the parameters of models are re-randomized when new data are used, the procedure of which is given in Algorithm 1 in supplementary material. The other is that the parameters are fine-tuned based on pre-training of previous model, which will be presented with a fast learning algorithm in the following section.

3.3 Active set for scalable training

To obtain the precise minimum of , the cardinality of needs to be set much smaller than , meaning that will be large even for a dataset of moderate scale. The training of many loops will lead to time-consuming. Here we propose the active set to address the issue, in the sense that for each loop of cluster curriculum, generative models are always trained with a subset of small fixed size instead of whose size becomes incrementally large.

(a) (b)
Figure 2: Schematic illustration of active set for cluster curriculum. Here . The cardinality of the active set is . When is taken for training, we need to randomly sample another (i.e. ) data points from the history data to form . Then the complete active set is composed by . We can see that data points in become less dense after sampling.

To form the active set , the subset of data points are randomly sampled from to combine with for the next loop, where . For easy understanding, we illustrate the active set with toy example in Figure 2. In this scenario, progressive pre-training must be applied, meaning that the update of model parameters for the current training is based on parameters of previous loop. The procedure of cluster curriculum with active set is detailed in Algorithm 2 in supplementary material.

The active set allows us to train generative models with a small dataset that is actively adapted, thereby significantly reducing the training time for large-scale data.

4 Geometric view of cluster curriculum

Cluster curriculum bears the interesting relation to high-dimensional geometry, which can provide geometric understanding of our algorithm. Without loss of generality, we work on a cluster obeying normal distribution. The characteristic of the cluster can be extended into other clusters of the same distribution. For easy analysis, let us begin with a toy example. As Figure 

3a shows, the confidence ellipse fitted from the subset of centrality-ranked data points is nearly conformal to of all data points, allowing us to put the relation of these two ellipses by virtue of the confidence-level equation. Let signify the center and the covariance matrix of the cluster of interest, where . To make it formal, we can write the equation by

(a) (b)
Figure 3: Illustration of growing one cluster in cluster curriculum. (a) Data points taken with large centrality. (b) The annulus formed by of removing the inner ellipse from the outer one.


can be the chi-squared distribution or Mahalanobis distance square,

is the degree of freedom, and

is the confidence level. For conciseness, we write as in the following context. Then the ellipses and correspond to and , respectively, where .

To analyze the hardness of training generative model, a fundamental aspect is to examine the number of given data points falling in a geometric entity  111For cluster curriculum, it is an annulus explained shortly. and the number of lattice points in it. The less is compared to , the harder the problem will be. However, the enumeration of lattice points is computationally prohibitive for high dimensions. Inspired by the information theory of encoding data of normal distributions Roman (1996), we count the number of small spheres of radius packed in the ellipsoid instead. Thus we can use this number to replace the role of as long as the radius of the sphere is set properly. With a little abuse of notation, we still use to denote the packing number in the following context. Theorem 1 gives the exact form of .

Theorem 1.

For a set of data points drawn from normal distribution , the ellipsoid of confidence is defined as , where has no zero eigenvalues and . Let be the number of spheres of radius packed in the ellipsoid . Then we can establish


We see that admits a tidy form with Mahalanobis distance , dimension , and sphere radius as variables. The proof is provided in supplementary material.

The geometric region of interest for cluster curriculum is the annulus formed by removing the ellipsoid 222The ellipse refers to the surface and the ellipsoid refers to the elliptic ball. from the ellipsoid , as Figure 3b displays. We investigate the varying law between and in the annulus when the inner ellipse grows with cluster curriculum. For this purpose, we need the following two corollaries that immediately follows from Theorem 1.

Corollary 1.

Let be the number of spheres of radius packed in the annulus that is formed by removing the ellipsoid from the ellipsoid , where . Then the following identity holds

Figure 4: Comparison between the number of data points sampled from isotropic normal distributions and of spheres (lattice) packed in the annulus

with respect to the Chi quantile

. is the dimension of data points. For each dimension, we sample 70,000 data points from . The scales of -axis and -axis are normalized by 10,000 and , respectively.
Corollary 2.


It is obvious that goes infinite when under the conditions that and is bounded. Besides, when (cluster) grows, reduces with exponent if the ellipsoid is fixed.

In light of Corollary 1, we can now demonstrate the functional law between and . First, we determine as follows


which means that

is the ellipsoid of minimal Mahalanobis distance to the center that contains all the data points in the cluster. In addition, we need to estimate a suitable sphere radius

, such that and have comparable scales in order to make and comparable in scale. To achieve this, we define an oracle ellipse where . For simplicity, we let be the oracle ellipse. Thus we can determine with Corollary 3.

Corollary 3.

If we let be the oracle ellipse such that , then the free parameter can be computed with .

To make the demonstration amenable to handle, data points we use for simulation are assumed to obey the isotropic normal distribution, meaning that data points are generated with nearly equal variance along each dimension. Figure 

4 shows that gradually exhibits the critical phenomena of percolation processes333Percolation theory is a fundamental tool of studying the structure of complex systems in statistical physics and mathematics. The critical point is the percolation threshold where the transition takes place. One can refer to Stauffer and Aharony (1994) if interested. when the dimension goes large, implying that the data points in the annulus are significantly reduced when grows a little bigger near the critical point. In contrast, the number of lattice points is still large and varies negligibly until approaches the boundary. This discrepancy indicates clearly that fitting data points in the annulus is pretty hard and guaranteeing the precision is nearly impossible when crossing the critical point of even for a moderate dimension (e.g. ). Therefore, the plausibility of cluster curriculum can be drawn naturally from this geometric fact.

Figure 5: Examples of LSUN cat dataset and CelebA face dataset. The samples in the first row are of high centrality and the samples of low centrality in the second row are noisy data or outliers that we call in the context.

5 Experiment

Generative models that we use for experiments are progressive growing of GAN (ProGAN) Karras et al. (2018a) and Glow of the normalizing-flow algorithm Kingma and Dhariwal (2018). These two algorithms are chosen because they are the state of the arts with official open sources available. According to convention, we opt the Fréchet inception distance (FID) as the quality metric for ProGAN. Considering that the backgrounds of face images generated by Glow are nearly uniform, we take the sliced Wasserstein distance (SWD) as the accuracy measure of the Glow model. One can refer to Borji (2018) for details, where various metrics for generative models are compared.

5.1 Dataset and experimental setting

We randomly sample 200,000 cat images from the LSUN dataset Yu et al. (2015). These cat images are captured in the wild. So their styles vary significantly. Figure  5 shows the cat examples of high and low centralities. We can see that noisy cat images differ much from the clean ones. There actually contain the images of very few informative cat features, which are outliers we refer to. The curriculum parameters are set as and , which means that the algorithms are trained with 20,000 images first and after the initial training, another 10,000 images according to centrality order are merged into the current training data for further re-training. For active set, its size is fixed to be .

The CelebA dataset is a large-scale face attribute dataset Liu et al. (2015). We use the cropped and well-aligned faces with a bit of image backgrounds preserved for generation task. For cluster-curriculum learning, we randomly sample 70,000 faces as the training set. The face examples of different centralities are shown in Figure  5. The curriculum parameters are set as and . We bypass the experiment of active set on faces because it is used for large-scale data.

Each image in two databases is resized to be . To form cluster curricula, we exploit ResNet34 He et al. (2016)

pre-trained on ImageNet 

Russakovsky et al. (2015) to extract 512-dimensional features for each face and cat images. The directed graphs are built with these feature vectors. We determine the parameter

of edge weights by enforcing the geometric mean of weights to be 0.8. The number of nearest neighbors is set to be

. The centrality is the stationary probability distribution. All codes are written with TensorFlow.

Figure 6: FID curves of cluster-curriculum learning for ProGAN on cat dataset. The centrality and the FID share the -axis due to that they have the same order of data points. The same colors of the -axis labels and the curves denote the figurative correspondence. The network parameters for “normal training” are randomly re-initialized for each re-training. The active set is based on progressive pre-training of fixed small dataset. The scale of the -axis is normalized by 10,000.

5.2 Experimental result

From Figure 6, we can see that the FID curves are all nearly V-shaped, indicating that global minima exist amid the training process. This is clear evidence that noisy data and outliers deteriorate the quality of generative models during training. From the optimal curricula found by two algorithms (i.e. curricula at 110,000 and 100,000), we can see that the curriculum of the active set differs from that of normal training only by one-step data increment, implying that the active set is reliable for fast cluster-curriculum learning. The performance of the active set measured by FID is much worse than that of normal training, especially when more noisy data are fed into generative models. However, this does not change the whole V-shape of the accuracy curve. Namely, it is applicable as long as the active set admits the metric minimum corresponding to the appropriate curriculum.

(a) ProGAN (b) Glow
Figure 7: FID curves of cluster-curriculum learning on CelebA face dataset. The figurative information is the same with the caption of Figure 6.

The clear V-shape of the centrality-FID curve on the cat data is due to that the noisy data of low centrality contains little effective information to characterize the cats, as already displayed in Figure 5. However, it is different for the CelebA face dataset where the face images of low centrality also convey part of face features. As evident by Figure 7a, ProGAN keeps being optimized by the majority of data until the curriculum of size . To highlight the meaning of this nearly negligible minimum, we also conduct the exactly same experiment on the FFHQ face dataset containing face images of high-quality Karras et al. (2018b). For FFHQ data, the noisy face data can be ignored. The gray curve of normal training in Figure 7a indicates that the FID of ProGAN is monotonically decreased for all curricula. This gentle difference of the FID curves at the ends between CelebA and FFHQ clearly demonstrate the difficulty of noisy data to generative algorithms. To further unveil this effect, we run cluster-curriculum learning on CelebA faces with Glow. The performance of Glow is sensitive to the variation of cluster densities Dahl et al. (2017); Kingma and Dhariwal (2018). So it is expected for Glow to reflect the alteration of noisy degree in data. To stabilize Glow, we take in this trial. As Figure 7b displays, Glow admits a sharp minimum at , i.e. one incremental curriculum of size in ProGAN. The one-step curriculum difference is good enough to support the plausibility of cluster-curriculum learning.

(a) cat data (b) face data
Figure 8: Geometric phenomenon of cluster curriculum on LSUN cat and CelebA face datasets. The pink strips are intervals of optimal curricula derived by generative models. For example, the value of the pink interval in (a) is obtained by , where is one of the minima (i.e. 110,000) in Figure 6. The others are derived in the same way. The subtraction transforms the data number in the cluster to be the one in the annulus. The critical points are determined by searching the maxima of the absolute discrete difference of the associated curves. The scales of -axes are normalized by 10,000.

5.3 Geometric investigation

To understand cluster curriculum deeply, we employ the geometric method formulated in section 4 to analyze the cat and face data. The percolation processes are both conducted with 512-dimensional features from ResNet34. Figure 8 displays the curve of that is the variable of interest in this scenario. As expected, the critical point in the percolation process occurs for both cases, as shown by blue curves. An obvious fact is that the optimal curricula (red strips) both fall into the (feasible) domains of percolation processes after critical points, as indicated by gray color. This is a desirable property because data become rather sparse in the annuli when crossing the critical points. Then noisy data play the non-negligible role in tuning parameters of generative models. Therefore, a fast learning strategy can be derived from percolation process. The training may begin from the curriculum specified by the critical point, thus significantly accelerating cluster-curriculum learning.

Another intriguing phenomenon is that the more noisy the data, the closer the optimal interval (red strip) is to the critical point. We can see that the optimal interval of the cat data is much closer to the critical point than that of the face data. What surprises us here is that the optimal interval of cluster curricula associated with the cat data nearly coincides with the critical point of the percolation process in the annulus! This means that the optimal curriculum may be found at the intervals close to the critical point of of percolation for heavily noisy data, thus affording great convenience to learning an appropriate generative model for such datasets.

6 Analysis and conclusion

The cluster curriculum is proposed for robust training of generative models. The active set of cluster curriculum is devised to facilitate scalable learning. The geometric principle behind cluster curriculum is analyzed in detail as well. The experimental results on LSUN cat dataset and CelebA face dataset demonstrate that generative models trained with cluster curriculum is capable of learning the optimal parameters with respect to the specified quality metric such as Fréchet inception distance and sliced Wasserstein distance. Geometric analysis indicates that the optimal curricula obtained from generative models are closely related to critical points of the associated percolation processes established in this paper. This intriguing geometric phenomenon is worth being explored deeply in terms of the theoretical connection between generative models and high-dimensional geometry.

It is worth emphasizing that the meaning of model optimality refers to the global minimum of the centrality-FID curve. As we already noted, the optimality is metric-dependent. We are able to obtain the optimal model with cluster curriculum, which does not mean that the algorithm only serves to this purpose. We know that more informative data can help learn a more powerful model covering large data diversity. Here a trade-off arises, i.e. the robustness against noise and the capacity of fitting more data. The centrality-FID curve provides a visual tool to monitor the state of model training, thus aiding us in understanding the learning process and selecting suitable models according to noisy degree of given data. For instance, we can pick the trained model close to the optimal curriculum for heavily noisy data or the one near the end of the centrality-FID curve for datasets of little noise. In fact, this may be the most common way of using cluster curriculum.

In this paper, we do not investigate the cluster-curriculum learning for the multi-class case, e.g. the ImageNet dataset with BigGAN Brock et al. (2019). The cluster-curriculum learning of multiple classes is more complex than that we have already analyzed on face and cat data. We leave this study for future work.


1 Centrality measure

The centrality or clustering coefficient pertaining to a cluster in data points or a community in a complex network is a well-studied traditional topic in machine learning and complex systems. Here we introduce two graph-theoretic centralities for the utilization of cluster curriculum. Firstly, we construct a directed graph (digraph) with nearest neighbors. The weighted adjacency matrix of the digraph can be formed in this way: if is one of the nearest neighbors of and otherwise, where is the distance between and and is a free parameter.

1.1 Stationary probability distribution

The density of data points can be quantified with stationary probability distribution of a Markov chain. For a digraph built from data, the transition probability matrix can be derived by row normalization, say,

. Then the stationary probability can be obtained by solving an eigenvalue problem


where denotes the matrix transpose. It is straightforward to know that is the eigen-vector of corresponding to the largest eigenvalue (i.e. ). is also defined as a kind of PageRank in many scenarios.

For density-based cluster curriculum, the centrality coincides with the stationary probability . Figure 1 in the main context shows the plausibility of using the stationary probability distribution to quantify data density.

1.2 Indegree-to-outdegree ratio

The curriculum measured with the stationary probability separates clusters if the density discrepancy between different clusters is obvious. Here we introduce another type of metric that is applicable to taking data points from all clusters for each curriculum. The metric relies on the dual high-order degrees of the digraph, i.e. the indegree and the outdegree Zhao and Tang [2014]. Formally, we compute the dual degrees of high order by


where is the power of and is the all-one vector. Then the degree-based centrality is put by the indegree-to-outdegree ratio


The metric of degree ratio makes every cluster curriculum contain data points of all clusters, thus facilitating the distributional diversity of cluster curriculum, as illustrated in Figure S1.

Figure S1: Cluster Curriculum based on indegree-to-outdegree ratio. From magenta color to black color, the centrality of data points reduces. The value is the number of data points taken with centrality order.

2 Theorem 1 and Proof

Theorem 1.

For a set of data points drawn from normal distribution , the ellipsoid of confidence is defined as , where has no zero eigenvalues and . Let be the number of spheres of radius packed in the ellipsoid . Then we can establish


As explained in the main context, the ellipse equation with respect to the confidence can be expressed by the following equation


Suppose that are the eigenvalues of . Then equation (S5) can be written as


Further, eliminating on the right side gives


Then we derive the length of semi-axis with respect to , i.e.


For a -dimensional ellipsoid , the volume of is


where the leng of semi-axis of and is the Gamma function. Substituting (S8) into the above equation, we obtain the final formula of volume


Using the volume formula in (S9), it is straightforward to get the volume of packing sphere


By the definition of , we can write


We conclude the proof of the theorem. ∎

3 Procedures of Algorithm 1 and Algorithm 2

2:, dataset containing data points
3:GenerativeModel(), generative models
4:QualityScore(), metric for generative results
5:, number of subsets
7: Solve centralities
8: Centrality() section 1
9: Cluster curriculum
10:Get sorted data according to descending order of
11:Divide to be
12: Train generative models
13:for   do
14:     Initialize model parameters randomly
15:      GenerativeModel(, ) e.g. GAN
17:end for
19: Search the optimal set
20:for   do
21:     Generate data with model of parameter
22:      QualityScore( ) e.g. FID
23:end for
25:Return the optimal model parameter
Algorithm 1 Cluster Curriculum for Generative Models
2:, GenerativeModel(), QualityScore() as in Algorithm 1
3:, cardinality of active set
5: Solve centralities
6: Centrality() section 1
7: Cluster curriculum
8:Get sorted data according to descending order of
9:Divide to be
10: Train generative models
11:Initialize model parameters randomly
12: GenerativeModel(, ) e.g. GAN
13:for   do
14:     Derive by randomly sampling
16:      Use pre-training model
17:      GenerativeModel(, )
19:end for
21: Search the optimal set
22:for   do
23:     Generate data with model of parameter by sampling a prior e.g. Gaussian
24:      QualityScore( ) e.g. FID
25:end for
28:Return the optimal model parameter
Algorithm 2 Cluster Curriculum with Active Set