GANTree
Code release for "GANTree: An Incrementally Learned Hierarchical Generative Framework for MultiModal Data Distributions", ICCV 2019
view repo
Despite the remarkable success of generative adversarial networks, their performance seems less impressive for diverse training sets, requiring learning of discontinuous mapping functions. Though multimode prior or multigenerator models have been proposed to alleviate this problem, such approaches may fail depending on the empirically chosen initial mode components. In contrast to such bottomup approaches, we present GANTree, which follows a hierarchical divisive strategy to address such discontinuous multimodal data. Devoid of any assumption on the number of modes, GANTree utilizes a novel modesplitting algorithm to effectively split the parent mode to semantically cohesive children modes, facilitating unsupervised clustering. Further, it also enables incremental addition of new data modes to an already trained GANTree, by updating only a single branch of the tree structure. As compared to prior approaches, the proposed framework offers a higher degree of flexibility in choosing a large variety of mutually exclusive and exhaustive tree nodes called GANSet. Extensive experiments on synthetic and natural image datasets including ImageNet demonstrate the superiority of GANTree against the prior stateofthearts.
READ FULL TEXT VIEW PDFCode release for "GANTree: An Incrementally Learned Hierarchical Generative Framework for MultiModal Data Distributions", ICCV 2019
Generative models have gained enormous attention in recent years as an emerging field of research to understand and represent the vast amount of data surrounding us. The primary objective behind such models is to effectively capture the underlying data distribution from a set of given samples. The task becomes more challenging for complex highdimensional target samples such as image and text. Wellknown techniques like Generative Adversarial Network (GAN) [14]
and Variational Autoencoder (VAE)
[23] realize it by defining a mapping from a predefined latent prior to the highdimensional target distribution.Despite the success of GAN, the potential of such a framework has certain limitations. GAN is trained to look for the best possible approximate of the target data distribution within the boundaries restricted by the choice of latent variable setting (i.e. the dimension of latent embedding and the type of prior distribution) and the computational capacity of the generator network (characterized by its architecture and parameter size). Such a limitation is more prominent in the presence of highly diverse intraclass and interclass variations, where the given target data spans a highly sparse nonlinear manifold. This indicates that the underlying data distribution would constitute multiple, sparsely spread, lowdensity regions. Considering enough capacity of the generator architecture (Universal Approximation Theorem [19]), GAN guarantees convergence to the true data distribution. However, the validity of the theorem does not hold for mapping functions involving discontinuities (Fig. 1), as exhibited by natural image or text datasets. Furthermore, various regularizations [7, 32] imposed in the training objective inevitably restrict the generator to exploit its full computational potential.
A reasonable solution to address the above limitations could be to realize multimodal prior in place of the singlemode distribution in the general GAN framework. Several recent approaches explored this direction by explicitly enforcing the generator to capture diverse multimodal target distribution [15, 21]. The prime challenge encountered by such approaches is attributed to the choice of the number of modes to be considered for a given set of fullyunlabelled data samples. To better analyze the challenging scenario, let us consider an extreme case, where a very high number of modes is chosen in the beginning without any knowledge of the inherent number of categories present in a dataset. In such a case, the corresponding generative model would deliver a higher inception score [4] as a result of dedicated prior modes for individual subcategories or even sample level hierarchy. This is a clear manifestation of “overfitting in generative modeling
” as such a model would generate reduced or a negligible amount of novel samples as compared to a singlemode GAN. Intuitively, the ability to interpolate between two samples in the latent embedding space
[38, 30]demonstrates continuity and generalizability of a generative model. However, such an interpolation is possible only within a pair of samples belonging to the same mode specifically in the case of multimodal latent distribution. It reveals a clear tradeoff between the two schools of thoughts, that is, multimodal latent distribution has the potential to model a better estimate of
as compared to a singlemode counterpart, but at a cost of reduced generalizability depending on the choice of mode selection. This also highlights the inherent tradeoff between quality (multimodal GAN) and diversity (singlemode GAN) of a generative model [29] specifically in the absence of a concrete definition of natural data distribution.An ideal generative framework addressing the above concerns must have the following important traits:
The framework should allow enough flexibility in the design choice of the number of modes to be considered for the latent variable distribution.
Flexibility in generation of novel samples depending on varied preferences of quality versus diversity according to the intended application in focus (such as unsupervised clustering, hierarchical classification, nearest neighbor retrieval, etc.).
Flexibility to adapt to a similar but different class of additional data samples introduced later in absence of the initial data samples (incremental learning setting).
In this work, we propose a novel generative modeling framework, which is flexible enough to address the qualitydiversity tradeoff in a given multimodal data distribution. We introduce GANTree, a hierarchical generative modeling framework consisting of multiple GANs organized in a specific order similar to a binarytree structure. In contrast to the bottomup approach incorporated by recent multimodal GAN [35, 21, 15], we follow a topdown hierarchical divisive clustering procedure. First, the root node of the GANTree is trained using a singlemode latent distribution on the full target set aiming maximum level of generalizability. Following this, an unsupervised splitting algorithm is incorporated to cluster the target set samples accessed by the parent node into two different clusters based on the most discriminative semantic feature difference. After obtaining a clear cluster of target samples, a bimodal generative training procedure is realized to enable the generation of plausible novel samples from the predefined children latent distributions. To demonstrate the flexibility of GANTree, we define GANSet, a set of mutually exclusive and exhaustive treenodes which can be utilized together with the corresponding prior distribution to generate samples with the desired level of quality vs diversity. Note that the leaf nodes would realize improved quality with reduced diversity whereas the nodes closer to the root would yield a reciprocal effect.
The hierarchical topdown framework opens up interesting future upgradation possibilities, which is highly challenging to realize in general GAN settings. One of them being incremental GANTree, denoted as iGANTree. It supports incremental generative modeling in a much efficient manner, as only a certain branch of the full GANTree has to be updated to effectively model distribution of a new input set. Additionally, the topdown setup results in an unsupervised clustering of the underlying classlabels as a byproduct, which can be further utilized to develop a classification model with implicit hierarchical categorization.
Commonly, most of the generative approaches realize the data distribution as a mapping from a predefined prior distribution [14]. BEGAN [5] proposed an autoencoder based GAN, which adversarially minimizes an energy function [37] derived from Wasserstein distance [2]. Later, several deficiencies in this approach have been explored, such as modecollapse [33], unstable generator convergence [26, 1], etc. Recently, several approaches propose to use an inference network [7, 24],
, or minimize the joint distribution
[10, 12] to regularize the generator from modecollapse. Although these approaches effectively address modecollapse, they suffer from the limitations of modeling disconnected multimodal data [21], using singlemode prior and the capacity of single generator transformation as discussed in Section 1.To effectively address multimodal data, two different approaches have been explored in recent works viz. a) multigenerator model and b) single generator with multimode prior. Works such as [13, 18, 21]
propose to utilize multiple generators to account for the discontinuous multimodal natural distribution. These approaches use a modeclassifier network either separately
[18] or embedded with a discriminator [13] to enforce learning of mutually exclusive and exhaustive data modes dedicated to individual generator network. Chen [8] proposed InfoGAN, which aims to exploit the semantic latent source of variations by maximizing the mutual information between the generated image and the latent code. Gurumurthy [15] proposed to utilize a Gaussian mixture prior with a fixed number of components in a single generator network. These approaches used a fixed number of Gaussian components and hence do not offer much flexibility on the scale of quality versus diversity required by the end task in focus. Inspired by boosting algorithms, AdaGAN [35] proposes an iterative procedure, which incrementally addresses uncovered data modes by introducing new GAN components using the sample reweighting technique.In this section, we provide a detailed outline of the construction scheme and training algorithm of GANTree (Section 3.13.3). Further, we discuss the inference methods for fetching a GANSet from a trained GANTree for generation (Section 3.4). We also elaborate on the procedure to incrementally extend a previously trained GANTree using new data samples from a different category (Section 3.5).
A GANTree is a full binary tree where each node indexed with , GN (GNode), represents an individual GAN framework. The root node is represented as GN with the corresponding children nodes as GN and GN (see Fig. 2). Here we give a brief overview of a general GANTree framework. Given a set of target samples = drawn from a true data distribution , the objective is to optimize the parameters of the mapping , such that the distribution of generated samples approximates the target distribution
upon randomly drawn latent vectors
. Recent generative approaches [7] propose to simultaneously train an inference mapping, to avoid modecollapse. In this paper, we have used Adversarially Learned Inference (ALI) [12] framework as the basic GAN formulation for each node of GANTree. However, one can employ any other GAN framework for training the individual GANTree nodes, if it satisfies the specific requirement of having an inference mapping.Root node (). Assuming as the set of complete target samples, the root node is first trained using a singlemode latent prior distribution . As shown in Fig. 2; , and are the encoder, generator and discriminator network respectively for the root node with index; which are trained to generate samples, approximating . Here, is the true target distribution whose samples are given as . After obtaining the best approximate , the next objective is to improve the approximation by considering the multimodal latent distribution in the succeeding hierarchy of GANTree.
Children nodes ( and ). Without any choice of the initial number of modes, we plan to split each GNode into two children nodes (see Fig. 2). In a general setting, assuming as the parent node index with the corresponding two children nodes indexed as and , we define , , and for simplifying further discussions. Considering the example shown in Fig. 2, with the parent index , the indices of left and right child would be and respectively. A novel binary Modesplitting procedure (Section 3.2) is incorporated, which, without using the label information, effectively exploits the most discriminative semantic difference at the latent space to realize a clear binary clustering of the input target samples. We obtain clusterset and by applying Modesplitting on the parentset such that . Note that, a single encoder network is shared by both the child nodes and as it is also utilized as a routing network, which can route a given target sample from the rootnode to one of the leafnodes by traversing through different levels of the full GANTree. The bimodal latent distribution at the output of the common encoder is defined as and for the left and right childnode respectively.
After the simultaneous training of and using a BiModal Generative Adversarial Training (BiMGAT) procedure (Section 3.3), we obtain an improved approximation () of the true distribution () as . Here, the generated distributions and are modelled as and respectively (Algo. 1). Similarly, one can split the tree further to effectively capture the inherent number of modes associated with the true data distribution .
Node Selection for split and stoppingcriteria. A natural question then arises of how to decide which node to split first out of all the leaf nodes present at a certain state of GANTree? For making this decision, we choose the leaf node which gives minimum mean likelihood over the data samples labeled for it (lines 56, Algo. 1). Also, the stopping criteria on the splitting of GANTree has to be defined carefully to avoid overfitting to the given target data samples. For this, we make use of a robust IRCbased stopping criteria [16] over the embedding space , preferred against standard AIC and BIC metrics. However, one may use a fixed number of modes as a stopping criteria and extend the training from that point as and when required.
The modesplit algorithm is treated as a basis of the topdown divisive clustering idea, which is incorporated to construct the hierarchical GANTree by performing binary split of individual GANTree nodes. The splitting algorithm must be efficient enough to successfully exploit the highly discriminative semantic characteristics in a fullyunsupervised manner. To realize this, we first define and as the fixed normal prior distributions (nontrainable) for the left and right children respectively. A clear separation between these two priors is achieved by setting the distance between the mean vectors as with ; where is a identity matrix. Assuming as the parent node index, is the cluster of target samples modeled by . Put differently, the objective of the modesplit algorithm is to form two mutually exclusive and exhaustive target data clusters and , by utilizing the likelihood of the latent representations to the predefined priors and .
To effectively realize modesplitting (Algo. 2), we define two different bags; a) assigned bag and b) unassigned bag . holds the semantic characteristics of individual modes in the form of representative high confidence labeled target samples. Here, the assigned labeled samples are a subset of the parent target samples, , with the corresponding hard assigned clusterid obtained using the likelihood to the predefined priors (line 11, Algo. 2) in the transformed encoded space. We refer it as a hardassignment as we do not update the cluster label of these samples once they are moved from to the assigned bag, . This effectively tackles modecollapse in the later iterations of the modespilt procedure. For the samples in , a temporary cluster label is assigned depending on the prior with maximum likelihood (line 12, Algo. 2) to aggressively move them towards one of the binary modes (lines 1922, Algo. 2). Finally, the algorithm converges when all the samples in are moved to . The algorithm involves simultaneous update of three different network parameters (line 18, Algo. 2
) using a final loss function
consisting of:the likelihood maximization term for samples in both and (lines 1516) encouraging exploitation of a binary discriminative semantic characteristic, and
the semantic preserving reconstruction loss computed using the corresponding generator i.e. and (lines 1314). This is used as a regularization to hold the semantic uniqueness of the individual samples avoiding modecollapse.
The modesplit algorithm does not ensure matching of the generated distribution and with the expected target distribution and without explicit attention. Therefore to enable generation of plausible samples from the randomly drawn prior latent vectors, a generative adversarial framework is incorporated simultaneously for both left and right children. In ALI [12] setting, the loss function involves optimization of the common encoder along with both the generators in an adversarial fashion; utilizing two separate discriminators, which are trained to distinguish and from and respectively.
To utilize a generative model spanning the entire data distribution , an enduser can select any combination of nodes from a fully trained GANTree (i.e. GANSet) such that the data distribution they model is exhaustive and mutually exclusive. However, to generate only a subset of the full data distribution, one may choose a mutually exclusive, but nonexhaustive set  Partial GANSet.
For a use case where extreme preference is given to diversity in terms of the number of novel samples over quality of the generated samples, selecting a singleton set  {root} would be an apt choice. However, in a contrasting use case, one may select all the leaf nodes as a Terminal GANSet to have the best quality in the generated samples, albeit losing the novelty in generated samples. The most practical tasks will involve use cases where a GANSet is constructed as a combination of both intermediate nodes and leaf nodes.
A GANSet can also be used to perform clustering and label assignment for new data samples in a fully unsupervised setting. We provide a formal procedure AssignLabel in the supplementary document for performing the clustering of the data samples using a GANTree.
How does GANTree differ from previous works?
AdaGAN  Sequential learning approach adopted by [35] requires a fullytrained model on the previously addressed mode before addressing the subsequent undiscovered samples. As it does not enforce any constraints on the amount of data to be modeled by a single generator network, it mostly converges to more number of modes than that actually present in the data. In contrast, GANTree models a mutually exclusive and exhaustive set at each splitting of a parent node by simultaneously training child generator networks. Another major disadvantage of AdaGAN is that it highly focuses on quality rather than diversity (caused by the overmode division), which inevitably restricts the latent space interpolation ability of the final generative model.
DMGAN  Khayatkhoei [21] proposed a disconnected manifold learning generative model using a multigenerator approach. They proposed to start with an overestimate of the initial number of mode components, , than the actual number of modes in the data distribution . As discussed before, we do not consider the existence of a definite value for the number of actual modes as considered by DMGAN, especially for diverse natural image datasets like CIFAR and ImageNet. In a practical scenario, one can not decide the initial value of without any clue on the number of classes present in the dataset. DMGAN will fail for cases where as discussed by the authors. Also note that unlike GANTree, DMGAN is not suitable for incremental future expansion. This clearly demonstrates the superior flexibility of GANTree against DMGAN as a result of the adopted topdown divisive strategy.
We advance the idea of GANTree to iGANTree, wherein we propose a novel mechanism to extend an already trained GANTree to also model samples from a set of new data samples. An outline of the entire procedure is provided across Algorithms 3 and 4. To understand the mechanism, we start with the following assumptions from the algorithm. On termination of this procedure over , we expect to have a single leaf node which solely models the distribution of samples from ; and other intermediate nodes which are the ancestors of this new node, should model a mixture distribution which also includes samples from .
To achieve this, we first find out the right level of hierarchy and position to insert this new leaf node using a seek procedure (lines 28 in Algo. 4). Here and similarly for , in lines 56. Let’s say the seek procedure stops at node index . We now introduce 2 new nodes GN and GN in the tree and perform reassignment (lines 1117 in Algo. 4). The new child node GN models only the new data samples; and the new parent node GN models a mixture of and . This brings us to the question, how do we learn the new distribution modeled by GN and its ancestors? To solve this, we follow a bottomup training approach from GN to GN, incrementally training each node on the branch with samples from ’ to maintain the hierarchical property of the GANTree (lines 2224, Algo. 4).
Now, the problem reduces down to retraining the parent and the child and networks at each node in the selected branch, such that (i) correctly routes the generated data samples to the proper child node and (ii) the samples from are modeled by the new distribution at all the ancestor nodes of GN, remembering the samples from distribution at the same time. Moreover, we make no assumption of having the data samples on which the GANTree was trained previously. To solve the problem of training the node GN, we make use of terminal GANSet of the sub GANTree rooted at GN to generate samples for retraining the node. A thorough procedure of how each node is trained incrementally is illustrated in Algo. 3. Also, note that we use the mean likelihood measure to decide which of the two child nodes has the potential to model the new samples. We select the child whose mean vector has the minimum average Mahalanobis distance () from the embeddings of the samples of ’. This idea can also be implemented to have a full persistency over the structure [11] (further details in Supplementary).
In this section, we discuss a thorough evaluation of GANTree against baselines and prior approaches. We decide not to use any improved learning techniques (as proposed by SNGAN [27] and SAGAN [36]) for the proposed GANTree framework to have a fair comparison against the prior art [21, 13, 18] targeting multimodal distribution.
GANTree is a multigenerator framework, which can work on a multitude of basic GAN formalizations (like AAE [25], ALI [12], RFGAN [3] etc.) at the individual node level. However, in most of the experiments we use ALI [12] except for CIFAR, where both ALI [12] and RFGAN [3] are used to demonstrate generalizability of GANTree over varied GAN formalizations. Also note that we freeze parameter update of lower layers of encoder and discriminator; and higher layers of the generator (close to data generation layer) in a systematic fashion, as we go deeper in the GANTree hierarchical separation pipeline. Such a parameter sharing strategy helps us to remove overfitting at an individual node level close to the terminal leafnodes.
We employ modifications to the commonly used DCGAN [30] architecture for generator, discriminator and encoder networks while working on image datasets i.e. MNIST (32), CIFAR10 (32) and FaceBed (64
)). However, unlike in DCGAN, we use batch normalization
[20]with Leaky ReLU nonlinearity inline with the prior multigenerator works
[18]. While training GANTree on Imagenet [31], we follow the generator architecture used by SNGAN [27] for a generation resolution of 128128 with RFGAN [3] formalization. For both modesplit and BiModalGAN training we employ Adam optimizer [22] with a learning rate of 0.001.Effectiveness of the proposed modesplit algorithm. To verify the effectiveness of the proposed modesplit algorithm, we perform an ablation analysis against a baseline deepclustering [34] technique on the 10class MNIST dataset. Performance of GANTree highly depends on the initial binary split performed at the root node, as an error in cluster assignment at this stage may lead to multiplemodes for a single image category across both the child tree hierarchy. Fig. 4B clearly shows the superiority of modesplit procedure when applied at the MNIST root node.
Model  #Gen  JSD MNIST  JSD FaceBed  FID FaceBed 
DMWGAN [21]  20  
DMWGANPL [21]  20  
Ours GANSet  5  
Ours GANSet  10 
Evaluation on Toy dataset.
We construct a synthetic dataset by sampling 2D points from a mixture of nine disconnected Gaussian distributions with distinct means and covariance parameters. The complete
GANTree training procedure over this dataset is illustrated in Fig. 5. As observed, the distribution modeled at each pair of child nodes validates the mutually exclusive and exhaustive nature of child nodes for the corresponding parent.Evaluation on MNIST. We show an extensive comparison of GANTree against DMWGANPL [21] across various qualitative metrics on MNIST dataset. Table 1 shows the quantitative comparison of interclass variation against previous stateoftheart approaches. It highlights the superiority of the proposed GANTree framework.
Evaluation on compositionalMNIST. As proposed by Che [7], the compositionalMNIST dataset consists of 3 random digits at 3 different quadrants of a full 6464 resolution template, resulting in a data distribution of 1000 unique modes. Following this, a pretrained MNIST classifier is used for recognizing digits from the generated samples, to compute the number of modes covered while generating from all of the 1000 variants. Table 2 highlights the superiority of GANTree against MADGAN [13].
iGANTree on MNIST. We show a qualitative analysis of the generations of a trained GANTree after incrementally adding data samples under different settings. We first train a GANTree for 5 modes on MNIST digits 04. We then train it incrementally with samples of the digit 5 and show how the modified structure of the GANTree looks like. Fig. 4D shows a detailed illustration for this experiment.
GANTree on MNIST+FMNIST and FaceBed. We perform the divisive GANTree training procedure on two mixed datasets. For MNIST+FashionMNIST, we combine 20K images from both the datasets individually. Similarly, following [21], we combine FaceBed to demonstrate the effectiveness of GANTree to model diverse multimodal data supported on a disconnected manifold (as highlighted by Table 1). The hierarchical generations for MNIST+FMNIST and the mixed FaceBed datasets are shown in Fig. 4A and Fig. 6A respectively.
Methods  KL Div.  Modes covered 
WGAN [1]  0.25  1000 
MADGAN [13]  0.074  1000 
GANSet (root)  0.16  980 
GANSet (5 GNodes)  0.10  1000 
GANSet (10 GNodes)  0.072  1000 
On CIFAR10 and ImageNet. In Table 3, we report the inception score [32] and FID [17] obtained by GANTree against prior works on both CIFAR10 and ImageNet dataset. We separately implement the prior multimodal approaches, a) GMVAE [9] b) ClusterGAN [28], and also the prior multigenerator works, a) MADGAN [13] b) DMWGANPL [21] with a fixed number of generators. Additionally, to demonstrate the generalizability of the proposed framework with varied GAN formalizations at the individual nodelevel, we implement GANTree with ALI [12], RFGAN [3], and BigGAN [6] as the basic GAN setup. Note that, we utilize the design characteristics of BigGAN without accessing the classlabel information, along with RFGAN’s encoder for both CIFAR10 and ImageNet.
In Table 3, all the approaches targeting ImageNet dataset use modified ResNet50 architecture, where the total number of parameter varies depending on the number of generators (considering the hierarchical weight sharing strategy) as reported under the #Param column. While comparing generation performance, one needs access to a selected GANSet instead of the entire GANTree. In Table 3, the performance of GANSet (RFGAN) with 3 generators (i.e. GANTree with total 5 generators) is superior to DMWGANPL [21] and MADGAN [13], each with 10 generators. This clearly shows the superior computational efficiency of GANTree against prior multigenerator works. An exemplar set of generated images with the first root node split is presented in Fig. 6B and 6C.
Method  #Gen 



IS  FID  IS  FID  #Param  
GMVAE [9]  1  6.89  39.2        
ClusterGAN [28]  1  7.02  37.1        
RFGAN [3] (rootnode)  1  6.87  38.0  20.01  46.4  50M  
BigGAN (w/o label)  1  7.19  36.7  20.89  42.5  50M  
MADGAN [13]  10  7.33  35.1  20.92  38.3  205M  
DMWGANPL [21]  10  7.41  33.1  21.57  37.8  205M  
Ours GANSet (ALI)  3  7.42  32.5        
Ours GANSet (ALI)  5  7.63  28.2        
Ours GANSet (RFGAN)  3  7.60  28.3  21.97  34.0  65M  
Ours GANSet (RFGAN)  5  7.91  27.8  24.84  29.4  105M  
Ours GANSet (BigGAN)  3  8.12  25.2  22.38  31.2  130M  
Ours GANSet (BigGAN)  5  8.60  21.9  25.93  27.1  210M 
GANTree is an effective framework to address natural data distribution without any assumption on the inherent number of modes in the given data. Its hierarchical tree structure gives enough flexibility by providing GANSets of varied qualityvsdiversity tradeoff. This also makes GANTree a suitable candidate for incremental generative modeling. Further investigation on the limitations and advantages of such a framework will be explored in the future.
Acknowledgements. This work was supported by a Wipro PhD Fellowship (Jogendra) and a grant from ISRO, India.
International Conference on Machine Learning
, Cited by: §2.The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §2, Table 2, Table 3, §4, §4, §4, §4.A robust stopping criterion for agglomerative hierarchical clustering in a speaker diarization system
. In Eighth Annual Conference of the International Speech Communication Association, Cited by: §3.1.Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 4610–4617. Cited by: Table 3, §4.Deepcluster: a general clustering framework based on deep learning
. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 809–825. Cited by: §4.