1 Introduction
People usually use their natural languages to communicate their thoughts and emotions to others. Recently, artificial intelligence (AI) technology allows us to use these natural languages for operating and controlling the computer systems. For example, it is becoming possible for an AI engine to generate 3D shapes from descriptions of a natural language.
When creating 3D models, 3D designers commonly rely on expensive modeling software, such as Maya, Blender, and 3DSMAX, and they need to spend very long time to create satisfied quality of 3D models with these tools. Even after becoming an expert of these tools, it takes a long time to create one 3D model. Compared with using 3D modeling software, if human already has an image of the target object in her/his brain, it is very easy to express the outline of the shapes in the form of text descriptions. If an AI engine can generate 3D shapes quickly form text descriptions, it is possible to reduce the time taken for these heavy tasks.
There have been research efforts to generate 3D models from vectors of encoded text descriptions by using Generative Adversarial Network(GAN)
[3]. For example, in [7], it is possible to generate 3D shapes with rough categories, but it is not possible to generate the fine models of them with colors or details of each part.Prior work shown in [2]
successfully generates 3D shapes with colors from text descriptions. However the resolution of the shapes is still low. In the work, text is first converted to a vector representation via an encoder. Next, a corresponding 3D shape is generated by the Generator from the vector. This Generator is based on a deeplearning methodology and a GAN framework. When the vectorized text is input to the Generator, it is combined with noises to ensure the flexibility of the Generator.
The output of the 3D shapes is created through 3D deconvolution operations and then input to a Critic network. The Wasserstein distance between the actual 3D shape and the generated one is calculated. On one hand, the Critic network attempts to calculate the Wasserstein distance accurately. On the other hand, the Generator attempts to minimize the distance calculated by Critic. This ensures that the probability distribution of the actual 3D shape and the generated one becomes closer. The details of the algorithm will be described later in Section ??.
In the previous work, the output is generated in the form of 3D voxels. As for voxel generation, learning was done with voxel size of . Since, in general, 3D model designers do not create 3D model in the form of voxels but in the form of mesh structure, voxels can be converted to mesh with the some methods such as Marching Cube[6]. However, if the resolution of the voxels is low, the appearance of the transformed mesh becomes considerably coarse. In addition, it becomes difficult to display sufficient details written in the text with low resolution voxels. Therefore, in this research, we propose another GAN learning methodology to generate voxels with higher resolution. Specifically, we use voxel size of .
One of the ways to make the resolution higher is using larger neural network models. However, this cause tremendous increase in the number of parameters required to learn. If we simply use this large neural network model with a number of parameters, learning time increases significantly and sometimes learning is unstable. Therefore, in our proposed methodology, we reconsider the role of the Critic network of the previous research so that learning phase can be fast and stable.
Throughout this paper, we propose several GAN models where the rolse of the Critic network differ, for example, whether or not focusing on preciseness of the text descriptions. Since these models have advantages and disadvantages for various indices, we also introduce several metrics to compare the effectiveness of these proposed models.
The rest of this paper is organized as follows. In the next section, …..
2 Related Work
2.1 Generative Adversarial Network (GAN)[3]
In the Generative Adversarial Network (GAN) framework, two networks called Generator and Discriminator are learned. Generator is trained to generate the data similar to the training data from latent vector. Discriminator learns to discriminate between the training data and the data generated by Generator. Generator and Discriminator is trained alternately with this mechanism, and finally it is expected that Generator can make the data similar to the training one.
These processes can be expressed by the following mathematical expression,
where, , , and are Generator, Discriminator, and training data, respectively. This expression shows the objective function of . Here, we use as a noise dor the latent vector and generates data from the noise . means the probability that is regarded as the training data.
2.2 Conditional GAN (CGAN)[8]
In the original GAN, all elements of the latent vector which input to are noise based on certain probability distribution. However, if noise is taken as an input, it is difficult to specify and generate the data which you want to generate specially. On the other hand, CGAN can express what we want to generate, by inputting a label as latent vector into Generator and Discriminator. In the CGAN, the latent vector of what you want to generate is encoded (from image, text. etc.) before generation.
In CGAN, the latent vector which is appropriately generated by the training data is combined with the noise vector and input to the Generator. The reason for combining with the noise vector is that this ensures Generator having diversity. Otherwise, it may limit its possible output, determined only by the latent vector. In addition to that, this helps make the output more robust even for some parts where latent vector cannot describe sufficiently (such as details of the background of an image).
It is common in usual GAN frameworks to use combined latent vector for an input to Generator so that the Discriminator recognizes the generated data as training data. The difference of CGAN is that it uses the combined vector as an input to Discriminator. By doing so, Discriminator can judge whether the output from the Generator is actually linked with the corresponding latent vector description.
To implement the above point, the objective function of CGAN is expressed as follows:
In order to strengthen the capability of identifying whether output is generated with the corresponding latent vector, the literature [10] proposes a way to train the data so that should output for training data that differs with the description of the latent vector. In this case, the objective function is expressed as follows:
where is the probability distribution of the training data that mismatched with the latent vector.
2.3 WassersteinGAN(WGAN)[1]
The aim of WassersteinGAN (WGAN) is to bring the probability distribution of the generated voxel, , close to the probability distribution of the training data, . The simplest way of doing this is minimizing KullbackLeibler (KL) divergence which is one of the metrics to observe distances between two probability distributions. KL divergence () between the probability distribution and is calculated as follows:
In the case of GANs, it is not possible to numerically calculate
, and hence compute the loss function for them, because a specific probability distribution is not assumed in GANs.
Instead of using directly, the GAN try to minimize JensenShanon divergence (JSD). Although is asymmetrical, JSD can be symmetrical. The mathematical expression of JSD is represented as follows:
As for loss of the GAN shown below;
obviously JSD becomes maximum in the case of
In this condition, the objective function becomes as follows
In other words, while accurately approximates JSD, learns to minimize JSD which is calculated by . However, in training GANs, it is very important to carefully adjust learning balance between Discriminator and Generator or learning rate of them. If the Discriminator’s training is insufficient, the Generator will minimize the incorrect JSD. On the other hands, if Discriminator’s training is too enough, the gradient for the parameters of Discriminator will be small, making Generator’s training infeasible.
As discussed, the success of learning of GANs depends on learning parameters. Specially when learning a model with a large number of parameters such as for 3D voxel creation in this research, adjustment of those learning parameters is extremely difficult. To overcome this challenge, One of prior work introduces another index to measure the distance between probability distributions instead of using JSD. In WassersteinGAN[1], Wasserstein distance is introduced as a metric of distance. Wasserstein distance is expressed as follows:
where
denotes the set of all joint distributions
whose marginals are and , respectively. Intuitively, indicates how much “mass” must be transported from to for converting the probability distributions into the probability distribution . Originally, this is a metric used for optimal transport problems. This makes measuring the distance between low dimensional manifolds possible.The Wasserstein distance can a better way to describe the distance than JSD, but it is difficult to calculate. According to KantorovichRubinstein duality [9], it can be expressed using the 1Lipschitz function as follows:
where and . The meaning of 1Lipschitz function is that the slope of a straight line of arbitrary does not exceed . Here, if is a function represented by some parameters and follows , we can express as follows:
In order to satisfy the 1Lipschitz condition, it suffices that each parameter fit into the compact space, that is, the absolute value of the weight parameter is clipped to a certain value . Discriminator is called Critic to distinguish it from the original GAN.
In WGAN, the Critic and the Generator networks learn alternately until Wasserstein distance converges. While the Critic attempts to calculate the Wasserstein distance between training data and generated data accurately, the Generator attempts to bring the probability distribution of generated data close to that of the training data by minimizing the Wasserstein distance calculated. Given that gradients do not disappear even if the Wasserstein distance converges completely WGAN is very stable in learning. Therefore adjustment of the learning balance between the Critic and the Generator is unnecessary.
In conventional GANs, since the Critic and the Generator networks use different loss functions, the loss values do not converge even if training fairly progress. The problem here is that the timing of finishing training is difficult to find out. However, the loss value of the Critic always goes toward converge in WGAN. Therfore, one of the advantages of WGAN is that decrease of the loss value always correlates with improvement in the quality of the generated data.
However, one of the disadvantage in WGAN is clipping weight parameters. If the weight parameters are clipped, the weights become polarized to the clipped boundary values, resulting in the gradient explosion or disappearance. This cause delay in learning. Here, the optimized Critic has a characteristic that it has a slope whose norm is at almost all points below and [4]. Based on this feature, WGANgp is proposed[4]. In WGANgp, a penalty term is introduced in the loss function so that the Critic has a slope whose norm is at almost all points below and . This allows it to be optimized without clipping the weights. Let be the output of the Critic. Then the loss is expressed as follows:
where , at , , . By providing the penalty to the gradient, the weights have diversified values without polarizing, giving higher model performance.
2.4 text2shapeGAN[2]
As mentioned in Section 1, this paper is based on prior work which generates 3D voxels from texts[2]. In text2shapeGAN (TSGAN), they generate 3D voxels stably by combining CGAN and WGANgp techniques. In TSGAN, the Critic not only evaluates how realistic the generated voxels look like, but also how faithfully the generated voxels reflect texts. The objective function of TSGAN is as follows:
where is the text embedding, is the 3D voxel, and is the probability distribution of the text embedding. In addition, and are the probability distribution of matching textvoxel pairs and mismatching textvoxel pairs, respectively. Note that they sum up gradients for all the input variables of to make the gradient penalty.
Fig.1 shows the model of TSGAN. First, it combines text embedding with a noise vector. This is an input to the Generator and a set of 3D voxels is output by deconvolution. As the last layer of the Generator, it has a sigmoid layer so that the output value is restricted from to . The generated 3D voxels are input to the Critic and it is transformed into a onedimensional vector via convolutional layers. It is combined with the text embedding and then passed to the fully connected layers. The output of the final fully connected layer is a scaler value. The output of the Generator is a set of voxels of .
3 Proposed Method
3.1 Approach
As the simplest approach to generate high resolution 3D shapes, it is conceivable that we can add a higher resolution deconvolution layer to the model of TSGAN. However, the number of parameters to be added becomes too large to compute because it needs to deal with threedimensional data for learning. In this research, we assume to use a GPU for faster learning speed, but the number of parameters that a GPU can store in its memory for learning is limited, depending on the memory size of the GPU.
If the number of parameters is large, the problem is not only the fact that the learning is sometimes terminated by the lack of memory, but also the number of epochs required for training dramatically increases. Since the training for threedimensional data is proportional to the order of cubic, training will not be finished in realistic time. Even though we can limit the number of epochs, the generated voxels may be collapsed. For these reasons, simply adding higher resolution layer does not work well.
To overcome the challenges described above, our approach is dividing the tasks into two steps; one is for generating a low resolution shape which roughly reflects the target text (StageI) and the other is for generating corresponding high resolution shape (StageII) by using the knowledge of StackGAN[12]. In StackGAN, StageI generates a low resolution image by roughly deciding color distribution and placement of it. In StageII, low resolution image generated in StageI is input to some convolution layers, then combined with the text embedding, which is sent to the Residual layer. Finally, a high resolution output image is generated via deconvolution layers. In the proposed method, we use TSGAN as StageI and constructing a new model for StageII to generate high resolution voxels. The following sections describe the details of these two stages.
3.2 Low resolution task (StageI)
The Generator tries to create rough shapes of resulting voxels at this stage. Unlike TSGAN, StageI in this research does not need to generate voxels which are strictly faithful to the input text since the details described in the text are reshaped at StageII. Instead of using latent vectors combined with additional noises as proposed in [2], we use only latent vectors as the input to the Critic since we found that it can create sufficient level of voxels and achieve faster convergence of training. In addition to that, we found a problem in TSGAN. In TSGAN, generated voxels are first converted to onedimensional vectors through convolutional and fully connect layers of the Generator, and then combined with text latent vectors. This causes some cases that spatial information of voxels is lost, resulting in the lack of meaningful connection between the voxels and corresponding texts. Therefore, we spatially duplicated the text embedding vectors and combined them with the convolved voxels to hold the spatial feature as like StackGAN[12]. Fig.2 shows our proposed network model for StageI.
3.3 High resolution task (StageII)
In the high resolution task, the role of the Critic network varies depending on the type of input variables or loss functions. In contrast with StageI which decides the rough color and shape of voxels, we can consider two training models in StageII; focusing only on heightening resolution or extending the existing training models by refining the prior work. Therefore, we propose the following two models for StageII:

(v0): This model supposes that the faithfulness to text is sufficient in StageI, so that binding to text is relaxed. The Critic network focus on whether the generated high resolution shapes are correct or not.

(v1): Like TSGAN, the Critic network focus on whether voxel is accurately generated from the text description of the shape.
3.3.1 High resolution model v0
In this model, Critic monitors whether higher resolution shapes can be appropriately achieved from the low resolution voxels. Fig.3 shows the flow of v0 model.
A vector of text embedding is first input to the generator of StageI for generating low resolution voxels. The output voxels are input to the generator of StageII to generate higher resolution voxels. In v0 model, the input of the Critic network is both high resolution voxels and low resolution voxels so that the Critic can evaluate high resolution voxels based on the information of low resolution ones. We do not use vectors of text embedding for tasks after StageII because StageII concentrates solely on heightening resolution. Therefore, high resolution tasks can be completely separated from low resolution tasks by not using the text information in StageII.
Fig.4 shows the training model of v0. We introduce a residual layer as a hidden layer of the Generator network of StageII. By the residual layer, we can optimize each layer by learning the residual function using layer input instead of learning the optimum output of each layer[5].
The loss function for v0 model is defined as follows:
where the generators for low resolution and high resolution tasks are indicated by and , respectively. By relieving the term contributed to from , the model suppress learning the aspect of whether the shape matches with the text description. The degree of matching between the text description and the generated shape depends on the quality of training in StageI. However, by concentrating solely on heightening resolution, the number of parameter updates in the training phase can be greatly reduced, resulting in faster learning speed.
3.3.2 High resolution model v1
In this model, the Critic network monitors whether voxels are faithfully generated from the text depictions as like TSGAN. Fig.5 shows the flow of v1 model.
A vector of text embedding is first input to the generator of StageI for generating low resolution voxels. Here, StageI is the same as Generator of Fig.2. The generated low resolution voxel is input to StageII and a onedimensional vector is created through the convolution layers. The vector is combined with the vector of text embedding and converted to a voxel again by the deconvolution layers. The reason of combining a vector representation of the voxel with the vector of text embedding is to generate a higher resolution voxel which reflects the details described in the text. The Critic network tries to determines both how realistic the generated voxel looks and how closely it matches with the text by using Wasserstein distance. Fig.6 shows the training model of v1.
The loss function for v1 is defined as follows:
This is the same loss function as TSGAN. Therefore, the Critic network evaluates whether the generated voxel is properly created from the text. The cost for a voxel mismatched with the text is included in the loss function to check how accurately the voxel is matched with the text. However, many parameter updates are needed in training the network since the Generator has two roles of heightening the resolution and making the generated voxel is closely matched with the text. Therefore, long training time is expected in this model
4 Evaluation
4.1 Experimental setup
In this section, we evaluate the proposed high resolution tasks v0 and v1. We compare the shapes generated by the models trained with 18000 epochs. In this evaluation, we use two objects, table and chair for generating 3D shapes. For this experiment, we used the same dataset as in TSGAN[2]. We created train/validation/test data by randomly splitting the dataset into the ratio of , respectively. As for text embedding, we exclude texts which are not related to a table or a chair, texts with misspelling, or text without enough description to generate voxels.
In order to evaluate the generated 3D shapes in the aspect of quality, we use the following two indices.

Accuracy of classification (Class acc.): The first index is the accuracy rate of classification which evaluate how correctly the generated shapes are classified. We created another classifier network model to classify two target objects represented by voxels generated from the text descriptions. All the generated 3D voxels are input to the classifier and evaluate its accuracy rate. The model of the classifier is created based on the prior work
[7]. We expect that this accuracy metric can reflect how realistic the generated voxels look like. 
Mean squared error (mse): The second index is the mean square error of the text embedding vector. We created an encoder that generates a 128 dimensional vector from a resulted voxel. The dimension of the vector is the same as the text embedding vector. We train this encoder to minimize the mean squared error between the output vectors and original text embedding vectors. We input the resulting voxels into the encoder network and then calculate the mse between the result and the corresponding latent vectors. We expect that this can evaluate how well each voxel is matched with the text representation for it.
4.2 Generation result and its quantitative evaluation
Table 1 shows the evaluation results of two indices for v0 and v1 models. The higher the classification accuracy rate, the better the quality of generated voxels. As for the mse, the lower the better. From the table, we see that the classification accuracy rate of v1 is larger than that of v0 and the mse of v1 is smaller than that of v0.
show the trend of training losses of the Critic and the Generator network for two high resolution tasks, respectively. The loss for the both the overall trend of the loss for v0 and v1 in the Critic network is similar. However, variance of the loss trend for v0 is larger than that of v1 in the Generator network.
Method  Class acc.  mse 

DataSet  1.0  0.153 
v0  0.97  0.156 
v1  0.98  0.141 
4.3 Discussion
As stated above, we compare the results of the networks trained with 18000 epochs. At this epoch, both v0 and v1 models have almost no 3D shape collapse. According to Table 1, v1 model is more faithful to the text than v0 model. As for the accuracy, both v0 and v1 model have high classification accuracy, but the v1 model achieves a little higher accuracy rate compared with the v0 model.
This is because the Critic in v1 model can calculate the distance of the probability distribution between the train data and the generated voxel more accurately by referring to the text description. Thus, the Generator could generate more realistic voxels properly. The mean squared error for v1 is smaller that v0, meaning that v1 achieves better encoding results. This indicates the v1 model can generate voxels more faithfully to the text. Since we use information of the cost for a voxel mismatched with the text, the Critic can calculate Wasserstein distance more properly.
As can be seen from Fig. 7 and Fig. 8 (and also the figures in Appendix, the shape resolution is appropriately increased in both v0 and v1 models. However, a little change in the color distribution can be seen compared with the low resolution cases. We consider this is because the constraint of “How realistic and faithful the generated voxel is” is applied to the loss function, but we do not set any constraint regarding “How a high resolution shape is faithful to the corresponding low resolution shape”. As in the previous study[11], introducing a constraint on the mean and variance of the color distribution to the loss function may suppress the change in color.
5 Conclusion
In this paper, we extend the prior work [2] and propose new GAN models that can generate high resolution voxels. In the proposed model, we also improved the previous method to generate even low resolution voxels more faithfully to a given text.
The contributions of this research are threefold. First, we proposed the models which generate high resolution voxels faithfully to a given text from low resolution voxels. From the evaluation results, it is possible to generate high resolution voxels with good visual quality. Second, we contrived multiple roles of the Critic network and configured multiple models. We showed that there was a difference in accuracy depending on whether separating the higher resolution task from considering the text latent vector. Third, we introduced multiple indices to compare the performance of the models. As described in the discussion, there is a possibility for our proposed model to generate voxels that are more faithful to a given text with a higher quality shape.
References

[1]
Martin Arjovsky, Soumith Chintala, and Léon Bottou.
Wasserstein generative adversarial networks.
In Doina Precup and Yee Whye Teh, editors,
Proceedings of the 34th International Conference on Machine Learning
, volume 70 of Proceedings of Machine Learning Research, pages 214–223, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.  [2] Kevin Chen, Christopher B Choy, Manolis Savva, Angel X Chang, Thomas Funkhouser, and Silvio Savarese. Text2shape: Generating shapes from natural language by learning joint embeddings. arXiv preprint arXiv:1803.08495, 2018.
 [3] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
 [4] Ishaan Gulrajani. Improved Training of Wasserstein GANs.

[5]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 770–778, June 2016.  [6] William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. SIGGRAPH Comput. Graph., 21(4):163–169, August 1987.

[7]
Daniel Maturana and Sebastian Scherer.
Voxnet: A 3d convolutional neural network for realtime object recognition.
In IEEE/RSJ International Conference on Intelligent Robots and Systems, page 922 â 928, September 2015.  [8] Simon Osindero. Conditional Generative Adversarial Nets. pages 1–7.
 [9] Cedric Villani. Grundlehren der mathematischen wissenschaften. 2009.
 [10] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Finegrained text to image generation with attentional generative adversarial networks. In CVPR, pages 1316–1324. IEEE Computer Society, 2018.
 [11] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis & Machine Intelligence, page 1, 2017.
 [12] Han Zhang, Tao Xu, and Hongsheng Li. Stackgan: Text to photorealistic image synthesis with stacked generative adversarial networks. In ICCV, pages 5908–5916. IEEE Computer Society, 2017.
Appendix A Examples of Generation
We attach the result of the voxel generated by the model of this experiment. All voxels are generated by model v1.