1 Introduction
Textures play important roles in multimedia applications, such as understanding and generation of multimedia content. Texture synthesis and generation have also been extensively investigated in the past years [1]
. Before the revival of deep learning, researchers mainly used examplebased approaches to synthesize textures. In these methods, new textures with similar appearances to existing samples can be produced. With the development of deep learning, more methods for texture generation have been proposed by learning from the training data. However, there is no visual perceptual information involved in this process, whereas humans commonly use perceptual attributes, such as texture roughness, coarseness and directionality, to describe textures. Moreover, the majority of deep learning based methods can only generate images of low quality. Thus, it is desired to develop a new way for generating highquality textures based on human perceptual descriptions; for example, the new generation method should be able to produce textures with strong directionality or less regularity as required by the user.
Convolutional Neural Network (CNN), which was inspired by the mechanism of visual cortex, has shown great superiority in latest studies [2] [3]
. With the aid of deep convolutional networks, researchers have made breakthroughs in many classical computer vision tasks. For example, in the ImageNet Large Scale Visual Recognition Challenge, the performance of computer algorithms even surpassed human’s
[4]. Consequently, researchers have been investigating different approaches based on CNN for image generation [5] [6] [7].Goodfellow et al. proposed a generative adversarial framework (GAN) [8] and produced excellent results in many image generation tasks. However, the generated samples were still in low resolution and far from being perfect. In order to generate more realistic images, Wang and Gupta factorized the image generation process and proposed a joint model consisting of Style and Structure Generative Adversarial Networks [9]. Experimental results in [9] suggested that a great gain could be obtained through this factoring trick for generating realistic indoor scenes. All these work indicates that it is a promising practice to exploit joint convolutional neural networks and adversarial training schemes for generating highquality images.
In addition to generating natural images, another question is what we can generate from semantics or highlevel descriptions. Many efforts have been made regarding this topic. Karpathy et al. proposed a fragment embedding method in 2014 [10], which was essentially a bidirectional retrieval scheme, as the desired image must exist in the image database. Yan et al. modeled images as composite of foreground and background and developed a layered generative model [11]. Their method shows promising results in the tasks of attributeconditioned image reconstruction and completion. Nevertheless, the quality of generated images is still not good enough for texture perception study.
The contribution of this paper is a new joint model that combines perceptual feature regression and adversarial schemes for generating textures based on perceptual descriptions. Unlike existing Conditional Generative Adversarial Networks (CGAN) [12]
, in which the discriminative model need to estimate the joint distribution of condition vectors with samples and can not always provide enough information for the generator to adjust parameters, in our new model, perceptual feature regression can supervise the generator to produce textures in consistence with human visual system. Thus, the discriminative model is assisted by the perceptual regression model and therefore released from the inaccurate estimation of joint distributions. Furthermore, the perceptual model is able to supply more information to the generator and guide it to produce texture with enough details, which lead to highquality output texture images.
2 Related Work
Textures have attracted widespread attention in the research field of visual perception and computer vision. Rao et al. identified the perceptual features people used to classify the textures and also established the correlation between semantic attributes and textures
[13], which showed the importance of perceptual features for understanding texture images. Meanwhile, texture synthesis and texture generation have been active research areas for many years. Shin et al. proposed a pixelbased method for texture synthesis with nonparametric sampling [14], and Wei proposed an efficient algorithm using treestructured vector quantization for realistic texture synthesis, which required only a sample texture as input [15]. These studies normally concern on example based texture synthesis, whereas our work focuses on generating textures according to userdefined perceptual attributes.Deep learning models, particularly deep convolutional neural networks, have achieved great success in texture analysis due to their strong learning capability. Texture synthesis based on CNN is a new research topic [16], which has produced promising results. These results suggest that this topic deserves more research devotion. In [16], Gatys combined the conceptual framework of spatial summary statistics on feature responses with the feature space of a convolutional neural network, and the goal is to generate textures from a given source image. Ulyanov also trained feedforward generation networks to generate multiple samples of the same texture with arbitrary sizes [17]. In this manner, the representation of the given image can be learned by the convolutional networks, and the new samples can be generated from the networks. Goodfellow [8] proposed a generative adversarial framework that could estimate generative models via an adversarial process, in which a generative model and a discriminative model
were simultaneously trained. The generative model is responsible for capturing the data distribution, and the discriminative model is used to estimate the probability that a sample comes from the training data rather than
. The training procedure for is to maximize the probability ofmaking a mistake. It has been proven that GAN can be used to generate realistic images from uniformly distributed random noise
[8]. Furthermore, GAN was extended as CGAN for conditional image generation by Mirza and Osindero [12], where both models and received an additional vector of information as condition. This vector might contain information about the class of the training example. CGAN has been successfully applied in digit and face image generation [18], whereas we are interested in generating textures with given perceptual attributes.Inspired by previous works, this paper proposes a joint model, which combines the perceptual feature regression and adversarial training scheme for perception driven texture generation. Since the perceptual regression model can provide additional information for the generator in the adversarial scheme, the proposed model is able to generate highquality textures.
3 Perception Driven Texture Generation
In this section, we first introduce the overall architecture of the proposed joint model for perception driven texture generation. Then we provide details on the network design and initialization.
3.1 Overall Architecture of the Joint Model
Human observers essentially use perceptual features for texture description, e.g. regularity and repetitiveness [19]. According to [20], there are 12 prominent perceptual features for human to perceive a texture. In practice, human can not only perceive these features from a texture but also imagine a texture from these perceptual descriptions. For example, textures with weak or strong directionality can be easily depicted in human mind; in contrast, no computer algorithm is able to generate texture from these descriptions. Therefore we designed a joint deep model in order to achieve such a goal. As shown in Fig. 1, the overall architecture includes three parts: a perceptual feature regression model, a conditional generative model, and a discriminative model. The generative model is responsible for conditional texture generation, whereas the discriminative model is used to distinguish whether the generated texture is from the training sample distribution, and the perceptual model can drive the generative model to produce textures possessing certain attributes.
Inspired by the success of the Inceptionv3 model [4]
, which reached 3.46% top5 error rate and even surpassed human performance in the 2015 ImageNet Large Scale Visual Recognition Challenge(ILSVRC), we use Inceptionv3 for our perceptual feature regression. First we change the activation function of the final output layer and auxiliary units to
, as our perceptual features are scaled in the range between 0.9 and 0.9. The reason for scaling the range is to avoid the saturation of the output neurons. Furthermore,
is much easier to be trained than [21]. Second, we change the cross entropy loss of softmax to the quadratic loss. Then we train the modified Inceptionv3 model using our texture database for perceptual feature prediction. In the following sections, we call the modified Inceptionv3 as the perceptual model.In the CGAN framework, the discriminator needs to figure out the union distribution of the condition and samples. The distinguishing task is relatively difficult, and the discriminator cannot supply enough information for the generator to justify its parameters. In our model, we use the perceptual model to impose perceptual constraints on the generator; this can provide additional information for the generator to produce certain perceived textures. We use , , and to represent the generative, discriminative and perceptual model, respectively. Then the loss of can be defined as:
(1) 
where represents a training example, is the corresponding perceptual feature vector, is one or zero, indicating whether is a real pair, and is the number of training examples. The quadratic loss for is defined as:
(2) 
The loss of contains two parts: one from , and the other from ; the definition is:
(3)  
(4)  
(5) 
where is a tradeoff parameter, is a random noise vector, is preliminarily trained, and and are trained in an adversarial scheme. In this manner, the discriminator makes the generator produce realistic textures, and the perceptual model makes the generated textures possess certain perceptual attributes.
3.2 Network Design Details
In this subsection, we first introduce the initialization scheme for our deep networks, and then present strategies for the design of certain part of the network. Inspired by [22], we initialize weights of one layer of the proposed network by formulation . In most cases, we only consider the back propagation situation, so represents the number of units that can be reached by one input neuron, and
represents the weight in convolutional or fully connected layer. ReLU is used as the activation function in the network, since it can reduce the gradient vanishing effect and make the model learn fast. However, we would like the output of the generator to be limited in a certain range, because an image always has limited pixel values. The discriminator should yield a probability result, which indicates whether an image comes from the real training samples. Accordingly, we use
as the activation function in the output layer of the generative model, andin the discriminative model. Thus, we adopt different initialization strategies for the output layer. In order to keep the gradient variance, when the activation function is
, we initialize the weights using the truncated normal distribution with the standard deviation
. In contrast, we use as the deviation when the activation function is . Here, we assume that the weights are initialized independently, and the bias is initialized with zero. In particular, if the number of units decreases too much in the output layer, we slightly reduce the deviation of weights to avoid the output becoming too saturated in the forward case. We will introduce more details about the network design in section 4.It should be noted that, in the fully connected layer, the initialization strategy can be easily analyzed. However, it becomes complicated in the convolutional layers. We may take the 1D convolutional operation as an example, and it can be easily extended to the high dimensional case. We use to represent the number of units, which can propagate its gradient to certain input unit. When the number of input units becomes very large, we can calculate an average value for . We define a universal formulation:
(6)  
where represents the maximal integer no larger than , represents the kernel size, and represents the step size. Eq (6) illustrates a period of the convolutional operation. Each line in Eq (6) calculates the number of units that can be reached by certain input unit. The period begins with the input unit. The length of the cycle is . From Eq (6), we can get the average value of for general situation:
(7) 
We use the average value of to calculate the deviation of for initialization. To extend this to the two dimensional situation, we simply expand and to two dimensions. This scheme is used to initialize our networks through all experiments.
In order to emphasize the importance of perceptual features for texture generation, we stretch the perceptual feature vector to 800 dimensions via a fully connected layer. The random noise vector is drawn uniformly from a 200 dimensional space ranging from 1 to 1. The reason for using these specific dimensions is explained as follows. A random noise vector with 200 dimensions can be significantly varied to generate diverse textures given certain perceptual features. In theory, if we change each dimension of the random noise vector with step of 0.1, we can obtain different vectors. This is a large enough space for variant texture appearance. In addition, textures with the same perceptual feature vector have similar appearances. In the above analysis, we demonstrate that the covariance shift can be avoided by certain initialization strategy in the forward and backward view. In the fully connected layers for stretching perceptual features, we simply consider the forward propagation. Thus, we make represent the number of units in the input layer. Consequently, the stretched perceptual features own similar variance as the original. Let
represent the random variable. Then its variance is
. Recall that the perceptual features are scaled to the range between 0.9 and 0.9. Let represent one perceptual feature, and we use the following equation for scaling:(8) 
Through this transformation, the resulted owns variance of . Since the stretching layer is initialized by using the forward principle, the variance of the stretched features is also approximately . The result is that the variance of the random noise is three times larger than that of the stretched perceptual features. Hence if we want the perceptual features to play the same role as random noise in the generating task, we should make the number of the output units in the stretching layer three times larger than that of the random noise. In this work we therefore set the number to 800, and we can let the perceptual features dominate the generating procedure.
4 Experiments
4.1 The Data Set
In our experiments, we use the Perceptual Texture Database (PTD), in which there are 450 textures with corresponding 12D perceptual features [20]. The textures in PTD have a resolution of , and the 12D perceptual features include contrast, repetitiveness, granularity, randomness, roughness, density, directionality, structural complexity, coarseness, regularity, orientation and uniformity. However, since 450 textures are still too few to train a deep neural network, we expand the examples in the following way. First, we crop each texture into 81 textures of size ; the step used for cropping is 8. Second, we resize the resulted textures to . Regarding perceptual features, we let the resulted textures have same values as their original ones. We eventually obtain 36450 examples of size , and we use 36000 among them to train our models. The remaining textures are left as the validation set. It should be noted that it is reasonable to make the resulted 81 textures have the same perceptual features as their originals. First, the textures in PTD are isotropic; a region can cover most area of the original texture and can therefor keep original perceptual characteristics. Second, resizing the texture to does not cause obviously blurring effect.
4.2 Perceptual Feature Regression
Since our perceptual model was modified from Inceptionv3, we did not need to train it from scratch. The preliminary trained Inceptionv3 on ImageNet can be found in [23]
. Since our perceptual model only differed from Inceptionv3 in the output layer and loss definition, we initialized the output layer with truncated Gaussian noise, and the other layers were reloaded from preliminary trained Inceptionv3 model. Then we finetuned the perceptual model with initial learning rate 0.001. The RMSProp method was used for gradient descent
[24]. We ran the optimization algorithm for 50000 iterations. The process is illustrated in Fig. 2(a). Finally, the Euclidean loss converged to 0.01161, and the final evaluation error was 0.0039. Since the perceptual features have 12 attributes, the standard error deviation for each attribute in average can be calculated:
(9) 
This means that we can accurately predict the perceptual features for one texture with very small deviation. Based on this observation, we can make a basic assumption here: if the generated textures have certain perceptual attributes, it should be correctly perceived by the perceptual model. We use the preliminary trained perceptual model as an accessory of the whole generative framework.
4.3 Generating Textures from Perceptual Features
To generate realistic textures, we must design a reasonable network structure. The kernel size is a vital factor for generating highquality images. In the experiments, we found that if we set the kernel size too small, i.e. 3, the generated textures owned more details but looked too crude. If the kernel size was too large, i.e. 7, the generated textures looked more smooth, but with less details. Eventually, we used kernels for convolution or inverse convolution in our discriminative and generative models. We also tried to fuse kernels of different size for generating textures with more details and global information. However, it did not produce good results.
Since one part of the input to the generative model was drawn from random noise (the other part is the perceptual feature vector), there were infinitely many training examples in practice. Thus we used the ADAM [25] method for optimization. We optimized the generative model twice after each optimization for the discriminative model. We made each batch contain 60 training examples. The tradeoff parameter was set as 10. In the end, we ran 266000 optimization iterations. The training process is illustrated in Fig. 2(b)(c)(d). Two experiments were designed after the models were trained. First, we fed real perceptual features in our database with different random noise to the generative model. The generated textures are shown in Fig. 3. Second, we manually edited some perceptual features and used them to generate textures. It should be emphasized that the manually edited or handcrafted perceptual features were based on existing perceptual features, i.e. only certain perceptual feature was set to three different values: 0.9, 0, 0.9, whereas the others were kept the same as the existing ones. In Fig. 4, we only provide six results due to the limited space, but more results are provided in the supplementary materials. As an example, we can see from the first column of Fig. 4, when we decrease the perceptual feature value of directionality from 0.9 to 0.9, the textures gradually lose the overall direction. These results indicate that the proposed method is able to generate desired textures by varying certain perceptual attributes.
5 Conclusion
We propose a novel deep network model for perception driven texture generation. In the proposed model, a perceptual regression component is integrated with the generative framework, which drives the produced textures possessing certain perceptual attributes. This perceptual regression model partially releases the discriminative model’s workload, and can supply more information for the generator to produce better perceived texture. Experimental results show that the jointed models are able to generate realistic texture from given perceptual attributes. We attribute this success to the fact that if the generated texture is realistic enough, it should have the potentiality to be correctly perceived by the preliminary trained deep network.
It should be noted that the perceptual features are not independent from each other. If we change one perceptual attribute arbitrarily, the remaining relevant features might also need to be changed to fit the real distribution. In the future work, we will design an auxiliary model for generating correct perceptual feature vectors; in this way we may simply provide an existing perceptual feature vector and the desired value for certain attribute, and the tool can generate a suitable input perceptual feature vector.
References
 [1] Alexey Badalov, Irene Cheng, Claudio Silva, and Anup Basu, “An inplace texture synthesis technique for memory constrained multimedia applications,” in IEEE International Conference on Multimedia and Expo, 2011, pp. 1–4.
 [2] Kevin Jarrett, Koray Kavukcuoglu, Yann Lecun, et al., “What is the best multistage architecture for object recognition?,” in 2009 IEEE 12th International Conference on Computer Vision. IEEE, 2009, pp. 2146–2153.
 [3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in Neural Information Processing Systems, vol. 25, no. 2, pp. 2012, 2012.
 [4] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna, “Rethinking the inception architecture for computer vision,” arXiv preprint arXiv:1512.00567, 2015.
 [5] Geoffrey E Hinton, Simon Osindero, and YeeWhye Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
 [6] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra, “Draw: A recurrent neural network for image generation,” arXiv preprint arXiv:1502.04623, 2015.

[7]
Alexey Dosovitskiy, Jost Tobias Springenberg, and Thomas Brox,
“Learning to generate chairs with convolutional neural networks,”
in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2015, pp. 1538–1546.  [8] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, 2014, pp. 2672–2680.
 [9] Xiaolong Wang and Abhinav Gupta, “Generative image modeling using style and structure adversarial networks,” arXiv preprint arXiv:1603.05631, 2016.
 [10] Andrej Karpathy, Armand Joulin, and Fei Fei F Li, “Deep fragment embeddings for bidirectional image sentence mapping,” in Advances in neural information processing systems, 2014, pp. 1889–1897.
 [11] Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee, “Attribute2image: Conditional image generation from visual attributes,” arXiv preprint arXiv:1512.00570, 2015.
 [12] Mehdi Mirza and Simon Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.

[13]
Nalini Bhushan, A Ravishankar Rao, and Gerald L Lohse,
“The texture lexicon: Understanding the categorization of visual texture terms and their relationship to texture images,”
Cognitive Science, vol. 21, no. 2, pp. 219–246, 1997.  [14] Seunghyup Shin, Tomoyuki Nishita, and Sung Yong Shin, “On pixelbased texture synthesis by nonparametric sampling,” Computers & Graphics, vol. 30, no. 5, pp. 767–778, 2006.
 [15] LiYi Wei and Marc Levoy, “Fast texture synthesis using treestructured vector quantization,” in Proceedings of the 27th annual conference on Computer graphics and interactive techniques. ACM Press/AddisonWesley Publishing Co., 2000, pp. 479–488.
 [16] Leon Gatys, Alexander S Ecker, and Matthias Bethge, “Texture synthesis using convolutional neural networks,” in Advances in Neural Information Processing Systems, 2015, pp. 262–270.
 [17] Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor Lempitsky, “Texture networks: Feedforward synthesis of textures and stylized images,” arXiv preprint arXiv:1603.03417, 2016.
 [18] Jon Gauthier, “Conditional generative adversarial nets for convolutional face generation,” Class Project for Stanford CS231N: Convolutional Neural Networks for Visual Recognition, Winter semester, vol. 2014, 2014.
 [19] A. Ravishankar Rao and Gerald L. Lohse, “Towards a texture naming system: Identifying relevant dimensions of texture,” Vision Research, vol. 36, no. 11, pp. 1649–1669, 1996.
 [20] Jun Liu, Junyu Dong, Xiaoxu Cai, Lin Qi, and Mike Chantler, “Visual perception of procedural textures: Identifying perceptual dimensions and predicting generation models,” PloS one, vol. 10, no. 6, pp. e0130335, 2015.
 [21] Xavier Glorot and Yoshua Bengio, “Understanding the difficulty of training deep feedforward neural networks.,” in Aistats, 2010, vol. 9, pp. 249–256.
 [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1026–1034.
 [23] “Inceptionv3 model trained on imagenet,” http://download.tensorflow.org/models/image/imagenet/inceptionv320160301.tar.gz.
 [24] Kevin Swersky Geoffrey Hinton, Nitish Srivastava, “Overview of minibatch gradient descent,” http://www.cs.toronto.edu/%7Etijmen/csc321/slides/lecture_slides_lec6.pdf.
 [25] Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.