cgCNN
None
view repo
The goal of exemplarbased texture synthesis is to generate texture images that are visually similar to a given exemplar. Recently, promising results have been reported by methods relying on convolutional neural networks (ConvNets) pretrained on largescale image datasets. However, these methods have difficulties in synthesizing image textures with nonlocal structures and extending to dynamic or sound textures. In this paper, we present a conditional generative ConvNet (cgCNN) model which combines deep statistics and the probabilistic framework of generative ConvNet (gCNN) model. Given a texture exemplar, the cgCNN model defines a conditional distribution using deep statistics of a ConvNet, and synthesize new textures by sampling from the conditional distribution. In contrast to previous deep texture models, the proposed cgCNN dose not rely on pretrained ConvNets but learns the weights of ConvNets for each input exemplar instead. As a result, the cgCNN model can synthesize high quality dynamic, sound and image textures in a unified manner. We also explore the theoretical connections between our model and other texture models. Further investigations show that the cgCNN model can be easily generalized to texture expansion and inpainting. Extensive experiments demonstrate that our model can achieve better or at least comparable results than the stateoftheart methods.
READ FULL TEXT VIEW PDF
We model local texture patterns using the cooccurrence statistics of pi...
read it
A dynamic texture (DT) refers to a sequence of images that exhibit tempo...
read it
Currently, MarkovGibbs random field (MGRF) image models which include
h...
read it
Recent progresses on deep discriminative and generative modeling have sh...
read it
Recently, enthusiastic studies have devoted to texture synthesis using d...
read it
The real world exhibits an abundance of nonstationary textures. Example...
read it
Recently, deep generative adversarial networks for image generation have...
read it
None
Exemplarbased texture synthesis (EBTS) has been a dynamic yet challenging topic in computer vision and graphics for the past decades
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10], which targets to produce new samples that are visually similar to a given texture exemplar. The main difficulty of EBTS is to efficiently synthesize texture samples that are not only perceptually similar to the exemplar, but also able to balance the repeated and innovated elements in the texture.To overcome this difficulty, two main categories of approaches have been proposed in the literature, i.e., patchbased methods [2, 3, 4, 11] and methods relying on parametric statistical models [1, 5, 6, 7, 12]. Given a texture exemplar, patchbased methods regard small patches in the exemplar as basic elements, and generate new samples by copying pixels or patches from the exemplar to the synthesized texture under certain spatial constrains, such as Markovian property [2, 11, 4]. These methods can produce new textures with high visual fidelity to the given exemplar, but they often result in verbatim copies and few of them can be extended to dynamic texture, except [13]. Moreover, in contrast with their promising performance, they take less steps to understand the underlying process of textures. Whereas statistical parametric methods concentrate on exploring the underlying models of the texture exemplar, and new texture images can then be synthesized by sampling from the learned texture model. These methods are better at balancing the repetitions and innovations nature of textures, while they usually fail to reproduce textures with highly structured elements. It is worth mentioning that a few of these methods can be extended to sound textures [14] and dynamic ones [15]. Some recent surveys on EBTS can be founded in [16, 9].
Recently, parametric models has been revived by the use of
deep neural networks [12, 17, 18, 19]. These models employ deep ConvNets that are pretrained on largescale image datasets instead of handcrafted filters as feature extractors, and generate new samples by seeking images that maximize certain similarity between their deep features and those from the exemplar. Although these methods show great improvements over traditional parametric models, there are still two unsolved or only partially solved problems: 1) It is difficult to extend these methods to other types of textures, such as dynamic and sound textures, since these methods rely on ConvNets pretrained on largescale datasets, such as ImageNet, which are difficult to obtain in video or sound domain. 2) These models can not synthesize textures with nonlocal structures, as the optimization algorithm is likely to be trapped in local minimums where nonlocal structures are not preserved. A common remedy is to use extra penalty terms, such as Fourier spectrum
[19] or correlation matrix [18], but these terms bring in extra hyperparameters and are slow to optimize.In order to address these problems, we propose a new texture model named conditional generative ConvNet (cgCNN) by integrating deep texture statistics and the probabilistic framework of generative ConvNet (gCNN) [20]
. Given a texture exemplar, cgCNN first defines an energy based conditional distribution using deep statistics of a trainable ConvNet, which is then trained by maximal likelihood estimation (MLE). New textures can be synthesized by sampling from the learned conditional distribution. Unlike previous texture models that rely on pretrained ConvNets, cgCNN
learns the weights of the ConvNet for each input exemplar. It therefore has two main advantages: 1) It allows to synthesize image, dynamic and sound textures in a unified manner. 2) It can synthesize textures with nonlocal structures without using extra penalty terms, as it is easier for the sampling algorithm to escape from local minimums.We further present two forms of our cgCNN model, i.e. the canonical cgCNN (ccgCNN) and the forward cgCNN (fcgCNN), by exploiting two different sampling strategies. We show that these two forms of cgCNN have strong theoretical connections with previous texture models. Specifically, ccgCNN uses Langevin dynamics for sampling, and it can synthesize highly nonstationary textures. While fcgCNN uses a fully convolutional generator network as an approximated fast sampler, and it can synthesize arbitrarily large stationary textures. We further show that Gatys’ method [12] and TextureNet [17] are special cases of ccgCNN and fcgCNN respectively. In addition, we derive a concise texture inpainting algorithm based on cgCNN, which iteratively searches for a template in the uncorrupted region and synthesizes a texture patch according to the template.
Our main contributions are thus summarized as follows:
We propose a new texture model named cgCNN which combines deep statistics and the probabilistic framework of gCNN model. Instead of relying on pretrained ConvNets as previous deep texture models, the proposed cgCNN learns the weights of the ConvNet adaptively for each input exemplar. As a result, cgCNN can synthesize high quality dynamic, sound and image textures in a unified manner.
We present two forms of cgCNN and show their effectiveness in texture synthesis and expansion: ccgCNN can synthesize highly nonstationary textures without extra penalty terms, while fcgCNN can synthesize arbitrarily large stationary textures. We also show their strong theoretical connections with previous texture models. Note fcgCNN is the first deep texture model that enables us to expand dynamic or sound textures.
We present a simple but effective algorithm for texture inpainting based on the proposed cgCNN. To our knowledge, it is the first neural algorithm for inpainting sound textures.
Extensive experiments^{1}^{1}1All experiments can be found at captain.whu.edu.cn/cgcnntexture. in synthesis, expansion and inpainting of various types of textures using cgCNN. We demonstrate that our model achieves better or at least comparable results than the stateoftheart methods.
The rest of this paper is organized as follows: Sec. II reviews some related works. Sec. III recalls four baseline models. Sec. IV details cgCNN’s formulation and training algorithm, and provides some theoretical analysis to the models. Sec. V uses cgCNN for the synthesis of various types of textures and adapts the synthesis algorithm to texture inpainting. Sec. VI presents results that demonstrate the effectiveness of cgCNN in synthesizing, expanding and inpainting all three types of textures. Sec. VII draws some conclusions.
One seminal work on parametric EBTS was made by Heeger and Bergen [1], who proposed to synthesize textures by matching the marginal distributions of the synthesized and the exemplar texture. Subsequently, Portilla et al. [6] extended this model by using more and higherorder measurements. Another remarkable work at that time was the FRAME model proposed by Zhu et al. [5], which is a framework unifying the random field model and the maximum entropy principle for texture modeling. Other notable works include [7, 8, 21]. These methods built solid theoretical background for texture synthesis, but are limited in their ability to synthesize structured textures.
Recently, Gatys [12] made a breakthrough in texture modelling by using deep neural networks. This model can be seen as an extension of Portilla’s model [6], where the linear filters was replaced by a pretrained deep ConvNet. Gatys’ method was subsequently extended to style transfer [22], where the content image was force to have similar deep statistics with the style image. In more recent works, Gatys’ method has been extended to synthesizing textures with nonlocal structures by using more constraints such as correlation matrix [18] and spectrum [19]. However, such constraints bring in extra hyperparameters that require manual tuning, and are slow to optimize [18] or cause spectrum like noise [19]. In contrast, our model can synthesize nonlocal structures without the aid of these constraints due to the effective sampling strategy. In order to accelerate the synthesis process and synthesize larger textures than the input, Ulyanov et al. [17] and Johnson et al. [23] proposed to combine a fully convolutional generator with Gaty’s model, so that textures can be synthesized in a fast forward pass of the generator. Similar to Ulyanov’s model et al. [17], our model also uses a generator for fast sampling and texture expansion. In contrast to Gatys’ method which relies on pretrained ConvNets, Xie [20] proposed a generative ConvNet (gCNN) model that can learn the ConvNet and synthesize textures simultaneously. In subsequent works, Xie [24] proposed CoopNet by combining gCNN and a latent variable model. This model was latter extended to video [10] and 3D shape [25] synthesis. Our model can be regarded as a combination of Gatys’ method and gCNN, as it utilizes the idea of deep statistics in Gatys’ method and the probabilistic framework of gCNN.
Considering dynamic texture synthesis, it is common to use linear autoregressive models [26, 15] to model the appearance and dynamics. Later work [27]
compared these method quantitatively by studying the synthesizability of the input exemplars. Recent works leveraged deep learning techniques for synthesizing dynamic textures. For instance, Tesfaldet
et al. [28] proposed to combine Gatys’ method [12] with an optic flow network in order to capture the temporal statistics. In contrast, our model does not require the aid of other nets, as our model is flexible to use spatialtemporal ConvNets for spatial and temporal modelling.As for sound texture synthesis, classic models [14] are generally based on wavelet framework and use handcrafted filters to extract temporal statistics. Recently, Antognini et al. [29] extended Gatys’ method to sound texture synthesis by applying a random network to the spectrograms of sound textures. In contrast, our model learns the network adaptively instead of fixing it to random weights, and our model is applied to raw waveforms directly.
The texture inpainting problem is a special case of image or video inpainting problem, where the inpainted image or video is assumed to be a texture. Igehy [30] transferred Heeger and Bergen’s texture synthesis algorithm [1] to an inpainting algorithm. Our inpainting algorithm shares important ideas with Igehy’s method [30], as we also adopt an inpainting by synthesizing scheme. Other important texture inpainting methods include conditional Gaussian simulation [31] and PatchMatch based methods [32, 33].
This section recalls several baseline models on which cgCNN is built. The theoretical connections between these model and cgCNN will be discussed in Sec. IV.
Given a RGBcolor image texture exemplar , where and are the height and width of the image, texture synthesis targets to generate new samples that are visually similar to .
Gatys’ method uses a pretrained deep ConvNet as a feature extractor. For an input texture exemplar, the Gram matrices of feature maps at selected layers are first calculated. New texture samples are then synthesized by matching the Gram matrices of the synthesized textures and the exemplar.
Formally, Gatys’ method tries to solve the following optimization problem:
(1) 
The objective function is defined as:
(2) 
where is a pretrained ConvNet, is the feature map at layer , and is the Frobenius norm. is the Gram matrix defined as:
(3) 
where is a feature map with channels and elements in each channel.
This model is trained by gradient descent using back propagation. Each step follows
(4) 
where is the learning rate.
TextureNet is a forward version of Gatys’ method. It learns a generator network with trainable weights , which maps a sample of random noise to a local minimum of Eqn. (2). This amounts to solve the following optimization problem:
(5) 
is trained by gradient decent with approximate gradients:
(6) 
where are samples from .
gCNN is defined on a more general setting. It aims to estimate the underlying distribution of a set of images and generate new images by sampling from this distribution. In our work, we only consider the specific case where the input set contains only one image , i.e. , and is a stationary texture exemplar.
gCNN defines a distribution of in image space:
(7) 
where is the normalization factor. is the energy function defined by
(8) 
where is the output of a ConvNet with learnable weights . gCNN is trained by maximum likelihood estimation (MLE).
CoopNet extends gCNN by combining gCNN with a latent variable model [34] which takes the form of
(9) 
where is a forward ConvNet parametrized by and is the synthesized image.
The is trained by MLE, which is to iterate the following four steps:
Generate samples using random .
Feed to gCNN, run steps of Langevin dynamics for : .
Run steps of Langevin dynamics for : .
Update using gradient descent: =
In this section, we first present the definition of our conditional generative ConvNet (cgCNN) model, and then explore two forms of cgCNN, i.e. the canonical cgCNN (ccgCNN) and the forward cgCNN (fcgCNN). Finally, we conclude this section with some theoretical explanations of cgCNN.
Let represent an image, dynamic or sound texture exemplar, and note that the shape of depends on its type. Specifically, represents a dynamic texture exemplar; represents an image texture exemplar, and represents a sound texture exemplar, where and are spatial and temporal sizes.
Given a texture exemplar , cgCNN defines a conditional distribution of synthesized texture :
(10) 
where is the normalization factor. is the energy function which is supposed to capture the visual difference between and by assigning lower values to ’s that are visually closer to . As an analogue to , we define by
(11) 
where is a deep network with learnable weight , and is the feature maps at the th layer. is a statistic measurement, such as e.g. Gram matrix defined in Eqn. (3
). We also test spatial mean vector
as an alternative measurement in our experiment section. For simplicity, in the rest of this paper, we denote by when the meaning is clear from the text.The objective of training cgCNN is to estimate the conditional distribution using only one input data . This is achieve by minimizing the KL divergence between the empirical data distribution, which is a Kronecker delta function , and the estimated distribution . The KL divergence can be written as:
(12) 
where denotes the entropy and .
Note that minimizing is equivalent to MLE, where the loglikelihood is defined as the loglikelihood of the input given itself as the condition:
(13) 
For the consistency of notation, in the rest of this paper, we use instead of as the objective function.
The gradient of can be written as follows:
(14) 
Note that the expectation term in Eqn. (14) is analytical intractable, and has to be approximated by the Monte Carlo method. Suppose we have samples drawn from , the gradient of can be approximated as:
(15) 
We can then minimize using gradient decent according to Eqn. (15).
Therefore, the key of training cgCNN is sampling from . We use 1) Langevin dynamics and 2) a generator net for sampling, which lead to ccgCNN and fcgCNN respectively.
ccgCNN uses Langevin dynamics to sample from . Specifically, starting from a random noise , it uses the following rule to update :
(16) 
where is a sample at step , is the step size, and is a Gaussian noise. A training algorithm for ccgCNN can be derived by combining Langevin sampling in Eqn. (16) and approximated gradient in Eqn. (15). Starting from a random noise , the algorithm iteratively goes through learning step and Langevin sampling step:
Langevin sampling: draw samples using Langevin dynamics according to Eqn. (16).
learning: update network using approximated gradient according to Eqn. (15).
The detailed training process is presented in Alg. 1.
The Langevin dynamics used in ccgCNN is slow, and may be the bottleneck of the Alg. 1. As an alternative, we may also use a generator net as a fast approximated sampler of . Specifically, we introduce a generator network with learnable weights
, which maps the normal distribution
to a parametrized distribution . The training object is to match and , so that samples of can be approximated by samples of . In other words, when is trained, approximated samples of can be drawn by forwarding a noise through network , which is much faster than Langevin dynamics in Eqn. (16). Formally, network is trained by minimizing the KL divergence between and :(17) 
The first term in Eqn. (IVB2) is the entropy of distribution , which is analytical intractable. Following TextureNet [17], we use KozachenkoLeonenko estimator [35] (KLE) to approximate this term. Given samples drawn from , KLE is defined as:
(18) 
The second term in Eqn. (IVB2) is an expectation of our energy function . It can be approximated by taking average over a batch of samples of .
Now, since both terms in Eqn. (IVB2) can be approximated, the gradient of can be calculated as:
(19) 
where are samples drawn from .
The complete training algorithm of fcgCNN can be derived by training network and jointly. Formally, the goal is to match three distributions: , and by optimizing the following objective function,
(20) 
To achieve this goal, fcgCNN is trained by iteratively going through the following three steps:
The detailed algorithm is presented in Alg. 2.
We present some theoretical understandings of cgCNN by relating it to other neural models. We first point out cgCNN is conceptually related to GAN [36] as it can be written in a minmax adversarial form. Then we show that: 1) cgCNN and fcgCNN are generalizations of Gatys’ method [12] and TextureNet [17] respectively. 2) ccgCNN is a variation of gCNN [20] with extra deep statistics, and the forward structures in fcgCNN and CoopNet are consistent. The main properties of these models are summarized in Tab. I.
The adversarial form of fcgCNN can be written as:
(21) 
This adversarial form has an intuitive explanation: network tries to synthesize textures that are more visually similar to the input exemplar, and network tries to detect the differences between them. The training process ends when the adversarial game reaches an equilibrium. Similarly, we have the minmax form of ccgCNN:
(22) 
where the synthesized texture plays the role that is played by in fcgCNN.
It is easy to see that ccgCNN is a generalization of Gatys’ method with an extra step to learn the network . Because if we fix network to be a pretrained ConvNet with weights in Eqn. (22), ccgCNN becomes , which is exactly Gatys’ method defined in Eqn. (1). Furthermore, since fcgCNN and TextureNet are built on ccgCNN and Gatys’ method respectively, and they use the same forward structures, we can conclude that fcgCNN is a generalization of TextureNet as defined in Eqn. (5). In summary, we have the following proposition:
Gatys’ method and TextureNet are special cases of ccgCNN and fcgCNN respectively, where the net is fixed to be a pretrained ConvNet.
Comparing to Gatys’ method, samples of ccgCNN are less likely to be trapped in local minimums for too long, because the learning step always seeks to increase the energy of current samples. For example, if is a local minimal at step , the subsequent learning step will increase ’s energy, thus the energy of may be higher than its neighborhood at the beginning of step , and the Langevin steps will sample different from . In our experiments, we find this property enables us to synthesize highly structured textures without extra penalty terms.
Unlike TextureNet and Gatys’ method, both ccgCNN and fcgCNN can synthesize other types of textures besides image texture, because they do not rely on pretrained ConvNets. In addition, thanks to the their forward structures, both fcgCNN and TextureNet can synthesize textures that are larger than the input.
In general, ccgCNN can be regarded as a variation of gCNN in texture synthesis. It should be noticed that the energy defined in gCNN dose not involve any deep statistics, hence it can be used to synthesis both texture and nontexture images, such as human faces. However, the energy defined in cgCNN incorporates deep statistics (Gram matrix or mean vector) specifically designed for texture modelling, hence it is more powerful in texture synthesis but can not handle nontexture images.
CoopNet uses a latent variable model as the forward structure to accelerate the Langevin dynamics in gCNN. Note that the forward structures in CoopNet and fcgCNN are consistent, as they both seek to learn the distribution defined by their respective backward structures,i.e. gCNN and cgCNN. Furthermore, they are equivalent in a special setting as stated in the following proposition.
If we 1) disable all noise term in Langevin dynamics, 2) set in CoopNet, and 3) discard the entropy term in fcgCNN, the forward structures in CoopNet and fcgCNN become equivalent.
In this setting, denote the output of the latent variable model in gCNN as , then the target is defined as step Langevin dynamics starting from , i.e. . Training amounts to minimize the objective function via gradient descent. Note the gradient of the objective function can be calculated as , which is exactly backpropagation for minimizing
according to the chain rule. Because the generator net in fcgCNN is also trained using backpropagation, it is clear that the forward structures in CoopNet and fcgCNN are equivalent.
All of cgCNN, CoopNet and gCNN can synthesize various types of textures. However, unlike fcgCNN whose synthesis step is a simple forward pass, the synthesis step of CoopNet involves several Langevin steps of gCNN, it is therefore difficult to expand textures using CoopNet.
Model  Forward structure  Backward structure  Multiscale statistics  Dynamic texture synthesis  Sound texture synthesis  Texture expansion  Fast sampling 
Gatys’ [12]  pretrained ConvNet  ✓  ✗  ✗  ✗  ✗  
TextureNet [17]  generator  pretrained ConvNet  ✓  ✗  ✗  ✓  ✓ 
gCNN [20]  gCNN  ✗  ✓  ✗  ✗  
CoopNet [24]  latent variable model  gCNN  ✗  ✓  ✗  ✓  
ccgCNN (Ours)  cgCNN  ✓  ✓  ✓  ✗  ✗  
fcgCNN (Ours)  generator  cgCNN  ✓  ✓  ✓  ✓  ✓ 
In our model, we use the same training algorithms described in Alg. 1 and Alg. 2 and statistics (Gram matrix and mean vector) for all types of textures. Therefore, in order to synthesize different types of textures, we only need to modify the network dimensions accordingly, and all other settings remain the same.
Dynamic textures can be regarded as image textures with an extra temporal dimension. Therefore, we simply use 3dimensional spatialtemporal convolutional layers in cgCNN to capture the spatial appearances and the temporal dynamics simultaneously. In other words, unlike the methods [28, 26] that model spatial and temporal statistics independently, our model treats them equally by regarding a clip of dynamic texture as a spatialtemporal volume, in which both the spatial and the temporal dimensions are stationary.
Sound textures can be regarded as a special case of dynamic textures, where spatial dimensions are not considered. However, modelling sound texture is not a simple task, because the sampling frequency of sound textures (10 kHz) is usually far higher than that of dynamic textures (10 Hz). As a result, sound textures show more complicated longrange temporal dependencies and multiscale structures than dynamic textures.
In our model, we simply use 1dimensional temporal convolutional layers in cgCNN to extract temporal statistics. We use atrous [37] convolutions to ensure large receptive fields, which enable us to learn longrange dependencies. Unlike Antognini’s model [29] which applies fixed random ConvNets to the spectrograms, our model learns the ConvNet using raw waveforms.
As a proof of concept, we present a simple algorithm for texture inpainting based on our texture synthesis algorithm described in Alg. 1.
Given an input texture with a corrupted region , the texture inpainting problem is to fill so that the inpainted texture appears as natural as possible. In other words, must be visually close to at least one patch in the uncorrupted region , where is the corrupted region with its border.
Our texture synthesis algorithm described in Alg. 1 can be easily generalized to a texture inpainting algorithm, which iteratively searches for a template in and updates according to the template. Specifically, our method iterates a searching step and a synthesis step. In the searching step, we first measure the energy between and all candidate patches . Then we select the patch with the lowest energy to be the template. In the synthesis step, we update according to template using Alg. 1. It is obvious that this algorithm ensures the inpainted region is visually similar to at least one patch (e.g. the template) in the uncorrupted region.
In the searching step, we use grid search to find the template . Note the template can also be assigned by the user [30]. It is possible to replace the grid search by more advanced searching techniques such as PatchMatch [33], and use gradient penalty [38] or partial mask [30] to ensure a smooth transition near the border of . However, these contradict the purpose of this algorithm, which is to show the effectiveness of the proposed ccgCNN method by combining it with other simplest possible methods.
The detailed inpainting algorithm is presented in Alg. 1.
In this section, we evaluate the proposed cgCNN model and compare it with other texture models. We first perform self evaluations of ccgCNN in Sec. VIBSec. VIE. Specifically, we investigate several key aspects of ccgCNN including the influence of bounded constraints and the diversity of synthesis results. We also carry out two ablation studies concerning the network structure and the training algorithm respectively. Then we evaluate the performance of ccgCNN and fcgCNN in texture synthesis and expansion by comparing them with other theoretically related or the stateoftheart methods in Sec. VIFSec. VIG. We finally evaluate our texture inpainting method in Sec. VIH.
The image exemplars are collected from DTD dataset [39] and Internet, and all examples are resized to . The dynamic texture exemplars are adopted from [28], where each video has 12 frames ane each frame is resized to . We use sound textures that were used in [14], which are recorded at . For our experiments, we clip the first sample points (about 2 seconds) of each audio as exemplars.
The network used in cgCNN is shown in Fig. 1
. It consists of a deep branch and a shallow branch. The deep branch consists of convolutional layers with small kernel size focusing on details in textures, and the shallow branch consists of three convolutional layers with large kernel size focusing on largerscale and nonlocal structures. The combination of these two branches enables cgCNN to model both global and local structures. When synthesizing dynamic or sound textures, we use spatialtemporal or temporal convolutional layers respectively. We use the hard sigmoid function as activation function in the network, which is defined as:
(23) 
We sample
textures in each iteration and each sample is initialized as Gaussian noise with variance
. We run or Langevin steps in each iteration. The training algorithm stops with a maximal ofiterations. We use RMSprop
[40] or Adam [41] to update networks and synthesized images, with the initial learning rate set to . In all our experiments, we follow these settings except where explicitly stated.All the results are available at http://captain.whu.edu.cn/cgcnntexture/, where one can check dynamic and sound textures.
We find it is crucial to constrain the magnitude of the energy in order to stabilize the training process, because the energy
often grows too large and causes the exploding gradient problem. In this work, we use bounded activation function in the network architecture to ensure the energy
is upper bounded.We notice the choice of activation function has subtle influences on the synthesis results. This is shown in Fig. 2, where we present the results using different activation functions, i.e.hard sigmoid, tanh and sigmoid respectively. We observe the use of hard sigmoid produces the most satisfactory results, while tanh often generates some unnatural colors, and the results using sigmoid exhibit some checkboard artifacts.
It is important for a texture synthesis algorithm to be able to synthesis diversified texture samples using a given exemplar. For the proposed ccgCNN model, the diversity of the synthesized textures is a direct result of the randomness of the initial Gaussian noise, thus one does not need to make extra effort to ensure such diversity. This is shown in Fig. 3, where a batch of three synthesized samples for each exemplars are presented. Note that all synthesized textures are visually similar to the exemplars, but they are not identical to each other.
In order to verify the importance of learning step in Alg. 1, we test a fixed random method which is to disable learning step. This fixed random method is actually optimizing the synthesized image using a fixed random ConvNet.
Fig. 4 presents the comparison between our Alg. 1 and such fixed random method. Clearly, our method produces more favorable results than this fixed random method, as our results are sharper and clearer while this method can only produce blurry and noisy textures. We can therefore conclude that learning step is key to the success of our algorithm, as it enables us to learn better deep filters than a random ConvNet.
In order to investigate the roles played by different layers in our network in Fig. 1, we carry out an ablation study by using different subnetworks. Note the original network has two branches consisting of deep layers and shallow layers respectively. We denote a subnetwork with deep layers and shallow layers by . For instance, a subnetwork consists of the first layers in the deep branch and the first layer in the shallow branch. We experiment with and . Fig. 5 presents the results of five subnetworks with increasingly large receptive field, i.e. . As we can see, the synthesized textures capture larger scale structures as the receptive field increases.
In general, to generate high fidelity samples, the network must be able to model structures of different scales contained in the input image. As shown in Fig. 5, generates results with serious artifacts because the receptive field is only pixels wide, which is too small for any meaningful texture elements. For the porous texture which consists of small scale elements, a subnetwork with a relatively small receptive field, e.g. and , is sufficient to produce high quality textures. However, for textures containing largerscale structures, like cherries and pebbles, larger receptive fields are often required for producing better results.
For image texture synthesis, we compare the following methods, which are theoretically related to our model or reported stateoftheart performance.
ccgCNNGram: Our ccgCNN with the Gram matrix as the statistic measurement.
ccgCNNMean: Our ccgCNN where the mean vector is used as the statistic measurement instead of Gram matrix.
Gatys’ method [12]: A texture model relying on pretrained ConvNets. It is a special case of our ccgCNNGram model with pretrained ConvNet.
Self tuning [11]: A recent patchbased EBTS algorithm that utilizes optimization technique.
Fig. 6
shows the qualitative comparison of these algorithms. We observe that Gatys’ method fails to capture global structures (the 3rd and 4th textures) because the optimization process converges to a local minimum where global structures are not preserved, and it also generates artifacts such as unnatural color and noises (Zoom in the 1st and 5th textures). Meanwhile, although gCNN and CoopNet can capture most of the largescale structures, they loss too many details in the results, probably because they do not use any deep statistics. Self tuning is excel at generating regular textures (the 3rd and 4th textures), but it sometimes losses the global structures (the 1st, 2nd and 4th textures) due to the lack of global structure modeling in this method. In contrast, ccgCNNGram and ccgCNNMean can both produce better samples than other baseline methods, since they not only capture largescale structures but also reproduce smallscale details, even for highly structured textures (1st, 3rd and 4th textures). This is because ccgCNN use both deep statistics and effective sampling strategy that are not likely to be trapped in bad local minimums. It is also worth noticing that the results of ccgCNNGram and ccgCNNMean are comparable in most cases even though they use different statistics. For quantitative evaluation, we measure multiscale structural similarity
[45](MSSSIM) between the synthesized texture and the exemplar. A higher score indicate higher visual similarity. The quantitative results are summarized in Tab.
II. The results show that our methods outperform other baseline methods in most cases.painting  lines  wall  scaly  ink  
Gatys’ [12]  0.01  0.01  0.09  0.08  0.34 
gCNN [20]  0.05  0.05  0.11  0.11  0.35 
selftuning [11]  0.03  0.17  0.07  0.01  0.42 
CoopNet [24]  0.05  0.09  0.20  0.08  0.32 
ccgCNNGram  0.10  0.09  0.31  0.36  0.43 
ccgCNNMean  0.14  0.10  0.10  0.00  0.46 
For dynamic texture synthesis, we use the network , and we sample dynamic textures in sampling step. Fig. 7 presents the qualitative comparison between ccgCNN method and recent advanced twostream method [28]. We notice the results of twostream model suffer from artifacts such as greyish (the 1st texture) or low level noise (the 2nd texture), and sometime exhibit temporal inconsistency. While the results of both ccgCNNGram and ccgCNNMean are more favorable as they are cleaner and show better temporal consistency.
For qualitative evaluation, we measure the average of MSSSIM metric between each frame of synthesized results and the corresponding frame in the exemplar. The results are shown in Tab. III, where both ccgCNNMean and ccgCNNGram outperform twostream method.
ocean  smoke  

TwoStream [28]  0.08  0.01 
ccgCNNGram  0.17  0.79 
ccgCNNMean  0.13  0.86 
For sound texture synthesis, we use the network where the kernel size and number of filters in each layer are and
, and the strides in each layer is
except the first layer where the stride is . We do not use pooling layers in this network.Fig. 8 presents the results of sound texture synthesis using ccgCNN, McDermott’s model [14] and Antognini’s model [29]
in waveforms. Unlike other two methods which act on frequency domain, ccgCNN only uses raw audios. We observe that the results of these methods are generally comparable, except for some cases where our results are noisier than baseline methods. It is probably because of the loss of short temporal dependencies caused by the large strides in the shallow layers. It also suggests that our results might be further improved by using more carefully designed networks.
The structure of generator net used in fcgCNN is borrowed from TextureNet, with two extra residual blocks at the output layer. See [17] for details. When expanding dynamic or sound texture, the spatial convolutional layers in is replaced by spatialtemporal or temporal convolutional layers accordingly.
Fig. 9 presents a comparison between fcgCNN and TextureNet in image texture expansion. The results of fcgCNN and TextureNet are generally comparable, because both of them are able to learn the stationary elements in the exemplars. In addition, fcgCNN are generally slower to converge than TextureNet in the training phase because it trains an extra net , but their synthesis speed is the same as their synthesis both involve a forward pass through the generator net.
Fig. 10 presents the results of dynamic texture expansion using fcgCNN. The exemplar dynamic texture is expanded to 48 frames and the size of each frame is . We observe that fcgCNN successfully reproduces stationary elements and expands the exemplar dynamic texture in both temporal and spatial dimensions, i.e. the synthesized textures have more frames and each frame is larger than the input exemplars. It should be noticed that fcgCNN is the first neural texture model that enables us to expand dynamic textures.
Fig. 11 presents the results of sound texture expansion. In this experiment, we clip the first 16384 data points (less than 1 second) in each sound texture as exemplars, and expand the exemplar to 122880 data points (about 5 seconds) using fcgCNN. Similar to the case of dynamic texture expansion, fcgCNN successfully expands the exemplar sound texture while preserving sound elements that occur most frequently. Notice fcgCNN is also the first texture model that enables us to expand sound textures.


brick  camouflage  fiber  sponge  water  

DeepPrior [46]  0.978  0.897  0.856  0.904  0.956 
DeepFill [47]  0.966  0.922  0.900  0.900  0.962 
cgCNN  0.984  0.930  0.905  0.914  0.912 
For image texture inpainting, we evaluate our algorithm by comparing it with the following two deep image inpainting methods:
Deep prior [46]: An inpainting algorithm that utilizes the prior of a random ConvNet. This method dose not require extra training data.
Deep fill [47]: The stateoftheart image inpainting algorithm. It requires extra training data, and we use the model pretrained on ImageNet for our experiment.
We use the network where the number of channels is 64. We first prepare a rectangle mask of size near the center of a image, then we obtain the corrupted texture by applying the mask to a raw texture, i.e. all pixels within the masked area are set to zero. The border width is set to 4 pixels. All inpainting methods have access to the mask and the corrupted texture, but do not have access to the raw textures.
Fig. 12 presents the qualitative comparison. In general, although the baseline methods can handle textures with nonlocal structures relatively well (the 1st texture), they can not handle random elements in textures (from the 2nd to the 5th textures). Most results of baseline methods are blurry, and the results of deep fill sometimes show obvious color artifacts (the 1st and 5th textures). Clearly, our method outperforms other baseline methods, as it is able to inpaint all corrupted exemplars with convincing textural content, and does not produce blurry or color artifacts.
Tab. IV presents the quantitative comparison. We calculate the MSSSIM score between the inpainted textures and the corresponding raw textures (not shown). A higher score indicates a better inpainting result. It can be seen that our method outperforms other baseline methods in most cases.
For dynamic texture inpainting, we prepare a mask of size , and apply this mask to each frame of dynamic textures. The border width is set to 2 pixels. We use the network where the number of channels is reduced to 32. The template is assigned by the user because the grid search may cause memory overhead for GPUs. For sound texture inpainting, the mask covers the interval from 20000th to 30000th data point. The border width is set to 1000 data points. We use the same network settings as in the sound texture synthesis experiment.
Fig. 13 and Fig. 14 present the results of dynamic texture and sound texture inpainting using our method. Similar to the case of image inpainting, we observe that our method successfully fills the corrupted region with convincing textural content, and the overall inpainted textures are natural and clear. It should be noticed that our proposed method is the first neural algorithm for sound texture inpainting.
In this paper, we present cgCNN for exemplarbased texture synthesis. Our model can synthesize high quality image texture, dynamic texture and sound textures in a unified manner. The experiments demonstrate the effectiveness of our model in texture synthesis, expansion and inpainting.
There are several issues need further investigations. We notice that one limitation of cgCNN is that it cannot synthesis dynamic patterns without spatial stationarity, such as the ones studied in [10]. Extending cgCNN to those dynamic patterns would be an interesting direction for further work. Another limitation is that current cgCNN can not learn multiple input textures, i.e
., it can only learn one texture a time. Future works should extend cgCNN to the batch training setting, and explore its potential in downstream tasks such as texture feature extraction
[48] and classification [25].J. Johnson, A. Alahi, and L. FeiFei, “Perceptual losses for realtime style transfer and superresolution,” in
ECCV. Springer, 2016, pp. 694–711.T. Tieleman and G. Hinton, “Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural Networks for Machine Learning, 2012.
Comments
There are no comments yet.