Code for our CVPR-2021 paper on Combining Semantic Guidance and Deep Reinforcement Learning For Generating Human Level Paintings.
Generation of stroke-based non-photorealistic imagery, is an important problem in the computer vision community. As an endeavor in this direction, substantial recent research efforts have been focused on teaching machines "how to paint", in a manner similar to a human painter. However, the applicability of previous methods has been limited to datasets with little variation in position, scale and saliency of the foreground object. As a consequence, we find that these methods struggle to cover the granularity and diversity possessed by real world images. To this end, we propose a Semantic Guidance pipeline with 1) a bi-level painting procedure for learning the distinction between foreground and background brush strokes at training time. 2) We also introduce invariance to the position and scale of the foreground object through a neural alignment model, which combines object localization and spatial transformer networks in an end to end manner, to zoom into a particular semantic instance. 3) The distinguishing features of the in-focus object are then amplified by maximizing a novel guided backpropagation based focus reward. The proposed agent does not require any supervision on human stroke-data and successfully handles variations in foreground object attributes, thus, producing much higher quality canvases for the CUB-200 Birds and Stanford Cars-196 datasets. Finally, we demonstrate the further efficacy of our method on complex datasets with multiple foreground object instances by evaluating an extension of our method on the challenging Virtual-KITTI dataset.READ FULL TEXT VIEW PDF
Code for our CVPR-2021 paper on Combining Semantic Guidance and Deep Reinforcement Learning For Generating Human Level Paintings.
Paintings form a key medium through which humans express their visual conception, creativity and thoughts. Being able to paint constitutes a vital skill in the human learning process and requires long-term planning to efficiently convey the picture within a limited number of brush strokes. Thus, the successful impartation of this challenging skill to machines, would not only have huge applications in computer graphics, but would also form a key component in the development of a general artificial intelligence system.
Recently, a lot of research [huang2019learning, mellor2019unsupervised, ganin2018synthesizing, zheng2018strokenet, xie2013artist, ha2017neural] is being targeted on teaching machines “how to paint”
, in a manner similar to a human painter. A popular solution to this problem is to use reinforcement learning and model the painting episode as a Markov Decision Process (MDP). Given a target image, the agent learns to predict a sequence of brush strokes which when transferred on to a canvas, result in a painting which is semantically and visually similar to the input image. The reward function for the agent is usually learnt using a generative adversarial network (GAN)[goodfellow2014generative], which provides a measure of similarity between the final canvas and the original target image.
In this paper, we propose a semantic guidance pipeline which addresses the following three challenges faced by the current painting agents. First, the current methods [huang2019learning, mellor2019unsupervised, ganin2018synthesizing] are limited to only datasets which depict a single dominant instance per image (e.g. cropped faces). Experimental results reveal that this leads to poor performance on varying the position, scale and saliency of the foreground object within the image. We address this limitation by adopting a bi-level painting procedure, which incorporates semantic segmentation to learn a distinction between brush strokes for foreground and background image regions. Here, we utilize the intuition that the human painting process is deeply rooted in our semantic understanding of the image components. For instance, an accurate depiction of a bird sitting on a tree would depend highly on the agent’s ability to recognize the bird and the tree as separate objects and hence use correspondingly different stroke patterns / plans.
, variation in position and scale of the foreground objects within the image, introduces high variance in the input distribution for the generative model. To this end, we propose aneural alignment model, which combines object localization and spatial transformer networks to learn an affine mapping between the overall image and the bounding box of the target object. The neural alignment model is end-to-end and preserves the differentiability requirement for our model-based reinforcement learning approach.
Third, accurate depiction of instances belonging to the same semantic class should require the painting agent to give special attention to different distinguishing features. For instance, while the shape of the beak may be a key feature for some birds, it may be of little consequence for other bird types. We thus propose a novel guided backpropagation based focus reward to increase the model’s attention on these fine-grain features. The use of guided backpropagation also helps in amplifying the importance of small image regions, like a bird’s eye which might be otherwise ignored by the reinforcement learning agent.
In summary, the main contributions of this paper are:
We introduce a semantically guided bi-level painting process to develop a better distinction between foreground and background brush strokes.
We propose a neural alignment model, which combines object localization and spatial transformer networks in an end to end manner to zoom in on a particular foreground object in the image.
We finally introduce expert guidance on the relative importance of distinguishing features of the in-focus object (e.g. tail, beak etc. for a bird) by proposing a novel guided backpropagation based focus reward.
Stroke based rendering methods. Automatic generation of non-photorealistic imagery has been a problem of keen interest in the computer vision community. Stroke Based Rendering (SBR) is a popular approach in this regard, which focuses on recreating images by placing discrete elements such as paint strokes or stipples [hertzmann2003survey].
The positioning and selection of appropriate strokes is a key aspect of this approach [zeng2009image]. Most traditional SBR algorithms address this task through either, greedy search at each step [hertzmann1998painterly, litwinowicz1997processing]
, optimization over an energy function using heuristics[turk1996image], or require user interaction for supervising stroke positions [haeberli1990paint, teece19983d].
RNN-based methods.ha2017neural] for drawings and Graves [graves2013generating] for handwriting generation, require access to sequential stroke data, which limits their applicability for most real world datasets. StrokeNet [ha2017neural] addresses this limitation by using a differentiable renderer, however it fails to generalize to color images.
Unsupervised stroke decomposition using RL. Recent methods [xie2013artist, ganin2018synthesizing, mellor2019unsupervised, huang2019learning] use RL to learn an efficient stroke decomposition. The adoption of a trial and error approach alleviates the need for stroke supervision, as long as a reliable reward metric is available. SPIRAL [ganin2018synthesizing], SPIRAL++ [mellor2019unsupervised] and Huang [huang2019learning] adopt an adversarial training approach, wherein the reward function is modelled using the WGAN distance [huang2019learning, arjovsky2017wasserstein]. Learning a differentiable renderer model has also been shown to improve the learning speed of the training process [huang2019learning, zheng2018strokenet, nakano2019neural, frans2018unsupervised].
The above methods generalize only for datasets (e.g. cropped, aligned faces from CelebA [liu2015faceattributes]), with limited variation in scale, position and saliency of the foreground object. We note that while Huang [huang2019learning]
, evaluate their approach on ImageNet[deng2009imagenet], we find that competitive results are achieved only after using the division parameter at inference times. In this setting, the agent divides the overall image into a grid with 16 / 256 blocks, and then proceeds to paint each of them in parallel. We argue that such a division does not follow the constraints of the original problem formulation, in which the agent mimics the human painting process. Furthermore, such a division strategy increases the effective number of total strokes and tends towards a pixel-level image regression approach, with the generated images losing the desired artistic / non-photorealistic touch.
Semantic Divide and Conquer. Our work is in part also motivated by semantic division strategies from [wang2020sdc, liu2010single]
, which propose a division of the overall depth estimation task among the constituent semantic classes. However, to the best of our knowledge, our work is the first attempt on incorporating semantic division (with model-based RL) for the “learning to paint” problem.
Similar to Huang [huang2019learning], we adopt a model-based reinforcement learning approach for this problem. The painting episode is modelled as a Markov Decision Process (MDP) defined by state space , transition function and action space .
State space. The state at any time is defined by the tuple , where is the canvas image at timestep and is the target image.
represent the semantic instance probability mapand the guided backpropagation map for the target image.
Action space. The action
at each timestep, depicts the parameters of a quadratic Bézier curve, used to model the brush stroke. The stroke parameters form a 13 dimensional vector as follows,
where the first 10 parameters depict stroke position, shape and transparency, while the last 3 parameters form the RGB representation for the stroke color.
Environment Model. The environment model / transition function is modelled through a neural renderer network , which facilitates a differentiable mapping from the current canvas and brush stroke parameters to the updated canvas state . For mathematical convenience alone, we define two distinct stroke map definitions . represents the stroke density map, whose value at any pixel provides a measure of transparency of the current stroke. is the colored rendering of the original stroke density map on an empty canvas.
Action Bundle. We adopt an action bundle approach which has been shown to be an efficient mechanism for enforcing higher emphasis on the planning process [huang2019learning]. Thus, at each timestep the agent predicts the parameters for the next brush strokes.
In the following sections, we describe the complete pipeline for our semantic guidance model (refer Fig. 2). We first outline our approach for a two class (foreground, background) painting problem and then later demonstrate its extension to more complex image datasets with multiple foreground instances per image in Section 5.
The human painting process is inherently multi-level, wherein the painter would focus on different semantic regions through distinct brush strokes. For instance, brush strokes aimed at painting the general image background would have a different distribution as compared to strokes depicting each of the foreground instances.
Motivated by this, we propose to use semantic segmentation to develop a distinction between the foreground and the background strokes. This distinction is achieved through a bi-level painting procedure which allocates a specialized reward for each stroke type. More specifically, we first modify the action bundle to separately predict Bézier curve parameters for foreground and background strokes, i.e.
where represent the foreground and background stroke parameters, respectively. Next, given a neural renderer network , target image and semantic class probability map , the canvas state is updated in the following two stages,
where indicates element-wise multiplication and represents the colored rendering of the stroke density map .
The reward for each stroke type is then defined as,
where represent the foreground and background rewards, respectively, and is the Wassertein-l / Earth-Mover distance between the image and canvas .
The accuracy of the foreground rewards computed using Eq. 6, depends on the ability of the discriminator to accurately capture the similarity between the target image and the current canvas state . However, the input to the discriminator of the WGAN model would have high variance, if the position and scale of the foreground object varies significantly amongst the input images. This high variance poses a direct challenge to the discriminator’s performance, while training on complex real world datasets. To this end, we propose a differentiable neural alignment model, which combines object localization and spatial transformer networks to zoom into the foreground object, thereby providing a standardized input for the discriminator.
First, we modify the segmentation model to predict both the foreground object mask and bounding box coordinates of the foreground object in the target image. We then use a spatial transformer network , which uses the predicted bounding box coordinates to compute an affine mapping, from the overall canvas image to the zoomed foreground object image . Mathematically,
where represents the foreground segmentation and localization network. The affine matrix for the spatial transformer network , given bounding box coordinates and overall image size , is defined as,
The modified foreground reward () is then computed using the WGAN distance between the zoomed-in target and canvas images, as follows,
The semantic importance of an image region is not necessarily proportional to the number of pixels covered by the corresponding region. While using WGAN loss provides some degree of abstraction as compared with the direct pixel-wise distance, we observe that a painting agent trained with a WGAN distance based reward function, does not pay adequate attention to small but distinguishing object features. For instance, as shown in Fig. 3
, for the CUB-200-2011 birds dataset, we see that while the baseline agent captures the global object features like shape and color, it either omits or insufficiently depicts important bird features like eyes, wing texture, color marks around the necketc.
In order to address this limitation, we propose to incorporate a novel focus reward, in conjuction with the global WGAN reward, to amplify the focus on the distinguishing features of each foreground instance. The focus reward uses guided back propagation maps from an expert task model (e.g. classification) to scale the relative importance of different image regions in the painting process. Guided backpropagation (GBP) has been shown to be an efficient mechanism for visualizing key image features [springenberg2014striving, nie2018theoretical]. Thus by maximizing the focus reward, we encourage the painting agent to generate canvases with enhanced granularity at key feature locations.
Mathematically, given the normalized guided back-propagation map for the target image, object bounding box coordinates and neural alignment model , we first define the GBP distance as,
where represents the Frobenius norm. Here we normalize the weighted difference between neurally aligned target and canvas images, using the total number of non-zero pixels in the guided backpropagation map. Thus, the scale of GBP distance is invariant to extent of activations in the zoomed key-point importance map .
The focus reward is then defined as the difference between GBP distances at successive timesteps,
The semantic guidance pipeline discussed in Section 4, mainly handles images with a single foreground object instance per image. In this section, we show how the proposed approach can be used to “learn how to paint” on datasets depicting multiple foreground objects per image.
At training time, we maintain the bi-level painting procedure from Section 4.1. The action bundle at each timestep describes the brush stroke parameters for the background and one of the foreground instances. The foreground instance for a particular painting episode is kept fixed and is selected with a probability proportional to the total number of pixels covered by that object.
At inference time however, the agent would need to pay attention to all of the foreground instances. Given total foreground objects, the agent at any timestep of the painting episode, would choose to predict brush stroke parameters for the foreground class with the highest difference the corresponding areas in the canvas and the target image. Mathematically, the foreground instance at each timestep is selected as,
where is the foreground segmentation map for the object. We also note that the distinction between foreground and background strokes allows us to perform data augmentation with a specialized dataset to improve the quality of foreground data examples. Thus, in our experiments, we augment the Virtual KITTI dataset with Stanford Cars-196 in ratio of 0.8:0.2 while training.
We use the CUB-200-2011 Birds [WahCUB_200_2011] and Stanford Cars-196 [KrauseStarkDengFei-Fei_3DRR2013] dataset for performing qualitative evaluation of our method. The above datasets mainly feature one foreground instance per image and hence can be trained using the bi-level semantic guidance pipeline described in Section 4. We also use the high-fidelity Virtual-KITTI [cabon2020vkitti2] dataset to demonstrate the extension of the proposed method to multiple foreground instances per image.
CUB-200-2011 Birds [WahCUB_200_2011] is a large-scale birds dataset frequently used for benchmarking fine-grain classification models. It consists of 200 bird species with annotations available for class, foreground mask and bounding box of the bird. The dataset features high variation in object background as well as scale, position and the relative saliency of the foreground bird with respect to its immediate surroundings. These properties make it a challenging benchmark for the “learning to paint” problem.
Stanford Cars-196 [KrauseStarkDengFei-Fei_3DRR2013] is another dataset used for testing fine-grain classification. It consists of 16185 total images depicting cars belonging to 196 distinct categories and having varying 3D orientation. The dataset only provides object category and bounding box annotations. We compute the foreground car masks using the pretrained DeepLabV3-Resnet101 network [chen2017rethinking].
Virtual KITTI [cabon2020vkitti2] is a high fidelity dataset containing photo-realistic renderings of urban environments from 5 distinct scene backgrounds. Each scene contains images depicting variation in camera location, weather, time of day and density / location of foreground objects. The high variability of these image attributes, makes it a very challenging dataset for training the painting agent. Nevertheless, we demonstrate that our method helps in improving the semantic quality of the generated canvases despite these obstacles.
Neural Renderer. We closely follow the architecture from Huang [huang2019learning], while designing the differentiable neural renderer . Given a batch of random brush stroke parameters , the network output is trained to mimic the rendering of the corresponding Bézier curve on an empty canvas. The training labels are generated using an automated graphics module and the renderer is trained for iterations with a batch size of 64.
Learning foreground mask and bounding box. A key component of the semantic guidance pipeline is foreground segmentation and bounding box prediction. We use a fully convolutional network, with separate heads to predict a per-pixel foreground probability map and the coordinates of the bounding box. The foreground mask prediction is trained with the standard cross-entropy loss , while the bounding box coordinates are learned using Smooth L1 [girshick2015fast] regression loss .
Expert model for Guided Backpropagation. We use the pretrained fine-grain classification NTS-Net model [yang2018learning] as the expert network used for generating guided backpropagation maps on the CUB-200-2011 birds dataset. Note that we use NTS-Net due the easy accessibility of the pretrained model. We expect that using a more state of the art model like [ge2019weakly] would lead to better results with the focus reward.
The expert model for the Standford Cars-196 dataset is trained in conjunction with the reinforcement learning agent, with an EfficientNet-B0 [tan2019efficientnet]
backbone network. The EfficientNet architecture allows us to limit the total number of network parameters while respecting the memory constraints for a NVIDIA GTX 2080 Ti. The expert model is trained for a total of 200 epochs with a batch size of 64. EfficientNet-B7 model pretrained on ImageNet[deng2009imagenet] dataset, is used as the expert for the Virtual KITTI dataset.
Overall Training. The reinforcement learning agent follows an actor-critic architecture. The actor predicts the policy function , while the critic computes the value function . The agent is trained using model-based DDPG [lillicrap2015continuous] with the following policy and value loss,
where is the discount factor and the final reward function is computed as the weighted sum of the foreground, background and focus rewards,
are hyperparameters. A hyper-parameter selection ofwas seen to give competitive results for our experiments. The model-based RL agent is trained for a total of 2M iterations with a batch size of 96.
We compare our method with the baseline “learning to paint” pipeline from Huang [huang2019learning] which uses an action bundle containing 5 consecutive brush strokes. In order to provide a fair comparison, we use the same overall bundle size but divide it among foreground and background strokes in the ratio of 3:2. That is, the agent at each timestep predicts 3 foreground and 2 background brush strokes.
Improved foreground saliency. Fig. 3 shows the results for the CUB-200 Birds and Stanford-Cars196 dataset. We clearly see that our method leads to increased saliency of foreground objects, especially when the target object is partly camouflaged by its immediate surroundings (refer Fig. 3a, row-4 and Fig. 3b, row-3). This increased contrast between foreground and background perception, results directly from our semantically guided bi-level painting process and the neural alignment model.
Enhanced feature granularity. We also observe that canvases generated using our method show improved focus on key object features as compared to the baseline. For instance, the red head-feather, which is an important feature of pileated woodpecker (refer Fig. 3a: row-1), is practically ignored by the baseline agent due to its small size. The proposed guided backpropagation based focus reward, helps in amplifying the importance of this key feature in the overall reward function. Similarly, our method also leads to improved depiction of wing patterns and claws in (Fig. 3a: row-2), the small eye region, feather marks in (Fig. 3a: row-3) and car headlights, wheel patterns in (Fig. 3b: row-1,2).
Multiple foreground instances. We use the Virtual-KITTI dataset and the extended training procedure outlined in Section 5, to demonstrate the applicability of our method on images with multiple foreground instances. Note that due to computational limits and the nature of ground-truth data, we stick to vehicular foreground classes like cars, vans, buses etc, for our experiments. Results are shown in Fig. 4. We observe that due to the dominant nature of image backgrounds in this dataset, the baseline agent fails to accurately capture the presence / color spectrum of the foreground vehicles. In contrast, our bi-level painting procedure learns a distinction between foreground and background strokes in the training process itself, and thus provides a much better balance between foreground and background depiction for the target image.
In this section, we design a control experiment in order to isolate the impact of focus reward proposed in Section 4.3. To this end, we construct a modified birds dataset from CUB-200-2011 dataset. We do this by first setting the background image pixels to zero, which alleviates the need for the bi-level painting procedure. We next eliminate the need for the neural alignment model by cropping the bounding box for each bird. The resulting dataset is then used to train the baseline [huang2019learning], and a modified semantic guidance pipeline trained only using a weighted combination of the WGAN reward [huang2019learning] and the focus reward ,
where represents baseline model without the focus loss. We then analyse the effect on the resulting canvas as the weightage of the focus reward is increased. All models are trained for 1M iterations with a batch size of 96.
Fig. 5 describes the modified training results. We clearly see that while the baseline [huang2019learning] trained with wgan reward captures the overall bird shape and color, it fails to accurately pay attention to finer bird features like texture of the wings (row 1,3,4), density of eyes (row 3,4) and sharp color contrast (red regions near the face for row 1,2). We also observe that the granularity of the above discussed features in the painted canvas, improves as the weightage of the focus reward is increased.
Recall that the main goal of the “learning to paint” problem, is to make the machine paint in a manner similar to a human painter. Thus, the performance of a painting agent should be measured, not only by the resemblance between the final canvas and the target image, but also by the similarity of the corresponding painting sequence with that of a human painter. In this section, we demonstrate that unlike previous methods, semantic guidance helps the reinforcement learning agent adopt a painting trajectory that is highly similar to the human painting process.
In order to do a fair comparison of agent trajectories between our method and the baseline [huang2019learning], we select test images from the Stanford Cars-196 dataset, such that the final canvases from both methods are equally similar to the target image. That is, the 111We note that, in general distance may not be a reliable measure of semantic similarity between two images. As shown in Fig. 6, two canvases can be qualitatively quite different while having similar distance with the target image. distance between the final canvas and the target image is similar for both methods.
Results are shown in Fig. 6. We can immediately observe a stark difference between the painting styles of the two agents. The standard agent displays bottom-up image understanding, and proceeds to first paint visually distinct car edges / parts like windows, red tail light, black region near the bottom of the car etc. In contrast, the semantically guided agent follows a top down approach, wherein it first begins with a rough structural outline for the car and only then focuses on other structurally non-relevant parts. For instance, in the first example from Fig. 6, the semantically guided agent adds color to the tail-light only after finishing painting the overall structure of the car. On the other hand, the red brush stroke for the tail-light region is painted quite early by the baseline agent, even before the overall car structure begins to emerge on the canvas. Thus, these striking differences in the painting sequences suggest that, the proposed semantic guidance pipeline helps in imparting a more human like painting style to the learning agent.
In this paper, we propose a semantic guidance pipeline for the “learning to paint” problem. Our method incorporates semantic segmentation to propose a bi-level painting process, which helps in learning a distinction between foreground and background brush stroke rewards. We also introduce a guided backpropagation based focus reward, to increase the granularity and importance of small but distinguishing object features in the final canvas. The resulting agent successfully handles variations in position, scale and saliency of foreground objects, and develops a top-down painting style which closely resembles a human painter.
|Semantic Guidance (Ours)||69.26||48.15|
The inadequacy of the frequently used pixel-wise distance [ganin2018synthesizing, huang2019learning] in capturing semantic similarity, poses a major challenge in performing a quantitative evaluation of our method. In order to address this, we present a novel approach to quantitatively evaluate the semantic similarity between the generated canvases and the target image. To this end, we use a pretrained DeeplabV3-ResNet101 model [chen2017rethinking] to compute the semantic segmentation maps for the final painted canvases for both Huang [huang2019learning] and the Semantic Guidance (Ours) approach. The detected segmentation maps for both methods are then compared with the ground truth foreground masks for the target image.
Results are shown in Fig. 7. We clearly see that our method learns to paint canvases with semantic segmentation maps having high resemblance with the ground truth foreground masks for the target image. In contrast, the canvases generated using the baseline [huang2019learning] show low foreground saliency. This sometimes results in the pretrained segmentation model [chen2017rethinking] even failing to detect the presence of the foreground object. Note that the semantic guidance pipeline does not directly train the RL agent to mimic the segmentation maps of the original image.
We also provide a more quantitative evaluation of the quality of detected semantic segmentation maps for both methods in Table 2. The accuracy scores are reported on the test set images and represent the percentage of foreground pixels which are correctly detected in the segmentation map of a given canvas. We observe that our method leads to huge improvements in the semantic segmentation accuracy and IoU values for the painted canvases.
|Method||Foreground L2 Distance|
|Semantic Guidance (Ours)||7.81|
The neural alignment model is implemented by replacing the localization net of a standard spatial transformer network [jaderberg2015spatial] with the bounding box prediction network. We also note that the
affine matrix defined in Eq. 11 of the main paper, represents the ideal affine mapping operation from input to output image coordinates. However, the affine matrix used for practical implementations may vary based on the conventions of the used deep learning framework. For our implementation (in pytorch), we compute the affine matrix for the spatial transformer network as follows,
where are the normalized bounding box coordinates of the foreground object.