Stroke-based Artistic Rendering Agent with Deep Reinforcement Learning
Excellent painters can use only a few strokes to create a fantastic painting, which is a symbol of human intelligence and art. Reversing the simulator to interpret images is also a challenging task of computer vision in recent years. In this paper, we present SARA, a stroke-based artistic rendering agent that combines the neural renderer and deep reinforcement learning (DRL), allowing the machine to learn the ability to deconstruct images using strokes and create amazing visual effects. Our agent is an end-to-end program that converts natural images into paintings. The training process does not require the experience of human painting or stroke tracking data.READ FULL TEXT VIEW PDF
We show how to teach machines to paint like human painters, who can use ...
In 2015, Google's DeepMind announced an advancement in creating an auton...
This paper presents a neuro-symbolic agent that combines deep reinforcem...
While deep reinforcement learning (DRL) has led to numerous successes in...
Advances in deep generative networks have led to impressive results in r...
We present DRLViz, a visual analytics interface to interpret the interna...
We propose a scheme for training a computerized agent to perform complex...
Stroke-based Artistic Rendering Agent with Deep Reinforcement Learning
Painting, being an important form in the visual arts, symbolize the wisdom and creativity of humans. In recent centuries, artists have used a diverse array of tools to create their masterpieces. But it’s hard for people to master this skill unless spending a large amount of time. Therefore, teaching machines to paint is indeed a challenging and meaningful task. Furthermore, the study of this topic can help us build a painting assistant tool and explore the mystery of painting.
We define artificial intelligent painting that an agent can paint strokes on a canvas in sequence to generate a painting that resembles the given target image. Some work has studied in teaching machines to learn painting-related skills, such as sketch[9, 3, 40], doodle  and write characters . Differently, we aim to teach machines to handle more complex tasks, such as painting portraits of humans and natural scenes in the real world. The rich textures and complex composition of colors make them harder to deal with for machines.
There are three challenges for an agent to paint texture-rich images.
. (a) At every step, the policy (aka actor) gives a set of stroke parameters based on the canvas and target image. The renderer renders the stroke on the canvas. (b) During learning, the evaluator (aka critic) evaluate the action based on the target image and the rendered canvas. In our implementation, the policy, evaluator and renderer are implemented using neural networks.
First, painting like humans requires the agent to have the ability to decompose the given target image into strokes spatially then paint them on the canvas in correct order. The agent needs to parse the target image visually, understand the current status of the canvas, and have foresight plans about future strokes. To resolve this problem, one common way is to give the supervised loss for stroke decomposition at each step, as shown in . This approach is computationally expensive. Also, texture-rich image painting usually requires hundreds of strokes to generate a painting that resembles the target image, which is tens of times more than doodling, sketching or character writing [48, 9, 47]. To handle such a long-term planning task, reinforcement learning (RL) is a good choice, because RL aims to maximize the cumulative rewards of the whole painting process rather than minimizing the supervised loss at each step, which makes the agent have a foresight plan for stroke decomposition and painting for massive steps. Moreover, we take the adversarial training strategy  to train the painting agent. This strategy is successfully used in pixel-level image generation tasks , and also helps the agent to paint.
Second, fine-grained stroke parameter space, including stroke location and color, is essential for painting. Previous work [9, 48, 5] design stroke parameter space to be discrete and each parameter has only a limited number of choice, which cannot hold for texture-rich painting any more. Defining the stroke parameter upon continuous space raises a grand challenge for most RL algorithms due to their poor ability in handling fine-grained parameter space, Deep Q-Network (DQN)  and policy gradient (PG) . Instead, Deep Deterministic Policy Gradient (DDPG)  is designed for handling continuous action space, the subtle controlling performance has been shown by the agents trained using DDPG [18, 45]. We adopt DDPG in our method to empower the agent with the ability of painting.
Third, an efficient painting simulator is critical for the performance of the agent, especially in the case of painting hundreds of strokes on the canvas. Most work [9, 48, 5] paints by interacting with simulated painting environments. This approach is time-consuming and inflexible. Instead, we use a neural network (NN) to train an end-to-end renderer which directly maps stroke parameters to stroke paintings. The renderer can implement all kinds of stroke design. Moreover, the renderer is differential that can be subtly combined with DDPG as one model-based DRL algorithm, which boosts the performance of original DDPG greatly.
In summary, our contributions are as follows:
We address the painting task with the model-based DRL algorithm, allowing the agent decompose the target image into hundreds of strokes in sequence to generate a painting that resembles the target image.
The neural renderer is used for efficient painting, and it is also compatible with various stroke designs. Besides, the neural renderer contributes to our proposed model-based DDPG.
The proposed painting agent can handle multiple types of target images well, including digits, house numbers, portraits, and natural scene images.
Stroke-based rendering (SBR) is an automatic approach to creating nonphotorealistic imagery by placing discrete elements such as paint strokes or stipples , which is one similar task as our proposed. Most stroke-based rendering algorithms focus on every single step greedily or need user interaction. Haeberli  propose a semiautomatic method which requires interaction between users and machines. The user needs to set some parameters to control the shape of the strokes and select the positions of each stroke. Litwinowicz  propose a single-layer painterly rendering which places the brush strokes on a grid in the image plane, with randomly perturbed positions. Some work also studies on the effects of different strokes  and generating animations based on video .
combines differentiable renderer and recurrent neural network (RNN) to train agents to paint but fails to generalize on color images. These methods are not good enough to handle this complicated task and require massive computing resources. Doodle-SDQ trains the agents to emulate human doodling with DQN. Earlier, Sketch-RNN  uses sequential datasets to achieve good results in sketch drawings. Artist Agent  explores using RL for automatic generation of a single brush stroke.
In recent years, many DRL methods that combining deep learning (DL) and RL have been applied to various tasks successfully, such as Go , action real-time strategy (ARTS) game , first-person shooter game , or controlling complex physiologically-based model . Lots of DRL algorithms are used in these tasks, such as DQN, Asynchronous Advantage Actor-Critic (A3C) , Proximal Policy Optimization (PPO)  and DDPG. These algorithms are model-free, which means that the agent needs to maximize the expected reward only based on samples from the environment.  points out humans can learn quickly because of a great deal of prior knowledge about the world. For some tasks, the agent can understand easy environment better by making predictions . Another effective method is to build a generative neural network model . Gu  explores using model-based methods to accelerate DQN.
The goal of the painting agent is first decomposing the given target image with stroke representations then painting the strokes on the canvas to form a painting. To imitate the painting process of humans, the agent is designed to predict the next stroke based on observing the current state of the canvas and the target image. More important, to make the agent gain the ability to predict one suitable stroke at a time, the stroke is well compatible with previous strokes and future strokes, it requires a well-designed feedback mechanism. We postulate that the feedback should be the gained reward after finishing one stroke painting, and the agent pursuits maximizing the cumulative rewards after finishing all stroke paintings. We give diagrams for the overall architecture in Figure 2.
Given a target image and an empty canvas , the agent aims to find a stroke sequence , where rendering on can get . After rendering these strokes in sequence, we get the final painting , which we hope looks like
as similarly as possible. We model this task as a Markov decision process with a state space, an action space , a transition function and a reward function . We will introduce how to define these components specifically next.
State and Transition Function The state space is constructed by all possible information that the agent can observe in the environment. We define a state with three parts: the canvas, the target image, and the step number. Formally, . and are bitmaps and the step number acts as additional information to instruct the painting process for the agent. The transition function , gives the transition process between states, which is implemented by painting a stroke on the current canvas.
Action The action space is the set of actions that the agent can perform. An action is a set of parameters that control the position, shape, color, transparency of the stroke that would be painted at step . We define the behavior of an agent as a policy function that maps states to deterministic actions, i.e. . At step , the agent observes state then gives the stroke parameters of the next stroke . The state evolutes based on the transition function until going on for steps.
Reward The reward function acts to evaluate the actions decided by the policy. Selecting a suitable metric to measure the difference between the canvas and the target image is crucial for training a painting agent. The reward is designed as follows,
where is the reward at step , is the measured loss between and the and is the measured loss between and the .
To make sure the final canvas resembles the target image, the agent should be driven to maximize the cumulative rewards in the whole episode. At each step, the objective of the agent is to maximize the sum of discounted future reward with a discounting factor .
In this section, we introduce how to train the agent using a well designed model-based DDPG.
We first describe the original DDPG, then introduce building model-based DDPG for the efficient training of the agent.
As defined, the action space in the painting task is continuous with high dimensions. Discretizing the action space to adapt some DRL methods, such as DQN and PG, will lose the precision of stroke representation and require much efforts in manual structure design to face with the explosion of parameter combinations in discrete space. DPG  was proposed using deterministic policy to resolve the difficulties caused by high-dimensional continuous action space. Furthermore, DDPG was proposed combining NN with DPG to enhance its performance in lots of control tasks.
In the original DDPG, there are two networks: the actor and critic . The actor models a policy that maps a state to action
. The critic estimates the expected reward for the agent taking actionat state , which is learned using Bellman equation (2) as in Q-learning  and the data is sampled from an experience replay buffer:
where is a reward given by the environment when performing action at state . The actor is learned to maximize the critic’s estimated . In other words, the actor decides an action for each state. Based on the current canvas and the target image, the critic predicts an expected reward for the stroke given by the actor. The critic is optimized to estimate more accurate expected rewards.
We cannot train a good-performance painting agent using original DDPG because it’s hard for the agent to model the complex environment well that is composed of any types of real-world images during learning. Thus, we design a neural renderer so that the agent can observe a modeled environment. Then it can explore the environment and improve its policy efficiently. We term the DDPG with the actor that can get access to the gradients from environments as model-based DDPG. The difference between the two algorithms is visually shown in Figure 4.
The optimization of the agent using the model-based DDPG is different from that using the original DDPG. At step , the critic takes as input rather than both of and . The critic still predicts the expected reward for the state but no longer includes the reward caused by the current action. The new expected reward is called value function learned using the following equation:
where is a reward when performing action based on . The actor is learned to maximize . Here, the transition function is the differentiable renderer.
Frame Skip  is a powerful parameter for many RL tasks. The agent can only observe the environment and acts once every frames rather than one frame. This trick makes the agents have a better ability to learn associations between more temporally distant states and actions. The agent predicts one action and reuse it at the next frames instead and achieves better performance with less computation cost.
Inspired by this trick, we make the actor output the parameters of strokes at one step. This practice encourages the exploration of the action space and action combinations. The renderer will render strokes simultaneously to greatly speedup the painting process. We term this trick as Action Bundle. We experimentally find that setting is a good choice that significantly improves the performance and the training speed. It’s worth noting that we modify the reward discount factor from to to keep consistency.
where denotes the discriminator, and are the paintings and target images distribution. The prerequisite of above objective is that should be under the constraints of 1-Lipschitz. To achieve the constraint, we use WGAN with gradient penalty (WGAN-GP) .
We want to reduce the distribution distance between paintings and target images as much as possible. To achieve this, we set the difference of discriminator scores from to using equation (1) as the reward for guiding the learning of the actor. In experiments, we find use WGAN loss function to learn the agent is better than and loss function.
Due to the high variability and high complexity of real-world images, we use residual structures similar to ResNet-18 
as the feature extractor in the actor and the critic. The actor works well with Batch Normalization (BN), but BN can not speed up the critic training significantly. Salimans 
apply Weight Normalization (WN) to improve DQN. Similarly, we use WN with Translated ReLU (TReLU) on the critic to stabilize our training. In addition, we use CoordConv  as the first layer in the actor and the critic. For the discriminator, we use similar network architecture with PatchGAN . We use WN with TReLU on the discriminator as well. The network architectures of the actor, critic and discriminator are shown in Figure 5 (a) and (b).
Following the original DDPG paper, we use the soft target network which means creating a copy for the actor and critic and updating their parameters by having them slowly track the learned networks. We also apply this trick on the discriminator to improve its training stability.
In this section, we introduce how to build a neural stroke renderer and use it to generate multiple types of strokes.
Using a neural network to generate strokes has two advantages. First, the neural renderer is flexible to generate any styles of strokes and is more efficient than hand-crafting stroke simulators. Second, the neural renderer is differentiable and can model the environment well for the original DDPG so that boosting the performance of the agent.
Specifically, feed the neural renderer with a set of stroke parameters , then it outputs the rendered stroke image
. The training samples are generated randomly using graphics renderer programs. The neural renderer can be quickly trained with supervised learning and runs on the GPU. So we get a differentiable and fast-running environment. Formally, the model-based transition dynamicsand the reward function are differentiable. Some simple geometric renderings can be done without neural networks and give the gradient as well. But neural networks can help us omit cumbersome formula calculations.
The neural renderer network is consisting of several fully connect layers and convolution layers. Sub-pixel  is taken to increase the resolution of strokes in the network, which is a fast running operation and can eliminate the checkboard effect. We show the network architecture of the neural renderer in Figure 5 (c).
Strokes can be designed as a variety of curves or geometries. In general, the parameter of a stroke should include the position, shape, color and transparency.
For curve strokes, which simulating the effects of brushes, the coordinates of control points and the thickness of the stroke determine the shape of a stroke. Bezier curves are common on vector drawing programs. We design a brief stroke represent of Quadratic Bezier Curve (QBC) as follows,
where are the coordinates of the three control points of the QBC. , control the thickness and transparency of the two endpoints of the curve, respectively. And controls the color. The formula of QBC is:
To eliminate aliasing, the curve is first drawn on a high-resolution canvas and then resized to the resolution of the target image. We can use neural renders with the same structure to implement the rendering of different strokes.
Four datasets are used for our experiments, including MNIST , SVHN , CelebA  and Imagenet . We show that the agent has excellent performance in painting various types of real-world images.
MNIST contains 70,000 examples of hand-written digits, where 60,000 are training data, and 10,000 are testing data. Each example is a grayscale image with a resolution of pixels.
SVHN is a real-world house number image dataset, including 600,000 digits images. Each sample in the Cropped Digits set is a color image with a resolution of pixels. We randomly sample 200,000 images for our experiments.
CelebA contains approximately 200,000 celebrity face images. The officially provided center-cropped images are used in our experiments.
ImageNet (ILSVRC2012) contains 1.2 million natural scene images, which fall into 1000 categories. An extreme diversity is shown by ImageNet, which raises a great challenge to the painting agent. We randomly sample 200,000 images that cover 1,000 categories as training data.
In our task, we aim to learn an agent that can paint any images rather than only the ones in the training set. Thus, we additionally split out testing set to test the generalization ability of the learned agent. For MNIST, we use the officially defined testing set. For other datasets, we randomly split out 2,000 images as testing set.
We resized all images with a resolution of pixels before giving them to the agent. We trained the agent with batches for ImageNet and CelebA datasets, batches for SVHN and batches for MNIST. Adam  was used for optimization, and the minibatch size was set as 96. The agent training was finished on a single NVIDIA TITAN Xp and with a consumption of 30GB memory. It took about 40 hours for training on ImageNet and CelebA, 20 hours for SVHN and two hours for MNIST. It took 5 - 15 hours to train the neural renderer for every different stroke design. The learned renderer can be used for different agents.
The actor, critic and discriminator were updated in turn at each training step. The replay memory buffer was set to store the data of the latest 800 episodes for training the agent. Please refer to the supplemental materials for more training details.
Since the reward was given by a dynamically learned discriminator, this would introduce the bias in the calculated rewards. Thus the agent still can explore the environments well without adding noise to the actions.
The images of MNIST and SVHN show simple image structures and regular contents. We train one agent that paints five strokes for images of MNIST, and another paints 40 strokes for images of SVHN. The example paintings are shown in Figure 3 (a) and (b). The agents can perfectly reproduce the target images.
By contrast, the images of CelebA have more complex structures and diverse contents. We train a 200-strokes agent to deal with the images of CelebA. As shown in Figure 3 (c), the paintings are quite similar to the target images with losing a certain level of details. Similarly, SPIRAL  shows its performance on CelebA. To make a fair comparison, we also train a 20-strokes agent as SPIRAL and use opaque strokes. The results of the two methods are shown in Figure 6 (a) and (b) respectively. Our paintings are clear than SPIRAL’s, and our distance is smaller than one third of SPIRAL’s.
We train a 400-strokes agent to deal with the images of ImageNet, due to the extremely complex structures and various contents. As shown in Figure 3 (d), paintings are similar to the target images concerning the outline and colors of objects and backgrounds. Although some textures lost, the agent still shows great power in decomposing complicated scenes into strokes and repainting them in a reasonable way.
In addition, we show the testing loss curves of agents trained on different datasets in Figure 8.
In this section, we study how the components or tricks, including model-based DDPG, Action Bundle and WGAN reward, affect the performance of the agent. For simplicity, we experiment on CelebA only.
We explore how much benefits brought by model-based DDPG over original DDPG. As we know, original DDPG can only model the environment in an implicit way with observations and rewards from the environment. Besides, the high-dimensional action space also stops model-free methods successfully dealing with the painting task. To further explore the capability of model-free methods, we improve original DDPG with a method inspired by PatchGAN. We split the images into patches before feeding the critic, then use the patch-level rewards to optimize the critic. We term this method as PatchQ. PatchQ boosts the sample efficiency and improves the performance of the agent by providing much more supervision signals in training.
We show the performance of agents trained with different algorithms in Figure 7 (a). Model-based DDPG outperforms original DDPG and DDPG with PatchQ, with five times smaller distance than DDPG with PatchQ and 20 times smaller distance than original DDPG. Although underperforming the model-based DDPG, DDPG with PatchQ outperforms original DDPG with great improvement.
distance is a choice as the reward for learning the actor. We show the painting results with using WGAN rewards and rewards in Figure 6 (d) and (e) respectively. The paintings with WGAN rewards show richer textures and look more vivid as the target images. Interestingly, we find using WGAN rewards to train the agent can achieve a lower loss on the testing data than using rewards instead. This shows that WGAN distance is a better metric in measuring the difference between paintings and real-world images than .
The stroke number for painting is critical for the final painting effects, especially for the texture-rich images. We train agents that can paint 100, 200, 400 and 1000 strokes, and the testing loss curves are shown in Figure 7 (c). It’s observed that larger strokes number contributes to better painting effects. We show the paintings with 200-strokes and 1000-strokes in Figure 7 (e) and (f) respectively. To the best of our knowledge, few methods can handle such a large number of strokes. More strokes help reconstruct the details in the paintings.
Action Bundle is a trick for speeding the painting process. Apart from this, we explore how Action Bundle affects the performance of the agent. We show testing loss curves of several settings of Action Bundle in Figure 7 (b). We find that making the agent predict five strokes in one bundle achieves the best performance. We conjecture that the increasing of strokes number in one bundle makes the agent paint more strokes with a certain number of decisions and helps the agent have a long-term plan. But it will increase difficulties for the agent to paint more strokes reasonably for one decision. Thus, to achieve a trade-off, five strokes in one bundle is a good setting for the agent.
Besides the QBC, we show alternative stroke representations that can be well mastered by the agent, including straight strokes, circles and triangles. We train one neural renderer for each stroke representation. The paintings with these renderers are shown in Figure 9. The QBC strokes produce an excellent visual effect. Meanwhile, other strokes create different artistic effects. Although with different styles, the paintings still resemble the target images. This shows the learned agent is general and robust to the stroke designs.
In addition, by restricting the transparency of strokes, we can get paintings with different stroke effects, such as ink painting and oil painting as shown in Figure 6 (c).
In this paper, we learn a painting agent that decomposes the target image into strokes and paint on the canvas in sequence to form a painting. The agent learning is based on the DRL framework, which encourages the agent making a long-term plan about the sequence of strokes painting to maximize the cumulative rewards. Instead of a conventional stroke simulator, a neural renderer is used to generate strokes with simplicity and high efficiency. Moreover, neural renderers also contribute to the model-based DRL algorithm, which shows better performance than the original DRL algorithm in the painting task. The learned agent can predict hundreds or even thousands of strokes to generate a painting. The experimental results show that our model can handle multiple types of target images and achieve good performance in painting texture-rich natural scene images.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 214–223. Cited by: §3.3.3.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.4.
The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. Cited by: 3(a), §5.
Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §3.3.3.
An intriguing failing of convolutional neural networks and the coordconv solution. In Advances in Neural Information Processing Systems, pp. 9628–9639. Cited by: §3.4.
Semantic image inpainting with deep generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5485–5493. Cited by: §3.3.3.