LPaintB: Learning to Paint from Self-SupervisionLPaintB: Learning to Paint from Self-Supervision

06/17/2019 ∙ by Biao Jia, et al. ∙ 4

We present a novel reinforcement learning-based natural media painting algorithm. Our goal is to reproduce a reference image using brush strokes and we encode the objective through observations. Our formulation takes into account that the distribution of the reward in the action space is sparse and training a reinforcement learning algorithm from scratch can be difficult. We present an approach that combines self-supervised learning and reinforcement learning to effectively transfer negative samples into positive ones and change the reward distribution. We demonstrate the benefits of our painting agent to reproduce reference images with brush strokes. The training phase takes about one hour and the runtime algorithm takes about 30 seconds on a GTX1080 GPU reproducing a 1000x800 image with 20,000 strokes.



There are no comments yet.


page 1

page 5

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Digital painting systems are increasingly used by artists and content developers for various applications. One of the main goals has been to simulate popular or widely-used painting styles. With the development of non-photorealistic rendering techniques, including stroke-based rendering and painterly rendering  [9, 33]

, specially-designed or hand-engineered methods can increasingly simulate the painting process by applying heuristics. In practice, these algorithms can generate compelling results, but it is difficult to extend them to new or unseen styles.

Over the last decade, there has been considerable interest in using machine learning methods for digital painting. These methods include image synthesis algorithms based on convolutional neural networks, including modeling the brush 

[34], generating brush stroke paintings [36], reconstructing paintings in specific styles  [31], constructing stroke-based drawings [7], etc. Recent developments in generative adversarial networks [6]

and variational autoencoders 

[17] have led to the development of image generation algorithms that can be applied to painting styles [40, 39, 12, 16, 27].

Figure 1: Results Generated by Our Painting Agent: We use three paintings (top row) as the reference images to test our novel self-supervised learning algorithm. Our trained agent automatically generates the digitally painted image (bottom row) of the corresponding column in about 30 seconds without need of a paired dataset of human artist.

One of the goals is to develop an automatic or intelligent painting agent that can develop its painting skills by imitating reference paintings. In this paper, we focus on building an intelligent painting agent that can reproduce a reference image in an identical or transformed style with a sequence of painting actions. Unlike methods that directly synthesize images bypassing the painting process, we focus on a more general and challenging problem of training a painting agent from scratch using reinforcement learning methods. [36, 35, 34, 39] also use reinforcement learning to solve the problem. All the methods encode goal states, which are usually defined as reference images, to the observations. This set-up is different from classic reinforcement learning tasks because, while the problem introduces an implicit objective to the policy network of reinforcement learning, the distribution of the reward in the action space can be very sparse and it makes training a reinforcement learning algorithm from scratch very difficult. To solve the problem, [36, 35, 34, 39] pre-train the policy network with a paired dataset consisting of images and corresponding actions defined in [34]

. However, it is very expensive to collect such a paired dataset of human artist and we need to explore other unsupervised learning methods.

Main Results: We present a reinforcement learning-based algorithm (LPaintB) that incorporates self-supervised learning to train a painting agent on a limited number of reference images without paired datasets. Our approach is data-driven and can be generalized by expanding the image datasets. Specifically, we adopt proximal policy optimization (PPO) [29] by encoding the current and goal states as observations and the continuous action space defined based on configurations of the paint brush like length, orientation and brush size. The training component of our method only requires the reference paintings in the desired artistic style and does not require paired datasets collected by human artists. We use a self-supervised learning method to increase the sampling efficiency. By replacing the goal state of an unsuccessful episode with its final state, we automatically generate a paired dataset with positive rewards. After applying the dataset to retrain the model using reinforcement learning, our approach can efficiently learn the optimal policy. The novel contributions of our work include:

  • [noitemsep,nolistsep]

  • An approach for collecting supervised data for painting tasks by self-supervised learning.

  • An adapted deep reinforcement learning network that can be trained using human expert data and self-supervised data, though we mostly rely on self-supervised data.

  • An efficient rendering system that can automatically generate stroke-based paintings of desired resolutions by our trained painting agent.

We evaluate our approach by comparing our painting agent with prior painting agents that are trained from scratch by reinforcement learning  [14]. We collect 1000 images with different color and patterns as the benchmark and compute L2 Loss between generated images and reference images. Our results show that self-supervised learning can efficiently collect paired data and can accelerate the training process. The training phase takes about 1 hour and the runtime algorithm takes about 30 seconds on a GTX 1080 GPU for high resolution images.

2 Related Work

In this section, we give a brief overview of prior work on non-photorealistic rendering and the use of machine learning techniques for image synthesis.

2.1 Non-Photorealistic Rendering

Non-photorealistic rendering methods render a reference image as a combination of strokes by determining many properties like position, density, size, and color. To mimic the oil-painting process, Hertzmann [9] renders the reference image into primitive strokes using gradient-based features. To simulate mosaic decorative tile effects, Hauser [8] segments the reference image using Centroidal Voronoi diagrams. Many algorithms have been proposed for specific artistic styles, such as stipple drawings [2], pen-and-ink sketches [26] and oil paintings [37]  [22]. The drawback of non photo-realistic rendering methods is the lack of generalizability to new or unseen styles. Moreover, they may require hand tuning and need to be extended to other styles.

2.2 Visual Generative Algorithms

Hertzmann et al. [10] introduce image analogies, a generative method based on a non-parametric texture model. Many recent approaches are based on CNNs and use large datasets of input-output training image pairs to learn the mapping function [4]. Inspired by the idea of variational autoencoders [17], Johnson et al. [15] introduce the concept of perceptual loss to model the style transferbetween paired dataset. Zhu et al.  [40] use generative adversarial networks to learn the mappings without paired training examples. These techniques have been used to generate natural images [16, 27], artistic images [20], and videos [32, 21]. Compared to previous visual generative methods , our approach can generate results of high resolution, can be applied to different painting media and is easy to extend to different painting media and artistic styles.

2.3 Image Synthesis Using Machine Learning

Many techniques have been proposed for image synthesis using machine learning. Xie et al. [34, 36, 35] present a series of algorithms that simulate strokes using reinforcement learning and inverse reinforcement learning. These approach learn a policy from either reward functions or expert demonstrations. For interactive artistic creation, stroke-based approaches can generate trajectories and intermediate painting states. Another advantage of stroke-based methods is that the final results are trajectories of paint brushes, which can then be deployed in different synthetic natural media painting environments and real painting environments using robot arms. In contrast to our algorithm, Xie et al. [34, 36, 35] focus on designing reward functions to generate orientational painting strokes. Moreover, their approach requires expert demonstrations for supervision. Ha et al. [7]

collect a large-scale dataset of simple sketches of common objects with corresponding recordings of painting actions. Based on this dataset, a recurrent neural network model is trained in a supervised manner to encode and re-synthesize the action sequences. Moreover, the trained model is shown to be capable of generating new sketches. Following

[7], Zhou et al. [39]

use reinforcement learning and imitation learning to reduce the amount of supervision needed to train such a sketch generation model. In contrast to prior methods,

[14] operate in a continuous action space with higher dimensions applying PPO[29] reinforcement learning algorithm to train the agent from scratch. It can handle dense images with high resolutions.

Compared with prior visual generative methods, our painting agent can automatically generate results using a limited training dataset without paired dataset.

3 Self-Supervised Painting Agent

In this section, we introduce notations, formulate the problem and present our self-supervised learning algorithm for natural media painting.

Figure 2: Our Learning Algorithm: We use self-supervised learning to generate paired dataset using a training dataset with reference images only and initialize the model for reinforcement learning. Then we feed the trained policy network to self-supervised learning to generate the paired datasets with positive rewards. (1) We initialize the policy network with random painting actions; (2) We rollout the policy by iteratively applying to the policy network to the painting environment to get paired data, followed by assigning the goal state as and changing the rewards of each step accordingly; (3) We retrain the policy with the supervision data to generate the self-supervised policy, and use the behavior cloning to initialize the policy network; (4) We apply policy optimization [29] and update the policy; (5) We rollout the updated policy and continue the iterative algorithm.
Symbol Meaning
step index
time steps to compute accumulated rewards
current painting state of step , canvas
target painting state, reference image
reproduction of
observation of step
action of step
reward of step
accumulated reward of step
discount factor for computing the reward
painting policy, predict by
value function of the painting policy,
predict by
feature extraction of state
render function, render action to
observation function, encode the current
state and the target state
loss function, measuring distance between
state and objective state
Table 1: Notation and Symbols used in our Algorithm

3.1 Background

Self-supervised learning methods [18] are designed to enable learning without explicit supervision. The supervised signal for a pretext task is created automatically. It is a form of unsupervised learning where the data itself provides supervision. In its original formulation, this process is performed by withholding part of the information of the data and training the classification or regression function to predict it. The required task usually has a definition of the proxy loss so that it can be solved by self-supervised learning. There are a variety of applications for self-supervised learning in different areas such as audio-visual analysis [24], visual representation learning [3], image analysis [5], robotics [13] etc. In this paper, we use the term self-supervised learning to refer to the process of generating self-supervision data and feeding the data to the policy network of the reinforcement learning framework.

3.2 Problem Formulation

Reproducing images with brush strokes can be formalized as finding a series of actions that minimizes the distance between the reference image and the current canvas in the desired feature space. Based on notations in Table 1, this can be expressed as minimizing the loss function:


After we apply reinforcement learning to solve the problem by defining function, we can get:


Function and can perform feature extraction of states as and , and the feature extraction can be either the same or different. In other words, functions and can be either in the same or in different feature space. If and extract different features, the policy will learn to map the observation into the feature space that the reward function uses.

3.3 Behavior Cloning

Behavior cloning uses a paired dataset with observations and corresponding actions to train the policy to imitate an expert trajectory or behaviors. In our setup, the expert trajectory is encoded in the paired dataset which is related to step 4 in Figure 2. We use behavior cloning to initialize the policy network of reinforcement learning with the supervised policy trained by paired data. The paired dataset can be generated by a human expert or an optimal algorithm with global knowledge, which our painting agent does not have. Once the paired dataset is obtained, one solution is to apply supervised learning based on regression or classification to train the policy. The trained process can be represented using an optimization formulation as:


It is difficult to generate such an expert dataset for our painting application because of the large variation in the reference images and painting actions. However, we can generate a paired dataset by rolling out a policy defined as Eq.(4), which can be seen as iteratively applying predicted actions to the painting environment. For the painting problem, we can use the trained policy itself as the expert by introducing self-supervised learning.

3.4 Self-Supervised Learning

As we apply reinforcement learning to the painting problem, there are several new identities that emerge as distinct from those of the classic controlling problems [29, 28, 23, 30]. We use the reference image as the objective and encode it in the observation of the environment defined in Eq.(12). As a result, the objective of the task Eq.(3) is not explicitly defined. Hence the rollout actions on different reference images can vary.

Through the reinforcement learning training process, the positive rewards in the high dimensional action space can be very sparse. In other words, only a small portion of actions sampled by policy network have positive rewards. To change the reward distribution in the action space by increasing the probability of a positive reward, we propose using self-supervised learning. Our formulation uses the rollout of the policy as the paired data to train the policy network and retrains the model using reinforcement learning. Specifically, we replace the reference image

with the final rendering of the rollout of the policy function . Moreover, we use the updated observation and the actions as the paired supervised training dataset. For the rollout process of the trained policy , we have:


Next, we can collect as the paired data. We denote the rendering of the final state as . The reward function is defined as the percentage improvement of the loss over the previous state:


Next, we modify and to a self-supervised representation as and as:


We use to train a self-supervised policy and the value function . Algorithm 1 highlights the learning process for self-supervised learning.

0:  Set of objective states , its size is
0:  Painting Policy and its value function
1:  for  do
4:     // Rollout the policy and collect the paired data with positive reward
5:     while  do
10:     end while
11:     // Build self-supervised learning dataset
12:     for  do
15:     end for
16:     // Compute cumulative rewards
17:     for  do
19:     end for
20:      // Initialize policy network for policy optimization
21:      // Initialize value network for policy optimization
22:  end for
23:  return  
Algorithm 1 Self-Supervised Learning

3.5 Retraining with Reinforcement Learning

After we build the expert dataset from the rollout of the trained agent, we use this dataset to train the agent by behavior cloning. However, the policy generated by supervised learning described in Alg.1 is not robust enough if we only use supervised learning to train the policy. There are two main problems. First, the paired data only consists of actions with positive rewards, which makes it difficult to recover from actions that return negative rewards. Second, the expert data generated by the policy is not always optimal. For the painting and other controlling problems, each state can be the result of multiple series of actions.

One solution to the generalization problem of behavior cloning is using data aggregation [25]. It increases the robustness of the trained model by adding noise to the trajectories and computing the corresponding recovering actions and observations. The critical condition of the data aggregation is that the expert has global knowledge to provide the recovering actions for the bad states. For our problem, we still need human experts to provide guiding information to aggregate the dataset.

Another solution is retraining the model with reinforcement learning. After we obtain the expert data , we can use the to train the value network and use the subset to train the policy network . In this manner, we can retrain using reinforcement learning and set the objective state as , which is the same as self-supervised learning.

Reinforcement learning can solve the two problems mentioned above based on:

  1. Exploring more regions in the action space with negative or positive rewards by adding noise to the action, which can generalize the model.

  2. Optimizing the actions of the expert guide with the reward function.

As described in Figure 2, the self-supervised learning takes a random policy as input, which randomly samples from the action space. In this case, reinforcement learning can benefit from the paired dataset with positive rewards after the initialization. After policy optimization, reinforcement learning can optimize the policy for the next turn of self-supervised learning. The role that reinforcement learning plays is to generalize the model and optimize the trajectories. Self-supervised learning provides paired datasets and expands the variation of the objective states. Therefore, the gap between the performances of reinforcement learning and self-supervised learning narrows during the training process until they converge.

3.6 Painting Agent

In this section, we present technical details of our reinforcement learning-based painting agent.

3.6.1 Observation

As shown in Figure 3, our observation function is defined as follows. First, we encode the objective state (reference image) with the painting canvas. Second, we extract both the global and the egocentric view of the state. As mentioned in [39, 14], the egocentric view can encode the current position of the agent and it provides details about the state. The global view can provide overall information about the state. is defined as Eq.(12), given the patch size and the position of the brush position.

Figure 3: Observation for Training: The figure demonstrates observations of a rollout process with 20 iterations. In each sub-figure, we extract the global view of the reference image (upper left), the global view of the canvas (upper right), the egocentric view of the reference image (lower left), and the egocentric view of the canvas (lower right). We normalize 4 views and combine them as the observation of the state , and then fill the border of the canvas and reference image with white.

3.6.2 Action

The action is defined as a vector in continuous space with positional information and paint brush configurations.

. Each value is normalized to . The action is in a continuous space, which makes it possible to train the agent using policy gradient based reinforcement learning algorithms. The updated position of the paint brush after applying an action is computed by adding to the coordinates of the paint brush .

3.6.3 Loss Function

The loss function defines the distance between the current state and the objective state. It can guide how the agent reproduces the reference image. In practice, we test our algorithm with defined as Eq.(13), where is the image of size .


For the self-supervised learning process, the loss function only affects the reward computation. However, the reinforcement learning training process uses as the reference images to train the model and the loss function can affect the policy network.

3.6.4 Policy Network

To define the structure of the policy network, we consider the input as a concatenated patch of the reference image and canvas in egocentric view and global view, given the sample size of . The first hidden layer convolves 64

filters with stride 4, the second convolves 64

filters with stride 2 and the third layer convolves 64

filters with stride 1. After that, it connects to a fully-connected layer with 512 neurons. All layers use ReLU activation function


3.6.5 Runtime Algorithm

After we trained a model using self-supervised learning and reinforcement learning, we can apply the model to generate reference images with different resolutions. First, we randomly sample a position from the canvas and draw a patch with size and feed it to the policy network. Second, we iteratively predict actions and render them by environment until the value network returns a negative reward. Then we reset the environment by sampling another position from the canvas and keep the loop until less than .

0:  Reference image which size is , the learned painting policy with observation size
0:  Final rendering
1:  s = Initialize()
2:  while   do
3:     //sample a 2-dimensional point within image to start the stroke
5:      //Get observation
6:      //Initialize the predicted reward
7:     while   do
8:         //Predict the painting action
9:         //Predict the expected reward
10:         // Render the action
11:         //Update the observation
12:     end while
13:  end while
14:  return  
Algorithm 2 Our Runtime Algorithm

4 Implementation

Our painting environment is similar to that in [14], which is a simplified simulated painting environment. Our system can execute painting actions with parameters describing stroke size, color and positional information and updates the canvas accordingly (as shown in Equation 5). We also implement the reward function Equation 8, which evaluates the distance between the current state and the goal state. We use a vectorized environment [11] for a parallel training process, as shown in Figure 5, to train our model. A vectorized environment consists of environments. usually is decided by the cores of the CPU to achieve the best performance. The environments share the same policy network and its value network and they update the weights of the neural network at the same time. As a result, we can change the number of the environments for the roll-out or retraining process. Then we adapt the proximal policy optimization [29] to train the model on the vectorized environment. Training process with timesteps can finish within 2 hours.

4.1 Data Collection

To train the model, we draw random patches from reference images in a specific style at varying scales to assemble the training dataset and then sample the patches to a fixed size. By applying self-supervised learning, we can augment the dataset by the rollout of the intermediate policy through the training process. To reproduce , we also initialize the canvas with a random sampled reference image so that . The goal of the learning process is to minimize the loss between and :


After self-supervised learning, we have an updated reference image . If we have training samples and each self-supervised task consists of steps, we can have paired supervision data, which can make the algorithm generalize better.

Figure 4: Vectorized Environment: We use a vectorized environment with 16 threads to train the model. The figure demonstrates the runtime state of the vectorized environment. Each small figure demonstrates the observation of a training thread, and its left side is the reference image and its right side is the canvas.

4.2 Performance

In practice, we use a 16 core CPU and a GTX 1080 GPU to train the model with a vectorized environment of dimension 16. We use SSPE [14] as Equation 5 to accelerate the training process. The learned policy can also be transferred to other simulated painting media like MyPaint or WetBrush [1] to get different visual effects and styles.

5 Results

In this section, we highlight the results and compare the performance with prior learning-based painting algorithms.

5.1 Comparisons

For the first benchmark, we apply a critic condition to reward each step for . Once the agent fails the condition, the environment will stop the rollout. We compare the cumulative reward by feeding the same set of unseen images to the environment. We use two benchmarks to test the generalization of the models. Benchmark1 is to reproduce a image from a random image like the training scheme mentioned in subsection 4.1. Benchmark2 is to reproduce a image from a blank canvas. Each benchmark have 1000 patches. Some result of our approach is shown in Figure 6. As shown in Table 2, our combined training scheme outperforms using only self-supervised learning or only reinforcement learning.

Benchmarks Benchmark1 Benchmark2
Reinforcement Learning Only
Self-supervised Learning Only
Our Combined Scheme
Table 2: Comparison of Different Training Schemes: We evaluate our method by comparing the average cumulative rewards Eq.(8) on the test dataset. We apply a critic condition to reward each step (for ). Once the agent fails the condition, the environment will stop the rollout. We build two benchmarks to test the generalization of the models. Benchmark1 is to reproduce a image from a random image like the training scheme mentioned in subsection 4.1. Benchmark2 is to reproduce a image from a blank canvas.

For the second benchmark, we evaluate the performance on the high-resolution reference images. We compute the Loss Eq.(13) and cumlative rewards Eq.(8) and compare our approach with [14]. We draw patches from 10 reference images to construct the benchmark. Moreover, we iteratively apply both the algorithms times to reproduce the reference images. We use the same training dataset with images to train the models. As shown in Table 3, our approach have a lower , loss although both methods perform well in terms of cumulative rewards.

Approaches Cumulative Rewards Loss
Table 3: Comparison with Previous Work We evaluate our method by comparing the average cumulative rewards Eq.(8) and Loss between final rendering and the reference image Eq.(13) on the test dataset. We draw patches from reference images to build the benchmark. We iteratively apply both the algorithms for 1000 times to reproduce reference images.

Overall, the comparison results show that our approach (LPaintB) combined with self-supervised learning and reinforcement learning have a better performance in terms of convergence, cumulative rewards and generalization.

Figure 5: Learning Curve Comparison We evaluate our algorithm by plotting the learning curve of the training without pretraining (blue) and the approach pretrained with self-supervised learning (red). As shown in the figure, the self-supervision have better convergence and performance.
Figure 6: Result on Unseen Test Dataset This figure shows the rollout result on the test dataset. The dataset is collected using a policy generate random painting actions. The left image of each small figure is the reference, and the right image is the final rollout using 20 painting actions.


Figure 7: Our results compared with [14] We compare the final rendering result using the same scale of reference image and the same amount of painting actions. (a) are the reference images. (b) are generated by our painting agent (c) are generated by agent [14]. The training dataset for both algorithms consists of 374 patches sampling from one painting.
Figure 8: Result generated using LPaintB: We demonstrate the benefits of our approach by reproducing paintings as well as photos. For each pair, the top row is the reference image and bottom row is our result. The training dataset consists of 374 patches sampling from one painting. Each reference image is corresponding to (a) Bedroom in Arles by Vincent van Gogh; (c) Poppies Near Argenteuil by Claude Monet; (d) Painting by Pierre Bonnard; e) Photo; f) Photo;
Benchmarks Resolutions Strokes Loss
Mona Lisa by Leonardo da Vinci Figure 1
Sunflowers by Vincent van Gogh Figure 1
Girl with a Pearl Earring by Johannes Vermeer Figure 1
The Starry Night by Vincent van Gogh Figure 7
Lake Photo Figure 7
Bedroom in Arles by Vincent van Gogh Figure 8
Giudecca by William Turner Figure 8
Poppies Near Argenteuil by Claude Monet Figure 8
Painting by Pierre Bonnard Figure 8
Tulip Photo Figure 8
Road Photo Figure 8
Table 4: Benchmarks We test our runtime algorithm Alg. 2 with the trained reinforcement learning model. We recorded resolutions of reference images, total strokes used for reproduction and the Loss Eq.(13).


6 Conclusion, Limitations and Future Work

We present a novel approach for stroke-based image reproduction using self-supervised learning and reinforcement learning. Our approach is based on a feedback loop with reinforcement learning and self-supervised learning. We modify and reuse the rollout data of the previously trained policy network and feed it into the reinforcement learning framework. We compare our method with both the model trained with only self-supervised learning and the model trained from scratch by reinforcement learning. The result shows that our combination of self-supervised and reinforcement learning can greatly improve efficiency of sampling and performance of the policy.

One major limitation of our approach is that the generalization of the trained policy is highly dependent on the training data. Although reinforcement learning enables the policy to generalize to different states that supervised learning cannot address, the states still depend on the training data. Specifically, the distribution of generated supervision data is not close to the unseen data. Another limitation is that our method is based on a simplified painting environment for training due to the extremely large exploration space of reinforcement learning. We need to investigate better techniques to handle such large exploration spaces.

For future work, we aim to enlarge the runtime steps and action space of the painting environment so that the data generated by self-supervised learning can be closer to the distribution of the unseen data. Our current setup includes most common stroke parameters like brush size, color, and position, but the painting parameters describing pen tilting, pen rotation, and pressure are not used. Moreover, we also aim to build a model-based reinforcement learning framework that can be incorporated with a more natural painting media simulator.