Stroke-based Artistic Rendering Agent with Deep Reinforcement Learning

03/11/2019 ∙ by Zhewei Huang, et al. ∙ Megvii Technology Limited Peking University 18

Excellent painters can use only a few strokes to create a fantastic painting, which is a symbol of human intelligence and art. Reversing the simulator to interpret images is also a challenging task of computer vision in recent years. In this paper, we present SARA, a stroke-based artistic rendering agent that combines the neural renderer and deep reinforcement learning (DRL), allowing the machine to learn the ability to deconstruct images using strokes and create amazing visual effects. Our agent is an end-to-end program that converts natural images into paintings. The training process does not require the experience of human painting or stroke tracking data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 5

page 6

page 7

Code Repositories

SARA_DDPG

Stroke-based Artistic Rendering Agent with Deep Reinforcement Learning


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Painting, being an important form in the visual arts, symbolize the wisdom and creativity of humans. In recent centuries, artists have used a diverse array of tools to create their masterpieces. But it’s hard for people to master this skill unless spending a large amount of time. Therefore, teaching machines to paint is indeed a challenging and meaningful task. Furthermore, the study of this topic can help us build a painting assistant tool and explore the mystery of painting.

We define artificial intelligent painting that an agent can paint strokes on a canvas in sequence to generate a painting that resembles the given target image. Some work has studied in teaching machines to learn painting-related skills, such as sketch 

[9, 3, 40], doodle [48] and write characters [47]. Differently, we aim to teach machines to handle more complex tasks, such as painting portraits of humans and natural scenes in the real world. The rich textures and complex composition of colors make them harder to deal with for machines.

There are three challenges for an agent to paint texture-rich images.

Figure 1: The painting process. The first column shows the target images. The agent paints outlines first and then adds textures.
Figure 2: The overall architecture

. (a) At every step, the policy (aka actor) gives a set of stroke parameters based on the canvas and target image. The renderer renders the stroke on the canvas. (b) During learning, the evaluator (aka critic) evaluate the action based on the target image and the rendered canvas. In our implementation, the policy, evaluator and renderer are implemented using neural networks.

First, painting like humans requires the agent to have the ability to decompose the given target image into strokes spatially then paint them on the canvas in correct order. The agent needs to parse the target image visually, understand the current status of the canvas, and have foresight plans about future strokes. To resolve this problem, one common way is to give the supervised loss for stroke decomposition at each step, as shown in [48]. This approach is computationally expensive. Also, texture-rich image painting usually requires hundreds of strokes to generate a painting that resembles the target image, which is tens of times more than doodling, sketching or character writing [48, 9, 47]. To handle such a long-term planning task, reinforcement learning (RL) is a good choice, because RL aims to maximize the cumulative rewards of the whole painting process rather than minimizing the supervised loss at each step, which makes the agent have a foresight plan for stroke decomposition and painting for massive steps. Moreover, we take the adversarial training strategy [6] to train the painting agent. This strategy is successfully used in pixel-level image generation tasks [23], and also helps the agent to paint.

Second, fine-grained stroke parameter space, including stroke location and color, is essential for painting. Previous work [9, 48, 5] design stroke parameter space to be discrete and each parameter has only a limited number of choice, which cannot hold for texture-rich painting any more. Defining the stroke parameter upon continuous space raises a grand challenge for most RL algorithms due to their poor ability in handling fine-grained parameter space, Deep Q-Network (DQN) [30] and policy gradient (PG) [41]. Instead, Deep Deterministic Policy Gradient (DDPG) [24] is designed for handling continuous action space, the subtle controlling performance has been shown by the agents trained using DDPG [18, 45]. We adopt DDPG in our method to empower the agent with the ability of painting.

Third, an efficient painting simulator is critical for the performance of the agent, especially in the case of painting hundreds of strokes on the canvas. Most work [9, 48, 5] paints by interacting with simulated painting environments. This approach is time-consuming and inflexible. Instead, we use a neural network (NN) to train an end-to-end renderer which directly maps stroke parameters to stroke paintings. The renderer can implement all kinds of stroke design. Moreover, the renderer is differential that can be subtly combined with DDPG as one model-based DRL algorithm, which boosts the performance of original DDPG greatly.

In summary, our contributions are as follows:

  • We address the painting task with the model-based DRL algorithm, allowing the agent decompose the target image into hundreds of strokes in sequence to generate a painting that resembles the target image.

  • The neural renderer is used for efficient painting, and it is also compatible with various stroke designs. Besides, the neural renderer contributes to our proposed model-based DDPG.

  • The proposed painting agent can handle multiple types of target images well, including digits, house numbers, portraits, and natural scene images.

(a) MNIST [21]
(b) SVHN [31]
(c) CelebA [28]
(d) ImageNet [34]
Figure 3: The painting results from multiple datasets. The stroke numbers for the paintings are 5, 40, 200 and 400 for MNIST, SVHN, CelebA and ImageNet respectively.

2 Related work

Stroke-based rendering (SBR) is an automatic approach to creating nonphotorealistic imagery by placing discrete elements such as paint strokes or stipples [13], which is one similar task as our proposed. Most stroke-based rendering algorithms focus on every single step greedily or need user interaction. Haeberli  [11] propose a semiautomatic method which requires interaction between users and machines. The user needs to set some parameters to control the shape of the strokes and select the positions of each stroke. Litwinowicz  [26] propose a single-layer painterly rendering which places the brush strokes on a grid in the image plane, with randomly perturbed positions. Some work also studies on the effects of different strokes [14] and generating animations based on video [25].

Similar to our agent, SPIRAL [5] is an adversarially trained RL agent which can reconstruct the high-level structure of images. StrokeNet [47]

combines differentiable renderer and recurrent neural network (RNN) to train agents to paint but fails to generalize on color images. These methods are not good enough to handle this complicated task and require massive computing resources. Doodle-SDQ 

[48] trains the agents to emulate human doodling with DQN. Earlier, Sketch-RNN [9] uses sequential datasets to achieve good results in sketch drawings. Artist Agent [44] explores using RL for automatic generation of a single brush stroke.

In recent years, many DRL methods that combining deep learning (DL) 

[20] and RL have been applied to various tasks successfully, such as Go [38], action real-time strategy (ARTS) game [32], first-person shooter game [17], or controlling complex physiologically-based model [18]. Lots of DRL algorithms are used in these tasks, such as DQN, Asynchronous Advantage Actor-Critic (A3C) [29], Proximal Policy Optimization (PPO) [36] and DDPG. These algorithms are model-free, which means that the agent needs to maximize the expected reward only based on samples from the environment. [4] points out humans can learn quickly because of a great deal of prior knowledge about the world. For some tasks, the agent can understand easy environment better by making predictions [33]. Another effective method is to build a generative neural network model [10]. Gu  [7] explores using model-based methods to accelerate DQN.

3 Painting Agent

3.1 Overview

The goal of the painting agent is first decomposing the given target image with stroke representations then painting the strokes on the canvas to form a painting. To imitate the painting process of humans, the agent is designed to predict the next stroke based on observing the current state of the canvas and the target image. More important, to make the agent gain the ability to predict one suitable stroke at a time, the stroke is well compatible with previous strokes and future strokes, it requires a well-designed feedback mechanism. We postulate that the feedback should be the gained reward after finishing one stroke painting, and the agent pursuits maximizing the cumulative rewards after finishing all stroke paintings. We give diagrams for the overall architecture in Figure 2.

Base on the above motivations, we model the painting process as a sequential decision-making task, which is described in Section 3.2. And to build the feedback mechanism, we use a model-based RL algorithm to train the agent, which is described in Section 3.3.

3.2 The Model

Given a target image and an empty canvas , the agent aims to find a stroke sequence , where rendering on can get . After rendering these strokes in sequence, we get the final painting , which we hope looks like

as similarly as possible. We model this task as a Markov decision process with a state space

, an action space , a transition function and a reward function . We will introduce how to define these components specifically next.

State and Transition Function The state space is constructed by all possible information that the agent can observe in the environment. We define a state with three parts: the canvas, the target image, and the step number. Formally, . and are bitmaps and the step number acts as additional information to instruct the painting process for the agent. The transition function , gives the transition process between states, which is implemented by painting a stroke on the current canvas.

Action The action space is the set of actions that the agent can perform. An action is a set of parameters that control the position, shape, color, transparency of the stroke that would be painted at step . We define the behavior of an agent as a policy function that maps states to deterministic actions, i.e. . At step , the agent observes state then gives the stroke parameters of the next stroke . The state evolutes based on the transition function until going on for steps.

Reward The reward function acts to evaluate the actions decided by the policy. Selecting a suitable metric to measure the difference between the canvas and the target image is crucial for training a painting agent. The reward is designed as follows,

(1)

where is the reward at step , is the measured loss between and the and is the measured loss between and the .

To make sure the final canvas resembles the target image, the agent should be driven to maximize the cumulative rewards in the whole episode. At each step, the objective of the agent is to maximize the sum of discounted future reward with a discounting factor .

3.3 Learning

In this section, we introduce how to train the agent using a well designed model-based DDPG.

(a) Original DDPG
(b) Model-based DDPG
Figure 4: In the original DDPG, critic needs to learn to model the environment implicitly. In the model-based DDPG, the environment is explicitly modeled through a neural renderer, which helps train an efficient agent.

3.3.1 Model-based DDPG

We first describe the original DDPG, then introduce building model-based DDPG for the efficient training of the agent.

As defined, the action space in the painting task is continuous with high dimensions. Discretizing the action space to adapt some DRL methods, such as DQN and PG, will lose the precision of stroke representation and require much efforts in manual structure design to face with the explosion of parameter combinations in discrete space. DPG [39] was proposed using deterministic policy to resolve the difficulties caused by high-dimensional continuous action space. Furthermore, DDPG was proposed combining NN with DPG to enhance its performance in lots of control tasks.

In the original DDPG, there are two networks: the actor and critic . The actor models a policy that maps a state to action

. The critic estimates the expected reward for the agent taking action

at state , which is learned using Bellman equation (2) as in Q-learning [42] and the data is sampled from an experience replay buffer:

(2)

where is a reward given by the environment when performing action at state . The actor is learned to maximize the critic’s estimated . In other words, the actor decides an action for each state. Based on the current canvas and the target image, the critic predicts an expected reward for the stroke given by the actor. The critic is optimized to estimate more accurate expected rewards.

We cannot train a good-performance painting agent using original DDPG because it’s hard for the agent to model the complex environment well that is composed of any types of real-world images during learning. Thus, we design a neural renderer so that the agent can observe a modeled environment. Then it can explore the environment and improve its policy efficiently. We term the DDPG with the actor that can get access to the gradients from environments as model-based DDPG. The difference between the two algorithms is visually shown in Figure 4.

The optimization of the agent using the model-based DDPG is different from that using the original DDPG. At step , the critic takes as input rather than both of and . The critic still predicts the expected reward for the state but no longer includes the reward caused by the current action. The new expected reward is called value function learned using the following equation:

(3)

where is a reward when performing action based on . The actor is learned to maximize . Here, the transition function is the differentiable renderer.

3.3.2 Action Bundle

Frame Skip [2] is a powerful parameter for many RL tasks. The agent can only observe the environment and acts once every frames rather than one frame. This trick makes the agents have a better ability to learn associations between more temporally distant states and actions. The agent predicts one action and reuse it at the next frames instead and achieves better performance with less computation cost.

Inspired by this trick, we make the actor output the parameters of strokes at one step. This practice encourages the exploration of the action space and action combinations. The renderer will render strokes simultaneously to greatly speedup the painting process. We term this trick as Action Bundle. We experimentally find that setting is a good choice that significantly improves the performance and the training speed. It’s worth noting that we modify the reward discount factor from to to keep consistency.

3.3.3 WGAN Reward

GAN has been widely used as a particular loss function in transfer learning, text model and image restoration 

[22, 46], because of its great ability in measuring the distribution distance between the generated data and the target data. Wasserstein GAN (WGAN) [1] is an improved version of the original GAN. WGAN minimizes the Wasserstein-l distance, also known as Earth-Mover distance, which helps to stabilize the training of GAN. In our task, the objective of the discriminator in WGAN is defined as

(4)

where denotes the discriminator, and are the paintings and target images distribution. The prerequisite of above objective is that should be under the constraints of 1-Lipschitz. To achieve the constraint, we use WGAN with gradient penalty (WGAN-GP) [8].

We want to reduce the distribution distance between paintings and target images as much as possible. To achieve this, we set the difference of discriminator scores from to using equation (1) as the reward for guiding the learning of the actor. In experiments, we find use WGAN loss function to learn the agent is better than and loss function.

3.4 Network Architectures

(a) The actor and critic
(b) The discriminator
(c) The neural renderer
Figure 5: Network architectures. FC refers to a fully-connected layer, Conv refers to a convolution layer, and GAP refers to a global average pooling layer. the actor and the critic use a same structure except for the different output dimensions of the last fully-connected layer.

Due to the high variability and high complexity of real-world images, we use residual structures similar to ResNet-18 [12]

as the feature extractor in the actor and the critic. The actor works well with Batch Normalization (BN) 

[15], but BN can not speed up the critic training significantly. Salimans  [35]

apply Weight Normalization (WN) to improve DQN. Similarly, we use WN with Translated ReLU (TReLU) 

[43] on the critic to stabilize our training. In addition, we use CoordConv [27] as the first layer in the actor and the critic. For the discriminator, we use similar network architecture with PatchGAN [16]. We use WN with TReLU on the discriminator as well. The network architectures of the actor, critic and discriminator are shown in Figure 5 (a) and (b).

Following the original DDPG paper, we use the soft target network which means creating a copy for the actor and critic and updating their parameters by having them slowly track the learned networks. We also apply this trick on the discriminator to improve its training stability.

(a) SPIRAL paintings with 20 strokes [5]
(b) Ours with 20 opaque strokes
(c) Ours with 200 opaque strokes
(d) Ours with 200 strokes using reward
(e) Ours with 200 strokes
(f) Ours with 1000 strokes
(g) The target
Figure 6: CelebA paintings under different settings. (a) The results of SPIRAL. (b)-(c) Paintings with opaque strokes. (d) Paintings with reward. (e) Baseline. (f) Paintings with 1000 strokes. (g) The target images.

4 Stroked-based Renderer

In this section, we introduce how to build a neural stroke renderer and use it to generate multiple types of strokes.

4.1 Neural Renderer

Using a neural network to generate strokes has two advantages. First, the neural renderer is flexible to generate any styles of strokes and is more efficient than hand-crafting stroke simulators. Second, the neural renderer is differentiable and can model the environment well for the original DDPG so that boosting the performance of the agent.

Specifically, feed the neural renderer with a set of stroke parameters , then it outputs the rendered stroke image

. The training samples are generated randomly using graphics renderer programs. The neural renderer can be quickly trained with supervised learning and runs on the GPU. So we get a differentiable and fast-running environment. Formally, the model-based transition dynamics

and the reward function are differentiable. Some simple geometric renderings can be done without neural networks and give the gradient as well. But neural networks can help us omit cumbersome formula calculations.

The neural renderer network is consisting of several fully connect layers and convolution layers. Sub-pixel [37] is taken to increase the resolution of strokes in the network, which is a fast running operation and can eliminate the checkboard effect. We show the network architecture of the neural renderer in Figure 5 (c).

4.2 Stroke Design

Strokes can be designed as a variety of curves or geometries. In general, the parameter of a stroke should include the position, shape, color and transparency.

For curve strokes, which simulating the effects of brushes, the coordinates of control points and the thickness of the stroke determine the shape of a stroke. Bezier curves are common on vector drawing programs. We design a brief stroke represent of Quadratic Bezier Curve (QBC) as follows,

(5)

where are the coordinates of the three control points of the QBC. , control the thickness and transparency of the two endpoints of the curve, respectively. And controls the color. The formula of QBC is:

(6)

To eliminate aliasing, the curve is first drawn on a high-resolution canvas and then resized to the resolution of the target image. We can use neural renders with the same structure to implement the rendering of different strokes.

(a) DDPG and model-based DDPG
(b) Different settings of Action Bundle
(c) Different number of strokes
Figure 7: The testing -distance between paintings and the target images of CelebA for ablation studies.

5 Experiments

Four datasets are used for our experiments, including MNIST [21], SVHN [31], CelebA [28] and Imagenet [34]. We show that the agent has excellent performance in painting various types of real-world images.

5.1 Datasets

MNIST contains 70,000 examples of hand-written digits, where 60,000 are training data, and 10,000 are testing data. Each example is a grayscale image with a resolution of pixels.

SVHN is a real-world house number image dataset, including 600,000 digits images. Each sample in the Cropped Digits set is a color image with a resolution of pixels. We randomly sample 200,000 images for our experiments.

CelebA contains approximately 200,000 celebrity face images. The officially provided center-cropped images are used in our experiments.

ImageNet (ILSVRC2012) contains 1.2 million natural scene images, which fall into 1000 categories. An extreme diversity is shown by ImageNet, which raises a great challenge to the painting agent. We randomly sample 200,000 images that cover 1,000 categories as training data.

In our task, we aim to learn an agent that can paint any images rather than only the ones in the training set. Thus, we additionally split out testing set to test the generalization ability of the learned agent. For MNIST, we use the officially defined testing set. For other datasets, we randomly split out 2,000 images as testing set.

5.2 Training

We resized all images with a resolution of pixels before giving them to the agent. We trained the agent with batches for ImageNet and CelebA datasets, batches for SVHN and batches for MNIST. Adam [19] was used for optimization, and the minibatch size was set as 96. The agent training was finished on a single NVIDIA TITAN Xp and with a consumption of 30GB memory. It took about 40 hours for training on ImageNet and CelebA, 20 hours for SVHN and two hours for MNIST. It took 5 - 15 hours to train the neural renderer for every different stroke design. The learned renderer can be used for different agents.

The actor, critic and discriminator were updated in turn at each training step. The replay memory buffer was set to store the data of the latest 800 episodes for training the agent. Please refer to the supplemental materials for more training details.

Since the reward was given by a dynamically learned discriminator, this would introduce the bias in the calculated rewards. Thus the agent still can explore the environments well without adding noise to the actions.

5.3 Results

The images of MNIST and SVHN show simple image structures and regular contents. We train one agent that paints five strokes for images of MNIST, and another paints 40 strokes for images of SVHN. The example paintings are shown in Figure 3 (a) and (b). The agents can perfectly reproduce the target images.

By contrast, the images of CelebA have more complex structures and diverse contents. We train a 200-strokes agent to deal with the images of CelebA. As shown in Figure 3 (c), the paintings are quite similar to the target images with losing a certain level of details. Similarly, SPIRAL [5] shows its performance on CelebA. To make a fair comparison, we also train a 20-strokes agent as SPIRAL and use opaque strokes. The results of the two methods are shown in Figure 6 (a) and (b) respectively. Our paintings are clear than SPIRAL’s, and our distance is smaller than one third of SPIRAL’s.

We train a 400-strokes agent to deal with the images of ImageNet, due to the extremely complex structures and various contents. As shown in Figure 3 (d), paintings are similar to the target images concerning the outline and colors of objects and backgrounds. Although some textures lost, the agent still shows great power in decomposing complicated scenes into strokes and repainting them in a reasonable way.

In addition, we show the testing loss curves of agents trained on different datasets in Figure 8.

Figure 8: The testing -distance between paintings and the target images for different datasets.

5.4 Ablation Studies

In this section, we study how the components or tricks, including model-based DDPG, Action Bundle and WGAN reward, affect the performance of the agent. For simplicity, we experiment on CelebA only.

5.4.1 Model-based vs. Model-free DDPG

We explore how much benefits brought by model-based DDPG over original DDPG. As we know, original DDPG can only model the environment in an implicit way with observations and rewards from the environment. Besides, the high-dimensional action space also stops model-free methods successfully dealing with the painting task. To further explore the capability of model-free methods, we improve original DDPG with a method inspired by PatchGAN. We split the images into patches before feeding the critic, then use the patch-level rewards to optimize the critic. We term this method as PatchQ. PatchQ boosts the sample efficiency and improves the performance of the agent by providing much more supervision signals in training.

We show the performance of agents trained with different algorithms in Figure 7 (a). Model-based DDPG outperforms original DDPG and DDPG with PatchQ, with five times smaller distance than DDPG with PatchQ and 20 times smaller distance than original DDPG. Although underperforming the model-based DDPG, DDPG with PatchQ outperforms original DDPG with great improvement.

5.4.2 Rewards

distance is a choice as the reward for learning the actor. We show the painting results with using WGAN rewards and rewards in Figure 6 (d) and (e) respectively. The paintings with WGAN rewards show richer textures and look more vivid as the target images. Interestingly, we find using WGAN rewards to train the agent can achieve a lower loss on the testing data than using rewards instead. This shows that WGAN distance is a better metric in measuring the difference between paintings and real-world images than .

5.4.3 Stroke Number and Action Bundle

The stroke number for painting is critical for the final painting effects, especially for the texture-rich images. We train agents that can paint 100, 200, 400 and 1000 strokes, and the testing loss curves are shown in Figure 7 (c). It’s observed that larger strokes number contributes to better painting effects. We show the paintings with 200-strokes and 1000-strokes in Figure 7 (e) and (f) respectively. To the best of our knowledge, few methods can handle such a large number of strokes. More strokes help reconstruct the details in the paintings.

Action Bundle is a trick for speeding the painting process. Apart from this, we explore how Action Bundle affects the performance of the agent. We show testing loss curves of several settings of Action Bundle in Figure 7 (b). We find that making the agent predict five strokes in one bundle achieves the best performance. We conjecture that the increasing of strokes number in one bundle makes the agent paint more strokes with a certain number of decisions and helps the agent have a long-term plan. But it will increase difficulties for the agent to paint more strokes reasonably for one decision. Thus, to achieve a trade-off, five strokes in one bundle is a good setting for the agent.

5.4.4 Alternative Stroke Representations

Besides the QBC, we show alternative stroke representations that can be well mastered by the agent, including straight strokes, circles and triangles. We train one neural renderer for each stroke representation. The paintings with these renderers are shown in Figure 9. The QBC strokes produce an excellent visual effect. Meanwhile, other strokes create different artistic effects. Although with different styles, the paintings still resemble the target images. This shows the learned agent is general and robust to the stroke designs.

In addition, by restricting the transparency of strokes, we can get paintings with different stroke effects, such as ink painting and oil painting as shown in Figure 6 (c).

The target QBC Straight Stroke Triangle Circle
Figure 9: CelebA paintings using different strokes.

6 Conclusion

In this paper, we learn a painting agent that decomposes the target image into strokes and paint on the canvas in sequence to form a painting. The agent learning is based on the DRL framework, which encourages the agent making a long-term plan about the sequence of strokes painting to maximize the cumulative rewards. Instead of a conventional stroke simulator, a neural renderer is used to generate strokes with simplicity and high efficiency. Moreover, neural renderers also contribute to the model-based DRL algorithm, which shows better performance than the original DRL algorithm in the painting task. The learned agent can predict hundreds or even thousands of strokes to generate a painting. The experimental results show that our model can handle multiple types of target images and achieve good performance in painting texture-rich natural scene images.

References

  • [1] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein generative adversarial networks. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    ,
    pp. 214–223. Cited by: §3.3.3.
  • [2] A. Braylan, M. Hollenbeck, E. Meyerson, and R. Miikkulainen (2015) Frame skip is a powerful parameter for learning to play atari. In Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence, Cited by: §3.3.2.
  • [3] Y. Chen, S. Tu, Y. Yi, and L. Xu (2017) Sketch-pix2seq: a model to generate sketches of multiple categories. arXiv preprint arXiv:1709.04121. Cited by: §1.
  • [4] R. Dubey, P. Agrawal, D. Pathak, T. L. Griffiths, and A. A. Efros (2018) Investigating human priors for playing video games. arXiv preprint arXiv:1802.10217. Cited by: §2.
  • [5] Y. Ganin, T. Kulkarni, I. Babuschkin, S. A. Eslami, and O. Vinyals (2018) Synthesizing programs for images using reinforced adversarial learning. In International Conference on Machine Learning, pp. 1652–1661. Cited by: §1, §1, §2, 6(a), §5.3.
  • [6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • [7] S. Gu, T. Lillicrap, I. Sutskever, and S. Levine (2016) Continuous deep q-learning with model-based acceleration. In International Conference on Machine Learning, pp. 2829–2838. Cited by: §2.
  • [8] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777. Cited by: §3.3.3.
  • [9] D. Ha and D. Eck (2017) A neural representation of sketch drawings. arXiv preprint arXiv:1704.03477. Cited by: §1, §1, §1, §1, §2.
  • [10] D. Ha and J. Schmidhuber (2018) Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 2451–2463. Note: https://worldmodels.github.io External Links: Link Cited by: §2.
  • [11] P. Haeberli (1990) Paint by numbers: abstract image representations. In ACM SIGGRAPH computer graphics, Vol. 24, pp. 207–214. Cited by: §2.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §3.4.
  • [13] A. Hertzmann (2003-07) A survey of stroke-based rendering. IEEE Computer Graphics and Applications 23 (4), pp. 70–81. External Links: Document, ISSN 0272-1716 Cited by: §2.
  • [14] A. Hertzmann (1998) Painterly rendering with curved brush strokes of multiple sizes. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pp. 453–460. Cited by: §2.
  • [15] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, pp. 448–456. Cited by: §3.4.
  • [16] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §3.4.
  • [17] M. Kempka, M. Wydmuch, G. Runc, J. Toczek, and W. Jaśkowski (2016) Vizdoom: a doom-based ai research platform for visual reinforcement learning. In Computational Intelligence and Games (CIG), 2016 IEEE Conference on, pp. 1–8. Cited by: §2.
  • [18] Ł. Kidziński, S. P. Mohanty, C. F. Ong, Z. Huang, S. Zhou, A. Pechenko, A. Stelmaszczyk, P. Jarosik, M. Pavlov, S. Kolesnikov, et al. (2018) Learning to run challenge solutions: adapting reinforcement learning methods for neuromusculoskeletal environments. In The NIPS’17 Competition: Building Intelligent Systems, pp. 121–153. Cited by: §1, §2.
  • [19] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.2.
  • [20] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §2.
  • [21] Y. LeCun (1998)

    The mnist database of handwritten digits

    .
    http://yann. lecun. com/exdb/mnist/. Cited by: 3(a), §5.
  • [22] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017)

    Photo-realistic single image super-resolution using a generative adversarial network

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §3.3.3.
  • [23] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, and Z. Wang (2016) Photo-realistic single image super-resolution using a generative adversarial network. Cited by: §1.
  • [24] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
  • [25] L. Lin, K. Zeng, H. Lv, Y. Wang, Y. Xu, and S. Zhu (2010) Painterly animation using video semantics and feature correspondence. In Proceedings of the 8th International Symposium on Non-Photorealistic Animation and Rendering, pp. 73–80. Cited by: §2.
  • [26] P. Litwinowicz (1997) Processing images and video for an impressionist effect. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pp. 407–414. Cited by: §2.
  • [27] R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, and J. Yosinski (2018)

    An intriguing failing of convolutional neural networks and the coordconv solution

    .
    In Advances in Neural Information Processing Systems, pp. 9628–9639. Cited by: §3.4.
  • [28] Z. Liu, P. Luo, X. Wang, and X. Tang (2015-12) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: 3(c), §5.
  • [29] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §2.
  • [30] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.
  • [31] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, Vol. 2011, pp. 5. Cited by: 3(b), §5.
  • [32] OpenAI OpenAI five. Note: https://blog.openai.com/openai-five/ Cited by: §2.
  • [33] S. Racanière, T. Weber, D. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, et al. (2017) Imagination-augmented agents for deep reinforcement learning. In Advances in neural information processing systems, pp. 5690–5701. Cited by: §2.
  • [34] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: 3(d), §5.
  • [35] T. Salimans and D. P. Kingma (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909. Cited by: §3.4.
  • [36] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §2.
  • [37] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1874–1883. Cited by: §4.1.
  • [38] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484–489. Cited by: §2.
  • [39] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller (2014) Deterministic policy gradient algorithms. In ICML, Cited by: §3.3.1.
  • [40] J. Song, K. Pang, Y. Song, T. Xiang, and T. M. Hospedales (2018) Learning to sketch with shortcut cycle consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 801–810. Cited by: §1.
  • [41] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems, pp. 1057–1063. Cited by: §1.
  • [42] C. J. Watkins and P. Dayan (1992) Q-learning. Machine learning 8 (3-4), pp. 279–292. Cited by: §3.3.1.
  • [43] S. Xiang and H. Li (2017) On the effects of batch and weight normalization in generative adversarial networks. arXiv preprint arXiv:1704.03971. Cited by: §3.4.
  • [44] N. Xie, H. Hachiya, and M. Sugiyama (2013) Artist agent: a reinforcement learning approach to automatic stroke generation in oriental ink painting. IEICE TRANSACTIONS on Information and Systems 96 (5), pp. 1134–1144. Cited by: §2.
  • [45] Z. Yang, K. E. Merrick, H. A. Abbass, and L. Jin (2017) Multi-task deep reinforcement learning for continuous action control.. In IJCAI, pp. 3301–3307. Cited by: §1.
  • [46] R. A. Yeh, C. Chen, T. Yian Lim, A. G. Schwing, M. Hasegawa-Johnson, and M. N. Do (2017)

    Semantic image inpainting with deep generative models

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5485–5493. Cited by: §3.3.3.
  • [47] N. Zheng, Y. Jiang, and D. Huang (2019) StrokeNet: a neural painting environment. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §2.
  • [48] T. Zhou, C. Fang, Z. Wang, J. Yang, B. Kim, Z. Chen, J. Brandt, and D. Terzopoulos Learning to doodle with deep q-networks and demonstrated strokes. Cited by: §1, §1, §1, §1, §2.