This paper suggests a methodology of helping designers to obtain morphological insights with the use of deep reinforcement learning. Within the constructive design process, design methodologies have evolved to find the points of human inconvenience or the problems contained in the use of the product . In the problem solving process, a design solution for the defined problem is searched for that can be applied to the actual design. Many designers have suggested various methods to solve these design problems [2, 3]. However, since these methods usually require high cost in terms of time and space, many researchers have difficulties in finding a good solution.
Meanwhile, the role of computer science in the design process has been changed. Computer scientists collaborate not only in the development of design tools that designers can use directly for designing products, but also in various forms that help designers with computer simulation or machine learning algorithms before the mock-up stage [4, 5, 6, 7, 8]. However, these approaches only use computer science as a rule-based assistant, not a creative designing tool. Dreamsketch , on the other hand, suggested the methodology to obtain multiple 3D design solutions based on the context created from the sketch phase. But still, because it is applied only after all the analysis has been done, rather than used for understanding the context of design, it is of little help in the design process and does not directly address the underlying problem solving. Pahn et al.  is another good example that attempted to assist the design process using machine learning, but it is very costly because it requires the process of dividing the product and the corresponding design problem into small objects and assigns a role to each object.
Different from the above mentioned methods, we have chosen algorithms that can more directly understand the design process in order to suggest a more direct and lower-cost methodology to designers at the problem-solution bridge stage. The reinforcement learning algorithm, which is becoming a hot topic in computer science, is an algorithm that learns the best action (to get the best reward) in a provided environment. Even if we do not give detailed information about the intermediate process leading to the best reward, we can learn to get the best action possible. Through this, we devised a framework that can design based on a given task.
In this paper, we apply the reinforcement learning algorithm to product design and present a link that enables the computer to directly find the task-oriented solution through the problem. The whole proposed methodology is shown in Figure 1. When problems from design research is given, we define tasks, 3D simulation environments and reinforcement learning environments. Tasks and 3D simulation environments are processed in Blender  and linked to the reinforcement learning algorithm. In the figure, this paper deals with the processes denoted in the dashed-line box.
Zhu et al.  is also a good example of a deep learning algorithm for product interpretation. However, we go one step further and present a methodology that allows the computer to design itself based on the understanding of the product or task. More specifically, we have chosen to design a pot with a couple of design objectives. By tackling this problem, we will discuss how the reinforcement learning algorithm finds solutions in order to achieve high scores in a given task. To enable this, we define the design process as an action space that can be understood by the computer. We will also cover how we can finally use the generated output by giving morphological intuition to product designers.
Ii Related Work
Ii-a Constructive Design
In various study of product design process [1, 12], the authors discuss about the process of constructive design research which is initiated by formulating a research question out of an existing theory or philosophy, then investigate the question through a process of making and designing artifacts.
For constructive design, user research should be proceeded first. With studying users and products, designers get several insights that should be applied to their final design. After studio work, constructive design researchers develop designs, which begins with sketchy ideas and mock-ups. In this stage, usually hundreds of mock-ups are made by designers, which costs a lot of time and efforts.
Ii-B Reinforcement Learning
Reinforcement learning is an algorithm that learns which actions to take to maximize rewards in a given environment. In reinforcement learning, we define and solve problems with the framework of Markov Decision Processes (MDPs)
the framework of Markov Decision Processes (MDPs). MDPs consist of the environment and the agent. They interact each other at every continual time index , and the agent tries to achieve a given goal.
Specifically, the agent receives information about the state from the environment and takes action to obtain maximum rewards in the current state, and the action is determined by the policy. The policy is defined as a probability distribution () that is of available action outputs in a given state . The environment outputs rewards as a result of agent’s action and this process repeats. In the end, trajectories such as can be obtained by the interaction of the agent and the environment. State changes are determined by stationary transition dynamics distribution which follows markov properties .
Deep Q-Network(DQN) has achieved surprising performance in the Atari2600 task learning environment, where the action-value function (; Q-function) was approximated by deep neural network . The Q-function estimates the maximum achievable cumulative rewards for a current state
was approximated by deep neural network
. The Q-function estimates the maximum achievable cumulative rewards for a current statewith action .
In DQN, the output of the Q-function is an one-hot vector form of the discrete action space,which prohibits DQN from being applied to continuous action space environments.
then used the Kullback-Leibler divergence as a constraint to resolve the unstable result of policy update due from the fixed step size in DDPG. In addition, proximal policy optimization algorithm (PPO)  proposes a clipped surrogate objective function to alleviate the complex computational requirement in TRPO.
Iii-a Task Specification
In this study, to verify the effectiveness of RL in the design process, we selected the cylindrical pot shown in Fig. 2 as a basic design for its simple but flexible form. Starting from a cylinder-shaped pot, the agent tries to maximize the cumulative reward for a specific task by taking actions which is defined as increasing or decreasing the diameter of the pot at different heights. The first task we consider is pouring as much water as possible in the pot into another cup according to the primary purpose of the pot. In addition, we assumed a shaking situation as a second task. Here, the agent tries to keep water from the shaking pot. We named each environment as ‘pouring environment’ and ‘shaking environment’, respectively. After training each task successfully, we also show that simultaneously training both tasks is possible despite these two tasks have conflicting features.
Iii-B 3D Simulation Environment
The existence of simulation environment has been one of the main reasons for the success of deep reinforcement learning. Mnih et al.  was able to find the optimal policy by training in the Atari2600 game environment through Arcade Learning Environment . The role of the simulation is also important in our task. Without simulation, we have to repeat the inefficient process of designing the pot each time, outputting the product, and experimenting.
We used an open-source 3D modeling tool, Blender, to construct a reinforcement learning environment. The reason for using the Blender is that it allows fluid simulation through embedded particle systems and can control and output all the available information on the environment via python scripts. The initial model of the pot implemented with the Blender is shown in Figure 2.
Iii-C Reinforcement Learning Environment
As described in the above section, to apply each modeling design to reinforcement learning, we need to define the state and action space. To define the state and action, we assigned 11 control points along the -axis of the pot by dividing -axis into 10 regions. And each cross sectional circle corresponding to each control point consists of 32 points with equal distances. Now the pot consists of points as we can see in Figure 2. The action space is defined as an 11-dimensional vector that controls the radii of 11 circle groups and the state is a vector of dimension which corresponds to coordinates of all points in the pot design.
As we see in Figure 3, in the pouring environment, the environment consists of a pot containing water and a cup to receive water. At first, the agent takes an action to change the design of the pot from the initial state. The environment simulates a step process and then measures a reward. The step proceeds as follows. With a certain amount of water in the pot, tilt the pot from zero to 130 degrees in the direction of -axis for 2 seconds. At this time, the amount of water in the cup is measured as a reward.
In the shaking environment shown in Figure 4, the environment only consists of a pot containing water unlike the pouring environment. Most of the step processes are similar to the pouring environment except for the simulation step process. With a certain amount of water in the pot, the environment shakes the pot for thirteen seconds from to 70 degrees in the direction of -axis. At this time, the amount of water remaining in the pot is measured as a reward.
In this section, we will show that task-oriented design is possible through experiments not only in the tasks mentioned in the previous section, but also in a multi-task environment where the goal is a hybrid of the two tasks. And then, we will analyze how the computer understands a task and designs a model by examining i) the rewards during learning quantitatively and ii) the result of the pot design qualitatively.
Iv-a Training Model Details
We used the actor-critic based PPO algorithm for learning experiments. It is composed of actor network and critic network, and is a 3-layer fully-connected network consisting of (256, 128, 64) units. The activation function is tanh (tangent hyperbolic) function. Each network receives 352 points as input that make up the pot. The actor network outputs values corresponding to 11 action spaces defined byIII-C
from the gaussian distribution, and the critic network computes the expected cumulative value from current state. For the stability of the learning, the scale of the action is limited to a value between 0.5 and 1.5 times the initial value. We used Adam optimizer as an optimizer and learning rate is 0.0007.
Iv-B Pouring Environment
In the pouring environment, the pot performs a task of pouring water into a narrow cup located a certain distance away as shown in Figure 3. If the amount of water in the cup is , and the total amount of water is , the reward is defined as
Through this, we designed an experiment to get the maximum amount of water to the cup when tilting the pot. We used PPO as the reinforcement learning algorithm. In general, millions to billions of steps are needed for a reinforcement learning model to converge. In our experiment, however, we only used 1,000 steps for training due to the computational bottleneck of the blender simulation. We separated the 1,000 steps into 5 episodes and initialized the pot design every 200 steps to prevent sub-optimality that might occur in reinforcement learning. Through this, we encouraged the agent to make optimal modeling.
Iv-B1 Quantitative Analysis
Figure 6 shows the overall reward rise during the training. At the initial state, the reward is quite low due to the distance gap in the -direction between the tip of the pot and the center of the cup. As the training proceeds, the reward increases. The valley for each episode in Figure 6 (at 1, 201, 401, 601, and 801 steps) shows that it starts again from the initial state so that it deviates from the sub-optimality and shows a slight improvement in reward as the learning progresses. Compared to 23% of water in the cup in the initial state, we can see the improvement in performance by containing 53% of water at the end of the training. The final image in the pouring environment of Figure 5 is what the pot design would look like when it got a reward 0.53 at episode 4 step 107.
Iv-B2 Qualitative Analysis
If you look at the models created by the deep reinforcement learning algorithm, you can see which tasks the agent want to perform in each step. In Figure 5, the first row shows how the model trained from the pouring environment changes. In the initial state, it inevitably fails to aim correctly since the distance from the pot to the cup is far as can be seen in Figure 3. Our agent solve this problem by shaping pot design such that the center is narrow and the head and the bottom are wide, which controls the acceleration of the fluid. After this, the algorithm passes through exploration steps to maximize the reward (1). Also, we can see that the head area is resized to maximize the reward.
Iv-C Shaking Environment
In the shaking environment, the pot is designed to shed as little water as possible in the environment of shaking the pot. If the initial amount of water in the pot is , and the final amount of water after shaking is , the reward is defined as
Through this, we trained the pot to conserve the maximum amount of water in the pot. We used the PPO algorithm like the pouring environment and trained the model for 5 episodes of 200 steps each.
Iv-C1 Quantitative Analysis
Figure 7 shows the trend of reward as the learning progresses in the shaking environment. The reward tends to decrease slightly after reaching a saturation level in the first three episodes, which indicates that the agent has stuck in the local minimum during training. However, by solving the local minimum problem after episode 4 through exploration, the maximal reward increases. As a result, compared to the initial state which saves only 41% of the water, at the end of the training, the resultant pot is able to keep 86% of the water. The bottom right image of the shaking environment in Figure 5 is when the algorithm gets a reward of 0.86, as recorded in episode 4 step 147.
Iv-C2 Qualitative Analysis
Looking at the changes in the models generated by the reinforcement learning, you can see how the network is trained to protect water. As you can see from the bottom row of Figure 5, the bottom part of the pot becomes larger and larger to keep as much water as possible, and the structure is good for storing water. As training proceeds, it was difficult to store water in the lower part, and the training progressed with a double tube structure. Commonly, there is a barrier structure in the upper part to prevent the pot from splashing by water shaking. Consequentially, we were able to confirm that when the simulation was carried out, it was trained to keep the bouncing water as much as possible in the pot from the large swing of 70 degrees.
In hybrid-learning, we examined the possibility of pot design that can perform both contradictory tasks. To do this, we defined a new reward which is a weighted sum of the reward in the pouring environment and the reward in the shaking environment as follows:
In this equation, is a weight parameter between 0 and 1. We experimented how the algorithm interprets each task according to five values in .
Iv-D1 Quantitative Analysis
Table I indicates how much reward is obtained for different weight parameters. Each episode and step indicates the time when the maximum hybrid reward was achieved for the corresponding weight, and the three rewards (pouring, shaking, and hybrid) are the corresponding rewards at the time. When is 0.1, according to (3), we can see that the shaking environment has more weight on training. As increases, training is more focused on the pouring environment. As can be seen in Table I, in the pouring environment, a pouring reward of 0.32 was achieved at the point where the hybrid reward was largest when was 0.1. As increased to 0.9, the pouring reward increased to 0.53, which is the best score of the single pouring environment, because the algorithm gave more weight to the pouring environment. Conversely, in the shaking environment, the shaking reward was 0.87 when was 0.1 and the reward decreased to 0.71 when was 0.9. In this way, we showed the deep RL algorithm combining the two opposite tasks can train a model that satisfies both tasks.
Iv-D2 Qualitative Analysis
Figure 8 shows how the model appears based on the change in the weight parameter . When the value of is 0.1, there is a water trap structure at lower position, a narrow entrance, and a barrier structure below the entrance like the model designed in the pure shaking environment. This shows that the training is focused on the shaking environment and trained to maximize water in the pot. On the other hand, when is 0.9, we can see that the model has trained to create a smooth line in the middle like the model designed in the pouring environment and flows the water as easily as possible. In the case of the third model of , which performs the two tasks in the most balanced way, we can see that the model design maintains all of these features. Though there exists a storage part in the middle influenced by the shaking environment, it has a tendency to minimize the water remaining in the pot through the narrowing structure from the bottom to the top which resembles the design of the pouring environment.
Iv-E Contribution in Design Process
Since the computer-designed products from this experiment does not consider either usability or aesthetics other than the given objectives, it will be necessary to design a creative part based on the generated form. Therefore, in order to verify the validity of the current methodology, we invited a product designer to sketch designs through the computer-generated design. The tester chose the main design concept as a Chinese pot. He sketched the new Chinese pot using the characteristics of the generated form. The design sketch result is shown in Figure 9.
After sketching session, we had a short interview with the product designer to get pros and cons about this methodological concept. The designer commented that ‘this methodology is highly useful when designers should make a product from a task-based concept’. Also, he found out that the output form is quite similar to the common sense of the pot designed for similar tasks. As so, he believes that the results of this study will be a good reference as a task-based study, which will increase the reliability of the results produced by the designer. However, he has worried that significant features of the output can be lost because of the extreme morphological tendency of the computer-generated design.
Through this study, we have shown that task-oriented design using deep reinforcement learning is possible for a specific task whose objective can be well defined. It is shown that the task objective function of the modified design through the deep reinforcement learning is significantly higher than that of the basic form, which indicates that the computer succeeded in designing the task-oriented model. By using this methodology, designers and researchers will be able to apply task-based form research before they move to creative parts of product design process. In addition, the proposed methodology is highly efficient, because it is possible to study morphology within 20 hours at low-cost, achieving a high understanding of the task.
However, in the present learning, since an action space is used in which a radius of each point layer is simply changed, there is a limit to an aesthetic or complex design that can be used in real life as an output. Also, since the reward function for the task is simply designed, the limit is shown when the performance reaches a certain level. For this reason, we will need to design a more delicate action space as well as a reward function in future works.
-  T. B. J. R. S. W. Ilpo Koskinen, John Zimmerman, Design Research through Practice from the lab, field and showroom. Morgan Kaufmann, 2011.
-  K. Dorst and N. Cross, “Creativity in the design process: co-evolution of problem–solution,” Design Studies, vol. 22, pp. 425–437, 2001.
-  D. G. Ullman, The Mechanical Design Process. New York: McGraw-Hill Education, 1991.
-  R. Aish, J. Stam, M. Glueck, and A. Khan, “Physics-based generative design,” CAAD Futures Conference, vol. 14, pp. 231–244, 2009.
-  T. Du, A. Schulz, B. Zhu, B. Bickel, and W. Matusik, “Computational multicopter design,” ACM Transactions on Graphics, vol. 35, no. 227, 2016.
-  Y. I. H. Parish and P. Müller, “Procedural modeling of cities,” SIGGRAPH, vol. 8, 2001.
-  N. Umentani, T. Igarashi, and N. J. Mitra, “Guided exploration of physically valid shapes for furniture design,” ACM Transactions on Graphics, vol. 31, no. 86, 2012.
-  Y. Zhu, Y. Zhao, and S.-C. Zhu, “Understanding tools: Task-oriented object modeling, learning and recognition,” CVPR, vol. 10, 2015.
-  R. H. Kazi, T. Grossman, H. Cheong, A. Hashemi, and G. Fitzmaurice, “Dreamsketch: Early stage 3d design explorations with sketching and generative design,” UIST, vol. 14, pp. 401–414, 2017.
-  F. Pahn, N. Senin, and D. Wallace, “Distribution modeling and evaluation of product design problems,” Computer-Aided Design, vol. 30, pp. 411–423, 1998.
-  Blender Online Community, Blender - a 3D modelling and rendering package, Blender Foundation, Blender Institute, Amsterdam, 2018. [Online]. Available: http://www.blender.org
-  A. L. Bang, P. G. Krogh, M. Ludvigsen, and T. Markussen, “The role of hypothesis in constructive design research,” The Art of Research IV, vol. 11, 2012.
-  R. Sutton and A. Barto, “Reinforcement learning: An introduction, (complete draft),” 2017.
-  D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” in ICML, 2014.
-  V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
-  T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
-  J. Peters and S. Schaal, “Natural actor-critic,” Neurocomputing, vol. 71, no. 7-9, pp. 1180–1190, 2008.
-  J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, 2015, pp. 1889–1897.
-  S. Kullback and R. A. Leibler, “On information and sufficiency,” The annals of mathematical statistics, vol. 22, no. 1, pp. 79–86, 1951.
-  J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning
environment: An evaluation platform for general agents,”
Journal of Artificial Intelligence Research, vol. 47, pp. 253–279, 2013.