In VLSI design high quality global placement results are highly correlated to the final quality (area, performance and power) of the physical design. As one of the first stages of the physical design flow, placement decisions affect the results of all downstream design stages. Recently DREAMPlace and ABCDPlace provide an extremely fast platform for running global and detail placement[5, 6]
. This opens up the possibility of running large numbers of exploration placement runs to find the best possible placement solution. We leverage these advancements along with recent advances in reinforcement learning and introduce an augmented state of the art placement algorithm which learns new internal heuristics which provide higher quality final solutions.
State of the art force based academic placers use the Lagrangian relaxation technique to optimize the constrained objective function which takes into account both cell density and half-perimeter wire length (HPWL). Improvements to these algorithms (such as work in ePlace and RePlAce) frequently come as improvements to heuristic rules used during the course of the optimization [7, 2]. In this paper we investigate whether it is possible to use reinforcement learning to find better heuristics than the ones being used today. The main contributions of this paper are:
We propose the use of large scale placement exploration accelerated by GPUs to fuel a data driven approach to placement.
We propose two distinct methods of augmenting force-based global placement algorithms with reinforcement learning.
We introduce a unique correlated sampling strategy for reinforcement learning algorithms acting on a two dimensional action space.
We demonstrate that using these methods results in a 1% reduction in HPWL across a range of academic and industry benchmarks.
As far as we are aware this is the first attempt to use reinforcement learning to directly control the dynamics of a state of the art global placement algorithm.
The modern VLSI design flow is an iterative process. The placement stage of a given design is run many times often with incremental changes to the underlying design. A placement engine that can learn from previous iterations and apply learned strategies to provide better quality of results to later iterations is therefore desirable. To this end we also demonstrate that our learned policies retain on average 77% of their original benefit through hundreds of synthetic netlist edits. Given that the compute time for training our agents on a new partition is significant this ability to generalize to design changes is important.
Ii-a Global Placement Optimization
The objective of the global placement problem is to find locations for design cells that minimize competing objectives. These objectives always include HPWL (Half Perimeter Wire Length) and cell density, and can optionally include other metrics such as routability, timing, etc. In this work we focus only on the HPWL and cell density objectives and leave extenstion to other objectives as possible future work.
. In these models cells are treated as point charges with the cell density cost calculated as potential energy of the system. This formulation allows for the use of the fast Fourier transform to efficiently and differentiably calculate the potential energy and therefore the density cost of a given placement. These methods then use Nesterov’s method to iteratively solve for a placement that minimizes both HPWL and density costs.
The optimization process of these approaches can be summarized as
where C is the overall cost, WL is the wire length cost function and D is the cell density cost function.
As part of the placement algorithm these solvers must choose a value () to control the tradeoff between wire length and density costs (referred to as the density cost weight). Before RePlAce/ePlace, electrostatic-based solvers either chose a fixed or a gradually increasing value for the density cost weight. ePlace proposed a heuristic rule based on recent changes in wire length cost and RePlAce further suggested ’dynamic step size adaptation’ which automatically adjusts the penalty based on the HPWL curve of a trial placement [7, 2]. Using this approach they were able to show an improvement in solution quality. Additionally, the authors were able to show that adding a local density cost function which adjusted the density weight based on local overflow statistics further improved the solution quality.
Many of the insights from the previous works highlight the importance of existing heuristics to global placement tools. As the authors of previous works were able to improve results by adding heuristic rules to control the dynamics of the placement solver, in this work we study instead training a reinforcement learning agent to either replace existing heuristics or leverage new controls in order to optimize for final solution quality.
|Log of current HPWL|
|Log of HPWL|
|Log of the current density weight (|
|Current overflow as reported by DREAMPlace|
|Current (as defined in Algorithm 1)|
|Cell Density Map|
|Wire Density Map|
|Local HPWL (2x2, 4x4, 8x8, 16x16)|
Ii-B Asynchronous Advantage Actor Critic
We frame global placement as a Markov Decision Process (MDP) with states, actions , transition function , and reward function . We define , , and for global placement in Sections IV, IV-A, and IV-B1 or IV-B2 respectively. is defined by DREAMPlace as explained in Section IV.
In the reinforcement learning paradigm, we optimize a policy to maximize expected discounted return where is the horizon length, is a discount factor between 0 and 1, and is the reward at timestep of a trajectory.
Policy gradient methods with continuous action spaces traditionally model as a parameterized Gaussian and an underlying function provides the parameters of this distribution.
Policy gradient methods then optimize directly by approximating its gradient using the objective function:
where and are the state and action at timestep of a sampled trajectory . An agent following acts in an environment defined by to collect . is the advantage function and represents how much better or worse taking one action is compared to some baseline.
We choose the Asynchronous Advantage Actor Critic (A3C) algorithm to implement policy gradient reinforcement learning . This method improves over vanilla policy gradient methods while remaining straightforward to implement and modify. A3C defines as:
where is a learned value function parameterized by and updated to approximate . This advantage function compares the return from a trajectory to the expected return from starting state . The A3C advantage function makes use of an -step Bellman target .
A3C also introduces an entropy term to the loss to aid in exploration and avoid early convergence. This is calculated with where is the entropy function and weights the entropy term’s contribution to the total loss.
Iii Related Work
Some recent work has used reinforcement learning to try to solve related placement problems.  was able to learn a policy to explicitly place a smaller number of large macro cells before using a force based method to place the remaining cells. Other work such as  has studied using reinforcement learning for the assignment of logic elements to FPGA logic blocks. However, these differ from our work significantly as we investigate ways to directly improve the force-based method used to place smaller standard cells.
Many other works attempt to use machine learning to predict downstream problems during the placement stage to quickly identify potential problematic placement solutions
. These approaches are applications of supervised learning which provides potentially actionable information to other portions of the design flow. Our approach instead attempts to leverage reinforcement learning to learn a better placement algorithm.
To create our reinforcement learning environment we modified DREAMPlace code to run placement in steps, yielding state information every 10 iterations and allowing the agent to observe this state and modify placement control parameters before continuing. The reward, which is provided to the agent when the density target is reached, is the percent decrease in detail place HPWL when compared to the DREAMPlace algorithm run without agent interference (the baseline HPWL). In the event the placement process diverges the environment provides a fixed reward of -10.
Iv-a DREAMPlace State
The state is presented to the policy network as a 3-dimensional tensor. The first two dimensions are spatial and the final channel dimension separates each individual feature. The features are listed in TableI
. All scalar features are repeated across the first two dimensions. All features are clipped within their 10th and 90th percentile values and then normalized to zero-mean and unit-variance using the statistics of that feature during the baseline run.
|Learning Rate (’density weight’)|
|Learning Rate (’spatial cost’)|
|Batch Size (trajectories)|
Actor Critic Hyperparameters
Iv-B DREAMPlace Actions
Iv-B1 Density Weight Control
The first of two action spaces we define is referred to as ’density weight control’. It allows the agent to set the density weight coefficient () used to adjust the value of the density cost weight () during the placement algorithm. Algorithm 1 shows the original ePlace update rule for used in DREAMPlace with an additional condition added for our RL control. When enabled, the heuristic calculation of is instead replaced with the current output of the RL agent. This action space is therefore a single continuous value.
Iv-B2 Spatial Cost Weighting
The second action space we define is referred to as ’spatial cost weighting’. This action space allows the agent to provide a 2D field which is used to scale the gradient of the placement objective function ( from Equation 1) for each cell based on the cell’s location, , performing the below operation immediately after the gradient of the objective is calculated.
where is the action at the current time step and is the partition size divided by the dimension of the action space. This approach provides the agent the ability to resist or amplify cell movement in specific areas during the placement process.
Iv-C Choice of Reinforcement Learning Algorithm
The choice of which RL algorithm to use for a given problem is an important one. Most RL algorithms fit into one of two categories, model-free or model-based, based on whether they attempt to directly model the environment dynamics. We focus on the ’model-free’ category in this discussion (although we mention the possible application of ’model-based’ approaches in section VI). Within ’model-free’ RL we must decide between two styles, ’policy optimization’ and ’Q learning’. In ’Q learning’ an agent is trained to identify the expected future reward given a state-action pair (). Alternatively ’policy optimization’ methods directly train a policy function () to maximize the expected future reward.
There are a couple of reasons we chose to explore the Actor-Critic (’policy optimization’) framework for this specific problem. The first is that the complete state of the placement solver is difficult to represent and therefore a significant portion of the state of the placer remains hidden from the model. This makes approximating the true function difficult. The second is that the placement environment provides only a single reward at the terminal state. Because ’Q learning’ methods train the function to match the sum of the immediate reward and , it can take many samples to incorporate sparse reward information into the model. Due to the ’on policy’ nature of policy optimization methods such as Actor Critic, policy model updates can instead use the final observed reward to adjust the likelihood of all actions taken during a single sampled episode.
Iv-D Actor-Critic Implementation
Our experiments use the A3C algorithm described above. We extend the asynchronous framework torchbeast with support for our multidimensional and continuous action spaces .
In our experiments we train both and
simultaneously. We clip the gradient norm at 8 to increase training stability. To prevent the policy gradient loss from trending towards negative infinity during training we enforce a minimum value on the output of the policy standard deviation by adding a small fixed value to the network output.
Because our agent receives a sparse reward limited to terminal states, setting assures that early actions receive equal credit and that our agent has no incentive to reach terminal states too quickly.
Additional hyper-parameters can be found in Table II.
Iv-E Neural Network Model Architectures
In deep reinforcement learning both and are approximated by deep neural networks. Figure 1
illustrates our network architecture. In our experiments both networks share a subset of their parameters in a trunk network (a 2D convolutional neural network with residual connections). The output of the trunk network is used as input to separate value and policy branches. The value branch is composed of a single fully connected layer. In the case of our ’density weight control’ action space the policy branch is also a single fully connected layer. For the ’spatial cost weighting’ action space the policy branch is a fully convolutional neural network similar to the main trunk.
|Design||DREAMPlace||Spatial Cost Action||Density Weight Action|
|Baseline||HPWL||Steps||% Improvement||HPWL||Steps||% Improvement|
|Design||DREAMPlace||Spatial Cost Action||Density Weight Action|
|Baseline||HPWL||Steps||% Improvement||HPWL||Steps||% Improvement|
|Design G||523.37||101||519.71||106||(0.63,0.70)||521.95||137||(0.19, 0.27)|
|Design H||367.87||117||365.16||119||(0.48,0.74)||361.19||134||(1.50, 1.82)|
|Design I||530.14||96||517.73||105||(2.32, 2.34)||524.30||116||(0.93,1.10)|
|Design||# Standard Cells||# Nets|
Iv-F Correlated Noise
Traditionally when drawing samples from the policy
for a continuous action space a sample is drawn from the unit normal distribution and then scaled and shifted by the parameters provided by the policy neural network. The entropy in this parameterized distribution allows the agent to occasionally ”explore” new actions even if they were not highly likely under the given policy. For our 2D action space care must be taken to ensure ’meaningful’ exploration decisions are taken. If each value in the 2D action grid is IID there is too much high frequency noise (both spatially and temporally) in the actions for the agent to chance upon exploring coherent strategies.[4, 19] suggests sampling an Ornstein-Uhlenbeck process 
to enforce temporal consistency in the sampled noise. This process models the velocity of a Brownian particle which is both temporally correlated and mean-reverting. We extend this to also include spatial consistency across the action space. To do this for each training episode we pick a random resolution lower than the action resolution and sample each pixel from an independent Ornstein-Uhlenbeck process. We then use bilinear upsampling to interpolate this lower resolution back to the size of original action space. This forces varying amounts of spatial and temporal consistency in the exploration noise. The noise sampled from this process is visually depicted in Figure2. Without this correlated sampling our ’spatial cost’ agent was unable to learn a policy that improved over the baseline HPWL.
Iv-G Training Setup
Training is performed on a single DGX machine with 8 Tesla V100 GPUs. 7 GPUs are filled with actor instances (2 or 3 per GPU depending on design size) running our DREAMPlace based RL environment. The final GPU is used for running the training algorithm for the policy and value networks. Training is performed for 1.5e6 training steps (between 10,000 and 25,000 placement episodes depending on placement episode length) and the best solution found during training is reported. This training process takes between 24 and 48 hours depending on the size of the partition.
Iv-H Inference Time
The addition of our agent to the placement process does have a small effect on runtime because we run our feature collection and agent neural network once every 10 placement steps. The addition of our agent adds a roughly 10% overhead to the DREAMPlace global placement runtime.
V-a Data Set
We train our reinforcement learning policy on an array of open source placement benchmarks that have been previously used for various placement competitions. These include the ISPD ’05 and ’06  and DAC ’12  competition. We report non-scaled detail place HPWL values from DREAMPlace and ABCDPlace to obtain the ’DREAMPlace HPWL’ detail place values for all partitions. In addition we run the same experiment on 9 industry designs. Due to memory limits running the GPU accelerated detail place step, we report improvement in the global place HPWL result.
In Table III we present our results on open source benchmark designs. We report the max and median of the best placement solutions found across 3 independent training runs. The median ’density weight’ action space solutions are on average 0.87% better than the DREAMPlace baseline solutions and the ’spatial cost’ action space solutions are on average 1.03% improved over the baseline.
In Table IV we present results in a similar manner on nine designs taken from multiple industry workflows. On these designs the agent is able to improve by 1.47% and 1.49% on average using the ’spatial cost’ and ’density weight’ actions respectively. Notably the agent is able to improve the solution for Design D by more than 5%.
Our RL guided placement methods are able to find placement solutions for all designs which have a shorter HPWL than the DREAMPlace baseline. Interestingly, one action space does not outperform the other across all benchmarks. While the ’density weight’ action does appear to be superior on most designs, it comes at the cost of an increase in the number of placement iterations to convergence.
V-C Investigating Trained Agent Policies
V-C1 Density Weight Control
As seen in Figure 3, the RePlAce rule for adjusting the density weight cost chooses a significantly different schedule than the baseline heuristic. Specifically, the agent tends to favor smaller increases in the density weight which leads to more placement iterations and longer time to converge to the target density. However, the choice of when to use small density weight updates and when to use larger ones seems to be both important and non-trivial because the final policies outperform both slow and fast static policies.
V-C2 Spatial Cost Weighting
If we visualize the actions the trained ’spatial cost weighting’ agent takes during placement (seen in Figure 4), we observe the network leverages both the current state of placement and placement structures such as the macro positions and high density logic clusters to control the evolution of the placement process. Interestingly, unlike the ’density weight’ action space the number of placement iterations to convergence is not significantly larger than the baseline placement. The agent is able to improve results with only a small increase in the number of iterations.
V-D Netlist Modification Test
In order to investigate how robust the learned policy is to changes in the netlist we performed the following test. We make synthetic edits to the benchmark netlists and rerun DREAMPlace with our learned policy to measure how the performance of our learned policy degrades as the netlist changes. Random edits which consist of adding and removing nodes as well as adding additional nets are performed according to Algorithm 2. As shown in Figure 5, we find that while the improvement of the learned policy degrades with these netlist edits we still see a significant improvement over the baseline placement scheme with hundreds of netlist edits. Specifically, on average we retain 78% of the trained benefit with 500 random edits and 73% of the benefit with 1000 random edits.
In this work, we used GPU accelerated placement to train a reinforcement learning agent that augments a state of the art global placement algorithm. We have demonstrated this approach can learn heuristics that improve the final quality of results. Although we focus on improving half-perimeter wire length, this approach can be extended to include many other downstream metrics such as congestion, timing or power.
Ideally, an agent trained with reinforcement learning would be able to perform placement on novel designs that were unseen during training. However, we observed that the training of a generalized agent is difficult. One major challenge to progress is a lack of open source datasets. Currently, open source datasets are too small to provide sufficient coverage of the distribution of all possible designs. We invite future work to tackle this challenge of generalization. As the placement environment is deterministic, exploration of model-based approaches to improve generalization is potentially promising. Meanwhile, since designs are often built iteratively, we believe our approach can learn strategies specific to each design, yielding benefits as the design is iteratively updated.
The result of our approach is more than 1% reduced HPWL on both widely used academic benchmarks and industry designs. While the training time is high this benefit is still significant to highly optimized designs and demonstrates the potential of data driven approaches to global placement optimization. To our knowledge this is the first attempt to integrate reinforcement learning directly into force based global placement algorithms.
-  (2018) A learning-based methodology for routability prediction in placement. In 2018 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), Vol. , pp. 1–4. External Links: Cited by: §III.
-  (2019) RePlAce: advancing solution quality and routability validation in global placement. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 38 (9), pp. 1717–1730. External Links: Cited by: §I, §II-A, §II-A, §III.
TorchBeast: a pytorch platform for distributed rl. External Links: Cited by: §IV-D.
-  (2019) Continuous control with deep reinforcement learning. External Links: Cited by: §IV-F.
DREAMPlace: deep learning toolkit-enabled gpu acceleration for modern vlsi placement. In 2019 56th ACM/IEEE Design Automation Conference (DAC), Vol. , pp. 1–6. External Links: Cited by: §I, §III.
-  (2020) ABCDPlace: accelerated batch-based concurrent detailed placement on multi-threaded cpus and gpus. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (), pp. 1–1. External Links: Cited by: §I.
-  (2014) EPlace: electrostatics based placement using nesterov’s method. In 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC), Vol. , pp. 1–6. External Links: Cited by: §I, §II-A, §II-A, §III.
Placement and routing for 3d-fpgas using reinforcement learning and support vector machines. In 18th International Conference on VLSI Design held jointly with 4th International Conference on Embedded Systems Design, Vol. , pp. 451–456. External Links: Cited by: §III.
-  (2020) Chip placement with deep reinforcement learning. External Links: Cited by: §III.
-  (2016) Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pp. 1928–1937. Cited by: §II-B.
-  (2005) The ispd2005 placement contest and benchmarks suite. In ISPD, pp. 216–220. Cited by: §V-A.
-  (2006) ISPD2006 placement contest: benchmark suite and results. In ISPD, pp. 167–167. Cited by: §V-A.
-  (1983) A method for solving the convex programming problem with convergence rate o). Dokl. Akad. Nauk SSSR 269, pp. 543–547. Cited by: §II-A.
-  (1988) Learning to predict by the methods of temporal differences. Machine learning 3 (1), pp. 9–44. Cited by: §II-B.
-  (2017) Detailed routing violation prediction during placement using machine learning. In 2017 International Symposium on VLSI Design, Automation and Test (VLSI-DAT), Vol. , pp. 1–4. External Links: Cited by: §III.
-  (2018) A machine learning framework to identify detailed routing short violations from a placed netlist. In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), Vol. , pp. 1–6. External Links: Cited by: §III.
-  (1930) On the theory of the brownian motion. Vol. 36, pp. 823. Cited by: §IV-F.
-  (2012) The dac 2012 routability-driven placement contest and benchmark suite. In DAC Design Automation Conference 2012, Vol. , pp. 774–782. External Links: Cited by: §V-A.
-  (2015) Control policy with autocorrelated noise in reinforcement learning for robotics. International Journal of Machine Learning and Computing 5 (2), pp. 91. Cited by: §IV-F.