In this work, we present a learning-based approach to chip placement, one of the most complex and time-consuming stages of the chip design process. Unlike prior methods, our approach has the ability to learn from past experience and improve over time. In particular, as we train over a greater number of chip blocks, our method becomes better at rapidly generating optimized placements for previously unseen chip blocks. To achieve these results, we pose placement as a Reinforcement Learning (RL) problem and train an agent to place the nodes of a chip netlist onto a chip canvas. To enable our RL policy to generalize to unseen blocks, we ground representation learning in the supervised task of predicting placement quality. By designing a neural architecture that can accurately predict reward across a wide variety of netlists and their placements, we are able to generate rich feature embeddings of the input netlists. We then use this architecture as the encoder of our policy and value networks to enable transfer learning. Our objective is to minimize PPA (power, performance, and area), and we show that, in under 6 hours, our method can generate placements that are superhuman or comparable on modern accelerator netlists, whereas existing baselines require human experts in the loop and take several weeks.READ FULL TEXT VIEW PDF
Rapid progress in AI has been enabled by remarkable advances in computer systems and hardware, but with the end of Moore’s Law and Dennard scaling, the world is moving toward specialized hardware to meet AI’s exponentially growing demand for compute. However, today’s chips take years to design, leaving us with the speculative task of optimizing them for the machine learning (ML) models of 2-5 years from now. Dramatically shortening the chip design cycle would allow hardware to better adapt to the rapidly advancing field of AI. We believe that it is AI itself that will provide the means to shorten the chip design cycle, creating a symbiotic relationship between hardware and AI with each fueling advances in the other.
In this work, we present a learning-based approach to chip placement, one of the most complex and time-consuming stages of the chip design process. The objective is to place a netlist graph of macros (e.g., SRAMs) and standard cells (logic gates, such as NAND, NOR, and XOR) onto a chip canvas, such that power, performance, and area (PPA) are optimized, while adhering to constraints on placement density and routing congestion (described in Sections 3.3.6 and 3.3.5). Despite decades of research on this problem, it is still necessary for human experts to iterate for weeks with the existing placement tools, in order to produce solutions that meet multi-faceted design criteria. The problem’s complexity arises from the sizes of the netlist graphs (millions to billions of nodes), the granularity of the grids onto which these graphs must be placed, and the exorbitant cost of computing the true target metrics (many hours and sometimes over a day for industry-standard electronic design automation (EDA) tools to evaluate a single design). Even after breaking the problem into more manageable subproblems (e.g., grouping the nodes into a few thousand clusters and reducing the granularity of the grid), the state space is still orders of magnitude larger than recent problems on which learning-based methods have shown success.
To address this challenge, we pose chip placement as a Reinforcement Learning (RL) problem, where we train an agent (e.g., RL policy network) to optimize the placements. In each iteration of training, all of the macros of the chip block are sequentially placed by the RL agent, after which the standard cells are placed by a force-directed method (Hanan and Kurtzberg, 1972; Tao Luo and Pan, 2008; Bo Hu and Marek-Sadowska, 2005; Obermeier et al., 2005; Spindler et al., 2008; Viswanathan et al., 2007, 2007). Training is guided by a fast-but-approximate reward signal for each of the agent’s chip placements.
To our knowledge, the proposed method is the first placement approach with the ability to generalize, meaning that it can leverage what it has learned from placing previous netlists to generate placements for new unseen netlists. In particular, we show that, as our agent is exposed to a greater volume and variety of chips, it becomes both faster and better at generating optimized placements for new chip blocks, bringing us closer to a future in which chip designers are assisted by artificial agents with vast chip placement experience.
We believe that the ability of our approach to learn from experience and improve over time unlocks new possibilities for chip designers. We show that we can achieve superior PPA on real AI accelerator chips (Google TPUs), as compared to state-of-the-art baselines. Furthermore, our methods generate placements that are superior or comparable to human expert chip designers in under 6 hours, whereas the highest-performing alternatives require human experts in the loop and take several weeks for each of the dozens of blocks in a modern chip. Although we evaluate primarily on AI accelerator chips, our proposed method is broadly applicable to any chip placement optimization.
Global placement is a longstanding challenge in chip design, requiring multi-objective optimization over circuits of ever-growing complexity. Since the 1960s, many approaches have been proposed, so far falling into three broad categories: 1) partitioning-based methods, 2) stochastic/hill-climbing methods, and 3) analytic solvers.
Starting in the 1960s, industry and academic labs took a partitioning-based approach to the global placement problem, proposing (Breuer, 1977; Kernighan, 1985; Fiduccia and Mattheyses, 1982), as well as resistive-network based methods (Chung-Kuan Cheng and Kuh, 1984; Ren-Song Tsay et al., 1988). These methods are characterized by a divide-and-conquer approach; the netlist and the chip canvas are recursively partitioned until sufficiently small sub-problems emerge, at which point the sub-netlists are placed onto the sub-regions using optimal solvers. Such approaches are quite fast to execute and their hierarchical nature allows them to scale to arbitrarily large netlists. However, by optimizing each sub-problem in isolation, partitioning-based methods sacrifice quality of the global solution, especially routing congestion. Furthermore, a poor early partition may result in an unsalvageable end placement.
In the 1980s, analytic approaches emerged, but were quickly overtaken by stochastic / hill-climbing algorithms, particularly simulated annealing (Kirkpatrick et al., 1983; Sechen and Sangiovanni-Vincentelli, 1986; Sarrafzadeh et al., 2003). Simulated annealing (SA) is named for its analogy to metallurgy, in which metals are first heated and then gradually cooled to induce, or anneal, energy-optimal crystalline surfaces. SA applies random perturbations to a given placement (e.g., shifts, swaps, or rotations of macros), and then measures their effect on the objective function (e.g., half-perimeter wirelength described in Section 3.3.1
). If the perturbation is an improvement, it is applied; if not, it is still applied with some probability, referred to as temperature. Temperature is initialized to a particular value and is then gradually annealed to a lower value. Although SA generates high-quality solutions, it is very slow and difficult to parallelize, thereby failing to scale to the increasingly large and complex circuits of the 1990s and beyond.
The 1990s-2000s were characterized by multi-level partitioning methods (Agnihotri et al., 2005; Roy et al., 2007), as well as the resurgence of analytic techniques, such as force-directed methods (Tao Luo and Pan, 2008; Bo Hu and Marek-Sadowska, 2005; Obermeier et al., 2005; Spindler et al., 2008; Viswanathan et al., 2007, 2007) and non-linear optimizers (Kahng et al., 2005; Chen et al., 2006). The renewed success of quadratic methods was due in part to algorithmic advances, but also to the large size of modern circuits (10-100 million nodes), which justified approximating the placement problem as that of placing nodes with zero area. However, despite the computational efficiency of quadratic methods, they are generally less reliable and produce lower quality solutions than their non-linear counterparts.
Non-linear optimization approximates cost using smooth mathematical functions, such as log-sum-exp (William et al., 2001) and weighted-average (Hsu et al., 2011) models for wirelength, as well as Gaussian (Chen et al., 2008) and Helmholtz models for density. These functions are then combined into a single objective function using a Lagrange penalty or relaxation. Due to the higher complexity of these models, it is necessary to take a hierarchical approach, placing clusters rather than individual nodes, an approximation which degrades the quality of the placement.
The last decade has seen the rise of modern analytic techniques, including more advanced quadratic methods (Kim et al., 2010, 2012b; Kim and Markov, 2012; Brenner et al., 2008; Lin et al., 2013), and more recently, electrostatics-based methods like ePlace (Lu et al., 2015b) and RePlAce (Cheng et al., 2019). Modeling netlist placement as an electrostatic system, ePlace (Lu et al., 2015b) proposed a new formulation of the density penalty where each node (macro or standard cell) of the netlist is analogous to a positively charged particle whose area corresponds to its electric charge. In this setting, nodes repel each other with a force proportional to their charge (area), and the density function and gradient correspond to the system’s potential energy. Variations of this electrostatics-based approach have been proposed to address standard-cell placement (Lu et al., 2015b) and mixed-size placement (Lu et al., 2015a, 2016). RePlAce (Cheng et al., 2019) is a recent state-of-the-art mixed-size placement technique that further optimizes ePlace’s density function by introducing a local density function, which tailors the penalty factor for each individual bin size. Section 5 compares the performance of the state-of-the-art RePlAce algorithm against our approach.
Recent work (Huang et al., 2019) proposes training a model to predict the number of Design Rule Check (DRC) violations for a given macro placement. DRCs are rules that ensure that the placed and routed netlist adheres to tape-out requirements. To generate macro placements with fewer DRCs, (Huang et al., 2019) use the predictions from this trained model as the evaluation function in simulated annealing. While this work represents an interesting direction, it reports results on netlists with no more than 6 macros, far fewer than any modern block, and the approach does not include any optimization during the place and the route steps. Due to the optimization, the placement and the routing can change dramatically, and the actual DRC will change accordingly, invalidating the model prediction. In addition, although adhering to the DRC criteria is a necessary condition, the primary objective of macro placement is to optimize for wirelength, timing (e.g. Worst Negative Slack (WNS) and Total Negative Slack (TNS)), power, and area, and this work does not even consider these metrics.
To address this classic problem, we propose a new category of approach: end-to-end learning-based methods. This type of approach is most closely related to analytic solvers, particularly non-linear ones, in that all of these methods optimize an objective function via gradient updates. However, our approach differs from prior approaches in its ability to learn from past experience to generate higher-quality placements on new chips. Unlike existing methods that optimize the placement for each new chip from scratch, our work leverages knowledge gained from placing prior chips to become better over time. In addition, our method enables direct optimization of the target metrics, such as wirelength, density, and congestion, without having to define convex approximations of those functions as is done in other approaches (Cheng et al., 2019; Lu et al., 2015b). Not only does our formulation make it easy to incorporate new cost functions as they become available, but it also allows us to weight their relative importance according to the needs of a given chip block (e.g., timing-critical or power-constrained).
Domain adaptation is the problem of training policies that can learn across multiple experiences and transfer the acquired knowledge to perform better on new unseen examples. In the context of chip placement, domain adaptation involves training a policy across a set of chip netlists and applying that trained policy to a new unseen netlist. Recently, domain adaptation for combinatorial optimization has emerged as a trend(Zhou et al., 2019; Paliwal et al., 2019; Addanki et al., 2019). While the focus in prior work has been on using domain knowledge learned from previous examples of an optimization problem to speed up policy training on new problems, we propose an approach that, for the first time, enables the generation of higher quality results by leveraging past experience. Not only does our novel domain adaptation produce better results, it also reduces the training time 8-fold compared to training the policy from scratch.
In this work, we target the chip placement optimization problem, in which the objective is to map the nodes of a netlist (the graph describing the chip) onto a chip canvas (a bounded 2D space), such that final power, performance, and area (PPA) is optimized. In this section, we describe an overview of how we formulate the problem as a reinforcement learning (RL) problem, followed by a detailed description of the reward function, action and state representations, policy architecture, and policy updates.
We take a deep reinforcement learning approach to the placement problem, where an RL agent (policy network) sequentially places the macros; once all macros are placed, a force-directed method is used to produce a rough placement of the standard cells, as shown in Figure 1
. RL problems can be formulated as Markov Decision Processes (MDPs), consisting of four key elements:
states: the set of possible states of the world (e.g., in our case, every possible partial placement of the netlist onto the chip canvas).
actions: the set of actions that can be taken by the agent (e.g., given the current macro to place, the available actions are the set of all the locations in the discrete canvas space (grid cells) onto which that macro can be placed without violating any hard constraints on density or blockages).
state transition: given a state and an action, this is the probability distribution over next states.
reward: the reward for taking an action in a state. (e.g., in our case, the reward is 0 for all actions except the last action where the reward is a negative weighted sum of proxy wirelength and congestion, subject to density constraints as described in Section 3.3).
In our setting, at the initial state, , we have an empty chip canvas and an unplaced netlist. The final state corresponds to a completely placed netlist. At each step, one macro is placed. Thus, is equal to the total number of macros in the netlist. At each time step , the agent begins in state (), takes an action (), arrives at a new state (), and receives a reward () from the environment (0 for and negative proxy cost for ).
We define to be a concatenation of features representing the state at time , including a graph embedding of the netlist (including both placed and unplaced nodes), a node embedding of the current macro to place, metadata about the netlist (Section 4), and a mask representing the feasibility of placing the current node onto each cell of the grid (Section 3.3.6).
The action space is all valid placements of the macro, which is a function of the density mask described in section 3.3.6. Action is the cell placement of the macro that was chosen by the RL policy network.
is the next state, which includes an updated representation containing information about the newly placed macro, an updated density mask, and an embedding for the next node to be placed.
In our formulation, is for every time step except for the final , where it is a weighted sum of approximate wirelength and congestion as described in Section 3.3.
Through repeated episodes (sequences of states, actions, and rewards), the policy network learns to take actions that will maximize cumulative reward. We use Proximal Policy Optimization (PPO) (Schulman et al., 2017) to update the parameters of the policy network, given the cumulative reward for each placement.
In this section, we define the reward , state , actions , policy network architecture parameterized by , and finally the optimization method we use to train those parameters.
Our goal in this work is to minimize power, performance and area, subject to constraints on routing congestion and density. Our true reward is the output of a commercial EDA tool, including wirelength, routing congestion, density, power, timing, and area. However, RL policies require 100,000s of examples to learn effectively, so it is critical that the reward function be fast to evaluate, ideally running in a few milliseconds. In order to be effective, these approximate reward functions must also be positively correlated with the true reward. Therefore, a component of our cost is wirelength, because it is not only much cheaper to evaluate, but also correlates with power and performance (timing). We define approximate cost functions for both wirelength and congestion, as described in Section 3.3.1 and Section 3.3.5, respectively.
To combine multiple objectives into a single reward function, we take the weighted sum of proxy wirelength and congestion where the weight can be used to explore the trade-off between the two metrics.
While we treat congestion as a soft constraint (i.e., lower congestion improves the reward function), we treat density as a hard constraint, masking out actions (grid cells to place nodes onto) whose density exceeds the target density, as described further in section 3.3.6.
To keep the runtime per iteration small, we apply several approximations to the calculation of the reward function:
We group millions of standard cells into a few thousand clusters using hMETIS (Karypis and Kumar, 1998), a partitioning technique based on the normalized minimum cut objective. Once all the macros are placed, we use force-directed methods to place the standard cell clusters, as described in section 3.3.4. Doing so enables us to achieve an approximate but fast standard cell placement that facilitates policy network optimization.
We discretize the grid to a few thousand grid cells and place the center of macros and standard cell clusters onto the center of the grid cells.
When calculating wirelength, we make the simplifying assumption that all wires leaving a standard cell cluster originate at the center of the cluster.
To calculate routing congestion cost, we only consider the average congestion of the top 10% most congested grid cells, as described in Section 3.3.5.
Following the literature (Shahookar and Mazumder, 1991), we employ half-perimeter wirelength (HPWL), the most commonly used approximation for wirelength. HPWL is defined as the half-perimeter of the bounding boxes for all nodes in the netlist. The HPWL for a given net (edge) is shown in the equation below:
Here and show the x and y coordinates of the end points of net . The overall HPWL cost is then calculated by taking the normalized sum of all half-perimeter bounding boxes, as shown in Equation 2.
is a normalization factor which improves the accuracy of the estimate by increasing the wirelength cost as the number of nodes increases, whereis the number of nets.
Intuitively, the HPWL for a given placement is roughly the length of its Steiner tree (Gilbert and Pollak, 1968), which is a lower bound on routing cost.
Wirelength also has the advantage of correlating with other important metrics, such as power and timing. Although we don’t optimize directly for these other metrics, we observe high performance in power and timing (as shown in Table 2).
Given the dimensions of the chip canvas, there are many choices to discretize the 2D canvas into grid cells. This decision impacts the difficulty of optimization and the quality of the final placement. We limit the maximum number of rows and columns to 128. We treat choosing the optimal number of rows and columns as a bin-packing problem and rank different combinations of rows and columns by the amount of wasted space they incur. We use an average of 30 rows and columns in the experiments described in Section 5.
To select the order in which the macros are placed, we sort macros by descending size and break ties using a topological sort. By placing larger macros first, we reduce the chance of there being no feasible placement for a later macro. The topological sort can help the policy network learn to place connected nodes close to one another. Another potential approach would be to learn to jointly optimize the ordering of macros and their placement, making the choice of which node to place next part of the action space. However, this enlarged action space would significantly increase the complexity of the problem, and we found that this heuristic worked in practice.
To place standard cell clusters, we use an approach similar to classic force-directed methods (Shahookar and Mazumder, 1991). We represent the netlist as a system of springs that apply force to each node, according to the
formula, causing tightly connected nodes to be attracted to one another. We also introduce a repulsive force between overlapping nodes to reduce placement density. After applying all forces, we move nodes in the direction of the force vector. To reduce oscillations, we set a maximum distance for each move.
We also followed convention in calculating proxy congestion (Kim et al., 2012a), using a simple deterministic routing based on the locations of the driver and loads on the net. The routed net occupies a certain amount of available routing resources (determined by the underlying semiconductor fabrication technology) for each grid cell which it passes through. We keep track of vertical and horizontal allocations in each grid cell separately. To smoothe the congestion estimate, we run convolutional filters in both the vertical and horizontal direction. After all nets are routed, we take the average of the top congestion values, drawing inspiration from the ABA10 metric in MAPLE (Kim et al., 2012a). The congestion cost in Equation 4 is the top average congestion calculated by this process.
We treat density as a hard constraint, disallowing the policy network from placing macros in locations which would cause density to exceed the target () or which would result in infeasible macro overlap. This approach has two benefits: (1) it reduces the number of invalid placements generated by the policy network, and (2) it reduces the search space of the optimization problem, making it more computationally tractable.
A feasible standard cell cluster placement should meet the following criterion: the density of placed items in each grid cell should not exceed a given target density threshold (). We set this threshold to be in our experiments. To meet this constraint, during each RL step, we calculate the current density mask, a binary matrix that represents grid cells onto which we can place the center of the current node without violating the density threshold criteria. Before choosing an action from the policy network output, we first take the dot product of the mask and the policy network output and then take the argmax over feasible locations. This approach prevents the policy network from generating placements with overlapping macros or dense standard cell areas.
We also enable blockage-aware placements (such as clock straps) by setting the density function of the blocked areas to .
To prepare the placements for evaluation by commercial EDA tools, we perform a greedy legalization step to snap macros onto the nearest legal position while honoring the minimum spacing constraints. We then fix the macro placements and use an EDA tool to place the standard cells and evaluate the placement.
For policy optimization purposes, we convert the canvas into a grid. Thus, for any given state, the action space (or the output of the policy network) is the probability distribution of placements of the current macro over the grid. The action is the argmax of this probability distribution.
Our state contains information about the netlist graph (adjacency matrix), its node features (width, height, type, etc.), edge features (number of connections), current node (macro) to be placed, and metadata of the netlist and the underlying technology (e.g., routing allocations, total number of wires, macros, and standard cell clusters, etc.). In the following section, we discuss how we process these features to learn effective representations for the chip placement problem.
Our goal is to develop RL agents that can generate higher quality results as they gain experience placing chips. We can formally define the placement objective function as follows:
Here is the cost function. The agent is parameterized by . The dataset of netlist graphs of size is denoted by with each individual netlist in the dataset written as . is the episode reward of a placement drawn from the policy network applied to netlist .
Equation 4 shows the reward that we used for policy network optimization, which is the negative weighted average of wirelength and congestion, subject to density constraints. The reward is explained in detail in Section 3.3. In our experiments, congestion weight is set to 0.01 and the max density threshold is set to 0.6.
We propose a novel neural architecture that enables us to train domain-adaptive policies for chip placement. Training such a policy network is a challenging task since the state space encompassing all possible placements of all possible chips is immense. Furthermore, different netlists and grid sizes can have very different properties, including differing numbers of nodes, macro sizes, graph topologies, and canvas widths and heights. To address this challenge, we first focused on learning rich representations of the state space. Our intuition was that a policy network architecture capable of transferring placement optimization across chips should also be able to encode the state associated with a new unseen chip into a meaningful signal at inference time. We therefore proposed training a neural network architecture capable of predicting reward on new netlists, with the ultimate goal of using this architecture as the encoder layer of our policy network.
To train this supervised model, we needed a large dataset of chip placements and their corresponding reward labels. We therefore created a dataset of 10,000 chip placements where the input is the state associated with a given placement and the label is the reward for that placement (wirelength and congestion). We built this dataset by first picking 5 different accelerator netlists and then generating 2,000 placements for each netlist. To create diverse placements for each netlist, we trained a vanilla policy network at various congestion weights (ranging from 0 to 1) and random seeds, and collected snapshots of each placement during the course of policy training. An untrained policy network starts off with random weights and the generated placements are of low quality, but as the policy network trains, the quality of generated placements improves, allowing us to collect a diverse dataset with placements of varying quality.
To train a supervised model that can accurately predict wirelength and congestion labels and generalize to unseen data, we developed a novel graph neural network architecture that embeds information about the netlist. The role of graph neural networks is to distill information about the type and connectivity of a node within a large graph into low-dimensional vector representations which can be used in downstream tasks. Some examples of such downstream tasks are node classification (Nazi et al., 2019), device placement (Zhou et al., 2019), link prediction (Zhang and Chen, 2018), and Design Rule Violations (DRCs) prediction (Univeristy, 2018).
We create a vector representation of each node by concatenating the node features. The node features include node type, width, height, and x and y coordinates. We also pass node adjacency information as input to our algorithm. We then repeatedly perform the following updates: 1) each edge updates its representation by applying a fully connected network to an aggregated representation of intermediate node embeddings, and 2) each node updates its representation by taking the mean of adjacent edge embeddings. The node and edge updates are shown in Equation 5.
Node embeddings are denoted by s for , where is the total number of macros and standard cell clusters. Vectorized edges connecting nodes and are represented as . Both edge () and node () embeddings are randomly initialized and are 32-dimensional. is a , is a feedforward network and s are learnable weights corresponding to edges. shows the neighbors of . The outputs of the algorithm are the node and edge embeddings.
Our supervised model consists of: (1) The graph neural network described above that embeds information about node types and the netlist adjacency matrix. (2) A fully connected feedforward network that embeds the metadata, including information about the underlying semiconductor technology (horizontal and vertical routing capacity), the total number of nets (edges), macros, and standard cell clusters, canvas size and number of rows and columns in the grid. (3) A fully connected feedforward network (the prediction layer) whose input is a concatenation of the netlist graph and metadata embedding and whose output is the reward prediction. The netlist graph embedding is created by applying a reduce mean function on the edge embeddings. The supervised model is trained via regression to minimize the weighted sum of the mean squared loss of wirelength and congestion.
This supervised task allowed us to find the features and architecture necessary to generalize reward prediction across netlists. To incorporate this architecture into our policy network, we removed the prediction layer and then used it as the encoder component of the policy network as shown in Figure 2.
To handle different grid sizes corresponding to different choices of rows and columns, we set the grid size to , and mask the unused L-shaped section for grid sizes smaller than 128 rows and columns.
To place a new test netlist at inference time, we load the pre-trained weights of the policy network and apply it to the new netlist. We refer to placements generated by a pre-trained policy network with no finetuning as zero-shot placements. Such a placement can be generated in less than a second, because it only requires a single inference step of the pre-trained policy network. We can further optimize placement quality by finetuning the policy network. Doing so gives us the flexibility to either use the pre-trained weights (that have learned a rich representation of the input state) or further finetune these weights to optimize for the properties of a particular chip netlist.
Figure 2 depicts an overview of the policy network (modeled by in Equation 3) and the value network architecture that we developed for chip placement. The inputs to these networks are the netlist graph (graph adjacency matrix and node features), the id of the current node to be placed, and the metadata of the netlist and the semiconductor technology. The netlist graph is passed through our proposed graph neural network architecture as described earlier. This graph neural network generates embeddings of (1) the partially placed graph and (2) the current node. We use a simple feedforward network to embed (3) the metadata. These three embedding vectors are then concatenated to form the state embedding, which is passed to a feedforward neural network. The output of the feedforward network is then fed into the policy network (composed of 5 deconvolutions 111 The deconvolutions layers have a 3x3 kernel size with stride 2 and 16, 8, 4, 2, and 1 filter channels respectively.
The deconvolutions layers have a 3x3 kernel size with stride 2 and 16, 8, 4, 2, and 1 filter channels respectively.
and Batch Normalization layers) to generate a probability distribution over actions and passed to a value network (composed of a feedforward network) to predict the value of the input state.
In Equation 3, the objective is to train a policy network that maximizes the expected value () of the reward () over the policy network’s placement distribution. To optimize the parameters of the policy network, we use Proximal Policy Optimization (PPO) (Schulman et al., 2017) with a clipped objective as shown below:
where represents the expected value at timestep , is the ratio of the new policy and the old policy, and is the estimated advantage at timestep .
In this section, we evaluate our method and answer the following questions: Does our method enable domain transfer and learning from experience? What is the impact of using pre-trained policies on the quality of result? How does the quality of the generated placements compare to state-of-the-art baselines? We also inspect the visual appearance of the generated placements and provide some insights into why our policy network made those decisions.
Figure 3 compares the quality of placements generated using pre-trained policies to those generated by training the policy network from scratch. Zero-shot means that we applied a pre-trained policy network to a new netlist with no finetuning, yielding a placement in less than one second. We also show results where we finetune the pre-trained policy network on the details of a particular design for 2 and 12 hours. The policy network trained from scratch takes much longer to converge, and even after 24 hours, the results are worse than what the finetuned policy network achieves after 12 hours, demonstrating that the learned weights and exposure to many different designs are helping us to achieve higher quality placements for new designs in less time.
Figure 4 shows the convergence plots for training from scratch vs. training from a pre-trained policy network for Ariane RISC-V CPU. The pre-trained policy network starts with a lower placement cost at the beginning of the finetuning process. Furthermore, the pre-trained policy network converges to a lower placement cost and does so more than 30 hours faster than the policy network that was trained from scratch.
|Replacing Deep RL with SA in our framework||Ours|
As we train on more chip blocks, we are able to speed up the training process and generate higher quality results faster. Figure 5 (left) shows the impact of a larger training set on performance. The training dataset is created from internal TPU blocks. The training data consists of a variety of blocks including memory subsystems, compute units, and control logic. As we increase the training set from 2 blocks to 5 blocks and finally to 20 blocks, the policy network generates better placements both at zero-shot and after being finetuned for the same number of hours. Figure 5 (right) shows the placement cost on the test data, as the policy network is being (pre-)trained. We can see that for the small training dataset, the policy network quickly overfits to the training data and performance on the test data degrades, whereas it takes longer for the policy network to overfit on largest dataset and the policy network pre-trained on this larger dataset yields better results on the test data. This plot suggests that as we expose the policy network to a greater variety of distinct blocks, while it might take longer for the policy network to pre-train, the policy network becomes less prone to overfitting and better at finding optimized placements for new unseen blocks.
Figure 6 show the placement results for the Ariane RISC-V CPU. On the left, placements from the zero-shot policy network and on the right, placements from the finetuned policy network are shown. The zero-shot placements are generated at inference time on a previously unseen chip. The zero-shot policy network places the standard cells in the center of the canvas surrounded by macros, which is already quite close to the optimal arrangement. After finetuning, the placements of macros become more regularized and the standard cell area in the center becomes less congested.
Figure 7 shows the visualized placements: on the left, results from a manual placement, and on the right, results from our approach are shown. The white area shows the macro placements and the green area shows the standard cell placements. Our method creates donut-shaped placements of macros, surrounding standard cells, which results in a reduction in the total wirelength.
In this section, we compare our method with 3 baselines methods: Simulated Annealing, RePlAce, and human expert baselines. For our method, we use a policy pre-trained on the largest dataset (of 20 TPU blocks) and then finetune it on 5 target unseen blocks denoted by Blocks 1 to 5. Our dataset consists a variety of blocks including memory subsystems, compute units, and control logic. Due to confidentiality, we cannot disclose the details of these blocks, but to give an idea of the scale, each block contains up to a few hundred macros and millions of standard cells.
|WNS (ps)||TNS (ns)||Total ()||Total (W)||(m)||H (%)||V (%)|
Comparisons with Simulated Annealing: Simulated Annealing (SA), is known to be a powerful, but slow, optimization method. However, like RL, simulated annealing is capable of optimizing arbitrary non-differentiable cost functions. To show the relative sample efficiency of RL, we ran experiments in which we replaced it with a simulated annealing based optimizer. In these experiments, we use the same inputs and cost function as before, but in each episode, the simulated annealing optimizer places all macros, followed by an FD step to place the standard cell clusters. Each macro placement is accepted according to the SA update rule using an exponential decay annealing schedule (Kirkpatrick et al., 1983). SA takes 18 hours to converge, whereas our method takes no more than 6 hours. To make comparisons fair, we ran multiple SA experiments that sweep different hyper-parameters, including min and max temperature, seed, and max SA episodes, such that SA and RL spend the same amount of CPU-hours in simulation and search a similar number of states. The results from the experiment with minimum cost are reported in Table 1. As shown in the table, even with additional time, SA struggles to produce high-quality placements compared to our approach, and produces placements with higher wirelength and higher congestion on average.
Comparisons with RePlAce (Cheng et al., 2019) and manual baselines: Table 2 compares our results with the state-of-the-art method RePlAce (Cheng et al., 2019) and manual baselines. The manual baseline is generated by a production chip design team, and involved many iterations of placement optimization, guided by feedback from a commercial EDA tool over a period of several weeks.
With respect to RePlAce, we share the same optimization goals, namely to optimize global placement in chip design, but we use different objective functions. Thus, rather than comparing results from different cost functions, we treat the output of a commercial EDA tool as ground truth. To perform this comparison, we fix the macro placements generated by our method and by RePlAce and allow a commercial EDA tool to further optimize the standard cell placements, using the tool’s default settings. We then report total wirelength, timing (worst (WNS) and total (TNS) negative slack), area, and power metrics. As shown in Table 2, our method outperforms RePLAce in generating placements that meet the design requirements. Given constraints imposed by the underlying semiconductor technology, placements of these blocks will not be able to meet timing constraints in the later stage of the design flow if the WNS is significantly above 100 ps or if the horizontal or vertical congestion is over 1%, rendering some RePlAce placements (Blocks 1, 2, 3) unusable. These results demonstrate that our congestion-aware approach is effective in generating high-quality placements that meet design criteria.
RePlAce is faster than our method as it converges in 1 to 3.5 hours, whereas our results were achieved in 3 to 6 hours. However, some of the fundamental advantages of our approach are 1) our method can readily optimize for various non-differentiable cost functions, without the need to formulate closed form or differentiable equivalents of those cost functions. For example, while it is straightforward to model wirelength as a convex function, this is not true for routing congestion or timing. 2) our method has the ability to improve over time as the policy is exposed to more chip blocks, and 3) our method is able to adhere to various design constraints, such as blockages of differing shapes.
Table 2 also shows the results generated by human expert chip designers. Both our method and human experts consistently generate viable placements, meaning that they meet the timing and congestion design criteria. We also outperform or match manual placements in WNS, area, power, and wirelength. Furthermore, our end-to-end learning-based approach takes less than 6 hours, whereas the manual baseline involves a slow iterative optimization process with experts in the loop and can take multiple weeks.
Opportunities for further optimization of our approach: There are multiple opportunities to further improve the quality of our method. For example, the process of standard cell partitioning, row and column selection, as well as selecting the order in which the macros are placed all can be further optimized. In addition, we would also benefit from a more optimized approach to standard cell placement. Currently, we use a force-directed method to place standard cells due to its fast runtime. However, we believe that more advanced techniques for standard cell placement such as RePlAce (Cheng et al., 2019) and DREAMPlace (Lin et al., 2019) can yield more accurate standard cell placements to guide the policy network training. This is helpful because if the policy network has a clearer signal on how its macro placements affect standard cell placement and final metrics, it can learn to make more optimal macro placement decisions.
Implications for a broader class of problems: This work is just one example of domain-adaptive policies for optimization and can be extended to other stages of the chip design process, such as architecture and logic design, synthesis, and design verification, with the goal of training ML models that improve as they encounter more instances of the problem. A learning based method also enables further design space exploration and co-optimization within the cascade of tasks that compose the chip design process.
In this work, we target the complex and impactful problem of chip placement. We propose an RL-based approach that enables transfer learning, meaning that the RL agent becomes faster and better at chip placement as it gains experience on a greater number of chip netlists. We show that our method outperforms state-of-the-art baselines and can generate placements that are superior or comparable to human experts on modern accelerators. Our method is end-to-end and generates placements in under 6 hours, whereas the strongest baselines require human experts in the loop and take several weeks.
This project was a collaboration between Google Research and the Google Chip Implementation and Infrastructure (CI2) Team. We would like to thank Cliff Young, Ed Chi, Chip Stratakos, Sudip Roy, Amir Yazdanbakhsh, Nathan Myung-Chul Kim, Sachin Agarwal, Bin Li, Martin Abadi, Amir Salek, Samy Bengio, and David Patterson for their help and support.
DREAMPlace: deep learning toolkit-enabled gpu acceleration for modern vlsi placement. In Proceedings of the 56th Annual Design Automation Conference 2019, DAC ’19. Cited by: §5.5.
EPlace: electrostatics-based placement using fast fourier transform and nesterov’s method. ACM Trans. Des. Autom. Electron. Syst. 20 (2). External Links: Cited by: §2, §2.
RouteNet: routability prediction for mixed-size designs using convolutional neural network. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD, Cited by: §4.1.