As technology scaling is approaching the physical limit, the lithography process is considered as a critical step to continue the Moore’s law [moore1965cramming]. Even though the light wavelength for the process is larger than the actual transistor feature size, recent advances in lithography processing, , multi-patterning, optical proximity correction, , have made it possible to overcome the sub-wavelength lithography gap [2015Optical]. On the other hand, due to the complex design rules and process control at sub-14nm, even with such lithography advances, circuit designers have to consider lithography-friendliness at design stage as part of design for manufacturability (DFM) [2012Accurate].
Lithography hotspot detection (LHD) is such an essential task of DFM, which is no longer optional for modern sub-14nm VLSI designs. Lithography hotspot is a mask layout location that is susceptible to having fatal pinching or bridging owing to the poor printability of certain layout patterns. To avoid such unprintable patterns or layout regions, it is commonly required to conduct full mask lithography simulation to identify such hotspots. While lithography simulation remains as the most accurate method to recognize lithography hotspots, the procedure can be very time-consuming to obtain the full chip characteristics [2003Hotspot]. To speedup the procedure, pattern matching and machine learning techniques have been recently deployed in LHD to save the simulation time [2017Layout, 2014A, 2015Machine]. For example, [2014A] built a hotspot library to match and identify the hotspot candidates. Reference [2015Machine]the performance of all the aforementioned methods heavily depends on the quantity and quality of the underlying hotspot data to build the library or train the model. Otherwise, these methods may have weak generality especially for unique design patterns or topologies under the advanced technology nodes.
In practice, each design houses may own a certain amount of hotspot data, which can be homogeneous111Homogeneous hotspot data refers to the hotspot candidates that share the same feature space due to the similar design patterns or layout topologies. and possibly insufficient to build a general and robust model/library through local learning. On the other hand, the design houses are unwilling to directly share such data with other houses or even the tool developer to build one unified model through centralized learning due to privacy concern. Recently, advances in federated learning in the deep learning community provide a promising alternative to address the aforementioned dilemma. Unlike centralized learning that needs to collect the data at a centralized server or local training that can only utilize the design house’s own data, federated learning allows each design house to train the model at local, and then uploads the updated model instead of data to a centralized server, which aggregates and re-distributes the updated global model back to each design house [mcmahan2017communication].
While federated learning naturally protects layout data privacy without direct access to local data, its performance (or even convergence) actually can be very problematic when data are heterogeneous (or so-called non-Independent and Identically Distributed, , non-IID). However, such heterogeneity is very common for lithography hotspot data, as each design house may have a very unique design pattern and layout topology, leading to lithography hotspot pattern heterogeneity. To overcome the challenge of heterogeneity in federated learning, the deep learning community recently introduced many variants of federated learning [2020A, smith2018federated, 2017Model, 2018Federated]
. For example, federated transfer learning[2020A] ingested the knowledge from the source domain and reused the model in the target domain. In [smith2018federated], the concept of federated multi-task learning is proposed to allow the model to learn the shared and unique features of different tasks. To provide more local model adaptability, [2017Model] used meta-learning to fine-tune the global model to generate different local models for different tasks. [2020Think] further separated the global and local representations of the model through alternating model updates, which may get trapped at a sub-optimal solution when the global representation is much larger than the local one. A recent work [2018Federated] presented a framework called FedProx that added a proximal term to the objective to help handle the statistical heterogeneity. Note that LHD is different from the common deep learning applications: LHD is featured with limited design houses (several to tens) each of which usually has a reasonable amount of data (thousands to tens of thousands layout clips). The prior federated learning variants [2020A, smith2018federated, 2017Model, 2020Think, 2018Federated] are not designed for LHD and hence can be inefficient without such domain knowledge. For example, meta learning appears to loosely ensure the model consistency among the local nodes and hence fails to learn the shared knowledge for LHD when the number of local nodes is small, while FedProx strictly enforces the model consistency, yielding limited local model adaptivity to support local hotspot data heterogeneity. Thus, it is highly desired to have an LHD framework to properly balance local data heterogeneity and global model robustness.
To address the aforementioned issues in centralized learning, local learning, and federated learning, in this work, we propose an accurate and efficient LHD framework using heterogeneous federated learning with local adaptation. The major contributions are summarized as follows:
The proposed framework accounts for the domain knowledge of LHD to design a heterogeneous federated learning framework for hotspot detection. A local adaptation scheme is employed to make the framework automatically balanced between local data heterogeneity and global model robustness.
While many prior works empirically decide the low-dimensional representation of the layout clips, we propose an efficient feature selection method to automatically select the most critical features and remove unnecessary redundancy to build a more compact and accurate feature representation.
A heterogeneous federated learning with local adaptation (HFL-LA) algorithm is presented to handle data heterogeneity with a global sub-model to learn shared knowledge and local sub-models to adapt to local data features. A synchronization scheme is also presented to support communication heterogeneity.
We perform a detailed theoretical analysis to provide the convergence guarantee for our proposed HFL-LA algorithm and establish the relationship between design parameters and convergence performance.
Experimental results show that our proposed framework outperforms the other local learning, centralized learning, and federated learning methods for various metrics and settings on both open-source and industrial datasets. Compared with the federated learning and its variants[mcmahan2017communication, 2018Federated], the proposed framework can achieve 7-11% accuracy improvement with one order of magnitude smaller false positive rate. Moreover, our framework can maintain a consistent performance when the number of clients increases and/or the size of the dataset reduces, while the performance of local learning quickly degrades in such scenarios. Finally, with the guidance from the theoretical analysis, the proposed framework can achieve a faster convergence even with heterogeneous communication between the clients and central server, while the other methods take 5 iterations to converge.
Ii-a Feature Tensor Extraction
(c) shows an example of concentric circle sampling which samples from the layout clip in a concentric circling manner. These feature extraction methods exploit prior knowledge of lithographic layout patterns, and hence can help reduce the layout representation complexity in LHD. However,as the spatial information surrounding the polygonal patterns within the layout clip are ignored, such methods may suffer from accuracy issues [2017Layout].
Another possible feature extraction is based on the spectral domain [2017Layout, yang2018layout], which can include more spatial information. For example, [2017Layout, yang2018layout] use discrete cosine transform (DCT) to convert the layout spatial information into the spectral domain, where the coefficients after the transform are considered as the feature representation of the clip. Since such feature tensor representation is still large in size and may cause non-trivial computational overhead, [yang2018layout] proposes to ignore the high frequency components, which are supposed to be sparse and have limited useful information. However, such an assumption is not necessarily true for the advanced technologies, which can have subtle and abrupt changes in the shape. In other words, the ignorance may neglect critical feature components and hence cause accuracy loss.
Ii-B Federated Learning
Federated learning allows local clients to collaboratively learn a shared model while keeping all the training data at local [mcmahan2017communication]. Consider a set of clients connected to a central server, where each client can only access its own local data and has a local objective function . Federated learning can be then formulated as
where is the model parameter, and denotes the global objective function. FedAvg [mcmahan2017communication] is a popular federated learning method to solve the above problem. In FedAvg, the clients send updates of locally trained models to the central server in each round, and the server then averages the collected updates and distributes the aggregated update back to all the clients. FedAvg works well with independent and identically distributed (IID) datasets but may suffer from significant performance degradation when it is applied to non-IID datasets.
Iii Proposed Framework
Fig. 2 demonstrates two commonly used procedures for LHD, , local learning in Fig. 2(a) and centralized learning in Fig. 2(b). Both procedures contain two key steps, feature tensor extraction and learning. We adopt these two procedures as our baseline models for LHD. Table I defines the symbols that will be used in the rest of the paper.
|The set of weights of a CNN model|
|Global weights of the model|
|Local weights of the client model|
|Total number of clients|
|The data size of client|
The performance of LHD can be evaluated by the true positive rate (TPR), the false positive rate (FPR), and the overall accuracy, which can be defined as follows.
Definition 1 (True Positive Rate).
The ratio between the number of correctly identified layout hotspots and the total number of hotspots.
Definition 2 (False Positive Rate).
The ratio between the number of wrongly identified layout hotspots (false alarms) and the total number of non-hotspots.
Definition 3 (Accuracy).
The ratio between the number of correctly classified clips and the total number of clips.
The ratio between the number of correctly classified clips and the total number of clips.
With the definitions above, we propose to formulate the following heterogeneous federated learning based LHD:
Problem Formulation 1 (Heterogeneous Federated Learning Based Lithography Hotspot Detection).
Given clients (or design houses) owning unique layout data, the proposed LHD is to aggregate the information from the clients and create a compact local sub-model on each client and a global sub-model shared across the clients. The global and local sub-models form a unique hotspot detector for each client.
The proposed heterogeneous federated learning based LHD aims to support the heterogeneity at different levels: data, model, and communication:
Data: The hotspot patterns at each design house (client) can be non-IID.
Model: The optimized detector model includes global and local sub-models, where the local sub-model can be different from client to client through the local adaptation.
Communication: Unlike the prior federated learning [mcmahan2017communication], the framework allows asynchronous updates from the clients while maintaining good convergence.
Figure 3 presents an overview of the proposed framework to solve the above LHD problem with the desired features, which includes three key operations:
Feature Selection: An efficient feature selection method is proposed to automatically find critical features of the layout clip and remove unnecessary redundancy.
Global Aggregation: Global aggregation only updates the global sub-model shared across the clients with fewer parameters compared to the full model. It does not only reduces the computational cost but also facilitates heterogeneous communication.
Local Adaptation: This operation allows the unique local sub-model at each client to have personalized feature representation of local non-IID layout data.
These operations connect central server and clients together to build a privacy-preserving system, which allows distilled knowledge sharing through federated learning and balance between global model robustness and local feature support. In the following, we will discuss the three operations in details.
Iii-B Feature Selection
As discussed in Sec. II-B, while spectral based method can utilize more spatial information, it may easily generate a very large feature vector. To reduce computational cost, the vector is often shortened based on prior knowledge or heuristics[yang2018layout, 2017Layout]. In this paper, we would like to propose a more automatic feature selection method to find out the most critical components while maintaining the accuracy.
The proposed selection procedure is shown in Fig. 4. The input layout clip is first mapped to a spectral domain with DCT. Then we use Group Lasso training to remove the unwanted redundancy [yuan2006model], which is a common regularization to induce grouped sparsity in a deep CNN model. Generally, the optimization regularized by Group Lasso is
where is the set of the weights, is the loss on data, is a general regularization term applied on all the weights (, L2-norm), and is a structured L2 regularization on the specific weight group . In particular, if we make the channels of each filter in the first convolution layer of a deep CNN model a penalized group, the optimization would tend to remove less important channels. Since each channel directly corresponds to a channel in feature space, this is equivalent to removing the redundant feature channels. In other words, the remaining features are supposed to be the critical feature representation. The optimization target of the channel-wise Group Lasso penalty can be defined as
where is the weight of the first convolutional layer, is the channel of all the filters in , is the L2 regularization term applied on all the weights, is the L2 regularization strength and is the Group Lasso regularization strength. When is a feature channel with less impact on the data loss, our feature selection method tends to enforce the L2 norm of all the weights related to the channel to zero. Then, the remaining feature channels would be the more critical features, leading to a reduction in the dimension of the layout clip information representation.
Iii-C Global Aggregation and Local Adaptation
Global aggregation and local adaptation are the key operations in the proposed Heterogeneous Federated Learning with Local Adaptation algorithm (HFL-LA). The algorithm HFL-LA is particularly designed for LHD with awareness of its unique domain knowledge: (1) The design patterns in different clients (design houses) may have non-trivial portion in common, which indicates a larger global sub-model for knowledge sharing; (2) The number of clients may be limited, e.g., from several to tens; (3) The local data at each client may be insufficient to support a large local sub-model training.
As shown in Fig. 3, HFL-LA adopts a flow similar to the conventional federated learning that has a central server to aggregate the information uploaded from the distributed clients. However, unlike the conventional federated learning, the model that each client maintains can be further decomposed into a global sub-model and a local sub-model, where: (1) the global sub-model is downloaded from the server and shared across the clients to fuse the common knowledge for LHD, and (2) the local sub-model is maintained within the client to adapt to the non-IID local data and hence, varies from client to client.
To derive such a model, we define the following objective function for optimization:
where is the global sub-model parameter shared by all the clients; is a matrix whose column is the local sub-model parameter for the client; is the number of clients; and is the contribution ratio of each client; is the data size of client . By default, we can set , where is the total number of samples across all the clients. For the local data at client ,
is the local (potentially non-convex) loss function, which is defined as
where is the sample of client . As shown in Algorithm 1, in the round, the central server broadcasts the latest global sub-model parameter to all the clients. Then, each client (, client) starts with and conducts local updates for sub-model parameters
where denote the intermediate variables locally updated by client in the round; ; are the samples uniformly chosen from the local data in the round of training. After that, the global and local sub-model parameters at client become and are then updated by steps of inner gradient descent as follows:
where denote the intermediate variables updated by client in the round; . Finally, the client sends the global sub-model parameters back to the server, which then aggregates the global sub-model parameters of all the clients, , , to generate the new global sub-model, .
presents the network architecture of each client used in our experiment. The architecture has two convolution stages and two fully connected stages. Each convolution stage has two convolution layers, a Rectified Linear Unit (ReLU) layer, and a max-pooling layer. The second fully connected layer is the output layer of the network in which the outputs correspond to the predicted probabilities of hotspot and non-hotspot. We note that the presented network architecture is just a specific example for the target application and our proposed framework is not limited by specific network architectures.
Iii-D Communication Heterogeneity
In addition to data heterogeneity, the proposed framework also supports communication heterogeneity, , the clients can conduct synchronized or asynchronized updates, while still ensuring good convergence. For the synchronized updates, all the clients participate in each round of global aggregation as:
Then all the clients need to wait for the slowest client to finish the update. Due to heterogeneity of data, the computational complexity and willingness to participate in a synchronized or asynchronized update may vary from client to client. Thus, it is more realistic to assume that different clients may update at different rates. We can set a threshold and let the central server collect the outputs of only the first responded clients. After collecting outputs, the server stops waiting for the rest clients, , the to clients are ignored in this round of global aggregation. Assuming is the set of the indices of the first clients in the round, the global aggregation can then be rewritten as
where is the sum of the sample data volume of the first clients and .
Iv Convergence Analysis
In this section, we study the convergence of the proposed HFL-LA algorithm. Unlike the conventional federated learning, our proposed HFL-LA algorithm for LHD works with fewer clients, smaller data volume, and non-IID datasets, making the convergence analysis more challenging. Before proceeding into the main convergence result, we provide the following widely used assumptions on the local cost functions and stochastic gradients [2019Parallel].
are all , i.e., , .
Let be uniformly sampled from the client’s local data. The variance of stochastic gradients in each client is upper bounded, i.e.,
client’s local data. The variance of stochastic gradients in each client is upper bounded, i.e.,.
The expected squared norm of stochastic gradients is uniformly bounded by a constant , i.e., for all .
With the above assumptions, we are ready to present the following main results of the convergence of the proposed algorithm. The detailed proof can be found in the Appendix.
Lemma 1 (Consensus of global sub-model parameters).
Suppose Assumption 3 holds. Then,
The above lemma guarantees that the global sub-model parameters of all the clients reach consensus with an error proportional to the learning rate while the following theorem ensures the convergence of the proposed algorithm.
Suppose Assumption 1-3 hold. Then, , we have
The above theorem shows that, with a constant step-size, the parameters of all clients converge to the -neighborhood of a stationary point with a rate of . It should be noted that the second term of the steady-state error is proportional to the square root of , but will vanish when . This theorem sheds light on the relationship between design parameters and convergence performance, which helps guide the design of the proposed HFL-LA algorithm.
V Experimental Results
|Benchmarks||Training Set||Testing Set||Size/Clip ()|
We implement the proposed framework using the PyTorch library[paszke2019pytorch]
. We use the following hyperparameters to conduct model training on each client in our experiment: We train our models with Adam optimizer forrounds with a fixed learning rate and a batch size of . And in each round, we conduct local updates for iterations, and global updates for iterations. To prevent overfitting, we use L2 regularization of . We adopt two benchmarks (ICCAD and Industry) for training and testing. We merge all the 28nm patterns in the test cases published in ICCAD 2012 contest [torres2012iccad] into a unified benchmark denoted by ICCAD. And Industry is obtained from our industrial partner at 20nm technology node. Table II summarizes the benchmark details including the training/testing as well as the layout clip size. In the table, columns “HS#” and “non-HS#” list the total numbers of hotspots and non-hotspots, respectively. Since the original layout clips have different sizes, clips in ICCAD are divided into nine blocks to have a consistent size as Industry. We note that, due to the different technologies and design patterns, the two benchmarks have different feature representations, and Industry has more diverse design patterns (, higher data heterogeneity) than ICCAD.
V-a Feature Selection
This subsection presents the performance of the proposed feature selection method. As discussed in Sec. III-B, L2 norm of the channel-wise groups in the first convolutional layer is correlated with the contributions to model performance from the corresponding feature channels, as shown in Fig. 6. We then sort all the feature channels by their L2 norms and retrain our model from scratch with the selected top- channels, , in the experiment. To validate the efficiency of our feature selection method, we test the performance of HFL-LA with different numbers of features representing the layout clips on the validation set and compare the performance. Fig. 7 shows that HFL-LA achieves comparable (even slightly higher) accuracy in the case of features as suggested by the proposed selection method for both benchmarks, which indicates a computation reduction for the following learning in comparison to the original 32 features.
V-B Heterogeneous Federated Learning with Local Adaptation
To demonstrate the performance of the proposed HFL-LA algorithm, we compare the results of HFL-LA with that of the state-of-the-art federated learning algorithm, FedAvg in [mcmahan2017communication] and FedProx in [2018Federated], as well as local and central learning. Here we have:
FedAvg: The conventional federated learning algorithm that averages over the uploaded model [mcmahan2017communication].
FedProx: The algorithm adds a proximal term to the objective to handle the heterogeneity [2018Federated].
Local learning (denoted as local): The local learning algorithm that only uses the local data of client.
Central learning (denoted as centralized): The central learning algorithm has access to all the training sets to train one unified model.
In our experiments, the training sets of ICCAD and Industry benchmarks are merged together and then assigned to different numbers of clients as the local data, , 2, 4, and 10 clients. The testing sets have been preserved in advance as shown in Table II and used to validate the performance of the trained models. We compare the performance of the algorithms in terms of TPR, FPR, and accuracy, as defined in Sec. III-A, and summarize the results in Table III. In the experiments in Table III, all the clients communicate with the server in a synchronous manner and the average of the performance across all the clients for the three scenarios of 2, 4, and 10 clients, in which the best performance cases are marked in bold. It is noted that the proposed HFL-LA can achieve 7-11% accuracy improvement for both TPR and FPR, compared to FedAvg and FedProx. Due to the fact of using only local homogeneous training data, local learning can achieve slightly better results for ICCAD. However, when the data heterogeneity increases like Industry, the performance of local learning quickly drops and yields 4% degradation compared to HFL-LA.
We further compare the results when the model can be updated asynchronously for the scenarios of 4 and 10 clients, where half of the clients are randomly selected for training and update in each round. Since only federated learning based methods require model updates, we only compare HFL-LA with FedAvg and FedProx in Fig. 8. As shown in the figure, even with heterogeneous communication and updates, HFL-LA can still achieve 5-10% accuracy improvement from that of the other federated learning methods [mcmahan2017communication, 2018Federated].
Finally, we compare the accuracy changes of different methods with different update mechanisms (synchronous and asynchronous, denoted as sync and async, respectively) for 10 clients during the training. For ICCAD benchmark in Fig. 9(a), local learning and HFL-LA method achieve the highest accuracy and converge much faster than the other methods. Even with asynchronous updates, HFL-LA method can achieve convergence rate and accuracy similar to the synchronous case. For Industry in Fig. 9(b), the superiority of HFL-LA is more obvious, outperforming all the other methods in terms of accuracy (e.g., 3.7% improvement over local learning). Moreover, HFL-LA achieves almost 5 convergence speedup compared to the other federated learning methods even adopting asynchronous updates.
In this paper, we propose a novel heterogeneous federated learning based hotspot detection framework with local adaptation. By adopting an efficient feature selection and utilizing the domain knowledge of LHD, our framework can support the heterogeneity in data, model, and communication. Experimental results shows that our framework not only outperforms other alternative methods in terms of performance but can also guarantee good convergence even in the scenario with high heterogeneity.