1 Related Work
Researchers have developed reinforcement learning methods for solving various combinatorial optimization problems. In bello2016neural, the authors trained a Pointer Network vinyals2015pointer to solve the Traveling Salesman Problem (TSP). Khalil et al. khalil2017learning combined graph embedding and RL for solving a diverse set of combinatorial optimization problems including the Minimum Vertex Cover, Maximum Cut, and TSP. Chen and Tian chen2019learning proposed a method to learn policies that can rewrite the heuristics in existing solvers for combinatorial optimization problems. Lu et al. Lu2020A showed that RL-based method could outperform a classic operation research algorithm in terms of both average cost and time efficiency.
Many real-life applications can be formalized as sequence generation problems li2016deep; popova2018deep; Angermueller2020Model-based; mirhoseini2020chip. In li2016deep, the authors integrated RL and seq2seq to automatically generate a response by simulating the dialogue between two agents. In Angermueller2020Model-based, the authors proposed a model-based variant of PPO to deal with the large-batch, low round setting for biological sequence design Angermueller2020Model-based. Mirhoseini et al. mirhoseini2020chip combined graph neural networks with RL for sequentially placing devices on a chip. These previous works all trained sequence generation models using policy gradient algorithms. In this work, we introduced a sequence generation network architecture tailored to the optical design task. Additionally, we combined local search with DRL for finetuning the thicknesses of the generated layers.
Deep-learning-based inverse design ma2018deep; liu2018training; liu2018generative has been gaining popularity in recent years. In ma2018deep
, the authors trained convolutional neural networks to directly predict design parameters using the design target as the input to the network. Liu et al.
liu2018generative trained a generative adversarial network (GAN) to inversely design optical devices by generating 2D shapes of the optical structure. However, these approaches all rely on a curated training set that contains diverse examples. When our goal is to push the performance limit of certain devices, the near-optimal structures is unlikely to be within the training data distribution. Thus, these static methods are not appropriate for optimizing design performances. Our proposed method tackles this problem by actively searching the design space to generate high-performance designs via reinforcement learning. In jiang2019free, the authors also developed an active search process by adding additional high-quality data to augment the initial training set. However, their approach requires the users to retrain the neural network with the augmented dataset while our RL-based method accomplishes the design task within one training process.2 Methods
Multi-layer films can be treated as sequences. Each layer is represented as . We can represent such a structure with layers as , where and denote the material and the thickness of the -th layer (counting from the top), respectively. When designing optical multi-layer films, we hope to synthesize a sequence that has the desired target spectral response . Thus, the design task is equivalent to a sequence generation problem, where we generate and in each step. Generation tasks such as dialogue generation li2016deep, molecule generation popova2018deep, and biological sequence generation Angermueller2020Model-based
have been widely studied by machine learning researchers. In these works, researchers train a neural network as a generator for synthesizing sequences. Because we do not have ground-truth data for optimal design tasks, we apply reinforcement learning
sutton2018reinforcement to train the sequence generator.2.1 Sequence generation network
To generate the optical layer sequences, we use a recurrent neural network (RNN) hochreiter1997long
. Unlike simple feed-forward neural networks, RNNs maintain a hidden state
that contains useful information from the history of the sequence. Thus, RNNs are suitable for tasks that require memorizing history and have been widely used in sequence generation tasks graves2013generating. Gated recurrent units (GRUs)
chung2014empiricaland long short-term memory networks (LSTMs)
hochreiter1997long are two popular variants of RNNs. Researchers have previously found that the empirical performance of GRUs and LSTMs is similar. Because GRUs have a simpler structure than LSTMs and require fewer parameters to train, we choose to use a GRU for generating the optical multi-layer structures. Similar to sampling words from a dictionary when generating a sentence, we sample the material from a fixed set of materials for each layer. Though the thickness is intrinsically a continuous variable, we choose to sample the thickness from a set of discrete values to reduce the size of the exploration space. Later, we apply quasi-Newton methods zhu1997algorithm to finetune the layer thicknesses of the generated structure for further performance improvement.Our optical multi-layer sequence generation network consists of a GRU and two multi-layer perceptrons (MLPs)
goodfellow2016deep. At generation step , the GRU takes its own output from the previous step and the previous hidden state as the inputs to compute the hidden state . This auto-regressive generation process allows the GRU to remember what has been generated so far. To generate the material and thickness for layer , the hidden stateof the GRU is inputted to two MLPs. One of the MLPs outputs logits vector
corresponding to all possible materials and an end-of-sequence token (EOS). The other MLP outputs a thickness logits vector corresponding to all allowable thicknesses in the set . Then, we transform these logits vectors with the softmaxfunction to obtain proper probability distributions. Finally, the material and thickness are sampled from their corresponding distributions. The generation process will stop either when the length reaches the maximum length
set by the user or when the EOS token is sampled. Thus, the number of layers of a generated structure is always lower than or equal to the maximum sequence length . The process for generating a sequence is illustrated in Figure. 2.2.1.1 Non-repetitive gating
The aforementioned material sampling procedure does not prevent the situation where the same material is sampled for adjacent layers. However, such consecutive layers of the same material are equivalent to a single thicker layer. Thus, allowing the sequence generator to generate the same material for adjacent layers leads to redundant computation. Moreover, doing so increases the exploration space size and makes the search problem harder. Thus, we introduce a non-repetitive gating function that removes the logit element corresponding to the most recently sampled material to prevent the sequence generator from generating the same materials in a row. This gating function is a matrix formed by removing the row corresponding to the most recently sampled material from an identify matrix. When multiplied with the logits vector , the element corresponding to that material will be removed, i.e., . Then, we pass the transformed logit vector
to the softmax layer to obtain the sampling probability. By doing so, we set the sampling probability for the recurring material to 0. With the non-repetitive gating, the generated material sequence is guaranteed to have different materials for adjacent layers. Note that, we do not apply the gating function for the first generation step because there is no previously sampled material.
2.1.2 Auto-regressive generation of material and thickness
Because the proper thickness of a layer should depend on the material, we input the sampled material to the thickness MLP in addition to the hidden state . A similar approach has been applied in RL problems where the actions are dependent on each other vinyals2019grandmaster. Instead of using a one-hot vector to represent the material, we train a material embedding matrix together with the sequence generator network. Each row of the embedding matrix is a continuous representation of one material, where is the embedding size. Using an embedding allows us to use a large number of materials without significantly increasing the dimensionality of the material representation. The material embedding vector for the sampled material is concatenated with the hidden state to form the input to the material MLP.
The full sequence generator architecture is plotted in Figure. 2(a). To understand the effect of non-repetitive gating and modeling the dependency between the material and the thickness, we compare the proposed OML-PPO architecture against a baseline architecture Experiment section.
2.2 Reinforcement learning training
We train the sequence generation network with reinforcement learning. The goal of reinforcement learning is to maximize expected cumulative rewards by learning a policy that can map a state to an action . Here, is the discount factor that penalizes future rewards and is the reward at step . The sequence generation network described above serves as the policy.
We represent the state at the -th generation step as the concatenation of the last layer information and the GRU hidden state, i.e., . The actions correspond to the material and thickness of the current layer. We set the reward to be 0 for all generation steps except the final step. At the final step (i.e., the structure has been completely generated), we compute the spectrum of the generated structure with an optical spectrum calculation package TMM byrnes2016multilayer and assign the final reward based on how well the structure spectrum matches with the target spectrum. We also tried to calculate the spectrum following every generation step and assign intermediate rewards. However, this dense-reward approach is slow and does not lead to improved performance. Thus, we only report the final-only approach here. We set the discount factor . Thus, the cumulative reward for the generated sequence is simply the reward at the final step, which is defined as one minus the mean absolute error between the spectrum of the generated structure and the target spectrum:
(1) |
where is the spectrum of the generated structure at wavelength under incidence angle . Because , the cumulative reward is always non-negative. The reward value will become higher as the spectrum gets closer to the target spectrum until it reaches 1 when the structure spectrum perfectly matches with the target spectrum.
During training, the sequence generator actively generates new structures and receive rewards. Our goal is to maximize the expected rewards for structures sampled from the sequence generation network:
(2) |
Based on the calculated rewards for generated sequences, the agent adjusts its parameters with gradient ascent so that future rewards can be improved. Here, we use a policy gradient algorithm to compute the gradient for updating the sequence generator . From the policy gradient theorem sutton2018reinforcement; schulman2017proximal, we have
(3) |
where is the probability of sampling a structure from the generator network and
is the estimated advantage function
schulman2015high, which measures the performance of the generated sequence compared against the average performance of structures sampled from .Instead of directly updating the sequence generator using Eqn.3, we use a state-of-the-art policy gradient algorithm Proximal Policy Optimization (PPO) schulman2017proximal to compute the policy gradient from a surrogate objective function:
(4) |
where is the importance weight that measures the distance between the policies before and after the gradient update. The function disincentivizes large update steps to the policy, where
is a hyperparameter that affects the actual update size. Here, the advantage
is estimated by Generalized Advantage Estimation (GAE) schulman2015high, which achieves a good balance between bias and variance of the estimated gradients.
is the model parameters for a critic network that is trained together with the sequence generator. Compared to the vanilla policy gradient and actor-critic algorithms, PPO is more sample-efficient because it allows multi-step updates using the same batch of trajectories. Previous results show that PPO can achieve state-of-the-art performance on many tasks schulman2017proximal. With the computed policy gradient, the sequence generator model parameters are updated using the Adam optimizer kingma2014adam. The model training process is summarized in Figure. 4. Similar to the active search approach in Bello et al. bello2016neural, we output the best structure discovered throughout the entire training process as the final design. The pseudocode that summarizes our design generation process is given in Algorithm 1.Our model is implemented using PyTorch
paszke2019pytorch and Spinning Up SpinningUp2018. The data used in this study and our code are publicly available^{1}^{1}1https://github.com/hammer-wang/oml-ppo.3 Experiment
We applied the proposed method to two optical design tasks that are relevant to energy applications, i.e., 1) designing ultra-wideband absorbers and 2) designing incandescent light bulb filters. The designed ultra-wideband absorbers can help solar thermal panels to absorb the sunlight more efficiently and the light bulb filter can enhance incandescent light bulb efficiency in emitting visible light while suppressing the radiation in the infrared range that represents energy loss. We also did an ablation study to understand the effect of non-repetitive gating and auto-regressive materials/thickness sampling.
Performance evaluation: In task 1 ultra-wideband absorber design, we measure the quality of the designed structure by average absorption. In task 2 incandescent light bulb filter, we calculate the visible light enhancement factor to measure the performance of designed structures.
3.1 Task 1: ultra-wideband absorber
Firstly, we apply our algorithm to the task of designing an ultra-wideband absorber for the wavelength range [400, 2000] nm. We choose the target spectrum as a constant 100% absorption under normal light incidence angle (i.e., the light is shining at the absorber at a right angle) to represent an ideal broadband absorber. This task has been previously studied by Yang et al. yang2016compact based on physical models, where the broadband absorption is achieved by overlapping multiple absorption resonances and with an overall graded-index structure to minimize reflection. The authors designed a 5-layer structure using MgF_{2}, TiO_{2}, Si, Ge, and Cr. The simulated average absorption of their structure over the wavelength range is 95.37% under normal incidence. If not specified otherwise, we assume normal incidence when reporting average absorption.
Ag | Al | Al_{2}O_{3} | Cr | Fe_{2}O_{3} | Ge | HfO_{2} | MgF_{2} |
---|---|---|---|---|---|---|---|
Ni | Si | SiO_{2} | Ti | TiO_{2} | ZnO | ZnS | ZnSe |
We hypothesize that, when choosing from a larger set of materials than used in the previous work yang2016compact, it is possible to design a structure with higher average absorption than the human-designed structure. Thus, we expanded the original material set yang2016compact to include 11 more materials (16 total). The set of materials is listed in Table 1. We set the available discrete thicknesses to be nm with a total of 38 different values. When training the sequence generator, we set the learning rate to and the maximum length to . The material embedding size is set to 5, i.e., . The generator is trained for a total of epochs with the batch size set to be generation steps. We repeat the training for runs with different random seeds. The best structure discovered in each run was recorded and finetuned using the quasi-Newton method.
ID | Material | Thickness | ID | Material | Thickness |
---|---|---|---|---|---|
1 | MgF_{2} | 123 nm | 8 | Si | 15 nm |
2 | TiO_{2} | 32 nm | 9 | Cr | 17 nm |
3 | MgF_{2} | 21 nm | 10 | Ge | 15 nm |
4 | Si | 15 nm | 11 | TiO_{2} | 33 nm |
5 | TiO_{2} | 15 nm | 12 | Cr | 29 nm |
6 | Si | 15 nm | 13 | TiO_{2} | 81 nm |
7 | Ge | 15 nm | 14 | Cr | 116 nm |
It is worth noting that our algorithm can yield very similar structures as that reported in yang2016compact, i.e., it can search for and find the structure designed based by human experts. One of such structures is {(MgF2, 112 nm), (TiO2, 55 nm), (Ti, 30 nm), (Ge, 30 nm), (Cr, 200 nm)} with an average absorption of 96.12%, which has exactly the same material composition as the one reported previously yang2016compact. However, the best structure discovered by the algorithm, exhibiting a higher average absorption of 97.64%, is {(SiO2, 115 nm), (Fe2O3, 70 nm), (Ti, 15 nm), (MgF2, 124 nm), (Ti, 148 nm)}. The spectrum under normal incidence are plotted in Figure 4(a).
We plot the best absorption values before and after finetuning of all ten runs in Figure. 6. After finetuning, the average absorptions for the discovered structures across all runs were improved. We found that the algorithm is robust to the randomness during training as 8 out of the 10 runs achieved an absorption that is higher than 95% after finetuning.
In an additional experiment, we explore whether the algorithm can design a structure with more layers to achieve even higher absorptions. We set the maximum length and sample layer materials from MgF_{2}, TiO_{2}, Si, Ge, and Cr. The best discovered structure has 14 layers with an average absorption of 99.24%. The structure configuration is summarized in Table 2. We plot the normal incidence spectrum structure in Figure. 4(b). The structure discovered by OML-PPO reaches close-to-perfect performance under normal incidence and has high absorption over a wide range of angles.
3.2 Task 2: incandescent light bulb filter
To further test whether our method is scalable to more complicated tasks, we apply the proposed method for designing a filter that can enhance the luminous efficiency of incandescent light bulbs zhou2016efficient; ilic2016tailoring. The idea is to reflect the infrared light emitted by the light bulb filament so that its energy can be recycled. To this end, we set the target reflectivity to be 0% in the range [480, 700] nm, and 100% outside this range (Figure. 6(a)). In this way, the infrared light, which cannot contribute to lighting, will be reflected back to heat up the emitter.
A similar design has been previously studied ilic2016tailoring; shi2017optimization. We choose the same seven dielectric materials as the available materials: Al_{2}O_{3}, HfO_{2}, MgF_{2}, SiC, SiO_{2}, and TiO_{2} shi2017optimization. Similar to our previous experiment, we train our policy for runs with different random seeds. Here, we set the maximum allowed length and the learning rate to be . The number of epochs and batch size are 10,000 and 3,000, respectively. The best discovered structure is reported in Appendix.
In Figure 7, we compare the average reflectivity normalized over all incidence angles (0 - 90 degree) of the 42-layer structure designed with our algorithm and the 41-layer structure designed by a memetic algorithm shi2017optimization. Our structure has a higher average reflectivity in the infrared range ( nm) than the 41-layer structure.
We quantitatively evaluated the performance of the designed filter by calculating the enhancement factor for visible light (400 - 780 nm) under a fixed operating power. The results are reported in Table 3. Details about the calculation of enhanced factor is included in Appendix.
Model | Enhancement factor |
---|---|
OML-PPO | |
Memetic shi2017optimization |
3.3 Ablation study
On the ultra-wideband absorber design task, we conducted an ablation study to understand the effect of non-repetitive gating and auto-regressive generation of materials and thicknesses. We trained four different models: 1) OML-PPO with both non-repetitive gating and auto-regressive generation, 2) non-repetitive gating only, 3) auto-regressive generation only, 4) neither non-repetitive gating nor the auto-regressive generation. For each model, we repeated the training for ten times. The maximum absorption values discovered by each model before finetuning are reported in Table 4. Both non-repetitive gating and the auto-regressive material/thickness generation improve the performance of the baseline model.
Training trajectory of OML-PPO and other baseline algorithms. (a) Average absorption trajectory. (b) Maximum absorption trajectory. The non-repetitive gating enables the model to converge to better solutions than models without the gating. The shaded area corresponds to one standard deviation.
Model | Average Absorption |
---|---|
OML-PPO | |
Only gating | |
Only auto-regressive | |
None (baseline) |
In Figure. 8, we plot the average absorption and maximum absorption of the structures generated in each epoch over the entire training trajectory. The effect of non-repetitive gating is more significant than auto-regressive material/thickness generation as the OML-PPO and the only-gating variants both significantly outperform the other two variants. The non-repetitive gating significantly improves the model convergence during training. When non-repetitive gating and the auto-regressive sampling are combined together, the model achieves the best performance.
4 Conclusion
We introduced a novel sequence generation architecture and a deep reinforcement learning pipeline to automatically design optical multi-layer films. To the best of our knowledge, our work is the first to apply deep reinforcement learning to design multi-layer optical structures with the optimal number of layers not known beforehand. Using a sequence generation network, the proposed method can select material and thickness for each layer of a multi-layer structure sequentially. On the task of designing an ultra-wideband absorber, we demonstrate that our method can achieve high performance robustly. The algorithm automatically discovered a 5-layer structure with 97.64% average absorption over the [400, 2000] nm range, which is 2% higher than a structure previously designed by human experts. When applied to generate a structure with more layers, the algorithm discovered a 14-layer structure with 99.24% average absorption, approaching perfect performance. On the task of designing incandescent light bulb filters, our method achieves 8.5% higher visible light enhancement factor than a structure designed by a state-of-art memetic algorithm.
Through an ablation study, we showed that customizing the sequence generation network based on optical design domain knowledge can greatly improve the optimization performance. Our results demonstrated the high performance of the proposed method on complicated optical design tasks. Because the proposed method does not rely on hand-crafted heuristics, we believe that it can be applied to many other multi-layer optical design tasks such as lens design and multi-layer metasurface design.
References
Appendix A RL-designed 42-layer incandescent light bulb
ID | Material | Thickness | ID | Material | Thickness | ID | Material | Thickness |
---|---|---|---|---|---|---|---|---|
1 | SiO_{2} | 289 nm | 15 | SiC | 210 nm | 29 | SiC | 117 nm |
2 | SiN | 268 nm | 16 | SiN | 168 nm | 30 | MgF_{2} | 224 nm |
3 | MgF_{2} | 185 nm | 17 | MgF_{2} | 200 nm | 31 | SiC | 122 nm |
4 | SiN | 189 nm | 18 | SiC | 227 nm | 32 | MgF_{2} | 235 nm |
5 | SiC | 214 nm | 19 | SiN | 242 nm | 33 | SiC | 127 nm |
6 | SiN | 214 nm | 20 | MgF_{2} | 222 nm | 34 | MgF_{2} | 230 nm |
7 | MgF_{2} | 210 nm | 21 | SiC | 228 nm | 35 | SiC | 234 nm |
8 | SiN | 206 nm | 22 | MgF_{2} | 216 nm | 36 | MgF_{2} | 218 nm |
9 | SiC | 205 nm | 23 | SiC | 229 nm | 37 | SiC | 235 nm |
10 | SiN | 183 nm | 24 | MgF_{2} | 203 nm | 38 | MgF_{2} | 220 nm |
11 | MgF_{2} | 184 nm | 25 | SiC | 101 nm | 39 | SiC | 231 nm |
12 | SiN | 179 nm | 26 | MgF_{2} | 209 nm | 40 | MgF_{2} | 216 nm |
13 | SiC | 203 nm | 27 | SiC | 121 nm | 41 | SiC | 233 nm |
14 | SiN | 273 nm | 28 | MgF_{2} | 225 nm | 42 | Al_{2}O_{3} | 95 nm |
Appendix B Visible light enhancement factor
We first calculated the angle averaged emissivity over a hemisphere:
where . is the reflection of the structure at wavelength under the incidence angle of . is the view factor that equals to the proportion of the light from the emitter that can reach the filter. We compared two different view factors and in our calculation. In addition, we assume the light bulb operates at 100 W and the surface area of the emitter is equal to . Then, we can solve for the temperature of the light emitter with the equation:
where is the blackbody emission intensity spectrum. With view factor , the OML-PPO designed filter leads to the emitter temperature of 3810 K (3553 K) while the structure designed by the memetic algorithm achieves a temperature of 3750 K (3498 K). The black body temperature under the same condition is calculated to be K. We measure the enhancement factor by:
where is the human eye’s sensitity spectrum (sharpe2005luminous). Our structure achieves an enhancement factor of 16.60 (10.67) while the memetic structure has an enhancement factor of 15.30 (9.72). The 42-layer structure designed by OML-PPO outperforms the previous 41-layer design by 8.5% (9.8%) in terms of the visible light enhancement.
Comments
There are no comments yet.