Automated Optical Multi-layer Design via Deep Reinforcement Learning

06/21/2020 ∙ by Haozhu Wang, et al. ∙ University of Michigan 0

Optical multi-layer thin films are widely used in optical and energy applications requiring photonic designs. Engineers often design such structures based on their physical intuition. However, solely relying on human experts can be time-consuming and may lead to sub-optimal designs, especially when the design space is large. In this work, we frame the multi-layer optical design task as a sequence generation problem. A deep sequence generation network is proposed for efficiently generating optical layer sequences. We train the deep sequence generation network with proximal policy optimization to generate multi-layer structures with desired properties. The proposed method is applied to two energy applications. Our algorithm successfully discovered high-performance designs, outperforming structures designed by human experts in task 1, and a state-of-the-art memetic algorithm in task 2.



There are no comments yet.


page 1

page 3

page 5

page 6

page 8

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Related Work

Researchers have developed reinforcement learning methods for solving various combinatorial optimization problems. In bello2016neural, the authors trained a Pointer Network vinyals2015pointer to solve the Traveling Salesman Problem (TSP). Khalil et al. khalil2017learning combined graph embedding and RL for solving a diverse set of combinatorial optimization problems including the Minimum Vertex Cover, Maximum Cut, and TSP. Chen and Tian chen2019learning proposed a method to learn policies that can rewrite the heuristics in existing solvers for combinatorial optimization problems. Lu et al. Lu2020A showed that RL-based method could outperform a classic operation research algorithm in terms of both average cost and time efficiency.

Many real-life applications can be formalized as sequence generation problems li2016deep; popova2018deep; Angermueller2020Model-based; mirhoseini2020chip. In li2016deep, the authors integrated RL and seq2seq to automatically generate a response by simulating the dialogue between two agents. In Angermueller2020Model-based, the authors proposed a model-based variant of PPO to deal with the large-batch, low round setting for biological sequence design Angermueller2020Model-based. Mirhoseini et al. mirhoseini2020chip combined graph neural networks with RL for sequentially placing devices on a chip. These previous works all trained sequence generation models using policy gradient algorithms. In this work, we introduced a sequence generation network architecture tailored to the optical design task. Additionally, we combined local search with DRL for finetuning the thicknesses of the generated layers.

Deep-learning-based inverse design ma2018deep; liu2018training; liu2018generative has been gaining popularity in recent years. In ma2018deep

, the authors trained convolutional neural networks to directly predict design parameters using the design target as the input to the network. Liu et al.

liu2018generative trained a generative adversarial network (GAN) to inversely design optical devices by generating 2D shapes of the optical structure. However, these approaches all rely on a curated training set that contains diverse examples. When our goal is to push the performance limit of certain devices, the near-optimal structures is unlikely to be within the training data distribution. Thus, these static methods are not appropriate for optimizing design performances. Our proposed method tackles this problem by actively searching the design space to generate high-performance designs via reinforcement learning. In jiang2019free, the authors also developed an active search process by adding additional high-quality data to augment the initial training set. However, their approach requires the users to retrain the neural network with the augmented dataset while our RL-based method accomplishes the design task within one training process.

2 Methods

Multi-layer films can be treated as sequences. Each layer is represented as . We can represent such a structure with layers as , where and denote the material and the thickness of the -th layer (counting from the top), respectively. When designing optical multi-layer films, we hope to synthesize a sequence that has the desired target spectral response . Thus, the design task is equivalent to a sequence generation problem, where we generate and in each step. Generation tasks such as dialogue generation li2016deep, molecule generation popova2018deep, and biological sequence generation Angermueller2020Model-based

have been widely studied by machine learning researchers. In these works, researchers train a neural network as a generator for synthesizing sequences. Because we do not have ground-truth data for optimal design tasks, we apply reinforcement learning

sutton2018reinforcement to train the sequence generator.

2.1 Sequence generation network

To generate the optical layer sequences, we use a recurrent neural network (RNN) hochreiter1997long

. Unlike simple feed-forward neural networks, RNNs maintain a hidden state

that contains useful information from the history of the sequence. Thus, RNNs are suitable for tasks that require memorizing history and have been widely used in sequence generation tasks graves2013generating

. Gated recurrent units (GRUs)


and long short-term memory networks (LSTMs)

hochreiter1997long are two popular variants of RNNs. Researchers have previously found that the empirical performance of GRUs and LSTMs is similar. Because GRUs have a simpler structure than LSTMs and require fewer parameters to train, we choose to use a GRU for generating the optical multi-layer structures. Similar to sampling words from a dictionary when generating a sentence, we sample the material from a fixed set of materials for each layer. Though the thickness is intrinsically a continuous variable, we choose to sample the thickness from a set of discrete values to reduce the size of the exploration space. Later, we apply quasi-Newton methods zhu1997algorithm to finetune the layer thicknesses of the generated structure for further performance improvement.

Figure 2: Optical multi-layer design as sequence generation. The generation process will stop when either the EOS token is sampled, or the length of the sequence reaches the maximum allowed length .

Our optical multi-layer sequence generation network consists of a GRU and two multi-layer perceptrons (MLPs)

goodfellow2016deep. At generation step , the GRU takes its own output from the previous step and the previous hidden state as the inputs to compute the hidden state . This auto-regressive generation process allows the GRU to remember what has been generated so far. To generate the material and thickness for layer , the hidden state

of the GRU is inputted to two MLPs. One of the MLPs outputs logits vector

corresponding to all possible materials and an end-of-sequence token (EOS). The other MLP outputs a thickness logits vector corresponding to all allowable thicknesses in the set . Then, we transform these logits vectors with the softmax

function to obtain proper probability distributions. Finally, the material and thickness are sampled from their corresponding distributions. The generation process will stop either when the length reaches the maximum length

set by the user or when the EOS token is sampled. Thus, the number of layers of a generated structure is always lower than or equal to the maximum sequence length . The process for generating a sequence is illustrated in Figure. 2.

Figure 3: Neural network architectures for generating optical multi-layer films. (a) We show one generation step in the plot. The hidden state of the GRU is passed to two MLPs to output material and thickness probabilities, respectively. The actual material and thickness for layer are sampled from categorical distributions parametrized by and . Built-upon the baseline architecture, our proposed model adds a non-repetitive gating function and auto-regressive connection between the sampled material and the thickness MLP. (b) Illustration of how the non-repetitive gating works. Here we suppose there are a total of 5 materials. Thus, the gating matrix is of dimension .

2.1.1 Non-repetitive gating

The aforementioned material sampling procedure does not prevent the situation where the same material is sampled for adjacent layers. However, such consecutive layers of the same material are equivalent to a single thicker layer. Thus, allowing the sequence generator to generate the same material for adjacent layers leads to redundant computation. Moreover, doing so increases the exploration space size and makes the search problem harder. Thus, we introduce a non-repetitive gating function that removes the logit element corresponding to the most recently sampled material to prevent the sequence generator from generating the same materials in a row. This gating function is a matrix formed by removing the row corresponding to the most recently sampled material from an identify matrix. When multiplied with the logits vector , the element corresponding to that material will be removed, i.e., . Then, we pass the transformed logit vector

to the softmax layer to obtain the sampling probability. By doing so, we set the sampling probability for the recurring material to 0. With the non-repetitive gating, the generated material sequence is guaranteed to have different materials for adjacent layers. Note that, we do not apply the gating function for the first generation step because there is no previously sampled material.

2.1.2 Auto-regressive generation of material and thickness

Because the proper thickness of a layer should depend on the material, we input the sampled material to the thickness MLP in addition to the hidden state . A similar approach has been applied in RL problems where the actions are dependent on each other vinyals2019grandmaster. Instead of using a one-hot vector to represent the material, we train a material embedding matrix together with the sequence generator network. Each row of the embedding matrix is a continuous representation of one material, where is the embedding size. Using an embedding allows us to use a large number of materials without significantly increasing the dimensionality of the material representation. The material embedding vector for the sampled material is concatenated with the hidden state to form the input to the material MLP.

The full sequence generator architecture is plotted in Figure. 2(a). To understand the effect of non-repetitive gating and modeling the dependency between the material and the thickness, we compare the proposed OML-PPO architecture against a baseline architecture Experiment section.

2.2 Reinforcement learning training

We train the sequence generation network with reinforcement learning. The goal of reinforcement learning is to maximize expected cumulative rewards by learning a policy that can map a state to an action . Here, is the discount factor that penalizes future rewards and is the reward at step . The sequence generation network described above serves as the policy.

We represent the state at the -th generation step as the concatenation of the last layer information and the GRU hidden state, i.e., . The actions correspond to the material and thickness of the current layer. We set the reward to be 0 for all generation steps except the final step. At the final step (i.e., the structure has been completely generated), we compute the spectrum of the generated structure with an optical spectrum calculation package TMM byrnes2016multilayer and assign the final reward based on how well the structure spectrum matches with the target spectrum. We also tried to calculate the spectrum following every generation step and assign intermediate rewards. However, this dense-reward approach is slow and does not lead to improved performance. Thus, we only report the final-only approach here. We set the discount factor . Thus, the cumulative reward for the generated sequence is simply the reward at the final step, which is defined as one minus the mean absolute error between the spectrum of the generated structure and the target spectrum:


where is the spectrum of the generated structure at wavelength under incidence angle . Because , the cumulative reward is always non-negative. The reward value will become higher as the spectrum gets closer to the target spectrum until it reaches 1 when the structure spectrum perfectly matches with the target spectrum.

During training, the sequence generator actively generates new structures and receive rewards. Our goal is to maximize the expected rewards for structures sampled from the sequence generation network:


Based on the calculated rewards for generated sequences, the agent adjusts its parameters with gradient ascent so that future rewards can be improved. Here, we use a policy gradient algorithm to compute the gradient for updating the sequence generator . From the policy gradient theorem sutton2018reinforcement; schulman2017proximal, we have


where is the probability of sampling a structure from the generator network and

is the estimated advantage function

schulman2015high, which measures the performance of the generated sequence compared against the average performance of structures sampled from .

Instead of directly updating the sequence generator using Eqn.3, we use a state-of-the-art policy gradient algorithm Proximal Policy Optimization (PPO) schulman2017proximal to compute the policy gradient from a surrogate objective function:


where is the importance weight that measures the distance between the policies before and after the gradient update. The function disincentivizes large update steps to the policy, where

is a hyperparameter that affects the actual update size. Here, the advantage

is estimated by Generalized Advantage Estimation (GAE) schulman2015high

, which achieves a good balance between bias and variance of the estimated gradients.

is the model parameters for a critic network that is trained together with the sequence generator. Compared to the vanilla policy gradient and actor-critic algorithms, PPO is more sample-efficient because it allows multi-step updates using the same batch of trajectories. Previous results show that PPO can achieve state-of-the-art performance on many tasks schulman2017proximal. With the computed policy gradient, the sequence generator model parameters are updated using the Adam optimizer kingma2014adam. The model training process is summarized in Figure. 4. Similar to the active search approach in Bello et al. bello2016neural, we output the best structure discovered throughout the entire training process as the final design. The pseudocode that summarizes our design generation process is given in Algorithm 1.

Our model is implemented using PyTorch

paszke2019pytorch and Spinning Up SpinningUp2018. The data used in this study and our code are publicly available111

Figure 4: Pipeline of the sequence generator training process. We first generate multi-layer structures using the sequence generator . The spectrum of the generated structures are simulated by the TMM module. Next, PPO algorithm is applied to compute the policy gradient for updating the sequence generator model. We keep pushing the best discovered structure into a buffer with size 1. This process is repeated until convergence. Finally, we finetune the layer thicknesses to obtain the design.
Input: target

, number of epochs

, batch size , maximum length
Output: Optical multi-layer sequence
1 Initialize sequence generator parameters ;
2 Initialize critic network parameters ;
3 Initialize best design ;
4 for k = 1, …, K do
5       ;
6       ;
8 end for
Algorithm 1 OML-PPO

3 Experiment

We applied the proposed method to two optical design tasks that are relevant to energy applications, i.e., 1) designing ultra-wideband absorbers and 2) designing incandescent light bulb filters. The designed ultra-wideband absorbers can help solar thermal panels to absorb the sunlight more efficiently and the light bulb filter can enhance incandescent light bulb efficiency in emitting visible light while suppressing the radiation in the infrared range that represents energy loss. We also did an ablation study to understand the effect of non-repetitive gating and auto-regressive materials/thickness sampling.

Performance evaluation: In task 1 ultra-wideband absorber design, we measure the quality of the designed structure by average absorption. In task 2 incandescent light bulb filter, we calculate the visible light enhancement factor to measure the performance of designed structures.

3.1 Task 1: ultra-wideband absorber

Firstly, we apply our algorithm to the task of designing an ultra-wideband absorber for the wavelength range [400, 2000] nm. We choose the target spectrum as a constant 100% absorption under normal light incidence angle (i.e., the light is shining at the absorber at a right angle) to represent an ideal broadband absorber. This task has been previously studied by Yang et al. yang2016compact based on physical models, where the broadband absorption is achieved by overlapping multiple absorption resonances and with an overall graded-index structure to minimize reflection. The authors designed a 5-layer structure using MgF2, TiO2, Si, Ge, and Cr. The simulated average absorption of their structure over the wavelength range is 95.37% under normal incidence. If not specified otherwise, we assume normal incidence when reporting average absorption.

Ag Al Al2O3 Cr Fe2O3 Ge HfO2 MgF2
Ni Si SiO2 Ti TiO2 ZnO ZnS ZnSe
Table 1: Available materials for constructing the ultra-wideband absorber.

We hypothesize that, when choosing from a larger set of materials than used in the previous work yang2016compact, it is possible to design a structure with higher average absorption than the human-designed structure. Thus, we expanded the original material set yang2016compact to include 11 more materials (16 total). The set of materials is listed in Table 1. We set the available discrete thicknesses to be nm with a total of 38 different values. When training the sequence generator, we set the learning rate to and the maximum length to . The material embedding size is set to 5, i.e., . The generator is trained for a total of epochs with the batch size set to be generation steps. We repeat the training for runs with different random seeds. The best structure discovered in each run was recorded and finetuned using the quasi-Newton method.

ID Material Thickness ID Material Thickness
1 MgF2 123 nm 8 Si 15 nm
2 TiO2 32 nm 9 Cr 17 nm
3 MgF2 21 nm 10 Ge 15 nm
4 Si 15 nm 11 TiO2 33 nm
5 TiO2 15 nm 12 Cr 29 nm
6 Si 15 nm 13 TiO2 81 nm
7 Ge 15 nm 14 Cr 116 nm
Table 2: RL designed 14-layer structure with 99.24% average absorption.

It is worth noting that our algorithm can yield very similar structures as that reported in yang2016compact, i.e., it can search for and find the structure designed based by human experts. One of such structures is {(MgF2, 112 nm), (TiO2, 55 nm), (Ti, 30 nm), (Ge, 30 nm), (Cr, 200 nm)} with an average absorption of 96.12%, which has exactly the same material composition as the one reported previously yang2016compact. However, the best structure discovered by the algorithm, exhibiting a higher average absorption of 97.64%, is {(SiO2, 115 nm), (Fe2O3, 70 nm), (Ti, 15 nm), (MgF2, 124 nm), (Ti, 148 nm)}. The spectrum under normal incidence are plotted in Figure 4(a).

Figure 5: Normal incidence spectrum for the best discovered absorber structures with 5 and 14 layers. R: reflection, T: transimission, A: absorption. We design the multi-layer thin film to have high absorption over the entire wavelength range. (a) Normal incidence spectrum for the 5-layer structure. (b) Normal incidence spectrum for the 14-layer structure.

We plot the best absorption values before and after finetuning of all ten runs in Figure. 6. After finetuning, the average absorptions for the discovered structures across all runs were improved. We found that the algorithm is robust to the randomness during training as 8 out of the 10 runs achieved an absorption that is higher than 95% after finetuning.

Figure 6: Absorption values before and after finetuning. finetuning improves the average absorption of every structure discovered in each run. (a) Average absorption values before and after finetuning for each individual run. (b) Box-plot for ten average absorptions values

In an additional experiment, we explore whether the algorithm can design a structure with more layers to achieve even higher absorptions. We set the maximum length and sample layer materials from MgF2, TiO2, Si, Ge, and Cr. The best discovered structure has 14 layers with an average absorption of 99.24%. The structure configuration is summarized in Table 2. We plot the normal incidence spectrum structure in Figure. 4(b). The structure discovered by OML-PPO reaches close-to-perfect performance under normal incidence and has high absorption over a wide range of angles.

3.2 Task 2: incandescent light bulb filter

To further test whether our method is scalable to more complicated tasks, we apply the proposed method for designing a filter that can enhance the luminous efficiency of incandescent light bulbs zhou2016efficient; ilic2016tailoring. The idea is to reflect the infrared light emitted by the light bulb filament so that its energy can be recycled. To this end, we set the target reflectivity to be 0% in the range [480, 700] nm, and 100% outside this range (Figure. 6(a)). In this way, the infrared light, which cannot contribute to lighting, will be reflected back to heat up the emitter.

A similar design has been previously studied ilic2016tailoring; shi2017optimization. We choose the same seven dielectric materials as the available materials: Al2O3, HfO2, MgF2, SiC, SiO2, and TiO2 shi2017optimization. Similar to our previous experiment, we train our policy for runs with different random seeds. Here, we set the maximum allowed length and the learning rate to be . The number of epochs and batch size are 10,000 and 3,000, respectively. The best discovered structure is reported in Appendix.

In Figure 7, we compare the average reflectivity normalized over all incidence angles (0 - 90 degree) of the 42-layer structure designed with our algorithm and the 41-layer structure designed by a memetic algorithm shi2017optimization. Our structure has a higher average reflectivity in the infrared range ( nm) than the 41-layer structure.

Figure 7: Results on the incandescent light bulb design. (a) Target spectrum and the average reflectivity of structures designed by OML-PPO and the memetic algorithm. (b) Emissive power spectrum. A good design will have high emissive power in the visible range [380, 780] nm. is the view factor that equals the proportion of emitted light from the light bulb filament that can reach the light bulb filter. We report results under view factors 0.95 and 1.

We quantitatively evaluated the performance of the designed filter by calculating the enhancement factor for visible light (400 - 780 nm) under a fixed operating power. The results are reported in Table 3. Details about the calculation of enhanced factor is included in Appendix.

Model Enhancement factor
Memetic shi2017optimization
Table 3: Visible light enhancement. Our RL-designed structure achieved 8.5% higher visible light enhancement than the structure designed by a memetic algorithm.

3.3 Ablation study

On the ultra-wideband absorber design task, we conducted an ablation study to understand the effect of non-repetitive gating and auto-regressive generation of materials and thicknesses. We trained four different models: 1) OML-PPO with both non-repetitive gating and auto-regressive generation, 2) non-repetitive gating only, 3) auto-regressive generation only, 4) neither non-repetitive gating nor the auto-regressive generation. For each model, we repeated the training for ten times. The maximum absorption values discovered by each model before finetuning are reported in Table 4. Both non-repetitive gating and the auto-regressive material/thickness generation improve the performance of the baseline model.

Figure 8:

Training trajectory of OML-PPO and other baseline algorithms. (a) Average absorption trajectory. (b) Maximum absorption trajectory. The non-repetitive gating enables the model to converge to better solutions than models without the gating. The shaded area corresponds to one standard deviation.

Model Average Absorption
Only gating
Only auto-regressive
None (baseline)
Table 4: Highest absorption values discovered by each algorithm across 10 runs. The mean average absorption values and standard deviations of the 10 runs are reported.

In Figure. 8, we plot the average absorption and maximum absorption of the structures generated in each epoch over the entire training trajectory. The effect of non-repetitive gating is more significant than auto-regressive material/thickness generation as the OML-PPO and the only-gating variants both significantly outperform the other two variants. The non-repetitive gating significantly improves the model convergence during training. When non-repetitive gating and the auto-regressive sampling are combined together, the model achieves the best performance.

4 Conclusion

We introduced a novel sequence generation architecture and a deep reinforcement learning pipeline to automatically design optical multi-layer films. To the best of our knowledge, our work is the first to apply deep reinforcement learning to design multi-layer optical structures with the optimal number of layers not known beforehand. Using a sequence generation network, the proposed method can select material and thickness for each layer of a multi-layer structure sequentially. On the task of designing an ultra-wideband absorber, we demonstrate that our method can achieve high performance robustly. The algorithm automatically discovered a 5-layer structure with 97.64% average absorption over the [400, 2000] nm range, which is 2% higher than a structure previously designed by human experts. When applied to generate a structure with more layers, the algorithm discovered a 14-layer structure with 99.24% average absorption, approaching perfect performance. On the task of designing incandescent light bulb filters, our method achieves 8.5% higher visible light enhancement factor than a structure designed by a state-of-art memetic algorithm.

Through an ablation study, we showed that customizing the sequence generation network based on optical design domain knowledge can greatly improve the optimization performance. Our results demonstrated the high performance of the proposed method on complicated optical design tasks. Because the proposed method does not rely on hand-crafted heuristics, we believe that it can be applied to many other multi-layer optical design tasks such as lens design and multi-layer metasurface design.


Appendix A RL-designed 42-layer incandescent light bulb

ID Material Thickness ID Material Thickness ID Material Thickness
1 SiO2 289 nm 15 SiC 210 nm 29 SiC 117 nm
2 SiN 268 nm 16 SiN 168 nm 30 MgF2 224 nm
3 MgF2 185 nm 17 MgF2 200 nm 31 SiC 122 nm
4 SiN 189 nm 18 SiC 227 nm 32 MgF2 235 nm
5 SiC 214 nm 19 SiN 242 nm 33 SiC 127 nm
6 SiN 214 nm 20 MgF2 222 nm 34 MgF2 230 nm
7 MgF2 210 nm 21 SiC 228 nm 35 SiC 234 nm
8 SiN 206 nm 22 MgF2 216 nm 36 MgF2 218 nm
9 SiC 205 nm 23 SiC 229 nm 37 SiC 235 nm
10 SiN 183 nm 24 MgF2 203 nm 38 MgF2 220 nm
11 MgF2 184 nm 25 SiC 101 nm 39 SiC 231 nm
12 SiN 179 nm 26 MgF2 209 nm 40 MgF2 216 nm
13 SiC 203 nm 27 SiC 121 nm 41 SiC 233 nm
14 SiN 273 nm 28 MgF2 225 nm 42 Al2O3 95 nm
Table 5: RL designed incandescent light bulb filter with 42 layers. The total thickness is .

Appendix B Visible light enhancement factor

We first calculated the angle averaged emissivity over a hemisphere:

where . is the reflection of the structure at wavelength under the incidence angle of . is the view factor that equals to the proportion of the light from the emitter that can reach the filter. We compared two different view factors and in our calculation. In addition, we assume the light bulb operates at 100 W and the surface area of the emitter is equal to . Then, we can solve for the temperature of the light emitter with the equation:

where is the blackbody emission intensity spectrum. With view factor , the OML-PPO designed filter leads to the emitter temperature of 3810 K (3553 K) while the structure designed by the memetic algorithm achieves a temperature of 3750 K (3498 K). The black body temperature under the same condition is calculated to be K. We measure the enhancement factor by:

where is the human eye’s sensitity spectrum (sharpe2005luminous). Our structure achieves an enhancement factor of 16.60 (10.67) while the memetic structure has an enhancement factor of 15.30 (9.72). The 42-layer structure designed by OML-PPO outperforms the previous 41-layer design by 8.5% (9.8%) in terms of the visible light enhancement.

Appendix C Angle-dependent absorption map for ultra-wideband absorbers

Figure 9: Angle-dependent absorption map for the best discovered absorber structures with 5 and 14 layers. Both achieves high absorption over a wide range of angles. (a) 5-layer structure. (b) 14-layer structure.

Appendix D Angle-dependent reflection map for incandescent light bulb filter

Figure 10: Angle-dependent reflection map for RL-designed structure and a structure designed by memetic algorithm. (a) structure designed by RL algorithm. (b) structure designed by memetic algorithm.