wavetorch
🌊 Numerically solving and backpropagating through the wave equation
view repo
Analog machine learning hardware platforms promise to be faster and more energy-efficient than their digital counterparts. Wave physics, as found in acoustics and optics, is a natural candidate for building analog processors for time-varying signals. Here we identify a mapping between the dynamics of wave physics, and the computation in recurrent neural networks. This mapping indicates that physical wave systems can be trained to learn complex features in temporal data, using standard training techniques for neural networks. As a demonstration, we show that an inversely-designed inhomogeneous medium can perform vowel classification on raw audio data by simple propagation of waves through such a medium, achieving performance that is comparable to a standard digital implementation of a recurrent neural network. These findings pave the way for a new class of analog machine learning platforms, capable of fast and efficient processing of information in its native domain.
READ FULL TEXT VIEW PDF🌊 Numerically solving and backpropagating through the wave equation
Funding: This work was supported by a Vannevar Bush Faculty Fellowship (Grant No N00014-17-1-3030) from the U.S. Department of Defense, by the Gordon and Betty Moore Foundation (Grant No GBMF4744), by a MURI grant from the U.S. Air Force Office of Scientific Research (Grant No FA9550-17-1-0002), and by the Swiss National Science Foundation (Project No P300P2_177721). Author contributions: T.W.H conceived of the idea with input from I.A.D.W. and M.M.. The software for performing numerical simulations and training of the wave equation was written by I.A.D.W. with input from T.W.H. and M.M.. The model for the standard RNN was developed and trained by M.M.. S.F. supervised the project. All authors contributed to analyzing the results and writing the manuscript. Competing interests: The authors have jointly filed for a provisional patent on the idea. The authors declare no other competing interests. Data and materials availability: The code for performing numerical simulations and training of the wave equation, as well as generating the figures presented in this paper are available online at http://www.github.com/fancompute/wavetorch/.
, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision
115, 211–252 (2015).Yichen Shen, Nicholas C. Harris, Scott Skirlo, Mihika Prabhu, Tom Baehr-Jones, Michael Hochberg, Xin Sun, Shijie Zhao, Hugo Larochelle, Dirk Englund, and Marin Soljačić, “Deep learning with coherent nanophotonic circuits,”
Nature Photonics 11, 441–446 (2017).Tyler W. Hughes, Momchil Minkov, Yu Shi, and Shanhui Fan, “Training of photonic neural networks through in situ backpropagation and gradient measurement,”
Optica 5, 864–871 (2018).Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural computation
9, 1735–1780 (1997).In the main text, we specified that the dynamics of the scalar field distribution, , are governed by the wave equation
(S1) |
where is the Laplacian operator. is the spatial distribution of the wave speed and is a source term. For a nonlinear system, can depend on the wave amplitude. Eq. (S1) may be discretized in time using centered finite differences with a temporal step size of , after which it becomes
(S2) |
Here, the subscript in is used to indicate the value of a scalar field at time step . To connect Eq. (S2) to the RNN update equations from Eq. 1 and 2, we exress this in matrix form as
(S3) |
Then, the update equation for the wave equation defined by Eq. (S3) can be rewritten as
(S4) | ||||
(S5) |
where we have defined as the matrix appearing in Eq. (S3). The nonlinear dependence on is defined by the nonlinear wave speed described above.
An absorbing region is introduced to approximate an open boundary condition Oskooi et al. (2008), corresponding to the grey region in Fig. 2B. This region is defined by a dampening coefficient, , which has a cubic dependence on the distance from the interior boundary of the layer. The scalar wave equation with damping is defined by the inhomogeneous partial differential equation Elmore and Heald (2012)
(S6) |
where is the unknown scalar field, is the dampening coefficient. Here, we assume that can be spatially varying but is frequency-independent. For a time step indexed by , Eq. S6 is discretized using centered finite differences in time to give
(S7) |
From Eq. S7, we may form a recurrence relation in terms of , which leads to the following update equation
(S8) |
Equation S8 therefore represents the discretized update equation for the scalar wave equation with damping. In matrix form, we may express Eq. (S8) as
(S9) |
which also has the same form as in Eq. (5) and (6) of the main text.
In the wave equation, we use the linear operators and to define the injection and measurement locations within our domain. Here we provide more details on these operators.
We start from the vectors and , which are discretized and flattened vectors from the field distribution and . Then, we define the linear operators, and , each column of which define the respective spatial distributions of the injection and measurement points in this flattened basis. With this, we can write the injection of the input vector () as a matrix-vector multiplication
(S10) |
Similarly, as the output of the RNN at each time step is given by an intensity measurement of the scalar fields, we may express this in terms of the flattened scalar field as
(S11) |
As the wave equation hidden state, is defined as the concatenation of and , we define the following matrices for convenience, as they only act on the portion of
(S12) | ||||
(S13) |
where is a matrix of all zeros. These matrices are used in the injection and measurement stages of the scalar wave update equations of the main text and thus serve a similar role to the and matrices of the traditional RNN in Eqs. (1) and (2). However, unlike and , these matrices are fixed and not trainable parameters.
The procedure for training the vowel recognition system is as follows. First, each vowel waveform is downsampled from its original recording with a 16 kHz sampling rate to a sampling rate of 10 kHz. Next, the entire dataset of (3 classes) (45 males + 48 females) = 272 vowel samples is divided into 5 groups of approximately equal size. Cross validated training is performed with 4 out of the 5 sample groups forming a training set and 1 out of the 5 sample groups forming a testing set. Independent training runs are performed with each of the 5 groups serving as the testing set, with results being averaged over all training runs. Each training run is performed for 30 epochs using the Adam optimization algorithm Kingma and Ba (2014b). During each epoch, every sample vowel sequence from the training set is windowed to a length of 1000, taken from the center of the sequence. This limits the computational cost of the training proceedure by reducing the length of the time through which gradients must be tracked.
All windowed samples from the training set are run through the simulation in batches of 9 and the categorical cross entropy loss between the output probe probability distribution and the correct one-hot vector for each vowel sample is computed. To encourage the optimizer to produce a binarized distribution of the wave speed with relatively large feature sizes, the optimizer minimizes this loss function with respect to a material density distribution, within a central region of the simulation domain, indicated by the green region in Fig. 2A. The distribution of the wave speed, , is computed by first applying a low-pass spatial filter and then a projection operation to the density distribution. The details of this process are described in section S4. We use the Adam algorithm Kingma and Ba (2014b) with learning rate 0.0004 to perform the optimization in batches of 9. Fig. 2D illustrates the optimization process over several epochs, during which, the wave velocity distribution converges to a final structure. At the end of each epoch, the classification accuracy is computed over both the testing and training set. Unlike the training set, the full length of each vowel sample from the testing set is used.
The mean energy spectrum of the three vowel classes after downsampling to 10 kHz is shown in Fig. S1. We observe that the majority of the energy for all vowel classes is below 1 kHz and that there is strong overlap between the mean peak energy of the ei and iy vowel classes. Moreover, the mean peak energy of the ae vowel class is very close to the peak energy of the other two vowels. Therefore, the vowel recognition task learned by the system in the main text is non-trivial.
Numerical modeling and simulation of the wave equation physics was performed using a custom software package written in Python wav . The software package was developed on top of the popular machine learning library, pytorch, to compute the gradients of the loss function with respect to the material distribution via reverse-mode automatic differentiation. In the context of inverse design in the fields of physics and engineering, this method of gradient computation is commonly referred to as the adjoint variable method and has a computational cost of one additional wave simulation. Using of a machine learning platform to perform the numerical simulation greatly reduces opportunities for errors in the analytic derivation or numerical implementation of the gradient. The code for performing numerical simulations and training of the wave equation, as well as generating the figures presented in this paper, may be found online at http://www.github.com/fancompute/wavetorch/.
In wavetorch, the operation of on a spatially discretized wave field, is carried out using the convolution
(S14) |
where is the step size of the spatial grid and denotes a convolution.
To create realistic devices with sufficiently large minimum feature sizes and a binarized distribution, we employed filtering and projection schemes during our optimization. Rather than updating the wave speed distribution directly, one may instead choose to update a design density , which varies between 0 and 1 within the design region and describes the density of material in each pixel. To create a structure with larger feature sizes, a low pass spatial filter can be applied to to created a filtered density, labelled
(S15) |
For binarization of the structure, a projection scheme is used to recreate the final wave speed from the filtered density. We define as the projected density, which is created from as
(S16) |
Here, is a parameter between 0 and 1 that controls the mid-point of the projection, typically 0.5, and controls the strength of the projection, typically around 100.
Finally, the wave speed can be determined from as
(S17) |
where and are the background and optimized material wave speed, respectively.
We compare the wave equation results to a conventional RNN model as defined in Eq. (1) and Eq. (2). The number of trainable parameters in the model is determined by the hidden state size , as the model is given by three matrices , and of size , and , respectively. We tried 70, for which the total number of RNN free parameters is 5250, and 100, with 10500 free parameters. The RNN was implemented and trained using pytorch. In Table S1 we show the results of a standard RNN on the vowel recognition task and compare them to the scalar wave. We find that the conventional RNN achieves a performance comparable to the wave equation. However, this performance is highly dependent on the number of trainable parameters. For a similar number of trainable parameters, the conventional RNN achieves about 6% lower classification accuracy. However, when the number of free parameters is increased to about 2 times that of the scalar wave, the accuracy is higher by about 3 %. We note that it is possible that more advanced recurrent models like long short-term memory (LSTM) Hochreiter and Schmidhuber (1997)
or gated recurrent unit (GRU)
Chung et al. (2014) could have a better performance with a smaller number of parameters, but exploring this is outside the scope of this study.Model | Nonlinearity | Number of parameters | Accuracy | |
Training | Testing | |||
Wave Equation | linear wave speed | 4200 | 93.1% | 86.6% |
nonlinear wave speed | 4200 | 92.6% | 86.3% | |
Conventional RNN | linear | 5250 | 78.8% | 79.4% |
leaky ReLU |
5250 | 82.6% | 80.2% | |
linear | 10500 | 88.9% | 88.2% | |
leaky ReLU | 10500 | 89.4% | 89.4% |
The conventional RNN and the one implemented by a scalar wave equation have many qualitative differences. We discuss some of those below. First, in the RNN case, the trainable parameters are given by the elements of the weight matrices. In the wave equation case, we choose to use the wave velocity, , as trainable parameters, because a specific distribution of can be physically implemented after the training process. In acoustic or optical systems, this can be practically realized using technologies such as 3D printing or nanolithography. Furthermore, whereas the RNN free parameters define a matrix which is multiplied by the input, output, and hidden state vectors, in the wave equation case, the free parameters are multiplied element-wise with the hidden state, which limits the influence of each individual parameter over the full dynamics.
For a given amount of expressive power, the size of the hidden state in the wave equation must arguably be much larger than that in the RNN case. This is because the amount of information that can be encoded in the spatial distribution of is constrained by the diffraction limit for wave systems. It follows that a single RNN hidden state element may be analogous to several grid cells in the scalar wave equation. Furthermore, the wave update matrix is sparse and only contains non-zeros on diagonal elements (self coupling) and those corresponding to neighbor-to-neighbor coupling between spatial grid cells. Because of this, information in a given cell of will take several time steps to reach other cells, as determined by the wave velocity and the distance between them. The presence of this form of causality practically means that one must wait longer for a full ‘mixing’ of information between cells in the domain, suggesting that in the our numerical simulations, a larger number of time steps may be needed as compared to the typical RNN.
The form of nonlinearity used in the wave equation is different from that in the typical RNN, which involves the application of the nonlinear function, , as in Eq. (1). In the wave equation, nonlinearity is provided by making the wave velocity, , or damping dependent, , to be dependent on the instantaneous wave intensity . For example , or . In optics, these nonlinearities may be implemented using materials or saturable absorption, respectively. With this addition, the update matrix of Eq. (5), , becomes a function of the solution at that time step, making the dynamics nonlinear. Nonlinearity is introduced into the output of the wave system () by measuring the intensity of the wave field, which involves a squaring operation. One may also consider directly discretizing a nonlinear wave equation using the same technique as the main text.
Comments
There are no comments yet.