Wave Physics as an Analog Recurrent Neural Network

04/29/2019
by   Tyler W. Hughes, et al.
Stanford University
0

Analog machine learning hardware platforms promise to be faster and more energy-efficient than their digital counterparts. Wave physics, as found in acoustics and optics, is a natural candidate for building analog processors for time-varying signals. Here we identify a mapping between the dynamics of wave physics, and the computation in recurrent neural networks. This mapping indicates that physical wave systems can be trained to learn complex features in temporal data, using standard training techniques for neural networks. As a demonstration, we show that an inversely-designed inhomogeneous medium can perform vowel classification on raw audio data by simple propagation of waves through such a medium, achieving performance that is comparable to a standard digital implementation of a recurrent neural network. These findings pave the way for a new class of analog machine learning platforms, capable of fast and efficient processing of information in its native domain.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

07/19/2021

Wave-based extreme deep learning based on non-linear time-Floquet entanglement

Wave-based analog signal processing holds the promise of extremely fast,...
09/06/2020

A Comparison of Virtual Analog Modelling Techniques for Desktop and Embedded Implementations

We develop a virtual analog model of the Klon Centaur guitar pedal circu...
11/14/2017

Reinforcement Learning in a large scale photonic Recurrent Neural Network

Photonic Neural Network implementations have been gaining considerable a...
07/06/2020

Building Reservoir Computing Hardware Using Low Energy-Barrier Magnetics

Biologically inspired recurrent neural networks, such as reservoir compu...
11/09/2018

Convolutional neural networks in phase space and inverse problems

We study inverse problems consisting on determining medium properties us...
04/01/2022

Physical Deep Learning with Biologically Plausible Training Method

The ever-growing demand for further advances in artificial intelligence ...
02/28/2019

Application-level Studies of Cellular Neural Network-based Hardware Accelerators

As cost and performance benefits associated with Moore's Law scaling slo...

Code Repositories

wavetorch

🌊 Numerically solving and backpropagating through the wave equation


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Acknowledgements

Funding: This work was supported by a Vannevar Bush Faculty Fellowship (Grant No N00014-17-1-3030) from the U.S. Department of Defense, by the Gordon and Betty Moore Foundation (Grant No GBMF4744), by a MURI grant from the U.S. Air Force Office of Scientific Research (Grant No FA9550-17-1-0002), and by the Swiss National Science Foundation (Project No P300P2_177721). Author contributions: T.W.H conceived of the idea with input from I.A.D.W. and M.M.. The software for performing numerical simulations and training of the wave equation was written by I.A.D.W. with input from T.W.H. and M.M.. The model for the standard RNN was developed and trained by M.M.. S.F. supervised the project. All authors contributed to analyzing the results and writing the manuscript. Competing interests: The authors have jointly filed for a provisional patent on the idea. The authors declare no other competing interests. Data and materials availability: The code for performing numerical simulations and training of the wave equation, as well as generating the figures presented in this paper are available online at http://www.github.com/fancompute/wavetorch/.

References

  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al.

    , “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision 

    115, 211–252 (2015).
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals,  and Quoc V Le, “Sequence to Sequence Learning with Neural Networks,” in Advances in Neural Information Processing Systems 27, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence,  and K. Q. Weinberger (Curran Associates, Inc., 2014) pp. 3104–3112.
  • Shainline et al. (2017) Jeffrey M Shainline, Sonia M Buckley, Richard P Mirin,  and Sae Woo Nam, “Superconducting optoelectronic circuits for neuromorphic computing,” Physical Review Applied 7, 034013 (2017).
  • Tait et al. (2017) Alexander N. Tait, Thomas Ferreira de Lima, Ellen Zhou, Allie X. Wu, Mitchell A. Nahmias, Bhavin J. Shastri,  and Paul R. Prucnal, “Neuromorphic photonic networks using silicon photonic weight banks,” Scientific Reports 7, 7430 (2017).
  • Romera et al. (2018) Miguel Romera, Philippe Talatchian, Sumito Tsunegi, Flavio Abreu Araujo, Vincent Cros, Paolo Bortolotti, Juan Trastoy, Kay Yakushiji, Akio Fukushima, Hitoshi Kubota, Shinji Yuasa, Maxence Ernoult, Damir Vodenicarevic, Tifenn Hirtzlin, Nicolas Locatelli, Damien Querlioz,  and Julie Grollier, “Vowel recognition with four coupled spin-torque nano-oscillators,” Nature 563, 230 (2018).
  • Shen et al. (2017)

    Yichen Shen, Nicholas C. Harris, Scott Skirlo, Mihika Prabhu, Tom Baehr-Jones, Michael Hochberg, Xin Sun, Shijie Zhao, Hugo Larochelle, Dirk Englund,  and Marin Soljačić, “Deep learning with coherent nanophotonic circuits,” 

    Nature Photonics 11, 441–446 (2017).
  • Biamonte et al. (2017) Jacob Biamonte, Peter Wittek, Nicola Pancotti, Patrick Rebentrost, Nathan Wiebe,  and Seth Lloyd, “Quantum machine learning,” Nature 549, 195–202 (2017).
  • Laporte et al. (2018) Floris Laporte, Andrew Katumba, Joni Dambre,  and Peter Bienstman, “Numerical demonstration of neuromorphic computing with photonic crystal cavities,” Optics express 26, 7955–7964 (2018).
  • Lin et al. (2018) Xing Lin, Yair Rivenson, Nezih T Yardimci, Muhammed Veli, Yi Luo, Mona Jarrahi,  and Aydogan Ozcan, “All-optical machine learning using diffractive deep neural networks,” Science 361, 1004–1008 (2018).
  • Yao et al. (2013) Kaisheng Yao, Geoffrey Zweig, Mei-Yuh Hwang, Yangyang Shi,  and Dong Yu, “Recurrent neural networks for language understanding.” in Interspeech (2013) pp. 2524–2528.
  • Hüsken and Stagge (2003) Michael Hüsken and Peter Stagge, “Recurrent neural networks for time series classification,” Neurocomputing 50, 223–235 (2003).
  • Dorffner (1996) Georg Dorffner, “Neural Networks for Time Series Processing,” Neural Network World 6, 447–468 (1996).
  • Connor et al. (1994) J. T. Connor, R. D. Martin,  and L. E. Atlas, “Recurrent neural networks and robust time series prediction,” IEEE Transactions on Neural Networks 5, 240–254 (1994).
  • Elman (1990) Jeffrey L Elman, “Finding structure in time,” Cognitive Science 14, 179–211 (1990).
  • Jordan (1997) Michael I Jordan, “Serial order: A parallel distributed processing approach,” in Advances in psychology, Vol. 121 (Elsevier, 1997) pp. 471–495.
  • Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio,  and Aaron Courville, Deep Learning (MIT Press, 2016).
  • Ursell (1953) F. Ursell, “The long-wave paradox in the theory of gravity waves,” Mathematical Proceedings of the Cambridge Philosophical Society 49, 685 (1953).
  • Boyd (2008) Robert W. Boyd, Nonlinear Optics (Academic Press, 2008).
  • Hillenbrand et al. (1995) James Hillenbrand, Laura A. Getty, Michael J. Clark,  and Kimberlee Wheeler, “Acoustic characteristics of American English vowels,” The Journal of the Acoustical Society of America 97, 3099–3111 (1995).
  • Ba et al. (2017) Abdoulaye Ba, Artem Kovalenko, Christophe Aristégui, Olivier Mondain-Monval,  and Thomas Brunet, “Soft porous silicone rubbers with ultra-low sound speeds in acoustic metamaterials,” Scientific Reports 7, 40106 (2017).
  • Hughes et al. (2018)

    Tyler W. Hughes, Momchil Minkov, Yu Shi,  and Shanhui Fan, “Training of photonic neural networks through in situ backpropagation and gradient measurement,” 

    Optica 5, 864–871 (2018).
  • Molesky et al. (2018) Sean Molesky, Zin Lin, Alexander Y Piggott, Weiliang Jin, Jelena Vucković,  and Alejandro W Rodriguez, “Inverse design in nanophotonics,” Nature Photonics 12, 659 (2018).
  • Elesin et al. (2012) Y. Elesin, B. S. Lazarov, J. S. Jensen,  and O. Sigmund, “Design of robust and efficient photonic switches using topology optimization,” Photonics and Nanostructures - Fundamentals and Applications 10, 153–165 (2012).
  • Kingma and Ba (2014a) Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980  (2014a).
  • Jing et al. (2017) Li Jing, Yichen Shen, Tena Dubcek, John Peurifoy, Scott Skirlo, Yann LeCun, Max Tegmark,  and Marin Soljačić, “Tunable efficient unitary neural networks (eunn) and their application to rnns,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70 (JMLR. org, 2017) pp. 1733–1741.
  • Silva et al. (2014) Alexandre Silva, Francesco Monticone, Giuseppe Castaldi, Vincenzo Galdi, Andrea Alù,  and Nader Engheta, “Performing Mathematical Operations with Metamaterials,” Science 343, 160–163 (2014).
  • Hermans et al. (2015) Michiel Hermans, Michaël Burm, Thomas Van Vaerenbergh, Joni Dambre,  and Peter Bienstman, “Trainable hardware for dynamical computing using error backpropagation through physical media,” Nature Communications 6, 6729 (2015).
  • Guo et al. (2018) Cheng Guo, Meng Xiao, Momchil Minkov, Yu Shi,  and Shanhui Fan, “Photonic crystal slab Laplace operator for image differentiation,” Optica 5, 251–256 (2018).
  • Kwon et al. (2018) Hoyeong Kwon, Dimitrios Sounas, Andrea Cordaro, Albert Polman,  and Andrea Alù, “Nonlocal Metasurfaces for Optical Signal Processing,” Physical Review Letters 121, 173004 (2018).
  • Estakhri et al. (2019) Nasim Mohammadi Estakhri, Brian Edwards,  and Nader Engheta, “Inverse-designed metastructures that solve equations,” Science 363, 1333–1338 (2019).
  • Oskooi et al. (2008) Ardavan F. Oskooi, Lei Zhang, Yehuda Avniel,  and Steven G. Johnson, “The failure of perfectly matched layers, and towards their redemption by adiabatic absorbers,” Optics Express 16, 11376–11392 (2008).
  • Elmore and Heald (2012) William C. Elmore and Mark A. Heald, Physics of Waves (Courier Corporation, 2012).
  • Kingma and Ba (2014b) Diederik P. Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization,” arXiv:1412.6980 [cs]  (2014b).
  • (34) wavetorch: A Python framework for simulating and back propagating through the wave equation,” https://github.com/fancompute/wavetorch.
  • Hochreiter and Schmidhuber (1997)

    Sepp Hochreiter and Jürgen Schmidhuber, “Long short-term memory,” Neural computation 

    9, 1735–1780 (1997).
  • Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,  and Yoshua Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv:1412.3555  (2014).

Supplementary Materials

S1 Derivation of the wave equation update relationship

In the main text, we specified that the dynamics of the scalar field distribution, , are governed by the wave equation

(S1)

where is the Laplacian operator. is the spatial distribution of the wave speed and is a source term. For a nonlinear system, can depend on the wave amplitude. Eq. (S1) may be discretized in time using centered finite differences with a temporal step size of , after which it becomes

(S2)

Here, the subscript in is used to indicate the value of a scalar field at time step . To connect Eq. (S2) to the RNN update equations from Eq. 1 and 2, we exress this in matrix form as

(S3)

Then, the update equation for the wave equation defined by Eq. (S3) can be rewritten as

(S4)
(S5)

where we have defined as the matrix appearing in Eq. (S3). The nonlinear dependence on is defined by the nonlinear wave speed described above.

An absorbing region is introduced to approximate an open boundary condition Oskooi et al. (2008), corresponding to the grey region in Fig. 2B. This region is defined by a dampening coefficient, , which has a cubic dependence on the distance from the interior boundary of the layer. The scalar wave equation with damping is defined by the inhomogeneous partial differential equation Elmore and Heald (2012)

(S6)

where is the unknown scalar field, is the dampening coefficient. Here, we assume that can be spatially varying but is frequency-independent. For a time step indexed by , Eq. S6 is discretized using centered finite differences in time to give

(S7)

From Eq. S7, we may form a recurrence relation in terms of , which leads to the following update equation

(S8)

Equation S8 therefore represents the discretized update equation for the scalar wave equation with damping. In matrix form, we may express Eq. (S8) as

(S9)

which also has the same form as in Eq. (5) and (6) of the main text.

S2 Input and output connection matrices

In the wave equation, we use the linear operators and to define the injection and measurement locations within our domain. Here we provide more details on these operators.

We start from the vectors and , which are discretized and flattened vectors from the field distribution and . Then, we define the linear operators, and , each column of which define the respective spatial distributions of the injection and measurement points in this flattened basis. With this, we can write the injection of the input vector () as a matrix-vector multiplication

(S10)

Similarly, as the output of the RNN at each time step is given by an intensity measurement of the scalar fields, we may express this in terms of the flattened scalar field as

(S11)

As the wave equation hidden state, is defined as the concatenation of and , we define the following matrices for convenience, as they only act on the portion of

(S12)
(S13)

where is a matrix of all zeros. These matrices are used in the injection and measurement stages of the scalar wave update equations of the main text and thus serve a similar role to the and matrices of the traditional RNN in Eqs. (1) and (2). However, unlike and , these matrices are fixed and not trainable parameters.

S3 Training method for vowel classification

The procedure for training the vowel recognition system is as follows. First, each vowel waveform is downsampled from its original recording with a 16 kHz sampling rate to a sampling rate of 10 kHz. Next, the entire dataset of (3 classes) (45 males + 48 females) = 272 vowel samples is divided into 5 groups of approximately equal size. Cross validated training is performed with 4 out of the 5 sample groups forming a training set and 1 out of the 5 sample groups forming a testing set. Independent training runs are performed with each of the 5 groups serving as the testing set, with results being averaged over all training runs. Each training run is performed for 30 epochs using the Adam optimization algorithm Kingma and Ba (2014b). During each epoch, every sample vowel sequence from the training set is windowed to a length of 1000, taken from the center of the sequence. This limits the computational cost of the training proceedure by reducing the length of the time through which gradients must be tracked.

All windowed samples from the training set are run through the simulation in batches of 9 and the categorical cross entropy loss between the output probe probability distribution and the correct one-hot vector for each vowel sample is computed. To encourage the optimizer to produce a binarized distribution of the wave speed with relatively large feature sizes, the optimizer minimizes this loss function with respect to a material density distribution, within a central region of the simulation domain, indicated by the green region in Fig. 2A. The distribution of the wave speed, , is computed by first applying a low-pass spatial filter and then a projection operation to the density distribution. The details of this process are described in section S4. We use the Adam algorithm Kingma and Ba (2014b) with learning rate 0.0004 to perform the optimization in batches of 9. Fig. 2D illustrates the optimization process over several epochs, during which, the wave velocity distribution converges to a final structure. At the end of each epoch, the classification accuracy is computed over both the testing and training set. Unlike the training set, the full length of each vowel sample from the testing set is used.

The mean energy spectrum of the three vowel classes after downsampling to 10 kHz is shown in Fig. S1. We observe that the majority of the energy for all vowel classes is below 1 kHz and that there is strong overlap between the mean peak energy of the ei and iy vowel classes. Moreover, the mean peak energy of the ae vowel class is very close to the peak energy of the other two vowels. Therefore, the vowel recognition task learned by the system in the main text is non-trivial.

Figure S1: Mean energy spectrum for the ae, ei, and iy vowel classes.

S4 Numerical modeling methods

Numerical modeling and simulation of the wave equation physics was performed using a custom software package written in Python wav . The software package was developed on top of the popular machine learning library, pytorch, to compute the gradients of the loss function with respect to the material distribution via reverse-mode automatic differentiation. In the context of inverse design in the fields of physics and engineering, this method of gradient computation is commonly referred to as the adjoint variable method and has a computational cost of one additional wave simulation. Using of a machine learning platform to perform the numerical simulation greatly reduces opportunities for errors in the analytic derivation or numerical implementation of the gradient. The code for performing numerical simulations and training of the wave equation, as well as generating the figures presented in this paper, may be found online at http://www.github.com/fancompute/wavetorch/.

In wavetorch, the operation of on a spatially discretized wave field, is carried out using the convolution

(S14)

where is the step size of the spatial grid and denotes a convolution.

To create realistic devices with sufficiently large minimum feature sizes and a binarized distribution, we employed filtering and projection schemes during our optimization. Rather than updating the wave speed distribution directly, one may instead choose to update a design density , which varies between 0 and 1 within the design region and describes the density of material in each pixel. To create a structure with larger feature sizes, a low pass spatial filter can be applied to to created a filtered density, labelled

(S15)

For binarization of the structure, a projection scheme is used to recreate the final wave speed from the filtered density. We define as the projected density, which is created from as

(S16)

Here, is a parameter between 0 and 1 that controls the mid-point of the projection, typically 0.5, and controls the strength of the projection, typically around 100.

Finally, the wave speed can be determined from as

(S17)

where and are the background and optimized material wave speed, respectively.

S5 Comparison of the wave equation and a conventional RNN

We compare the wave equation results to a conventional RNN model as defined in Eq. (1) and Eq. (2). The number of trainable parameters in the model is determined by the hidden state size , as the model is given by three matrices , and of size , and , respectively. We tried 70, for which the total number of RNN free parameters is 5250, and 100, with 10500 free parameters. The RNN was implemented and trained using pytorch. In Table S1 we show the results of a standard RNN on the vowel recognition task and compare them to the scalar wave. We find that the conventional RNN achieves a performance comparable to the wave equation. However, this performance is highly dependent on the number of trainable parameters. For a similar number of trainable parameters, the conventional RNN achieves about 6% lower classification accuracy. However, when the number of free parameters is increased to about 2 times that of the scalar wave, the accuracy is higher by about 3 %. We note that it is possible that more advanced recurrent models like long short-term memory (LSTM) Hochreiter and Schmidhuber (1997)

or gated recurrent unit (GRU)

Chung et al. (2014) could have a better performance with a smaller number of parameters, but exploring this is outside the scope of this study.

Model Nonlinearity Number of parameters Accuracy
Training Testing
Wave Equation linear wave speed 4200 93.1% 86.6%
nonlinear wave speed 4200 92.6% 86.3%
Conventional RNN linear 5250 78.8% 79.4%

leaky ReLU

5250 82.6% 80.2%
linear 10500 88.9% 88.2%
leaky ReLU 10500 89.4% 89.4%
Table S1: Comparison of scalar wave model and conventional RNN on vowel recognition task.

The conventional RNN and the one implemented by a scalar wave equation have many qualitative differences. We discuss some of those below. First, in the RNN case, the trainable parameters are given by the elements of the weight matrices. In the wave equation case, we choose to use the wave velocity, , as trainable parameters, because a specific distribution of can be physically implemented after the training process. In acoustic or optical systems, this can be practically realized using technologies such as 3D printing or nanolithography. Furthermore, whereas the RNN free parameters define a matrix which is multiplied by the input, output, and hidden state vectors, in the wave equation case, the free parameters are multiplied element-wise with the hidden state, which limits the influence of each individual parameter over the full dynamics.

For a given amount of expressive power, the size of the hidden state in the wave equation must arguably be much larger than that in the RNN case. This is because the amount of information that can be encoded in the spatial distribution of is constrained by the diffraction limit for wave systems. It follows that a single RNN hidden state element may be analogous to several grid cells in the scalar wave equation. Furthermore, the wave update matrix is sparse and only contains non-zeros on diagonal elements (self coupling) and those corresponding to neighbor-to-neighbor coupling between spatial grid cells. Because of this, information in a given cell of will take several time steps to reach other cells, as determined by the wave velocity and the distance between them. The presence of this form of causality practically means that one must wait longer for a full ‘mixing’ of information between cells in the domain, suggesting that in the our numerical simulations, a larger number of time steps may be needed as compared to the typical RNN.

The form of nonlinearity used in the wave equation is different from that in the typical RNN, which involves the application of the nonlinear function, , as in Eq. (1). In the wave equation, nonlinearity is provided by making the wave velocity, , or damping dependent, , to be dependent on the instantaneous wave intensity . For example , or . In optics, these nonlinearities may be implemented using materials or saturable absorption, respectively. With this addition, the update matrix of Eq. (5), , becomes a function of the solution at that time step, making the dynamics nonlinear. Nonlinearity is introduced into the output of the wave system () by measuring the intensity of the wave field, which involves a squaring operation. One may also consider directly discretizing a nonlinear wave equation using the same technique as the main text.