Instructional implementation of Physics-Aware Training (PAT) with demonstrations on simulated experiments.
Deep neural networks have become a pervasive tool in science and engineering. However, modern deep neural networks' growing energy requirements now increasingly limit their scaling and broader use. We propose a radical alternative for implementing deep neural network models: Physical Neural Networks. We introduce a hybrid physical-digital algorithm called Physics-Aware Training to efficiently train sequences of controllable physical systems to act as deep neural networks. This method automatically trains the functionality of any sequence of real physical systems, directly, using backpropagation, the same technique used for modern deep neural networks. To illustrate their generality, we demonstrate physical neural networks with three diverse physical systems-optical, mechanical, and electrical. Physical neural networks may facilitate unconventional machine learning hardware that is orders of magnitude faster and more energy efficient than conventional electronic processors.READ FULL TEXT VIEW PDF
Deep learning has rapidly become a widespread tool in both scientific an...
In the last decade, deep learning has become a major component of artifi...
To make progress in science, we often build abstract representations of
We propose a new family of neural networks to predict the behaviors of
A Deep Neural Network is applied to classify physical signatures obtaine...
We present a technique, which we term leapfrogging, to parallelize back-...
To understand changes in physical systems and facilitate decisions,
Instructional implementation of Physics-Aware Training (PAT) with demonstrations on simulated experiments.
Like many historical developments in artificial intelligencebrooks1991intelligence; hooker2020hardware
, the widespread adoption of deep neural networks (DNNs) was enabled in part by synergistic hardware. In 2012, building on numerous earlier works, Krizhevsky et al. showed that the backpropagation algorithm for stochastic gradient descent (SGD) could be efficiently executed with graphics-processing units to train large convolutional DNNskrizhevsky2012imagenet to perform accurate image classification. Since 2012, the breadth of applications of DNNs has expanded, but so too has their typical size. As a result, the computational requirements of DNN models have grown rapidly, outpacing Moore’s Law schwartz2019green; patterson2021carbon. Now, DNNs are increasingly limited by hardware energy efficiency.
The emerging DNN energy problem has inspired special-purpose hardware: DNN ‘accelerators’ reuther2020survey; markovic2020physics. Several proposals push beyond conventional electronics with alternative physical platforms markovic2020physics, such as optics wetzstein2020inference; shen2017deep; hamerly2019large; shastri2021photonics or memristor crossbar arrays prezioso2015training; hu2014memristor. These devices rely on approximate analogies between the hardware physics and the mathematical operations in DNNs (Fig. 1A). Consequently, their success will depend on intensive engineering to push device performance toward the limits of the hardware physics, while carefully suppressing parts of the physics that violate the analogy, such as unintended nonlinearities, noise processes, and device variations.
More generally, however, the controlled evolutions of physical systems are well-suited to realizing deep learning models. DNNs and physical processes share numerous structural similarities, such as hierarchy, approximate symmetries, redundancy, and nonlinearity. These structural commonalities explain much of DNNs’ success operating robustly on data from the natural, physical worldlin2017does
. As physical systems evolve, they perform, in effect, the mathematical operations within DNNs: controlled convolutions, nonlinearities, matrix-vector operations and so on. We can harness these physical computations by encoding input data into the initial conditions of the physical system, then reading out the results by performing measurements after the system evolves (Fig. 1C). Physical computations can be controlled by adjusting physical parameters. By cascading such controlled physical input-output transformations, we can realize trainable, hierarchical physical computations: deep physical neural networks (PNNs, Fig. 1D). As anyone who has simulated the evolution of complex physical systems appreciates, physical transformations are typically faster and consume less energy than their digital emulations: processes which take nanoseconds and nanojoules frequently require seconds and joules to digitally simulate. PNNs are therefore a route to scalable, energy-efficient, and high-speed machine learning.
Theoretical proposals for physical learning hardware have recently emerged in various fields, such as optics hughes2019wave; Wu2020neuromorphicmetasurface; furuhata2021physical; nakajima2020neural, spintronic nano-oscillators Romera2018spintronicPNNvowel, nanoelectronic devices Chen2020geneticalgorithmSi; euler2020deep, and small-scale quantum computers Mitarai2018quantumcircuitlearning; Benedetti2019parameterizedQCL; havlivcek2019supervised; fosel2021quantum. A related trend is physical reservoir computing Tanaka2019reservoircomputingreview; Nakajima2020reservoircomputingreview, in which the information transformations of a physical system ‘reservoir’ are not trained but are instead linearly combined by a trainable output layer. Reservoir computing harnesses generic physical processes for computation, but its training is inherently shallow: it does not allow the hierarchical process learning that characterizes modern deep neural networks. In contrast, the newer proposals for physical learning hardware hughes2019wave; Wu2020neuromorphicmetasurface; Mitarai2018quantumcircuitlearning; Benedetti2019parameterizedQCL; Chen2020geneticalgorithmSi; euler2020deep; Romera2018spintronicPNNvowel; stern2020supervised; furuhata2021physical; nakajima2020neural overcome this by training the physical transformations themselves.
There have been few experimental studies on physical learning hardware, however, and those that exist have relied on gradient-free learning algorithms Chen2020geneticalgorithmSi; Romera2018spintronicPNNvowel; bueno2018reinforcement. While these works have made critical steps, it is now appreciated that gradient-based learning algorithms, such as the backpropagation algorithm, are essential for the efficient training and good generalization of large-scale DNNs poggio2020theoretical. To solve this problem, proposals to realize backpropagation on physical hardware have appeared hughes2019wave; lin2018all; Wu2020neuromorphicmetasurface; euler2020deep; furuhata2021physical; hermans2015trainable; hughes2018training; lopez2021self. While inspirational, these proposals nonetheless often rely on restrictive assumptions, such as linearity or dissipation-free evolution. The most general proposals hughes2019wave; Wu2020neuromorphicmetasurface; furuhata2021physical; euler2020deep may overcome such constraints, but still rely on performing training in silico, i.e., wholly within numerical simulations. Thus, to be realized experimentally, and in scalable hardware, they will face the same challenges as hardware based on mathematical analogies: intense engineering efforts to force hardware to precisely match idealized simulations.
Here we demonstrate a universal framework to directly train arbitrary, real physical systems to execute deep neural networks, using backpropagation. We call these trained hierarchical physical computations physical neural networks (PNNs). Our approach is enabled by a hybrid physical-digital algorithm, physics-aware training (PAT). PAT allows us to efficiently and accurately execute the backpropagation algorithm on any sequence of physical input-output transformations, directly in situ
. We demonstrate the universality of this approach by experimentally performing image classification using three distinct systems: the multimode mechanical oscillations of a driven metal plate, the analog dynamics of a nonlinear transistor oscillator, and ultrafast optical second harmonic generation. We obtain accurate hierarchical classifiers, which utilize each system’s unique physical transformations, and which inherently mitigate each system’s unique noise processes and imperfections. While PNNs are a radical departure from traditional hardware, they are easily integrated into modern machine learning. We show that PNNs can be seamlessly combined with conventional hardware and neural network methods via physical-digital hybrid architectures, in which conventional hardware learns to opportunistically cooperate with unconventional physical resources using PAT. Ultimately, PNNs provide a basis for hardware-physics-software codesign in artificial intelligencemarkovic2020physics, routes to improving the energy efficiency and speed of machine learning by many orders of magnitude, and pathways to automatically designing complex functional devices, such as functional nanoparticles peurifoy2018nanophotonic, robots veit2016residual, and smart sensors Zhou2020smartsensor; Martel2020smartsensor; Mennel2020smartsensor; Duarte2008singlepixelimaging.
Figure 2 shows an example physical neural network based on the propagation of broadband optical pulses in quadratic nonlinear optical media, i.e. ultrafast second harmonic generation (SHG)111Technically speaking, propagation of intense broadband pulses in quadratic nonlinear optical media gives rise not only to second-harmonic generation, but also to sum-frequency and difference-frequency processes among the multitude of optical frequencies. For simplicity, we refer to these complex nonlinear optical dynamics as SHG.. The ultrafast SHG process realizes a physical computation, a nonlinear transformation of the input near-infrared pulse’s spectrum ( 800 nm center-wavelength) to a new optical spectrum in the ultraviolet ( 400 nm center-wavelength). To use and control this physical computation, input data and parameters are encoded into sections of the near-infrared pulse’s spectrum by modulating its frequency components using a pulse shaper (Fig. 2A). This modulated pulse then propagates through a nonlinear optical crystal, producing a broadband ultraviolet pulse, whose spectrum is measured to read-out the computation result.
A task routinely used to demonstrate novel machine learning hardware is the classification of spoken vowels according to their formant frequencies shen2017deep; Romera2018spintronicPNNvowel; voweldata. In this task, the machine learning device must learn to predict spoken vowels from 12-dimensional input data vectors of formant frequencies extracted from audio recordings.
To realize vowel classification with SHG, we construct a multi-layer SHG-PNN (Fig. 1B) where the input data for the first physical layer (i.e., the first physical transformation) consists of the vowel formant frequency vector. At the final layer, the SHG-PNN’s predicted vowel is read out by taking the maximum value of 7 spectral bins in the measured ultraviolet spectrum (Fig. 1C). In each layer, the output spectrum is digitally renormalized before being passed to the next layer, along with a trainable digital rescaling (i.e., ). Thus, the classification is achieved almost entirely using the trained nonlinear optical transformations. The results of Fig. 2 show that the SHG-PNN adapts the physical transformation of ultrafast SHG into a hierarchical physical computation, which consists of five sequential, trained nonlinear transformations that cooperate to produce a final optical spectrum that accurately predicts vowels (Fig. 1B,D).
To train PNNs’ parameters, we use physics-aware training (PAT, Fig. 3), an algorithm that allows us to effectively perform the backpropagation algorithm for stochastic gradient descent (SGD) directly on any sequence of physical input-output transformations, such as the sequence of nonlinear optical transformations in Fig. 2. In the backpropagation algorithm, automatic differentiation efficiently determines the gradient of a loss function with respect to trainable parameters. This makes the algorithm
-times more efficient than finite-difference methods for gradient estimation (whereis the number of parameters). PAT has some similarities to quantization-aware training algorithms used to train neural networks for low-precision hardware hubara2017quantized, and feedback alignment lillicrap2016random. PAT can be seen as solving a problem analogous to the ‘simulation-reality gap’ in robotics jakobi1995noise, which is increasingly addressed by hybrid physical-digital techniques howison2021reality.
PAT’s essential advantages stems from the forward pass being executed by the actual physical hardware, rather than by a simulation. Given an accurate model, one could attempt to train by autodifferentiating simulations, then transferring the final parameters to the physical hardware (termed in silico training in Fig. 3). Our data-driven digital model for SHG is remarkably accurate (Supplementary Figure 20) and even includes an accurate noise model (Supplementary Figures 18-19). However, as evidenced by Fig. 3B, in silico training still fails, reaching a maximum vowel-classification accuracy of 40%. In contrast, PAT succeeds, accurately training the SHG-PNN, even when additional layers are added (Fig. 3B-C).
An intuitive motivation for why PAT works is that the training’s optimization of parameters is always grounded in the true optimization landscape by the physical forward pass. With PAT, even if gradients are estimated only approximately, the true loss function is always precisely known. Moreover, the true output from each intermediate layer is also known, so gradients of intermediate physical layers are always computed with respect to correct inputs. In contrast, in any form of in silico training, compounding errors build up through the simulation of each physical layer, leading to a rapidly diverging simulation-reality gap as training proceeds (see Supplementary Section 1 for details). As a secondary benefit, PAT ensures learned models are inherently resilient to noise and other imperfections beyond a digital model, since the change of loss along noisy directions in parameter space will tend to average to zero. This facilitates the learning of noise-resilient (more speculatively, noise-enhanced) models that automatically adapt to device-device variations and brain-like stochastic dynamics markovic2020physics.
PNNs can learn to accurately perform more complex tasks, and can be realized with virtually any physical system. In Figure 4, we present three PNN classifiers for the MNIST handwritten digit classification task, based on three distinct physical systems. For each physical system, we also demonstrate a different PNN architecture, illustrating the wide variety of PNN models possible. In all cases, models were constructed and trained in PyTorchpaszke2017automatic, where each trainable physical transformation called an automated experimental apparatus with a vector of the input data, , and parameters , y=run_exp(x,theta), which is made differentiable by using the digital model for backward operations (see Supplementary Section 1).
In the mechanical PNN (Fig. 4A-D), a metal plate is driven by a time-varying force, which encodes both input data and trainable parameters. We find that the plate’s multimode oscillation enacts a controllable convolution on input data (Supplementary Figures 16-17). Using the oscillating plate’s trainable transformation three times in sequence, we classify 28 by 28 (784-pixel) images input as unrolled, 784-dimensional time series. To control the physical transformations, we train element-wise rescaling of the time-series of forces sent to the plate digitally (i.e., , where and are trainable and is the input image or an output from a previous layer). The PNN achieves 87% test accuracy, close to the optimal performance of a linear classifier lecun1998gradient. To ensure that the digital re-scaling operations are not responsible for a significant part of the physical computation, we repeat the experiment with the physical transformation replaced by an identity operation. When this model is trained, its optimal performance is comparable to random guessing (10%). This shows that most of the PNN’s functionality comes from the controlled physical transformations.
An analog-electronic PNN is implemented with a circuit featuring a transistor (Fig. 4E-H), which results in a noisy, highly-nonlinear transient response (Supplementary Figures 12-13) that is more challenging to accurately model than the oscillating plate’s response (Supplementary Figure 23). Inputs and parameters are realized through the time-varying voltage applied across the circuit and the output is taken as the voltage time-series measured across the inductor and resistor (Fig. 4F). The electronic PNN architecture is similar to the mechanical PNN’s, except that the final prediction is obtained by averaging the predictions of 7 independent, 3-layer PNNs. Since it is nonlinear, the electronic PNN outperforms the mechanical PNN, achieving 93% test accuracy.
Finally, using broadband SHG, we demonstrate a physical-digital hybrid PNN (Fig. 4I-L). The hybrid architecture performs an inference as follows: First, the MNIST image is acted on by a trainable linear input layer, which produces the input to the SHG transformation. The full PNN architecture involves 4 separate channels, each with 2 physical layers, whose outputs are concatenated to produce the PNN’s prediction. The inclusion of the trainable SHG transformations boosts the performance of the digital operations from roughly 90% accuracy to 97%. Since the classification task’s difficulty is nonlinear with respect to accuracy, such an improvement typically requires increasing the digital operations count by at least one order of magnitude lecun1998gradient. This illustrates how a hybrid physical-digital PNN can automatically learn to offload portions of a computation from an expensive digital processor to a fast, energy-efficient physical co-processor.
While the above results demonstrate that a variety of physical systems can be trained to perform machine learning, these proof-of-concept experiments leave open practical questions such as: What physical systems are good candidates for physical neural networks (PNNs), and how much can they speed up or reduce the energy consumption of machine learning? One naive approach is to analyze the costs of digitally simulating programmable physical systems relative to evolving them physically, as is done in recent work demonstrating quantum computational advantage arute2019quantum; zhong2020quantum. Of course, such self-simulation advantages
will overestimate PNN advantages for practical tasks, but simulation analysis can still be insightful: it allows us to upper-bound and evaluate how potential advantages scale with physical parameters, to appreciate the mathematical operations that each physical system effectively approximates, and to therefore identify practical tasks each system may be best-suited for. We find that typical self-simulation advantages grow as the connectivity of the PNN’s physical degrees of freedom increases, as the dimensionality of the physics increases, and as the nonlinearity of interactions increases (Supplementary Section 3). Realistic device implementations routinely achieve self-simulation advantages over() for speed (energy), with much larger values possible by size-scaling. Simulation analysis reveals that candidate PNN systems approximate standard machine learning operations like controllable matrix-vector multiplications, as in multimode wave propagation saade2016random or coupled spintronic oscillators Romera2018spintronicPNNvowel
or electrical networks, as well as less-common operations, like the nonlinear convolutions in ultrafast optical SHG and the higher-order tensor contractions implemented by (multimode) nonlinear wave propagation in Kerr mediateugin2020scalable.
Reaching the performance ceiling implied by self-simulation for tasks of practical interest should be possible, but will require a co-design of physics and algorithms markovic2020physics
. On one hand, many candidate PNN systems, such as multimode optical waves, and networks of coupled transistors, lasers or nano-oscillators, exhibit dynamical evolutions that closely resemble standard neural networks and neural ordinary differential equationschen2018neural (see Supplementary Section 3). For these systems, training PNNs with physics-aware training (PAT) will facilitate the learning of hierarchical physical computations of similar form as those in conventional deep neural networks, but overcoming the imperfect simulations, device variations, and higher-order physical processes that would pose an impenetrable challenge for approaches based on in silico training or approximate mathematical analogies. Since their controlled dynamics are well-known to resemble conventional neural networks, it is likely that PNNs based on these systems will be able to translate self-simulation advantages of and higher into comparable speed-ups and energy efficiency gains on problems of practical interest. As our SHG-based machine learning shows, however, PNNs do not need not be limited to physical operations that resemble today’s popular machine learning algorithms. Since they have not yet been widely studied in machine learning models, it is difficult to estimate how more exotic physical operations like nonlinear-optics-based high-order tensor contractions will translate to practical computations. Nonetheless, if we can learn how to harness them productively, PNNs incorporating these novel operations may realize even more significant computational advantages. In addition, there are clear technical strategies to design PNNs to maximize their practical benefits. Even though the inference stage may account for as much as 90 of energy consumption due to DNNs in commercial deployments jassy2019aws; patterson2021carbon, it will also be valuable to improve PAT to facilitate physics-enhanced or physics-based learning hughes2018training; lopez2021self. Meanwhile, general design techniques like multidimensional layouts, in-place stored parameters, and physical data feedforward will permit scaling PNN complexity with minimal added energy cost (for a full exposition, see Supplementary Section 4).
This work has focused so far on PNNs specifically as accelerators for machine learning, but PNNs are promising for other applications as well. PNNs can perform computations on data within its physical domain, allowing for smart sensors Zhou2020smartsensor; Martel2020smartsensor; Mennel2020smartsensor; Duarte2008singlepixelimaging that pre-process information prior to conversion to the electronic domain (e.g., a low-power, microphone-coupled circuit tuned to recognize specific hotwords). Since the achievable sensitivity and resolution of many sensors is limited by digital electronics, PNN-sensors should have advantages over conventional approaches. More broadly, with PAT, one is simply training the complex functionality of physical systems. While machine learning and sensing are important functionalities, they are but two of many peurifoy2018nanophotonic; howison2021reality that PAT, and the concept of PNNs, could be applied to.
L.G.W., T.O. and P.L.M. conceived the project and methods. T.O. and L.G.W. performed the SHG-PNN experiments. L.G.W. performed the electronic-PNN experiments. M.M.S. performed the oscillating-plate-PNN experiments. T.W., D.T.S., and Z.H. contributed to initial parts of the work. L.G.W., T.O., M.M.S. and P.L.M. wrote the manuscript. P.L.M. supervised the project. L.G.W. and T.O. contributed equally to the work overall. The authors wish to thank NTT Research for their financial and technical support. Portions of this work were supported by the National Science Foundation (award CCF-1918549). L.G.W. and T.W. acknowledge support from Mong Fellowships from Cornell Neurotech during early parts of this work. P.L.M. acknowledges membership of the CIFAR Quantum Information Science Program as an Azrieli Global Scholar. We acknowledge helpful discussions with Danyal Ahsanullah, Vladimir Kremenetski, Edwin Ng, Sebestian Popoff, Sridhar Prabhu, Hidenori Tanaka and Ryotatsu Yanagimoto, and members of the NTT PHI Lab / NSF Expeditions research collaboration, and thank Philipp Jordan for discussions and illustrations.Competing interests: L.G.W., T.O., M.M.S. and P.L.M. are listed as inventors on a U.S. provisional patent application (No. 63/178,318) on physical neural networks and physics-aware training.
An expandable demonstration code for applying physics-aware training to train physical neural networks is available at: https://github.com/mcmahon-lab/Physics-Aware-Training. All data generated and code used for this work is available at https://doi.org/10.5281/zenodo.4719150.