I Diving into machine learning
Companies today invest tens of billions of dollars every year to develop machine learning technology, making it a ubiquitous tool for analyzing and interpreting data. Google and Facebook use machine learning algorithms to serve you ads. Amazon and Apple use machine learning both to process spoken language and to synthesize realistic sounding voices. Tesla uses learning tools to develop self-driving vehicles. Learning techniques have also made their way into more surprising applications: Jaguar has adopted learning tools, not to drive their cars, but to provide mapping services that optimize cellular service reception along the drive. Unilever even uses machine learning to design consumer products like shampoos.
Machine learning impacts more than commerce and consumer goods. The number of scientific applications is exploding. In the physical sciences, learning techniques have delivered new techniques for data analysis and prediction, new methods for comparing simulations and experiments, and new directions in scientific computing and computer architecture. Researchers from disparate disciplines have incorporated machine learning tools across a host of applications: fitting scattered data, fitting or recognition of vector- or image-valued data, signal analysis, approximation of partial differential equations, construction of smooth functions for analysis and optimization, and much more.
Beyond the technical advances, nations are vying for technical dominance in the arena, with China and the US widely perceived as leading. China’s goal is to achieve dominance in machine learning by 2030. Vladimir Putin announced, "Artificial intelligence is the future … whoever becomes the leader in this sphere will become the ruler of the world." In a move that scientists can expect to influence science policy, the US House of Representatives created the Artificial Intelligence caucus to seek science and technology input for developing public policycaucus . For many reasons, then, a working knowledge of the principles of machine learning is beneficial to physical scientists.
Our aims are:
to develop a foundation from which researchers can explore machine learning,
to demystify and define machine learning with an emphasis on deep learning via neural networks,
to lay out the vocabulary and essential concepts necessary to recognize the strengths of deep learning,
to identify appropriate learning techniques for specific applications, and
to choose software tools to begin research exploration.
Ii Machine learning: context and a definition
Machine learning is the application of a numerical algorithm that improves its performance at a given task based on experience mitchell:defn_learn . The task is to predict a numerical value based on numerical input. Mathematically, we desire a function that maps our inputs to output values, say . The experience is the collection of input and output values, where and , with ranging over examples. These examples come to us from simulation or experimental observation. We can measure the performance of a learning algorithm by the nearness of its predicted values, , to the true target values, . In the simplest case, we might measure the performance by the squared error, . The learning is the improvement of the algorithm performance with exposure to additional experience or data. Typical tasks for machine learning include classification, clustering, dimensional reduction, and regression. Our task for this tutorial will be regression – using learning algorithms to approximate real-valued functions.
The familiar fitting methods used in the physical sciences are elementary parametric machine learning algorithms. The prototype is the linear least squares method. Here, we use labeled (supervised) data,
, to fit a model with explicit parameters. Examples of parametrized model functions for use with linear least squares include the familiar
and the series
, both of which are linear in their parameters. They clearly need not have basis functions that are linear in x. We can relax the need for linearity in the parameters to accommodate models like
. However, in this nonlinear case, we must appeal to nonlinear solution techniques, like the Levenberg-Marquardt procedure. In any case, linear or nonlinear, these parametric methods require that we know a suitable basis in advance based on prior knowledge of the application at hand.
Machine learning algorithms can be extended beyond parametric techniques to non-parametric
methods. These algorithms do not require an explicit parameterization or, in linear models, a statement of the basis. Examples include support vector machines, decision trees, and (deep) neural networks. In neural networks, the algorithm builds a useful representation of the data by setting a very large number of parameters. The parameters combine many very simple functions to build up the function being approximated. It is counterintuitive that neural network techniques are considered non-parametric because they employ a large number of parameters. But, the essential feature of non-parametric techniques, in particular neural networks, is that we need not describe a parameterization in advance based on prior knowledge. This gives the technique valuable flexibility to fit potentially complicated and unknown details in the function to be approximated. Avoiding the specification of a parameterization, of course, comes at a cost. Without the constraining prior information of a parameterization, non-parametric techniques require more data for training (fitting). This tradeoff between flexibility and data volume requirements presents a recurrent challenge as we design and execute learning algorithms.
Iii Some motivational examples from the plasma physics community
Contemporary advances in machine learning are being quickly incorporated into research of interest to the plasma physicists. Machine learning has been broadly investigated to help predict disruption in tokamak devices. Disruption, the sudden loss of confinement, is both potentially damaging to the device and difficult to model and predict. Rea and Granetz rea
have used random forest learning techniques to predict disruptions on DIII-D with high accuracy. Here, the learning tool assigns the the observed device conditions to a category – nondisrupted, near disruption, or far from disruption. This categorical prediction task is called classification. Others have developed similar predictive classification capabilities for DIII-D and JET using neural networks and support vector machinescannas:nn_disrupt ; vega:disrupt .
Researchers are also incorporating numerical techniques directly into numerical simulations. Multiple groups have investigated using neural networks to learn closure models for hydrodynamic simulations of turbulent flow. We consider here an illustrative proof of principle for incorporating trained neural networks directly into discretized partial differential equation (PDE) models duraisamy . Using the Spallart-Almaras turbulence model
researchers trained a neural network to approximate the source terms in the model (all right hand terms excluding the diffusion term, , then performed numerical simulations showing that the model with the learned approximation reproduced the solutions of the full PDE simulations. Similar techniques might be used in future investigations to approximate expensive physics packages with the goal of reducing computational cost.
In a final example, inertial confinement fusion (ICF) researchers used neural networks to explore high-dimensional design spaces. The team used both random forests and deep neural networks to learn the response of an expensive radiation hydrodynamics code over a 9-dimensional parameter space. With this learned response in hand, they navigated parameter space to find implosions that optimized a combination of high neutron yield implosion robustness. The exercise led to the discovery of asymmetric implosions that, in simulation, provide high yield and a greater robustness to perturbations than spherical implosions. Without the ability to search parameter space with machine learning tools, the rare, well-performing, asymmetric simulations would have been difficult, if not impossible, to find Peterson:2017kq ; humbird:djinn ; Nora:coda2015 .
Iv Fundamentals of neural networks
The most exciting growth in contemporary machine learning has come from advancements in neural network methods. A neural network is a set of nested, nonlinear functions that can be adjusted to fit data. A neural network, then, is really a complex function of the form
An example network is conveniently represented as a graph in figure 1. The input values, , experience a nonlinear transformation at each layer of the network. The final layer, or output layer, produces the ultimate result, the predicted values, . Intermediate layers are called hidden layers
since their inputs and outputs are buried within the network. Each of these layers is composed of a unit, or neuron. A network layer can be described by itswidth, or the number of units in the layer. The network can also be described by the total number of layers, or the depth. Many-layer networks, or deep neural networks, frequently outperform shallow ones supporting the heavy interest in deep learning.
Each neuron in a layer operates on a linear combination of the values from a previous layer such that a subsequent layer accepts values constructed from the prior layer outputs, , as
The elements in the tensor,, are known as the weights and in vector, , as the biases. The weights and biases are the (many) free parameters to be chosen to approximate the relationship between inputs and outputs in a set of data to be fitted. The nonlinear operation performed by each unit is known as the activation function. We show candidate activation functions in figure 2. Historically, the activation function was sigmoidal, like
. Current practice relies heavily on the rectified linear unit, or
. This piecewise linear, but globally nonlinear, often yields much better results than sigmoidal functions. This is mainly attributed to the saturation behavior of sigmoidal functions that can lead to shallow gradients that slow learning. Taking advantage of the linear combinations between layers and choosing ReLU as the activation function, our example neural network becomes
To cement our understanding of the basics of neural networks, we turn to an instructive, analytical example. We will develop a small network to learn the exclusive or function, XOR. The XOR, represented in figure 3, accepts independent variables and . When both input values are or both values are , XOR returns 0. When and are different from each other, XOR returns . Using our language from section II, our task is to regress on the experience with supervised labels , respectively.
The example is not only interesting because we can write down the solution without appealing to extensive numerics, but also because it is of historical importance. Critics of neural networks in the 1980’s (check dates) noted that the XOR problem could not be solved with a 2-layer network. This lead critics to generalize, wrongly, that deep neural networks might also fail to handle essential nonlinearities in learning tasks. It is now well known that deep networks are exceptionally powerful for handling richly nonlinear tasks.
We proceed here to show that a 3-layer network (figure 4) succeeds at the XOR task. Our treatment is a modification of an example from the excellent book, Deep Learning Goodfellow:deep_learning . We take the opportunity to emphasize the importance of our choice of activation function to the network performance. We will experiment with two activation functions: a linear function (bad choice) and the ReLU (good choice). We begin with the linear activation function. At this point, we have specified our network architecture (figure 4) and our activation function (linear). We next choose the cost function we use to measure the nearness of our predicted values to the true XOR values. For simplicity, we choose mean squared error such that
Our network approximation is very simple:
Inserting into the cost function, we recover the normal equations for linear least squares. The solution is and . This constant solution is not at all what we want.
Let us now explore the same procedure – same network, same loss function, but this time choosing ReLU for the activation function. Calling the input,, the hidden layer output, , and the final scalar output, , we have
as the transform from input layer to hidden layer and
as the transform from hidden layer to final output. Combining the transformations, we have (summing on repeated indices)
We now have a neural network, albeit a simple one. What remains is to select the indexed constants. We could try to learn these constants using the training experience and an optimization algorithm like gradient descent, which we describe next. For now, we simply select the nine numbers needed to exactly reproduce the XOR behavior. This leads to a completely specified network
which by inspection can be seen to give the desired answers. This simple example has served two purposes for us. It has made concrete what a neural network is, but has it also highlighted the importance of the proper activation function. We must exercise caution when choosing this function in practical applications, too.
Of course, deep learning is interesting because it scales well to enormously difficult research tasks. For these research tasks, we need a numerical method for selecting the optimal parameters when we cannot surmise them by inspection. In these cases, we seek a technique for minimizing the cost function. The standard example process is as follows:
compute current estimates of output,
measure the difference between current estimates and true training data using the loss function,
compute the gradient of the loss function with respect to the parameters, , using backpropagation
choose new parameters that most reduce the loss function using gradient descent
Backpropagation is an efficient algorithm to compute the gradient of the loss function with respect to the parameters, . Because the training data is independent of the choice of , this is really an algorithm for finding the gradient of the network itself
. The algorithm specifies the order of differentiation operations following the chain rule so that repeatedly used derivatives are stored in memory rather than recomputed. This accelerates the computation, instead burdening memory, which is desirable for most applications.
With the gradient in hand, a gradient descent algorithm can be used to update parameters according to a rule like
. The parameter is commonly called the learning rate. We must set the learning rate with care. The nonlinear nature of deep neural networks typically introduces many local minima. Setting the learning rate too small can trap the gradient descent in a sub-optimal local minimum. Setting it too large can allow large leaps that skip regions of desirable behavior. There are also alternative parameter optimization techniques, including ones with variable learning rates and Newton-style schemes.
V A numerical starting point
We now turn to a simple numerical example to help develop the numerical tools required for application of deep neural networks. Our task will be to develop an approximate function for the simple, nonlinear relationship . We will use the open-source Python package scikit-learn scikit-learn to help readers begin.
from sklearn.neural_network import MLPRegressor x1, x2 = mgrid[-1:1:200j, -1:1:200j] v1 = ravel(x1) v2 = ravel(x2) Y = v1**2 + v2**2 X = stack((v1,v2),axis=1) nn = neural_network.MLPRegressor() nn.fit(X,Y) yptrain = nn.predict(X)
Here, the class MLPRegressor (a MultiLayer Perceptron, or deep neural network), returns a neural network object. The method fit() performs backpropagation and gradient descent using the training data X,Y. Then, the method predict() evaluates the trained neural network at all locations in the data X. Software tools like MLPRegressor are helpful because they can be implemented with relative ease. However, even simple deep learning techniques are powerful and flexible. They require the user to set or accept defaults for multiple parameters, for example hidden layer sizes, learning rate, activation function, etc. The efficient choice for these requires knowledge of the underlying numerics and often some experimentation. We show in figure 5 the true function and neural neural network approximations made with both poor and good choices of parameters.
Vi Examining the quality of your learned model
This raises a key question: what does it mean for a learned model to be good? We can begin by defining a scalar measure for goodness of fit like the value
where are the true training values, are the predicted values, and is the expectation value of the multiple . As the approach the , tends to unity. However, it is not sufficient for the model to achieve a high value on the training data. We show a set of three model fits in 6. The best model achieves an of and is intuitively what we mean by a good fit. We call this a well fitted model. The model with low is a bad fit and uses a model that is too simple to explain the data. We call this failure to match the training data underfitting. The model with has a good fitness metric, but is clearly overly complicated for the data. We call this behavior overfitting. All of our fitness assessments have been made on the same data that we used to train our models. We call this an assessment of training error.
With simple univariate data, it is sometimes possible to identify underfitting or overfitting by plotting both the model and the training data against the independent variable. However, we need to be more sophisticated with the high-dimensional data typical to deep learning applications. To do so, we introduce the notion of generalization to our model. We demand not only that the fitted model get the right answer for data that was used in training, but also that it generalize – that it get the right answer for data that wasnot used in the training. We can compute a generalization error, or test error, using the same function to assess data not used in training. This data might be subset of the available training data that was intentionally held out to test generalization, or it might be new data collected after training. The concept of testing both training error and generalization error is called cross validation.
While developing a reliable trained model, we usually adjust the model capacity, or the flexibility with which it can accommodate the data. We can add capacity by introducing additional neurons or layers, for example. We can remove capacity by adding a cost function penalty (regularization) for regions of parameter space that produce undesirable models. As we increase model capacity the test and training errors typically evolve as shown in figure 7
. The training error falls to low values as the model "connects the dots," or directly interpolates the data. However, the test error reaches a minimum before rebounding. As the model becomes overly complicated, it begins to fail to predict unseen test data. Our models are underfitted if they have high training error. Once we have increased the model capacity to reduce training error, we turn to the training error. Models with low training error, but high test error, are overfitted. For intermediate capacities, the model is said to be well fitted. It may be that even in the well-fitting regime, we find the test error unacceptably high. In this case, we may be forced to collect more training data to improve the fit. This is usually an expensive or time-consuming proposition.
Vii The strengths of deep learning solutions
In principle, neural networks can offer perfect approximations to functions. This notion is described formally and theoretically in work on universal approximation. Multiple authors have shown that any sufficiently smooth function can be represented by a 3-layer neural network cybenko ; hornik . To be capable of universal approximation, the network must have a nonlinear (squashing) activation function. While such a network can be proven to exist, it may not be very useful. First, the network may need to be arbitrarily wide, making it impossible to develop enough data for training. Second, the even the existence of a finite network says nothing about whether the network can be trained. Much prior work has been done using sigmoidal activation functions. Though they meet the nonlinearity requirements for universal representation, they also saturate at extreme input values. This saturation often leads to shallow gradients in the cost function which greatly slow the training process (see section IV). The cost function can sometimes be chosen to rectify these shallow gradients, but not always.
The revolution in contemporary deep learning has been based on successful repairs to the shortcomings of historical networks. A key advance is the now-routine use of nonlinear activation functions that don’t saturate (e.g., ReLU). Networks also commonly use cost functions that are engineered to interact well with the selected activation function (e.g., cross entropy). Perhaps the most useful advance is the recognition that deep networks routinely outperform shallow ones. Deep networks typically require fewer total units for the same task and produce improved generalization error. These features couple well with a host of other advancements: the development of backpropagation for efficient gradient computation, the arrival of "big data" for training large networks, modern computer architectures and processor development (e.g., the general purpose graphics processing unit (GPGPU)), and neural network architectures that can exploit structures in the training data. Taken together, these advances have propelled the explosion of progress in deep learning.
The distinguishing feature of deep learning techniques is their ability to build very efficient representations of the training data. Deep networks use the many hidden layers to develop an intermediate representation of the data called a latent space (see figure 8). This latent space is essentially a nonlinear coordinate transformation. We can think of this as something like a basis for expressing the training data. Deep neural networks rely on these effective latent spaces to capture fine details in the mapping from input to output.
The notion of the latent space and the associated sequential transformations in hidden layers is beautifully described in an example by Honglak Lee et al. lee_honglak:latent which we partly reproduce in figure 9
. At each layer of a neural network developed for facial recognition, we can see the structure of the latent space develop. Each layer develops more resolving power, leading to features that can be interpreted and can also be combined to produce a desired output. Deep neural networks like this work very well for the strong nonlinearities that can characterize plasma physics problems. We show an ICF example in figure10. The task in this example is to reproduce the very rapid change in total neutron yield for an ICF implosion experiencing strong degradations. While a more traditional learning model, like Bayesian additive regression trees (BART), achieves moderate training error, it generalizes rather poorly. A deep neural network tool (called DJINN), captures the nonlinearities and generalizes well. The network built here is considerably more sophisticated than the demonstration network in V
. It was developed using the software package TensorFlow (www.tensorflow.org), which is specifically designed for complicated networks and large scale data.
Viii Tailoring deep networks to your application
Deep neural networks and their efficient latent spaces are flexible tools that can be applied to many tasks. However, the network can and should be specialized to the task. We cover here a few common tasks that occur in physical science problems and the specialized networks that best handle them.
viii.1 Autoencoders for dimensional reduction
We touch first on autoencoders. Autoencoders are networks composed of two consecutive pieces, an encoder and a decoder. The encoder transforms the network input data to a more efficient representation in latent space. The decoder reverses the the transformation, restoring the network input from the latent space representation. Because the network maps input back to input, this is anunsupervised learning technique. In our initial definition of learning, supervised training used paired input and output sets, . Here, we use only a single set as network input, say .
Autoencoders have a characteristic bottleneck structure (see figure 11) to compress information into a lower-dimensional latent space. The overarching goal is usually to develop a descriptive latent representation of the data while maintaining good fidelity following decoding. These networks can be used to reduce the dimensionality of data analogous to a principal components method. This type of dimensional reduction is useful in data analysis and learning tasks. Reducing the number of dimensions can reduce the volume of data needed to train models and perform analyses. As an example, we show a dimensionally reduced autoencoder representation of x-ray spectral data humbird:spectra . The network successfully reduces the number variables necessary to describe the spectrum from 250 to 8. This reduction is close to that achieved by a parameterized physics model created with expert knowledge oxford:mix_spectra . However, because it is a non-parameteric technique, the autoencoder did not require the parametric description of the model.
viii.2 Convolutional networks for arrayed data
Neural networks can be specialized and simplified to account for structure and correlation in the training data. We discuss now modifications that may be suitable for treating array data, whether image data or fixed-length vector data. Here, the neighboring pixels values are often correlated. Well-designed networks can encode these relationships in the structure of the model. The neural network of choice is typically a convolutional network.
To start, we recognize that the network architecture determines the relationships between the input layer and other neurons. While the most general neural network is fully connected, with each neuron providing input to every neuron in the next layer (see figure 13
), the network need not be fully connected. In fact, the data to be learned may not support the many connections in a fully connected network. Furthermore, we may want to modify the network to reduce its size, accelerate training, or improve its accuracy. For example, a pixel in the center of an image likely depends on its nearest neighbors, but it is probably much less affected by the corners of the image. We might then employsparse connectivity. A sparse network reduces the number of connections, allowing a neuron to feed only a few near neighbors in the subsequent layer. This reduces the number of weights and biases to be trained, consequently reducing the data required for training. Sparse connections also change the receptive field for each neuron. In a fully connected network, the activation for a particular neuron depends on the inputs from all neurons in the previous layer. The receptive field for the neuron is the entire previous layer. In the sparsely connected example, the receptive field is reduced to only three nearby neurons in the preceding layer. This reduces the impact of far-field information on local neuron values, and may better reflect the underlying data, as in our central pixel example.
The network can be further modified to reduce the number of free parameters using parameter sharing. In this scheme, the the weights on edges connecting neurons in the same relative position are the same. We represent this shared weighting with color in figure 13. Each directly downstream neuron has the same weight; edges on major diagonals likewise share values. This is especially sensible if pixel is dependent on its neighbors in the same way, regardless of pixel position in the array – a good assumption for most scientific images.
Ultimately, to accommodate the correlations in array data, we replace the matrix multiplication in the neural network with convolution over a kernel. This not only reduces the data required to train thanks to sparse connections and parameter sharing, but it greatly reduces the number of numerical operations needed in training. Convolution also builds in a degree of invariance to small displacements, simplifying registration requirements in the analysis process. In practice, convolutional neural networks have been responsible for a dramatic improvement in deep learning for image processing. Each year, learning experts compete to develop image recognition tools using an open source image data set called ImageNetimagenet (http://www.image-net.org/). Until 2012, the winning error rate was about 25%, falling a percent or two per year. The introduction of convolutional networks in 2012 brought a 10% reduction, and top error rates are now routinely in the low single digits. We note here that at the same time that convolutional networks were being introduced, training on graphics processing units (GPUs) arrived, leading to computational hardware developments to support the software advancements.
viii.3 Transfer learning for sparse data
While deep learning inherently relies on large data sets to train the many parameters in the network, it is also possible to develop networks using sparse data. The key concept is called transfer learning (see figure 14
). In transfer learning, we first train a deep neural network on a large corpus of data. This could be open source data, like ImageNet. Or, it might be scientific simulation data that is easier to obtain in large volumes than corresponding experimental observations. In this initial training step, the network develops a representation for the data, developing an efficient latent space representation. The model sets the full complement of parameters in this period. If the task is image recognition, we might say that the network learns to see in this first step. In the following step, a limited set of parameters, typically those in the last layer or layers of the network, are re-trained on a smaller corpus of data. This data is typically more expensive data associated with a specialized task. Because only a limited number of parameters can be adjusted in the re-training step, we can get by with a much smaller data set. Thus, transfer learning allows us to augment small, precious data sets with large, low-cost data sets to train effective networks. This may sound too good to be true, but it works. For example, scientists working at the National Ignition Facility trained a deep neural network classifiermundhenk on ImageNet data (images of cats, fruits, etc.), but used subsequent transfer learning to help identify defects in high-power laser optics (images of damage sites in lenses) with greater than accuracy (figure15). Transfer learning potentially allows deep learning techniques to be applied to relatively small experimental data sets using augmentation from cheaper related simulation data sets or even unrelated open-source data sets.
viii.4 Recurrent networks for time series
We finally consider specializations for time series data. The networks we have considered so far are feedforward networks. Information that enters the network propagates through the network with each layer affecting only the subsequent layers. However, when handling sequence information, like natural language or scientific time series, we may need to remind a layer of a value that it has seen before in the context of later values. More specifically, we may want a feedback mechanism. For this, we replace the simple neuron with a recurrent unit called a long short-term memory (LSTM) unitcolah:rnn . The LSTM, more complicated than the feed forward neuron, uses feedback to establish a state of the unit. Thus, the unit output is dependent not only on the current input from a sequence, but also on the state established by previous sequence values. As shown in figure 16, a recurrent network can be unfolded to look like a feedforward network. The recurrent LSTM allows networks to adapt to sequences of arbitrary length and is a useful tool for analyzing records parameterized by time or other single scalar.
We summarize in table 1 the various networks and the tasks for which they might be appropriate.
|network type or technique||fully-connected network||convolutional network||recurrent network||transfer learning||auto-encoder|
|application or data type||scalar data||fixed-length vector or image data||time-histories||sparse data||data to be dimensionally reduced|
Ix Impacts of machine learning on computer architectures
Machine learning operations are readily parallelized. This has made them amenable to execution on graphics cards with general-purpose GPUs, which are characterized by many-core processors and high memory bandwidth. Together with the CUDA language for writing arbitrary code on GPUs, numerous machine learning algorithms and software packages are taking advantage of this capability. As practitioners looking to implement learning algorithms, we must choose the computer architecture for training carefully. For the DJINN model humbird:djinn , written in TensorFlow, training on a GPU proceeds about twice as fast as on an equivalent CPU. This puts competing design pressures on computers for scientific machine learning. We may still want the good branching control, parallelism across large networks, and programming convenience of CPUs for scientific simulation. For subsequent learning, we may want the benefits of GPUs for model training. In some circumstances, machine learning workflows can benefit from specialized chips, sometimes called inference engines, used just to evaluate the already trained neural network. Customers and computer vendors are increasingly considering heterogeneous architectures containing CPUs, GPUs, and inference engines. However, the needs of computer users in the commercial technology, commercial goods, or scientific communities can be quite varied. Our scientific community is responsible for exploring the computer design requirements generated by our research and developing a vision for the next generation of scientific computers.
X Jointly advancing physical science and machine learning
Regardless of the particular task or the computer platform used, learning algorithms derive much of their power from their flexibility. In fact, deep learning models achieve their tasks without detailed intervention by the user, say by explicitly constructing a parametric model. Some go so far as to say that, for the most advanced algorithms, no one knows exactly how they function ai:dark_secret . Interpreting the function of these complicated algorithms is difficult, at least in part because there is often no external theory for the tasks they aim to achieve. Their is no set of first principle laws for teaching autonomous vehicles or for parsing natural language text. However, applied science is distinctly different. For many tasks, like a regression task mapping numerical simulation inputs to their computed outputs, their exists at least an approximate parallel theory. Learned models for scientific tasks can be compared to a variety of existing theoretical models, they can be tested against repeatable experiments, and they can be checked against physical laws. Moreover, the scientific community often produces its own data through simulation or experiment. Thus, we can perform experiments on the learned models by augmenting or adapting training data with new examples to test the effects.
The use of modern machine learning for scientific purposes raises a long list of questions for exploration by the community. Can we use machine learning to better understand experimental data? Can we use machine learning to accelerate and improve numerical simulation? How should we use learning to explore experimental design spaces? How do we quantify uncertainty in analysis using machine learning? Can we apply learning across data sets of multiple fidelities – experiment, low-order simulations, higher-order simulations? Can we, as a scientific community, develop a more formal theory of machine learning by building on the foundations of statistical physics, for which there are many parallels? With the proliferation of machine learning algorithms and software tools (table 2) for implementing them, it is incumbent upon our community to embrace them and develop these tools to advance our scientific missions.
Acknowledgements.I would like to thank my Ensembles and Machine Learning Strategic Initiative team members for the challenging and exciting discussions that teach me so much. Special thanks to Luc Peterson, John Field, Kelli Humbird, Jim Gaffney, Ryan Nora, Timo Bremer, Jay Thiagarajan, and Brian Van Essen. I also thank Jim Brase and Katie Lewis for inviting me into this research area and giving this kind of work an organized home at Lawrence Livermore National Laboratory. Prepared by LLNL under Contract DE-AC52-07NA27344.
-  B. Cannas, A. Fanni, P. Sonato, M.K. Zedda, and JET-EFDA contributors. A prediction tool for real-time application in the disruption protection system at jet. Nuclear Fusion, 47(11):1559, 2007.
-  Yu-Ning Aileen Chuang. Lawmakers: Don’t gauge artificial intelligence by what you see in the movies. https://www.npr.org/sections/alltechconsidered/2017/10/05/555032943/lawmakers-dont-gauge-artificial-intelligence-by-what-you-see-in-the-movies. Accessed: 13-December-2017.
-  O. Ciricosta, H. Scott, P. Durey, B. A. Hammel, R. Epstein, T. R. Preston, S. P. Regan, S. M. Vinko, N. C. Woolsey, and J. S. Wark. Simultaneous diagnosis of radial profiles and mix in nif ignition-scale implosions via x-ray spectroscopy. Physics of Plasmas, 24(11):112703, 2017.
-  G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4):303–314, Dec 1989.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
-  Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
-  Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359 – 366, 1989.
-  K. D. Humbird. Private communication, 2017.
-  K. D. Humbird, J. L. Peterson, and Ryan G. McClarren. Deep jointly-informed neural networks. CoRR, abs/1707.00784, 2017.
-  Will Knight. The dark secret at the heart of ai. https://www.technologyreview.com/s/604087/the-dark-secret-at-the-heart-of-ai/. Accessed: 15-December-2017.
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng.
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 609–616, New York, NY, USA, 2009. ACM.
-  T. M. Mitchell. Machine Learning. McGraw-Hill, 1997. New York.
-  T. Nathan Mundhenk, Laura M. Kegelmeyer, and Scott K. Trummer. Deep learning for evaluating difficult-to-detect incomplete repairs of high fluence laser optics at the national ignition facility. In QCAV, 2017.
Ryan Nora, Jayson Luc Peterson, Brian Keith Spears, John Everett Field, and
Ensemble simulations of inertial confinement fusion implosions.
Statistical Analysis and Data Mining: The ASA Data Science Journal, 10(4):230–237, 2017.
-  Chris Olah. Understanding lstm networks. http://colah.github.io/posts/2015-08-Understanding-LSTMs/. Accessed: 17-December-2017.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12:2825–2830, 2011.
-  J. L. Peterson, K. D. Humbird, J. E. Field, S. T. Brandon, S. H. Langer, R. C. Nora, B. K. Spears, and P. T. Springer. Zonal flow generation in inertial confinement fusion implosions. Physics of Plasmas, 24(3):032702, 2017.
-  C. Rea and R. S. Granetz. Exploratory Machine Learning studies for disruption prediction using large databases on DIII-D. FUSION SCIENCE AND TECHNOLOGY, (), 2017.
-  Brendan Tracey, Karthik Duraisamy, and Juan J. Alonso. A machine learning strategy to assist turbulence model development. American Institute of Aeronautics and Astronautics Inc, AIAA, 2015.
-  Jesus Vega, Sebastian Dormido-Canto, Juan M. Lopez, Andrea Murari, Jesus M. Ramirez, Raul Moreno, Mariano Ruiz, Diogo Alves, and Robert Felton. Results of the jet real-time disruption predictor in the iter-like wall campaigns. Fusion Engineering and Design, 88(6):1228 – 1231, 2013. Proceedings of the 27th Symposium On Fusion Technology (SOFT-27); Liege, Belgium, September 24-28, 2012.