MLMOD Package: Machine Learning Methods for Data-Driven Modeling in LAMMPS

by   Paul J Atzberger, et al.

We discuss a software package for incorporating into simulations data-driven models trained using machine learning methods. These can be used for (i) modeling dynamics and time-step integration, (ii) modeling interactions between system components, and (iii) computing quantities of interest characterizing system state. The package allows for use of machine learning methods with general model classes including Neural Networks, Gaussian Process Regression, Kernel Models, and other approaches. We discuss in this whitepaper our prototype C++ package, aims, and example usage.



There are no comments yet.


page 2


OMLT: Optimization Machine Learning Toolkit

The optimization and machine learning toolkit (OMLT) is an open-source s...

Supervised Learning and Model Analysis with Compositional Data

The compositionality and sparsity of high-throughput sequencing data pos...

NeuralFMU: Towards Structural Integration of FMUs into Neural Networks

This paper covers two major subjects: First, the presentation of a new o...

Thirteen Simple Steps for Creating An R Package with an External C++ Library

We desribe how we extend R with an external C++ code library by using th...

Towards Identifying and Managing Sources of Uncertainty in AI and Machine Learning Models - An Overview

Quantifying and managing uncertainties that occur when data-driven model...

Learning from Exemplars and Prototypes in Machine Learning and Psychology

This paper draws a parallel between similarity-based categorisation mode...

Assessment of machine learning methods for state-to-state approaches

It is well known that numerical simulations of high-speed reacting flows...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Recent advances in machine learning, optimization, and available computational resources are presenting new opportunities for data-driven modeling and simulation in the natural sciences and engineering. Empirical successes in deep learning suggest promising techniques for learning non-linear mappings and for identifying features of underlying structure 

[12, 10]

. Scientific computations and associated dynamical systems present a unique set of challenges for employing recent machine learning methods often motivated by image analysis and natural language processing 

[2, 7, 27]. However, for scientific and engineering applications there are often important constraints required to obtain plausible models and a need for results to be more interpretable. In large-scale scientific computations modeling efforts often aim to start with first principles, or more detailed concepts, from which computation is used to obtain insights into larger-scale emergent system behaviors. Examples include the rheological responses of soft materials and complex fluids from microstructure interactions [1, 4, 18, 15], molecular dynamics modeling of protein structure and functional domains from atomic level interactions [20, 14, 6, 23], and prediction of weather and climate phenomena from detailed physical models and measurements [25, 3].

Obtaining observables and quantities of interest (QoI) from simulations of such high fidelity detailed models can involve significant computational resources [19, 26, 31, 22, 21, 30]. However, many observables are not sensitive to many of the system behaviors and are expected to only depend on a subset of underlying factors. If these key features of the system could be identified for classes of observables, this would present opportunities to formulate more simplified models and make predictions which are less computationally expensive to simulate and could provide further insights into underlying mechanisms yielding system behaviors.

Recent machine learning approaches provide promising data-driven approaches for learning features and models for system behaviors from high fidelity simulation data. This includes learning data-driven models for (i) dynamics of the system at larger spatial-temporal scales (ii) interactions between system components, and (iii) features yielding coarser degrees of freedom or new quantities of interest characterizing system behaviors. The models obtained from learning can take many forms, including Deep Neural Networks (DNNs) 

[10], Kernel Regression Models (KRM) [28], Gaussian Process Regression (GPR) [24], and others [11].

A practical challenge is the effort often required to incorporate the contributions of such learned system features to augment existing models and simulations. Our package MLMOD aims to address this aspect of data-driven modeling by providing a general interface for incorporating ML models using standardized representations and by leveraging existing simulation frameworks such as LAMMPS. The MLMOD package provides hooks which are triggered during key parts of simulation calculations. In this way standard machine learning frameworks can be used to train ML models, such as PyTorch and TensorFlow, with the resulting models more amenable to being translated into practical simulations.

Data-Driven Modeling

Data-driven modeling can take many forms. As a specific motivation for the package and our initial prototype implementations, we discuss a rather particular example case in detail, but we aim for our overall package also to be useful in other settings. Consider the case of a detailed molecular dynamics simulation of relatively large colloidal particles within a bath of much smaller solvent particles. It is often of interest to try to infer the interaction law between the colloidal particles given the type of solution, charge, and other physical conditions. While there is extensive theoretical literature on colloidal interactions and approximate models [8, 9, 13], which have had many successes, these can also have limited accuracy and be challenging to apply in many practical settings [13, 29]. Computational simulations are widely used and allow for investigations of phenomena where detailed modeling can be employed to control for contributing physical effects.

While providing predictions related to the principles built into the detailed models, such computational simulations can be expensive given the many degrees of freedom required to represent the solvent in the entire spatial domain and from small time-scales associated with solvent-solvent interactions. However, given the size contrast of a colloid and the solvent it is expected that in many circumstances the colloidal dynamics and interactions are governed primarily through temporal averaging of the solvent-solvent and solvent-colloid interactions.

Figure 1: Data-driven modeling from detailed molecular simulations can be used to train machine learning (ML) models for performing simulations at larger spatial-temporal scales. This can include models for the dynamics, interactions, or for computing quantities of interest (QoI) characterizing the system state. The colloidal system for example could be modeled by dynamics at a larger scale using equation 1 with obtained from training. In the MLMOD package, the ML models can be represented by Deep Neural Networks, Kernel Regression Models, or other model classes.


Relative to the detailed molecular dynamics simulation, this motivates a simplified model for the effective colloid dynamics


The refers to the collective configuration of all colloids. Here the main component to determine for the simplified model is the

mobility tensor

, which captures the effective coupling between the colloidal particles.

In principle, this could be tabulated for each

by performing a sequence of computational simulations over all configurations and force combinations. However, in the general case this is inefficient in practice unless there are known symmetries or other physical structures. For example, interactions only occurring pairwise or translational and rotational invariances / equivariances. In the case of pairwise interactions, translational invariance, and rotational equivariance, the mobility can be reduced effectively to a one dimensional dependence on the configuration. In many circumstances such symmetries of the system may not be immediately apparent and even when symmetries are known tabulation can present drawbacks for interpolation and storage.

Machine learning methods provide data-driven approaches for learning representations and features for such modeling. Optimization frameworks with judicious choices of loss functions and training protocols can be used to identify system features underlying interactions, symmetries, and other structures. In machine learning methods these are represented in a compressed form over some model class allowing for efficient interpolation, and even extrapolation in some cases, especially when using explicit low dimensional latent spaces or when imposing other inductive biases.

As further simplification for the colloidal example, if we assume the interactions occur to a good approximation only pairwise the problem can be reduced to a model depending on six dimensions. This can be further constrained to learn only symmetric positive semi-definite tensors, for example by learning to generate .

There are many ways we can obtain the model

. This includes learning directly using machine learning approaches or a multi-stage approach using first a tabulation approach from more traditional estimators from statistical mechanics followed by compression to an ML model. For example, a common way to estimate mobility in fluid mechanics is to apply active forces

and compute the velocity response . The for chosen carefully. For large enough forces , the thermal fluctuations can be averaged away readily by repeating this measurement and taking the mean. In statistical mechanics, another estimator is obtained when

by using the passive fluctuations of system. A moment-based estimator commonly used is

for chosen carefully. While theoretically each of these estimators give information on , in practice there can be subtleties such as a good choice for , magnitude for , and role of fluctuations. Even for these more traditional estimators, it could still be useful for storage efficiency and convenience to train an ML model to provide a compressed representation and for interpolation for evaluating .

Machine learning methods also could be used to train more directly from simulation data for sampled colloid trajectories . The training would select an ML model over some class parameterized by , such as the weights and biases of a Deep Neural Network. For instance, this could be done by Maximum Likelihood Estimation (MLE) from the trajectory data by optimizing the objective



denotes the probability density when

is used in equation 1 for observing the configurations . To obtain tractable and robust training algorithms further approximations and regularizations may be required to the MLE problem in equation 2. This could include using variational inference approaches, further restrictions on the model architectures, or introducing priors  [17, 16, 5]. Combining such approximations with further regularizations could help facilitate learning possible symmetries and other features shaping the learned model .

The MLMOD package provides a way for transferring such learned models from the data into components of practical simulations in LAMMPS. We discussed here one example of a basic data-driven modeling approach. The MLMOD package can be used more generally and support more broad classes of models for incorporating machine learning results into simulation components. Components can include the dynamics, interactions, and for computing quantities of interest as system state information or characterizing system behaviors. The initial prototype implementation we present in this whitepaper supports the basic mobility modeling framework as a proof-of-concept, with longer-term aims to support more general classes of reduced dynamics and interactions in future releases.

Structure of the Package Components

The package is organized as a module within LAMMPS that is called each time-step and has the potential to serve multiple functions in simulations. This includes (i) serving as a time-step integrator updating the configuration of the system based on a specified learned model, (ii) evaluating interactions between system components to compute energy and forces, and (iii) compute quantities of interest (QoI) that can be used as state information during simulations or computing statistics. The package is controlled by external XML files that specify the mode of operation and source for pre-trained models and other information, see the schematic in Figure 2.

Figure 2: The MLMOD Package is structured modularly with subcomponents for providing ML models in simulations for the dynamics, interactions, and computing quantities of interest (QoI) characterizing the system state. The package makes use of standardized data formats such as XML for inputs and export ML model formats from machine learning frameworks..


The MLMOD Package is incorporated into a simulation by either using the LAMMPS scripting language or the python interface. This is done using the ”fix” command in LAMMPS, with this terminology historically motivated by algorithms for ”fixing” molecular bonds as rigid each time-step. For our package the command to set up the triggers for our algorithms is fix 1 mlmod all filename.mlmod_params. This specifies the tag ”1” for this fix, particle groups controlled by the package as ”all”, and the XML file of parameters. The XML file filename.mlmod_params specifies the MLMOD simulation mode and where to find the associated exported ML models. An example and more details are discussed below in the section on package usage. The MLMOD Package can evaluate machine learning models using frameworks such as C++ PyTorch API. While this would also allow for the possibility of doing on-the-fly learning, we anticipate the most common use scenarios would be to train in advance to learn models and then incorporate them into simulations through evaluation.

A common scenario we anticipate is for a data-driven model to be obtained from a machine learning framework such as PyTorch by training from trajectory data from detailed high fidelity simulations. Once the model is trained, it can be exported to a portable format such as Torch. The MLMOD package would import these pre-trained models from Torch files such as ””. This allows for these models to then be invoked by MLMOD to provide elements for (i) performing time-step integration to model dynamics, (ii) computing interactions between system components, and (iii) computing quantities of interest (QoI) for further computations or as statistics. This provides a modular and general way for data-driven models obtained from training with machine learning methods to be used to govern LAMMPS simulations.

Example Usage of the Package

We give one basic example usage of the package in the case for modeling colloids using a mobility tensor . The MLMOD package is invoked by using the ”fix” command in LAMMPS, with this terminology historically motivated by algorithms for ”fixing” molecular bonds as rigid each time-step. To set up the triggers for the MLMOD package during LAMMPS simulations a typical command would look like

fix m1 c_group mlmod model.mlmod_params

The m1 gives the tag for the fix, c_group specifies the label for the group of particles controlled by this instance of the MLMOD package, the mlmod specifies to use the MLMOD package, and model.mlmod_params gives the XML file with parameters controlling the mode to run MLMOD package and associated exported ML models.

Multiple instances of MLMOD package are permitted to control different groups of particles. The package is designed with modularity so a mode is first defined in a parameter file and then different sets of algorithms and parameters are used based on that information. For the mobility example discussed in the introduction, a basic prototype implementation is given by the MLMOD simulation mode dX_MF_ML1. For this modeling mode, a typical parameter file would look like

<?xml version="1.0" encoding="UTF-8"?>
<model_data type="dX_MF_ML1">
<M_ii_filename value=""/>
<M_ij_filename value=""/>

This specifies for an assumed mobility tensor of pairwise interactions the models for the self-mobility responses and the pairwise mobility response . For example, a hydrodynamic model for interactions when the two colloids of radius are not too close together is to use the Oseen Tensors and . The is the fluid viscosity, with give the particle separation. The responses are with and summation notation.

The dX_MF_ML1 mode allows for this type of model and replacing the interactions with ML models learned from simulation data. Related modes can also be implemented to extend models to capture more complicated interactions or near-field effects. For example, to allow for localized many-body interactions with ML models giving contributions to mobility . In this way MLMOD can be used for hybrid modeling combining ML models with more traditional modeling approaches within a unified framework.

The models used by MLMOD in principle can be of any exported format from machine learning frameworks. Currently, the implementation uses PyTorch and the export format based on torch script with .pt files. This allows for a variety of models to be used ranging from those based on Deep Neural Networks, Kernel Regression Models, and others. We describe here our early work on a prototype implementation, so some aspects of the interface may change in future releases. For examples, updates, and additional information please check the MLMOD package website at


The package provides capabilities in LAMMPS for incorporating into simulations data-driven models for dynamics and interactions obtained from training with machine learning methods. The package supports representations using Neural Networks, Gaussian Process Regression, Kernel Models, and other classes. Discussed here is our early prototype implementation of the MLMOD package. Please check the MLMOD website for updates and future releases at


Authors research supported by grants DOE Grant ASCR PHILMS DE-SC0019246 and NSF Grant DMS-1616353. Authors also acknowledge UCSB Center for Scientific Computing NSF MRSEC (DMR1121053) and UCSB MRL NSF CNS-1725797. P.J.A. would also like to acknowledge a hardware grant from Nvidia.