1 Introduction
In the past decade, multiscale simulation methods have demonstrated significant advantages for computational mechanics due to their ability to consider microscopic heterogeneities inside a material. The macroscopic material properties are strongly affected by the morphology and evolution of the microstructures, especially under extreme events with large material deformations and geometric nonlinearities. While direct simulations for largescale heterogeneous structures are extremely expensive and uncommon in industrial applications, the representative volume element (RVE) techniques based on homogenization theory hill1965self ; feyel2000a ; geers2010multi are a type of hierarchical multiscale simulation methods offering the numerical constitutive closure relationship at the macroscopic point.
In terms of RVE analysis and homogenization, direct numerical simulations (DNS), such as finite element method (FEM) feyel2000a ; belytschko2008a , meshfree methods wu2012three ; wu2016immersed
, and fast Fourier transform (FFT)based methods
moulinec1998a ; de2017finite , offer high accuracy at the expense of high computational costs. A myriad of model reduction techniques have been introduced for predicting the effective mechanical properties in a manner that balances computational cost and accuracy. Analytical micromechanics methods eshelby1957determination ; hashin1963variational ; hill1965self ; mura1987micromechanics ; liu2015statistical ; liu2016extended can be regarded as one type of reducedorder models with high efficiency. However, due to a loss of detailed physics in the microscale, they normally lose accuracy or require extensive model calibrations when irregular complex morphologies, nonlinear historydependent properties or large deformations are presented. For heterogeneous hyperelastic materials, manifoldlearning methods like isomap are used for nonlinear dimensionality reduction of microscopic strain fields bhattacharjee2016nonlinear . The model reduction of historydependent plastic materials can be more complex and challenging. Two examples are nonuniform transformation field analysis (NTFA) michel2003a ; michel2016model and variants of the principle component analysis jolliffe2002a or proper orthogonal decomposition (POD) yvonnet2007a ; kerfriden2013a ; oliver2017reduced . However, they usually require extensive a priorisimulations for interpolating nonlinear responses, and their extrapolation capability for new material inputs is usually limited, Recently, the selfconsistent clustering analysis (SCA)
liu2016self ; liu2018microstructural has demonstrated a powerful tradeoff between accuracy and efficiency in predicting smallstrain elastoplastic behavior though clustering techniques, and it only requires linear elastic simulations in the offline stage.Meanwhile, current advanced machine learning models (e.g. artificial neural networks and deep learning) have achieved great successes in broad areas of computer engineering, such as computer vision, gaming, and natural language processing
hinton2012deep ; lecun2015deep ; Goodfellowetal2016 ; silver2017mastering. Although these techniques are able to construct models for complex inputoutput relations, their applications to mechanics of materials are still limited. Methods like tensor product approximation and neural networks have been employed to directly construct the overall strain energy density surface of the RVE
le2015computational ; yvonnet2013computational ; bessa2017 ; ibanez2017data . Artificial neural networks have also been used to approximate constitutive behavior by fitting the stressstrain relation directly ghaboussi1991knowledge ; unger2008coupling ; wang2018multiscale . However, these techniques are usually problemdependent and may suffer from ”the danger of extrapolation” beyond the original sampling space, e.g. different material laws and loading paths. Issues like material history dependency, physical invariance and conservation laws are not naturally resolved, mainly due to the loss of physics in the current machine learning models.The paper introduces a novel multiscale material modeling method called deep material network, which represents the DNS model of RVE by a hierarchical topological structure with mechanistic building blocks, and is able to predict nonlinear material behaviors both accurately and efficiently. In the proposed material network, the abovementioned limitations of various reduced order methods are addressed simultaneously by meeting three fundamental goals : 1) avoiding an extensive offline sampling stage (e.g. POD, NTFA and generic neural network) and only requiring linear elastic RVE analysis; 2) eliminating the need for extra calibration of the constitutive laws (e.g. NTFA, methods based on isomap) or micromechanical assumption of homogenization (e.g. micromechanics methods, selfconsistent scheme in SCA); 3) discovering an efficient reduced representation of RVE without the loss of physics and danger of extrapolation (e.g. generic neural network). Due to its intrinsic hierarchy structure, its computational time is proportional to the number of degrees of freedom in the system. After one time offline training, the optimized material network creates a microstructural database of the RVE by virtue of its unique capability of extrapolation to unknown material and loading spaces, which is useful for multiscale concurrent simulation and material design.
In Section 2, the theory of material network, including the building block and network architect, are explained for RVEs under 2dimensional (2D) plane strain condition. In Section 3, machine learning approaches, such as stochastic gradient descent (SGD) with backpropagation algorithm and model compression algorithms, are developed for training the material network. Extrapolations to general RVE problems with material and geometric nonlinearities are discussed in Section 4. Finally, applications to several challenging problems are addressed in Section 5, including linear elasticity with high contrast of phase properties, nonlinear historydependent plasticity and finitestrain hyperelasticity under large deformations. Concluding remarks are given in Section 6.
2 Theory of material network
The basic concept of material network is to use a collection of connected simple building blocks to describe complex RVE responses, similar to the one of artificial neural network where neurons are connected to define an arbitrary function. The building block is chosen to be a simple structure with analytical homogenization solutions, and the architect of the material network represents the path of the homogenization process from each individual phase to the overall macroscopic material. With advanced mechanistic machine learning approaches introduced in Section
3, the trained material network provides a possibility to obtain a simplified representation of the RVE with heterogeneous microstructures.2.1 The physically based building block
The theory of material network is first developed for a twophase linearly elastic building block in 2dimensional (2D) plane strain condition. Under smallstrain assumption, the strain and stress measures are the infinitesimal strain and the Cauchy stress , which are related by the fourthorder compliance tensor . The overall stressstrain relation of the building block can be expressed as
(1) 
For materials 1 and 2, we have
(2) 
where and are the volume fractions of materials 1 and 2, respectively. In Mandel notation, the stress and strain can be written as
(3) 
and
where the subscript 3 denotes the shear direction. The compliance matrix can be written as
(4) 
Similar to artificial neural networks, we want the building block to be simple and easy for the analysis. Here a twolayer structure is proposed. As shown in Figure 1. there are two operations in the building block: 1) homogenization which gives and 2) rotation of the twolayer structure which gives .
The homogenized compliance matrix before the rotation can be expressed as a function of the microscopic compliance matrix of each constituent and its morphological descriptors, in this case, the volume fractions , with . Mathematically, it can be written as
(5) 
Analytical homogenized results are available for this simple twolayer structure, derived based on the equilibrium condition
(6) 
and kinematic constraint
(7) 
Analytical expressions of the components in the homogenized compliance matrix are
(8) 
where
After the homogenization operation, the twolayer structure is rotated and the new compliance matrix is passed to a child node of the next material building block in the upper level. As shown in Fig 1, the rotation angle is denoted as . The matrix R defines the rotation of a secondorder tensor through the angle under Mandel notation,
(9) 
and it satisfies
(10) 
After rotation, the new compliance matrix can be expressed as
(11) 
Combining the homogenization and rotation operations in Eq. (5) and (11) yields the completed homogenization function of the twolayer building block,
(12) 
Remark 1
It is equivalent to deriving all the analytical homogenization and rotation functions based on the stiffness tensor C, as we will show in Section 4 and A. However, the reason why we use the compliance tensor D here is that the expressions of the analytical solution based on D are neater for this smallstrain 2D building block in plane strain condition.
Note that analytical forms for building blocks with multiple layers could also be derived for multiphase material, however, the material network would become more complex and more partial derivatives would be evolved in the learning process. For simplicity, this paper will focus on twophase RVEs, and the twolayer structure, as well as the corresponding binarytree architects, will be mainly discussed. In practice, the compliance matrix are stored in a vectorized form with 6 independent variables. The data flow for one building block after the vectorization is shown in Fig.
2.Derivatives of and will be used in the stochastic gradient descent with backpropagation algorithm introduced in Section 3.2. Specifically, derivatives of with respect to the rotation angle are
(13) 
with
(14) 
In terms of the derivatives of with respect to the components in , we have
(15) 
Derivatives of with respect to the volume fraction () are
(16) 
Moreover, derivatives of with respect to the components in are
(17) 
Similarly, forms of can be obtained by switching the phase index in Eq. (17).
In Section 5.2.3, we will extend the theory of material network to general finitestrain problems, such as heterogeneous hyperelastic RVEs. Furthermore, the concept of twolayer building block can be applied to 3dimensional (3D) problems, which will be explored in our future work. It should be noted that the existence of an analytical homogenization function is crucial for optimizing the fitting parameters in the material network using gradientbased methods. More details on the training of material network can be found in Section 3.
2.2 Architects of material network
As shown in Fig. 3, a material network with a binarytree structure is proposed. The depth of the layer in the tree structure is denoted by , and is the layer index. Layer 0 represents the output layer equivalent to the macroscopic homogenized material, and layer contains the input from each individual microscopic constituent. Layer is regarded as the bottom layer. Since a twolayer structure is considered as the basic homogenization building block, each node has 2 child nodes, and there are nodes at layer for .
To explain the data flows inside the material network, a homogenization building block for the th node at layer is provided in Fig. 4. Correspondingly, the indices of its two child nodes at layer are and .
Other than the compliance matrix at the each node, a socalled weighting function is added into the data flow, which is used for tracking the volume fractions in each building block. When feed forward, the weights of the two child nodes are summed up and passed to their parent node,
(18) 
By performing the summation recursively, the weight can be expressed as a summation of weights of its descendant nodes at the bottom layer ,
(19) 
Therefore, only the weights at bottom layer are independent parameters. The partial derivative of with respect to can be written as
(20) 
The weights at layer
are activated through the rectified linear unit (ReLU)
glorot2011deep , which have been widely used in deep learning. For , we have(21) 
The activations are defined at the bottom layer and determine all the weighting functions in the network. Therefore, are treated as the fitting parameters. The derivative of the ReLU is
(22) 
Once a unit is deactivated ( and ), its gradient vanishes and the unit will never be activated again during the training process. This is good for automatically simplifying the material network and increasing the training speed, as long as the learning rate is set appropriately.
Back to the building block in Fig. 4, the volume fraction of the first child nodes in the corresponding twolayer structure can be written as
(23) 
The derivative of with respect to at the bottom layer is
(24) 
The compliance matrices and D at the th node in layer become
(25) 
and
(26) 
By combining Eq. (19), (21), (23) and (26) the homogenized compliance matrix of the RVE can be written as a function of the compliance matrix from each material phase ( and ) and the fitting parameters ( and ),
(27) 
The number of layers can be regarded as a hyperparameter of the material network, and there are totally fitting parameters in a material network with depth for a 2D problem. The optimum choice of needed to be studied through numerical experiments. A shallow material network with a small may not be sufficient for capturing the RVE behavior, while a deep network with a large will increase the training time, as well as the computational cost for online extrapolation.
Remark 2
Other than
in the bottom layer, it is also possible to directly use the volume fraction of the nodes at each layer as the fitting parameters. However, according to our tests, this would cause the vanishing gradient problem of volume fractions in the deep layers and might also result in unwanted early deactivation of subtree structures, whereas, the training process based on
is more continuous and stable.3 Machine learning of deep material network
3.1 Cost function and dataset for training
The procedure for training the material network is illustrated in Fig. 5. The dataset for training is generated through highfidelity DNS of an RVE, or experiments (this can be challenging due to limited experimental tests). The inputs, outputs and fitting parameters for this training/optimization problem are listed below, with the number of variables given in the parentheses,
Inputs (12):
Outputs (6):
Fitting parameters ():
When generating the input data, the components in and cannot be assigned arbitrarily. For a physical material, situations resulting in negative strain energy density should be avoided. For the ease of data generation, the two material phases are considered to be orthotropical elastic,
(28) 
To remove the redundancy due to the scaling effect, we have
(29) 
The other inputs are selected randomly as
where
stands for uniform distribution. Poisson ratios are selected to guarantee the compliance matrices are positive definite for the strain energy density to be positive,
Design of experiments based on the Latin hypercube is performed to generate the input space.
To quantify how well we are training the network, a cost function based on the mean square errors (MSE) is proposed:
(30) 
Here, and are the parameters to be fitted, is the index of the sample (or data point), and is the total number of training samples. Note that the cost function is normalized by the squared norm of to remove the scaling effect. The operator denotes the matrix norm. For a general secondorder matrix B of dimension , the matrix norm is defined as
(31) 
and it is also called the Frobenius norm of matrix B. Since the trace of a matrix product is invariant under cyclic permutation, it can be proved that the norm of a compliance matrix under Mandel notation is invariant under arbitrary rotation. By using Eq. (11) and the symmetry of the compliance matrix, we have
(32) 
This guarantees that the cost function defined in Eq. (30), as well as the optimized fitting parameters, is independent of the choice of the coordinate system.
To constrain the magnitude of activations and make the optimization problem wellposed, an additional term is appended to the cost function ,
(33) 
where is defined as
(34) 
The hyperparameter determines the magnitude of . Although the hyperparameter will not alter the optimum fitting parameters for a given training dataset, in practice, it should be set appropriately to expedite the gradient descent algorithm in the training process.
3.2 Stochastic gradient descent and backpropagation algorithm
To minimize the cost function, the stochastic gradient descent (SGD) with backpropagation algorithm will be adopted. The gradient vector of the cost function can be written as
(35) 
with
Due to the simplicity of , we will only focus on deriving the gradients of the cost function in this section.
In the context of training the neural networks, backpropagation is commonly used to adjust the weight of neurons by calculating the gradient of the cost function. Sometime, this technique is also called backward propagation of errors, because the error is calculated at the output and distributed backward through the network layers. Similar technique can be utilized to train the material network. The heart of backpropagation is the chain rule in calculus, by which the time for computing those partial differentiates can be greatly reduced and less memory is needed. The derivation and implementations of backpropagation for training the material network will be discussed below.
As shown in Fig. 2, each independent component of compliance matrix is regarded as a neuron in the network. After the vectorization, the components in the compliance matrices D for all the nodes at layer are denoted by , with the index ranging from to . Meanwhile, we use to represent the components in for all the nodes at layer . In the algorithm, we define the error of neuron in layer by
(36) 
and the error by
(37) 
In the output layer, the components of are given by
(38) 
Since takes a quadratic form in Eq. (30), can be easily computed. For , the error and can be computed by
(39) 
(40) 
All the derivatives have been previously defined. The expressions of and can be found in Eq. (15) and (17), respectively.
By combining Eq. (38), (39) and (40), we can compute the error and at any layer in the network. Within one step, the error of cost function propagates backward from the output layer to the bottom layer . After the backpropagation of error for compliance matrices, we can compute the rate of change of the cost function with respect to a rotation angle at layer by
(41) 
where can be found in Eq. (13). Meanwhile, the rate of change of the cost function with respect to an activation in the bottom layer can be computed by
(42) 
where denotes the Hadamard (elementwise) product. Analytical forms of are provided in Eq. (16), and is defined in Eq. (24).
Finally, we conclude the backpropagation algorithm for finding gradients of the cost function with respect to the fitting parameters in the material network.

Input: Set the initial values for and

Feedforward: For each layer , compute and

Output error : Compute the vector using Eq. (38)
In a standard gradient descent method, the gradients of each sample are computed and averaged to get based on Eq. (30),
(43) 
To accelerate the training speed, the stochastic gradient descent (SGD) is used to train the material network after the gradient vector is obtained. Instead of computing and averaging the gradients over all the samples at each step, a small number
of samples are randomly picked to estimate the gradient
. In this way, the original dataset is divided into several minibatches which will be used in a sequence of learning steps. If we label the samples in a minibatch by , the gradient of cost function for the corresponding learning step can be approximated by(44) 
Writing out the gradient decent updating rule in terms of the components, we have
(45) 
and
(46) 
where
is a positive parameter known as the learning rate. Here, an epoch is defined as each time the algorithm has processed all the minibatches and seen all the samples in the original dataset. In practice, after an epoch is completed, the dataset will be randomly shuffled and prepared for the next epoch to minimize the sample bias.
All fitting parameters are initialized randomly following a uniform distribution at the beginning of the SGD algorithm,
(47) 
In order to reduce the influence of the constraint term at early training stage, the hyperparameter is chosen to be
(48) 
This will help to avoid unwanted early deactivation of the material network. Additionally, the learning rate is an important hyperparameter that determines the performance of SGD. In the paper, we have utilized a Bold driver algorithm to dynamically adapting the learning rate, and it has effectively improved the training speed in our numerical study. In future, more sophisticated algorithms, like annealing, can be used to tune the learning rate.
3.3 Model compression and parameter reduction
The speed and convergence rate of the network training process can be improved by removing the redundancy in the network, which is called model compression. On the other hand, a less complex network is always beneficial for the extrapolation process in the online stage. As discussed in Section 2.2, the rectified linear unit enables automatical deactivation of nodes during the training; hence it already has a function of network compression. Moreover, two additional approaches are also introduced for further model compression: 1)deletion of the parent node with only one child node; 2) subtree merging based on the similarity search. Illustrations of these two model compression operations are shown in Fig. 6.
In Fig. 6 (a), node 2 is deleted as its volume fraction is equal to 1. Meanwhile, the rotation angle of node 1 needs to be updated to avoid a sudden jump in the cost function,
(49) 
As shown in Fig. 6 (b), merging of the subtree structures is based on the similarity search. The comparison of the two subtrees, and , are performed between all descendant layers of their root nodes and . The differences are evaluated by
(50) 
where the index denotes the order of nodes in the subtree. If the differences of and are both below the tolerances, it is said that
and will be merged left to compress the network and reduce the number of fitting parameters. Afterwards, the parameters in the new subtree become
(51) 
Note that a deletion operation will follow right after the merging operation since becomes 1 upon the merging of the two subtrees in Fig. 6 (b).
The merging operation should be performed on an ordered material network. At any layer of an order material network, the following condition should be satisfied,
Each time before the similarity search and subtree merging, the whole material network will be reordered based on the weighting functions. To save the training time, the network compression operations are performed every 10 epochs in our study.
4 Prediction and extrapolation
4.1 Nonlinear smallstrain plasticity
Since the material network can capture the essential topological structure of an RVE, it can be used for online prediction of nonlinear plasticity, more than just linear elasticity. Each node at the bottom layer can be regarded as an individual material node with independent degrees of freedom (DOFs), precisely, the infinitesimal strain . As a result, the total number of DOFs in the material network at the beginning of training is proportional to
. On the other hand, the ReLU activation function and several compression algorithms introduced in Section
3.3 are used to decrease the model complexity during the training. For the final trained network, the number of DOFs becomes proportional to the number of active/remaining nodes at the bottom layer, denoted by ,(52) 
In the bottom layer, the Cauchy stress at the th node is denoted by . At each loading step, its stressstrain relation can be written as
(53) 
The compliance matrix and incremental stress are given by the local constitutive law,
(54) 
and
(55) 
where is the vector of historydependent internal variables. Once the compliance matrix and incremental stress are obtained, we can compute the residual strain in the bottom layer by
(56) 
Note that the residual strain comes from the material nonlinearity. In other words, if all the materials are linear elastic, there will be no residual strain and the homogenization procedure will be same as the one described for the training process.
In a network with material nonlinearity, both of the compliance matrix D and residual strain are feed forward from the bottom layer to the output layer . In Fig. 4, the stressstrain relations for the th and th nodes on layer at each loading step can be written as
(57) 
and the stressstrain relation of their parent node at layer is
(58) 
In the forward homogenization process, the compliance matrix and residual strain at the parent node are calculated from the ones in its child nodes,
(59) 
and