1 Introduction
Multiscale structures are very common in industrial and biomechanical applications, for instance composite materials and trabecular bones [Fig. 1]. Full Finite Element Analysis (FEA) for stress prediction is usually very expensive in these structures as the FE mesh needs to be very dense to capture the effect of the fine scale features. Therefore, a common approach is to use a coarse scale mesh to predict the global stress field and to perform local scale corrections in the areas where the fine scale features are present.
The effect of fine (micro) scale features on the global (macro) stress field can be expressed using Stress Intensity Factors (SIFs). SIFs can be calculated analytically but only for spherical or elliptical isolated micro scale features, or specific assemblies of such features where the main axis of inertia is aligned with the principal stress on semifinite or infinite planes [Pilkey and Pilkey, 2008]. Additionally, analytical SIFs assume a separation of scales and thus require the application of homogeneous boundary conditions at infinity.
In this work we will use a nonparametric regression model to perform the local corrections. We don’t want to explicitly parameterise the micro and macro scale geometric features nor assume scale separability. A promising approach that satisfies all these constraints is a Convolutional Neural Network (CNN). A CNN is a specific type of Neural Network (NN) designed to work with images. CNNs have been vastly used for tasks like classification, localisation and segmentation and are able to recognise arbitrarily shapes and textures as long as these are relevant to their objective [Wu et al., 2015; Zhang et al., 2018; Aladem and Rawashdeh, 2020]. Therefore, our approach is not a priori limited to any kind micro scale geometric features but in practice we need to stay relatively close to the training set. Additionally, we can use the full macro scale stress fields as input to the CNN, not only handcrafted averages such as the ones used in homogenisation theory [SanchezPalencia, 1986]. Consequently, we can assume scale separability and apply arbitrary boundary conditions at infinity. Lastly, real medical or industrial data are hard to find and often expensive, so we aim to train our CNN on simpler, artificial, datasets and find a way to transfer our knowledge to real cases. To achieve that we are going to train our CNN using only patches of the geometry, so the CNN will be completely agnostic to the overall structure and will learn to identify the effect of microscale features on the macro stress field.
Although Neural Networks (NNs) are nowadays used in a wide range of applications they usually lack one very important feature. The vast majority of NNs fail to incorporate uncertainty information in their predictions. That makes them overconfident when facing data far away from the training set, something that can lead to catastrophic consequences for critical tasks. Even worst, in classification tasks, predictive probabilities obtained after the softmax layer are often erroneously interpreted as model confidence when in reality a NN can be very uncertain even with a very high softmax output. NNs that can provide uncertainty information are called Bayesian Neural Networks (BNNs). The extracted uncertainty from a BNN is expected to increase far from the dataset or in high noise areas warning us not to trust the prediction. On the other hand low uncertainty implies that the BNN is confident to make a prediction for the specific input and thus the prediction is sensible. There are 2 common ways to quantify the uncertainty in a NN and get a probabilistic output. The first one is the Bayes by Backprop method
[Blundell et al., 2015]. This is a proper Bayesian treatment of the model where the parameters of the network are replaced by distributions, often but not necessarily Gaussians. The second option is by using dropout as Bayesian approximation [Gal and Ghahramani, 2016]. This method doesn’t require changes in the architecture of the NN and the main idea is to use the dropout even during inference to introduce randomness into the network.Although the uncertainty on it’s own is a very important property, BNNs are usually employed to solve another, bigger problem with modern NNs, namely the lack of labelled data. Modern Deep Neural Networks (DNNs) are extremely data hungry, but labelled data are usually hard to find and expensive either in terms of time or money. This is true in our case as well. To reduce the labelled data requirements we will employ a Selective Learning (SL) framework that will help us identify the data that contain new, useful information and label only these. Selective learning with image data is a challenging task with a very sparse existing literature [Gal et al., 2017; Gal and Ghahramani, 2016; Holub et al., 2008; Joshi et al., 2009; Li and Guo, 2013].
Machine Learning (ML) has already been used in fields such as solid mechanics and bio mechanics. One of the earliest works of using ML models as surrogates for FEA is from [Liang et al., 2018]
who developed an imageimage deep learning framework to predict the aortic wall stress distribution where the mechanical behaviour in the FEA model was described by a fibrereinforced hyperelastic material model. After that, other NNs with fully connected layers have been used for stress predictions for nonlinear materials but simple beam structures as shown by
[RoewerDespres et al., 2018; Meister et al., 2018]. Later, [Mendizabal et al., 2019] used a CNN for the prediction of nonlinear soft tissue deformation on more complicated structures such as a liver but without any kind of microscale features. Moreover, [Nie et al., 2019] deployed a CNN model for stress prediction on simple structures with geometric features but not on multiscale problems as the size of these features was comparable to the size of the structure. Also, [Jiang et al., 2020] used a GAN to analyze mechanical stress distributions on a set of structures encoded as high and low resolution images. A variety of loading and boundary conditions has been used and some of them resembled the effect of isolated microstructural features. Recently, [Sun et al., 2020] based on the architecture of [Nie et al., 2019] created an EncoderDecoder CNN for the prediction of stress field on Fiberreinforced Polymers but their samples come from a single specimen and with a single FE simulation implying low generalisation ability both in terms of different structures and loading/boundary conditions. Additionally, they predict only thecomponent of the stress tensor and they report a value of about 70% in their evaluation metric. Lastly,
[Wang et al., 2020]used a Convolutional Aided Bidirectional Long Shortterm Memory Network to predict the sequence of maximum internal stress until material failure.
In contrast to all the aforementioned approaches our model is able to make predictions in multiscale cases, where multiple microscale features are interacting with each other and the macroscale structure. Also, our NN is trained on patches of different structures under different boundary conditions thus it can be applied to a much broader set of macro scale features and loadings. Furthermore, because we use the cheap macroscale stress as input, the NN only needs to learn how the microscale features are affecting the macro scale stress; further contributing to having a model that generalizes well in unseen cases. Moreover, we managed to incorporate uncertainty information into our prediction allowing us to make confident predictions or be aware of the potential large error in the prediction. Lastly, we proposed a SL framework to tackle the well known problem of insufficient labelled data.
2 Methods and Governing Equations
In this section we will discuss the reference multiscale mechanical model that we aim to solve online using the trained CNN and we will also introduce definitions and notations that will be necessary for us to explain our methodology.
2.1 General problem of elasticity
We consider a 2D body occupying domain with a boundary . The body is subjected to prescribed displacements on its boundary and prescribed tractions on the complementary boundary = . The equations of linear elasticity under the plane strain assumption are as follows [Eq. 1a  1d].
(1a)  
(1b)  
(1c)  
(1d) 
where is the stress tensor, is the body force per unit volume, and are Lamé elasticity parameters for the material in , is the identity tensor, tr is the trace operator on a tensor, is the symmetric strainrate tensor (symmetric gradient),
is the displacement vector field and lastly
n is the outward unit normal to .We are interested in predicting the stress field and more specifically a fracture indicator. A proper fracture indicator is the tresca stress, defined as the maximum of the principals stresses. The tresca stress can be calculated given the stress tensor as described below: The stress tensor can be rotated mechanically [Eq. 2] using the rotation matrix [Eq. 3]. The resulted components of the stress tensor after the rotation can be found in [Eq. 4a  4c]. From [Eq. 4c] we can conclude that there must be an angle such that the shear stress after rotation is zero [Eq. 5]. After inserting into [Eq. 4a, 4b] we can calculate the 2 principal stress components [Eq. 6]. is the tresca stress.
(2) 
(3) 
(4a)  
(4b)  
(4c) 
(5) 
(6) 
2.2 Multiscale Problem
Assuming a structure with both macroscale and microscale features like the one in [Fig 2] we assume that elasticity can be written over , where microscale features are ignored or averaged, through a modification of the constitutive law (i.e. standard homogenisation) as for example in [SanchezPalencia, 1986]. We assume that the macro stress prediction along with the microscale information in an domain surrounding a Region Of Interest (ROI) are sufficient to predict the micro stress field in the ROI up to an acceptable level of accuracy. This assumption is backed up by the SaintVenant’s principle, stating that the micro effect in a subregion can be fully predicted knowing the macroscopic solution, if we look sufficiently far away from the boundaries of B as shown in [Fig 2].
In our case we assume a homogeneous distribution of spherical pores as microscale features. Nevertheless our surrogate model is completely agnostic to the shape of the geometric features and these information are only available to it through the training examples.
2.3 Inputoutput strategy
One strategy is to work on the entire domain as described in [Zhang et al., 2020; Sasaki and Igarashi, 2019] for topology optimisation. Unfortunately, in our problem this strategy suffers from high computational requirements both in terms of training the NN and in creating enough data to train on. Most importantly it will result in poor generalization as the NN will unavoidably try to learn the specific macrostructers present in the dataset.
Another, more suited, strategy for our case is first to solve for the stress in the whole domain and then to divide the domain in squares that we call patches as can be seen in [Fig. 3]. Therefore, we can consider each patch as a training example instead of each domain.This strategy will reduce the computational requirements because we can extract a lot of patches from a single domain and also the NN will have inputs of smaller size. Last but not least, this strategy will result in better generalization because the NN is encouraged to to learn how the microstructural features are affecting the global stress field instead of learning the specific structures present in the training dataset.
As already discussed, according to SaintVenant’s principle we can’t fully predict the micro stress in the patch but only in a ROI far from the boundaries. Consequently, we need to provide to the NN all the available information in the patch (macro stress filed and microscale features) but we can only ask for a micro stress prediction in the ROI. Specifically, the input of our model will be the full macro stress tensor in the patch along with the microscale features while the output will be the tresca stress in the ROI.
2.4 Training dataset
For the purpose of training our model we have assumed a distribution of elliptical pores as macroscale features. We consider all the microscale features as circles with the same radius, . We assume that for a distance larger than 4 radius from the center of the microscale features the micro effect on the global stress field is negligible, for instance in the case of an infinite plate under uniaxial loading the max stress at is 1.04 times the macro stress [Pilkey and Pilkey, 2008]. Based on that the defect length is , and the interaction length is equal to . Given those 2 parameters we conclude that the patch length should be and the ROI should be a window in the middle of the patch as shown in [Fig 4].
The boundary conditions are applied to a buffer area where the mesh is much coarser, as can be seen in [Fig. 5]. The buffer area allows us to apply boundary conditions without introducing boundary effects on the fine mesh area. Additionally, because the mesh in the buffer area is very coarse the computational cost remains practically the same. We apply displacement as boundary conditions [Eq. 7].
(7) 
where is the far field displacement along the direction, is the far field displacement along the direction, is the far field displacement along the direction, is the position of a point in and is the initial position of the center of the body in .
3 Cnn
3.1 InputOutput
The input of the CNN is a 4D array of size where: is the number of data points, and are the size of the input image along the and direction respectively and is the number of channels of every data point. Each data point has 4 channels namely , , and corresponding to the , , component of the macro stress tensor and a binary image of the geometry respectively. We chose units so the input array is of size . The output of the model is an image corresponding to the micro tresca stress, so here an image of size . Note that we are only insterested in the ROI of the patch so all the statistics during training and inference are calculate in an window in the middle of the patch as shown in [Fig 4]. Because we want to identify the effect of micro scale features on the macro scale stress we will scale the output with a number that reflects the intensity of the macro stress field. This number is the sum of the principal stresses of the macro stress tensor from [Eq. 6]. The micro stress in areas away from micro scale features should be the same as the macro scale stress because these features only have a local effect. This suggests that the output should be constant away from the micro scale features and change rapidly very close to them. That is clearly visible in [Fig. 6].
Differences in the scales across input variables may increase the difficulty of the problem being modeled, for example increased difficulty for the optimizer to converge to a local minimum or unstable behaviour of the network, thus a standard practice is to preprocess the input data usually with a simple linear rescaling [Bishop, 1995]. In our case we will scale the data not only to improve the model but also to restrict the space we have to explore. The space that we have to cover is infinite because the input can take any real value. Fortunately, because we chose to model linear elasticity problems we can scale the input stress tensor by any value and the output micro tresca field will be scaled by exactly the same number. From [Eq. 6] it is trivial to show that if we replace to where is some scaling factor, then where is the new tresca after scaling the input. Here is the maximum stress value present in all the 3 stress components. This scaling of the input values to the range allows us to make predictions on input data of any possible scale. We just have to to calculate , multiply the input with it to transfer it to the desired scale and then multiply the output with to get the true output.
3.2 Architecture
Training Deep Neural Networks is complex for a number of reasons. A common reason is that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change, this phenomenon is known as internal covariate shift. This slows down the training by requiring lower learning rates and careful parameter initialization [Ioffe and Szegedy, 2015]
. Batch normalization (BN) aims at reaching a stable distribution of activation values throughout training
[Ioffe and Szegedy, 2015; Santurkar et al., 2019]. To achieve that, BN normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation. After the normalization, BN tries to scale and shift the normalized output by adding two trainable parameters to each layer. BN multiplies the normalized output by a “standard deviation” parameter
and then adds a “mean” parameter . This scaling and shifting happens to restore the representation power of the network, the network could even recover the original activations if that were the optimal thing to do [Ioffe and Szegedy, 2015; Santurkar et al., 2019]. Batch normalization acts to standardize only the mean and variance of each unit in order to stabilize learning, but allows the relationships between units and the nonlinear statistics of a single unit to change
[Goodfellow et al., 2016].Reduction of the internal covariate shift [Ioffe and Szegedy, 2015] is one possible reason why BN works. The other popular theory is that the BN makes the optimization landscape significantly smoother. This smoothness induces a more predictive and stable behaviour of the gradients, allowing for faster training [Santurkar et al., 2019]. Nevertheless, both agree that BN enables faster (larger learning rates) and more stable training of DNNs.
There is also a discussion about whether BN should be used with dropout or not. According to [Ioffe and Szegedy, 2015; Goodfellow et al., 2016] BN reduces generalization error and allows dropout to be omitted, due to the noise in the estimate of the statistics used to normalize each variable. [Chen et al., 2019] show that the simultaneous use of BN and dropout can actually reduce accuracy. However, they state that this is not true if dropout is used after BN. [Garbin et al., 2020; Li et al., 2018] agree that both of them can be used together if dropout is used after the BN layer and actually [Li et al., 2018]
proposed a modification in the architecture of well studied Neural Networks to take advantage of the combined effect of BN and dropout. Lastly, everyone seems to agree that the BN layer has to be used before the activation function.
In theory the more layers a network has the better is should perform. Nevertheless, that is not true in practice and training very deep NN is very challenging due to problems like vanishing gradients causing the NN to not be able to learn simple functions like the identity function between input and output [Sussillo and Abbott, 2015; Hochreiter et al., 2001]. The usual way to train DNNs is through residual blocks [He et al., 2015; Zagoruyko and Komodakis, 2017; Lim et al., 2017; Kim et al., 2016]. With residual blocks the NN itself can choose it’s depth by skipping the training of a few layers using skip connections. As we can see from [Fig. 7] even if the NN chooses to ignore some layers, , it will learn to map the input to the output, output: . This way we can use a large number of residual blocks and the network will simply ignore the ones it does not need. The name residual comes from the fact that the network tries to learn the residual, , or in other words the difference between the true output, , and the input, .
In the residual blocks of our CNN we are using another kind of block the Squeeze and Excitation block, SE. The SE can adaptively recalibrate channelwise feature responses by explicitly modelling interdependencies between channels resulting in improved generalization across datasets and improvement in the performance [Hu et al., 2019; Cheng et al., 2018; Li et al., 2018]. The input of the SE has channels, height and width , . The input decreases in size using a globalaveraging pooling layer resulting in a linear array of size . After that, two fully connected layers downsample and then upsample the linear array. Firstly the linear array is downsampled by a factor of 16, , as this is indicated to result in optimum performance [Hu et al., 2019]
, then a ReLU activation function is applied before upsampling again using a factor of 16
and in the end a Sigmoid activation function is applied. Lastly, the linear array is reshaped to size and multiplied with the input of the SE block [Fig. 8].The residual blocks we will use in this work consist of two convolution layers, followed by a BN layer and a ReLU activation function each, and a SE block in the end [Fig. 9]. The input and output of this block has exactly the same size as we choose the number of filters for the convolution layers to be the same as the number of filters at the input of the residual block.
The architecture of the network is inspired by the “StressNet”, proposed by [Nie et al., 2019]. Three convolution layers with increasing number of filters will downsample the input, after that five residual blocks are applied to the resulting array before using 3 deconvolution layers with a decreasing number of filters to upsample to the original dimension but with 1 channel instead of 4 [Fig. 10].
4 Bayesian Neural Network
4.1 Bayes By Backprop
In our effort to include uncertainty information into our prediction we will deploy a Bayesian framework, firstly described by [Blundell et al., 2015], to introduce uncertainty in the weights of the network. To achieve that we will replace the constant weights of a plain neural network with a distribution over each weight as seen in [Fig. 11]. The output of this probabilistic model, for an input and a set of possible outputs
, will be a probability distribution over all possible outputs,
. The distribution of the weights before observing the data is called prior distribution,, and it incorporates our prior beliefs for the weights. The goal is to calculate the posterior, the distribution of weights after observing the data, because during training and of course inference the weights of the network are sampled from the posterior. Bayesian inference can calculate the posterior distribution of the weights given the training data,
. Unfortunately, the posterior is intractable for NNs but can be approximated by a variational distribution [Hinton and van Camp, 1993; Graves, 2011], parameterised by . Variational learning finds the parametersthat minimise the KullbackLeibler (KL) divergence between the approximate posterior and the true Bayesian posterior. That is how the loss function is defined [Eq.
8]. The first term of the loss is the KL divergence between the approximate posterior and the prior. It is obvious that the prior is introducing a regularization effect because the KL divergence penalizes complexity by forcing the approximate posterior to be close to the prior. The second part is the negative log likelihood. This is a data dependent term and it forces the network to fit the data.(8) 
Here we consider the approximate prosterior to be a fully factorised Gaussian [Graves, 2011]. During one forward pass we sample the weights from the posterior but during back propagation we would have to define the gradient of the loss with respect to this sampling procedure, which is of course not possible. Instead we use the reparameterization trick [Kingma and Welling, 2014]. This procedure is well described in [Blundell et al., 2015]. Instead of having a parameterfree operation (sampling) we can obtain the weights of the posterior by sampling a unit Gaussian shifting it by a mean and scaling it by a standard deviation . This standard deviation is parameterised as and thus it is always positive. So the weights are sampled according to the following scheme: , and the variational parameters to be optimised are .
The architecture remains the same but we replace all the convolutional and dense layers with the respective Bayesian layers. We use a Gaussian as predictive distribution,
, corresponding to a squared loss. The prior is a single Gaussian for each layer, corresponding to L2 regularization. During backpropagation we optimise the prior by considering the gradients of loss not only with respect to the posterior but also the prior parameters. The mean prediction and the variance are calculated by passing the same input through the network multiple times.
4.2 Selective Learning
In this work we train a NN in a supervised way. This means that we need to provide labelled data during training. Labelled data contain both the input and the output of the NN, while unlabelled data contain only the input. Usually, but also in our work as well, labelled data are considerably more expensive to create than unlabelled data.
A selective learning process, in a supervised learning framework, assumes that a large pool of unlabelled data is available while there is a very expensive function that labeles these data. The aim of selective learning is to identify which of the unlabelled data contain useful information so that only these are labelled. To achieve that, an acquisition function needs to be formulated able to identify the useful data.
[Tsymbalov et al., 2018] suggest that the uncertainty extracted from a Bayesian Neural Network is a sensible acquisition function for this task. This is also intuitively a sensible conclusion because high uncertainty in the prediction of the network means that the input is far away from the training data distribution.5 Results
5.1 Initial Dataset
Firstly we created an initial dataset with very simple examples [Fig 12]. A single ellipse in the middle playing the role of the macroscale feature, creating a diverse macro stress field. Also, a few defects are randomly positioned around the ellipse, accounting for the micro scale features that will affect the macro stress field. All the defects have a circular shape and the same radius, units. From almost 500 examples we extracted about 33.000 patches, 5.000 of which where used as a validation set.
Experiments on this dataset showed very positive results. Training with 28.000 data points and validating on 5.000 unseen data points resulted in a validation accuracy of 96% when training with the Adam optimizer for 600 epochs, which required about 6 hours on a NVIDIA T4 GPU. The concept of accuracy in a regression task with images needs to be discussed. The process followed to define accuracy is summarised in [Algorithm
1] and it is described with more detail in the following passage: First of all we take the max of each prediction, this happens because we are primarily interested in the max values as these values will indicate if the material will fail or not. Next, we define an error metric between the real max value, , and the max value in our prediction, . Here we use the relative error defined as: . Finally we need to set a threshold for the acceptable error. In this case we will use 10%. To sum up, 96% validation accuracy means that in the validation set 96% of the max values were predicted with a relative error less than 10%. Someone could argue that this 10% threshold is arbitrarily chosen and it should be more application specific because different applications have different error requirements. We have constructed a diagram that shows the accuracy as a function of the threshold [Fig. 13]. We present results from 2 random patches [Fig. 14] and then a result on the whole structure [Fig. 15]. The prediction happens again on the patch level but then the original image is reconstructed. This is possible if we align one corner of the ROI with a corner of the image and use a sliding window equal to the size of the ROI as can seen as [Fig. 16]. We can see that in all cases the CNN was able to accurately reconstruct the full micro stress field but it was also able to predict the max values with a very small error. More specifically it is clear that away from the micro scale features the micro scale field is constant. Also we can see that very close to the defects we have a very steep rise of the micro stress field. The orientation and the shape of the micro stress field is accurately predicted even in complicated cases where more than one defects are interacting or defects and macroscale features are interacting.We also wanted to investigate the effect of training with less data. We randomly chose 10.000 data points, almost 30% of the available data, and train the NN with exactly the same settings, an example can be found at [Fig. 17]. We noticed that for the 10% threshold the accuracy is only 6% higher for the large dataset even though we used almost 3 times more data. This implies that after a specific point in training most of the data from this dataset do not really contain any new information. This highlights the need to use selective learning in order to only label (and train on) the useful data.
5.2 Advanced Dataset
Even though the CNN we trained seems to work well for the data it was trained on we believe that it can only be used for simple cases and it’s not suitable for complicated structures like the femur. To tackle this problem we created a new, more interesting, family of data with the expectation that this would add more complexity [Fig. 18]. We tried to use the old CNN to make predictions on the new dataset. We observed that the accuracy dropped from 96% to 72%. This implies two things. Firstly, the drop in accuracy means that the new dataset contains information that the network had never seen before, thus training in this dataset will make the CNN to generalize better. Lastly, the concept of making the knowledge transferable seems to be working as we were able to make reasonable, but not perfect, predictions on a new family of data. This suggests that we managed to learn interactions between micro scale features and the macro stress field and not just the structures themselves.
Training a CNN with the new dataset proved to be more challenging. By using 23.000 training data points (almost as many as with the original case) and 5.000 validation data points we obtained, with the same settings, a validation accuracy of 74% in contrast to the 96% in the first case. From experiments we found out that as more and more new data points are added the accuracy tends to increase slower and slower. This happens because the new data points added tend to contain less and less new information. As discussed before, we can use rotation as data augmentation technique, by rotating mechanically the stress and “physically” the images. We started from an initial training set of 5.000 data points ( 1/4 of the full set) and we rotated the dataset 6 and 12 times. After training with the same settings for all the cases, a validation accuracy of 62%, 80% and 82% was achieved respectively for the 10% threshold [Fig. 19]. Firstly, this means that we managed to outperform by 8% the full dataset and secondly, we realized that rotating from 6 to 12 times didn’t add a significant amount of new information even though the data are doubled. Once more, that was the motivation to start working with Selective Learning. We can see an example of a prediction with all 3 CNNs on the same input [Fig. 20], where the prediction improves with the number of rotations. We can also see a prediction of the CNN trained with 6 rotations on 4 random patches [Fig. 21].
5.3 Bayesian Neural Network
Until now we have used a deterministic neural network for the predictions. In this section we will see some results from the Bayesian Neural Network. We trained the BNN with the same 5.000 data points as earlier for 600 epochs and validated on 10.000 data points. That requires almost twice as much time compared to the deterministic case. The accuracy of the prediction is 72% for the 10% threshold compared to 62% for the deterministic case.
The mean and the variance of the NN prediction are calculated by passing the input 100 times through the network. The results can be found at [Fig 22]. We can see from the first image that the mean prediction is very close to the real value. We can also observe that for higher values we get higher error, something we expected because those cases are more challenging. In the second image we can see that the error between the real maximum value in the ROI and the predicted one is almost always, and specifically in 92% of the data points, inside the 95% credible interval. This is a very positive result as our objective is to always have the real maximum value inside the 95% credible interval. Lastly, we can see that the error tends to increase with the variance something that justifies our choice to use the BNN’s variance as acquisition function for the Selective Learning framework.
Results from the uncertainty estimation on image level can be found at [Fig. 23]. On the top 2 cases we can see some examples of very good mean predictions where there is nevertheless relatively high uncertainty. This is justified by the fact that there is high error as well, it just happens that this error is not in high stress regions so is not affecting the maximum values. We can observe that the uncertainty, expressed as 95% CI, is higher close to the higher error pixels indicating that the BNN has successfully identified the unseen interactions. The bottom left image is an example of a case where the maximum value is misspredicted with a large error of about 1 unit. Fortunately, we can observe that the uncertainty is also very large, specifically the 95% CI has the value of about 1.5 units meaning that the true maximum value is between the mean prediction and the 95% CI. The bottom right image is an example of a case with low uncertainty and low error. This means that the CIs are very tight and the BNN is very confident about the prediction. That was an expectable result in the sense that this is a very simple case and we would expect from the BNN to handle it without a problem.
5.4 Selective Learning
In the last section we will investigate the idea of Selective Learning to reduce the labelled data requirements for training the BNN. The principles of this framework are described below. We have an initial dataset with labeled data and a bigger dataset with only unlabeled data. We train on the initial dataset and then use an acquisition function to select small batches of data from the unlabeled data set to label and train on. After training, when the new information is incorporated into the BNN we repeat the same process until we reach the desired accuracy or label the entire unlabeled set.
We designed a small experiment to validate our approach, inspired by [Gal et al., 2017]
. We used the following setup: 2.500 data points for the initial set, 2.500 data points as the unlabeled set and 11.000 data points as validation set. We trained each network for 50 epochs, we performed 50 forward passes for the uncertainty estimation and we added 500 data points in the labeled set at each iteration, query rate = 500. The accuracy is calculated from the mean prediction of the network. Here we made a comparison between the max uncertainty acquisition function choosing first the points with higher uncertainty and a random acquisition function that chooses data points randomly. For the random selection approach we repeated the experiments 5 times and presented the mean and the 95% confidence interval. The results can be found at [Fig.
24]. We can observe that the results produced by the max uncertainty acquisition function are consistently better, presenting higher accuracy. More specifically, with this unlabeled data set we can reach an accuracy of about 75%. This can be achieved using 1.500 points with the max uncertainty acquisition function but requires all the 2.500 if we choose points randomly. This means that we reduced the labeled data requirement by 40%. From that we can conclude that the selective framework is working.Now we will use a larger unlabelled dataset consisting of 10.000 data points. We make again the comparison between the max uncertainty acquisition function and a random acquisition function. The initial train set has 5.000 data points, we train for 150 epochs every network, we perform 100 forward passes for the uncertainty quantification and we label 2.000 unlabeled data points in every iteration, query rate = 2.000. The results can be found at [Fig. 25]. We can again observe the accuracy increasing faster in the case with the max uncertainty acquisition function and also the loss function is decreasing faster until it reaches 6.000 new data points. At this point the accuracy practically stops increasing any more and the loss gradually approaches the same value as with the random acquisition function. Using the max uncertainty acquisition function we can reach the max accuracy using 6.000 data points while we need all the 10.000 data points when randomly choosing new data. Again we have a decrease of 40% in the labelled data requirement.
This time we want to perform a similar experiment but we are interested in examining the effect of query rate on the results. Specifically, we will use an initial set of 5.000 data points and we will perform Selective Learning on an unlabeled dataset of 4.000 data points. We will repeat the experiment 3 times, with query rates 500, 1.000 and 2.000. A similar experiment was conducted by [Islam, 2016], where he concluded that using very small query rates results in suboptimal performance, higher simulation times and noisy behaviour. There are 2 reasons why the results are worse in this case. Firstly, adding just a few data compared to the size of the initial dataset might result in overfitting and secondly, these data points might get smoothed out in the loss function. The simulation time increases because the network needs to be retrained a considerable number of times. On the other hand using too large query rates also results in worst results because the weights of the network are not updated frequently enough so new information is rarely incorporated in the network and we end up again labeling and training on data points that do not contain new information. The results of our experiment can be found at [Fig. 26]. We have reached the same conclusions. When query rate is 1.000 we have the optimal behaviour, when we double it we observe slower convergence and when we use a small query rate we observe noisy suboptimal behaviour.
After validating the Selective Learning framework we will now use it without the random acquisition function as baseline. We will use all the 30.000 available data to train the network. As initial set we will use again 5.000 data points. We will query 5.000 unlabeled data points at each iteration chosen by the max uncertainty acquisition function. We will train for 300 epochs and perform 100 forward passes for the uncertainty quantification. The results can be found at [Fig. 27]. It is clear that the accuracy is not improving after the third iteration, 15.000 data points, but we continued labelling points only for demonstration reasons. The mean squared error decreases for the first 3 iterations and then stops decreasing as well. In this example we could reach the maximum accuracy using 15.000 out of the 30.000 data points, so we managed to reduce the labelled data requirements by 50%.
Lastly, we want to test the BNN in data outside of the train set. Specifically, we will use ellipses as defects. Ellipses have a similar shape to circles and not very different behaviour. Neural Networks extrapolate when they make predictions outside of the data set and they are notoriously bad at it. What we are hoping for is that the BNN will understand that the ellipses are not in the dataset and will assign high variance to most of the patches. We created 500 patches and made a prediction with the previous BNN. The results can be found at [Fig. 28]. From the first plot we can see that the mean prediction from the BNN for the max values in the patch is not close to the real max value for a big percentage of the data, accuracy 50%, but is not unreasonable. Nevertheless, in a lot of cases the network successfully identified the interactions produced by the ellipses even if it was never trained on something like that. On the other hand, the second plot shows that in most cases, 80%, the true max value is indeed inside the 95% CI. Even more encouraging is the fact that clearly higher uncertainty corresponds to higher error. This also implies that selective learning is very promising in this case. We can also see some examples on predictions. At [Fig. 29] we can see 2 examples of cases where the error in max values is relatively high and even though the 95% credible intervals are very broad they fail to contain the real value. At [Fig. 30] we can 2 examples of cases where the error is high but inside the 95% CI. Lastly at [Fig. 31] we can see 2 examples where the mean prediction of the BNN is very close to the real value, there is some error present in other areas of the patch but this error is captured by the uncertainty of the BNN.
6 Conclusions
The goal of this work was to use a CNN to identify the effect of microscale features on the global stress field. We aimed for a Bayesian approach that would provide not a point estimation but credible intervals for the prediction. We successfully managed to train a plain CNN on a simple dataset achieving 96% validation accuracy using 28.000 data points. We proceeded to a more advanced dataset with more interesting interactions and found that training using 23.000 data points only resulted in a 72% accuracy. We used rotation as data augmentation technique in order to acquire more data with a minimum computational cost. With this approach by only using 5.000 data points, and applying 12 rotations, we were able to achieve an accuracy of 82%. Finally, we deployed a Bayesian Neural Network, able to provide uncertainty information for the predictions. The mean prediction of this network was able to fit the data very nicely, achieving a 72% validation accuracy using again 5.000 data points. The true result was inside the 95% Credible Interval for 92% of the data points and also the uncertainty was higher close to the high error pixels. This suggests that the BNN is able to efficiently quantify the uncertainty of the prediction and provide high quality uncertainty that can be used in a Selective Learning framework. We demonstrated twice the advantages of the Selective Learning framework by comparing a random acquisition function with a max uncertainty acquisition function and showing that the latter results in a 40% reduction in the labeled data requirement. Also we examined the effect of the query rate on the efficiency of the Selective Learning process and concluded that too small or too large query rates should be avoided. Furthermore, we used Selective Learning to train the BNN with all the available data and we reached an accuracy of 84% by using 50% less data. Lastly, we tested the limits of our BNN by making predictions on points outside the train data distribution. The accuracy of course dropped but the important conclusion is that the network was able to efficiently quantify the uncertainty on the unseen cases. The real value was inside the prediction’s 95% CI for 80% of the cases.
Acknowledgments
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie SklodowskaCurie grant agreement No. 764644.
This paper only contains the author’s views and the Research Executive Agency and the Commission are not responsible for any use that may be made of the information it contains.
References
 A singlestream segmentation and depth prediction cnn for autonomous driving. IEEE Intelligent Systems (), pp. 1–1. External Links: Document Cited by: §1.

Neural networks for pattern recognition
. Oxford University Press, Inc., USA. External Links: ISBN 0198538642 Cited by: §3.1.  Weight uncertainty in neural networks. Note: arXiv: 1505.05424 External Links: 1505.05424 Cited by: §1, §4.1, §4.1.
 Rethinking the usage of batch normalization and dropout in the training of deep neural networks. Note: arXiv: 1905.05928 External Links: 1905.05928 Cited by: §3.2.

SESR: single image super resolution with recursive squeeze and excitation networks
. In 2018 24th International Conference on Pattern Recognition (ICPR), Vol. , pp. 147–152. Cited by: §3.2.  Dropout as a bayesian approximation: representing model uncertainty in deep learning. Note: arXiv: 1506.02142 External Links: 1506.02142 Cited by: §1, §1.

Deep bayesian active learning with image data
. Note: arXiv: 1703.02910 External Links: 1703.02910 Cited by: §1, §5.4.  Dropout vs. batch normalization: an empirical study of their impact to deep learning. Multimedia Tools and Applications 79, pp. 1–39. External Links: Document Cited by: §3.2.
 Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §3.2, §3.2.
 Practical variational inference for neural networks. In Advances in Neural Information Processing Systems 24, J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), pp. 2348–2356. External Links: Link Cited by: §4.1, §4.1.
 Deep residual learning for image recognition. Note: arXiv: 1512.03385 External Links: 1512.03385 Cited by: §3.2.
 Keeping neural networks simple by minimizing the description length of the weights. In Proceedings of the 16th Annual Conference On Learning Theory (COLT), Cited by: §4.1.
 Gradient flow in recurrent nets: the difficulty of learning longterm dependencies. In Field Guide to Dynamical Recurrent Networks, J. Kolen and S. Kremer (Eds.), Cited by: §3.2.
 Entropybased active learning for object recognition. In CVPR Workshops, pp. 1–8. External Links: Link Cited by: §1.
 Squeezeandexcitation networks. Note: arXiv: 1709.01507 External Links: 1709.01507 Cited by: §3.2.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. Note: arXiv: 1502.03167 External Links: 1502.03167 Cited by: §3.2, §3.2, §3.2.
 Active learning for high dimensional inputs using bayesian convolutional neural networks. Cited by: §5.4.
 StressGAN: a generative deep learning model for 2d stress distribution prediction. Note: arXiv: 2006.11376 External Links: 2006.11376 Cited by: §1.

Multiclass active learning for image classification.
In
2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2009
, 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2009, pp. 2372–2379 (English (US)). Note: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition ; Conference date: 20062009 Through 25062009 External Links: Document, ISBN 9781424439935 Cited by: §1.  Deeplyrecursive convolutional network for image superresolution. Note: arXiv: 1511.04491 External Links: 1511.04491 Cited by: §3.2.
 Autoencoding variational bayes. Note: arXiv: 1312.6114 External Links: 1312.6114 Cited by: §4.1.
 Recurrent squeezeandexcitation context aggregation net for single image deraining. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §3.2.
 Understanding the disharmony between dropout and batch normalization by variance shift. Note: arXiv: 1801.05134 External Links: 1801.05134 Cited by: §3.2.
 Adaptive active learning for image classification. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, CVPR ’13, USA, pp. 859–866. External Links: ISBN 9780769549897, Link, Document Cited by: §1.
 A deep learning approach to estimate stress distribution: a fast and accurate surrogate of finiteelement analysis.. Journal of the Royal Society, Interface 15, pp. 138. External Links: ISSN , Document, Link Cited by: §1.
 Enhanced deep residual networks for single image superresolution. Note: arXiv: 1707.02921 External Links: 1707.02921 Cited by: §3.2.
 Towards fast biomechanical modeling of soft tissue using neural networks. Note: arXiv: 1812.06186 External Links: 1812.06186 Cited by: §1.
 Simulation of hyperelastic materials in realtime using deep learning. CoRR abs/1904.06197. Note: arXiv: 1904.06197 External Links: Link, 1904.06197 Cited by: §1.
 Stress field prediction in cantilevered structures using convolutional neural networks. Journal of Computing and Information Science in Engineering 20 (1). External Links: ISSN 19447078, Link, Document Cited by: §1, §3.2.
 Peterson’s stress concentration factors, third edition. Peterson’s Stress Concentration Factors, Third Edition, pp. 1–522. External Links: Document Cited by: §1, §2.4.
 Towards finiteelement simulation using deep learning. Cited by: §1.
 Homogenization in mechanics, a survey of solved and open problems. Rendiconti del Seminario Matematico 44 (1), pp. 1–45. External Links: ISSN , Link, Document Cited by: §1, §2.2.
 How does batch normalization help optimization?. Note: arXiv: 1805.11604 External Links: 1805.11604 Cited by: §3.2, §3.2.
 Topology optimization accelerated by deep learning. IEEE Transactions on Magnetics 55 (6), pp. 1–5. External Links: Document Cited by: §2.3.
 Predicting mechanical properties from microstructure images in fiberreinforced polymers using convolutional neural networks. Note: arXiv: 2010.03675 External Links: 2010.03675 Cited by: §1.
 Random walk initialization for training very deep feedforward networks. Note: arXiv: 1412.6558 External Links: 1412.6558 Cited by: §3.2.
 Dropoutbased active learning for regression. Analysis of Images, Social Networks and Texts, pp. 247–258. External Links: ISBN 9783030110277, ISSN 16113349, Link, Document Cited by: §4.2.
 StressNet: deep learning to predict stress with fracture propagation in brittle materials. External Links: 2011.10227 Cited by: §1.
 Deep multiple instance learning for image classification and autoannotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
 Wide residual networks. Note: arXiv: 1605.07146 External Links: 1605.07146 Cited by: §3.2.
 Diverse regionbased cnn for hyperspectral image classification. IEEE Transactions on Image Processing 27 (6), pp. 2623–2634. External Links: Document Cited by: §1.
 A deep convolutional neural network for topology optimization with strong generalization ability. Note: arXiv: 1901.07761 External Links: 1901.07761 Cited by: §2.3.