1 Introduction
With the rapid development of computer technology, the Finite Element Method (FEM) has been widely used to solve science and engineering problems roy2008recent ; xu2016tetrahedral . The FEM is a meshbased numerical method that needs to divide the study domain into a valid computational mesh consisting of interconnected and nonoverlapping elements. In the FEM, it is necessary to first create a shape function of the displacement field, then calculate the strain field by exploiting the relationship between displacement and strain, and finally create and solve the system of discrete equations to obtain the nodal displacements, strain, and stress.
After years of development, the conventional FEM become quite mature, and is capable of accurately solving various problems in science and engineering. However, the FEM also encounters several problems leveque2002finite . For example, when modeling and analyzing the large deformation he2010coupled or dynamic crack propagation, the FEM may have difficulties, and the precision of the numerical results is often unsatisfactory. Moreover, when analyzing the threedimensional solid problems, elements such as tetrahedrons are not continuous on the contact faces, which leads to the stiffness being too rigid. In addition, volume locking also occurs when the material is incompressible or nearly incompressible wihler2006locking .
To deal with the aforementioned weakness of the conventional FEM, in recent years, Liu G.R. et al liu2016smoothed proposed a series of SFEMs by combined the FEM with the smooth strain technology yoo2004stabilized . The SFEM is between the FEM and the pure Meshfree method. Compared with the FEM, the SFEM requires a modification or reconstruction of the strain field based on background mesh. The background mesh can be exactly the same as the computational mesh in FEM. When using the same number of elements as the FEM, the SFEM reduces the mesh quality requirements while providing more accurate (smoothed) computational results.
Although the SFEM can achieve more accurate numerical results than the FEM, it is less efficient than the FEM. In the SFEM, the interaction between elements makes the bandwidth of the global stiffness matrix larger than that of the global stiffness matrix in the FEM nguyen2009face . Moreover, smoothing domains need to be created and smoothing strain should be calculated. Therefore, the SFEM is computationally more expensive than the conventional FEM. When to analyze largescale problems in science or engineering, the computational efficiency of the SFEM should be improved.
There are two typical strategies to improve the computational efficiency of the SFEM. One is to theoretical redesign the algorithm structure in combination with the characteristics of the SFEM to improve the computational efficiency. The other is to utilize the parallelism on multicore CPU or manycore GPU. As a variation of the SFEM, the edgebased smooth element method (ESFEM) is parallelized on the GPU cai2018parallel .
Because the process of the SFEM is quite similar to the FEM, most current SFEM programs are modified on the basis of the FEM phan2013edge . Reference li2016automatic introduced an effective algorithm for establishing the smoothing domains, realizing the automation of 3D entity calculation and adaptive analysis. However, the current SFEM program, which uses a programming language such as C/C++ or Fortran, usually requires high demand of programming skill, and the code is difficult to read and modify. For programs written in highlevel languages such as Python and MATLAB, the code is readable but computationally inefficient sanner1999python .
In summary, quite few efforts are dedicated to developing the opensource packages of the SFEM by comprehensively considering the accuracy, efficiency, readability, and easeofdevelopment. To balance the program execution efficiency and the ease of implementing the program, in this paper, we design and implement an opensource package of parallel SFEM for elastic problems by using the Julia language bezanson2017julia . We term our package as juSFEM. To the best of the authors’ knowledge, juSFEM is the first package of parallel SFEM developed with the Julia language.
The Julia language is a fast, easytouse, and opensource programming language that was originally designed for highperformance computing. In combination with the principle of the SFEM and the characteristics of parallel computing, we redesign the algorithm structure and develop a package of the SFEM, juSFEM, to solve the 3D elastic problems.
Our contributions in this work can be summarized as follows.
(1) A Juliabased parallel package of the SFEM for elastic problems is developed.
(2) The redesigned strategy in juSFEM significantly improves the computing efficiency.
(3) The structure and function of juSFEM are modularized, and the code is clear and easy to understand, which is convenient for subsequent improvement.
The rest of this paper is organized as follows. Section 2 presents the background introduction to the SFEM and the Julia language. Section 3 introduces the implementation details of the package juSFEM. Section 4 provides several examples to validate the correctness and evaluate the efficiency of juSFEM. Section 5 analyzes the performance, advantages, and shortcomings of the package juSFEM. Finally, Section 6 concludes this work.
2 Background
In this section, we will present brief introduction to (1) the SFEM and (2) the Julia language.
2.1 Smoothed Finite Element Method (SFEM)
The SFEM is proposed by Liu G.R. liu2016smoothed to overcome the weakness of the conventional FEM such as the overly stiff phenomenon zienkiewicz1971reduced . Because of its excellent performance, SFEM has been widely applied in material mechanics liu2015smoothed , fracture mechanics chen2010singular ; chen2011novel , acoustics liu2009node ; li2014analysis , heat transfer li2014smoothed ; cui2016steady , and fluid–structure interactions he2011coupled . The SFEM has been proved to be an effective and robust numerical method for various types of problems in the past decade zeng2018smoothed .
The essential idea behind the SFEM is to modify the compatible strain field or construct a strain field using only the displacements. Such modification and construction are performed within the socalled smoothing domain. The method of modification is only suitable for threenoded triangular element (T3) or fournoded tetrahedral element (T4) since their compatible strain field is easy to obtain. In order to support multiple element types, we choose the method of constructing strain field in juSFEM.
Depending on the different methods of construction of the smooth domains, the SFEM can be divided into four categories: the nodebased smoothed FEM (NSFEM) liu2009node , the edgebased smoothed FEM (ESFEM) liu2009edge , the facebased smoothed FEM (FSFEM) nguyen2009face , and the cellbased smoothed FEM (CSFEM) liu2007smoothed . The calculation process of the FSFEM using the T4 element is as follows.
STEP 1. Discretization of the 3D problem domain
The FSFEM generally uses the T4 element to discrete threedimensional problem domains. When the T4 element is used, mesh generation is performed in the same manner as in the FEM, for example, by employing the widely used Delaunay tetrahedralization algorithm shewchuk2016delaunay .
STEP 2. Creating a displacement field
As shown in Figure 1, the attributes of the face are divided into two types: the interior (triangular face ) and the exterior (triangular face , , etc.). For the interior face, the body center points F and G are calculated in the tetrahedra ABCD and ABCE. Point F connects with three points in to form a new tetrahedron ABCF. Point G connects with three points in to form a new tetrahedron ABCG. The hexahedron ABCFG formed by two tetrahedra is the smoothing domain of the interior face . The external face has no adjacent tetrahedral elements, and the body center point F connects with three points of to form a new tetrahedron ACDF, which is the smoothing domain generated by the face .
The displacement field function is then assumed as:
(1) 
STEP 3. Evaluation of the compatible strain field
The FSFEM uses the boundary integral method to calculate the node shape function value at the relevant point according to the node information of the smoothing domains and the connection information between the elements. In the FSFEM, the smoothing strain is calculated as:
(2) 
where is the compatible strain field evaluated using an assumed displacement field, is the smoothing domain, and is the smoothing function:
(3) 
where is the volume of the smoothing domain.
Using Green’s divergence theorem, the smoothing strain matrix can be solved by the boundary integral of the smoothing domains:
(4) 
In Eq. (4):
(5) 
where is the face boundary of the smoothing domain , and in Figure 1, it is the six triangular faces of the hexahedron ABCFG.
is the unit normal direction vector of each face, and its form is:
(6) 
When using the strain field that is linearly compatible with the face boundary, and the integration is conducted on the face. Therefore, Eq. (5) can be simplified to:
(7) 
where is the number of face boundaries ; for the interior, , and for the exterior, . is the Gauss point of , is the area of , and is the outer unit normal vector.
The value of the shape function of Gauss points (centroid points) on the boundary of the smoothing domains is listed in Table 1.
Site  Node A  Node B  Node C  Node D  Node E  Description 
A  1  0  0  0  0  Field node 
B  0  1  0  0  0  Field node 
C  0  0  1  0  0  Field node 
D  0  0  0  1  0  Field node 
E  0  0  0  0  1  Field node 
F  1/4  1/4  1/4  1/4  0  Centroid of element 
G  1/4  1/4  1/4  0  1/4  Centroid of element 
g1  5/12  1/12  5/12  1/12  0  Gauss point 
g2  5/12  1/12  5/12  0  1/12  Gauss point 
g3  1/12  5/12  5/12  0  1/12  Gauss point 
g4  1/12  5/12  5/12  1/12  0  Gauss point 
g5  5/12  5/12  1/12  1/12  0  Gauss point 
g6  5/12  5/12  1/12  0  1/12  Gauss point 
STEP 4. Establish the system equation
In the FSFEM, it uses the smoothed Galerkin weak form and the assumed displacement and smoothing strain fields to establish the discrete linear algebraic system of equations instead of using the Galerkin weak form used in the FEM. In this process, the FSFEM requires only a simple summation over all of the smoothing domains. The linear system of equations when using T4 element is as follow:
(8) 
where is the smoothed stiffness matrix whose entries are given by:
(9) 
STEP 5. Impose essential boundary conditions
In the SFEM, the essential boundary conditions are satisfied when the displacement field is assumed. Therefore, the shape functions constructed and used for creating the displacement field have the important function properties. This practical approach of treating the essential boundary conditions is exactly the same as in FEM: essentially by removing (or modifying) the rows and columns of the stiffness matrix.
The overall calculation process of the FSFEM is illustrated in Figure 2.
2.2 Julia Language
The Julia language is a fast, easytouse, and opensource development scripting language under the MIT license julia . It was originally designed to meet the needs of highperformance numerical analysis and computational science. Fortran is the programming language to formally adopt highlevel programming concepts, and the goal was to write highlevel, generalpurpose formulas that could be automatically converted into efficient code. Not only does the complexity of Fortran require high programming ability of programmers, but the written code is also less readable. In recent years, dynamic languages such as Python and MATLAB have gradually become popular pine2019introduction , but their computational efficiency is limited. Julia created a new numerical computing method that is a dynamic language based on the parallelization and distributed computing of a mathematical programming language. Julia’s performance is close to that of C/C++ and other statically compiled languages, making it ideal for writing efficient SFEM programs; see Figure 3. A package written in Julia is more readable and easier for users to extend and improve.
Julia language has been used in many practical engineering problems. Frondelius et al. frondelius2017juliafem proposed a FEM framework in Julia language, which allows the use of simple programming models for distributed processing of large finite element models across computer clusters. Otter et al. otter2019thermodynamic described a prototype that demonstrates how thermodynamic property and thermofluid pipe component modeling enhanced via Julia language. To implement largescale image processing algorithms, LageFreitas et al. 7729158 proposed a solution which transparently processes distributed underlying computing resources with the advantages of Julia DistributedArrays, and they provided a digital programming interface to process huge data sets.
There are three levels of parallel computing in Julia: (1) Julia Coroutines (Green Threading), (2) MultiThreading (Experimental), (3) Multicore or Distributed Processing.
Julia Coroutines is suitable for small tasks. Multithreading in Julia is currently an experimental interface and may cause unknown errors. To achieve program stability and high efficiency, we choose multicore processing to realize the parallelism of the SFEM algorithm.
Nowadays, multicore computing is ubiquitous in ordinary computers, and according to Moore’s law, we expect the number of cores per chip to increase as the number of transistors increases chai2007understanding . As shown in Figure 4, multicore computing slices program tasks and redistributes them to each process to start computing. In Julia, the use of ”SharedArrays” allows different processes to access common data.
The functions “addprocs,”“rmprocs,” “workers,” and others are available as a programmatic means of adding, removing, and querying the processes. In particular, the iterations do not occur in a specified order, and writes to variables or arrays will not be globally visible because iterations run on different processes. Any variables used inside the parallel loop will be copied and broadcasted to each process. “SharedArrays” can be used to deal with situations where multiple processes are operating in an array at the same time.
3 The Developed Package juSFEM
3.1 Overview
In this paper, we design and implement an opensource package of parallel S FEM for elastic problems by utilizing the Julia language. We term our package as juSFEM. To the best of the authors’ knowledge, juSFEM is the first package of parallel SFEM developed with the Julia language, which is publicly available at https://github.com/Elliothuo/juSFEM.
The package juSFEM is composed of three components: preprocessing, solver, and postprocessing; see the structure of juSFEM in Figure 5.
(1) Preprocessing: the background mesh is achieved using mature mesh generation algorithms, and a new form for indexing mesh topological information is proposed to prepare for the building of smoothing domain.
(2) Solver: the global stiffness matrix is assembled according to the SFEM principle; the boundary conditions are imposed, and the system of linear equations is solved to obtain nodal displacements.
(3) Postprocessing: the numerical results are plotted interactively.
3.2 Preprocessing
We used TetGen si2015tetgen to generate a highquality tetrahedral mesh. The tetrahedral mesh is stored in “node”, “face” and “ele” files. On this basis, we designed five preprocessing matrices to record subdivision information according to the computational characteristics of the FSFEM. We further processed the mesh detail by combining the FSFEM and parallel computing features, and finally used five matrices to store the mesh detail correspondingly.
First, the “node” file is read into the matrix “Node” in the Julia program. “ele” is stored in the matrix “Element”. The number of rows in the “Node” matrix is the number of nodes. Each row has three entries, i.e., the x, y and z coordinates. The row number of the “Element” matrix is the number of all tetrahedral elements in the mesh. There are four columns, which are the four points that make up the tetrahedron. A new matrix, “Centroid”, is created with the same number of rows as “Element” and three columns. Through the four points in each line of “Element”, the body center of the tetrahedron is obtained, and the x, y and z coordinates of the body center are stored in each line of “Centroid”, respectively; see Figure 6.
In the FSFEM, the smoothing domain is built based on faces. The number of rows in the “face” file is the number of all faces in the mesh model. The first three columns of each row are the indices of points composing of the triangular faces. As shown in Figure 7, if the fourth column is ”1”, then the triangle face is inside; if it is ”0”, then the triangle face is outside. The fourth column is calculated by TetGen, which is used to determine the properties of the current triangle face. We divide all the faces of the model into two parts based on the values of the fourth column; the external faces are stored in the “face_out” matrix, and the internal faces are stored in the “face_in” matrix, both of which have three columns. Next, we create two new matrices named “indexout” and “indexin”. The matrix “indexout” has the same number of rows as “face_out” with 2 columns, and the matrix “indexin” has the same number of rows as “face_in” with 4 columns. By looping over the “face_out” and “face_in” matrices, it can determine which tetrahedron each face belongs to.
For the matrix “indexout”, the sequence number of the tetrahedron is added to the first column, and the remaining point of the tetrahedron is added to the second column. Because “face_in” faces belong to two tetrahedra, the first two columns of “indexin” are the IDs of tetrahedra, and the last two columns are the IDs of the remaining points in the tetrahedron. Finally, we use the “Concatenation” function in Julia to merge (1) “face_out” and “ indexout” and (2) “face_in” and “indexin”. Finally, we obtain the matrix “face_out” with five columns and “face_in” with seven columns. The above operations are illustrated in Figure 7. Note that in the mesh generation package, TetGen, the index starts at “0”, while in Julia it starts at “1.”
To efficiently process the “face” file, we utilize the parallelism on multicore CPU in Julia. We choose “SharedArrays” to deal with data interference in multiple processes. “SharedArrays” use shared system memory to map the same array across many processes. “SharedArrays” indexing (assignment and routines) works just as with regular arrays . We set “indexin” and “indexout” to the “SharedArrays” type, which enables different processes to operate at the same time document . An example of an external face is listed in Algorithm 1.
After preprocessing, we obtain five matrices: “Centroid”, “Face_in”, “Face_out”, “Element”, and “Node”.
3.3 Solver
The process of the solver is divided into two stages: (1) the assembly of stiffness matrix and (2) the solution of linear equations, both of which utilize the parallelism to improve the efficiency. In the above two stages, there are three critical issues that need to carefully be addressed.
(1) Largescale computation often leads to insufficient memory. Thus, commonly it needs to use the compression of the global stiffness matrix in a compressed format to reduce computer memory requirements. The conventional compression formats include COO, CSC, and CSR jain2002parallel . Multidimensional arrays in Julia are stored in columnmajor order. This means that the arrays are stacked one column at a time. Therefore, we choose the COO format to store the global stiffness matrix, and we transform it into the CSC format to improve the speed when solving the linear equations.
(2) In parallel computing, one of the critical problems is the data interference in multiple processes. For example, iterations do not occur in a specified order, and writes to variables or arrays will not be globally visible because iterations run on different processes. Any variables used inside the parallel loop will be copied and broadcast to each process. Therefore, it needs to set the global stiffness matrix to “SharedArrays” type and determine the position and corresponding value of filling the stiffness matrix according to the index of the loop so that each process can assign the stiffness matrix concurrently.
(3) After obtaining the linear equations, it needs to efficiently solve the system of linear equations to obtain the nodal displacements. We invoke PARDISO from the Intel® Math Kernel Library (MKL) mkl to solve the equations. PARDISO is a parallel direct sparse solver interface.
The process of the solver in juSFEM is illustrated in Figure 8
. After the preprocessing, the global stiffness matrix is first assembled. The assembly of the stiffness matrix is divided into two steps: (1) the assembly for the external faces and (2) the assembly for the internal faces. For the external faces, each smoothing domain is a tetrahedron composed of four points, and each point has three degrees of freedom; thus, the element stiffness matrix for each external face is a
matrix. Similarly, the element stiffness matrix for each internal face is a matrix.The detailed process of the solver in juSFEM is introduced as follows.
STEP 1. The calculation of the element stiffness matrices for the internal face and external face are basically the same. First, the volumes of the two tetrahedrons to which the interior face belongs are calculated. This step is implemented in the file “volume.jl”:
(10) 
where , , is the coordinate of point .
STEP 2. The area of each triangular face is calculated in Eq. (11). For the internal face unit, there are six faces. This step is implemented in the file “area.jl”.
(11) 
STEP 3. The normal unit vectors of faces are calculated. As shown in Figure 9, when calculating the normal vector of the first face, it takes the vectors and of any two sides in the triangular face and calculate the normal vector of the face through the crossproduct of the vector. After that, the element stiffness matrix can be assembled directly according to the principle of the FSFEM. This step is implemented in the file “vectorin.jl”.
STEP 4. According to the method by which the sparse matrix is constructed in Julia, we need three onedimensional arrays. The first array IK, in the order of each row, represents the row number of each value in the global stiffness matrix. The second array JK, in the order of each row, represents the column number of each value in the global stiffness matrix. The third array VK, in the order of each row, represents each value in the global stiffness matrix.
The sparse function can automatically accumulate the values in the same location, thus, the sizes of IK, JK and VK can be determined in advance:
(12) 
where is the number of external faces, is the number of internal faces.
Moreover, the arrays IK, JK, and VK are copied into “SharedArrays” in advance. Because the calculation of the stiffness matrix of each element is not datadependent, no data interference will occur during the parallel calculation of each process.
As mentioned above, we calculate the volume, area and normal vector of the smoothing domains where the inner face is located; and the calculation for each inner face is independent. Therefore, the above calculations are suitable to be parallelized on the multicore CPU. And the calculated results is also able to save into the three arrays, IK, JK, and VK in parallel.
The process of assembling the stiffness matrix for the external faces is listed in Algorithm 2.
STEP 5. After performing similar operations on the external face, IK, JK and VK are completed. The “Sparse()” function is used to construct the stiffness matrix in the CSC format. And the penalty function method is used to impose the boundary conditions. Also, the nodal load is calculated. After that, the MKL is configured, and the Pardiso.jl pardiso package is invoked to solve the system of linear equations to first obtain the nodal displacements and then the stain and stress.
3.4 Postprocessing
In this subsection, we will introduce how to plot the numerical results calculated using the developed package juSFEM.
3.4.1 Visualization in ParaView
After the performing of solver, the nodal displacements can be obtained by solving the system of linear equations, and then the strain and stress can be calculated according to the nodal displacements. To plot these numerically calculated results, we employ the opensource software ParaView paraview for the visualization. First, The background mesh and associated displacements, strain, and stress are stored and converted into the “vtu” format file using Julia’s opensource library “WriteVTK.jl” vtk . Then, the results storing in the “vtu” format file are imported into the ParaView for visualization (Figure 10).
3.4.2 Visualization in PlotlyJS
We also use “PlotlyJS.jl” plotlyjs to plot the displacement of the example in an interactive manner. In Julia, we first input the coordinates of each node in the mesh model, the details of each triangular face and the change of displacement of each node, and then plot the numerically calculated results interactively; see Figure 11.
4 Validation and Evaluation of juSFEM
To evaluate juSFEM’s correctness and efficiency, two groups of experiments are conducted on a workstation computer. The specifications of the workstation are listed in Table 2.
Specifications  Details 
OS  Windows 10 Professional 
CPU  Intel Xeon 5118 
CPU Frequency (GHz)  2.30 
CPU Cores  24 
CPU RAM (GB)  128 
Julia Version  Julia 1.1.1 
4.1 Verification of the Accuracy of juSFEM
To verify the correctness of juSFEM, a 3D elastic cantilever beam model is analyzed to compare the computational accuracy. In this example, the size of the cantilever beam is ; see Figure 12. The upper face of the cantilever beam is subjected to a uniform pressure in the direction of . The elastic modulus and Poisson’s ratio . The model consists of 458 nodes and 1,576 tetrahedral elements.
To verify the calculation accuracy, the displacements of the 3D elastic cantilever beam model is calculated and compared in the following three scenes.
(1) The displacements and stress are calculated using juSFEM with the mesh model consisting of 458 nodes and 1,576 tetrahedral elements (T4 elements); see Figure 13 and Figure 14.
(2) The displacements are calculated using the conventional FEM with the mesh model consisting of 458 nodes and 1,576 tetrahedral elements (T4 elements).
(3) According to reference liu2016smoothed , the commercial software ABAQUS is used to calculate the displacements with a very fine mesh consisting of 45,779 nodes and 30,998 10node tetrahedron elements (T10 elements).
As shown in Figure 15, the displacements calculated using juSFEM is more accurate than that of the FEMT4, and is a little less accurate than that of the FEMT10. Thus, it can be considered that the correctness and accuracy of juSFEM are verified.
4.2 Evaluation of the Efficiency of juSFEM
The developed package juSFEM can be executed (1) in serial on a single CPU core or (2) in parallel on multiple CPU cores. The efficiency of juSFEM in the above two scenes is recorded and compared. Moreover, to better evaluate the computational efficiency of juSFEM, for the same size of the cantilever beam model illustrated in Figure 12, four mesh models are generated; see the details of the meshes in Table 3. For each of the mesh model, both the serial and the parallel calculation time is recorded and compared.
Mesh model (T4)  Number of Nodes  Number of Elements 
1  94,066  503,963 
2  191,132  1,037,691 
3  275,655  1,498,126 
4  382,375  2,076,930 
The calculation process is divided into (1) the assembly of the global stiffness matrix and (2) the solution of the system of linear equations. The solution is performed by invoking the library PARDISO, and the computational efficiency of the solving is not discussed. Here only the computational efficiency of assembling the global stiffness matrix is evaluated by comparing.
The calculation time for assembling the global stiffness matrix for each mesh model is compared in Figure 16. It can be seen from the above results that for the mesh model consisting of two million elements, it requires approximately 208s on the singlecore, while it only needs approximately 10s on multiple cores. The speedup can reach 20 on the 24cores CPU.
To show the calculation efficiency of juSFEM, we also compared the times required by commercial software and juSFEM to calculate the same model; see the benchmark tests in Figure 17. In the comparisons, we recorded the total time of assembling the stiffness matrix and solving the equations. The results show that juSFEM is approximately 1.8 faster than ABAQUS.
5 Discussion
5.1 Comprehensive Evaluation of juSFEM
For a numerical modeling software or package, the correctness of the calculation results should be first guaranteed. Moreover, the computational efficiency should be satisfactory especially when solving largescale problems. In this paper, we design and implement the opensource package of parallel SFEM, juSFM. Here we present a comprehensive evaluation of juSFEM by analyzing the correctness and efficiency.
5.1.1 Computational Correctness
Through the verification of the cantilever beam calculation example presented above, a comparison of the calculation error between the FEM and juSFEM is listed in Table 4.
Position  Method  
FEMT4  juSFEMT4  FEMT10  
0.2 m  8.20E04 m  8.66E04 m  9.54E04 m 
0.4 m  2.61E03 m  2.82E03 m  3.08E03 m 
0.6 m  4.99E03 m  5.41E03 m  5.86E03 m 
0.8 m  7.62E03 m  8.28E03 m  8.92E03 m 
1.0 m  1.03E02 m  1.12E02 m  1.20E02 m 
When using numerical results calculated by the FEMT10 as the baseline, the displacement error of juSFEM at the maximum deflection of the cantilever beam is 6.6%, while that of the FEM T4 is 14.2%. Table 4 shows that the displacement variation of the FEMT4 is the smallest, while the displacement solution of the juSFEMT4 is more accurate. This is because the stiffness matrix of the FEM is too rigid for the same mesh, and only the upper bound of displacement solution can be obtained. The SFEM optimizes the stiffness matrix based on the operation between elements, making the displacement solution closer to the baseline.
5.1.2 Computational Efficiency
In the second verification example, we evaluated the computational efficiency of juSFEM by comparing the serial and parallel calculation time for four mesh models. The speedup of the parallel version over the corresponding serial version are illustrated in Figure 18.
As shown in Figure 18, for the mesh model composed of approximately 0.5 million elements, the speedup is 17.51 on a 24core CPU, which is not close to the theoretical speedups of 24. This is because it requires additional time for task allocation in the multicore parallelism. When the amount of computation is not large, the total computing time is too short, which leads to a larger proportion of the time consumed in task allocation. As the elements of computation increases, the time required to allocate tasks will increase slightly, but this part of the time is becoming shorter in the total computing time. For the mesh model composing of approximately 2 million elements, the efficiency of parallel computing on the multicore CPU significantly improves, offsetting the time required for task allocation. In this case, the speedup reaches 20.79 and shows an upward trend.
Another strategy to improve the computational efficiency is to reduce the memory operations by compressing the global stiffness matrix. In juSFEM, the COO format is used to assemble the global stiffness matrix; see the effect of compression of the global stiffness matrix in Table 5.
Dataset 





1  94,066  593.33  4.21  0.71  
2  191,132  2449.62  8.27  0.34  
3  275,655  5095.24  11.77  0.23  
4  382,375  9804.19  15.48  0.16 
For the largescale mesh model, there are many zero elements in the global stiffness matrix. Storing the global stiffness matrix in the form of the full matrix is quite memoryconsuming. When the number of nodes reaches 382,375, it takes approximately 1 TB of memory to store the global stiffness matrix. Due to hardware specifications, it will be very difficult to compute the largescale mesh model. juSFEM adopts the COO compression format and stores only nonzero elements in the global stiffness matrix, which significantly reduces memory usage. After assembling the stiffness matrix using the COO format, the stiffness matrix is then converted to the CSC format to further reduce the memory requirements, and used to generate the system of linear equations. The solving of the system of the linear equations becomes more efficient when using the CSC format to store the coefficient matrix and global stiffness matrix.
As mentioned above, the computation time of juSFEM is divided into two parts: (1) assembling the global stiffness matrix and (2) solving the equations. In order to analyze the performance of juSFEM, a mesh model composing of two million tetrahedral elements was selected. We recorded the proportion of these two parts in the total computing time in parallel and serial cases respectively.
As shown in Figure 19, when the parallel juSFEM is used for the mesh model composing of approximately 2 million tetrahedron elements, the time of assembling the global stiffness matrix is only 1.85% of the total calculation time, and the rest of the time is spent on solving equations. Thus, the computationally most expensive stage in juSFEM is to solve the system of equations. Opensource libraries such as the BLAS, Sparse BLAS, LAPACK, and MKL burylov2007intel are commonly used to solve the equations. However, reference schenk2004solving shows that PARDISO reaches a highspeed in solving the equations.
In summary, juSFEM realizes multicore parallel highperformance SFEM elastic problem by redesigning parallel algorithm and combining the advantages of Julia language. Although juSFEM can make full use of the multicore processor, it takes a lot of time to solve the linear equations of large sparse matrix. Moreover, the number of cores in a multicore computer will also limit the speedup of juSFEM.
5.2 Comparison with Other SFEM Programs
At present, there are very few studies on the SFEM programs, and most of them use programming languages such as C/C++ or Fortran. However, to achieve efficient computing, high demand of programming skills is required. In recent years, MATLAB and Python become popular. Although their code is more readable, but in general it is quite computationally expensive nickolls2008scalable .
juSFEM has a clear structure, all callers are constructed in a modular way, and each calculation step has a high degree of customization, which is characterized by high efficiency and brevity. Meanwhile, juSFEM has competitive computational efficiency.
In reference li2016automatic , a set of connectivity lists was developed for the SFEM. These connectivity lists will be created simultaneously, which will reduce computation and time consumption. In addition, the centroid of the mesh is selected as a uniform index to avoid duplicate storage.
In juSFEM, we choose the FSFEM. The face is used as the uniform index in the program. However, we redesign the method of storing mesh detail, which not only avoid the problem of repeated storage but also considered parallel computation. juSFEM stores the global stiffness matrix directly in a compressed fashion, which saves a lot of memory and facilitates largescale mesh computing.
As discussed above, juSFEM is computationally accurate and efficient. Under the redesigned calculation framework, the power of parallelism is well utilized. Meanwhile, the calculation process in juSFEM is packaged in the form of a function with concise grammar, which makes it convenient for further improvement.
5.3 Outlook and Future Work
Currently, the SFEM is widely used in material science and mechanical science. We hope that juSFEM based on the SFEM can be commonly used in geomechanics to provide fast and reliable numerical modeling results, for example, to analyze the slope deformation and failure 8895751 .
We also plan to extend juSFEM to analyze the plasticity problems. In the plasticity problems, we need to solve the equations iteratively. In regard to largescale computational models, the solution time of equations becomes particularly important nyssen1981efficient . We plan to adopt a more efficient parallel method to accelerate the solution speed of the equations on the GPU garland2008parallel ; ge2013multi .
In juSFEM, we use two methods to plot the results in postprocessing. ParaView needs to import the mesh file rewritten by juSFEM, which is very tedious. When using the PlotlyJS, it can only plot the displacements of the model in juSFEM, and thus needs to be improved in the future. We hope in juSFEM the postprocessing is conducted in a convenient and efficient way.
Liu G.R. et al. liu2016smoothed
proposed the SFEM by combining the gradient smoothing technique with the FEM, and the computation results are closer to the analytical solution. But in the SFEM, there is the problem of element incongruity in higherorder interpolation. To further improve the accuracy of interpolation, Liu G.R. and Zhang
gui2013smoothed proposed a smoothed point interpolation method (SPIM) based on space theory and double weak forms liu2010ag by developing generalized gradient smoothing technology. The SPIM can reduce the error by optimizing the interpolation form. We also hope to use Julia language to implement the parallel SPIM which is an extension of juSFEM.6 Conclusion
In this paper, we have designed and implemented an opensource package of the parallel SFEM for elastic problems by utilizing the Julia language on multicore CPU. We specifically designed a new strategy for the computation of the element stiffness matrices in parallel. The designed strategy is well suitable to be parallelized on the multicore CPU. To verify the correctness and evaluate the efficiency of juSFEM, two groups of benchmark tests have been conducted. We have found that (1) juSFEM only required 543 seconds to calculate the displacements of a 3D elastic cantilever beam model which was composed of approximately 2 million tetrahedral elements, while in contrast, the commercial FEM software required 930 seconds for the same calculation model; (2) the parallel juSFEM executed on the 24core CPU is approximately 20 faster than the corresponding serial version; (3) juSFEM employed COO compression format to store the global stiffness matrix, and the compression ratio can achieve to be 0.16% for the mesh model composed of approximately 2 million tetrahedrons. juSFEM is computationally efficient and can achieve accurate numerical results. Moreover, the structure and function of the juSFEM are easily modularized, and the Julia code in juSFEM is clear and readable, which is convenient for further development.
Acknowledgments
This research was jointly supported by the National Natural Science Foundation of China (Grant Numbers: 11602235 and 41772326), and the Fundamental Research Funds for China Central Universities (Grant Numbers: 2652018091, 2652018107, and 2652018109). The authors would like to thank the editor and the reviewers for their contributions.
References
 (1) R. Roy, S. Hinduja, R. Teti, Recent advances in engineering design optimisation: Challenges and future trends, CIRP Annals 57 (2) (2008) 697–715.
 (2) F. Xu, D. Schillinger, D. Kamensky, V. Varduhn, g. Wang, M.C. Hsu, The tetrahedral finite cell method for fluids: Immersogeometric analysis of turbulent flow around complex geometries, Computers & Fluids 141 (2016) 135–154.
 (3) R. J. LeVeque, et al., Finite volume methods for hyperbolic problems, Vol. 31, Cambridge University Press, 2002.
 (4) Z. He, G. Liu, Z. Zhong, X. Cui, G. Zhang, A. Cheng, A coupled edge/facebased smoothed finite element method for structural–acoustic problems, Applied Acoustics 71 (10) (2010) 955–964.
 (5) T. Wihler, Lockingfree adaptive discontinuous Galerkin FEM for linear elasticity problems, Mathematics of Computation 75 (255) (2006) 1087–1102.
 (6) G.R. Liu, N. Trung, Smoothed Finite Element Methods, CRC Press, 2016.
 (7) J. W. Yoo, B. Moran, J. S. Chen, Stabilized conforming nodal integration in the naturalelement method, International Journal for Numerical Methods in Engineering 60 (5) (2004) 861–890.
 (8) T. NguyenThoi, G. Liu, K. Lam, G. Zhang, A facebased smoothed finite element method (FSFEM) for 3D linear and geometrically nonlinear solid mechanics problems using 4node tetrahedral elements, International Journal for Numerical Methods in Engineering 78 (3) (2009) 324–353.
 (9) Y. Cai, X. Cui, G. Li, W. Liu, A parallel finite element procedure for contactimpact problems using edgebased smooth triangular element and GPU, Computer Physics Communications 225 (2018) 47–58.
 (10) H. PhanDao, H. NguyenXuan, C. ThaiHoang, T. NguyenThoi, T. Rabczuk, An edgebased smoothed finite element method for analysis of laminated composite plates, International Journal of Computational Methods 10 (01) (2013) 1340005.
 (11) Y. Li, J. Yue, R. Niu, G. Liu, Automatic mesh generation for 3D smoothed finite element method (SFEM) based on the weakenweak formulation, Advances in Engineering Software 99 (2016) 111–120.
 (12) M. F. Sanner, et al., Python: a programming language for software integration and development, J Mol Graph Model 17 (1) (1999) 57–61.
 (13) J. Bezanson, A. Edelman, S. Karpinski, V. B. Shah, Julia: A fresh approach to numerical computing, SIAM Review 59 (1) (2017) 65–98.
 (14) O. Zienkiewicz, R. Taylor, J. Too, Reduced integration technique in general analysis of plates and shells, International Journal for Numerical Methods in Engineering 3 (2) (1971) 275–290.
 (15) J. Liu, Z.Q. Zhang, G. Zhang, A smoothed finite element method (sfem) for largedeformation elastoplastic analysis, International Journal of Computational Methods 12 (04) (2015) 1540011.
 (16) L. Chen, G. Liu, N. NourbakhshNia, K. Zeng, A singular edgebased smoothed finite element method (esfem) for bimaterial interface cracks, Computational Mechanics 45 (23) (2010) 109.
 (17) L. Chen, G. Liu, K. Zeng, J. Zhang, A novel variable power singular element in g space with strain smoothing for bimaterial fracture analyses, Engineering analysis with boundary elements 35 (12) (2011) 1303–1317.
 (18) G. Liu, T. NguyenThoi, H. NguyenXuan, K. Lam, A nodebased smoothed finite element method (nsfem) for upper bound solutions to solid mechanics problems, Computers & structures 87 (12) (2009) 14–26.
 (19) W. Li, Y. Chai, M. Lei, G. Liu, Analysis of coupled structuralacoustic problems based on the smoothed finite element method (sfem), Engineering Analysis with Boundary Elements 42 (2014) 84–91.
 (20) E. Li, Z. Zhang, Z. He, X. Xu, G. Liu, Q. Li, Smoothed finite element method with exact solutions in heat transfer problems, International Journal of Heat and Mass Transfer 78 (2014) 1219–1231.
 (21) X. Cui, Z. Li, H. Feng, S. Feng, Steady and transient heat transfer analysis using a stable nodebased smoothed finite element method, International Journal of Thermal Sciences 110 (2016) 12–25.
 (22) Z. He, G. Liu, Z. Zhong, G. Zhang, A. Cheng, A coupled esfem/bem method for fluid–structure interaction problems, Engineering Analysis with Boundary Elements 35 (1) (2011) 140–147.
 (23) W. Zeng, G. Liu, Smoothed finite element methods (sfem): an overview and recent developments, Archives of Computational Methods in Engineering 25 (2) (2018) 397–435.
 (24) G. Liu, T. NguyenThoi, K. Lam, An edgebased smoothed finite element method (ESFEM) for static, free and forced vibration analyses of solids, Journal of Sound and Vibration 320 (45) (2009) 1100–1130.
 (25) G. Liu, K. Dai, T. T. Nguyen, A smoothed finite element method for mechanics problems, Computational Mechanics 39 (6) (2007) 859–877.
 (26) J. Shewchuk, T. K. Dey, S.W. Cheng, Delaunay Mesh Generation, Chapman and Hall/CRC, 2016.
 (27) The julia programming language, https://julialang.org/ (2019).
 (28) D. J. Pine, Introduction to Python for Science and Engineering, CRC Press, 2019.
 (29) Julia Benchmarks, https://julialang.org/benchmarks/ (2019).
 (30) T. Frondelius, J. Aho, Juliafemopen source solver for both industrial and academia usage, Rakenteiden Mekaniikka 50 (3) (2017) 229–233.
 (31) M. Otter, H. Elmqvist, D. Zimmer, C. Laughman, Thermodynamic property and fluid modeling with modern programming language construct, in: Proceedings of the 13th International Modelica Conference, Regensburg, Germany, March 4–6, 2019, no. 157, Linköping University Electronic Press, 2019.
 (32) A. LageFreitas, A. C. Frery, N. D. C. Oliveira, R. P. Ribeiro, R. Sarmento, Cloudarray: Easing huge image processing, in: 2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2016, pp. 631–634.
 (33) L. Chai, Q. Gao, D. K. Panda, Understanding the impact of multicore architecture in cluster computing: A case study with intel dualcore system, in: Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid’07), IEEE, 2007, pp. 471–478.
 (34) H. Si, TetGen, a Delaunaybased quality tetrahedral mesh generator, ACM Transactions on Mathematical Software (TOMS) 41 (2) (2015) 11.
 (35) Julia 1.1 Documentation, https://docs.julialang.org/en/v1/manual/parallelcomputing/ (2019).

(36)
A. Jain, N. Goharian, On parallel implementation of sparse matrix information retrieval engine, in: Proceedings of the International Multiconferences in Computer Science: on Information and Knowledge Engineering (IKE), 2002.
 (37) Intel® Math Kernel Library, https://software.intel.com/enus/mkl (2019).
 (38) Pardiso.jl, https://github.com/JuliaSparse/Pardiso.jl (2019).
 (39) Paraview, https://www.paraview.org/ (2019).
 (40) Writevtk.jl, https://github.com/jipolanco/WriteVTK.jl (2019).
 (41) Plotlyjs.jl, https://github.com/sglyon/PlotlyJS.jl (2019).
 (42) I. Burylov, M. Chuvelev, B. Greer, G. Henry, S. Kuznetsov, B. Sabanin, Intel performance libraries: Multicoreready software for numericintensive computation., Intel Technology Journal 11 (4).
 (43) O. Schenk, K. Gärtner, Solving unsymmetric sparse systems of linear equations with pardiso, Future Generation Computer Systems 20 (3) (2004) 475–487.
 (44) J. Nickolls, I. Buck, M. Garland, Scalable parallel programming, in: 2008 IEEE Hot Chips 20 Symposium (HCS), IEEE, 2008, pp. 40–53.
 (45) G. Mei, N. Xu, J. Qin, B. Wang, P. Qi, A survey of internet of things (iot) for geohazards prevention: Applications, technologies, and challenges, IEEE Internet of Things Journal (2019) 1–16doi:10.1109/JIOT.2019.2952593.
 (46) C. Nyssen, An efficient and accurate iterative method, allowing large incremental steps, to solve elastoplastic problems, in: Computational Methods in Nonlinear Structural and Solid Mechanics, Elsevier, 1981, pp. 63–71.
 (47) M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips, Y. Zhang, V. Volkov, Parallel computing experiences with cuda, IEEE micro 28 (4) (2008) 13–27.
 (48) W. Ge, J. Xu, Q. Xiong, X. Wang, F. Chen, L. Wang, C. Hou, M. Xu, J. Li, Multiscale continuumparticle simulation on CPU–GPU hybrid supercomputer, in: GPU Solutions to Multiscale Problems in Science and Engineering, Springer, 2013, pp. 143–161.
 (49) L. Guirong, Z. Guiyong, Smoothed point interpolation methods: G space theory and weakened weak forms, World Scientific, 2013.
 (50) G. Liu, Ag space theory and a weakened weak (w2) form for a unified formulation of compatible and incompatible methods: Part ii applications to solid mechanics problems, International journal for numerical methods in engineering 81 (9) (2010) 1127–1156.