1 Introduction
Numerical simulations are proving to be of paramount importance across different avenues of industrial and research development. These simulations are performed by solving partial differential equations, which are represented on a discretized computational domain using finite difference or finite volume methods. These methods provide accurate predictions, but they are computationally very expensive. As a result, researchers in the deep learning community have devised many different models to learn physics behind these engineering problems using supervised learning methods, that determine the input to output mapping
li2020fourier; bhatnagar2019prediction; zhu2018bayesian; guo2016convolutional; ranade2022composableor unsupervised learning methods, that embed physical laws into loss functions to compute PDE solutions
bar2019unsupervised; smith2020eikonet; raissi2019physics; ranade2021discretizationnet. These physicsinformed methods provide a unique benefit over most approaches by imposing initial and boundary conditions in the optimization process. Even though these surrogate models perform reasonably well, they suffer from error accumulation, especially in the extrapolation regime. The accumulation of error is worse for transient problems because the deep learning models diverge during inference. Many researchers have tried to solve this problem by proposing different operator networks like Fourier Neural Operator (FNO) li2020fourier, Multiscale Neural operator lutjens2022multiscale, Koopman operator balakrishnan2021stochastic etc. but the error accumulation problem for long time range predictions is still prevalent.This problem is very common in the field of natural language processing. In the models that translate one language to another, errorprone predictions in the beginning lead to a completely different output during sequential rollouts. To tackle this problem, researchers in this field have come up with several approaches. William et al. (williams1989learning)
proposed an algorithm named Teacher forcing in their paper on training recurrent neural networks back in 1989. This approach suffers from overgeneralization on training data and performs worse during inference. Later in 2015, Bengio et al.
(bengio2015scheduled) proposed a new training mechanism in their paper, “scheduled sampling for sequence prediction”, also known as Curriculum Learning.Taking inspiration from the natural language processing field, we demonstrate the effectiveness of these methods in the field of numerical simulations. The most common way to train transient models is to roll out the whole trajectory using its own predictions right from the beginning of the training process and calculating loss over all the predictions li2020fourier. In this study, we show that Teacher Forcing and Curriculum Learning techniques yield better results as compared to the regular training procedure. Specifically, we show that Curriculum Learning outperforms all other approaches during inference.
2 Method
During training a model for any transient (timeseries) problem, we predict a sequence of outputs and calculate the loss between predictions and ground truth. Let’s consider an example, shown in Eq. 1, of a general transient system, where we can map solutions from to
using a neural network.
(1) 
Here represents the parameters of the model. To predict the next time step, , we pass in the previous steps consisting of its own prediction. In Eq. 2, the solution prediction of inputs the latest prediction at as well as the remaining n1 historical solutions. This approach is also used by Zongyi Li et al., in their work on Fourier Neural Operators li2020fourier.
(2) 
During training time, as the ground truth for is available, we can use the ground truth solution, as shown in Eq. 3. As a result, the model always gets the correct sequence of previous time steps during training. This approach of handling time history is known as Teacher Forcing williams1989learning.
(3) 
The models trained with this approach often perform better than the approach used in Eq. 2. This is because the training in the previous approach is less robust due to noisy loss formulations resulting from error accumulation over longtime ranges. However, a model trained with teacher forcing never learns to correct itself as it never sees its own prediction during training. This results in divergence from the correct behavior for longtime predictions.
Curriculum Learning bengio2015scheduled, an advanced variation of Teacher Forcing algorithm, is a mix of the approaches stated in the Eqs. 2 and 3. In this approach, we randomly decide whether to use (target) or
(prediction) for future predictions during each training epoch. Fig.
1 explains the curriculum learning approach in more detail. Let e be the ratio of number of targets used to the length of the sequence. In the initial phase of the training (when e=1), there is a heavier emphasis on using the target as shown in the Fig. 1(a), where all future predictions are computed from historical ground truth solutions. In the middle phase of the training, e begins to decrease and the number of predicted solutions used for future predictions are more. For example, in Fig. 1b the prediction of depends on the previous prediction . Lower e values corresponds to use of more predicted solutions for future predictions. Finally, as shown in Fig. 1c, during the end phase of training all future predictions are computed from previously predicted solutions. The slow transition from using targets to predictions in future predictions allows for stable training and accurate models. Moreoever, the training imitates the actual inference process and hence, improves the robustness of the model3 Experiments
In this work, we use a publicly available dataset to demonstrate the effectiveness of Curriculum Learning technique. The dataset solves the 2D vorticity equation derived from NavierStokes for a viscous, incompressible fluid on a unit torus. The viscosity is 1e3 and the initial condition is solved for 50 timesteps on a 64 x 64 structured grid. More information about the dataset can be found in the work by Zyongi et al. li2020fourier. The dataset has 5000 samples. 4000 samples are used for training, 500 samples are used for validation and 500 samples are used for testing. Out of 50 timesteps, only the first 40 are used for training. The last 10 timesteps are used to check the extrapolation performance.
There are two models used, a UNet model ronneberger2015u and FNO 2D time model li2020fourier. These models are trained with three different schemes.

The whole rollout using model predictions (This is similar to the training mechanics used in li2020fourier)

Teacher forcing

Curriculum Learning
During training, the model takes the first 10 timesteps as input and predicts the next one timestep. The model is rolled out for all the remaining 30 time steps before the gradient descent step. At the time of inference, the model is rolled out on all the 40 timesteps in the same fashion. The mean squared error loss is calculated on the whole sequence of 30 timesteps predictions. The model is trained for 500 epochs with Adam optimizer. The initial learning rate is 0.001 and is subsequently halved after every 100 epochs. For Fourier Neural Operator model, the hyperparameters used are same as mentioned
li2020fourier. All experiments are run on a single Nvidia Tesla A100 GPU. The decay scheme used for e, in Curriculum Learning, is linear.4 Results
Table 1 shows the relative L2 norm on the test set. The numbers in the table are averaged over 500 test samples. The first technique gives satisfactory results. The Teacher forcing models show better results. The FNO model improves by 32.5%. On the other hand, UNet shows a remarkable improvement of 50%.
Solution approaches  FNO 2D time  UNet 

Rollout using model predictions  0.046  0.082 
Teacher Forcing  0.031  0.041 
Curriculum Learning  0.025  0.027 
The Teacher Forcing improves the performance across all models. Curriculum Learning further improves the error rate. Curriculum Learning outperforms all other approaches. The FNO model improves by 52% over the baseline. It achieves the best performance. UNet also improves by 67% over its baseline. It may be observed that the UNet model is benefitted the most. With general training mechanics, the FNO model is inherently better in mapping transient problems than UNet, but with curriculum learning the results from UNet and FNO are very close.
The following graphs in Fig. 2 show the rollout error on the test set. It is evident that the rollout error improves with Curriculum Learning. The error seems to rise with a significantly smaller rate compared to corresponding baselines. Also, in the extrapolation region (time steps beyond 40), the model tends to behave reasonably well. As the data is unseen for the model, the error rate increases in this region for both models
In Fig. 3, we show some of the snapshots of the output produced by the UNet model for a test sample. The figure shows targets, baseline, and curriculum learning outputs for certain time steps. In the figure, circle (A) represents the approximate location in time and space where the prediction starts to deviate from the ground truth. This deviation accumulates and diverges from the actual trajectory as evident in circle (B). The curriculum learning output also has some deviation in the beginning, but it sufficiently corrects itself and does not deviate vastly from the target. One thing to note here is that the last 2 time steps are from the extrapolation region which the model never learned during the training. This indicates that the model’s performance significantly improves in the extrapolation region using this training mechanism.
5 Conclusion and Future work
This work shows the effectiveness of Curriculum Learning in learning PDEs for better performance during inference. It also shows that the techniques from other fields like Natural Language Processing can be beneficial in deep learning for numerical simulations. The simple change in the training mechanics help in better generalization and extrapolation as the model gradually learns to correct itself during training using its own predictions. This is possible because the training objective slowly gets similar to the inference. We expect that this method can be used for any generalized transient problems to improve the overall performance for any model architecture. Future work includes trying out different decay schemes like exponential or inverse sigmoid for curriculum learning. The investigation of the effect of number of epochs on the convergence could be useful in case of challenging datasets. The relationship between the learning rate decay scheme and the performance improvement with this method needs to be studies for stable convergence.