1 Introduction
Plasma phenomena contain a wide range of temporal and spatial scales, often exhibiting multiscale characteristics (see Figure 1). In fusion energy plasmas, many disparate diagnostic instruments are simultaneously used in order to capture these various spatiotemporal scales, and to cover the multiple physics present in these plasmas. In addition, fusion experiments are increasingly built to run longer pulses, with a goal of eventually running a reactor continuously. The confluence of these facts leads to large, complex datasets with phenomena manifest over long sequences. A key challenge is enabling scientists/engineers to utilize these long sequence datasets to, for example, automatically catalog events of interest or predict the onset of phenomena.
Many deep learning architectures have been created and successfully applied to sequence learning
Graves (2013); LeCun (2015); Lipton (2015); Fawaz (2018)problems, in areas of timeseries analysis or natural language processing. However, many of the typical architectures used for learning from sequences (e.g. recurrent neural networks (RNN) and its most popular variant Long Short Time Memory networks (LSTM)) suffer from memory loss; longrange dependencies in sequences are difficult for these architectures to track
Bai (2018).In this paper we discuss recent advances in neural networks, specifically an architecture that uses dilated convolutions in a deep convolutional neural network (CNN), which was designed to overcome these problems of learning on long sequences. We use this architecture to predict oncoming disruptions in fusion plasma discharges of the DIIID tokamak utilizing only raw data from a single, high temporal resolution imaging diagnostic (the Electron Cyclotron Emission imaging diagnostic, or ECEi) Tobias (2010). Because the ECEi diagnostic is sensitive to a range of multiscale dynamics in the plasma related to disruptions Choi (2016), it offers the potential to more accurately predict them. Avoiding disruptions is a grand challenge for tokamak fusion devices on the road to fusion energy Hender (2007)
. While much research has gone into utilizing machine learning for disruption prediction
Vega (2013); Rea (2018); KatesHarbeck (2019), often global, reduced 0D features are used in shallow machine learning methods. Recently work utilizing deep LSTM networks also added the use of low temporal resolution 1D plasma profiles KatesHarbeck (2019), and another work used a combination CNN/LSTM on resampled, low temporal resolution bolometer data Ferreira (2018). The work we present here takes inspiration from these works in utilizing higher dimensional signals, and shows how to use newer deep learning architectures to learn on hightemporal resolution data with longrange dependencies due to multiscale physics.2 Deep convolutional neural networks with dilated convolutions
Recently there has been much research into deep learning architectures which can overcome the deficiencies of RNN/LSTM’s, and handle long, multiscale sequences Vaswani (2017); Devlin (2018); Radford (2019); Wang (2017); Gehring (2017). A seminal paper presented one such architecture, WaveNET Van Den Oord (2016), which is a convolutional neural network (CNN) focused on generating realistic audio. One of the key insights of this paper was to use dilated convolutions to increase the receptive field of the network. This overcomes the dilemma faced with using normal convolutions in causal networks, where to be sensitive to long sequences you must increase the convolutional filter size and/or the number of layers in the network. Dilated convolutions have a dilation factor () which represents the number of input points skipped between filter parameters, e.g. the sequence output from a dilated convolution with dilation is:
where represents the weights of the 1D dilated convolution filter of length , and is the input sequence. A normal convolution results by setting . By stacking layers of dilated convolutions, and increasing the dilation factor in each layer, the receptive field of the network can be increased while maintaining a tractable number of model parameters.
Dilated convolutions impose an inductive bias or specific structure to the architecture which guide the transformations learned by the neural network. Specifically, dilated convolutions have a natural connection with wavelet structures, which have been used for separating out structure in multiscale data, including turbulent flows Farge (1992). In a loose sense, these neural networks allow us to learn the wavelet coefficients needed to accomplish our classification task.
A simplified yet powerful architecture named temporal convolutional network (TCN) Bai (2018)
built upon this WaveNET work, utilizing dilated convolutions and many modern neural network techniques, such as weight normalization and residual connections. Bai
et. al. Bai (2018) showed the TCN could outperform LSTM and GRU architectures on many common sequence learning tasks, especially for long sequences with longrange dependencies. It is this TCN architecture that we will now apply to the problem of disruption prediction using ECEi data.3 Application to Disruption Prediction using Raw ECEi Imaging Data
Disruptions in tokamaks plasmas are a sudden loss of control which cause a termination of the plasma and potentially large destructive forces and/or heating on the containment vessel and protective wall materials. Nextstep devices such as ITER and beyond will have a low tolerance for disruptions DeVries (2016). We need to ensure disruptions can be avoided by accurate prediction of oncoming disruptions and mitigation techniques if necessary.
Here we apply the TCN architecture to hightemporal resolution, raw ECEi imaging data from the DIIID tokamak for the purpose of predicting oncoming disruptions^{1}^{1}1Code available at https://github.com/rmchurch/disruptcnn.
3.1 Data
The ECEi diagnostic Tobias (2010) is used to measure electron temperature on very fast timescales, normally sampling at 1 MHz on the DIIID tokamak. The diagnostic has 160 spatial channels, laid out in a rectangular grid with 20 vertical by 8 radial channels. Example time series of the DIIID ECEi diagnostic near a disruption is shown in Figure 2. ECEi can capture a number of plasma phenomena such as turbulence fluctuations, tearing modes, sawteeth, and ELMs Tobias (2010), which allow it to be sensitive to a number of predisruption markers. A dataset of good ECEi data (SNR) from 2,747 DIIID shots ( disruptive, nondisruptive) was selected, measuring about 10 TB. Time length of each shot varies, typically between 5 to 10 seconds. Raw digitizer voltage output was corrected for digitizer drift, then normalized before inputing into the TCN. For ease of training the neural network, we decided as an initial step to temporally downsample the ECEi data to 100 kHz (i.e. factor of 10x less data).
3.2 Model and Training Setup
We treat the problem of disruption prediction as a binary classification problem, where we predict whether each time slice corresponds to a “nondisruptive” or “disruptive” class. We label all time slices within 300ms of a disruption as “disruptive” (ms), and all other time slices as “nondisruptive” Rea (2018)
(sequences from shots without disruptions are taken during established times of the discharge, i.e. during the plasma current flattop). Typical binary crossentropy loss is used as the loss function for the neural network training.
We define our TCN model to have a receptive field of . This is an order of magnitude larger than receptive fields in the original TCN Bai (2018) or WaveNET Van Den Oord (2016) papers. With the 100 kHz sampling rate, this means that each time slice prediction uses the previous ms in order to make the prediction. With our definition of disruptive time slices as within 300ms of the disruption, this implicitly assumes that 600ms before a disruption is sufficient to predict oncoming disruptions. We use a 4 hidden layer TCN with dilations (i.e. increasing by a factor of about 10 each layer), with a filter kernel size of 15. The number of filters per hidden layer was held constant at 80 (varying number of filters per hidden layer was not attempted).
The TCN architecture allows parallelization of the sequence prediction by inputting sequences of length , which are longer than , resulting in predictions per sequence. Empirically it was found that sequence lengths of
allowed model computations that fit inside the GPU memory constraints, while allowing a batch size of 12 (per GPU) to ensure sufficient variety within each batch for training with stochastic gradient descent (the total batch size with data parallelism was 192. Larger batch size can be achieved reducing the sequence length, though at an increased computational cost due to more data reads). The set of sequences with timeslices consisting of only the majority class (“nondisruptive”) was undersampled such that there were balanced disruptive and nondisruptive sequences.
Stochastic Gradient Descent (SGD) with Nesterov momentum 0.9 was used to train the model, with an initial learning rate of 0.5 that was decreased automatically upon plateau (
ReduceLROnPlateau). A warmup period was used for the first 5 epochs, increasing the learning rate from 0.0625 to 0.5 to enable larger batch training
Goyal (2017). Multinode, multiGPU setup was used to parallelize the training. The Pytorch builtin synchronous data parallel training routine
DistributedDataParallel was used Paszke (2017), training on 16 GPUs over 2 days.3.3 Results
The results of training this TCN model on ECEi data for disruption prediction on DIIID are shown in Figure 3
. Results are plotted over 1000 training epochs. The training binary crossentropy loss continually decreases over the training, showing our model has the capacity to learn the task from this dataset. The validation loss also continually decreases, slightly flattening towards the end, indicating the model is reaching the limit of its generalizability after 1000 epochs. Two validation metrics are also shown: accuracy (how many time slices were predicted correctly as disruptive or nondisruptive), and F1score (a geometric mean between precision and recall). Because the time slice classes are imbalanced (even though the sequence sets are balanced), the F1score gives a better indication of how well our classifier does on the minority class (disruptive).
The metric of accuracy reaches , but more importantly the metric of F1score reaches , showing the neural network has learned to predict individual time slices of both disruptive and nondisruptive time slices very well. Current machine learning disruption predictors typically achieve a truepositive rate in the low 90% on shots Vega (2013); Rea (2018); KatesHarbeck (2019); Ferreira (2018), with the goal of with a falsepositive rate of DeVries (2016). The results presented here offer a promising path to overcome this gap. Consolidation of the time slice predictions to make shot predictions is left for future work, including not training on the last 30 ms before a disruption since this is a minimum amount of time needed to trigger mitigation systems. Most likely techniques like the hysteresis threshold algorithm will be needed Montes (2019), as the fast, noisy ECEi data could easily trigger occasional false predictions.
4 Discussion and Future Work
These results show the usefulness of deep convolutional neural networks with dilated convolutions for fusion problems where the multiscale, multiphysics nature mandates capturing longrange dependencies in timeseries. They have shown that it is possible to apply deep learning directly on the raw data from a single diagnostic with high temporal resolution in order to make useful disruption predictions, a topic critical to the success of magnetic confinement fusion. They also show that training TCN networks with large receptive fields on the order of is possible, allowing learning on long sequences with longrange dependencies.
Future work in various areas is planned. At the base level, using the full dataset at full temporal resolution could give further improvement, though may require model parallelism to train. Further, combining multiple modalities (including more diagnostics) Dumoulin (2018) can allow the disruption predictions to be sensitive to the various physics which can trigger disruptions DeVries (2016). Also, interpretability of the network decisions is highly desired, especially to understand the physics and extend to future machines Kim (2017).
Acknowledgments
The main author would like to thank Ben Tobias, Yilun Zhu, Dave Schissel, C.S. Chang, Bill Tang, Julien KatesHarbeck, Raffi Nazikian, Cristina Rea, Bob Granetz, Neville Luhmann, Sean Flanagan, Ahmed Diallo, and Ken Silber for various contributions to this work. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Fusion Energy Sciences, under AC0209CH11466, DEFC0204ER54698, and FG0299ER54531. We also recognize the Princeton Research Computing center for the computational resources used in this paper.
References
 Bai [2018] S. Bai, J. Z. Kolter, and V. Koltun. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv eprints, page arXiv:1803.01271, mar 2018.
 Choi [2016] M. Choi, H. Park, G. Yun, W. Lee, N. Luhmann, K. Lee, W.H. Ko, Y.S. Park, B. Park, and Y. In. 2D/3D electron temperature fluctuations near explosive MHD instabilities accompanied by minor and major disruptions. Nucl. Fusion, 56(6):066013, jun 2016.
 DeVries [2016] P. C. de Vries, G. Pautasso, D. Humphreys, M. Lehnen, S. Maruyama, J. A. Snipes, A. Vergara, and L. Zabeo. Requirements for Triggering the ITER Disruption Mitigation System. Fusion Sci. Technol., 69(2):471–484, apr 2016.
 Devlin [2018] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova. BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. arXiv eprints, oct 2018.
 Dumoulin [2018] V. Dumoulin, E. Perez, N. Schucher, F. Strub, H. Vries, A. Courville, and Y. Bengio. Featurewise transformations. Distill, 3(7):e11, jul 2018.
 Farge [1992] M. Farge. Wavelet Transforms and their Applications to Turbulence. Annu. Rev. Fluid Mech., 24(1):395–458, jan 1992.
 Fawaz [2018] H. I. Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P.A. Muller. Deep learning for time series classification: a review. arXiv eprints, sep 2018.
 Ferreira [2018] D. R. Ferreira. Applications of Deep Learning to Nuclear Fusion Research. arXiv eprints, nov 2018.
 Gehring [2017] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin. Convolutional Sequence to Sequence Learning. arXiv eprints, may 2017.
 Goyal [2017] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv eprints, page arXiv:1706.02677, jun 2017.
 Graves [2013] A. Graves. Generating Sequences With Recurrent Neural Networks. arXiv eprints, aug 2013.
 Hender [2007] T. Hender, J. Wesley, J. Bialek, A. Bondeson, A. Boozer, R. Buttery, A. Garofalo, T. Goodman, R. Granetz, Y. Gribov, O. Gruber, M. Gryaznevich, G. Giruzzi, S. Günter, N. Hayashi, P. Helander, C. Hegna, D. Howell, D. Humphreys, G. Huysmans, A. Hyatt, A. Isayama, S. Jardin, Y. Kawano, A. Kellman, C. Kessel, H. Koslowski, R. L. Haye, E. Lazzaro, Y. Liu, V. Lukash, J. Manickam, S. Medvedev, V. Mertens, S. Mirnov, Y. Nakamura, G. Navratil, M. Okabayashi, T. Ozeki, R. Paccagnella, G. Pautasso, F. Porcelli, V. Pustovitov, V. Riccardo, M. Sato, O. Sauter, M. Schaffer, M. Shimada, P. Sonato, E. Strait, M. Sugihara, M. Takechi, A. Turnbull, E. Westerhof, D. Whyte, R. Yoshino, H. Zohm, D. Group, the ITPA MHD, and Magnet. Chapter 3: MHD stability, operational limits and disruptions. Nucl. Fusion, 47(6):S128–S202, jun 2007.
 KatesHarbeck [2019] J. KatesHarbeck, A. Svyatkovskiy, and W. Tang. Predicting disruptive instabilities in controlled fusion plasmas through deep learning. Nature, 568(7753):526–531, apr 2019.

Kim [2017]
B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, and R. Sayres.
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV).
arXiv eprints, nov 2017.  LeCun [2015] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, may 2015.
 Lipton [2015] Z. C. Lipton, J. Berkowitz, and C. Elkan. A Critical Review of Recurrent Neural Networks for Sequence Learning. arXiv eprints, may 2015.
 Montes [2019] K. Montes, C. Rea, R. Granetz, R. Tinguely, N. Eidietis, O. Meneghini, D. Chen, B. Shen, B. Xiao, K. Erickson, and M. Boyer. Machine learning for disruption warnings on Alcator CMod, DIIID, and EAST. Nucl. Fusion, 59(9):096015, sep 2019.
 Paszke [2017] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. In NIPSW, oct 2017.
 Radford [2019] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language Models are Unsupervised Multitask Learners. Technical report, OpenAI, 2019.
 Rea [2018] C. Rea and R. S. Granetz. Exploratory Machine Learning Studies for Disruption Prediction Using Large Databases on DIIID. Fusion Sci. Technol., pages 1–12, mar 2018.
 FESAC ISOFS Subcommittee [2002] F. I. Subcommittee. FESAC ISOFS Subcommittee Final Report. Technical report, FES, 2002.
 Tobias [2010] B. Tobias, C. W. Domier, T. Liang, X. Kong, L. Yu, G. S. Yun, H. K. Park, I. G. J. Classen, J. E. Boom, A. J. H. Donné, T. Munsat, R. Nazikian, M. Van Zeeland, R. L. Boivin, and N. C. Luhmann. Commissioning of electron cyclotron emission imaging instrument on the DIIID tokamak and first data. Rev. Sci. Instrum., 81(10):10D928, oct 2010.
 Van Den Oord [2016] A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A Generative Model for Raw Audio. arXiv eprints, page arXiv:1609.03499, 2016.
 Vaswani [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention Is All You Need. arXiv eprints, jun 2017.
 Vega [2013] J. Vega, S. DormidoCanto, J. M. López, A. Murari, J. M. Ramírez, R. Moreno, M. Ruiz, D. Alves, and R. Felton. Results of the JET realtime disruption predictor in the ITERlike wall campaigns. Fusion Eng. Des., 88(68):1228–1231, oct 2013.
 Wang [2017] X. Wang, R. Girshick, A. Gupta, and K. He. Nonlocal Neural Networks. arXiv eprints, nov 2017.
Comments
There are no comments yet.